Re: [PATCH 2/4] libata: Implement disk shock protection support

Previous thread: Re: Linux 2.6.27-rc5: System boot regression caused by commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd by David Witbrodt on Saturday, August 30, 2008 - 4:29 pm. (2 messages)

Next thread: Marvell 6145 PATA controller by Peter Ambroz on Saturday, August 30, 2008 - 4:50 pm. (3 messages)
From: Elias Oltmanns
Date: Saturday, August 30, 2008 - 4:38 pm

Right, will do.


Maybe I'm missing something but I don't see how I could use msleep()

Well, I really am concerned about dev->sdev. So far, I haven't quite
figured out yet whether under which circumstances I can safely assume
that the scsi counter part of dev including the block layer request
queue has been completely set up and configured so there won't be any
null pointer dereferences. However, if you think that I needn't bother
with stopping the request queue anyway, checking for ATA_DEV_ATA (what

Quite right. It's just that it will be un- and replugged every
(3 * HZ) / 1000, so I thought it might be worthwhile to stop the queue
anyway. Perhaps it really isn't worth bothering and the code would

Just to be sure, are you just referring to the queue lock, or to the host
lock as well? Am I wrong in thinking that we have to protect all access
to dev->flags because bit operations are performed non atomically


Probably, I should insert a comment somewhere. The problem is that
device internal power management will be disabled until the next command
is received. If you have laptop mode enabled and the device has received
the unload command while spinning with no more commands in the queue to
follow, the device may keep spinning for quite a while and won't go into
standby which rather defeats the purpose of laptop mode. This behaviour

Since it's a user process that controls when we have to keep the heads
off the platter, a suspend operation has to be blocked *before* process

Ah, I can see that this while loop can replace my call to wait_event() in
the eh sequence earlier on. However, I wonder how I am to replace the
call to mod_timer(). As you can see, we perform different actions
depending on whether the timer has merely been updated, or a new timer
has been started. Only in the latter case we want to schedule eh. In
order to achieve the equivalent in your setting while preventing any
races, I'd have to protect the deadline field in struct ata_port by the
host lock, ...
From: Tejun Heo
Date: Sunday, August 31, 2008 - 2:25 am

Hello,


Ah.. you need to part ATAPI too?  If so, just test for
ata_dev_enabled().  One way or the other, there's no need to care

While the EH is running, nothing gets through other than commands
issued by EH itself, so no need to worry about how upper layers would

Yes, when modifying the flags.  You don't need to when testing a

Ah.. Okay.  I somehow thought the command would spin down the disk.

Can you please elaborate a bit?  The reloading is done by the kernel
after a timeout, right?  What kind of precarious situation can the

You can just do

    spin_lock_irq(ap->lock);
    ap->deadline = jiffies + timeout;
    ap->link.eh_info.action |= ATA_EH_PARK;
    ata_port_schedule_eh(ap);

Yeah, this looks about right, but you can make it a bit simpler...

    while (time_before((now = jiffies), ap->deadline))
	schedule_timeout_uninterruptible(ap->deadline - now);

As locking on reader side doesn't mean much in cases like this.  The
deadline update which triggered EH is guaranteed to be visible and the
the window between waking up from schedule_timeout and ap->deadline


Oh.. what I meant was whether we need a separate sysfs node to
indicate whether unload feature is enabled or not but now I come to
think about it, that is per-device and the timer is per-port.  :-)

Thanks.

-- 
tejun
--

From: Elias Oltmanns
Date: Sunday, August 31, 2008 - 5:08 am

Well, I'm not quite sure really. Perhaps you are right and I'd better
leave ATAPI alone, especially given the problem that the unload command
might mess up a CD/DVD write operation. As long as no laptop HDs are


Right.


Sorry, I haven't expressed myself very clearly there, it seems. The user
space process detects some acceleration and starts writing timeouts to
the sysfs file. This causes the unload command to be issued to the
device and stops all I/O until the user space daemon decides that the
danger has passed, writes 0 to the sysfs file and leaves it alone
afterwards. Now, if the daemon happens to request head parking right at
the beginning of a suspend sequence, this means that we are in danger of
falling, i.e. we have to make sure that I/O is stopped until that user
space daemon gives the all-clear. However, suspending implies freezing
all processes which means that the daemon can't keep checking and
signalling to the kernel. The last timeout received before the daemon
has been frozen will expire and the suspend procedure goes ahead. By
means of the notifiers we can make sure that suspend is blocked until

Please note that I want to schedule EH (and thus the head unload -
check power command sequence) only once in the event of overlapping
timeouts. For instance, when the daemon sets a timeout of 2 seconds and
does so again after one second has elapsed, I want the following to
happen:

[ Daemon writes 2 to the unload_heads file ]
1. Set timer / deadline;
2. eh_info.action |= ATA_EH_PARK;
3. schedule EH;
4. execute EH and sleep, waiting for the timeout to expire;

[ Daemon writes 2 to the unload_heads file before the previous timeout
has expired. ]

5. update timer / deadline;
6. the EH thread keeps blocking until the new timeout has expired.

In particular, I don't want to reschedule EH in response to the second
write to the unload_heads file. Also, we have to consider the case where
the daemon signals to resume I/O prematurely by writing a timeout of ...
From: Tejun Heo
Date: Sunday, August 31, 2008 - 6:03 am

Hello,



Is it really worth protecting against that?  What if the machine
started to fall after the userland tasks have been frozen?  And how
long the usual timeout would be?  If the machine has been falling for
10 secs, there really isn't much point in trying to protect anything
unless there also is ATA DEPLOY PARACHUTE command somewhere in the new
revision.

In libata, as with any other exceptions, suspend/resume are handled by
EH so while emergency head unload is in progress, suspend won't
commence which is about the same effect as the posted code sans the
timeout extension part.  I don't really think there's any significant
danger in not being able to extend timeout while suspend is in
progress.  It's not a very big window after all.  If you're really
worried about it, you can also let libata reject suspend if head
unload is in progress.

Also, the suspend operation is unloading the head and spin down the
drive which sound like a good thing to do before crashing.  Maybe we
can modify the suspend sequence such that it always unload the head
first and then issue spindown.  That will ensure the head is in safe
position as soon as possible.  If it's done this way, it'll be
probably a good idea to split unloading and loading operations and do
loading only when EH is being finished and the disk is not spun down.

To me, much more serious problem seems to be during hibernation.  The
kernel is actively writing memory image to userland and it takes quite

Whether EH is scheduled multiple times or not doesn't matter at all.
EH can be happily scheduled without any actual action to do and that
does happen from time to time due to asynchronous nature of events.
libata EH doesn't have any problem with that.  The only thing that's
required is there's at least one ata_schedule_eh() after the latest
EH-worthy event.  So, the simpler code might enter EH one more time
once in the blue moon, but it won't do any harm.  EH will just look
around and realize that there's nothing much to do ...
From: Bartlomiej Zolnierkiewicz
Date: Sunday, August 31, 2008 - 7:32 am

Hi,


Seconded.  To be honest I also don't like the change to issue UNLOAD to
all devices on the port (it only needlessly increases complexity right now
since the _only_ use case at the moment is ThinkPad w/ hdaps + 1 HD).


Which also brings again the question whether it is really the best to
use user-space solution instead of kernel thread?

After taking the look into the deamon program and hdaps driver I tend
to "Nope." answer.  The kernel-space solution would be more reliable,
should result in significatly less code and would free us from having
a special purpose libata/ide interfaces.  It should also make the
maintainance and future enhancements (i.e. making hibernation unload
friendly) a lot easier.

I imagine that this comes a bit late but can we at least give it an
another thought, please?

Thanks,
Bart
--

From: Elias Oltmanns
Date: Sunday, August 31, 2008 - 9:14 am

We are not just protecting against free fall. Think of a series of minor

Personaaly, I'm very much against plain rejection because of the odd
head unload. A delay in the suspend procedure, on the other hand, is

Well, scsi midlayer will also issue a flush cache command. Besides, with
previous implementations I have observed occasional lock ups when
suspending while the unload timer was running. Once we have settled the
timer vs deadline issue, I'm willing to do some more investigation in
this area if you really insist that pm_notifiers should be avoided. But
then I am still not too sure about your reasoning and do feel happier

That's right. The first requirement to protect against this problem is
to have the policy all in kernel space which isn't going to happen for
some time yet. This really is a *best effort* solution rather than a

The whole EH machinery is a very complex beast. Any user of the
emergency head park facility has a particular interest that the system
spends as little time as possible in the EH code even if it's real error
recovery that matters most. Perhaps we could agree on the following
compromise:

    spin_lock_irq(ap->lock);
    old_deadline = ap->deadline;
    ap->deadline = jiffies + timeout;
    if (old_deadline < jiffies) {
        ap->link.eh_info.action |= ATA_EH_PARK;
        ata_port_schedule_eh(ap);
    }
    spin_unlock_irq(ap->lock);

There is still a race but it is very unlikely to trigger.

Still, you have dismissed my point about the equivalent of stopping a
running timer by specifying a 0 timeout. In fact, whenever the new
deadline is *before* the old deadline, we have to signal the sleeping EH
thread to wake up in time. This way we end up with something like
wait_event().

Regards,

Elias
--

From: Tejun Heo
Date: Monday, September 1, 2008 - 1:33 am

Hello,



I'm not particularly against pm_notifiers.  I just can't see what
advantages it have given the added complexity.  The only race window
it closes is the one between suspend start and userland task freeze,
which is a relatively short one and there are other much larger gaping
holes there, so I don't really see much benefit in the particular
pm_notifiers usage.  Maybe we need to keep the task running till the
power is pulled?  If we can do that in sane manner which is no easy

Yeap, it is.  I just thought pm_notifiers bit didn't really contribute

The logic is quite complex due to the wonderful ATA but for users
requesting actions, it really is quite simple.  Well, at least I think

Really, it doesn't matter at all.  That's just an over optimization.
The whole EH machinery pretty much expects spurious EH events and
schedules and deals with them quite well.  No need to add extra

Yes, right, reducing the timeout.  How about doing the following?

	wait_event(ata_scsi_part_wq,
		   time_before(jiffies, ap->unload_deadline));

Heh... then again, it's not much different from your original code.  I
won't object strongly to the original code but I still prefer just
setting deadline and kicking EH from the userside instead of directly
manipulating the timer.  That way, implementation is more separate
from the interface and to me it seems easier to follow the code.

Thanks.

-- 
tejun
--

From: Elias Oltmanns
Date: Monday, September 1, 2008 - 7:51 am

I doubt that is feasible without substantial work especially as long as
part of it all is implemented in user space. With regard to the
pm_notifiers, I'll try to figure out exactly what the problem was
without them (I thought it was related to timers not firing anymore
after process freezing, but Pavel doubts that and I'm not too sure
anymore).


Yes, you would ;-). Seriously though, I do agree that it is easy for

... because of its complexity it is hard for me to estimate timing
impacts and constraints. The questions I'm concerned about are: What is
the average and worst case time it takes to get the heads parked and
what can I do if not to improve either of them, then at least not to make
things worse. In particular, I don't really have a clue about how much
time it takes to go through EH if no action is requested in comparison
to, say, the average read / write command. Obviously, I don't want to
schedule EH unnecessarily if that would mean that I won't be able to
issue another head unload for considerably longer than during normal I/O
or, indeed, on an idle system. Arguably, I don't even want to do
anything that causes more logging than absolutely necessary because this
will ultimately result in the disk spinning up from standby. But then I
believe that I only came across this logging issue when I was still
playing around with eh_revalidate and the like. So, can you set my mind
at rest that timing is no issue with spurious EH sequences? Now that I
come to think of it, I suppose it would harm performance anyway, so

That's fine with me. All I want is that my code doesn't end up leaving
the system in an unresponsive state (to a head unload request, that is)
more often than before by spuriously scheduling EH. If that is not a
problem, I'm content.

Regards,

Elias
--

From: Tejun Heo
Date: Monday, September 1, 2008 - 9:43 am

For STR, I think it shouldn't be too difficult.  For STD, it's more
difficult but it would be easy if we used a kexec kernel for

Hmm... I don't think pm notifiers would be necessary to prevent things

I can't tell you the exact delays, but sans times for actual actions,
there is no real delay.  It just involves scheduling a few times and
jumps through various functions in EH.  I don't think that would be
anything measureable.

Also, generating duplicate events.  Events would be duplicate only
when those events occur during EH is in progress, right?  In that
case, as EH finishes it would see that there's another EH action
requested and re-enter EH.  The second invocation of EH would go
through the diagnostic steps (doesn't involve issuing any command,
just checks data structures) and find out that there's nothing to do
and just exit.  For those bogus runs, EH won't print out anything
either.  So, really, nothing to worry about there.

If that's not enough assurance, even ATAPI CHECK SENSE is done via EH.
That is, many ATAPI commands invoke EH after completion but nobody

No, it won't leave the system unresponsive to unload request or
anything else.  If it does, it's a bug in EH core and should be fixed.
So, there should be no problem in making the implementation simple.

Thanks.

-- 
tejun
--

From: Elias Oltmanns
Date: Wednesday, September 3, 2008 - 1:23 pm

As yet, I haven't been able to reproduce the problems with an
implementation without a dedicated park_timer. So, I'm fine with

Yes, but that will be the rule rather than the exception because the
daemon will hardly ever let the timeout expire. Eitehr it decides that
it needs to extend the timeout, or it recons that everything's going to
be alright and issues a 0 timeout in order to resume normal operation

Right then. Here is another patch where, hopefully, most of your
concerns have been addressed. Please tell me what you make of it
(applies to next-20080903).

Regards,

Elias

---

 drivers/ata/ahci.c        |    1 +
 drivers/ata/ata_piix.c    |    6 +++
 drivers/ata/libata-eh.c   |   48 ++++++++++++++++++++++++++
 drivers/ata/libata-scsi.c |   84 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata.h      |    1 +
 include/linux/libata.h    |    4 ++
 6 files changed, 144 insertions(+), 0 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index c729e69..9539050 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -316,6 +316,7 @@ static struct device_attribute *ahci_shost_attrs[] = {
 
 static struct device_attribute *ahci_sdev_attrs[] = {
 	&dev_attr_sw_activity,
+	&dev_attr_unload_heads,
 	NULL
 };
 
diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index b1d08a8..1b470ad 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -298,8 +298,14 @@ static struct pci_driver piix_pci_driver = {
 #endif
 };
 
+static struct device_attribute *piix_sdev_attrs[] = {
+	&dev_attr_unload_heads,
+	NULL
+};
+
 static struct scsi_host_template piix_sht = {
 	ATA_BMDMA_SHT(DRV_NAME),
+	.sdev_attrs		= piix_sdev_attrs,
 };
 
 static struct ata_port_operations piix_pata_ops = {
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index bd0b2bc..7754f32 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2447,6 +2447,40 @@ int ata_eh_reset(struct ata_link *link, int ...
From: Tejun Heo
Date: Thursday, September 4, 2008 - 2:06 am

Oh.. I see.  Either way, it should be fine.  There is no reason to

Hmm... I meant more like


 extern struct device_attribute **libata_sdev_attrs;

 #define ATA_BASE_SHT(name)				\
 ....
	.sdev_attrs		= libata_sdev_attrs;	\
 ....

Which will give unload_heads to all libata drivers.  As ahci needs its

Nitpicking: Do you mind taking the schedule_timeout out of the while
condition?  It's just not very customary to put a statement with that
level of side effect into a condition clause.  Also, it would force

Isn't seconds a bit too crude? Or it just doesn't matter as it's
usually adjusted before expiring?  For most time interval values
(except for transfer timings of course) in ATA land, millisecs seem to


ata_scsi_find_dev() should be inside ap->lock.  Looking through the

You'll probably want to use spin_lock_irqsave and restore.  It's a

It doesn't really matter as all these are under the lock but maybe
moving ata_port_schedule_eh() below unpark_deadline is a good idea
just for clarification - you know, set the state and trigger the

Hmmm... Sorry to bring another issue with it but I think the interface
is a bit convoluted.  The unpark node is per-dev but the action is
per-port but devices can opt out by writing -2.  Also, although the
sysfs nodes are per-dev, writing to a node changes the value of park
node in the device sharing the port except when the value is -1 or -2.
That's strange, right?

How about something like the following?

* In park_store: set dev->unpark_timeout, kick and wake up EH.

* In park EH action: until the latest of all unpark_timeout are
  passed, park all drives whose unpark_timeout is in future.  When
  none of the drives needs to be parked (all timers expired), the
  action completes.

* There probably needs to be a flag to indicate that the timeout is
  valid; otherwise, we could get spurious head unparking after jiffies
  wraps (or maybe just use jiffies_64?).

With something like the above, the interface is cleanly per-dev ...
Previous thread: Re: Linux 2.6.27-rc5: System boot regression caused by commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd by David Witbrodt on Saturday, August 30, 2008 - 4:29 pm. (2 messages)

Next thread: Marvell 6145 PATA controller by Peter Ambroz on Saturday, August 30, 2008 - 4:50 pm. (3 messages)