login
Header Space

 
 

Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal

Previous thread: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag by David Rientjes on Monday, February 25, 2008 - 11:35 am. (35 messages)

Next thread: [git pull] scheduler fixes by Ingo Molnar on Monday, February 25, 2008 - 11:45 am. (1 message)
To: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 11:39 am

Ongoing efforts to remove the freezer from the system suspend and
hibernation code ("system sleep" is the proper catch-all term) have
turned up a fundamental flaw in the Power Management subsystem's
design.  In brief, we cannot handle the race between hotplug addition
of new devices and suspending all existing devices.

It's not a simple problem (and I'm going to leave out a lot of details
here).  For a comparison, think about what happens when a device is
hot-unplugged.  When device_del() calls the driver's remove method, the
driver is expected to manage all the details of synchronizing with
other threads that may be trying to add new child devices as well as
removing all existing children.

But when a system sleep begins, the PM core is expected to suspend
all the children of a device before calling the device driver's suspend
method.  If there are other threads trying to add new children at the
same time, it's the PM core's responsibility to synchronize with
them -- an impossible job, since only the device's driver knows what
those other threads are and how to stop them safely.

In the past this deficiency has been hidden by the freezer.  Other 
tasks couldn't register new children because they were frozen.  But now 
we are phasing out the freezer (already most kernel threads are not 
freezable) and the problem is starting to show up.

A change to the PM core present in 2.6.25-rc2 (but which is about to be
reverted!) has the core try to prevent these additions by acquiring the
device semaphores for every registered device.  This has turned out to
be too heavy-handed; for example, it prevents drivers from
unregistering devices during a system sleep.  There are more subtle
synchronization problems as well.

The only possible solution is to have the drivers themselves be
responsible for preventing calls to device_add() or device_register()  
during a system sleep.  (It's also necessary to prevent driver binding,
but this isn't a major issue.)  The most straightforward approach i...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 4:36 pm

.../...

Yup, old problem, I think I've said a long time ago that it's the
reponsibility of bus drivers (such as USB khub) to stop issuing device
additions when suspend is in progress. It might be possible to still do


No, I think the global notifier is plenty for now. Later on, we can
re-add early-suspend, late-resume, or we can finally turn the whole
thing into a recursive call tree where the parents are the ones calling
the children suspend :-) But for now, I think the global notifier is
enough.

Cheers,
Ben.


--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 6:24 pm

It's not a problem if new children are registered before the parent's
-&gt;suspend() is called, the PM core can handle that.  The problem is the
potential race between the suspending task and the threads registering new
children concurrently to the executing -&gt;suspend(), because if those threads
lose the race, the resume ordering will be broken.

Since the PM core knows nothing about the drivers internals, the drivers'
-&gt;suspend() methods must be responsible for synchronizing with the other

I think we just attempted to take device semaphores too early.  We probably
can take the device semaphores _after_ suspending all devices without
much hassle.  However, it's not actually a problem if a suspended device
gets unregistered - it's removed from the list on which it is at the moment
and won't be resumed.  It also is not a problem if the device is registered
after it's master's -&gt;resume() has run.

Besides, taking the semaphores for all _existing_ devices doesn't prevent
new devices from being added and that's we needed to take


As I said above, I don't see a problem with registering new devices before
the parent's -&gt;suspend() is run and after it's -&gt;resume() has run.  That
doesn't break any ordering rules and we can handle it.

Thanks,
Rafael
--
To: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 3:46 pm

On further thought maybe the existing methods can be used, with care.  
Drivers would have to assume the responsibility of synchronizing with
their helper threads and stopping addition of new children (something
they should already be doing), and they would also have to check that
all the existing children are already suspended.  They should not make
the assumption that the PM core has already suspended all the children.

The PM core could help detect errors here.  If it tries to suspend a 
device and sees that the device's parent is already suspended, then the 
parent's driver has a bug.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 6:25 pm

IMO the device driver should assure that no new children will be registered
concurrently with the -&gt;suspend() method (IOW, -&gt;suspend() should wait for
all such registrations to complete and should prevent any new ones from
being started) and it should make it impossible to register any new children

Yes, I think we ought to fail the suspend in such cases.  Still, that's not
sufficient to prevent a child from being registered after we've run
dpm_suspend().  For this reason, we could also leave dpm_suspend() with
dpm_list_mtx held and not release it until the next dpm_resume() is run.

That will potentially cause some trouble to CPU hotplug cotifiers, but we can
handle that, for example, by using the in_suspend_context() test.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>, Alan Stern <stern@...>
Cc: <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 3:13 am

This "flaw" isn't a new thing, of course.  I remember pointing out the rather
annoying proclivity of the PM framework to deadlock when suspend() tried to
remove USB devices ... back around 2.6.10 or so.  Things have shuffled around
a bit, and gotten better in some cases, but not fundamentally changed.

It may be more accurate to say that now we understand some constraints on
device tree management policies ... ones we had previously assumed should
not be issues.  (But AFAICT, without actually considering the question.
Now we know the right question to ask!)



There's also the case where it's framework code that handles the additions
rather than the parent device.  That would be typical for many bridge, hub,
or adapter type drivers ... you may be thinking mostly about drivers acting
as "leaf" nodes in the device tree, at least in terms of real hardware nodes.

Yes, "require that policy from such framework code too".  Just trying to be
sure the description doesn't have gaping holes in the middle.  :)

I can think of a bunch of serial busses where framework code has that sort
of responsiblity.  USB, SPI, I2C ... "legacy" I2C drivers would all need
to be taught not to create/remove children during those intervals, lacking
"new style" conversion (which makes them work more like normal drivers).

- Dave

--
To: David Brownell <david-b@...>
Cc: Rafael J. Wysocki <rjw@...>, Alan Stern <stern@...>, <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 4:25 am

Hardware can be inserted and removed while we're in a suspend state; and
there's nothing that we can do about it until we resume.  Is it fair to
say, then, that having started suspend, we could reasonably ignore any
device insertion and removal, and handle it on resume?  Presumably we
need to scan for hardware changes on resume.
--
To: David Newall <davidn@...>
Cc: Rafael J. Wysocki <rjw@...>, Alan Stern <stern@...>, <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 5:16 am

"Ignore" seems a bit strong; those events may be wakeup triggers,
which would cause the hardware to make it a very short suspend state.


Not on most busses I work with; the hardware issues notifications
whenever the devices are removable.

Which is a Good Thing ... scanning, as you suggest, is inherently
not reliable.  If a mouse is swapped, it likely doesn't matter a
lot whether it's the "same" or not.

But ... how about if some removable storage media was taken out,
updated on a different system, then restored before the resume?
Not many filesystems handle that sanely.  You *really* want to
have hardware which reports disconnect/reconnect events, or which
physically prevents them (locked case etc).

- Dave



--
To: David Brownell <david-b@...>
Cc: Rafael J. Wysocki <rjw@...>, Alan Stern <stern@...>, <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 9:36 am

Of course, "defer".  The insertion has to be handled eventually.  What

There's no notification while we're suspended.  Isn't it necessary to
scan all busses on resume, just to know what's on them?
--
To: David Newall <davidn@...>
Cc: David Brownell <david-b@...>, Rafael J. Wysocki <rjw@...>, <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 11:58 am

Certainly.  If hardware-change events can get lost because of the
system sleep, the resume method should make every effort to verify that 

It depends on the bus.  If the bus doesn't support hotplugging then 
scanning isn't necessary.  If the bus does support hotplugging then 
scanning after suspend may or may not be necessary, depending on 
whether or not the bus controller remained powered during the suspend.  
For hotpluggable buses, scanning after hibernation is always necessary.

Alan Stern

--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 7:37 pm

Not only that, the new children will exist temporarily in a state where 
their parent is suspended and they are not.  Clearly we cannot allow 

This is a new requirement.  We didn't have to worry about it with the 

At that stage there isn't any real need to hold the semaphores.  And it 
complicates unregistration, which should always be allowed.  At the 


Yes.  We could keep it, but not acquire it until after all devices have 

Exactly; this has to be added to the PM documentation.

The difficulty arises only for drivers that support hotplugging, that 
is, detecting and registering children from somewhere other than their 


Do they need to register new CPUs at some point?  There ought to be a 
way to handle that.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Monday, February 25, 2008 - 8:07 pm

Certainly.  That's why I've been saying that it's not really a simple task to





But we should not leave a window between releasing dpm_list_mtx and taking
pm_sleep_rwsem.  Either that, or we should make sure that dpm_active is

No, they don't, but there are some CPU-related device objects that get
uregistered/registered.  Still, all of this work is really redundant if the CPU
in question comes back up during the resume, so it should be avoided in
general.  The CPU hotplug notifiers should only unregister those objects if
the CPU hasn't gone on line during the resume and they have all information
necessary for discovering that.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 11:49 am

I've got some ideas on how to implement this.

We can add a new field "suspend_called" to dev-&gt;power.  It would be
owned by the PM core (protect by dpm_list_mtx) and read-only to
drivers.  Normally it will contain 0, but when the suspend method is
running we set it to SUSPEND_RUNNING and when the method returns
successfully we set it to SUSPEND_DONE.  Before calling the resume
method we set it back to 0.  Drivers can use this field as an easy way
of checking that all the child devices have been suspended.

When a new device is registered we check its parent's suspend_called
value.  If it is SUSPEND_DONE then the caller has a bug and we have to
fail the registration.  If it is SUSPEND_RUNNING then the registration
is legal, but we remember what happened.  Then when the
currently-running suspend method returns and we reacquire the
dpm_list_mtx, we will realize that a race was lost.  If the method
completed successfully (which it shouldn't) we can resume that device
immediately without ever taking it off the dpm_active list; but either
way we should continue the suspend loop.  Now the new child will be at
the end of the dpm_active_list, so it will be suspended before the
parent is reached again.

This way we can recover from drivers that are willing to suspend their 
device even though there are unsuspended children.  The only drawback 
will be that for a short time the child will be active while its parent 
is suspended.

We should not abort the entire sleep transition simply because we lost 
a race.  With this scheme we won't even need the pm_sleep_rwsem; the 
dpm_list_mtx will provide all the necessary protection.

This is more intricate than it should be.  It would have been better to
have had "disable_new_children" and "enable_new_children" methods from
the beginning; then there wouldn't be any races at all.  That's life...

The one tricky thing to watch out for is when a suspend or resume 
method wants to unregister the device being suspended or resumed.  Even 
that ...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Tuesday, February 26, 2008 - 7:17 pm

I'd call it "sleeping" or something like this, for it will also be used by

Why before?  I'd think that any non-suspended children should not be visible

This seems to require some trickery.  Namely, device_add() will notice that
the registration is done concurrently with the running -&gt;suspend() of the
parent and will have to communicate that to dpm_suspend() which is supposed



I don't agree here.  If we require drivers to prevent such races from happening
and they don't comply, we can give up instead of trying to work around the

That can't happen, because dev-&gt;sem is taken by suspend_device() and

I'm still thinking that registering while the parent is suspending should not

The direction seems to be fine, but the details need a bit more clarification,
as far as I'm concerned.  Having a patch to discuss will certainly help a lot,
though. ;-)

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 12:03 pm

The name refers to the "suspend" method, not the type of sleep being
carried out.  We use the same method for both suspend and hibernation.

All right, we can set it to RESUME_RUNNING before calling the resume
method and then set it to 0 afterwards.  The point is that the value
shouldn't remain SUSPEND_DONE while resume runs, because it should be

It will get noticed in device_pm_add() while holding dpm_list_mtx.  
The information can be stored in a static private flag


Sure.  But it won't be the PM core's problem; it will be a bug in the
bus's driver.  We will print a warning in the log so the bug can be 

You misunderstand.  We can't require drivers to prevent these races 
entirely.  As an example, a properly-written, compliant driver might 
work like this:

	Task 0				Task 1
	------				------
	dev-&gt;power.sleeping =
	  SUSPEND_RUNNING;
	Call (drv-&gt;suspend)(dev)
					Register a child below dev
	suspend method prevents new
	  child registrations
	suspend method waits for
	  existing registration to
	  finish
					Check dev-&gt;power.sleeping and set
					  child_added_while_parent_suspends
					Registration completes successfully
	suspend method sees there is
	  an unsuspended child and
	  returns -EBUSY

	Check child_added_while_parent_suspends
	  and realize that we lost the race

There's nothing illegal about this; it's just an accident of timing.  
Nothing has gone wrong and we shouldn't abort the sleep.  We should
continue where we left off, by suspending the new child and then trying

We'll have to fix device_del() to prevent that from happening.  Your 

Unfortunately the lack of "prevent_new_children" and 
"allow_new_children" methods gives us no choice.  The example above 
shows why.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 3:50 pm

I'm not sure.  The core moves the device to dpm_active only after -&gt;resume()
has run.  Thus, if -&gt;resume() registers new children, the ordering of



I'm not sure if we need to do it.  It's always been like this, so the current
drivers' -&gt;suspend() and -&gt;resume() don't unregister the device they're called

Yes, it does.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Thursday, February 28, 2008 - 6:49 pm

Rafael:

Here's my patch.  It doesn't include the timers for deadlock debugging, 
but it does include all the other stuff we've been talking about.  My 
base probably isn't quite in sync with yours, so this may not apply 
cleanly on your system.  But the divergences should be small.

Incidentally, there seemed to be a bug in your dpm_suspend() -- the 
dpm_list_mtx needs to be reacquired before the error checking.  This
patch fixes that.  It also removes pm_sleep_rwsem, which isn't used 
any more.

We should think about device_pm_schedule_removal().  It won't work
right if a suspend method calls it for the device being suspended, 
because the device gets moved to the dpm_off list after the method 
runs.

Alan Stern



Index: usb-2.6/Documentation/power/devices.txt
===================================================================
--- usb-2.6.orig/Documentation/power/devices.txt
+++ usb-2.6/Documentation/power/devices.txt
@@ -192,11 +192,27 @@ used to resume those devices.
 
 The ordering of the device tree is defined by the order in which devices
 get registered:  a child can never be registered, probed or resumed before
-its parent; and can't be removed or suspended after that parent.
+its parent; and can't be removed or suspended after that parent.  This
+means that special care is needed when calling device_move(); currently
+the kernel does not adjust its suspend and resume ordering to match the
+new device tree.
 
 The policy is that the device tree should match hardware bus topology.
 (Or at least the control bus, for devices which use multiple busses.)
 
+There is an unavoidable race between the PM core suspending all the
+children of a device and the device's driver registering new children.
+As a result, it is possible that the core may try to suspend a device
+without first having suspended all of the device's children.  Drivers
+must check for this; a suspend method should return -EBUSY if there are
+unsuspended children.  (The child-&gt;power.sleeping field c...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 10:26 am

In fact 'sleeping' doesn't look good in this context.  'pm_state' seems
better to me (although it is confusingly similar to 'power_state', we're going


Hmm.

Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
(1) taken by suspend_device(dev) (at the beginning)
(2) released by resume_device(dev) (at the end)
(3) taken (and released) by device_pm_add() if dev is the parent of the device
    being added.

In that case, device_pm_add() will block on attepmpts to register devices whose
parents are suspended (or suspending) and we're done.  At least so it would


This seems to be a weak spot.  The resuming of the device at this point need
not work correctly, given that the system's target state is still a sleep

Well, I wish it could be simpler ...

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 11:53 am

How about instead: "attempts to register a child device below a 




It gets used in only a couple of places, so nobody should object too 

No; this would repeat the same mistake we were struggling with last
week.  Blocking registration attempts (especially if we start _before_
calling the device's suspend method or end _after_ calling the resume
method) will lead to deadlocks while suspending or resuming.

If the blocking starts after the suspend method returns then it will be 
safer.  But what's the point?  If a registration attempt is made at 
that stage, it means there's a bug in the driver.  So failing the 
registration seems like a reasonable thing to do.

One issue: This rule doesn't allow suspend_late or resume_early methods
to register any new devices.  Will that cause problems with the CPU
hotplug or ACPI subsystems?  ACPI in particular may need to freeze the
kacpi_notify workqueue -- in fact, that might solve the problem in

I agree.  But there aren't any -EPOWER* or -EPM* error codes defined!  

That may be true.  But this is an error-recovery path intended to work
around a driver bug.  It doesn't have to guarantee perfect operation, 
just do its best.

Remember too that the target state is set before any devices are
suspended.  Hence, after the state is set there may still be runtime
resumes taking place.  Those _must_ not fail, which means that this

Me too.  But until the API is changed, this seems to be the best we can 
do.  It's not quite as bad as it looks, since a fair amount of the new 
code is just for reporting on and recovering from bugs in drivers.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 1:02 pm

Not exactly, because in that case we blocked all attempts to register devices,
while I think we can only block those regarding the children of a suspending

I'm not really convinced that it'll happen.  If we make the rule that
registering children from the device's own -&gt;suspend() method is forbidden
(I don't really see why it should be allowed), something like this might work
(it seems to me):

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -186,6 +186,7 @@ struct dev_pm_info {
 #ifdef	CONFIG_PM_SLEEP
 	unsigned		should_wakeup:1;
 	struct list_head	entry;
+	struct mutex		lock;
 #endif
 };
 
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -67,9 +67,14 @@ void device_pm_add(struct device *dev)
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev-&gt;bus ? dev-&gt;bus-&gt;name : "No Bus",
 		 kobject_name(&amp;dev-&gt;kobj));
+	if (dev-&gt;parent)
+		mutex_lock(&amp;dev-&gt;parent.power.lock);
 	mutex_lock(&amp;dpm_list_mtx);
 	list_add_tail(&amp;dev-&gt;power.entry, &amp;dpm_active);
+	mutex_init(&amp;dev-&gt;power.lock);
 	mutex_unlock(&amp;dpm_list_mtx);
+	if (dev-&gt;parent)
+		mutex_unlock(&amp;dev-&gt;parent.power.lock);
 }
 
 /**
@@ -249,6 +254,7 @@ static void dpm_resume(void)
 		list_move_tail(entry, &amp;dpm_active);
 		mutex_unlock(&amp;dpm_list_mtx);
 		resume_device(dev);
+		mutex_unlock(&amp;dev-&gt;power.lock);
 		mutex_lock(&amp;dpm_list_mtx);
 	}
 	mutex_unlock(&amp;dpm_list_mtx);
@@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
 		struct device *dev = to_device(entry);
 
 		mutex_unlock(&amp;dpm_list_mtx);
+		mutex_lock(&amp;dev-&gt;power.lock);
+		mutex_lock(&amp;dpm_list_mtx);
+		if (dev != to_device(dpm_active.prev)) {
+			mutex_unlock(&amp;dev-&gt...
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 2:42 pm

It's the "suspending" case that causes trouble.  Go back and look at 
the race I diagrammed in


It _did_ happen.  Precisely this race occurred in Bug #10030 (the SD
card insertion/removal), although there the window was bigger because
we blocked registrations starting from the start of the system sleep

It's not that the suspend method itself will want to register children.  
The problem is that the method has to wait for other threads that may 
already have started to register a child.  If those other threads are 

This looks pretty awkward.  Won't it cause lockdep to complain about

It buys us one thing: The system will continue to limp along instead of 
locking up.

If drivers don't check whether registration succeeded...  What can I 

During suspend_late and resume_early _every_ device is suspended, 
including the fictitious "device-tree-root" device.  Hence _every_ 
registration is for a child of a suspended device.

Besides, you don't want to allow new devices to be registered during 
suspend_late, do you?  They wouldn't get suspended before the system 

Right now that may be the easiest solution.  In fact, it may still be 

I'm not sure I understand.  Sure, autoresume may not involve calling 
the driver's resume method.  But it does involve actually setting the 

Are you still considering adding separate methods for suspend and 
hibernate (maybe also for freeze and prethaw)?  Perhaps the 
"prevent_new_children" and "allow_new_children" methods could be added 
then.  This would allow some of this complication to go away.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 5:57 pm

I'm still not sure if this particular race would happen if only the registering

Why would it?  It's not taken recursively at any place.

In fact we can't take dev-&gt;power.lock under dpm_list_mtx, because that would
deadlock (we have to be able to take dpm_list_mtx to complete the suspending
of devices).  For this reason, dpm_list_mtx is taken under dev-&gt;power.lock.
However, we don't know which lock to acquire (dev is unknown) until we get the
next device to suspend.  Thus, I first get the device (under dpm_list_mtx) and
release dpm_list_mtx.  Next, I take dev-&gt;power.lock and dpm_list_mtx under it
and check if the selected device is still the last on the list.  If it's not,
something has been registered in the meantime and it's better to get the new
device to suspend first (this seems to be cleaner than resuming an already
suspended device).

I could release dev-&gt;power.lock as soon as suspend_device(dev, state) has run,
but that could result in new devices being added to dpm_active in a wrong
order, so it's better not to release it until resume_device(dev).  However,
it's probably is safe to release it right prior to running resume_device(dev)
(after we've added dev back to dpm_active), because in that case the ordering

It may oops, though, if a driver attempts to use a device that it failed to


Well, people want to remove the freezing of tasks altogether from the suspend

In fact, that's the matter of how we are going to handle the runtime PM vs


I wonder how that would be different from using dev-&gt;power.lock for blocking
the registration of new children.  The only practical difference I see is that
the driver will have to block the registrations, this way or another, instead
of the core.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 6:46 pm

That's different.  Before you were talking about acquiring
dev-&gt;power.lock _before_ calling the suspend method.  Now you're
talking about blocking child registration _after_ the parent is already
suspended.

It might work if you did it that way.  In theory it _should_ work,
since nobody should ever try to register a child below a suspended
parent.

Given that this is merely a way of preventing something which should
never happen in the first place, is it really necessary to add the
extra lock?  Certainly it's simpler just to fail the registration.  If
it turns out later that we'd be better off blocking it instead, we

It is as far as lockdep is concerned.  You acquire power.lock for the 
first device, then you acquire it for the second device.  Lockdep 
doesn't know the two devices are different; all it knows is that you 
have tried to acquire a lock while already holding an instance of that 
same lock.  It's the same problem that affects attempts to convert 
dev-&gt;sem to a mutex.

As for the ordering of the lock and moving the device to dpm_off -- 
it's less of a problem if you don't acquire the lock until after the 
suspend method returns.  You can lock it just before reacquiring 

Which is better, an oops or a hang?  As far as the user is concerned, 
either one is useless.  For kernel developers, an oops is easier to 
debug.

In the end we should just try it and see what happens.  I don't think 
we can decide which will work out better without some real-world 

That's not what I mean.  In the long run it will turn out that certain
kernel threads _want_ to be frozen.  That is, if allowed to run during
a system sleep transition they would mess things up, and their
subsystem is designed so that it can carry out a sleep transition
perfectly well without the thread running.  (An example of such a
thread is khubd.)

To accomodate these threads we can freeze them -- that's easy since the
freezer already exists.  Or we can remove the freezer and provide a new
way for the...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Friday, February 29, 2008 - 8:13 pm

Hm, perhaps I don't understand the issue correctly, so please let me restate it.

We have the design issue that it's possible to register a child of a device
that was taken for suspend or even suspended which, in turn, leads to a wrong
ordering of devices on dpm_active.  Namely, suppose that device dev is taken
from the end of dpm_active and suspend_device(dev, state) is going to be called
If at the same time a child of dev is being registered and
suspend_device(dev, state) returns 0, dev will be taken out of dpm_active, but
the new device (its child) will be added to the end of dpm_active and
subsequently we'll attempt to suspend it.  That will be wrong.

Also, if a dev's child is registered after suspending all devices, it will go
to the end of dpm_active, so after the subsequent resume dev will end up
closer to the end of the list than the new child.  Thus, during the next
suspend it will be taken for suspending before this child and that will be
wrong either.  Note that this situation need not look dangerously from the
driver's perspective, since it doesn't know of the dpm_active ordering issue.

One of the possible solutions is to require suspend_device(dev, state) to
return an error if it detects a concurrent child registration.   This, however
is not sufficient, because the registration of a child may go unseen, right
after suspend_device(dev, state) returns and before dpm_list_mtx is reacquired.
For this reason, we need an additional mechanism to detect such situations
and work around them.  [Hence, the question arises whether it's really
necessary to require suspend_device(dev, state) to detect concurrent
registrations of children (we need to detect them independently anyway) etc.]
There also would have to be a mechanism for detecting registrations when
all devices have been suspended (we discussed that approach previously).

The other possible solution, and that's the one I'm considering, is to use
locking to prevent the registration of children of suspended or suspending...
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Saturday, March 1, 2008 - 11:30 am

Yes, but it's still wrong.  It's also wrong to register a new device 
(even one that has no parent) during the suspend_late stage, because 

It's worse than you describe, because suspend_device() is unable to
tell whether a concurrent child registration is valid or invalid.  By
"valid", I mean that it took place during the window before the suspend
method was called or before the method was able to prevent new child
registrations.

Valid child registrations must be allowed to proceed.  There's no
reason to fail them, because the driver has done nothing wrong -- and
it wouldn't be safe since drivers don't check for failure.  Likewise,
blocking them is likely to cause the suspend method to deadlock; see 
below.

But since suspend_device() can't tell which concurrent child
registrations are valid, it has no choice but to allow all of them.  
This means our only option for recovering from a concurrent child
registration is to resume the parent (don't move it to dpm_off) and 
continue.  That way the list ordering will remain correct.

Once the parent is fully suspended, suspend_device() has returned, and
the parent has been moved to dpm_off, the situation is different.  
Then we _know_ that child registrations are invalid, so either blocking

In view of my comments above, we _must not_ prevent registration of
children of suspending devices.  It's okay to prevent registration of
suspended devices.  There's nothing wrong with using a lock for this
purpose, but the lock should not be acquired until after
suspend_device() has returned and we have verified that no children

Consider a situation where a kernel thread is used by a driver for 
several purposes, including registering new children.  The suspend 
method will have to synchronize with the thread for obvious reasons: 
you don't want the thread to be using the device at the same time as 
the device is being suspended.

A typical synchronization approach is for the suspend method to wait
until the other thread is idle (the so...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Sunday, March 2, 2008 - 9:37 am

If new children get registered when the parent is suspended, that's already
wrong, because the children should have been suspended before.  Think of a case
when the parent is a bus and the children are devices on it.  In that case, by
suspending the parent before the children we can make the children

In fact we don't.  We only know that suspend_device() has won the race with

I don't agree with that (not only because the last sentence is oversimplified ;-)).

I think that the rule "the driver must not register new children after
-&gt;suspend() has run" is not a good one, because in fact we don't want
-&gt;suspend() to be called while new children are being registered.  IMO, we
should make the rule that "device registration may fail if it's carried out
concurrently with the parent's -&gt;suspend() method".  At least, that will tell


I agree it's not a good idea to hold the locks throughout the entire cycle,
but that can be overcome if we use an additional variable under

Still, I think we can fail them.

In fact, drivers _should_ check for device_add() failures and if they don't,

Well, I think it's not correct to allow the parent to suspend with active (not

I agree.  What I was trying to say is that the core should protect itself, to
the maximum reasonable extent possible, from suspending a parent before the

On the systems I'm talking about there are some devices referred to by the
ACPI _PTS method, so they must be on line whey this method is being executed.
Since we don't know a priori what devices they are, we must put all devices on

Well, that's correct.  However, if we make the rule that device_add() may fail
if it's run concurrently with the parent's -&gt;suspend(), the changes of the
drivers need not be substantial (they should check for the failures of
device_add() anyway).

[BTW, there is a bug in the error path of device_add() for which I'm going to
send a fix later today.  Namely, in case of a BusError, dpm_sysfs_remove()
is executed after device_pm_remo...
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Sunday, March 2, 2008 - 12:22 pm

Agreed.  That's why I've been saying all along that once the parent has
been moved to dpm_off, we should either block or fail child
registrations.  Drivers trying to register a child at such times are 
clearly buggy anyway.

What we are really trying to agree on is how the PM core should handle 
child registrations just before and while the parent is suspending.  
Drivers trying to register a child at such times need not be buggy at 
all; they may simply have lost a race.

I agree that the core needs to protect itself, but I also think that
drivers should need minimal changes -- preferably none.  If the core
becomes moderately more complicated as a tradeoff for keeping drivers a
little simpler, then IMO it's a win since there are lots and lots of

But you do agree that "drivers must not register new children after 
-&gt;suspend() has run" is correct, right?  You just don't think it goes 

In doing this you are putting a tremendous extra burden on drivers.  
You force _them_ to handle registration failure caused by an impending
suspend, something the driver has no way to know about beforehand!

In effect, you are trying to take the extra complication my patch adds
to the PM core and instead add that complication to _every_

The new patch is no better.  If a driver tries to create a new child
just before its suspend method is called, the registration will fail
with -EBUSY for no reason the driver is aware of.  So now the driver
has to detect the failure (which many drivers don't do!) and figure out

Of course they should check.  But they shouldn't have to deal with 
failures that need to be retried for no good reason.

In the long run, I still believe the best approach is to tell drivers
beforehand they should stop registering children.  That will make both
of us happy: The PM core can fail all later registration attempts with
no qualms, and drivers won't have to worry about unexpected failures.  

How many subsystems register new children at arbitrary times?  The

And I th...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Sunday, March 2, 2008 - 3:11 pm

That's not enough, though.  As soon as the state of the parent has changed,
it's incorrect to register any new children.  Moving the parent to dpm_off is


Would you agree, however, that the driver should be prepared for its
-&gt;resume() being called right after -&gt;suspend() and the -&gt;suspend() repeated


As I said above, I think that resuming the parent in case of a concurrent
child registration is not a trivial modification.  It may seem trivial, but

That's correct.  However, if such a child registration happens with the code
we currently have, there will be a problem, so the driver doing that may be
considered as buggy _right_ _now_.  Thus, in fact, we need not worry about
any existing drivers and we may require future drivers to follow the additional



IMO they are things done at a wrong time.  What we're talking about is a driver
trying to ignore the fact that it may be suspended and do all things as though
that's never going to happen.  In fact, if you use separate threads for the
registration of children etc., you should have implement a notification
mechanism that will let you know when a suspend is going to occur.  Otherwise,
you do things in a racy way and expecting that they'll never fail is just

I'm fine with that, although I'd call the new methods -&gt;begin() and -&gt;end() or
something like this, since they may generally be used for other purposes



On the systems in question (some NVidia chipsets for K8) USB controllers have
to be in the full power state before executing _PTS.

IMO these systems are just badly designed and we can blacklist or something
 like this.  It also is against the ACPI 2.0 spec, although it's compliant

Then, if a child registration occurs while suspend_device(parent) is being
run, dpm_active will be in a wrong order, unless suspend_device(parent) returns


Well, we can add new callbacks for notifying drivers of an impending suspend.
In that case, say we add a -&gt;begin() callback for this purpose (in fact that
would be...
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Sunday, March 2, 2008 - 11:54 pm

This is now a moot point, but I'll answer it anyway...

Drivers should not depend on any particular time interval between their 
suspend and resume methods being called.  It should be okay to call one 
and then the other a microsecond later, or 100 days later.  The driver 

"begin_sleep" and "end_sleep" are less ambiguous.

I was just thinking the same thing.  For instance, plenty of drivers
register children as part of their probe method, so the begin_sleep
method should prevent subsystems from probing the device.  (Not to
mention that it's kind of awkward for a driver to probe a suspended

After more thought, I'm not so sure about this.  It might be a good
idea to call the begin_sleep method just before the suspend method (or
any of its variants: freeze, hibernate, prethaw, etc.) and call the
end_sleep method just after the resume method.  This minimizes the time
drivers will spend in a peculiar non-hotplug-aware state, although it 
means that begin_sleep would have to be idempotent.

It also allows sophisticated drivers to do all their processing in the
begin_sleep (and end_sleep) method: both preventing new child
registrations and powering down the device.  At the moment I'm not sure
whether this would turn out to be a good strategy, but it might.

Alternatively, some subsystems might want to stop all child
registrations right at the beginning of the sleep transition.  This
would amount to blocking the kernel thread responsible for those
registrations when the sleep begins -- in other words, making that

This isn't quite so simple either.  A notifier isn't good enough
because it doesn't provide any synchronization.  On the other hand, how
many devices ever get registered without a parent?  Does this happen at
all after system startup?  Maybe when a loadable module for a new bus
type initializes...  We could block module loading at the start of a

Fortunately that won't cause any problems for some time to come.  
There are no current plans for doing runtime PM of PCI-base...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 12:32 pm

Well, I think there should be a window between -&gt;suspend_begin()
and -&gt;suspend() allowing the core to cancel the suspending of given
device and to select another one.  For example, if there's a child
registration concurrent wrt -&gt;suspend_begin() which completes after
-&gt;suspend_begin() has been called, but before -&gt;suspend_begin() has a chance
to block it, the core should not call -&gt;suspend() for the device, but select
another one (the child).

Of course, it won't be necessary if the -&gt;suspend_begin() methods are called

Well, I think the possibility to freeze kernel threads on demand may be
considered as one of suspend synchronization mechanisms available to drivers.


Yes, that should work.  I'll rework the patch along these lines (it starts to



I meant -&gt;hibernate_begin() as -&gt;freeze_begin() and I don't see any reason why
-&gt;prethaw_begin() would be different from it (the difference between -&gt;freeze()
and -&gt;prethaw() -- now called -&gt;quiesce(), btw -- is that the former saves


If such devices serve as logical parents of some "real" ones, we should
at least mark them as "sleeping" during suspend to protect the dpm_active
ordering.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 1:43 pm

With a sophisticated driver that would never happen, because after
blocking new child registrations, the driver would check that the
power.sleeping flag is set in all the children before powering down the 
device.  But like I said, I'm not sure if this would be a good 
strategy.

(This partly has to do with the requirements for runtime PM.  During a
runtime suspend the driver does have to check the children's status; it
can't rely on the PM core.  So if the check has to be done anyway, why
not also check during a system sleep?)

With non-sophisticated drivers, it definitely could happen that a new
child is registered concurrently with begin_sleep.  Then the core would
go back and suspend the child first, as you say.  Eventually the core
would return to the parent device, at which time it would call the
parent's begin_sleep method again -- unless we add another flag to


I had a change like that in my version of the patch.  It's excerpted 

AFAICS there needs to be only one begin_sleep method.  It should apply 
equally well to suspend, freeze, quiesce, and hibernate.  (But not to 

There's no need for that.  If it isn't implemented, treat it as though 
it was successful.

But if a driver implements suspend() and leaves begin_sleep() as just a 
stub (which many drivers might reasonably do, if they never register 
any children), then the core shouldn't require suspend() to succeed 
merely because begin_sleep() did.


They wouldn't have to be marked at all.  They would never get on any of 
the PM lists, because they would never be passed to device_pm_add().

The status of such devices is a little peculiar.  For example, consider 
the MMC subsystem.  There's an MMC controller, generally a platform 
device.  Its child is an mmc_host device, but the mmc_host methods are 
called by the platform device's methods -- there is no driver 
associated with an mmc_host.  At one time the mmc_host was a 
class_device; that's may explain in part why it works this way.

Alan Stern


Ind...
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 4:47 pm

The _core_ needs the window, not the driver.  Even if the driver discovers that
there are active children, it can only fail -&gt;suspend(), in which the core will
faill the entire power state transition (unless we reserve an error code for

Hm, I wonder why you didn't move dpm_sysfs_add() along with device_pm_add()?

Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we

It definitely would be simpler to assume so and introduce just one common

Well, I'm not sure.  Right now we have a problem with distinguishing drivers
that don't implement -&gt;suspend() purposefully from the ones that just don't
support suspend/hibernation ...

OTOH, since we are going to have a pointer to 'struct pm_ops', we can safely


But dev-&gt;parent will be not NULL for their children being on dpm_active ...

IMO it's simpler to just add those devices without any suspend callbacks
defined to dpm_active and handle them normally than to introduce a special
case.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 6:48 pm

Let's stick with that for now.

What about on the resume side?  Do you want to prevent child
registrations until after end_sleep has run?  Or would you be okay with
allowing the resume method to register new children?  It should be safe
to assume that drivers are smart enough to bring the device back to
full power before looking for new children.

In fact, maybe we don't need an end_sleep method at all.  There isn't
any synchronization issue to worry about -- drivers aren't going to



That's reasonable.  If a driver doesn't support PM then it won't have 

Yeah, I'm just trying to think too far ahead.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 6:56 pm

I think -&gt;resume() can register new children.  There's nothing wrong with



Okay, I'll prepare a patch for that, on top of the one introducing the

Exactly.

The question remains what we're going to do with the drivers without pm_ops
pointers in the long run (in the short run we will use the legacy callbacks in
that cases, if defined).

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 7:12 pm

While you're at it, could you add a field to indicate whether 
begin_sleep() has been called?  It would help prevent multiple calls to 
that method when a race does occur, and it could be useful for drivers 

One possibility is to unbind those drivers at the start of a sleep
transition and reprobe them at the end.  Another possibility is to
ignore the lack of PM support and hope it doesn't cause any problems.

Alan Stern

--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Monday, March 3, 2008 - 7:18 pm

That will be added along with the new callbacks.

I'd prefer to do all that in separate patches, so that it's clear what issue is


That would be easy, but that's what we do now and there are problems with it.

Thanks,
Rafael
--
To: Alan Stern <stern@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Thursday, February 28, 2008 - 8:01 pm

No big deal.  In fact we're working with Alex on a patch that modifies main.c

Oh, yes.  Will you please send a fix when

Yeah.  I left pm_sleep_rwsem, because I wasn't sure it would be necessary

I think we should remove device_pm_schedule_removal() early in the 2.6.26
cycle.

I'll review the patch more thoroughly tomorrow, since I have some new material
for the regressions list I need to take care of.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linux-pm mailing list <linux-pm@...>, Kernel development list <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 4:15 pm

Actually the move is done before the method is called.  So this isn't a
problem.


All right, if it doesn't happen now then we don't need to allow for it.  
That makes life a little simpler.

Alan Stern

--
Previous thread: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag by David Rientjes on Monday, February 25, 2008 - 11:35 am. (35 messages)

Next thread: [git pull] scheduler fixes by Ingo Molnar on Monday, February 25, 2008 - 11:45 am. (1 message)
speck-geostationary