Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 2)

Previous thread: How to avoid spurious lockdep warnings? by Jeremy Fitzhardinge on Thursday, March 20, 2008 - 4:02 pm. (5 messages)

Next thread: [PATCH] cpufreq: Expose cpufreq coordination requirements regardless of coordination mechanism by Darrick J. Wong on Thursday, March 20, 2008 - 5:43 pm. (3 messages)
From: Rafael J. Wysocki
Date: Thursday, March 20, 2008 - 5:01 pm

Hi,

Below is an updated version of the $subject patch.  It has only been
compilation tested, but I'd like to get as much feedback as reasonably possible
at this stage to avoid redesigning things later in the process.

The most important changes from the previous one are the following:
* Callbacks to be executed with interrupts disabled are now separated from
  the "regular" ones.  There still are six of them, but IMHO this is what may
  be necessary in the most general case (eg. a driver may have to carry out
  some operations with interrupts disabled, which are different for SUSPEND,
  FREEZE and HIBERNATE; the complementary "resume" callbacks are needed for
  error recovery, for example)
* It occured to me that the ->freeze() and ->quiesce() callbacks were
  essentially the same, so I removed the latter
* The ->recover() callback was also dropped, as ->thaw() may well be used instead
  just fine (it seems)
* ->prepare() and ->complete() are now called in separate loops which causes
  some funny complications with error recovery.  Namely, we must remember which
  devices have already been ->prepare()d, so that we call ->complete() for them
  in the error path.  Moreover, there may be some ->prepare()d and some
  ->suspend()ed devices at the same time, so if there's an error we should
  ->resume() the ->suspend()ed ones and ->complete() everything etc.
  This may be handled in many different ways, but I decided to introduce a new
  list on which the ->complete()d devices are stored.  Accordingly, the
  registration of new children of a device is blocked between ->prepare() and
  ->complete() (it was only blocked between ->prepare() and ->resume() in the
  previous version).

Please have a look.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce 'struct pm_ops' and 'struct pm_noirq_ops' representing
suspend and hibernation operations for bus types, device classes and
device types.

Modify the PM core to use 'struct pm_ops' and 'struct ...
From: Johannes Berg
Date: Thursday, March 20, 2008 - 5:21 pm

Why bother and not just make them return void, the error printing can
most likely be done much much better in the callback since that possibly

Same here, of course.

johannes
From: Rafael J. Wysocki
Date: Thursday, March 20, 2008 - 5:26 pm

The errors returned are used by TRACE_RESUME(), for example.

Well, perhaps the comments should be changed instead. :-)

Thanks,
Rafael
--

From: Oliver Neukum
Date: Tuesday, March 25, 2008 - 2:49 am

A device that cannot wake up is unusable. Shouldn't the pm core disconnect()
such a device?

	Regards
		Oliver

--

From: Rafael J. Wysocki
Date: Tuesday, March 25, 2008 - 6:06 am

Well, if ->resume() returns an error, the driver already knows there's a
problem and it can act upon that, at least in principle.

However, the PM core probably shouldn't try to resume the children of a failing
device.  Also, if ->resume_noirq() fails, it probably is not a good idea to
call ->resume() and ->complete() for the same device and for it's children.

Thanks,
Rafael
--

From: Oliver Neukum
Date: Tuesday, March 25, 2008 - 6:15 am

Then why return an error? If a driver returns an error I would assume that

Exactly. But we need to define what happens in these cases. If we simply
ignore the errors, the drivers must be able to deal with IO to half suspended
devices.

	Regards
		Oliver


--

From: Alan Stern
Date: Tuesday, March 25, 2008 - 7:19 am

It's not safe for the PM core to do such things unilaterally.  The 
decision to unregister a device should be made by the driver or the 
subsystem.

(The only problem is that it's impossible to unregister a device from 
within its suspend or resume methods.  Perhaps there should be a way 
for the driver to tell the PM core that the core should unregister the 
device as soon as the method returns.  I don't know if such a facility 

The PM core prints a warning in the system log whenever an error is 

Just so -- it's up to the drivers to deal with this sort of thing.  The 
PM core can't know the details of what should be done in each case.

For example, if a USB hub can't be resumed then usbcore marks all of
its descendants with USB_STATE_NOTATTACHED.  When the children's resume
methods are called, they return without trying to do anything.  Later
on khubd takes care of unregistering everything.

Alan Stern

--

From: Oliver Neukum
Date: Tuesday, March 25, 2008 - 7:24 am

Why? You can trigger it from user space via sysfs and in many cases
suspending to disk will disconnect all devices on a bus, so I'd say a
failure to resume is just a limited subcase of a device vanishing during
sleep.

	Regards
		Oliver

--

From: Alan Stern
Date: Tuesday, March 25, 2008 - 7:33 am

But these disconnects aren't done by the PM core; they are done by 

I'll go along with that.  If a device vanishes during sleep, the PM 
core isn't responsible for unregistering it -- the device's subsystem 
is.

Alan Stern

--

From: Oliver Neukum
Date: Tuesday, March 25, 2008 - 12:48 pm

Yes, that makes sense. You are right.

	Regards
		Oliver

--

From: Rafael J. Wysocki
Date: Tuesday, March 25, 2008 - 1:41 pm

Still, if ->resume() returns an error, does it make sense, from the PM core's
point of view, to execute ->complete() for that device, for example?

If you think it does, that behavior should be clearly documented (I didn't
think about that before).

Thanks,
Rafael
--

From: Oliver Neukum
Date: Tuesday, March 25, 2008 - 1:49 pm

IMO you must always keep the ordering invariant. If a parent returns an error
the PM core must not wake its children.

	Regards
		Oliver

--

From: Rafael J. Wysocki
Date: Tuesday, March 25, 2008 - 1:56 pm

I'm agreeing here, but one of the previous Alan's comments suggests he has a
differing opinion.  Alan?

I'm considering to make the PM core skip the resuming of the children of
devices that failed to resume and skip calling ->complete() for that devices
and their children.

Thanks,
Rafael
--

From: Alan Stern
Date: Wednesday, March 26, 2008 - 7:10 am

Don't think of it that way.  The PM core doesn't wake anything.  It
merely notifies drivers that the system sleep is ending, so that the
drivers can wake their devices.  It's up to the driver to detect
whether the parent failed to resume, in which case the driver should
take appropriate action.

The situation is no different from what happens when the user tries to 
access a mounted USB disk drive after the USB cable has been unplugged.  

While that might in principle be a reasonable thing to do, it's 
different from how the PM core has behaved in the past.  I don't see 
any point in making a change like that now.

Alan Stern

--

From: Oliver Neukum
Date: Wednesday, March 26, 2008 - 7:24 am

How do you propose that every driver should check the power state

That completely throws away the reason to have a PM core. We've made
a guarantee to drivers that they wil not be woken unless their parents are
awake. In fact the semantics of the callbacks are defined in a way that
adding devices to a parent can be enabled. You cannot add children to a
dead parent. It's the very reason for this rewrite.

	Regards
		Oliver
--

From: Alan Stern
Date: Wednesday, March 26, 2008 - 7:40 am

It depends entirely on the driver and subsystem; generalizations are 
not possible.  When appropriate, they can copy the scheme used by USB.  


No we haven't.  The guarantee is only that the PM core will call the 
parent's resume method before calling the child's method.  There is no 
guarantee about whether the method succeeds or the parent is awake.

Remember, the whole purpose of this is to let drivers know when the 
system is going to sleep or waking up.  Proper handling of devices is 

Depends what you mean.  In some cases it is possible -- i.e., you can
do it and the kernel won't crash, although the new children probably
won't be usable.  In any case it doesn't matter; the PM core doesn't

No it isn't.  See Rafael's changelog ocmments; the rewrite is being 
done for completely different reasons.

Alan Stern

--

From: Oliver Neukum
Date: Wednesday, March 26, 2008 - 8:42 am

Then declare these methods void. We cannot introduce methods that deliberately
ignore errors. Reporting is also better done in the drivers.

	Regards
		Oliver
--

From: Alan Stern
Date: Wednesday, March 26, 2008 - 9:36 am

That decision is up to Rafael.  Changing the methods to return void is 
okay with me.

Alan Stern

--

From: Rafael J. Wysocki
Date: Wednesday, March 26, 2008 - 1:46 pm

But that's not what they currently do, either.

If I change the methods to void and it turns out in the future that it's
better if they return error codes, it will be rather difficult to go back and
change everything.  For this reason, I'd prefer to retain the returning of
error codes.

What exactly do you whink would be wrong with using the error codes to avoid
resuming the children of devices that failed to resume?

Rafael
--

From: Alan Stern
Date: Wednesday, March 26, 2008 - 7:48 pm

Nothing, I guess.  Drivers should work okay either way.

Alan Stern

--

From: Alan Stern
Date: Thursday, March 20, 2008 - 6:21 pm

I don't have time to go through this in detail now, but a few things 
stand out.

One trivial problem is that your dpm_prepare and dpm_complete routines 
go through the list in the wrong order.

More importantly, the situation with prepare and complete is getting
rather complicated.  Now that you're adding dev->power.state, why not
go all the way?  Get rid of all the different lists, keeping just
dpm_active and possibly dpm_destroy.  Instead of moving devices between
different lists, just store in dev->power.state an identification for
which list the device is supposed to be on or is supposed to be moving
to.  (Or else have a bunch of bitfields indicating which methods have
been called.)  This has the advantage that the entries will never get
out of order, even if devices are registered at the wrong time.

There's some question about when we want the PM core to start
preventing new children from being registered.  Should this start right
after prepare() returns, or should it start right before suspend() is
called?  The first alternative sounds better to me.  Not only does it
agree with the purpose of the prepare method, it also makes race
situations easier to handle.

Since dpm_prepare is supposed to go through the list in the forward
direction, logically the "root" of the device tree should be the first
thing "prepared".  This means you should not allow parentless devices
to be registered any time after dpm_prepare has started.  Is that 
liable to cause problems?

There may be similar problems with complete().  A number of drivers
check during their resume method for the presence of new children
plugged in while the system was asleep.  All these drivers would have
to be converted over to the new scheme if they weren't permitted to
register the new children before complete() was called.  Of course,
this is easily fixed by permitting new children starting from when
resume() is called rather than when complete() is called.

Alan Stern

--

From: Rafael J. Wysocki
Date: Thursday, March 20, 2008 - 7:14 pm

I'm not all so sure that the order is actually wrong.  What would be the



I agree and that's implemented in the patch (ie. the registrations of new

I'm still not seeing the advantage of the forward direction in the first place.

Although I don't see what particular problems that may cause, I think the
current approach (first, block registrations of new children for each
->prepare()d device and finally block any registrations of new devices) is

Well, the problem here was the protection of the correct ordering of the
various lists.  However, if the approach with changing 'status' is adopted
instead, which seems to be better, we'll be able to unblock the registering
of new children before ->resume().

Thanks,
Rafael
--

From: Alan Stern
Date: Thursday, March 20, 2008 - 7:53 pm

The advantage is that races no longer matter.

Suppose you're going through the list backwards and a race occurs, so
that a child is registered while the parent's prepare() is running.  
That new child will of course be at the end of dpm_active, so the loop
in dpm_prepare won't see it (assuming you adopt the approach of using
only a single list).  But if you go through the list forwards and a new
child is added, you will naturally see the child when you reach the end
of the list.  You don't even have to go through the business about
changing the parent's state back to DPM_ON.

There are other ways of describing the situation, but they all come 
down to the same thing.  For instance, this is the way Ben was talking 

I suppose for parentless devices it doesn't really matter.  You could
safely allow them to be registered any time up until dpm_prepare()  
finishes -- which is what your patch does.  Perhaps the "all_inactive"

Yep.  The only thing to watch out for is in device_pm_remove(); it
would be a disaster if somehow a device was removed while it was being
prepared/suspended/resumed/completed/whatever.  I know that's not
supposed to happen but there's nothing to prevent it, especially if
the device in question doesn't have a driver.  No doubt you can invent
a way to allow this to happen safely.

Alan Stern

--

From: Rafael J. Wysocki
Date: Saturday, March 22, 2008 - 3:17 pm

Well, that's a separate issue that IMO should be addressed in a separate patch.
Something like the one below comes to mind.

The comment removed by the patch is wrong IMO, because it implies that
device_add() may be called with the device semaphore held and that might
deadlock in bus_attach_device().  Thus, I think we can acquire dev->sem
in device_pm_add() and in device_pm_remove().

Thanks,
Rafael

---
 drivers/base/power/main.c |   21 ++++++++++++++-------
 include/linux/pm.h        |    1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -41,10 +41,6 @@
  * from the bottom (i.e., end) up and resumed in the opposite order.
  * That way no parent will be suspended while it still has an active
  * child.
- *
- * Since device_pm_add() may be called with a device semaphore held,
- * we must never try to acquire a device semaphore while holding
- * dpm_list_mutex.
  */
 
 LIST_HEAD(dpm_active);
@@ -68,6 +64,7 @@ int device_pm_add(struct device *dev)
 	pr_debug("PM: Adding info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
 		 kobject_name(&dev->kobj));
+	down(&dev->sem);
 	mutex_lock(&dpm_list_mtx);
 	if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
 		if (dev->parent->power.sleeping)
@@ -80,10 +77,13 @@ int device_pm_add(struct device *dev)
 		error = -EBUSY;
 	} else {
 		error = dpm_sysfs_add(dev);
-		if (!error)
+		if (!error) {
+			dev->power.present = true;
 			list_add_tail(&dev->power.entry, &dpm_active);
+		}
 	}
 	mutex_unlock(&dpm_list_mtx);
+	up(&dev->sem);
 	return error;
 }
 
@@ -98,10 +98,13 @@ void device_pm_remove(struct device *dev
 	pr_debug("PM: Removing info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus",
 		 kobject_name(&dev->kobj));
+	down(&dev->sem);
 	mutex_lock(&dpm_list_mtx);
 ...
From: Alan Stern
Date: Saturday, March 22, 2008 - 4:28 pm

It isn't wrong.  device_add() may indeed be called with a device 
semaphore held -- just not the semaphore for the device being added.  
Quite often it is called with device's parent's semaphore held.  The 
implication is not that we may deadlock in bus_attach_device(); rather 
it is that the order of acquisition must always be device semaphore 

No, you have missed the entire point.  The problem doesn't exist in the
current code; it exists only if we switch over to using a single list.  
Routines like dpm_suspend() won't be able to use list_for_each_entry()
to traverse the list because entries may be removed by other threads
during the traversal.  Even list_for_each_entry_safe() won't work
correctly without careful attention to details.

Alan Stern

--

From: Rafael J. Wysocki
Date: Saturday, March 22, 2008 - 4:44 pm

Ah, ok.  Thanks for the clarification.

Doesn't it help that we traverse the list under dpm_list_mtx?  Anyone who
removes an entry is required to take dpm_list_mtx that we're holding while
the list is traversed except when the callbacks are invoked.

The only problem I see is when the device currently being handled is removed
from the list by a concurrent thread.  Is that you were referring to?

Rafael
--

From: Alan Stern
Date: Saturday, March 22, 2008 - 7:07 pm

It doesn't help.  What _does_ help is the fact that these traversals 
are all serialized (since only one thread can carry out a system sleep 

Yes, that is the problem.  If you try to work around it by using
list_for_each_entry_safe() then you run into a problem when a
concurrent thread removes the device _following_ the one being handled
(or when the device being handled is the last one on the list and a
concurrent thread registers a new device, which can only happen in
dpm_prepare()).

It's not hard to fix.  Just something to be aware of.

Alan Stern

P.S.: Oh yes, another related issue...  We should call get_device() and 
put_device() while holding dpm_list_mtx.  Otherwise the device 
structure might vanish when the callbacks are invoked.

--

From: Rafael J. Wysocki
Date: Sunday, March 23, 2008 - 11:41 am

Yes, I've almost finished a new patch taking that into account.  I'll send it

Good idea.

Thanks,
Rafael
--

From: Sam Ravnborg
Date: Friday, March 21, 2008 - 1:15 am

Hi Rafael.

Is it possible to extend this in some way so we avoid the
#ifdef stuff in the drivers?
We could introduce a few special sections that we discard if
PM is not in use.
We have a reliable build time infrastructure to detect
inconsistencies if needed.

Something like:
#define __suspend     __section(.suspend.text)
#define __suspenddata __section(.suspend.data)

#define __hibernate     __section(.hibernate.text)
#define __hibernatedata __section(.hibernate.data)

A few more tricks will be needed when we assign the functon pointers.
We have __devexit_p(*) and we may use something similar.

--

From: Rafael J. Wysocki
Date: Sunday, March 23, 2008 - 2:16 pm

Unfortunately, I have a little experience with linkers and I don't think I'll
be able to do anything like this in a reasonable time without any help.

Thanks,
Rafael
--

Previous thread: How to avoid spurious lockdep warnings? by Jeremy Fitzhardinge on Thursday, March 20, 2008 - 4:02 pm. (5 messages)<