Hi, Below is an updated version of the $subject patch. It has only been compilation tested, but I'd like to get as much feedback as reasonably possible at this stage to avoid redesigning things later in the process. The most important changes from the previous one are the following: * Callbacks to be executed with interrupts disabled are now separated from the "regular" ones. There still are six of them, but IMHO this is what may be necessary in the most general case (eg. a driver may have to carry out some operations with interrupts disabled, which are different for SUSPEND, FREEZE and HIBERNATE; the complementary "resume" callbacks are needed for error recovery, for example) * It occured to me that the ->freeze() and ->quiesce() callbacks were essentially the same, so I removed the latter * The ->recover() callback was also dropped, as ->thaw() may well be used instead just fine (it seems) * ->prepare() and ->complete() are now called in separate loops which causes some funny complications with error recovery. Namely, we must remember which devices have already been ->prepare()d, so that we call ->complete() for them in the error path. Moreover, there may be some ->prepare()d and some ->suspend()ed devices at the same time, so if there's an error we should ->resume() the ->suspend()ed ones and ->complete() everything etc. This may be handled in many different ways, but I decided to introduce a new list on which the ->complete()d devices are stored. Accordingly, the registration of new children of a device is blocked between ->prepare() and ->complete() (it was only blocked between ->prepare() and ->resume() in the previous version). Please have a look. Thanks, Rafael --- From: Rafael J. Wysocki <rjw@sisk.pl> Introduce 'struct pm_ops' and 'struct pm_noirq_ops' representing suspend and hibernation operations for bus types, device classes and device types. Modify the PM core to use 'struct pm_ops' and 'struct ...
Why bother and not just make them return void, the error printing can most likely be done much much better in the callback since that possibly Same here, of course. johannes
The errors returned are used by TRACE_RESUME(), for example. Well, perhaps the comments should be changed instead. :-) Thanks, Rafael --
A device that cannot wake up is unusable. Shouldn't the pm core disconnect() such a device? Regards Oliver --
Well, if ->resume() returns an error, the driver already knows there's a problem and it can act upon that, at least in principle. However, the PM core probably shouldn't try to resume the children of a failing device. Also, if ->resume_noirq() fails, it probably is not a good idea to call ->resume() and ->complete() for the same device and for it's children. Thanks, Rafael --
Then why return an error? If a driver returns an error I would assume that Exactly. But we need to define what happens in these cases. If we simply ignore the errors, the drivers must be able to deal with IO to half suspended devices. Regards Oliver --
It's not safe for the PM core to do such things unilaterally. The decision to unregister a device should be made by the driver or the subsystem. (The only problem is that it's impossible to unregister a device from within its suspend or resume methods. Perhaps there should be a way for the driver to tell the PM core that the core should unregister the device as soon as the method returns. I don't know if such a facility The PM core prints a warning in the system log whenever an error is Just so -- it's up to the drivers to deal with this sort of thing. The PM core can't know the details of what should be done in each case. For example, if a USB hub can't be resumed then usbcore marks all of its descendants with USB_STATE_NOTATTACHED. When the children's resume methods are called, they return without trying to do anything. Later on khubd takes care of unregistering everything. Alan Stern --
Why? You can trigger it from user space via sysfs and in many cases suspending to disk will disconnect all devices on a bus, so I'd say a failure to resume is just a limited subcase of a device vanishing during sleep. Regards Oliver --
But these disconnects aren't done by the PM core; they are done by I'll go along with that. If a device vanishes during sleep, the PM core isn't responsible for unregistering it -- the device's subsystem is. Alan Stern --
Yes, that makes sense. You are right. Regards Oliver --
Still, if ->resume() returns an error, does it make sense, from the PM core's point of view, to execute ->complete() for that device, for example? If you think it does, that behavior should be clearly documented (I didn't think about that before). Thanks, Rafael --
IMO you must always keep the ordering invariant. If a parent returns an error the PM core must not wake its children. Regards Oliver --
I'm agreeing here, but one of the previous Alan's comments suggests he has a differing opinion. Alan? I'm considering to make the PM core skip the resuming of the children of devices that failed to resume and skip calling ->complete() for that devices and their children. Thanks, Rafael --
Don't think of it that way. The PM core doesn't wake anything. It merely notifies drivers that the system sleep is ending, so that the drivers can wake their devices. It's up to the driver to detect whether the parent failed to resume, in which case the driver should take appropriate action. The situation is no different from what happens when the user tries to access a mounted USB disk drive after the USB cable has been unplugged. While that might in principle be a reasonable thing to do, it's different from how the PM core has behaved in the past. I don't see any point in making a change like that now. Alan Stern --
How do you propose that every driver should check the power state That completely throws away the reason to have a PM core. We've made a guarantee to drivers that they wil not be woken unless their parents are awake. In fact the semantics of the callbacks are defined in a way that adding devices to a parent can be enabled. You cannot add children to a dead parent. It's the very reason for this rewrite. Regards Oliver --
It depends entirely on the driver and subsystem; generalizations are not possible. When appropriate, they can copy the scheme used by USB. No we haven't. The guarantee is only that the PM core will call the parent's resume method before calling the child's method. There is no guarantee about whether the method succeeds or the parent is awake. Remember, the whole purpose of this is to let drivers know when the system is going to sleep or waking up. Proper handling of devices is Depends what you mean. In some cases it is possible -- i.e., you can do it and the kernel won't crash, although the new children probably won't be usable. In any case it doesn't matter; the PM core doesn't No it isn't. See Rafael's changelog ocmments; the rewrite is being done for completely different reasons. Alan Stern --
Then declare these methods void. We cannot introduce methods that deliberately ignore errors. Reporting is also better done in the drivers. Regards Oliver --
That decision is up to Rafael. Changing the methods to return void is okay with me. Alan Stern --
But that's not what they currently do, either. If I change the methods to void and it turns out in the future that it's better if they return error codes, it will be rather difficult to go back and change everything. For this reason, I'd prefer to retain the returning of error codes. What exactly do you whink would be wrong with using the error codes to avoid resuming the children of devices that failed to resume? Rafael --
Nothing, I guess. Drivers should work okay either way. Alan Stern --
I don't have time to go through this in detail now, but a few things stand out. One trivial problem is that your dpm_prepare and dpm_complete routines go through the list in the wrong order. More importantly, the situation with prepare and complete is getting rather complicated. Now that you're adding dev->power.state, why not go all the way? Get rid of all the different lists, keeping just dpm_active and possibly dpm_destroy. Instead of moving devices between different lists, just store in dev->power.state an identification for which list the device is supposed to be on or is supposed to be moving to. (Or else have a bunch of bitfields indicating which methods have been called.) This has the advantage that the entries will never get out of order, even if devices are registered at the wrong time. There's some question about when we want the PM core to start preventing new children from being registered. Should this start right after prepare() returns, or should it start right before suspend() is called? The first alternative sounds better to me. Not only does it agree with the purpose of the prepare method, it also makes race situations easier to handle. Since dpm_prepare is supposed to go through the list in the forward direction, logically the "root" of the device tree should be the first thing "prepared". This means you should not allow parentless devices to be registered any time after dpm_prepare has started. Is that liable to cause problems? There may be similar problems with complete(). A number of drivers check during their resume method for the presence of new children plugged in while the system was asleep. All these drivers would have to be converted over to the new scheme if they weren't permitted to register the new children before complete() was called. Of course, this is easily fixed by permitting new children starting from when resume() is called rather than when complete() is called. Alan Stern --
I'm not all so sure that the order is actually wrong. What would be the I agree and that's implemented in the patch (ie. the registrations of new I'm still not seeing the advantage of the forward direction in the first place. Although I don't see what particular problems that may cause, I think the current approach (first, block registrations of new children for each ->prepare()d device and finally block any registrations of new devices) is Well, the problem here was the protection of the correct ordering of the various lists. However, if the approach with changing 'status' is adopted instead, which seems to be better, we'll be able to unblock the registering of new children before ->resume(). Thanks, Rafael --
The advantage is that races no longer matter. Suppose you're going through the list backwards and a race occurs, so that a child is registered while the parent's prepare() is running. That new child will of course be at the end of dpm_active, so the loop in dpm_prepare won't see it (assuming you adopt the approach of using only a single list). But if you go through the list forwards and a new child is added, you will naturally see the child when you reach the end of the list. You don't even have to go through the business about changing the parent's state back to DPM_ON. There are other ways of describing the situation, but they all come down to the same thing. For instance, this is the way Ben was talking I suppose for parentless devices it doesn't really matter. You could safely allow them to be registered any time up until dpm_prepare() finishes -- which is what your patch does. Perhaps the "all_inactive" Yep. The only thing to watch out for is in device_pm_remove(); it would be a disaster if somehow a device was removed while it was being prepared/suspended/resumed/completed/whatever. I know that's not supposed to happen but there's nothing to prevent it, especially if the device in question doesn't have a driver. No doubt you can invent a way to allow this to happen safely. Alan Stern --
Well, that's a separate issue that IMO should be addressed in a separate patch.
Something like the one below comes to mind.
The comment removed by the patch is wrong IMO, because it implies that
device_add() may be called with the device semaphore held and that might
deadlock in bus_attach_device(). Thus, I think we can acquire dev->sem
in device_pm_add() and in device_pm_remove().
Thanks,
Rafael
---
drivers/base/power/main.c | 21 ++++++++++++++-------
include/linux/pm.h | 1 +
2 files changed, 15 insertions(+), 7 deletions(-)
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -41,10 +41,6 @@
* from the bottom (i.e., end) up and resumed in the opposite order.
* That way no parent will be suspended while it still has an active
* child.
- *
- * Since device_pm_add() may be called with a device semaphore held,
- * we must never try to acquire a device semaphore while holding
- * dpm_list_mutex.
*/
LIST_HEAD(dpm_active);
@@ -68,6 +64,7 @@ int device_pm_add(struct device *dev)
pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ down(&dev->sem);
mutex_lock(&dpm_list_mtx);
if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
if (dev->parent->power.sleeping)
@@ -80,10 +77,13 @@ int device_pm_add(struct device *dev)
error = -EBUSY;
} else {
error = dpm_sysfs_add(dev);
- if (!error)
+ if (!error) {
+ dev->power.present = true;
list_add_tail(&dev->power.entry, &dpm_active);
+ }
}
mutex_unlock(&dpm_list_mtx);
+ up(&dev->sem);
return error;
}
@@ -98,10 +98,13 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ down(&dev->sem);
mutex_lock(&dpm_list_mtx);
...It isn't wrong. device_add() may indeed be called with a device semaphore held -- just not the semaphore for the device being added. Quite often it is called with device's parent's semaphore held. The implication is not that we may deadlock in bus_attach_device(); rather it is that the order of acquisition must always be device semaphore No, you have missed the entire point. The problem doesn't exist in the current code; it exists only if we switch over to using a single list. Routines like dpm_suspend() won't be able to use list_for_each_entry() to traverse the list because entries may be removed by other threads during the traversal. Even list_for_each_entry_safe() won't work correctly without careful attention to details. Alan Stern --
Ah, ok. Thanks for the clarification. Doesn't it help that we traverse the list under dpm_list_mtx? Anyone who removes an entry is required to take dpm_list_mtx that we're holding while the list is traversed except when the callbacks are invoked. The only problem I see is when the device currently being handled is removed from the list by a concurrent thread. Is that you were referring to? Rafael --
It doesn't help. What _does_ help is the fact that these traversals are all serialized (since only one thread can carry out a system sleep Yes, that is the problem. If you try to work around it by using list_for_each_entry_safe() then you run into a problem when a concurrent thread removes the device _following_ the one being handled (or when the device being handled is the last one on the list and a concurrent thread registers a new device, which can only happen in dpm_prepare()). It's not hard to fix. Just something to be aware of. Alan Stern P.S.: Oh yes, another related issue... We should call get_device() and put_device() while holding dpm_list_mtx. Otherwise the device structure might vanish when the callbacks are invoked. --
Yes, I've almost finished a new patch taking that into account. I'll send it Good idea. Thanks, Rafael --
Hi Rafael. Is it possible to extend this in some way so we avoid the #ifdef stuff in the drivers? We could introduce a few special sections that we discard if PM is not in use. We have a reliable build time infrastructure to detect inconsistencies if needed. Something like: #define __suspend __section(.suspend.text) #define __suspenddata __section(.suspend.data) #define __hibernate __section(.hibernate.text) #define __hibernatedata __section(.hibernate.data) A few more tricks will be needed when we assign the functon pointers. We have __devexit_p(*) and we may use something similar. --
Unfortunately, I have a little experience with linkers and I don't think I'll be able to do anything like this in a reasonable time without any help. Thanks, Rafael --
