Hi, This is the 3rd revision of the patch introducing new callbacks for suspend and hibernation. It has been tested on x86-64. I did my best to address the Alan's comments regarding the 2nd revision of the patch. In particular: * The ->prepare() callbacks are now executed in the forward order to avoid races between the PM core and the registration of new devices. Analogously, the ->complete() callbacks are executed in the reverse order. * Only one list, called dpm_list, is used by the PM core its operations and the dev->power.status field is used to mark the current status of the device wrt the suspend/hibernation operations. * The registrations of parentless devices are disabled before the first ->prepare() method is called and enabled before the first ->resume() method is called * The registration of new devices with a parent is disabled right after the parent's ->prepare() is called and enabled just prior to calling the parent's ->resume(). The patch is on top of the "PM: Remove destroy_suspended_device()" patch I sent yesterday. Please review. Thanks, Rafael --- From: Rafael J. Wysocki <rjw@sisk.pl> Introduce 'struct pm_ops' and 'struct pm_noirq_ops' representing suspend and hibernation operations for bus types, device classes and device types. Modify the PM core to use 'struct pm_ops' and 'struct pm_noirq_ops' objects, if defined, instead of the ->suspend() and ->resume() or, respectively, ->suspend_late() and ->resume_early() callbacks that will be considered as legacy and gradually phased out. The main purpose of doing this is to separate suspend (aka S2RAM and standby) callbacks from hibernation callbacks in such a way that the new callbacks won't take arguments and the semantics of each of them will be clearly specified. This has been requested for multiple times by many people, including Linus himself, and the reason is that within the current scheme if ->resume() is called, for example, it's difficult to say why it's been ...
It would be okay to wait until after the last prepare() method is
Can we also have a DPM_PREPARING state, set when ->prepare() is about
to be called? The PM core wouldn't make use of it but some drivers
would. (I can't think of any use at all for the analogous
Clearer to say: if (dev->parent->power.status >= DPM_SUSPENDING) {
Log a warning here? If this ever happened, it would be the sort of
Clearer to say: if (dev->power.status >= DPM_OFF) {
Note that if dev->power.status is equal to DPM_SUSPENDING then you
don't want to call resume_device(), but you still do want to change
dev->power.status to DPM_RESUMING so that new children can be
This isn't the way I imagined doing it (your extra "list"), but it's
fine.
Instead you could eliminate the list_splice_init() above and put here:
list_splice(&list, dpm_list->prev);
On the whole it looks quite good.
Alan Stern
--
Hmm. dev->power.status is protected by dpm_list_mtx. Do you think it would be The device can't be removed at this point, because we hold dpm_list_mtx, which is needed by device_del(). Still, I'll move the put_device() to avoid Okay, thanks for the comments. Rafael --
I don't think so. What I have in mind is situations where there accessed has already been synchronized by other means, while the prepare() method is running. For example: Task 0 Task 1 ------ ------ ->prepare() is called Waits for currently-running registration in task 1 to finish Does other stuff Receives a request to register a new child under dev Sees that dev->power.state is still DPM_ON, so goes ahead with the child's registration ->prepare() returns dev->power.state is set to DPM_SUSPENDING device_pm_add() checks dev->power.state and fails the registration If dev->power.state had been set to DPM_PREPARING before ->prepare() was called, then task 1 would have avoided trying to register the True, it can't be removed at this point. But it _can_ be removed It will still work correctly. If dpm_list is empty then dpm_list->prev is equal to &dpm_list, so it will do the same thing as your current code does. I just thought of another problem. At the point where local_irq_disable() is called, in between device_suspend() and device_power_down(), it is possible in a preemptible kernel that another task is holding dpm_list_mtx and is in the middle of updating the list pointers. This would mess up the traversal in device_power_down(). I'm not sure about the best way to prevent this. Is it legal to call unlock_mutex() while interrupts or preemption are disabled? Alan Stern --
Well, I think it is, but I'm not sure how that can help. To prevent the race from happening, we can lock dpm_list_mtx before switching interrupts off in kernel/power/main.c:suspend_enter() and analogously in kernel/power/disk.c . Thanks, Rafael --
That's right. And once interrupts are turned off you should unlock dpm_list_mtx again, in case a noirq method wants to unregister a device. Hence my question: Is it legal to call unlock_mutex() while interrupts are disabled? Alan Stern --
Why would a noirq method want to do that? IMO, it's not a big deal if noirq Well, I suspect that will confuse lockdep quite a bit. Otherwise, I don't see a problem with it (it's just changing the value of a shared variable after all). Thanks, Rafael --
Then you have your answer. Perhaps have device_suspend() exit with the mutex held and have device_resume() release it (with appropriate handling for error situations, of course). Alan Stern --
That wouldn't work, because enable_nonboot_cpus() is called before device_resume() and the notifiers in there may want to unregister devices if some CPUs fail to go online. I added two accessor functions device_pm_lock() and device_pm_unlock() to be called just prior to disabling interrupts and right after enabling them, respectively, in the higher-level PM core (ie. kernel/power/main(disk).c). Thanks, Rafael --
The patch below should address all of your comments above. It also seems to
work.
If it looks good to you, please let me know and I'll repost it in a new thread
along with the platform and PCI patches adapted to it.
Thanks,
Rafael
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/apm_32.c | 4
drivers/base/power/main.c | 654 ++++++++++++++++++++++++++++++++++-----------
drivers/base/power/power.h | 2
drivers/base/power/trace.c | 4
include/linux/device.h | 8
include/linux/pm.h | 282 +++++++++++++++++--
kernel/power/disk.c | 14
kernel/power/main.c | 4
8 files changed, 772 insertions(+), 200 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -114,7 +114,9 @@ typedef struct pm_message {
int event;
} pm_message_t;
-/*
+/**
+ * struct pm_ops - device PM callbacks
+ *
* Several driver power state transitions are externally visible, affecting
* the state of pending I/O queues and (for drivers that touch hardware)
* interrupts, wakeups, DMA, and other hardware state. There may also be
@@ -122,6 +124,254 @@ typedef struct pm_message {
* to the rest of the driver stack (such as a driver that's ON gating off
* clocks which are not in active use).
*
+ * The externally visible transitions are handled with the help of the following
+ * callbacks included in this structure:
+ *
+ * @prepare: Prepare the device for the upcoming transition, but do NOT change
+ * its hardware state. Prevent new children of the device from being
+ * registered and prevent new calls to the probe method from being made
+ * after @prepare() returns. If @prepare() detects a situation it cannot
+ * handle (e.g. registration of a child already in progress), it may return
+ * -EAGAIN, so that the PM core can execute it once again (e.g. after the
+ * new child ...How is a driver supposed to prevent calls to probe()? You can block in probe() or you can fail it, but how do you prevent it from being called? User space can trigger probe via sysfs. To be useful to driver writers, this has Probably it should be mentioned that this is the time to allocate memory if you have to. And it is too late to request firmware. Regards Oliver --
The comment is supposed to mean that probe() should be prevented from running Well, not exactly. Afterwards you cannot use GFP_KERNEL allocations, but GFP_NOIO should still work, although for hibernation it's quite possible that Yes. Thanks, Rafael --
I see no locking that would would prevent disconnect() in the window between This is better documented explicitly. Regards Oliver --
There is no such locking. It's perfectly legal for a device to be unregistered between prepare() and suspend(). I suppose it wouldn't hurt to add a general comment explaining that a device can be unregistered at any time except when one of its methods is running. Alan Stern --
The driver isn't supposed to prevent calls to its own probe(). The comment means that the subsystem -- or the rest of the kernel generally -- is supposed to avoid binding a driver to the device (thereby calling the probe routine), assuming the device isn't already bound. I don't expect this sort of thing to be very common. Mostly it happens when new kernel modules are loaded and new drivers are registered; we will have to block module loading during a sleep transition. It also happens when the user writes to a driver's "bind" attribute in sysfs; the code there will have to block during a sleep transition. Alan Stern --
I don't see anything else to change. ACK. Alan Stern --
Thanks. I'd like to fix the problem with dpm_list_mtx vs the disabling of interrupts you pointed out in the other message, though. Rafael --
