Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

Previous thread: Burning CDs as user produces coasters (Plextor drive) by Gregor Jasny on Monday, March 24, 2008 - 1:39 pm. (3 messages)

Next thread: [PATCH 1/5] fs/gfs2: test for IS_ERR rather than 0 by Julia Lawall on Monday, March 24, 2008 - 2:08 pm. (2 messages)
To: pm list <linux-pm@...>
Cc: ACPI Devel Maling List <linux-acpi@...>, Alan Stern <stern@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 24, 2008 - 1:39 pm

Hi,

This is the 3rd revision of the patch introducing new callbacks for suspend
and hibernation. It has been tested on x86-64.

I did my best to address the Alan's comments regarding the 2nd revision of the
patch. In particular:
* The ->prepare() callbacks are now executed in the forward order to avoid
races between the PM core and the registration of new devices. Analogously,
the ->complete() callbacks are executed in the reverse order.
* Only one list, called dpm_list, is used by the PM core its operations and
the dev->power.status field is used to mark the current status of the device
wrt the suspend/hibernation operations.
* The registrations of parentless devices are disabled before the first
->prepare() method is called and enabled before the first ->resume() method
is called
* The registration of new devices with a parent is disabled right after the
parent's ->prepare() is called and enabled just prior to calling the parent's
->resume().

The patch is on top of the "PM: Remove destroy_suspended_device()" patch
I sent yesterday.

Please review.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Introduce 'struct pm_ops' and 'struct pm_noirq_ops' representing
suspend and hibernation operations for bus types, device classes and
device types.

Modify the PM core to use 'struct pm_ops' and 'struct pm_noirq_ops'
objects, if defined, instead of the ->suspend() and ->resume() or,
respectively, ->suspend_late() and ->resume_early() callbacks that
will be considered as legacy and gradually phased out.

The main purpose of doing this is to separate suspend (aka S2RAM and
standby) callbacks from hibernation callbacks in such a way that the
new callbacks won't take arguments and the semantics of each of them
will be clearly specified.  This has been requested for multiple
times by many people, including Linus himself, and the reason is that
within the current scheme if ->resume() is called, for example, ...

To: Rafael J. Wysocki <rjw@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 24, 2008 - 4:14 pm

It would be okay to wait until after the last prepare() method is

Can we also have a DPM_PREPARING state, set when ->prepare() is about
to be called? The PM core wouldn't make use of it but some drivers
would. (I can't think of any use at all for the analogous

Clearer to say: if (dev->parent->power.status >= DPM_SUSPENDING) {

Log a warning here? If this ever happened, it would be the sort of

Clearer to say: if (dev->power.status >= DPM_OFF) {

Note that if dev->power.status is equal to DPM_SUSPENDING then you
don't want to call resume_device(), but you still do want to change
dev->power.status to DPM_RESUMING so that new children can be

This isn't the way I imagined doing it (your extra "list"), but it's
fine.

Instead you could eliminate the list_splice_init() above and put here:

list_splice(&list, dpm_list->prev);

On the whole it looks quite good.

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 24, 2008 - 6:18 pm

The patch below should address all of your comments above. It also seems to
work.

If it looks good to you, please let me know and I'll repost it in a new thread
along with the platform and PCI patches adapted to it.

Thanks,
Rafael

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---

arch/x86/kernel/apm_32.c | 4
drivers/base/power/main.c | 654 ++++++++++++++++++++++++++++++++++-----------
drivers/base/power/power.h | 2
drivers/base/power/trace.c | 4
include/linux/device.h | 8
include/linux/pm.h | 282 +++++++++++++++++--
kernel/power/disk.c | 14
kernel/power/main.c | 4
8 files changed, 772 insertions(+), 200 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -114,7 +114,9 @@ typedef struct pm_message {
int event;
} pm_message_t;

-/*
+/**
+ * struct pm_ops - device PM callbacks
+ *
* Several driver power state transitions are externally visible, affecting
* the state of pending I/O queues and (for drivers that touch hardware)
* interrupts, wakeups, DMA, and other hardware state. There may also be
@@ -122,6 +124,254 @@ typedef struct pm_message {
* to the rest of the driver stack (such as a driver that's ON gating off
* clocks which are not in active use).
*
+ * The externally visible transitions are handled with the help of the following
+ * callbacks included in this structure:
+ *
+ * @prepare: Prepare the device for the upcoming transition, but do NOT change
+ * its hardware state. Prevent new children of the device from being
+ * registered and prevent new calls to the probe method from being made
+ * after @prepare() returns. If @prepare() detects a situation it cannot
+ * handle (e.g. registration of a child already in progress), it may return
+ * -EAGAIN, so that the PM core can execute it once again (e.g. after the
+ * new chi...

To: Rafael J. Wysocki <rjw@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 11:12 am

I don't see anything else to change. ACK.

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 4:39 pm

Thanks.

I'd like to fix the problem with dpm_list_mtx vs the disabling of interrupts
you pointed out in the other message, though.

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Alan Stern <stern@...>, pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 7:55 am

How is a driver supposed to prevent calls to probe()? You can block in probe()
or you can fail it, but how do you prevent it from being called? User space
can trigger probe via sysfs. To be useful to driver writers, this has

Probably it should be mentioned that this is the time to allocate memory
if you have to. And it is too late to request firmware.

Regards
Oliver
--

To: Oliver Neukum <oliver@...>
Cc: Alan Stern <stern@...>, pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 8:40 am

The comment is supposed to mean that probe() should be prevented from running

Well, not exactly. Afterwards you cannot use GFP_KERNEL allocations, but
GFP_NOIO should still work, although for hibernation it's quite possible that

Yes.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Oliver Neukum <oliver@...>, pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 10:29 am

The driver isn't supposed to prevent calls to its own probe(). The
comment means that the subsystem -- or the rest of the kernel generally
-- is supposed to avoid binding a driver to the device (thereby calling
the probe routine), assuming the device isn't already bound.

I don't expect this sort of thing to be very common. Mostly it happens
when new kernel modules are loaded and new drivers are registered; we
will have to block module loading during a sleep transition. It also
happens when the user writes to a driver's "bind" attribute in sysfs;
the code there will have to block during a sleep transition.

Alan Stern

--

To: Rafael J. Wysocki <rjw@...>
Cc: Alan Stern <stern@...>, pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 9:37 am

I see no locking that would would prevent disconnect() in the window between

This is better documented explicitly.

Regards
Oliver

--

To: Oliver Neukum <oliver@...>
Cc: Rafael J. Wysocki <rjw@...>, pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 10:24 am

There is no such locking. It's perfectly legal for a device to be
unregistered between prepare() and suspend(). I suppose it wouldn't
hurt to add a general comment explaining that a device can be
unregistered at any time except when one of its methods is running.

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 24, 2008 - 5:18 pm

Hmm. dev->power.status is protected by dpm_list_mtx. Do you think it would be

The device can't be removed at this point, because we hold dpm_list_mtx, which
is needed by device_del(). Still, I'll move the put_device() to avoid

Okay, thanks for the comments.

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 11:06 am

I don't think so. What I have in mind is situations where there
accessed has already been synchronized by other means, while the
prepare() method is running. For example:

Task 0 Task 1
------ ------
->prepare() is called
Waits for currently-running
registration in task 1
to finish
Does other stuff
Receives a request to register
a new child under dev
Sees that dev->power.state is
still DPM_ON, so goes ahead
with the child's registration
->prepare() returns
dev->power.state is set to
DPM_SUSPENDING
device_pm_add() checks
dev->power.state and fails
the registration

If dev->power.state had been set to DPM_PREPARING before ->prepare()
was called, then task 1 would have avoided trying to register the

True, it can't be removed at this point. But it _can_ be removed

It will still work correctly. If dpm_list is empty then dpm_list->prev
is equal to &dpm_list, so it will do the same thing as your current
code does.

I just thought of another problem. At the point where
local_irq_disable() is called, in between device_suspend() and
device_power_down(), it is possible in a preemptible kernel that
another task is holding dpm_list_mtx and is in the middle of updating
the list pointers. This would mess up the traversal in
device_power_down().

I'm not sure about the best way to prevent this. Is it legal to call
unlock_mutex() while interrupts or preemption are disabled?

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 25, 2008 - 4:35 pm

Well, I think it is, but I'm not sure how that can help.

To prevent the race from happening, we can lock dpm_list_mtx before switching
interrupts off in kernel/power/main.c:suspend_enter() and analogously in
kernel/power/disk.c .

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 26, 2008 - 10:03 am

That's right. And once interrupts are turned off you should unlock
dpm_list_mtx again, in case a noirq method wants to unregister a
device. Hence my question: Is it legal to call unlock_mutex() while
interrupts are disabled?

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 26, 2008 - 4:26 pm

Why would a noirq method want to do that? IMO, it's not a big deal if noirq

Well, I suspect that will confuse lockdep quite a bit. Otherwise, I don't see
a problem with it (it's just changing the value of a shared variable after
all).

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 26, 2008 - 4:36 pm

Then you have your answer. Perhaps have device_suspend() exit with the
mutex held and have device_resume() release it (with appropriate
handling for error situations, of course).

Alan Stern

--

To: Alan Stern <stern@...>
Cc: pm list <linux-pm@...>, ACPI Devel Maling List <linux-acpi@...>, Greg KH <greg@...>, Len Brown <lenb@...>, LKML <linux-kernel@...>, Alexey Starikovskiy <astarikovskiy@...>, David Brownell <david-b@...>, Pavel Machek <pavel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 26, 2008 - 4:54 pm

That wouldn't work, because enable_nonboot_cpus() is called before
device_resume() and the notifiers in there may want to unregister devices
if some CPUs fail to go online.

I added two accessor functions device_pm_lock() and device_pm_unlock()
to be called just prior to disabling interrupts and right after enabling them,
respectively, in the higher-level PM core (ie. kernel/power/main(disk).c).

Thanks,
Rafael
--

Previous thread: Burning CDs as user produces coasters (Plextor drive) by Gregor Jasny on Monday, March 24, 2008 - 1:39 pm. (3 messages)

Next thread: [PATCH 1/5] fs/gfs2: test for IS_ERR rather than 0 by Julia Lawall on Monday, March 24, 2008 - 2:08 pm. (2 messages)