Greg, Andrew,
The appended patch is a replacement for
gregkh-driver-pm-acquire-device-locks-prior-to-suspending.patch that deadlocked
suspend and hibernation on some systems.Please consider for applying.
Thanks,
Rafael---
From: Alan Stern <stern@rowland.harvard.edu>, Rafael J. Wysocki <rjw@sisk.pl>This patch reorganizes the way suspend and resume notifications are
sent to drivers. The major changes are that now the PM core acquires
every device semaphore before calling the methods, and calls to
device_add() during suspends will fail.It also provides a way to safely remove a suspended device with the help of
the PM core, by using the destroy_suspended_device() callback introduced
specifically for this purpose, and updates two drivers (msr and cpuid) that need
to do that.Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 67 +++++-
drivers/base/power/main.c | 452 ++++++++++++++++++++++++++-------------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 352 insertions(+), 199 deletions(-)Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Illegal %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
- goto Error;
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
+ goto Done;
+ }pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct device *dev)
}
Done:
put_devi...
since the call to device_del() will occur while the pm_sleep_rwsem is
still locked for writing. That's why I suggested not unregistering
these devices until after everything else has been resumed and the
rwsem has been dropped.Another thing to watch out for: Just in case somebody ends up calling
destroy_suspended_device(dev) from within dev's own resume method, you
should interchange the resume_device() and the list_move_tail()
calls in dpm_resume().Alan Stern
--
However, if we unregister them all at once after releasing pm_sleep_rwsem,
that shouldn't be necessary, right?Rafael
--
It's still necessary, because destroy_suspended_device() still has to
move the device from one list to another. You don't want it to end up
on the dpm_locked list.Alan Stern
--
Hmm. That means we'd have to do the same thing in dpm_power_up() in case
someone calls destroy_suspended_device() from resume_device_early(dev).Still, even doing that is not enough, since someone can call
destroy_suspended_device() from a .suspend() routine and then the device
will end up on a wrong list just as well.Greetings,
Rafael--
That should never happen. The whole idea of destroy_suspended_device()
is that the device couldn't be resumed and in fact should be
unregistered because it is no longer working or no longer present. A
suspend routine won't detect this sort of thing since it doesn't try to
resume the device.But it wouldn't hurt to mention in the kerneldoc that
destroy_suspended_device() is meant to be called only during a system
resume.Alan Stern
--
Hmm. Please have a look at the appended patch.
I have removed the warning from device_del() and used list_empty() to detect
removed devices in the .suspend() routines. Is that viable?Rafael
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 67 +++++-
drivers/base/power/main.c | 454 ++++++++++++++++++++++++++-------------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 354 insertions(+), 199 deletions(-)Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Illegal %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
- goto Error;
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
+ goto Done;
+ }pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct device *dev)
}
Done:
put_device(dev);
+ pm_sleep_unlock();
return error;
BusError:
device_pm_remove(dev);
@@ -1156,14 +1173,11 @@ error:
EXPORT_SYMBOL_GPL(device_create);/**
- * device_destroy - removes a device that was created with device_create()
+ * find_device - finds a device that was created with device_create()
* @class: pointer to the struct class that this device was registered with
* @devt: the dev_t of the device that was previously registered
- *
- * This call unregisters and cleans up a device that was created with a
- * call to device_create().
*/
-void device_destroy(struct class *class, dev_t devt)
+static struct device *find_device(struct class...
It's not good.
The warning in device_del() is vital. It's what will tell people where
the problem is when a deadlock occurs during system resume because some
driver has mistakenly tried to unregister a device at the wrong time.
It would have pointed immediately to the msr driver in the case of the
bug Andrew found, for instance.If you can figure out a way to disable the warning in device_del() for
just the one device being unregistered by
device_pm_destroy_suspended(), I suppose that would be okay.Alan Stern
--
Something like this, perhaps:
@@ -905,6 +915,18 @@ void device_del(struct device * dev)
struct device * parent = dev->parent;
struct class_interface *class_intf;+ if (down_trylock(&dev->sem)) {
+ if (pm_sleep_lock()) {
+ dev_warn(dev, "Illegal %s during suspend\n",
+ __FUNCTION__);
+ dump_stack();
+ } else {
+ pm_sleep_unlock();
+ }
+ } else {
+ up(&dev->sem);
+ }
+
if (parent)
klist_del(&dev->knode_parent);Rafael
--
Bizarre, but it should work. Be sure to include plenty of explanatory
comments -- otherwise nobody will be able to figure it out! :-)Alan Stern
--
OK
Still, shouldn't we fail the removal of the device apart from giving the
I will.
I think that code can be moved to its own function in
Well, I guess so. :-)
Thanks,
Rafael
--
We can't. device_del() can't fail -- it returns void. Besides, how
can a driver hope to deal with an unregistration failure?Alan Stern
--
Well, right.
Still, our present approach doesn't seem to be correct overall. For example,
I think we should prevent a suspend from happening while a device is being
removed.Rafael
--
We could, however I don't think it's dangerous to allow it. The two
problems to avoid are (1) messing up the PM device list pointers, and
(2) calling a driver's suspend/resume methods while its remove method
is running. (1) is handled by the pm_list_mutex and (2) is handled by
locking dev->sem.Alan Stern
--
Actually, having thought about it a bit more, I don't see the point in
preventing the removal of the device from the list in device_pm_remove() if
we allow all of the operations in device_del() preceding it to be performed.Shouldn't we just take pm_sleep_rwsem in device_del() upfront and block on that
if locked?Rafael
--
That's not the issue. We _don't_ allow all of the operations in
device_del() preceding the call to device_pm_remove(). In particular,
the call to the device's driver's remove method will deadlock becauseNo -- the whole idea here is to print an error message in the system
log if a driver's resume method tries to call device_del(). Deadlock
is unavoidable in this case, but at least we'll know which driver is
guilty.Alan Stern
--
Still, if we do that, we won't need to acquire dev->sem in device_pm_remove()
any more. Apart from this, by acqiring pm_sleep_rwsem for reading in
device_del() we can prevent a suspend from starting while the device is being
removed.Consider, for example, the scenario possible with the $subject patch:
- device_del() starts and notices pm_sleep_rwsem unlocked, so the warning is
not printed
- it proceeds and everything before device_pm_remove() succeeds
- now, device_suspend() is called and locks dev->sem
- device_del() calls device_pm_remove() and blocks on that with the device
partialy removed
I think we should prevent this from happening.Rafael
--
There's a window in lock_all_devices() when dpm_list_mtx isn't held.
We don't want device_pm_remove() taking an already-locked device off
the dpm_locked list at that time. So we do need to acquire dev->sem inI don't see anything wrong with it. All that will happen is that the
removal will start before the suspend and finish after the resume.Alan Stern
--
Not if pm_sleep_rwsem is held by device_del(), since in that case we won't
reach lock_all_devices() (device_add() calls device_pm_remove() underIn that case, we'll attempt to call the device's .suspend() and .resume()
routines, but we shouldn't do that, IMHO.Rafael
--
Let's try to summarize the main issues here:
1. We want the PM core to lock all devices during suspend and
hibernation. This implies that registration and unregistration
at such times can't work, because they need to lock the
device sem in order to make probe and remove method calls.2. Registration calls can be failed, with an error message in the
system log. However unregistration calls cannot fail. They
_can_ block until the system resumes, but if the unregistration
call was made from within a suspend or resume method it will
deadlock. This seems inescapable, but at least we should print
an error in the log so the offending driver can be identified.3. In response to 2, the PM core should have a special routine for
unregistering devices while a suspend is in progress. Rafael
proposed that the core should unlock the device to permit the
call to go through. This seems dangerous to me; I would prefer
to leave the locks in place and defer the unregistration until
after the system is back up and the locks have all been
dropped. This would avoid all sorts of locking, deadlock, and
mutual exclusion problems.(As a side note: destroy_suspended_device() has a rather limited
interface anyway, since it can handle only devices that were created by
create_device().)4. Rafael pointed out that unregistration can occur concurrently
with system suspend. When this happens we can end up trying to
suspend a device which has already been through
bus_remove_device(), because it hasn't yet been removed from
the dpm_active list. He proposes we make unregistration block
system suspend, just as registration does.I don't see 4 as a real problem. Starting an unregistration before
the suspend and finishing it afterward should be okay. Once a device
has gone through bus_remove_device() it hasn't got a suspend method any
more, so trying to suspend it won't do anything at all -- the tests in
suspend_device() will all fail. Conversely, if bus...
Please see the patch at: http://lkml.org/lkml/2008/1/6/298 . It represents my
current idea about how to do that.Thanks,
Rafael
--
It has some problems.
First, note that the list manipulations in dpm_suspend(),
device_power_down(), and so on aren't protected by dpm_list_mtx. So
your patch could corrupt the list pointers. Are you assuming that no
other threads can be running at this time?Note also that device_pm_destroy_suspended() does up(&dev->sem), but it
doesn't know whether or not dev->sem was locked to begin with.Do you want to rule out the possibility of a driver's suspend or remove
methods calling destroy_suspended_device() on its own device? With
your synchronous approach, this would mean that the suspend/resume
method would indirectly end up calling the remove method. This is
dangerous at best; with USB it would be a lockdep violation. With an
asynchronous approach, on the other hand, this wouldn't be a problem.Alan Stern
--
Yes, they need the locking. I have overlooked that, mostly because the locking
was removed by gregkh-driver-pm-acquire-device-locks-prior-to-suspending.patch
too (because you assumed there woundn't be any need to remove a device duringDo you mean it might have been released already by another thread
Well, the asynchronous apprach has the problem that the device may end up
on a wrong list when removed by one of the .suspend() callbacks (and I don't
see how to avoid that without extra complexity). Perhaps that's something we
can live with, though.One more question: is there any particular reason not to call
device_pm_remove() at the beginning of device_del()?Rafael
--
I was thinking that it might be called before lock_all_devices().
However let's ignore that possibility and simplify the discussion by
assuming that destroy_suspended_device() is never called except by a
suspend or resume method for that device or one of its ancestors.
(This still leaves the possibility that it might get called by mistakeThe same problem affects the synchronous approach. We can fix it by
having dpm_suspend() do the list_move() before callingI think it's done this way to avoid having a window where the device
isn't on a PM list and is still owned by the bus and the driver. But
if a suspend occurs during that window, it shouldn't matter that the
device will be left unsuspended. After all, the same thing would have
happened if the suspend occurred after bus_remove_device().So no, there shouldn't be a problem with moving the call.
Alan Stern
--
I've added pm_sleep_start_end_mtx and the locking dance in
Okay, well, now I'm leaning towards the asynchronous approach.
I'll prepare a new patch and send it later today.
Thanks,
Rafael
--
Yes, I see. What about the fact that device_suspend() locks
pm_sleep_start_end_mtx first and pm_sleep_rwsem second, whereas
device_pm_destroy_suspended() locks pm_sleep_start_end_mtx whileThis suggests another approach, simpler but not as general. So far all
the problems we've seen have been associated with those CPU notifiers.
Suppose the notifications about CPUs that failed to come back up were
delayed until after the resume was complete? Drivers like msr would
then have to check in their resume handler whether the CPU was actually
up, but no other changes would be needed.In this way we could fix the immediate problem. It wouldn't help with
other sorts of devices that need to be unregistered during a suspend,Okay.
Alan Stern
--
Appended is what I managed to put together today.
It probably still has some problems, but I'm not seeing them right now (too
tired). At least, it doesn't break my system. ;-)Please review.
Thanks,
Rafael---
From: Alan Stern <stern@rowland.harvard.edu>, Rafael J. Wysocki <rjw@sisk.pl>This patch reorganizes the way suspend and resume notifications are
sent to drivers. The major changes are that now the PM core acquires
every device semaphore before calling the methods, and calls to
device_add() during suspends will fail, while calls to device_del()
during suspends will block.It also provides a way to safely remove a suspended device with the
help of the PM core, by using the device_pm_schedule_removal() callback
introduced specifically for this purpose, and updates two drivers (msr
and cpuid) that need to use it.Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 60 ++++-
drivers/base/power/main.c | 504 +++++++++++++++++++++++++++++----------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 408 insertions(+), 188 deletions(-)Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Suspicious %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
goto Error;
+ }pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct devic...
Okay, this seems to be better. I like the way the complicated tests
are all localized in power/main.c.In dpm_resume() you shouldn't need to use dpm_list_mtx at all, because
the list_move_tail() comes before the resume_device(). It's the same
as in dpm_power_up().The same is true for dpm_suspend(). Once all the device have been
locked, there shouldn't be any other tasks accessing the dpm lists.
Hence there should be no need to protect the list.Which reminds me, the kerneldoc for device_pm_schedule_removal() is
inaccurate. The routine always just moves the device to dpm_destroy
list for later processing.Also, the kerneldoc for destroy_suspended_device() should contain an
extra paragraph warning that the routine should never be called except
within the scope of a system sleep transition. In practice this means
it has to be directly or indirectly invoked by a suspend or resume
method.It looks good.
Alan Stern
--
Still, device_pm_schedule_removal() can (in theory) be called concurrently
with dpm_resume() by another thread and this might corrupt the list withoutThanks for the review.
I'll fix the comments and repost the patch from scratch for merging in a
separate thread.Greetings,
Rafael
--
Any thread doing that would be in violation of the restrictions you're
going to add to the kerneldoc for destroy_suspended_device().However the overhead for the locking isn't critical. There won't be
any contention (if everything is working right) and it isn't a hot path
anyway. So you can leave the extra locking in if you want. But then
you should put it in all the routines where the lists get manipulated,
not just some of them. That is: device_power_down(), dpm_power_up(),In your patch the call is made in response to a CPU_UP_CANCELED_FROZEN
notification. Isn't it true that this notification is issued only as
part of a system sleep transition? We wouldn't want to allow
destroy_suspended_device() to be called when an arbitrary CPU hotplug
notification occurs.Alan Stern
--
Of course.
Greetings,
Rafael
--
So it counts as being indirectly invoked by a resume method.
Alan Stern
--
Rather, by the resume core. Technically, it's invoked by
enable_nonboot_cpus(), which is not a resume method literally.Greetings,
Rafael
--
Okay, then the routine should only be called directly or indirectly
from a suspend or resume method or from the suspend or resume core.Alan Stern
--
Ugh, the $subject patch looks like a city of races. I'm struggling to close
them all, but it's getting complicated.I'll post the result in a new thread.
Thanks,
Rafael
--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Heiko Carstens | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH iproute2] Re: HTB accuracy for high speed |
