Greg, Andrew,
The appended patch is a replacement for
gregkh-driver-pm-acquire-device-locks-prior-to-suspending.patch that deadlocked
suspend and hibernation on some systems.
Please consider for applying.
Thanks,
Rafael
---
From: Alan Stern <stern@rowland.harvard.edu>, Rafael J. Wysocki <rjw@sisk.pl>
This patch reorganizes the way suspend and resume notifications are
sent to drivers. The major changes are that now the PM core acquires
every device semaphore before calling the methods, and calls to
device_add() during suspends will fail.
It also provides a way to safely remove a suspended device with the help of
the PM core, by using the destroy_suspended_device() callback introduced
specifically for this purpose, and updates two drivers (msr and cpuid) that need
to do that.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 67 +++++-
drivers/base/power/main.c | 452 ++++++++++++++++++++++++++-------------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 352 insertions(+), 199 deletions(-)
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Illegal %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }
dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
- goto Error;
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
+ goto Done;
+ }
pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct device *dev)
}
Done:
...since the call to device_del() will occur while the pm_sleep_rwsem is still locked for writing. That's why I suggested not unregistering these devices until after everything else has been resumed and the rwsem has been dropped. Another thing to watch out for: Just in case somebody ends up calling destroy_suspended_device(dev) from within dev's own resume method, you should interchange the resume_device() and the list_move_tail() calls in dpm_resume(). Alan Stern --
However, if we unregister them all at once after releasing pm_sleep_rwsem, that shouldn't be necessary, right? Rafael --
It's still necessary, because destroy_suspended_device() still has to move the device from one list to another. You don't want it to end up on the dpm_locked list. Alan Stern --
Hmm. That means we'd have to do the same thing in dpm_power_up() in case someone calls destroy_suspended_device() from resume_device_early(dev). Still, even doing that is not enough, since someone can call destroy_suspended_device() from a .suspend() routine and then the device will end up on a wrong list just as well. Greetings, Rafael --
That should never happen. The whole idea of destroy_suspended_device() is that the device couldn't be resumed and in fact should be unregistered because it is no longer working or no longer present. A suspend routine won't detect this sort of thing since it doesn't try to resume the device. But it wouldn't hurt to mention in the kerneldoc that destroy_suspended_device() is meant to be called only during a system resume. Alan Stern --
Hmm. Please have a look at the appended patch.
I have removed the warning from device_del() and used list_empty() to detect
removed devices in the .suspend() routines. Is that viable?
Rafael
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 67 +++++-
drivers/base/power/main.c | 454 ++++++++++++++++++++++++++-------------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 354 insertions(+), 199 deletions(-)
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Illegal %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }
dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
- goto Error;
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
+ goto Done;
+ }
pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct device *dev)
}
Done:
put_device(dev);
+ pm_sleep_unlock();
return error;
BusError:
device_pm_remove(dev);
@@ -1156,14 +1173,11 @@ error:
EXPORT_SYMBOL_GPL(device_create);
/**
- * device_destroy - removes a device that was created with device_create()
+ * find_device - finds a device that was created with device_create()
* @class: pointer to the struct class that this device was registered with
* @devt: the dev_t of the device that was previously registered
- *
- * This call unregisters and cleans up a device that was created with a
- * call to device_create().
*/
-void device_destroy(struct class *class, dev_t devt)
+static struct device *find_device(struct class ...It's not good. The warning in device_del() is vital. It's what will tell people where the problem is when a deadlock occurs during system resume because some driver has mistakenly tried to unregister a device at the wrong time. It would have pointed immediately to the msr driver in the case of the bug Andrew found, for instance. If you can figure out a way to disable the warning in device_del() for just the one device being unregistered by device_pm_destroy_suspended(), I suppose that would be okay. Alan Stern --
Something like this, perhaps:
@@ -905,6 +915,18 @@ void device_del(struct device * dev)
struct device * parent = dev->parent;
struct class_interface *class_intf;
+ if (down_trylock(&dev->sem)) {
+ if (pm_sleep_lock()) {
+ dev_warn(dev, "Illegal %s during suspend\n",
+ __FUNCTION__);
+ dump_stack();
+ } else {
+ pm_sleep_unlock();
+ }
+ } else {
+ up(&dev->sem);
+ }
+
if (parent)
klist_del(&dev->knode_parent);
Rafael
--
Bizarre, but it should work. Be sure to include plenty of explanatory comments -- otherwise nobody will be able to figure it out! :-) Alan Stern --
OK Still, shouldn't we fail the removal of the device apart from giving the I will. I think that code can be moved to its own function in Well, I guess so. :-) Thanks, Rafael --
Actually, having thought about it a bit more, I don't see the point in preventing the removal of the device from the list in device_pm_remove() if we allow all of the operations in device_del() preceding it to be performed. Shouldn't we just take pm_sleep_rwsem in device_del() upfront and block on that if locked? Rafael --
Ugh, the $subject patch looks like a city of races. I'm struggling to close them all, but it's getting complicated. I'll post the result in a new thread. Thanks, Rafael --
That's not the issue. We _don't_ allow all of the operations in device_del() preceding the call to device_pm_remove(). In particular, the call to the device's driver's remove method will deadlock because No -- the whole idea here is to print an error message in the system log if a driver's resume method tries to call device_del(). Deadlock is unavoidable in this case, but at least we'll know which driver is guilty. Alan Stern --
Still, if we do that, we won't need to acquire dev->sem in device_pm_remove() any more. Apart from this, by acqiring pm_sleep_rwsem for reading in device_del() we can prevent a suspend from starting while the device is being removed. Consider, for example, the scenario possible with the $subject patch: - device_del() starts and notices pm_sleep_rwsem unlocked, so the warning is not printed - it proceeds and everything before device_pm_remove() succeeds - now, device_suspend() is called and locks dev->sem - device_del() calls device_pm_remove() and blocks on that with the device partialy removed I think we should prevent this from happening. Rafael --
Not if pm_sleep_rwsem is held by device_del(), since in that case we won't reach lock_all_devices() (device_add() calls device_pm_remove() under In that case, we'll attempt to call the device's .suspend() and .resume() routines, but we shouldn't do that, IMHO. Rafael --
Let's try to summarize the main issues here:
1. We want the PM core to lock all devices during suspend and
hibernation. This implies that registration and unregistration
at such times can't work, because they need to lock the
device sem in order to make probe and remove method calls.
2. Registration calls can be failed, with an error message in the
system log. However unregistration calls cannot fail. They
_can_ block until the system resumes, but if the unregistration
call was made from within a suspend or resume method it will
deadlock. This seems inescapable, but at least we should print
an error in the log so the offending driver can be identified.
3. In response to 2, the PM core should have a special routine for
unregistering devices while a suspend is in progress. Rafael
proposed that the core should unlock the device to permit the
call to go through. This seems dangerous to me; I would prefer
to leave the locks in place and defer the unregistration until
after the system is back up and the locks have all been
dropped. This would avoid all sorts of locking, deadlock, and
mutual exclusion problems.
(As a side note: destroy_suspended_device() has a rather limited
interface anyway, since it can handle only devices that were created by
create_device().)
4. Rafael pointed out that unregistration can occur concurrently
with system suspend. When this happens we can end up trying to
suspend a device which has already been through
bus_remove_device(), because it hasn't yet been removed from
the dpm_active list. He proposes we make unregistration block
system suspend, just as registration does.
I don't see 4 as a real problem. Starting an unregistration before
the suspend and finishing it afterward should be okay. Once a device
has gone through bus_remove_device() it hasn't got a suspend method any
more, so trying to suspend it won't do anything at all -- the tests in
suspend_device() will all fail. Conversely, if ...Please see the patch at: http://lkml.org/lkml/2008/1/6/298 . It represents my current idea about how to do that. Thanks, Rafael --
It has some problems. First, note that the list manipulations in dpm_suspend(), device_power_down(), and so on aren't protected by dpm_list_mtx. So your patch could corrupt the list pointers. Are you assuming that no other threads can be running at this time? Note also that device_pm_destroy_suspended() does up(&dev->sem), but it doesn't know whether or not dev->sem was locked to begin with. Do you want to rule out the possibility of a driver's suspend or remove methods calling destroy_suspended_device() on its own device? With your synchronous approach, this would mean that the suspend/resume method would indirectly end up calling the remove method. This is dangerous at best; with USB it would be a lockdep violation. With an asynchronous approach, on the other hand, this wouldn't be a problem. Alan Stern --
Yes, they need the locking. I have overlooked that, mostly because the locking was removed by gregkh-driver-pm-acquire-device-locks-prior-to-suspending.patch too (because you assumed there woundn't be any need to remove a device during Do you mean it might have been released already by another thread Well, the asynchronous apprach has the problem that the device may end up on a wrong list when removed by one of the .suspend() callbacks (and I don't see how to avoid that without extra complexity). Perhaps that's something we can live with, though. One more question: is there any particular reason not to call device_pm_remove() at the beginning of device_del()? Rafael --
I was thinking that it might be called before lock_all_devices(). However let's ignore that possibility and simplify the discussion by assuming that destroy_suspended_device() is never called except by a suspend or resume method for that device or one of its ancestors. (This still leaves the possibility that it might get called by mistake The same problem affects the synchronous approach. We can fix it by having dpm_suspend() do the list_move() before calling I think it's done this way to avoid having a window where the device isn't on a PM list and is still owned by the bus and the driver. But if a suspend occurs during that window, it shouldn't matter that the device will be left unsuspended. After all, the same thing would have happened if the suspend occurred after bus_remove_device(). So no, there shouldn't be a problem with moving the call. Alan Stern --
I've added pm_sleep_start_end_mtx and the locking dance in Okay, well, now I'm leaning towards the asynchronous approach. I'll prepare a new patch and send it later today. Thanks, Rafael --
Yes, I see. What about the fact that device_suspend() locks pm_sleep_start_end_mtx first and pm_sleep_rwsem second, whereas device_pm_destroy_suspended() locks pm_sleep_start_end_mtx while This suggests another approach, simpler but not as general. So far all the problems we've seen have been associated with those CPU notifiers. Suppose the notifications about CPUs that failed to come back up were delayed until after the resume was complete? Drivers like msr would then have to check in their resume handler whether the CPU was actually up, but no other changes would be needed. In this way we could fix the immediate problem. It wouldn't help with other sorts of devices that need to be unregistered during a suspend, Okay. Alan Stern --
Appended is what I managed to put together today.
It probably still has some problems, but I'm not seeing them right now (too
tired). At least, it doesn't break my system. ;-)
Please review.
Thanks,
Rafael
---
From: Alan Stern <stern@rowland.harvard.edu>, Rafael J. Wysocki <rjw@sisk.pl>
This patch reorganizes the way suspend and resume notifications are
sent to drivers. The major changes are that now the PM core acquires
every device semaphore before calling the methods, and calls to
device_add() during suspends will fail, while calls to device_del()
during suspends will block.
It also provides a way to safely remove a suspended device with the
help of the PM core, by using the device_pm_schedule_removal() callback
introduced specifically for this purpose, and updates two drivers (msr
and cpuid) that need to use it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/cpuid.c | 6
arch/x86/kernel/msr.c | 6
drivers/base/core.c | 60 ++++-
drivers/base/power/main.c | 504 +++++++++++++++++++++++++++++----------------
drivers/base/power/power.h | 12 +
include/linux/device.h | 8
6 files changed, 408 insertions(+), 188 deletions(-)
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -726,11 +726,20 @@ int device_add(struct device *dev)
{
struct device *parent = NULL;
struct class_interface *class_intf;
- int error = -EINVAL;
+ int error;
+
+ error = pm_sleep_lock();
+ if (error) {
+ dev_warn(dev, "Suspicious %s during suspend\n", __FUNCTION__);
+ dump_stack();
+ return error;
+ }
dev = get_device(dev);
- if (!dev || !strlen(dev->bus_id))
+ if (!dev || !strlen(dev->bus_id)) {
+ error = -EINVAL;
goto Error;
+ }
pr_debug("DEV: registering device: ID = '%s'\n", dev->bus_id);
@@ -795,6 +804,7 @@ int device_add(struct device *dev)
}
Done:
...Okay, this seems to be better. I like the way the complicated tests are all localized in power/main.c. In dpm_resume() you shouldn't need to use dpm_list_mtx at all, because the list_move_tail() comes before the resume_device(). It's the same as in dpm_power_up(). The same is true for dpm_suspend(). Once all the device have been locked, there shouldn't be any other tasks accessing the dpm lists. Hence there should be no need to protect the list. Which reminds me, the kerneldoc for device_pm_schedule_removal() is inaccurate. The routine always just moves the device to dpm_destroy list for later processing. Also, the kerneldoc for destroy_suspended_device() should contain an extra paragraph warning that the routine should never be called except within the scope of a system sleep transition. In practice this means it has to be directly or indirectly invoked by a suspend or resume method. It looks good. Alan Stern --
Still, device_pm_schedule_removal() can (in theory) be called concurrently with dpm_resume() by another thread and this might corrupt the list without Thanks for the review. I'll fix the comments and repost the patch from scratch for merging in a separate thread. Greetings, Rafael --
Any thread doing that would be in violation of the restrictions you're going to add to the kerneldoc for destroy_suspended_device(). However the overhead for the locking isn't critical. There won't be any contention (if everything is working right) and it isn't a hot path anyway. So you can leave the extra locking in if you want. But then you should put it in all the routines where the lists get manipulated, not just some of them. That is: device_power_down(), dpm_power_up(), In your patch the call is made in response to a CPU_UP_CANCELED_FROZEN notification. Isn't it true that this notification is issued only as part of a system sleep transition? We wouldn't want to allow destroy_suspended_device() to be called when an arbitrary CPU hotplug notification occurs. Alan Stern --
Rather, by the resume core. Technically, it's invoked by enable_nonboot_cpus(), which is not a resume method literally. Greetings, Rafael --
Okay, then the routine should only be called directly or indirectly from a suspend or resume method or from the suspend or resume core. Alan Stern --
We can't. device_del() can't fail -- it returns void. Besides, how can a driver hope to deal with an unregistration failure? Alan Stern --
We could, however I don't think it's dangerous to allow it. The two problems to avoid are (1) messing up the PM device list pointers, and (2) calling a driver's suspend/resume methods while its remove method is running. (1) is handled by the pm_list_mutex and (2) is handled by locking dev->sem. Alan Stern --
