Ongoing efforts to remove the freezer from the system suspend and
hibernation code ("system sleep" is the proper catch-all term) have
turned up a fundamental flaw in the Power Management subsystem's
design. In brief, we cannot handle the race between hotplug addition
of new devices and suspending all existing devices.
It's not a simple problem (and I'm going to leave out a lot of details
here). For a comparison, think about what happens when a device is
hot-unplugged. When device_del() calls the driver's remove method, the
driver is expected to manage all the details of synchronizing with
other threads that may be trying to add new child devices as well as
removing all existing children.
But when a system sleep begins, the PM core is expected to suspend
all the children of a device before calling the device driver's suspend
method. If there are other threads trying to add new children at the
same time, it's the PM core's responsibility to synchronize with
them -- an impossible job, since only the device's driver knows what
those other threads are and how to stop them safely.
In the past this deficiency has been hidden by the freezer. Other
tasks couldn't register new children because they were frozen. But now
we are phasing out the freezer (already most kernel threads are not
freezable) and the problem is starting to show up.
A change to the PM core present in 2.6.25-rc2 (but which is about to be
reverted!) has the core try to prevent these additions by acquiring the
device semaphores for every registered device. This has turned out to
be too heavy-handed; for example, it prevents drivers from
unregistering devices during a system sleep. There are more subtle
synchronization problems as well.
The only possible solution is to have the drivers themselves be
responsible for preventing calls to device_add() or device_register()
during a system sleep. (It's also necessary to prevent driver binding,
but this isn't a major issue.) The most straightforward approach i....../... Yup, old problem, I think I've said a long time ago that it's the reponsibility of bus drivers (such as USB khub) to stop issuing device additions when suspend is in progress. It might be possible to still do No, I think the global notifier is plenty for now. Later on, we can re-add early-suspend, late-resume, or we can finally turn the whole thing into a recursive call tree where the parents are the ones calling the children suspend :-) But for now, I think the global notifier is enough. Cheers, Ben. --
It's not a problem if new children are registered before the parent's ->suspend() is called, the PM core can handle that. The problem is the potential race between the suspending task and the threads registering new children concurrently to the executing ->suspend(), because if those threads lose the race, the resume ordering will be broken. Since the PM core knows nothing about the drivers internals, the drivers' ->suspend() methods must be responsible for synchronizing with the other I think we just attempted to take device semaphores too early. We probably can take the device semaphores _after_ suspending all devices without much hassle. However, it's not actually a problem if a suspended device gets unregistered - it's removed from the list on which it is at the moment and won't be resumed. It also is not a problem if the device is registered after it's master's ->resume() has run. Besides, taking the semaphores for all _existing_ devices doesn't prevent new devices from being added and that's we needed to take As I said above, I don't see a problem with registering new devices before the parent's ->suspend() is run and after it's ->resume() has run. That doesn't break any ordering rules and we can handle it. Thanks, Rafael --
On further thought maybe the existing methods can be used, with care. Drivers would have to assume the responsibility of synchronizing with their helper threads and stopping addition of new children (something they should already be doing), and they would also have to check that all the existing children are already suspended. They should not make the assumption that the PM core has already suspended all the children. The PM core could help detect errors here. If it tries to suspend a device and sees that the device's parent is already suspended, then the parent's driver has a bug. Alan Stern --
IMO the device driver should assure that no new children will be registered concurrently with the ->suspend() method (IOW, ->suspend() should wait for all such registrations to complete and should prevent any new ones from being started) and it should make it impossible to register any new children Yes, I think we ought to fail the suspend in such cases. Still, that's not sufficient to prevent a child from being registered after we've run dpm_suspend(). For this reason, we could also leave dpm_suspend() with dpm_list_mtx held and not release it until the next dpm_resume() is run. That will potentially cause some trouble to CPU hotplug cotifiers, but we can handle that, for example, by using the in_suspend_context() test. Thanks, Rafael --
This "flaw" isn't a new thing, of course. I remember pointing out the rather annoying proclivity of the PM framework to deadlock when suspend() tried to remove USB devices ... back around 2.6.10 or so. Things have shuffled around a bit, and gotten better in some cases, but not fundamentally changed. It may be more accurate to say that now we understand some constraints on device tree management policies ... ones we had previously assumed should not be issues. (But AFAICT, without actually considering the question. Now we know the right question to ask!) There's also the case where it's framework code that handles the additions rather than the parent device. That would be typical for many bridge, hub, or adapter type drivers ... you may be thinking mostly about drivers acting as "leaf" nodes in the device tree, at least in terms of real hardware nodes. Yes, "require that policy from such framework code too". Just trying to be sure the description doesn't have gaping holes in the middle. :) I can think of a bunch of serial busses where framework code has that sort of responsiblity. USB, SPI, I2C ... "legacy" I2C drivers would all need to be taught not to create/remove children during those intervals, lacking "new style" conversion (which makes them work more like normal drivers). - Dave --
Hardware can be inserted and removed while we're in a suspend state; and there's nothing that we can do about it until we resume. Is it fair to say, then, that having started suspend, we could reasonably ignore any device insertion and removal, and handle it on resume? Presumably we need to scan for hardware changes on resume. --
"Ignore" seems a bit strong; those events may be wakeup triggers, which would cause the hardware to make it a very short suspend state. Not on most busses I work with; the hardware issues notifications whenever the devices are removable. Which is a Good Thing ... scanning, as you suggest, is inherently not reliable. If a mouse is swapped, it likely doesn't matter a lot whether it's the "same" or not. But ... how about if some removable storage media was taken out, updated on a different system, then restored before the resume? Not many filesystems handle that sanely. You *really* want to have hardware which reports disconnect/reconnect events, or which physically prevents them (locked case etc). - Dave --
Of course, "defer". The insertion has to be handled eventually. What There's no notification while we're suspended. Isn't it necessary to scan all busses on resume, just to know what's on them? --
Certainly. If hardware-change events can get lost because of the system sleep, the resume method should make every effort to verify that It depends on the bus. If the bus doesn't support hotplugging then scanning isn't necessary. If the bus does support hotplugging then scanning after suspend may or may not be necessary, depending on whether or not the bus controller remained powered during the suspend. For hotpluggable buses, scanning after hibernation is always necessary. Alan Stern --
Not only that, the new children will exist temporarily in a state where their parent is suspended and they are not. Clearly we cannot allow This is a new requirement. We didn't have to worry about it with the At that stage there isn't any real need to hold the semaphores. And it complicates unregistration, which should always be allowed. At the Yes. We could keep it, but not acquire it until after all devices have Exactly; this has to be added to the PM documentation. The difficulty arises only for drivers that support hotplugging, that is, detecting and registering children from somewhere other than their Do they need to register new CPUs at some point? There ought to be a way to handle that. Alan Stern --
Certainly. That's why I've been saying that it's not really a simple task to But we should not leave a window between releasing dpm_list_mtx and taking pm_sleep_rwsem. Either that, or we should make sure that dpm_active is No, they don't, but there are some CPU-related device objects that get uregistered/registered. Still, all of this work is really redundant if the CPU in question comes back up during the resume, so it should be avoided in general. The CPU hotplug notifiers should only unregister those objects if the CPU hasn't gone on line during the resume and they have all information necessary for discovering that. Thanks, Rafael --
I've got some ideas on how to implement this. We can add a new field "suspend_called" to dev->power. It would be owned by the PM core (protect by dpm_list_mtx) and read-only to drivers. Normally it will contain 0, but when the suspend method is running we set it to SUSPEND_RUNNING and when the method returns successfully we set it to SUSPEND_DONE. Before calling the resume method we set it back to 0. Drivers can use this field as an easy way of checking that all the child devices have been suspended. When a new device is registered we check its parent's suspend_called value. If it is SUSPEND_DONE then the caller has a bug and we have to fail the registration. If it is SUSPEND_RUNNING then the registration is legal, but we remember what happened. Then when the currently-running suspend method returns and we reacquire the dpm_list_mtx, we will realize that a race was lost. If the method completed successfully (which it shouldn't) we can resume that device immediately without ever taking it off the dpm_active list; but either way we should continue the suspend loop. Now the new child will be at the end of the dpm_active_list, so it will be suspended before the parent is reached again. This way we can recover from drivers that are willing to suspend their device even though there are unsuspended children. The only drawback will be that for a short time the child will be active while its parent is suspended. We should not abort the entire sleep transition simply because we lost a race. With this scheme we won't even need the pm_sleep_rwsem; the dpm_list_mtx will provide all the necessary protection. This is more intricate than it should be. It would have been better to have had "disable_new_children" and "enable_new_children" methods from the beginning; then there wouldn't be any races at all. That's life... The one tricky thing to watch out for is when a suspend or resume method wants to unregister the device being suspended or resumed. Even that ...
I'd call it "sleeping" or something like this, for it will also be used by Why before? I'd think that any non-suspended children should not be visible This seems to require some trickery. Namely, device_add() will notice that the registration is done concurrently with the running ->suspend() of the parent and will have to communicate that to dpm_suspend() which is supposed I don't agree here. If we require drivers to prevent such races from happening and they don't comply, we can give up instead of trying to work around the That can't happen, because dev->sem is taken by suspend_device() and I'm still thinking that registering while the parent is suspending should not The direction seems to be fine, but the details need a bit more clarification, as far as I'm concerned. Having a patch to discuss will certainly help a lot, though. ;-) Thanks, Rafael --
The name refers to the "suspend" method, not the type of sleep being carried out. We use the same method for both suspend and hibernation. All right, we can set it to RESUME_RUNNING before calling the resume method and then set it to 0 afterwards. The point is that the value shouldn't remain SUSPEND_DONE while resume runs, because it should be It will get noticed in device_pm_add() while holding dpm_list_mtx. The information can be stored in a static private flag Sure. But it won't be the PM core's problem; it will be a bug in the bus's driver. We will print a warning in the log so the bug can be You misunderstand. We can't require drivers to prevent these races entirely. As an example, a properly-written, compliant driver might work like this: Task 0 Task 1 ------ ------ dev->power.sleeping = SUSPEND_RUNNING; Call (drv->suspend)(dev) Register a child below dev suspend method prevents new child registrations suspend method waits for existing registration to finish Check dev->power.sleeping and set child_added_while_parent_suspends Registration completes successfully suspend method sees there is an unsuspended child and returns -EBUSY Check child_added_while_parent_suspends and realize that we lost the race There's nothing illegal about this; it's just an accident of timing. Nothing has gone wrong and we shouldn't abort the sleep. We should continue where we left off, by suspending the new child and then trying We'll have to fix device_del() to prevent that from happening. Your Unfortunately the lack of "prevent_new_children" and "allow_new_children" methods gives us no choice. The example above shows why. Alan Stern --
I'm not sure. The core moves the device to dpm_active only after ->resume() has run. Thus, if ->resume() registers new children, the ordering of I'm not sure if we need to do it. It's always been like this, so the current drivers' ->suspend() and ->resume() don't unregister the device they're called Yes, it does. Thanks, Rafael --
Rafael: Here's my patch. It doesn't include the timers for deadlock debugging, but it does include all the other stuff we've been talking about. My base probably isn't quite in sync with yours, so this may not apply cleanly on your system. But the divergences should be small. Incidentally, there seemed to be a bug in your dpm_suspend() -- the dpm_list_mtx needs to be reacquired before the error checking. This patch fixes that. It also removes pm_sleep_rwsem, which isn't used any more. We should think about device_pm_schedule_removal(). It won't work right if a suspend method calls it for the device being suspended, because the device gets moved to the dpm_off list after the method runs. Alan Stern Index: usb-2.6/Documentation/power/devices.txt =================================================================== --- usb-2.6.orig/Documentation/power/devices.txt +++ usb-2.6/Documentation/power/devices.txt @@ -192,11 +192,27 @@ used to resume those devices. The ordering of the device tree is defined by the order in which devices get registered: a child can never be registered, probed or resumed before -its parent; and can't be removed or suspended after that parent. +its parent; and can't be removed or suspended after that parent. This +means that special care is needed when calling device_move(); currently +the kernel does not adjust its suspend and resume ordering to match the +new device tree. The policy is that the device tree should match hardware bus topology. (Or at least the control bus, for devices which use multiple busses.) +There is an unavoidable race between the PM core suspending all the +children of a device and the device's driver registering new children. +As a result, it is possible that the core may try to suspend a device +without first having suspended all of the device's children. Drivers +must check for this; a suspend method should return -EBUSY if there are +unsuspended children. (The child->power.sleeping field c...
In fact 'sleeping' doesn't look good in this context. 'pm_state' seems
better to me (although it is confusingly similar to 'power_state', we're going
Hmm.
Suppose we add a mutex to dev_pm_info, say pm_mtx, and require it to be:
(1) taken by suspend_device(dev) (at the beginning)
(2) released by resume_device(dev) (at the end)
(3) taken (and released) by device_pm_add() if dev is the parent of the device
being added.
In that case, device_pm_add() will block on attepmpts to register devices whose
parents are suspended (or suspending) and we're done. At least so it would
This seems to be a weak spot. The resuming of the device at this point need
not work correctly, given that the system's target state is still a sleep
Well, I wish it could be simpler ...
Thanks,
Rafael
--How about instead: "attempts to register a child device below a It gets used in only a couple of places, so nobody should object too No; this would repeat the same mistake we were struggling with last week. Blocking registration attempts (especially if we start _before_ calling the device's suspend method or end _after_ calling the resume method) will lead to deadlocks while suspending or resuming. If the blocking starts after the suspend method returns then it will be safer. But what's the point? If a registration attempt is made at that stage, it means there's a bug in the driver. So failing the registration seems like a reasonable thing to do. One issue: This rule doesn't allow suspend_late or resume_early methods to register any new devices. Will that cause problems with the CPU hotplug or ACPI subsystems? ACPI in particular may need to freeze the kacpi_notify workqueue -- in fact, that might solve the problem in I agree. But there aren't any -EPOWER* or -EPM* error codes defined! That may be true. But this is an error-recovery path intended to work around a driver bug. It doesn't have to guarantee perfect operation, just do its best. Remember too that the target state is set before any devices are suspended. Hence, after the state is set there may still be runtime resumes taking place. Those _must_ not fail, which means that this Me too. But until the API is changed, this seems to be the best we can do. It's not quite as bad as it looks, since a fair amount of the new code is just for reporting on and recovering from bugs in drivers. Alan Stern --
Not exactly, because in that case we blocked all attempts to register devices,
while I think we can only block those regarding the children of a suspending
I'm not really convinced that it'll happen. If we make the rule that
registering children from the device's own ->suspend() method is forbidden
(I don't really see why it should be allowed), something like this might work
(it seems to me):
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -186,6 +186,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
unsigned should_wakeup:1;
struct list_head entry;
+ struct mutex lock;
#endif
};
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -67,9 +67,14 @@ void device_pm_add(struct device *dev)
pr_debug("PM: Adding info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ if (dev->parent)
+ mutex_lock(&dev->parent.power.lock);
mutex_lock(&dpm_list_mtx);
list_add_tail(&dev->power.entry, &dpm_active);
+ mutex_init(&dev->power.lock);
mutex_unlock(&dpm_list_mtx);
+ if (dev->parent)
+ mutex_unlock(&dev->parent.power.lock);
}
/**
@@ -249,6 +254,7 @@ static void dpm_resume(void)
list_move_tail(entry, &dpm_active);
mutex_unlock(&dpm_list_mtx);
resume_device(dev);
+ mutex_unlock(&dev->power.lock);
mutex_lock(&dpm_list_mtx);
}
mutex_unlock(&dpm_list_mtx);
@@ -427,6 +433,13 @@ static int dpm_suspend(pm_message_t stat
struct device *dev = to_device(entry);
mutex_unlock(&dpm_list_mtx);
+ mutex_lock(&dev->power.lock);
+ mutex_lock(&dpm_list_mtx);
+ if (dev != to_device(dpm_active.prev)) {
+ mutex_unlock(&dev->...It's the "suspending" case that causes trouble. Go back and look at the race I diagrammed in It _did_ happen. Precisely this race occurred in Bug #10030 (the SD card insertion/removal), although there the window was bigger because we blocked registrations starting from the start of the system sleep It's not that the suspend method itself will want to register children. The problem is that the method has to wait for other threads that may already have started to register a child. If those other threads are This looks pretty awkward. Won't it cause lockdep to complain about It buys us one thing: The system will continue to limp along instead of locking up. If drivers don't check whether registration succeeded... What can I During suspend_late and resume_early _every_ device is suspended, including the fictitious "device-tree-root" device. Hence _every_ registration is for a child of a suspended device. Besides, you don't want to allow new devices to be registered during suspend_late, do you? They wouldn't get suspended before the system Right now that may be the easiest solution. In fact, it may still be I'm not sure I understand. Sure, autoresume may not involve calling the driver's resume method. But it does involve actually setting the Are you still considering adding separate methods for suspend and hibernate (maybe also for freeze and prethaw)? Perhaps the "prevent_new_children" and "allow_new_children" methods could be added then. This would allow some of this complication to go away. Alan Stern --
I'm still not sure if this particular race would happen if only the registering Why would it? It's not taken recursively at any place. In fact we can't take dev->power.lock under dpm_list_mtx, because that would deadlock (we have to be able to take dpm_list_mtx to complete the suspending of devices). For this reason, dpm_list_mtx is taken under dev->power.lock. However, we don't know which lock to acquire (dev is unknown) until we get the next device to suspend. Thus, I first get the device (under dpm_list_mtx) and release dpm_list_mtx. Next, I take dev->power.lock and dpm_list_mtx under it and check if the selected device is still the last on the list. If it's not, something has been registered in the meantime and it's better to get the new device to suspend first (this seems to be cleaner than resuming an already suspended device). I could release dev->power.lock as soon as suspend_device(dev, state) has run, but that could result in new devices being added to dpm_active in a wrong order, so it's better not to release it until resume_device(dev). However, it's probably is safe to release it right prior to running resume_device(dev) (after we've added dev back to dpm_active), because in that case the ordering It may oops, though, if a driver attempts to use a device that it failed to Well, people want to remove the freezing of tasks altogether from the suspend In fact, that's the matter of how we are going to handle the runtime PM vs I wonder how that would be different from using dev->power.lock for blocking the registration of new children. The only practical difference I see is that the driver will have to block the registrations, this way or another, instead of the core. Thanks, Rafael --
That's different. Before you were talking about acquiring dev->power.lock _before_ calling the suspend method. Now you're talking about blocking child registration _after_ the parent is already suspended. It might work if you did it that way. In theory it _should_ work, since nobody should ever try to register a child below a suspended parent. Given that this is merely a way of preventing something which should never happen in the first place, is it really necessary to add the extra lock? Certainly it's simpler just to fail the registration. If it turns out later that we'd be better off blocking it instead, we It is as far as lockdep is concerned. You acquire power.lock for the first device, then you acquire it for the second device. Lockdep doesn't know the two devices are different; all it knows is that you have tried to acquire a lock while already holding an instance of that same lock. It's the same problem that affects attempts to convert dev->sem to a mutex. As for the ordering of the lock and moving the device to dpm_off -- it's less of a problem if you don't acquire the lock until after the suspend method returns. You can lock it just before reacquiring Which is better, an oops or a hang? As far as the user is concerned, either one is useless. For kernel developers, an oops is easier to debug. In the end we should just try it and see what happens. I don't think we can decide which will work out better without some real-world That's not what I mean. In the long run it will turn out that certain kernel threads _want_ to be frozen. That is, if allowed to run during a system sleep transition they would mess things up, and their subsystem is designed so that it can carry out a sleep transition perfectly well without the thread running. (An example of such a thread is khubd.) To accomodate these threads we can freeze them -- that's easy since the freezer already exists. Or we can remove the freezer and provide a new way for the...
Hm, perhaps I don't understand the issue correctly, so please let me restate it. We have the design issue that it's possible to register a child of a device that was taken for suspend or even suspended which, in turn, leads to a wrong ordering of devices on dpm_active. Namely, suppose that device dev is taken from the end of dpm_active and suspend_device(dev, state) is going to be called If at the same time a child of dev is being registered and suspend_device(dev, state) returns 0, dev will be taken out of dpm_active, but the new device (its child) will be added to the end of dpm_active and subsequently we'll attempt to suspend it. That will be wrong. Also, if a dev's child is registered after suspending all devices, it will go to the end of dpm_active, so after the subsequent resume dev will end up closer to the end of the list than the new child. Thus, during the next suspend it will be taken for suspending before this child and that will be wrong either. Note that this situation need not look dangerously from the driver's perspective, since it doesn't know of the dpm_active ordering issue. One of the possible solutions is to require suspend_device(dev, state) to return an error if it detects a concurrent child registration. This, however is not sufficient, because the registration of a child may go unseen, right after suspend_device(dev, state) returns and before dpm_list_mtx is reacquired. For this reason, we need an additional mechanism to detect such situations and work around them. [Hence, the question arises whether it's really necessary to require suspend_device(dev, state) to detect concurrent registrations of children (we need to detect them independently anyway) etc.] There also would have to be a mechanism for detecting registrations when all devices have been suspended (we discussed that approach previously). The other possible solution, and that's the one I'm considering, is to use locking to prevent the registration of children of suspended or suspending...
Yes, but it's still wrong. It's also wrong to register a new device (even one that has no parent) during the suspend_late stage, because It's worse than you describe, because suspend_device() is unable to tell whether a concurrent child registration is valid or invalid. By "valid", I mean that it took place during the window before the suspend method was called or before the method was able to prevent new child registrations. Valid child registrations must be allowed to proceed. There's no reason to fail them, because the driver has done nothing wrong -- and it wouldn't be safe since drivers don't check for failure. Likewise, blocking them is likely to cause the suspend method to deadlock; see below. But since suspend_device() can't tell which concurrent child registrations are valid, it has no choice but to allow all of them. This means our only option for recovering from a concurrent child registration is to resume the parent (don't move it to dpm_off) and continue. That way the list ordering will remain correct. Once the parent is fully suspended, suspend_device() has returned, and the parent has been moved to dpm_off, the situation is different. Then we _know_ that child registrations are invalid, so either blocking In view of my comments above, we _must not_ prevent registration of children of suspending devices. It's okay to prevent registration of suspended devices. There's nothing wrong with using a lock for this purpose, but the lock should not be acquired until after suspend_device() has returned and we have verified that no children Consider a situation where a kernel thread is used by a driver for several purposes, including registering new children. The suspend method will have to synchronize with the thread for obvious reasons: you don't want the thread to be using the device at the same time as the device is being suspended. A typical synchronization approach is for the suspend method to wait until the other thread is idle (the so...
If new children get registered when the parent is suspended, that's already wrong, because the children should have been suspended before. Think of a case when the parent is a bus and the children are devices on it. In that case, by suspending the parent before the children we can make the children In fact we don't. We only know that suspend_device() has won the race with I don't agree with that (not only because the last sentence is oversimplified ;-)). I think that the rule "the driver must not register new children after ->suspend() has run" is not a good one, because in fact we don't want ->suspend() to be called while new children are being registered. IMO, we should make the rule that "device registration may fail if it's carried out concurrently with the parent's ->suspend() method". At least, that will tell I agree it's not a good idea to hold the locks throughout the entire cycle, but that can be overcome if we use an additional variable under Still, I think we can fail them. In fact, drivers _should_ check for device_add() failures and if they don't, Well, I think it's not correct to allow the parent to suspend with active (not I agree. What I was trying to say is that the core should protect itself, to the maximum reasonable extent possible, from suspending a parent before the On the systems I'm talking about there are some devices referred to by the ACPI _PTS method, so they must be on line whey this method is being executed. Since we don't know a priori what devices they are, we must put all devices on Well, that's correct. However, if we make the rule that device_add() may fail if it's run concurrently with the parent's ->suspend(), the changes of the drivers need not be substantial (they should check for the failures of device_add() anyway). [BTW, there is a bug in the error path of device_add() for which I'm going to send a fix later today. Namely, in case of a BusError, dpm_sysfs_remove() is executed after device_pm_remo...
Agreed. That's why I've been saying all along that once the parent has been moved to dpm_off, we should either block or fail child registrations. Drivers trying to register a child at such times are clearly buggy anyway. What we are really trying to agree on is how the PM core should handle child registrations just before and while the parent is suspending. Drivers trying to register a child at such times need not be buggy at all; they may simply have lost a race. I agree that the core needs to protect itself, but I also think that drivers should need minimal changes -- preferably none. If the core becomes moderately more complicated as a tradeoff for keeping drivers a little simpler, then IMO it's a win since there are lots and lots of But you do agree that "drivers must not register new children after ->suspend() has run" is correct, right? You just don't think it goes In doing this you are putting a tremendous extra burden on drivers. You force _them_ to handle registration failure caused by an impending suspend, something the driver has no way to know about beforehand! In effect, you are trying to take the extra complication my patch adds to the PM core and instead add that complication to _every_ The new patch is no better. If a driver tries to create a new child just before its suspend method is called, the registration will fail with -EBUSY for no reason the driver is aware of. So now the driver has to detect the failure (which many drivers don't do!) and figure out Of course they should check. But they shouldn't have to deal with failures that need to be retried for no good reason. In the long run, I still believe the best approach is to tell drivers beforehand they should stop registering children. That will make both of us happy: The PM core can fail all later registration attempts with no qualms, and drivers won't have to worry about unexpected failures. How many subsystems register new children at arbitrary times? The And I th...
That's not enough, though. As soon as the state of the parent has changed, it's incorrect to register any new children. Moving the parent to dpm_off is Would you agree, however, that the driver should be prepared for its ->resume() being called right after ->suspend() and the ->suspend() repeated As I said above, I think that resuming the parent in case of a concurrent child registration is not a trivial modification. It may seem trivial, but That's correct. However, if such a child registration happens with the code we currently have, there will be a problem, so the driver doing that may be considered as buggy _right_ _now_. Thus, in fact, we need not worry about any existing drivers and we may require future drivers to follow the additional IMO they are things done at a wrong time. What we're talking about is a driver trying to ignore the fact that it may be suspended and do all things as though that's never going to happen. In fact, if you use separate threads for the registration of children etc., you should have implement a notification mechanism that will let you know when a suspend is going to occur. Otherwise, you do things in a racy way and expecting that they'll never fail is just I'm fine with that, although I'd call the new methods ->begin() and ->end() or something like this, since they may generally be used for other purposes On the systems in question (some NVidia chipsets for K8) USB controllers have to be in the full power state before executing _PTS. IMO these systems are just badly designed and we can blacklist or something like this. It also is against the ACPI 2.0 spec, although it's compliant Then, if a child registration occurs while suspend_device(parent) is being run, dpm_active will be in a wrong order, unless suspend_device(parent) returns Well, we can add new callbacks for notifying drivers of an impending suspend. In that case, say we add a ->begin() callback for this purpose (in fact that would be...
This is now a moot point, but I'll answer it anyway... Drivers should not depend on any particular time interval between their suspend and resume methods being called. It should be okay to call one and then the other a microsecond later, or 100 days later. The driver "begin_sleep" and "end_sleep" are less ambiguous. I was just thinking the same thing. For instance, plenty of drivers register children as part of their probe method, so the begin_sleep method should prevent subsystems from probing the device. (Not to mention that it's kind of awkward for a driver to probe a suspended After more thought, I'm not so sure about this. It might be a good idea to call the begin_sleep method just before the suspend method (or any of its variants: freeze, hibernate, prethaw, etc.) and call the end_sleep method just after the resume method. This minimizes the time drivers will spend in a peculiar non-hotplug-aware state, although it means that begin_sleep would have to be idempotent. It also allows sophisticated drivers to do all their processing in the begin_sleep (and end_sleep) method: both preventing new child registrations and powering down the device. At the moment I'm not sure whether this would turn out to be a good strategy, but it might. Alternatively, some subsystems might want to stop all child registrations right at the beginning of the sleep transition. This would amount to blocking the kernel thread responsible for those registrations when the sleep begins -- in other words, making that This isn't quite so simple either. A notifier isn't good enough because it doesn't provide any synchronization. On the other hand, how many devices ever get registered without a parent? Does this happen at all after system startup? Maybe when a loadable module for a new bus type initializes... We could block module loading at the start of a Fortunately that won't cause any problems for some time to come. There are no current plans for doing runtime PM of PCI-base...
Well, I think there should be a window between ->suspend_begin() and ->suspend() allowing the core to cancel the suspending of given device and to select another one. For example, if there's a child registration concurrent wrt ->suspend_begin() which completes after ->suspend_begin() has been called, but before ->suspend_begin() has a chance to block it, the core should not call ->suspend() for the device, but select another one (the child). Of course, it won't be necessary if the ->suspend_begin() methods are called Well, I think the possibility to freeze kernel threads on demand may be considered as one of suspend synchronization mechanisms available to drivers. Yes, that should work. I'll rework the patch along these lines (it starts to I meant ->hibernate_begin() as ->freeze_begin() and I don't see any reason why ->prethaw_begin() would be different from it (the difference between ->freeze() and ->prethaw() -- now called ->quiesce(), btw -- is that the former saves If such devices serve as logical parents of some "real" ones, we should at least mark them as "sleeping" during suspend to protect the dpm_active ordering. Thanks, Rafael --
With a sophisticated driver that would never happen, because after blocking new child registrations, the driver would check that the power.sleeping flag is set in all the children before powering down the device. But like I said, I'm not sure if this would be a good strategy. (This partly has to do with the requirements for runtime PM. During a runtime suspend the driver does have to check the children's status; it can't rely on the PM core. So if the check has to be done anyway, why not also check during a system sleep?) With non-sophisticated drivers, it definitely could happen that a new child is registered concurrently with begin_sleep. Then the core would go back and suspend the child first, as you say. Eventually the core would return to the parent device, at which time it would call the parent's begin_sleep method again -- unless we add another flag to I had a change like that in my version of the patch. It's excerpted AFAICS there needs to be only one begin_sleep method. It should apply equally well to suspend, freeze, quiesce, and hibernate. (But not to There's no need for that. If it isn't implemented, treat it as though it was successful. But if a driver implements suspend() and leaves begin_sleep() as just a stub (which many drivers might reasonably do, if they never register any children), then the core shouldn't require suspend() to succeed merely because begin_sleep() did. They wouldn't have to be marked at all. They would never get on any of the PM lists, because they would never be passed to device_pm_add(). The status of such devices is a little peculiar. For example, consider the MMC subsystem. There's an MMC controller, generally a platform device. Its child is an mmc_host device, but the mmc_host methods are called by the platform device's methods -- there is no driver associated with an mmc_host. At one time the mmc_host was a class_device; that's may explain in part why it works this way. Alan Stern Ind...
The _core_ needs the window, not the driver. Even if the driver discovers that there are active children, it can only fail ->suspend(), in which the core will faill the entire power state transition (unless we reserve an error code for Hm, I wonder why you didn't move dpm_sysfs_add() along with device_pm_add()? Perhaps it's better to include dpm_sysfs_add() into device_pm_add(), since we It definitely would be simpler to assume so and introduce just one common Well, I'm not sure. Right now we have a problem with distinguishing drivers that don't implement ->suspend() purposefully from the ones that just don't support suspend/hibernation ... OTOH, since we are going to have a pointer to 'struct pm_ops', we can safely But dev->parent will be not NULL for their children being on dpm_active ... IMO it's simpler to just add those devices without any suspend callbacks defined to dpm_active and handle them normally than to introduce a special case. Thanks, Rafael --
Let's stick with that for now. What about on the resume side? Do you want to prevent child registrations until after end_sleep has run? Or would you be okay with allowing the resume method to register new children? It should be safe to assume that drivers are smart enough to bring the device back to full power before looking for new children. In fact, maybe we don't need an end_sleep method at all. There isn't any synchronization issue to worry about -- drivers aren't going to That's reasonable. If a driver doesn't support PM then it won't have Yeah, I'm just trying to think too far ahead. Alan Stern --
I think ->resume() can register new children. There's nothing wrong with Okay, I'll prepare a patch for that, on top of the one introducing the Exactly. The question remains what we're going to do with the drivers without pm_ops pointers in the long run (in the short run we will use the legacy callbacks in that cases, if defined). Thanks, Rafael --
While you're at it, could you add a field to indicate whether begin_sleep() has been called? It would help prevent multiple calls to that method when a race does occur, and it could be useful for drivers One possibility is to unbind those drivers at the start of a sleep transition and reprobe them at the end. Another possibility is to ignore the lack of PM support and hope it doesn't cause any problems. Alan Stern --
That will be added along with the new callbacks. I'd prefer to do all that in separate patches, so that it's clear what issue is That would be easy, but that's what we do now and there are problems with it. Thanks, Rafael --
No big deal. In fact we're working with Alex on a patch that modifies main.c Oh, yes. Will you please send a fix when Yeah. I left pm_sleep_rwsem, because I wasn't sure it would be necessary I think we should remove device_pm_schedule_removal() early in the 2.6.26 cycle. I'll review the patch more thoroughly tomorrow, since I have some new material for the regressions list I need to take care of. Thanks, Rafael --
Actually the move is done before the method is called. So this isn't a problem. All right, if it doesn't happen now then we don't need to allow for it. That makes life a little simpler. Alan Stern --
| Greg Kroah-Hartman | [PATCH 019/196] DMA: Convert from class_device to device for DMA engine |
| Tejun Heo | [PATCH 4/7] FUSE: implement direct lseek support |
| Parag Warudkar | BUG: soft lockup - CPU#1 stuck for 15s! [swapper:0] |
| Greg Smith | PostgreSQL pgbench performance regression in 2.6.23+ |
git: | |
| Len Brown | fatal: unable to create '.git/index': File exists |
| Dan Farina | backup or mirror a repository |
| André Goddard Rosa | Using kdiff3 to compare two different revisions of a folder |
| Petko Manolov | git and binary files |
| Richard Stallman | Real men don't attack straw men |
| Steve B | Intel Atom and D945GCLF2 |
| Jeff Ross | U320 Drive on U160 controller? |
| Sunnz | How do I configure sendmail? |
| Eric Dumazet | [PATCH] fs: pipe/sockets/anon dentries should not have a parent |
| Denys Fedoryshchenko | thousands of classes, e1000 TX unit hang |
| Wei Yongjun | [PATCH] xfrm: Fix kernel panic when flush and dump SPD entries |
| Steffen Klassert | [RFC PATCH 4/5] crypto: allow allocation of percpu crypto transforms |
