Suspend to RAM on a machine with / on a fuse filesystem turns out to be
a screaming nightmare - either the suspend fails because syslog (for
instance) can't be frozen, or the machine deadlocks for some other
reason I haven't tracked down. We could "fix" fuse, or alternatively we
could do what we do for suspend to RAM on other platforms (PPC and APM)
and just not use the freezer.
Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 8812985..5f109d5 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -19,7 +19,6 @@
#include <linux/console.h>
#include <linux/cpu.h>
#include <linux/resume-trace.h>
-#include <linux/freezer.h>
#include <linux/vmstat.h>
#include "power.h"
@@ -81,11 +80,6 @@ static int suspend_prepare(suspend_state_t state)
pm_prepare_console();
- if (freeze_processes()) {
- error = -EAGAIN;
- goto Thaw;
- }
-
if ((free_pages = global_page_state(NR_FREE_PAGES))
< FREE_PAGE_NUMBER) {
pr_debug("PM: free some memory\n");
@@ -93,7 +87,7 @@ static int suspend_prepare(suspend_state_t state)
if (nr_free_pages() < FREE_PAGE_NUMBER) {
error = -ENOMEM;
printk(KERN_ERR "PM: No enough memory\n");
- goto Thaw;
+ goto Restore_console;
}
}
@@ -118,8 +112,7 @@ static int suspend_prepare(suspend_state_t state)
device_resume();
Resume_console:
resume_console();
- Thaw:
- thaw_processes();
+ Restore_console:
pm_restore_console();
return error;
}
@@ -170,7 +163,6 @@ static void suspend_finish(suspend_state_t state)
pm_finish(state);
device_resume();
resume_console();
- thaw_processes();
pm_restore_console();
}
--
Matthew Garrett | mjg59@srcf.ucam.org
-Sorry, no. * this needs audit of all drivers. Or we can just merge it and then fix all the problems it causes. If you are willing to become suspend maintainer and handle all that mess, perhaps we can do this. * it does not solve FUSE vs. hibernation * it does not solve FUSE vs. suspend-to-both * userspace will now see CPUs going up and down at minimum Now, we want to do something like this long-term, but I do not think we can just remove the freezer like this. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Quite apart from the sync() matter, _any_ synchronous call to a FUSE filesystem during STR will cause trouble. Even if the user task implementing the filesystem isn't frozen, when it tries to carry out some I/O to a suspended device it will either: block until the system wakes up, or cause the suspend to abort. Neither outcome is desirable. Alan Stern -
For the suspend to RAM case, that sounds absolutely fine. -- Matthew Garrett | mjg59@srcf.ucam.org -
It's not so good when your suspend process has to wait for the call to complete! Alan Stern -
Why would it have to? Sorry, I suspect I'm missing something obvious here. -- Matthew Garrett | mjg59@srcf.ucam.org -
Well, the sys_sync() that caused your original problem did exactly that. It's the reason you get deadlocks, right? I agree that in general the suspend process should not have to wait for a userspace callback to complete. Indeed, there's no particular reason that anything running during STR should have to wait for something in userspace to complete. Given that fact, I don't see anything wrong with freezing userspace when doing STR. Alan Stern -
The sys_sync is unnecessary in the first case. There shouldn't be anything in the suspend path that's going to require userspace access to There's nothing wrong with it as such, it's just that our implementation appears to suck in a myriad of small ways that keep cropping up and biting people. Even without the sys_sync(), freezing processes results in the suspend failing because syslog is stuck in D state and won't go into the refrigerator. -- Matthew Garrett | mjg59@srcf.ucam.org -
Okay, I can believe that. The proper response then is to fix the freezer, not eliminate it. Has the syslog problem been reported on linux-pm? I don't recall hearing of it before. Alan Stern -
See the start of this thread. It's just not clear what the freezer buys us - removing it gets rid of a load of subtle issues and complexity, and turns system suspend into something that looks more like runtime suspend (which might then encourage people to get runtime suspend right...) -- Matthew Garrett | mjg59@srcf.ucam.org -
No, no -- you have it exactly backwards. Removing the freezer turns STR into something _less_ like runtime suspend, because it adds the requirement that devices must not automatically be resumed when an I/O request arrives. Alan Stern -
Could you please rediff against the current -mm tree? There are some patches in there that this will clash with. I still think that this is a mistake, BTW. Please see the Alan's post at https://lists.linux-foundation.org/pipermail/linux-pm/2007-June/012847.html Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth -
Hi! Can we get them? They are neccessary for debugging 'what in suspend calls fuse' problem. And yes, that problem is there even when you remove freezer. Matthew, you seem to be the only one able to produce them... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I can produce them, but haven't managed to do that in any way that lets sys_sync may not do anything on a fuse filesystem, but if you've loopback-mounted an ext3 filesystem from that fuse filesystem then it's going to result in writes to it and deadlock. -- Matthew Garrett | mjg59@srcf.ucam.org -
Aha, yes, that explains the problem. Thanks! Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
If you can see them, then perhaps you could use a digital camera or just copy the text manually. [snip] -- Jeremy Maitin-Shepard -
Note, though, that this won't help at all when people use the "suspend-to-r= am=20 instead of powering down after writing a hibernation image" feature in=20 (uswsusp | tuxonice). Fuse is just a broken idea in the first place, but=20 given that it exists, we still need to find the underlying cause. Regards, Nigel
If / is on fuse it's unlikely that hibernation is high on your list of priorities right now. -- Matthew Garrett | mjg59@srcf.ucam.org -
Hi. Yeah, well... what can you say to that? :) Nigel =2D-=20 See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info.
No, Fuse is not a broken idea in the first place. It's the freezer that is a totally broken idea. It has proven many times to be racy by design and cannot be made right. Ther usermode helper mess is just part of that, fuse is another example, etc etc ... So I think Matthew is totally right. In fact, the presence of the freezer is the main reason why Paulus so far NACKed Johannes attempts at merging the PPC PM code with the generic code in kernel/power.c We've been doing fine without it so far and intend to continue to do so. As for suspend-to-disk, I refer you to the discussions we had in the past with Linus, where he explains I think quite clearly how wrong the current implementation of STR is :-) Thing is, if you're going to do snapshots, you should probably not sync after you have "frozen" anyway. Cheers, Ben. -
How well does it work on SMP PPC? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Just fine, on those machines where we know how to reinitialize the video card. We currently require userspace to offline all except the boot cpu before suspending, but that could be moved into the kernel. I have no particular attachment to that way of doing it; it was just a "don't do things in the kernel that can be reasonably be done in userspace" kind of thing. Paul. -
Hi. To some extent, I agree. I think the ideal solution would be to simply not= =20 schedule processes that are supposed to be frozen. But who wants to play wi= th=20 =46use depends on !PPC? I assume you mean STD. The problem there is that Linus doesn't care about S= TD.=20 If he did, I dare say he'd think through the issues more thoroughly than he= =20 apparently has. =46ully agree. But how do you stop things syncing while you're writing the = image=20 if you don't have a freezer or equivalent? (scheduler based, kexec.. they'r= e=20 all workarounds for this issue). Regards, Nigel =2D-=20 Nigel, Michelle and Alisdair Cunningham 5 Mitchell Street Cobden 3266 Victoria, Australia
No, that's not what I'm saying. I'm saying we've been doing STR without Well, I was saying that in the context of the -current- snapshotting mechanism which is based on the freezer, then you should not sys_sync(). Some random user or kernel thread doing a sync is not a problem. It will stop in the middle of sync and resume on wakeup. The problem is currently because STD -itself- attempts to sync after it has frozen things. I think that should be changed. If you want to sync for whatever reason, (mostly save RAM ?) do it before the freeze. That means you may get new dirty data in memory that isn't written out by the sync before you freeze, but that's allright, that data will be in the suspend image anyway. If you fail to wakeup, that's akin to a normal crash, the user will only lose the last data written at the time of the suspend and journaling fs'es should take care of fs metadata integrity. So to summarize, the plan that makes things work with fuse is: - For STR, don't do the freezer thing. - For STD, don't sys_sync() after you froze There might be -other- issues, but that should get you through some of them at least. Of course, you'll be in trouble if you try to do things like STD-to-a-file which sits on a fuse FS but there's a limit to insanity :-) Cheers, Ben. -
In the long run, I agree. Still, can you please read this post from Alan Stern: https://lists.linux-foundation.org/pipermail/linux-pm/2007-June/012847.html ? I don't think I'm able to repeat the arguments given in there in a Yeah, I think we can move the syncing before the freezing, so to speak. Yes. :-) Ggreetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth -
That's the same crackpot I've been hearing for the past 3 years or so ... Both Paulus and I think the freezer is just a way to try to put your head in the sand and ignore the problem. It causes as many problems as it solves on its own, and is just not a solution that will be of any use once you start implementing dynamic PM schemes etc... In many cases, having proper support for "live" suspend of devices is just a matter of having a couple of helpers in whatever subsystem those drivers hookup with. In the case of network, for example, it's mostly trivial (stop the queue). For block, it's not terribly hard neither, though you want to have some orderign/atomicity between the blocking of the incoming request queue and the sending of things like spindown & flush commands to the disk. For old-style IDE, that was fairly easily solved by piping suspend/resume command down the request queue itself and have the queue block/unblbock itself after processing them. Some of that logic could maybe be moved to the block layer for all block drivers to benefit. But yes, overall, there is work to do on drivers and I'm doing the ones I hit on the platforms I use. I don't think the freezer is any kind of remotely good solution, just a way to continue avoiding the problem. Ben. -
...but the moment you start blocking tasks that done driver request, you _do_ have mini-freezer of your own, with pretty much the same problems. In another message I shown that removing freezer will not help with FUSE in general case. It probably does not help with firmware, too; as soon as udev attempts to do something with your wireless card, it is blocked, and if the wireless card needs the firmware from udev, you are deadlocked. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
No, not at all the same problems. Those tasks will block, but that will be harmless because we won't have some "freezer" things waiting for all tasks to reach a "stable" point (calling try_to_freeze()). We just let them block wherever we want, as long as it doesn't prevent a -driver- Firmware load has been a problem since day 1, I've talked about it multiple times, it's broken with or without the freezer, and so far, the reaction of pretty much everybody has been to dig their head deeper in the mud and ignore the problem. There are other issues (again, with or without freezer) that should be dealt with. For example, drivers that haven't yet got their suspend() callback or already have got their resume() may rely on services of the kernel that are still blocked, that's where things may go hairy. request_firmware() within resume() is a typical example of that. There are a few things we should do in that area. For example, once we start to call driver suspend's, we should probably set a system wide flag that will do things such as: - block usermode helpers (either make call_usermodehelper return something like -EBUSY or have it queue up the calls and issue them later when thing are resuming, we need to look closely at what semantic we want here). - Silently add GFP_NOIO to all allocations, to avoid having things blocking in kmalloc() with a mutex held that will deadlock with suspend() in a driver for example. Or set some way to have all GFP waiters wakeup and fail rather than wait for IOs. It's hard/bizarre but necessary, again, with or without a freezer. - Deal with the firmware problem. The best way is probably to have an async request_firmware interface(). Another thing is, drivers may want to cache their firmware in main memory, that sort of thing... And that's just a small list off the top of my mind, of known problems that will cause deadlocks or misbehaviours today, with or without the freezer, and that need to be addressed. Ben. -
Hi. Will you be able to guarantee that every place where a task can/will block= =20 will be harmless place? If so, how will you guarantee that? How will you=20 debug issues where a task occasionally doesn't block in the right place,=20 particularly instances where it is some less than obvious interaction with= =20 other tasks? This is the whole point to having the freezer. It makes things more=20 predictable and testable. It shows us, clearly, when process X is the one=20 Why? GFP_ATOMIC? (In driver suspend, they shouldn't be sleeping either, right?) Userspace device drivers too? Regards, Nigel
No, the freezer creates all those places what are harmful for a task to NOIO should be enough I think but ATOMIC would do). That's one of the reason why I used to have the pre-suspend and post-resume hooks in my original powermac implementation, for those few Note that the above firmware problem could be dealt with also with the pre-suspend/post-resume. Allowing to pre-request firmware etc... and keep it around until after resume, because we know we will need it. Gives a chance to drivers to perform things while the system is still live, filesystems still working, etc... (big memory allocations for Maybe but they are less of an issue, most of the time, they don't do DMA or whatever harmful things. If they are USB drivers, for example, they are an non-issues at that level. -
Hi. Nice try :) Okay then, you remove the freezer, try hibernating, then get ba= ck=20 to me after you've fixed your filesystem because some process that wasn't=20 frozen started writing things after the atomic copy (making the on disk=20 filesystem inconsistent with the snapshot). As Pavel rightly said, you can get rid of the freezer, but you're only goin= g=20 to have to implement another one that does the essentially the same thing,= =20 even if it is at some other level. (Leaving the rest of the message intact so we don't have to fragment the=20 discussion into a million subthreads). Regards, Nigel
I was mostly talking about STR... Regarding STD, we have a different problem and we all know it. The freezer is one somewhat horrible way to get it working for now, I would prefer something more along the way that blocks the page cache from writing out new dirty pages though, except those specifically flagged by the snapshot. That is, some kind of proper snapshotting facility, as linus was describing some time ago. -
Hi. t=20 =20 The kind of thing Linus was talking about would limit you (as swsusp and=20 uswsusp do now) to only half the amount of memory. I suppose you could lzf= =20 compress as you did the snapshot. That would generally get you up to 2/3rds= ,=20 but then again you can't know what compression ratio you'll get until you=20 try, so reliability would suffer or it would take longer because of retryin= g. I/O from swsusp and suspend2 use bios directly, so the page cache isn't an= =20 issue for them (apart from the fact that Suspend2 saves the page cache=20 separately so it can get a full image). Not sure about uswsusp. Only having half the amount of memory doesn't sound like a big limitation f= or=20 modern desktops & laptops, but don't forget that there are embedded guys=20 wanting to hbernate too :) Regards, Nigel =2D-=20 Nigel Cunningham Christian Reformed Church of Cobden 103 Curdie Street, Cobden 3266, Victoria, Australia Ph. +61 3 5595 1185 / +61 417 100 574 Communal Worship: 11 am Sunday.
How so? Suppose hibernate is implemented like this: (1) Userspace program calls sys_freeze_processes() (a) Pokes all CPUs with IPMIs and tells them to finish the currently running timeslot then stop (b) Atomically sends SIGSTOP to all userspace processes in a non- trappable way, except the calling process and any process which is ptracing it. (c) Returns to the calling process. (2) Userspace process sends SIGCONT to only those processes which are necessary for sync and a device-mapper snapshot. (3) Userspace calls sys_snapshot_kernel(snapshot_overhead_pages) (a) Kernel starts freeing memory and swapping stuff out to make room for a copy of *kernel* memory (not pagecache, not process RAM). It does the same for at least snapshot_overhead_pages extra (used by userspace later). It then allocates this memory to keep it from going away. Since most processes are stopped we won't have much else competing with us for the RAM. (a) Kernel uses the device-mapper up-call-into-filesystem machinery to get all mounted filesystems synced and ready for a DM snapshot. This may include sending data via the userspace processes resumed in (2). Any deadlocks here are userspace's fault (see (2)). Will need some modification to handle doing multiple blockdevs at a time. Anything using FUSE is basically perma-synced anyways (no dep- handling needed), and anything using loop should already be handled by DM. This includes allocating memory for the basic snapshot datastructures. (b) At this point all blockdev operations should be halted and disk caches flushed; that's all we care about. (c) Go through the device tree and quiesce DMA and shut off interrupts. Since all the disks are synced this is easy. (d) Use IPMIs again to get all the CPUs together, which should be easy as most processes are sleeping in IO or SIGSTOPed, and we're getting no interrupts. (e) One CPU turns off all interrupts on itself and takes an atomic sna...
Hi. Sorry for the long delay. Busy weekend and my motivation for working on=20 programming is almost zero at the moment... Ok. First, I'll ignore the specification that userspace does this - I don't= =20 think it matters whether it's userspace or kernel that does the suspending= =20 and I'm yet to see a good reason for it to be [required to be] done from=20 userspace. In this first step, you've reinvented the first part of the current freezer= =20 implementation. The reason we don't use a real signal is precisely so we ca= n=20 have an untrappable SIGSTOP. In this regard, I particularly remember Win4Li= n=20 from a few years ago. It would die if you sent it a real signal, so we had = to=20 do it this way. No doubt there are other instances I'm not aware of. How do you determine which ones are needed? Why stop them in the first plac= e? Ok. So now you also need processes running that are needed for swapping,=20 because freeing that memory might involve swapping. Fully agree with the=20 logic though (not really surprising - this is what I do in=20 Hotplugging cpus (when all those locking issues are taken care of) is simpl= er.=20 Prior to cpu hotplugging, I used IMPIs to put secondary cpus into a tight=20 loop, so I know it's possible to do it this way too. That way, though, you= =20 have less flexibility. What if a cpu really is plugged in between hibernate= =20 and resume? With cpu hotplugging, it's handled properly and transparently.= =20 Without cpu hotplugging, you could be using uninitialised data after the=20 atomic restore. Marking userspace as COW makes things more complicated, too. You then have = to=20 add code to the COW handling to update the list of pages that need to be=20 saved, and you reduce the reliability of the whole process. You can't predi= ct=20 beforehand how many of these COW pages are going to be needed, and therefor= e=20 can't know how much memory to free earlier on in the process. If you run ou= t=20 You still need to remembe...
Thanks for the detailed reply! The reason it's _required_ to be done from userspace is that userspace is the only one which can figure out "These processes need to run for suspend to work", and then let those processes continue running after the freeze. The *ONLY* reason this even stops processes at all is so we can do the post-device-mapper-snapshot code with very little usably-free RAM (IE: only about 1MB for a standard Well, you *do* want it to have semi-signal semantics, processes which receive it must not get back to userspace code so that they don't start allocating more memory when we're trying to do the freeze. You also don't want a process to be able to trap it (IE: like SIGSTOP or SIGKILL). On the other hand, it should be delivered asynchronously (IE: It doesn't break an interruptable sleep or respond to most is-a-signal- present checks). You don't actually care if its sleeping in the kernel somewhere, just as long as it doesn't allocate much memory. You would probably need a new signal "SIGFREEZE" which causes the process to be ignored as runnable the next time they schedule but never actually gets delivered, and a "SIGUNFREEZE" which does the reverse. That way userspace could selectively resume processes based It's userspace's job to know which ones are needed. For example, if you are hibernating over NFS then you need to resume the various NFS/ So they aren't allocating memory when we are doing the device-mapper It may be simpler, but it really screws up things like cpusets, processor affinity, etc. It also ties hibernation to the presently IMHO if the user pulls a CPU while the box is hibernated, then he/she gets what he/she deserves. If you really want to support that, then the user must do the hotplug operation *manually* before suspending. Anything else is just going to be shooting ourselves in the foot You could pretty easily have a spare 128MB swap partition somewhere which is not used ...
Hi Kyle. You're not talking about the same thing Linus was suggesting. He was just=20 wanting a result =3D sys_snapshot() sort of call. That would limit us to ha= lf=20 the amount of memory. I've looked over what you've written below and want to consider it in detai= l.=20 Right now though, I don't have the time. I'll try to get back to you=20 promptly. Nigel =2D-=20 See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info.
Wait wait wait ... uses the BIOS ? what do you mean ? I know that for example, things like MacOS X use a separate polled path to the storage driver for suspend (works fine for the built-in IDE, but more complicated in large scale). If you can use BIOS calls to write your suspend image, that is, if you don't need any of the normal block infrastructure, then you don't need a freezer ! not at all ! You just do like STR ... and at the end of the day, once you have stopped all your driver, you shut interrupts off and do the BIOS thing.... I fail to see how processes could dirty pages while/after the BIOS thingy :-) But then, the problem with that approach is that of course, you need a BIOS capable of doing that (or a special sideband path to the "blessed" block driver that will be used for suspend ... not necessarily a hard thing to do, would be trivial to add support to drivers/ide or libata for that sort of things I suppose). Ben. -
Hi. on=20 You misread me, Ben. Sorry for not being clearer. bios as in struct bio. Regards, Nigel =2D-=20 See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info.
Umm, this thread is NOT ABOUT HIBERNATING!!! Please go back and read
the subject, specifically the "suspend to RAM" parts :-D. When your
hardware can put itself to sleep and atomically preserve memory as it
does so, you don't need an atomic copy. For Real Suspend(TM) (IE:
Suspend-to-RAM), the list of things to do is short and simple:
1) Stop DMA and put most hardware into low-power states (stops all
interrupt sources)
2) Ensure that the other CPUs have finished any trailing interrupt
handlers and put them to sleep
3) Put the interrupt-controllers into low-power state
How about a freezer whose job it is to "wait for pending hard
interrupts to complete when we have already guaranteed that we won't
get any more"? That part should be really *REALLY* easy. You don't
need to care about either userspace processes or kernel threads at
all. Specifically, Step 1 consists of:
suspend_device(dev)
{
set_no_bind_flag(dev);
for (dev->subdevices)
suspend_device(dev);
set_no_io_flag(dev);
wait_for_in_progress_dma(dev);
turn_off_interrupts(dev);
go_to_low_power_state(dev);
}
After you've set the "no_bind" flag, you won't get any *new*
subdevices trying to bind, therefore it's safe to iterate over the
list of present sub-devices and suspend them. Once those are
suspended and in low-power states you can set a "no_io" flag to
prevent the driver from submitting more IO. At that point you can
lazily wait for existing DMA/IO/interrupts to finish on the device,
since *NOBODY* will be submitting them anymore, and we certainly
aren't probing for new devices. Then you can just turn off the power
to the device. When all the leaf devices are off, the parent device
can be turned off because everything waiting on the leaf devices is
blocked on them and won't unblock until the parent device *AND* the
leaf device are turned on again, in that order.
Scheduling and userspace are all still fully enabled in this
scenario. Once all yo...Your short and simple list omits a few crucial items: A) Decide what to do about remote wakeup requests. B) Prevent I/O requests from resuming devices that have been suspended. C) Prevent devices and drivers from being registered or unregistered; in particular decide what to do about hot-plug or hot-unplug events. D) Block driver bind or unbind calls. Any of these things is capable of screwing up the course of events. So what happens if a new subdevice arrives at the wrong time? Do you block instead of binding it? While holding a mutex needed to suspend the parent device? What about drivers trying to bind to existing devices? What happens to I/O requests submitted after the "no_io" flag is set? The driver will have to block them, effectively creating its own little This is a lot like what we already do. The differences are: There is nothing corresponding to your "no-bind" flag. Most drivers don't have anything like your "no_io" flag; they assume that nobody will be around to submit an I/O Nobody doubts that suspend can be made to work without the freezer. The point is that doing it this way dumps a bunch of extra responsibility on drivers. Alan Stern -
Except Linus already decreed (and I heartily agree) that hibernation and suspend-to-RAM were fundamentally completely different operations and therefore any attempts to share code were basically just making a big muddy mess of things. Would a thread "Remove phase-of-the-moon calculations from network-recv code" be relevant to lunar observation Why do we care? If the wakeup request arrives before we go to sleep, we obviously aren't asleep and so can't wake up. If it arrives after we go to sleep then it will wake us up. Anything that depends on a (1a) As I describe below, step (1) includes setting NO_BIND and NO_IO flags on devices as they are processed. Anybody who wants to do IO (1b) Again, that's where the NO_BIND flag comes in. If its set then If any of those things screw up suspend-to-RAM then it is 100% the drivers fault and no "process freezer" is going to fix it, end of story. And "A" cannot be made reliable. At some point you shut off interrupts right before going to sleep, and at that point any remote wakeup event is just going to get dropped until you actually enter sleep mode and the hardware takes over again. If you miss a wakeup event then whatever sent it should just retry, just as with *every* That would be a driver bug. If you have asynchronous probing then proper suspend handling includes being able to postpone driver probe events until after resume. If you have synchronous probing then the problem doesn't exist because "set_no_bind_flag" is just telling the While binding it will clearly be holding a mutex/spinlock on the parent device, so the suspend process will wait for it. When binding is done the suspend_device() code will take the device lock and tell Oh, so you're calling every waitqueue in the kernel a "freezer" now? We do these things at the driver level *all* *the* *time*. For instance, you can't submit new IOs to an ATA controller while it's Most drivers have an implicit...
For that matter, shutting down a CPU and hibernation are fundamentally
different operations -- but they both use the freezer. Is that a big
You don't understand my point. If a wakeup request arrives before the
system goes to sleep, and it is serviced, then a device which ought to
have been suspended will in fact be awake. This will (if the parent's
driver is written correctly) cause the sleep transition to abort.
Not that there's necessarily anything wrong with that. I just wanted
I didn't say "bind", I said "registered". Admittedly, they are rather
similar.
Still, there are difficulties. Let's say a driver has set the NO_BIND
flag for one of its devices. A bind request comes in, and the driver
puts it on a waitqueue. Note that the binding thread holds the device
semaphore; this is always true when a driver's probe routine is called.
Later on it comes time for the PM core to resume the device, which will
start up the threads on the waitqueue. Before doing so it must acquire
Why do you say that? A "process freezer" can prevent bind and
registration calls from occurring, since these calls have to run in
Who mentioned network packets? And who says a remote wakeup event will
get dropped once interrupts are disabled? More likely it will set a
bit somewhere that causes the system to wake up immediately after it
One of your conditions (embodied in the pseudocode you posted earlier)
was that drivers should be told to prevent binding and registration
before the child devices are suspended. Currently the PM core doesn't
do anything like that. You can't blame the drivers for this lack.
Of course it could be added. Or perhaps more easily, the drivers that
support asynchronous probing could be notified when a suspend is about
("Parent device"? Do you mean the device being bound? If so then I
agree. Or do you mean the device's parent? If so then your statement
is not clear at all. There is special-case code in the driver core to
My question re...Why would it ? Just make it fail, maybe with some kind of -ERETRY... Or No. The freezer will hide some of those problems under the carpet, but not solve the basic issue which is the driver should be solid. Period. The freezer is a flawed concept in the first place. If you go back to square one, what is the basic idea of it ? I'll basically expose the idea and go down all of the path I have in mind where it stops working and becomes an incredibly difficult thing that in the end doesn't even solve all the problems it's supposed to. So first thing first... I want a quiescent system with no new "IO requests" (whatever that mean in the context of drivers) issued to avoid races during suspend/resume. That sounds like a nice idea. Yeah. Sounds... only. Problem is. How do you define that quiescent system ? First idea is ... let's stop userland. There are various ways of doing that, but the freezer hooking into the signal code is not necessarily a bad one. No, I'm purposefully putting aside all the cases where the above doesn't work (user process in the kernel in some uninterruptible wait, etc...), which are the first big setback imho... our simple idea is suddenly not so simple anymore, but we can bring those back later. Now, there is still a problem... kernel threads. In fact, there is no fundamental distinction between a kernel thread and a user process... one has an MM and the other doesn't but as far as we are concerned, it's the same. Kernel threads can issues IOs, or like khubd, detect devices, plug/unplug them, etc etc.... all over the place. Easy answer that comes to mind -> freeze them too. Heh, but kernel threads don't do signals, so we end up with all those try_to_freeze(). Then what about the fact that drivers may need those kernel threads to proceed ? Some drivers queue up their IO requests to a kernel thread to process them and suspend() might need to flush those down, issue a couple more such as "spin down disks" before that kernel thread can actually be f...
Spinning in the driver with the lock not held is impossible, since the driver is called with the lock already acquired. Failing with -ERETRY is non-transparent. I would prefer to block such requests at their source, before the lock is acquired. Perhaps in the driver core, perhaps even earlier. (And rather than trying to manage a waitqueue or struct completion, it would be easiest to jump directly into the freezer! The driver or the You're missing the point. If the driver and the freezer are both solid, there's no reason they can't share the work. If many drivers can pass off part of their workload to the single freezer, it's a net win. So it isn't a question of how solid the drivers are; it's a question of how solid the freezer is. And bear in mind that if you convince people the freezer is not solid enough to be used, then you will have I'm willing to try, although I think it will be a tremendous amount of work to verify that every driver does the right thing. There's lots of support missing. For example, don't you think we should block all What about systems with no BIOS? I think this would be very hard or even impossible to make work. Alan Stern -
That's wrong. The freezer is NOT a solution for that sort of thing. Just I've explained already multiple times that the freezer will not do what you guys expect it to do. IOs can be submited at non-task time and there is no clear distinction between IO generating threads that must and those that must not be frozen. I really can't understand why you guys work so hard at trying to avoid Because I'm intimately convinced that the freezer is a wrong approach Too bad, that's where the interesting points that show that the freezer sysfs is a matter of driver. If a sysfs read/write callback in a driver is hitting the HW, it most certainly already has some kind of locking. That locking can/should be extended to deal with blocking when the HW is suspended. However, since it seems that people universally consider it very hard to get right (I don't but heh), Linus and Paul have come up with a solution for most simple enough directly-mapped drivers such as PCI (ok, that doesn't include USB) which is to simply do the HW suspend in a late You made the same mistake I did when reading Nigel's mail ... BIOs -> Block IO requests, not BIOS :-) Ben. -
You misunderstood. Stop trying to incite riot, calm down, and pay attention to what I actually wrote. I'll explain it again in more explicit terms: You agree that drivers need to block various activities during suspend. Principally I/O requests, but other things as well. So when one of these requests arrives, the driver has to make it wait somehow and then has to allow it to proceed at the appropriate time. Normally a waitqueue or a struct completion would be used for this purpose. But either one puts the burden on the driver of defining a data structure and signalling it at the right time. That time is generally when the device is resumed, but there's nothing wrong with delaying it slightly, to after all the devices have been resumed (i.e., the time when the current PM code takes everything out of the freezer). In fact, we definitely don't want to unblock plug events until this later time. So instead, why not have the PM core take care of all this? There could be a block_task_until_suspend_is_over() routine available for all drivers to use. Its effect would be exactly the same as sending the current task into the freezer, but it wouldn't be the freezer that exists now. It would just be some routine that blocks until the system suspend is over. We could call it "the icebox" instead of "the freezer". :-) User tasks can cause driver binding by writing to sysfs. Binding _can't_ be blocked in the driver; by then it's already too late. If it is going to be blocked at all, it has to be blocked earlier. One possibility is in the sysfs attribute code; another is to block all sysfs access. Of course, another possibility is simply to fail the bind. But that's Ben, you haven't given enough thought to the work needed to avoid locking problems. For instance, you agree that during suspend we must not allow device or driver registration or unregistration, right? And we must not allow driver binding or unbinding. But these events generally involve a...
Yes. The main thing I see here is that here is nothing common among drivers in what a "request" is and how it's processed. For example, for block drivers, it's actually fairly simple to just stop processing the request queue and wait for pending ones to complete. For network, in a similar vein, we can mostly just tell the network stack to stop sending us packets. That's what I call the "main path". This is often the trivial part to deal with, mostly because for a whole lot of drivers, it can be done via a couple of helpers in the subsystem that the driver provides a service too, via a helper, asking that subsystem to stop calling into the said driver (the asking should be done by the driver itself of course, for ordering reasons). We have some helpers, but I think not enough, and that's where we should focus imho. For example, I added fb_set_suspend() so that fbdev's can request fbcon to stop accessing them (it doesn't solve the problem of userland mmap's, that will have to be done, if we want to do it, in a more sneaky way, using VM tricks, but the DRM nowadays has the infrastructure to do it). But that's only the "main" path. Aside for that, almost all drivers also have sideband "request" input and some driver don't actually live behind a subsystem. That ranges from ioctl, to direct read/write on a char dev from userland. I think many of those cases can fairly well deal with just taking a PM semaphore, that's how I did for a couple of things in the past, provided that the request path isn't deadlocking with the semaphore held because of the system suspending of course. But in a whole lot of cases, it's, I beleive, perfectly kosher to just return an error. You're trying to capture frame from your camera while the machine is suspended ? error. At worst, your capture app will be unhappy when you wakeup, nothing terrible and totally fixable in userland if it's a problem. In some cases, we could use a little bit more help from the subsystem. Network for example, co...
Well, that way you'd have to teach applications about suspend... Which
is quite bad. You mentioned it -- returning random errors will be
very bad for machines like OLPC that want to suspend
automatically. Plus it is a step back from current implementation, and
Kernel threads already send _themselves_ to the refrigerator. [Plus we
put all the userland there, which is what you don't like, but kernel
can not rely on userland after suspend starts, anyway, so it should
not hurt].
Anyway.. PPC currently suspends without freezer, which puts rules on
drivers. ("Must handle i/o requests after .suspend() method is ran,
must not use GFP_KERNEL to do so, must not try to synchronously
communicate with userspace before _all_ devices are unfrozen") I am
not certain what the exact rules are, but you seem to know them. Could
we get Doc*/power/suspend_wo_freezer.txt describing them for driver
authors? That way we can make sure drivers work on ppc, too, and maybe
This is how it works currently in -mm.
(Plus, the rule is that threads that decide _not to_ stop themselves
should not do any I/O.)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-I'll make this reply short by agreeing up front with most of what you say. That's what USB does as well (for the drivers which have runtime PM We can try falling back on this approach for now. If the drivers are smart enough to fail cleanly when the device is already suspended, it should work. But I'm not sure it's a good idea in the long run. Think of a printer daemon, for example. It shouldn't have to experience unexpected I/O This will be up to the people responsible for the subsystems. I can Yes. Rafael, how close is your new notifier chain to mainline? Can it at least be added to Greg KH's development tree so that I can start That's what I had in mind. Rafael, can we add an "icebox" routine? Like Ben says, it doesn't need to be much more than a waitqueue that the current task puts itself on if a suspend is in progress. Callers arriving at a time when the icebox isn't activated should simply return without blocking. Basically the icebox should be active Here's a wacky idea which just might work: In order to prevent binding and unbinding, while suspending devices all the PM core has to do is avoid dropping the device semaphores! It can release the semaphores as it resumes the devices. Of course, for this to work it's necessary to avoid changes to the device list during the suspend. However I believe the iteration can be made safe against unregistration, so we only have to prevent device registration. (And anyway, it won't be possible to unregister a device while the PM core is holding its semaphore.) If we are willing to be somewhat non-transparent, this is easy to accomplish. After the notifier chain has been alerted about the upcoming suspend, we tell the driver core to disallow adding new devices. Maybe use SRCU to synchronize with registration calls that are in progress. Thus, until the suspend is over device_add() will immediately return an error. We could even add a new ESUSPENDING code to errno.h; it would come in han...
It's in -mm. The patches queued up in -mm are also in my patchset at Hmm, I think this would be functionally equivalent to a modified freezer with the following changes: (1) the actual freezing functionality is not hooked up to the signal code, but instead there are many invocations of try_to_freeze() all over the place, (2) the freezer doesn't wait for tasks to freeze, only sets their TIF_FREEZE flags and exits, where TIF_FREEZE now means "there's a suspend in progress, please freeze yourself if need be". BTW, this is similar to what I'd like to have for kernel threads RSN, except for the "doesn't wait" part. IMO, something like this might be used for hibernation too, but in that case we need to do something about the tasks that haven't gone to the icebox. I have some implementation-related questions: * do we want each task to be notified individually or do we want a global variable, say "pm_entering_sleep_state", that will mean "now you're supposed to go to the icebox" * at what point exactly this should be activated (ie. before the freezer or after it) * how do we release tasks from the icebox (all at a time or individually, and I think we should do something like that in the beginning and probably try to Agreed. Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth -
Why not ? Printer is offline when machine is asleep... trying to print errors out, I don't see the problem there. At one point, we'll need a cleaner way to also notify userland in which case our daemon could become more intelligent and stop servicings things before sleep and USB is not that much of a problem in the sense that for most "leaf" drivers, USB is a provider (ie, the bus they sit on), not the client (like the network stack is to network drivers). In most cases, that "helper" thing would sit on the client subsystem, since it's the one feeding drivers with requests. The main ones I see at hand are block, alsa, net, fb/drm... Some of them already have There is still the race of: drivers_sysfs_write() try_to_icebox() <---- <sleep request gets here> hit hardware Those are akin, in some ways, to the freezer races. Some kind of RCU might take care of them if we enable the icebox, then wait for all tasks to hit an explicit schedule point once (or return to userland). That would mean that drivers need to try_to_icebox() again if they do something that may schedule (such as __get_user). So it's not a magic True. Also, bus drivers could just flag the port with something saying "try registering again later". Don't underestimate the power of "try I think having a facility for a given workqueue entry to requeue itself Ben. -
It depends on what kind of race you're worried about. In general a
sysfs access that causes I/O isn't much different from a char device's
read or write. For such cases the driver would have to do something
like this:
drivers_sysfs_write()
{
restart:
mutex_lock(private_io_mutex);
if (device is suspended) {
mutex_unlock(private_io_mutex);
icebox();
goto restart;
}
... hit hardware ...
mutex_unlock(private_io_mutex);
}
And of course the suspend method would have to acquire the
private_io_mutex. Some drivers might be able to use the device
semaphore instead of adding a new mutex.
I'm more concerned about races involving device registration. The best
solution I can come up with is to use an rwsem:
drivers_sysfs_write()
{
down_read(private_rwsem);
... register/unregister devices ...
up_read(private_rwsem);
}
where the suspend notifer callout would do down_write() and the resume
-ESUSPENDING would naturally imply that the caller should retry after
the resume.
Alan Stern
-Not necessarily. The machine must survive going to sleep while you are printing. Any other error return than -ERESTARTSYS is not an option. We can't simply change the ABI. Regards Oliver -
...filesystems are offline, too, when the machine is asleep. Yet, unmounting everything on suspend would not result in useful suspend support. Yes, I believe we should be transparent. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
You just compared apple and oranges... Try printing and half way through the page, suspend your USB bus, and see how the printer reacts. Ben. -
