Hi again.
So - trying to get back to the original discussion - what (if anything)
do you see as the way ahead?The options I can think of are (starting with things I can do):
1) I stop developing Suspend2, thereby pushing however many current
Suspend2 users to move to [u]swsusp and seek to get that up to speed.2) I quit my day job, see if Redhat will take me full time and give me
the time to start trying to merge Suspend2 bit by bit. Alternatively,
days suddenly become 8 hours longer and I discover the boundless energy
and alertness needed to do this too :). Ok. Not going to happen.3) Someone else steps up to the plate and tries to merge Suspend2 one
bit at a time.4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.
5) Everything gets dropped and we start from scratch.
6) The status quo - or some small variant of it - stays. Oh... I said
"way ahead". I guess that rules this one out, even though I'll be very
surprised if it's not the one that wins out.7) Suspend2 gets merged and people get to choose which they like better.
Nearly forgot this as a conceivable possibility. Yeah, I know you said
you don't want it. I'm just trying to think of what might possibly
happen.N.
Perhaps do it the EVMS way? Do as much in userspace as possible, and
trying having a simple kernel API at the same time.
Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)Jan
--
-
Hi.
think putting suspend to disk code in userspace is just a broken idea.
Regards,
Nigel
So which bits do we want to merge? For example, Suspend2
kernel/power/ui.c, kernel/power/compression.c, and
kernel/power/encryption.c seem pointless now that we have uswsusp.
Furthermore, being the shameless Linus cheerleader that I am, I got
the impression that we should fix the snapshot/shutdown logic in the
kernel which Suspend2 doesn't really address?Pekka
-
Hi.
I agree that the driver logic could be addressed too, but to answer your
question...* Doing things in the right order? (Prepare the image, then do the
atomic copy, then save).
* Mulithreaded I/O (might as well use multiple cores to compress the
image, now that we're hotplugging later).
* Support for > 1 swap device.
* Support for ordinary files.
* Full image option.
* Modular design?Regards,
Nigel
I'd actually like to discuss this a bit..
I'm obviously not a huge fan of the whole user/kernel level split and
interfaces, but I actually do think that there is *one* split that makes
sense:- generate the (whole) snapshot image entirely inside the kernel
- do nothing else (ie no IO at all), and just export it as a single image
to user space (literally just mapping the pages into user space).
*one* interface. None of the "pretty UI update" crap. Just a single
system call:void *snapshot_system(u32 *size);
which will map in the snapshot, return the mapped address and the size
(and if you want to support snapshots > 4GB, be my guest, but I suspect
you're actually *better* off just admitting that if you cannot shrink
the snapshot to less than 32 bits, it's not worth doing)User space gets a fully running system, with that one process having that
one image mapped into its address space. It can then compress/write/do
whatever to that snapshot.You need one other system call, of course, which is
int resume_snapshot(void *snapshot, u32 size);
and for testing, you should be able to basically do
u32 size;
void *buffer = snapshot_system(&size);
if (buffer != MAP_FAILED)
resume_snapshot(buffer, size);and it should obviously work.
And btw, the device model changes are a big part of this. Because I don't
think it's even remotely debuggable with the full suspend/resume of the
devices being part of generating the image! That freeze/snapshot/unfreeze
sequence is likely a lot more debuggable, if only because freeze/unfreeze
is actually a no-op for most devices, and snapshotting is trivial too.Once you have that snapshot image in user space you can do anything you
want. And again: you'd hav a fully working system: not any degradation
*at*all*. If you're in X, then X will continue running etc even after the
snapshotting, although obviously the snapshotting will have tried to page
a lot of stuff out in order to...
In fact... If you're just paging out to make a smaller snapshot (ie, not
to free up memory), couldn't you just swap it out (if it's not backed by a
file) then mark it as "half-released"... ie, the snapshot writing code
ignores it knowing that it will be available on disk at resume, but then
when the snapshot is complete it's still available in physical RAM,
preventing user-space from crawling due to the necessity of paging it all
back in?Thanks,
Chase-
your swap space may end up being re-used before you restore with std
David Lang
-
Side note: the exception, of course, is page out more. The swap device has
to be read-only.We actually have support for that mode (it's how "swapoff" works: it marks
swap devices as not accepting _new_ entries, even though old entries are
still valid). So you can have a fully running system, with 99% of memory
swapped out, and still guarantee that you won't swap out anything *more*
(which would destroy the swap image, which you don't want, since it's
where a lot of the memory may end up being, in order to make the snapshot
itself as small as possible)!Anybody who cares can look at the code that messes with the the
SWP_WRITEOK flag. You'd basically swap out enough to make the snapshot
image fit comfortably in memory, and then you'd clear SWP_WRITEOK on all
swap devices and return to user space. Or something very close to that.But the point here is that we should actually really be able to have a
fully working system, even _after_ we created the snapshot. I don't even
think you should need any "initrd only" kind of situation.If somebody can do that, with just those two system calls, I'll remove
every other suspend-to-disk wannabe from the kernel in a heartbeat. I may
have missed something subtle, of course, but I really *think* it should be
doable.Linus
-
I think this is very similar to current uswsusp design; except that we
are using read on /dev/snapshot to read the snapshot (not memory
mapping) and that we freeze the system (because I do not think killall
_SIGSTOP is enough).Can you confirm that it is indeed similar design, or tell me why I'm
wrong? You had some pretty strong words for uswsusp before, so I'd
like to understand your position here. ("Ouch, I do not know, I am out
of time" is still better reply than silence.)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Agreed.
Greetings,
Rafael
-
remember, this is being done inside the kernel. the kernel can do things like
saving off the scheduler queue to prevent any userspace from running during the
snapshot, it could then move selected pids over to a new queue to selectivly
'unfreeze' whatever you need (like the X processes for example) and then proceed
normally (allowing processes to be spawned, forked, etc without activiating the
rest of userspace becouse the rest just won't be available to be scheduled) and
userspace can tell the kernel the list of pids to unfreeze so the kernel doesn't
need to try and guess.David Lang
-
Yep, we "freeze too much", so we can't just use the shell and pipe
it. Too bad.218 int write_image(char *resume_dev_name)
219 {
220 static struct swap_map_handle handle;
221 struct swsusp_info *header;
222 unsigned long start;
223 int fd;
224 int error;
225
226 fd = open(resume_dev_name, O_RDWR | O_SYNC);
227 if (fd < 0) {
228 printf("suspend: Could not open resume device\n");
229 return error;
230 }
231 error = read(dev, buffer, PAGE_SIZE);
232 if (error < PAGE_SIZE)
233 return error < 0 ? error : -EFAULT;
234 header = (struct swsusp_info *)buffer;
235 if (!enough_swap(header->pages)) {
236 printf("suspend: Not enough free swap\n");
237 return -ENOSPC;
238 }
239 error = init_swap_writer(&handle, fd);
240 if (!error) {
241 start = handle.cur_swap;
242 error = swap_write_page(&handle, header);
243 }
244 if (!error)
245 error = save_image(&handle, header->pages - 1);
246 if (!error) {
247 flush_swap_writer(&handle);
248 printf( "S" );
249 error = mark_swap(fd, start);
250 printf( "|\n" );
251 }
252 fsync(fd);
253 close(fd);
254 return error;
255 }This is basically the loop above, made complex by the fact that we do
not want to have separate partition for snapshot; we just want to
reuse free space in swap partition.I think you've just invented uswsusp.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
with the size of drives today is it really that bad to require a seperate
partition for this?I also don't like the idea of storing this in the swap partition for a couple of
reasons.1. on many modern linux systems the swap partition is not large enough.
for example, on my boxes with 16G or ram I only allocate 2G of swap space
2. it's too easy for other things to stomp on your swap partition.
for example: booting from a live CD that finds and uses swap partitions
if you are needing space for your freeze, allocate it in an unabigous way, not
by re-useing an existing partition.David Lang
-
WTF? So allocate larger swap partition. You just told me disks are big
That's a feature. If you are booting from live CD, you _want_ to erase
Of course you have that option. Writing image is done in userspace, so
you are free to write it to raw partition (and first versions indeed
done that).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
swap partitions are limited to 2G (or at least they were a couple of months ago
when I last checked). I also don't want to run the risk of having a box try towhy?
it's been stated that doing a std and booting another OS (including windows) is
a valid and common useage. saying that if you boot another OS you trash your
suspended image doesn't sound reasonable.David Lang
-
They aren't limited anymore, I have a number of machines with 20G swap
for experiments.OG.
-
If you hibernate your machine, boot from live cd, and change anything
on any filesystem, you are pretty likely to loose that filesystem.Doing that with Windows is okay as Windows do not usually write to
ext3 partitions.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
booting from a live CD doesn't mean that you are going to mount the filesystem,
let alone change it. but swap is not supposed to be this sensitive.-
This is basically how uswsusp is designed. (We do not use system call,
you just read from /dev/snapshot, and you have to make few ioctls toWell... We decided not to do this in the fully working system. SIGSTOP
is just not strong enough, and we want the snapshot atomic.Now, it would be _very_ nice to be able to snapshot system and
continue running, but I just don't see how to do it without extensive
filesystem support.
Pavel--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
So what kind of support do we need from the filesystem?
Pekka
-
"forcedremount ro, not telling anyone, not killing processes" would do
the trick. FS snapshots might do.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Hi.
It sounds to me more like Pekka is thinking of checkpointing support. If
that's the case, then remounting filesystems isn't going to be an
option. You want to freeze them for just long enough so that you can
determine what needs saving in the checkpoint. You certainly don't want
to make rw file handles ro and so on.Nigel
Hi.
That inherently limits the image to half of available ram (you need
somewhere to store the snapshot), so you won't get the full image youYou're describing uswsusp! (At least in so far as I understand it!).
You can't get a fully running system though, because if anything changes
something on disk that was snapshotted (super blocks etc) your snapshotPlease, go apply that logic elsewhere, then cut out (or at least stop
adding) support for users with less common needs in other areas. I fully
acknowledge that most users have only one place to store their image and
it's a swap device. But that doesn't mean one size fits all.A full image implies that you need to figure out what's not going to
change while you're writing it and save that separately. At the moment,
I'm treating most of the LRU contents as that list. If we're going to
start trying to let every man and his dog run while we're trying to
snapshot the system, that's not going to work anymore - or the logic
will get a lot more complicated.Sorry. I never thought I'd say this, but I think you're being naive
about how simple the process of snapshotting a system is.Regards,
Nigel
It doesn't. We can make the userspace mapped pages copy-on-write. As long
as the userspace makes sure there's not much activity during
snapshot/shutdown, we will be fine. What we probably do need to copy is
kernel pages.Pekka
-
The user space is (and IMHO should be) frozen way before that and what you're
suggesting here is what I wanted to implement some time ago. The problem with
this was that the user space pages may be updated, for example, by device
drivers as a result of some deferred I/O after we've snapshotted the system.I didn't know how to find out which pages owned by the user space could be
updated this way, so I gave up at that time.Greetings,
Rafael-
Hi.
COW is a possibility, but I understood (perhaps wrongly) that Linus was
thinking of a single syscall or such like to prepare the snapshot. If
you're going to start doing things like this, won't that mean you'd then
have to update/redo the snapshot or somehow nullify the effect of
anything the programs does so that doing it again after the snapshot is
restored doesn't cause problems?I was going to leave it at that and press send, but perhaps that
wouldn't be wise. I feel I should also ask what you're thinking of as a
means of making sure userspace doesn't do much activity.Thanks for your labours!
Regards,
Nigel
No. The snapshot is just that. A snapshot in time. From kernel point of
view, it doesn't matter one bit what when you did it or if the state has
changed before you resume. It's up to userspace to make sure the user
doesn't do real work while the snapshot is being written to disk and
machine is shut down.When the snapshot pages are COW, we will run out of memory if userspace
writes to those pages too much. If userspace is blocked, say like
displaying a "we are suspending" in X which blocks the user from using
other programs that could generate new writes and mounting filesystems
read-only, we don't need to worry about running out of memory.Pekka
-
Btw, obviously we need to break the COW when resuming and not include the
snapshot mapping. However, that should be trivially doable by snapshotting
the page mappings before remapping them as COW.Pekka
-
Why do you think that keeping the user space frozen after 'snapshot' is a bad
idea? I think that solves many of the problems you're discussing.Greetings,
Rafael-
It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
gdb -p <snapshotter>
when something goes wrong?) but we also *depend* on user space for various
things (the same way we depend on kernel threads, and why it has been such
a total disaster to try to freeze the kernel threads too!). For example,
if you want to do graphical stuff, just using X would be quite nice,
wouldn't it?But I do agree that doing everythign in the kernel is likely to just be a
hell of a lot simpler for everybody.Linus
-
Yeah, or gdb vmlinux snapshot
Then you could use kexec for resume...
J
-
While that would certainly be nifty, I think we're arguably starting
from the wrong point here. Why are we booting a kernel, trying to poke
the hardware back into some sort of mock-quiescent state, freeing memory
and then (finally) overwriting the entire contents of RAM rather than
just doing all of this from the bootloader? Given the time spent in
kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd
still be faster even if you're stuck using int 13 on x86.http://apcmag.com/5873/page14 suggests that Intel is looking into this,
but I haven't heard anything more yet. To the best of my knowledge, this
is also how Windows manages things.
--
Matthew Garrett | mjg59@srcf.ucam.org
-
Sure, you could make suspend generate a complete bootable kernel image
containing all RAM. Doesn't sound too hard to me. You know, from over
here on the sidelines.J
-
Doing it from the bootloader sounds attractive... but it is lot of
work. I'm essentially using linux as a bootloader.Ah, so we have a volunteer :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Well. We actually have first class support for using linux as a
bootloader. So you could use linux and do whatever dance you are
doing from a bootloader if you felt the desire.That might make the dance a little easier.
Eric
-
I think you're right.
Greetings,
Rafael
-
Yes, it would, but as long as we can't protect mounted filesystems from being
Greetings,
Rafael
-
And can you name a _single_ advantage of doing so?
It so happens, that most people wouldn't notice or care that kmirrord got
frozen (kernel thread picked at random - it might be one of the threads
that has gotten special-cased to not do that), but I have yet to hear a
single coherent explanation for why it's actually a good idea in the first
place.And it has added totally idiotic code to every single kernel thread main
loop. For _no_ reason, except that the concept was broken, and needed more
breakage to just make it work.Linus
-
Yes. We have a lot less interdependencies to worry about during the whole
Well, I don't know if that's a 'coherent' explanation from your point of view
(probably not), but I'll try nevertheless:
1) if the kernel threads are frozen, we know that they don't hold any locks
that could interfere with the freezing of device drivers,
2) if they are frozen, we know, for example, that they won't call user mode
helpers or do similar things,
3) if they are frozen, we know that they won't submit I/O to disks and
potentially damage filesystems (suspend2 has much more problems with that
than swsusp, but still. And yes, there have been bug reports related to it,
so it's not just my fantasy).It is actually useful for some things other than the hibernation/suspend, the
code is not idiotic (it's one line of code in the majority of cases) and you
should take that "I hate everything even remotely related to hibernation" hat
off, really.Greetings,
Rafael
-
this won't matter unless the user mode helpers are going to do I/O or other
if you have the filesystems checkpointed then I/O after the freeze won't matter
as you just revert to the checkpoint (and since this is going to be thrown away
it can stay in ram)if we are willing to make a break with the past to implement the new snapshot
capability, we should be able to use the LVM snapshot code to handle the
filesystem-
In that case, I would agree. Currently, however, we're not even close to this
point.The checkpointing of filesystems would be a very welcome feature, but there's
Yes, we can do that, in principle, and screw all of the current users in the
process. And finally we'd end up with something similar to what is done now,
IMHO.And no, the things are not just totally broken, as it may follow from these
discussions. The problem is that the people who are discussing them so
viciously have never tried to write anything like the hibernation code.This is as though as I were discussing the design of the CPU schedulers,
although I only know how they work on a general level.Actually, the really problematic thing with the hibernation _right_ _now_ is
what Linus is so concerned about (and rightfully so) - that we use the
same device drivers' callbacks for the hibernation and suspend (aka s2ram).
The other things work quite well and are really robust.Greetings,
Rafael
-
if accessing a file on a read-only filesystem changes that filesystem it's a bug
see the recent thread about ext3 journal replays when mounting read-only as an
however, the result may be a lot less 'special case pwoer management' code and a
lot more re-use of code that's in place for other uses.if work on the current versions was stopped (other then trying to avoid
regressions) and a new version (with new userspace tools) was built in a way
that satisfies everyone the old version could be phased out in a year or twoif simply splitting the functions cleans everything up enough to satisfy
everyone then we're almost done right? ;-)however I think that there are other fundamental disagreements here, and neither
the 'do absolutly everything in the kernel' or the 'do almost nothing in the
kernel' approaches are going to fly in the long run. I think the
userspace<->kernel interface is going to be different then either apprach is
doing now, and as such it's an oppurtunity to make more drastic changes if they
are appropriate.for example, why should we have LVM snapshot code and hibernate
snapshot/filesystem checkpoint code instead of just useing the LVM code (which
gets excercised and tested far more then the other code ever would be)? saying
that if you want to suspend to disk you need to use LVM is a change, but it's
a change that people could probably live with.David Lang
-
Oh well. Is this really wrong to protect users from such bugs, if we can do
Practically, yes. Theoretically, there's no software you can't improve
Well, that's a theory. Probably a good one, but still. :-)
The positive aspect of all this is that people have started to pay attention to
what we're doing, and gradually they will learn about the problems that they're
just not seeing right now.Greetings,
Rafael
-
That's not an advantage. That's why it has *sucked*.
Trying to freeze kernel threads has _caused_ problems. It has _added_
these interdependencies. It hasn't removed a single dependency at anyNONE of these are valid explanations at all. You're listing totally
theoretical problems, and ignoring all the _real_ problems that trying to
freeze kernel threads has _caused_.If you want to control user-mode helpers, you do that - you do not freeze
kernel threads!And no, kernel threads do not submit IO to disks on their own. You just
made that up. Yes, they can be involved in that whole disk submission
thing, but in a good way - they can be required in order to make disk
writing work!The problem that suspend has had is that it's done everything totally the
wrong way around. Do kernel threads do disk IO? Sure, if asked to do so.
For example, kernel threads can be involved in md etc, but that's a *good*
thing. The way to shut them up is not to freeze the threads, but to freeze
the *disk*.Linus
-
xfs problem was real. And I do not see that many problems caused by
freezing kernel threads: at least you get deadlocks, not silent fsYep, so we have md doing io while we are doing atomic copy. That
probably means it will continue when atomic copy is done... getting
image out of sync with disk.Well, if freezing the disk was available, I'd gladly do it. Is there
easy way to implement that?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Actually, the less things happen while we're creating and saving the image,
the less sources of potential problems there are and by freezing the kernel
threads (not all of them), we cause less things to happen at that time.To make you happy, we could stop doing that, but what actual _advantage_
Some of them can be, some other's need not be. We don't need any fs-related
They can be asked before we do the snapshot and complete the operation
In principle, you're right. In practice, go and try it.
Anyway, why is it so important that _all_ of the kernel threads be running
while the snapshot is created and saved?
-
That makes no sense.
You have to create the snapshot image with interrupts disabled *anyway*.
I really don't see how you can say that stopping threads etc can make any
difference what-so-ever. If you don't create the snapshot with interrupts
disabled (and just with a single CPU running) you have so many other
problems that it's not even remotely funny.So there's *by*definition* nothing at all that can happen while you
Like getting rid of all the magic "I don't want you to freeze me" crud?
Or getting rid of this horribly idiotic "three times widdershins" kind of
black magic mentality! It looks like the main reason for the process
freezing has nothing to do with technology, but some irrational fear of
other things happening at the same time, even though they CANNOT happen if
you do things even half-way sanely.The "let's stop all kernel threads" is superstition. It's the same kind of
superstition that made people write "sync" three times before turning off
the power in the olden times. It's the kind of superstition that comes
from "we don't do things right, so let's be vewy vewy quiet and _pray_
that it works when we are beign quiet".That's bad.
It's doubly bad, because that idiocy has also infected s2ram. Again,
another thing that really makes no sense at all - and we do it not justLike you wouldn't know. Look at commit b43376927a that you yourself are
credited with, just a month ago.Then, do something as simple as
git grep create_freezeable_workthread
and ponder the end results of that grep. If you don't see something wrong,
Who do you think you are kidding? See above.
And if you think that's an isolated example, look again. And start
grepping for PF_NOFREEZE, and other examples.The fact is, there is not a *single* reason to freeze kernel threads. But
some rocket scientist decided to, and then screwed everybody else over.Linus
-
Side note: while I think things should probably *work* even with user
processes going full bore while a snapshot it taken, I'll freely admit
that I'll follow that superstition far enough that I think it's probably a
good idea to try to quiesce the system to _some_ degree, and that stopping
user programs is a good idea. Partly because the whole memory shrinking
thing, and partly just because we should do the snapshot with hw IO queues
empty.But I don't think it would necessarily be wrong (and in many ways it would
probably be *right*) to do that IO queue stopping at the queue level
rather than at a process level. Why stop processes just becasue you want
to clean out IO queues? They are two totally different things!Linus
-
Actually, I'd like to stop I/O queues; if there was easy way to do
that, I'll happily switch. Notice that we'll need to stop 'I/O queues'
of the char devices, too...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
For creating the snapshot alone, it doesn't matter. Except that the restore is
cleaner a bit (we know exactly what all of these threads will be doing when
we restore the image and enable the IRQs after that).Still, I think that kernel threads can potentailly hold locks accross the
freezing of devices and image creation and that is fishy. Also I believe,
although I'm not 100% sure, that some of them may cause problems toOkay. Accidentally, I'm working on a freezer patch, so I'll probably drop
the freezing of kernel threads from swsusp in it and we'll see what happens.This was a mistake, quite unrelated to the point you're making. And actually,
I was trying to fix a problem with two kernel threads that we thought might
submit I/O to disk after the image had been created. Otherwise I wouldn'tWell, if someone does something in a wrong way, that need not mean the
thing he was trying to do was wrong.At least _that_ wasn't me. :-)
Greetings,
Rafael
-
In many ways, "at all".
I _do_ realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
ways to do so.So in practice, the "at all" may become a "why freeze kernel threads?" and
freezing user threads I don't find really objectionable.But as Paul pointed out, Linux on the old powerpc Mac hardware was
actually rather famous for having working (and reliable) suspend long
before it worked even remotely reliably on PC's. And they didn't do even
that.(They didn't have ACPI, and they had a much more limited set of devices,
but the whole process freezer is really about neither of those issues. The
wild and wacky PC hardware has its problems, but that's _one_ thing weDid you actually _do_ the "grep" (with the fixed argument)?
I had two totally independent points. #1 was that you yourself have been
fixing bugs in this area. #2 was the result of that grep. It's absolutely
_empty_ except for the define to add that interface.NOBODY USES IT!
Now, grep for the same interface that creates _non_freezeable workqueues.
Put another way:
[torvalds@woody linux]$ git grep create_workqueue | wc -l
35[torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
1and that _one_ hit you get for the "freezeable" case is not actually a
user, it's the definition!Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.
Yet we have all this support for freezing them (or rather, we freeze them
by default, and then we have all this support for _not_ doing that wrong
default thing!)So yes, I think it would be interesting to just stop freezing kernel
threads. Totally.Linus
-
there was a thread last week (or so) about splitting up the process list, one
list for normal user processes, one for kernel threads, and one for dead
processes waiting to be reaped.it almost sounds like what you want to do is to act as if the normal user
threads weren't there for a short time (while you make the snapshot) and then
recover them to continue and save the snapshot.David Lang
-
We freeze user space processes for the reasons that you have quoted above.
Why we freeze kernel threads in there too is a good question, but not for me to
The reason is pretty simple.
We wanted to drop that interface altogether, because it was broken (my fault),
but Oleg suggested that we keep it so that we could fix and use it in theOkay, I'll do that.
Greetings,
Rafael
-
We do not want kernel threads running:
a) they may hold some locks and deadlock suspend
b) they may do some writes to disk, leading to corruption
We could solve a) by carefully auditing suspend lock usage to make
sure deadlocks are impossible even with kernel threads running.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
remember that we are doing suspend-to-disk, after we do the snapshot we will be
doing a shutdown. that should simplify the locking issuesDavid Lang
-
That's assuming that we won't need to cancel the hibernation.
Greetings,
Rafael
-
true, but if we cancel the hibernation then why are the locks an issue? they are
appropriate for the system state.David Lang
-
You're really just making both of those up.
If a kernel thread holds a lock and deadlocks suspend, that would deadlock
anythign else _too_. Suspend isn't *that* special. Everything it does are
things other people do too.And no, kernel threads do not write to disk on their own. Name one. They
help *others* write to disk, but those disk writes need to happen.The freezer has *caused* those deadlocks (eg by stopping threads that were
needed for the suspend writeouts to succeed!), not solved them.So stop making these totally bogus arguments up.
Linus
-
I can't remember anything like this, but I believe you have a specific test
Well, they may be bogus, but there's something else.
I have reviewed some kernel threads used by device drivers that currently are
frozen to see if it would be safe not to freeze them, and I'm worried.What, for example, if such a thread schedules a timeout and waits for
something to happen (eg. the airo driver does something like this), but instead
the hibernation/suspend happens and the device is frozen/suspended under it?Shouldn't the thread be notified by the driver's freeze/suspend callback?
Moreover, what if after the restore the device is not present (for example, it
may be a pcmcia card that the user has removed) and the thread is scheduled
before the device's unfreeze callback has a chance to run? Shouldn't the
thread check that the device is present? In that case it would have to be
notified by someone that the check is necessary, but who can do that?Greetings,
Rafael
-
Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?
Rafael, you really don't know what you're talking about, do you?
Just _look_ at them. It's the IO threads etc that shouldn't be frozen,
exactly *because* they do IO. You claim that kernel threads shouldn't do
IO, but that's the point: if you cannot do IO when snapshotting to disk,
here's a damn big clue for you: how do you think that snapshot is going to
get written?I *guarantee* you that we've had a lot more problems with threads that
should *not* have been frozen than with those hypothetical threads that
you think should have been frozen.Linus
-
Well, we had nasty corruption on XFS, caused by thread that was not
frozen and should be. (While the other case leads "only" to deadlocks,
so it is easier to debug.)The locking point.. when I added freezing to swsusp, I knew very
little about kernel locking, so I "simply" decided to avoid the
problem altogether... using the freezer.You may be right that locks are not a big problem for the hibernation
after all; I just do not know.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Still, I think, if a kernel thread is a part of a device driver, then _in_
_principle_ it needs _some_ synchronization with the driver's suspend/freeze
and resume/thaw callbacks. For example, it's reasonable to assume that the
thread should be quiet between suspend/freeze and resume/thaw.With the freezing of kernel threads we provide a simple means of such
synchronization: use try_to_freeze() in a suitable place of your kernel thread
and you're done. [Well, there should be a second part for making the thread
die if the thaw callback doesn't find the device, but that's in the works.]Without it, there may be race conditions that we are not even aware of and that
may trigger in, say, 1 in 10 suspends or so and I wish you good luck with
debugging such things.Greetings,
Rafael
-
Well, I don't know why exactly it had been originally introduced. Currently,
it is used by the threads that should be running after the snapshot is doneOK, more precisely: fs-related threads should not try to process their queues,
etc., after the snapshot is done, because that may cause some fs data to be
written at that time and then the fs in question may be corrupted after the
restore. Not all of the I/O in general, fs data.Still, that alone probably is not a good enough reason for freezing all kernel
Well, I'm not sure whether or not that still would have been the case if we had
stopped to freeze kernel threads for the hibernation/suspend. I just see
potential problems that I've mentioned in the previous message and I don't see
any evidence that they cannot occur.Greetings,
Rafael
-
But that's not true _either_. That's only true because right now I think
we cannot even suspend to a swapfile (I might be wrong).If you have a swapfile on a filesystem, you'd need those fs queues
Did you miss the email where Paul pointed out that Mac/PowerPC didn't use
to do any of this? And apparently never had any issues with it? And
probably worked more reliably several years ago than suspend/hibernation
does _today_?Ie we do have history of _not_ freezing things. The freezing came later,
and came with the subsystem that had more problems..Linus
-
Still works pretty reliably; the last time my PowerBook G4 was
rebooted was 6 weeks ago. Once every 60 suspends or so the kernel
USB driver gets really confused and doesn't wake up the USB
controller properly, leading to dead keyboard/mouse, but other than
that I never have problems. I wouldn't be surprised if I could
comment out 90% of the "suspend" code and still have it work, the
hardware in is is incredibly robust. I can even swap batteries while
it's in suspend-to-RAM, as long as I do it in less than 45 sec or so;
I get around 6-7 days of suspend-to-RAM time on a full charge.Cheers,
Kyle Moffett-
No, I don't. It's done by bmapping the file and writing directly to the
underlying blockdev. Otherwise we'd have corrupted filesystems after the
restore.I have no problems with the hibernation on my test boxes (six of them), except
for one network driver that doesn't bother to define a .suspend() callback.There are problems with the suspend (s2ram), but they are _not_ related to the
freezing of kernel threads. Some of them are related to the other issue that
you have risen, which is that the same callbacks should not be used for the
suspend and hibernation, and which I think is absolutely valid. The remaining
ones are related to the fact that graphic card vendors don't care for us atIt doesn't have that many problems as you are trying to suggest. At present,
the only problems with it happen if someone tries to "improve" it in the way
I did with the workqueues.Anyway, the freezing of tasks, including kernel threads, is one of the few
things on which Pavel, Nigel and me completely agree that they should be done,
so perhaps you could accept that?Greetings,
Rafael
-
Actually, if we want to support OLPC _nicely_, we'll need to get rid
of freezer from suspend-to-RAM. Of course, that _will_ put more
pressure at the drivers -- and break few of them...Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
I think the removal of sys_sync() from freeze_processes() in the s2ram case
might help.I'm really afraid of dropping the freezing of kernel threads from the
hibernation/suspend altogether before we know we won't break drivers, because
we can introduce some very subtle and difficult to debug problems this way.Moreover, apart from speeding up the suspend slightly (kernel threads are
frozen very quickly) this won't buy us anything, since kprobes uses the freezer
and all of the infrastructure is needed anyway.Greetings,
Rafael
-
Hi.
For Suspend2, and I think for swsusp too, we bmap the locations when
allocating the storage, and then submit our own bios. Even if swsusp
isn't using this method, I'm pretty sure the swap code does bmapping atIt also came because of problems. Not working perfectly isn't
necessarily a sign of a faulty reason for being added in the first
place.I should also add, not freezing things is fine if you're happy with
getting half an image at most. If you want a full
just-as-if-I'd-never-turned-the-power-off image, you need freezing so
that you can have some pages which can be saved before others are
atomically copied, to ensure the whole image is consistent.Nigel
Yes, we can, but for now it's not been done yet.
Greetings,
Rafael
-
<snip>
Apparently I *CANNOT* wrap my head around this - if just because my laptop,
running a vendor 2.6.17 kernel does s2ram perfectly, at least, it does when
using the "Upstart" init system rather than the classical SysV init system. I
have tried it with the classical init and the suspend isn't triggered by the
buttons that used to do it. I didn't try 'echo ram > /sys/power/state', but I
have a feeling that would have worked as well. I have problems with s2disk,
but thats because I keep my swap partition small - I try to keep it at or
around 256M when I have more than half a gig of Ram in a system. Perhaps one
of these days I'll grab a multi-gig flash disk, set it up as a swap partition
and try it again. (every time I've tried s2disk I wind up running out of disk
space - and this is with nothing but X running. Any kind of progress meter
for when the system is doing s2disk would be nice - every time I've tried it
all I see for the nearly 2 minutes before the s2disk attempt ends is a black
screen. I say 2 minutes because thats how long it takes for it to learn that
there isn't enough space on the swap-partition to save the image)DRH
-
Just turn up console loglevel to see the messages.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
I agree. I don't like the freezer. We have had working
kernel-controlled suspend to RAM on powerbooks for almost 10 years
now, and we never needed to freeze processes.That said, I can see two attractions in freezing processes:
1. It provides a way to stop new I/O requests coming in, and thus
somewhat makes up for the lack of a way to freeze device request
queues (at least, we didn't have one last time I looked).2. Systems do sometimes die while suspended (e.g. run out of battery,
or the resume process fails), and to make the next boot painless,
you want the filesystems on disk to be as clean as possible.
Freezing processes and then doing a sync provides one way to
achieve that. Of course, you have to make sure you don't freeze
any kernel threads that are needed for doing the sync... And if
one of your filesystems is using FUSE, it's not going to get very
far.Paul.
-
Hi.
A couple of other advantages to freezing other processes:
1) It makes predicting how much memory is available for making and
saving snapshot a tractable problem. It therefore makes hibernation
_much_ more reliable.
2) Racing against other processes would also make hibernation slower,
increasing the chances of your battery running out before the save is
complete.
3) It makes finding potential memory leaks in the code possible. It was
ages ago now, but at one stage I could display a table saying exactly
how many pages had been allocated and freed by different sections of the
process and compare the number of free pages at the start and end of theI agree with Rafael. Freezing processes greatly helps in ensuring we
have a consistent image. He's right, too, in asserting that it's even
more important for Suspend2. Freezing processes is essential to being
able to know that those LRU pages won't change and therefore being ableI have had problems with MD threads generating I/O that I couldn't
account for - after userspace had been frozen, filesystems had been
nicely synced and so on. I have to speak with reservations though,
because I haven't yet gotten to the bottom of where the I/O is comingYeah, so long as we bmap the storage we want to use beforehand (thinking
I have to disagree here. Freezing the disk instead of the threads is
dealing with the symptoms instead of the cause.Regards,
Nigel
nobody is suggesting that you leave peocesses running while you do the snapshot,
what is being proposed is1. pause userspace (prevent scheduling)
2. make snapshot image of memory
3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
4. unpause
5. save image (with full userspace available, including network)
6. shutdown system (throw away all userspace memory, no need to do graceful
shutdown or nice kill signals, revert filesystem to snapshot/checkpoint ifall that's needed for the snapshot is to prevent userspace from scheduling, and
prevent media from being written to in a permanent way (writing to a LVM volume
after invoking a snapshot doesn't count, just revert to the snapshot)David Lang
-
Strictly speaking, all you *really* want to make sure is not so much that
user-space isn't scheduling, as the fact that all device IO buffers must
be empty.We can trivially snapshot an active user-space, and in fact it would
probably be hard to do a snapshot in a way that it could even *know* or
care about whether there are user-space processes running at the time of
the snapshot.So that's not the real problem.
What we obviously *cannot* snapshot is if some particular device is in the
middle of being written to or read from, and has outstanding commands on
the device itself (as opposed to just queued to the driver). So what we do
want to make sure happens is that there are no IO queues that are active.And the best way to make sure that there are no IO queues active is to
make sure that there are no new read or write-requests. And *that* you can
do two ways:- actually intercepting the read/write requests. Probably not too hard,
we could literally do it in the IO scheduler (and probably much more
easily than doing it in the process scheduler), but the easy cases will
only cover the block device layer, and character devices don't have the
same kind of scheduler you can trap IO in.- we also don't want to generate new data that needs to be snapshotted,
so we want to trap people who write even just to the page cache and
turn pages dirty. Again, we could probably do it at *that* point (ie
trapping them when they try to dirty a page), and it would be more
logical, but again, there are other cases of people who generate more
data (just any memory allocation obviously is a special case of
generating more data to be snapshotted),so I do agree that we want to stop producing new data to be snapshotted,
and we want to stop producing new read-requests. But kernel threads really
do neither: in an idle system, kernel threads are idle too. A kernel
thread is not like a user program that actually generates data - they only...
Including network? Your tcp peers will be really confused, then, if
you ACK packets then claim you did not get them. No, you do not want
to start network.--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
anyone who is doing a hibernate or suspend who expect all the network
connections to be working afterwords is dreaming or smokeing something.this is just another way that the failure can show up.
in fact, I would say that it would probalby be a nice thing to do for
intervening firewalls and external servers if a suspend closed all external TCP
connections rather then leaving them dangling (eating up resources until they
time out)if you software can't tolorate the network connection going away on you it will
have problems in normal operation anyway, let alone when you suspend/hibernate
your machine.David Lang
-
Yeah, for suspend-to-ram+resume and for snapshot+restore you probably
want userspace to support some kind of initscript-like mechanism
which is triggered by the lid-switch or something before calling into
the kernel. That way it can close network connections mostly-nicely
and down network interfaces before suspending, then re-run DHCP/
802.11/whatever configuration after resume/restore. That might not
be a bad place to handle NFS mounts and such too.Cheers,
Kyle Moffett-
Really? It works today... if the suspend is short enough. And that's
how it should be.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
If we get very good at Wake-on-Lan it should work for any length
of time.Regards
Oliver-
for suspend-to-ram this would work, I stand corrected.
for hibernate this would almost certinly not work, and I don't think that
it's worth raising false hopes.David Lang
-
Check the facts. It used to work, and it should work today.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
I don't dispute that it sometimes works today.
what I dispute is that makeing it work should be a contraint on a cleaner
design that happens to cause tcp connections to fail on suspend-to-disk
(hibernate).if you are dong suspend-to-disk for such a short period that TCP
connections are able to recover (typically <15 min for most firewalls, in
some cases <2 min for connections with keep-alive) is it really worth it?and once you pass the timeframes where the connections are still alive
then it shouldn't matter, and in fact the server should gracefully close
the connections to be nice to other devices and servers on the network.I dispute the idea that doing a suspend-to-disk and expecting that your
network connections will recover when you wake up is a sane expectation.David Lang
-
People were using swsusp to move server from one room to another.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
We used it (with great success) to replace bad UPSs on single-PSU
database servers under (light) load. No need for scheduled downtime,
etc.The whole point of hibernation (or suspend to disk, or whatever you
call it) is that the system goes to a zero-power state and then can be
brought back to its original state. Closing in-progress network
connections has nothing to do with pausing a machine any more than
setting IM clients to 'away' would, or locking an X session. That sort
of side-effect needs to be handled outside the core of "put state out
to disk and read it back".-
And then you'll have people wonder why the server which sent out all
those files has no log entries. You'd have to selectively unfreeze user
space, which is a cure worse than the desease.Simply throwing away user space work is a bug. And no, you cannot say that
it'll be redone away, as you are throwing away accepted input, too.Regards
Oliver
-
It's not a bug, it's a feature =). While I totally agree with you that for
the common case, you probably do want to avoid work in the userspace after
taking the snapshot, it is something that should be solved separately.
There is absolutely nothing wrong with taking a snapshot, doing some work,
and then resuming to the snapshot and thus "losing" some the work (this
is useful for debugging, for example).Pekka
-
>>
Hi.
It would be nice, yes.
But in doing so you make the contents of the disk inconsistent with the
state you've just snapshotted, leading to filesystem corruption. Even if
you modify filesystems to do checkpointing (which is what we're really
talking about), you still also have the problem that your snapshot has
to be stored somewhere before you write it to disk, so you also have to
either1) write some known static memory to disk before the snapshot and reuse
it for the snapshot,
2) ensure up to half the RAM is free for your snapshot or=20
3) compress the snapshot as you take it, guessing beforehand how much
memory the compressed snapshot might take and freeing that might
4) reserve memory at boot time for the atomic copy so that 2) or 3) isIndeed.
Nigel
Actually, it's a lot simpler than that. We can just combine the
device-mapper snapshot with a VM+kernel snapshot system call and be
almost done:sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);
When sys_snapshot is run, the kernel does:
1) Sequentially freeze mounted filesystems using blockdev freezing.
If it's an fs that doesn't support freezing then either fail or force-
remount-ro that fs and downgrade all its filedescriptors to RO.
Doesn't need extra locking since process which try to do IO either
succeed before the freeze call returns for that blockdev or sleep on
the unfreeze of that blockdev. Filesystems are synchronized and made
clean.
2) Iterate over the userspace process list, freezing each process
and remapping all of its pages copy-on-write. Any device-specific
pages need to have state saved by that device.
3) All processes (except kernel threads) are now frozen.
4) Kernel should save internal state corresponding to current
userspace state. The kernel also swaps out excess pages to free up
enough RAM and prepares the snapshot file-descriptor with copies of
kernel memory and the original (pre-COW) mapped userspace pages.
5) Kernel substitutes filesystems for either a device-mapper
snapshot with snapblockdev as backing storage or union with tmpfs and
remounts the underlying filesystems as read-only.
6) Kernel unfreezes all userspace processes and returns the snapshot
FD to userspace (where it can be read from).Then userspace can do whatever it wants. Any changes to filesystems
mounted at the time of snapshot will be discarded at shutdown.
Freshly mounted filesystems won't have the union or COW thing done,
and so you can write your snapshot to a compressed encrypted file on
a USB key if you want to, you just have to unmount it before the
snapshot() syscall and remount it right afterwards.Cheers,
Kyle Moffett-
How mature is freezing filesystems -- will it work on at least ext2/3
and vfat?What happens if you try to boot and filesystems are frozen from
previous run?Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
I'm pretty sure it works on ext2/3 and xfs and possibly others, I
don't know either way about VFAT though. Essentially the "freeze"
part involves telling the filesystem to sync all data, flush the
journal, and mark the filesystem clean. The intent under dm/LVM was
to allow you to make snapshots without having to fsck the just-If you're just doing a fresh boot then the filesystem is already
clean due to the dm freeze and so it mounts up normally. All you
need to do then is have a little startup script which purges the
saved image before you fsck or remount things read-write since either
case means the image is no longer safe to resume.If the kernel is later modified to purge all filesystem data (dcache/
pagecache) during snapshot and effectively remount and reopen all the
files by path during restore then you could remove that requirement.
You'd just need to make sure that the restore-from-disk scripts did
an fsck or journal-restore before reloading the old kernel data.Cheers,
Kyle Moffett
-
Wouldn't it be better if freeze wrote a freeze-ID to the fs and returned it?
This would naturally be kept in the image and a UUID mismatch would be
detectable - seems safer and more flexible than 'a script'."This isn't the freeze you're looking for, move along"
David
-
Possibly, but I was referring to the _current_ behavior of the device-
mapper freezing. While perhaps not ideal, it's currently very easily
usable.Cheers,
Kyle Moffett-
Okay, but how do we do the error recovery if, for example, the image cannot
This seems to be a good idea.
Greetings,
Rafael
-
it doesn't really need to matter. if you care, just arrange to not schedule user
give the user an error message telling him this, wait for confirmation, and then
jump directly to the restore step. revert everything to the snapshot image(s),
restart it.David Lang
-
(1) can be done without extra locking. Device-mapper already has
code to freeze filesystems and that makes a natural process-stopping
point. Any threads doing IO will very quickly put themselves toIf the image can't be saved then there are 2 options:
(1) Call sys_restore() with the image
(2) Pass your snapshot file-descriptor to sys_unsnapshot()In the former case, the system will be restored to the state it was
at a few seconds earlier, right as it took the snapshot. In the
latter case the modified-in-memory snapshot pages will be synced back
to the disk filesystems, the copy-on-write data-structures torn down
(think of merging an LVM snapshot back into its base device), and the
memory allocated for the snapshot will be freed. Either way the
system is properly in sync with disk again, the only difference is
whether you want to preserve the userspace state from during the
attempted snapshot (IE: any error status). You could also save the
error state in case (1) by just auto-posting a bug-report on http://
bugs.$VENDOR.com/ of course :-D.Cheers,
Kyle Moffett-
And where is the benefit in that? How is such user space freezing logic
simpler than having the kernel do the write?
What can you do in user space if all filesystems are r/o that is worth the
hassle?Regards
Oliver
-
I am talking about snapshot_system() here. It's not given that the
filesystems need to be read-only (you can snapshot them too). The benefit
here is that you can do whatever you want with the snapshot (encrypt,
compress, send over the network) and have a clean well-defined interface
in the kernel. In addition, aborting the snapshot is simpler, simply
munmap() the snapshot.The problem with writing in the kernel is obvious: we need to add new code
to the kernel for compression, encryption, and userspace interaction
(graphical progress bar) that are important for user experience.Pekka
-
Well, swsusp currently does almost the same, except that you can read the
image from the kernel as a stream of bytes, using read() and, during the
restore phase, upload the same image using write(). The advantage of this
is that the interface is symmetrical from the user space's point of view.
[You're cancelling the hibernation by closing /dev/snapshot, which also is
quite natural.]If you look at the interface in user.c, there are only two ioctls really needed
for that in there, SNAPSHOT_ATOMIC_SNAPSHOT and
SNAPSHOT_ATOMIC_RESTORE. Two more are handy for freezing
tasks, SNAPSHOT_FREEZE and SNAPSHOT_UNFREEZE. The others were added
later, to make the user space part simpler or capable of doing some fancyYes, and that's why we wanted to introduce the userland part. The problem
with this approach, as it's turned out, is that the userland part must be a
very specialized piece of software, really careful of what it's doing, mainly
because of the inability to checkpoint filesystems. If we could checkpoint
filesystems and were able to unfreeze the user space after creating the
snapshot without the risk of corrupting filesystems in the restore phase,
the userland part could be much simpler (even as simple as Linus suggested).Greetings,
Rafael
-
this sounds like a really good argument for having a useable userspace running.
we already have the LVM snapshot code in the kernel, so we have the pieces
available to protect the filesystems, we just need to figure out how to put them
togeather. (the simpliest way would be to make a new suspend package that
required the user to use LVM so that snapshots are available, but this is also
the most disruptive approach)David Lang
-
Yes. I personally know very little about the LVM snapshot code and I wasn't
aware of its capabilities. If we can make it possible to run the user space
safely after we've created the memory snapshot, I'm all for it.As far as the package is concerned, we can just add the new user space tools
to the suspend package containing our existing userland part.Greetings,
Rafael
-
The kernel can already do compression and encryption.
Regards
Oliver
-
Hi Oliver,
Yes, if we all could agree on _which_ compression and encryption
algorithm(s) we want to use. It goes beyond that too, where do you
want to save the image? In the swap device or a regular file? And
don't forget about debuggability either. It's faster to do a
snapshot/resume without shutdown/restart in the middle or just do a
snapshot, and examine its contents.
-
A swap device is doubtlessly easier. But isn't the problem of using
a swap file already fixed? The writeout seems the easiest part ofThen use a "fake reboot" option and save the image to a ramdisk.
It isn't that hard. You must be able to survive that, as io errors during
write out are possible.Regards
Oliver
-
gzip is too slow for this. lzf works okay. Oh and swsusp wants rsa
crypto.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Then port lzf to the kernel, or help with the lzo port.
Swsusp might want RSA crypto, but it doesn't really need it. Currently
it only uses it to be able to suspend without asking for a passphrase.So the current sequence is:
1) Generate RSA keys + ask for a passphrase. (Once)
...
2) Suspend. (Encrypt snapshot with public RSA key).
...
3) Ask for the passphrase.
4) Resume.
RSA is used so that the passphrase can be thrown away between 1 and 2.
But the same functionality can be achieved by doing:
1) Define a user password (e.g. /etc/shadow thing). (Once)
2) When a user logs in: get random data and encrypt it with the password,
this becomes the AES key. Store both the data and key in a secure way in
memory, e.g. using the existing kernel key infrastructure....
3) Suspend.
(Encrypt snapshot with the AES key and store the random data.)...
3) Ask for the passphrase.
(To get the AES key, encrypt the stored random data.)4) Resume.
Variants are possible of course, but this is the main idea.
This is secure because the key infrastructure is secure, and even if
it isn't the system must be compromised to get the suspend key before
the suspend is done. But at that point the attacker already has all
information that can be found in the suspend image, and could have done
all kind of things to inflict damage (like installing a key logger).Advantage of this scheme is that it only need AES and can be done (mostly)
in kernel space. It's also faster and simpler than the current RSA scheme.
Disadvantage is that it wastes at least 32 bytes of memory when the system
is running, to store the data and key.Only thing that needs to be done in userspace is setting the random data
and AES key, but there exist a suitable interface for that (the key system).
As user login is already done in user space, this can be integrated with
that in a nice way.Greetings,
Indan
-
Another disadvantage is that you need to hack into PAM infrastructure,
that your suspend password needs to be same as someone's login
password, and that it will really only work with single-user machine.Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Hello,
The first two are only true if you want to integrate it with user login, so
that a user only needs to sign in once, which seems like a convenient thing.
But if you don't want to integrate with the existing login infrastructure,
then just don't. And those disadvantages are true for any system that wants
users to login once.Then the disadvantage is reduced to a user needing to provide the password
at suspend if the system wasn't booted from a snapshot. But no need for
users to generate any files, just to choose a resume password.If the resume key is stored per user instead of a single global instance, it
will work with a multi-user system too. A more interesting question is what
should happen when one user did the suspend and the other wants to resume.
Throw away the snapshot? Refuse booting? Or boot and switch "active user"?If you don't want people to resume each other's suspends then a key per user
works. If you want them to, then it becomes a bit tricky, especially if you
don't integrate with the login system. You don't want that a user can resume
someone else's snapshot and have access to everything that other user left
open. Nor do you want users to give a password twice.If you want users to be able to resume each other's snapshots, you probably
also want the system to switch users after the resume. No matter what scheme
is used, this becomes hairy and hard to get watertight. (Perhaps "impossible"
is more realistic: how to be able to read the suspend image and copying it
to RAM again, without having access to all data within?)But if it's an "us" against "them" case, and you want users to resume each
other's snapshots, you're right that the scheme I proposed will fall apart.
In which case it needs to be adjusted a bit to handle this case:Have one global suspend/resume key, and for each user store it on disk,
encrypted with that user's password. Also store the key in memory as
before. Now when the system is suspended any user needs to have provided
his pa...
Hi.
Sorry Pekka, but that's just broken.
It implies firstly that we tell all userspace programs "I'm sorry, but
I'm suspending at the moment. Can you tip toe quietly around while I do
it?" You can't seriously expect every userspace program to be modified
to adjust it's behaviour according to whether we're writing a snapshot
to disk at the moment or not.It also implies that we can prepare a snapshot and then happily have the
contents of the disk change so that they don't match the superblock and
other filesystem details we just saved in the snapshot. We can't. At
least not without modifying all the filesystems so that (at a minimum)
they know how to throw away all the metadata they have at resume timeThis sounds feasible, but it's only really acceptable if your willing to
have hibernation fail or restart multiple times. If your battery is
running out or you need to rush to put a lappy in your bag because they
train just came early, that's not an option. It's for that very reason
that I've put a lot of effort into trying to make it work first time,
every time. Not there yet, but it's a priority.By the way, sorry. This email feels like it is pouring a lot of cold
water on your ideas. I don't want to be negative!Regards,
Nigel
It certainly isn't.
You don't need to modify other programs. You just need to display the
progress bar and block _user input_. I don't even claim to know X, but I
would be extremely surprised if you technically can't say "don't let
the user touch any other windows except this one." The user couldn't care
less whether tasks are frozen or not by the kernel. What matters is that
the user can't shoot himself in the foot while snapshotting.Furthermore, we probably do need to do other things to ensure safety, like
remounting filesystems read-only but again, this has nothing to do with
snapshotting per se. What the kernel needs to worry about is (1) providing
an atomic snapshot that is consistent and (2) resuming to that snapshot
safely. If the _user_ loses data that was generated between snapshot +
shutdown, it's absolutely no concern for the snapshot operation!But you just explained how we can! We shouldn't bend over backwards for
snapshotting just because the filesystems don't currently support
something we need.Don't worry, I am used to cold water :-).
Pekka
-
Hi.
User input doesn't account for all system activity. Think of cron jobs
Noooo! If the user looses data, the user will be concerned and we should
be. I for one would do my best to avoid using software that loses my
data for me. I wouldn't care if you said "Well, it's your fault. You
lost the data." From my perspective as a user, I didn't lose the data,Sorry, but I just don't believe filesystems should need to throw away
metadata post resume. If we let data be changed after snapshotting (or
ourselves cause it to be changed), we're the ones that are broken. Our
snapshot is out of date and the expectations of userspace programs that
were snapshotted will be out of date. Just imagine, for example, a
userspace program that is snapshotted, then reads and deletes a
temporary file. After the snapshot restore, it's running again. But
wait, we can't read or delete the file again because it's already gone.Maybe, but I'd still rather be encouraging!
Nigel
Yes, but the _user_ did not start them so they didn't lose any work. See,
it might or might not be important but that's something the _userspace_
has much more knowledge than the kernel ever will.You are looking at snapshot/shutdown from kernel and user experience point
of view at the same time which causes confusion here.Let me repeat: it is _absolutely no concern_ of the _kernel_ whether you
resume to a snapshot that does not contain all your precious data. The
kernel doesn't care one bit!That being said, the _userspace solution_ obviously needs to take this
into account by blocking user input, making filesystems read-only, and
maybe even blocking certain background processes (cron and beagle indexing
come into mind).It doesn't. We can either make the filesystem read-only or, surprise,
surprise, make a _snapshot_ of the filesystem!And while the points you raised are important for the full
end-user solution, it is absolutely not interesting to snapshot_system().
The only thing it needs to guarantee is a consistent snapshot that we can
resume later.You are. Perhaps you just don't know it yet. ;-)
Pekka
-
I think to some extent that's part of the problem. Consider for a moment
that a /dev/hibernate would be required, and that it must be (a) a disk,
or (b) a partition, or (c) other devices in the future, like an nbd, USB
flash or DVD.Don't have a device like that, then can't hibernate. Stop trying to be
smart and use swap for two different things. Stop trying to have an
interface between user space and kernel which does things not required
to preserve the system. A progress indicator is not needed, power off isHibernate is useful to avoid complex boot, it's useful as the UPS gets
tired, and putting features in the process beyond saving the snap
(possibly compressed and/or encrypted) just adds complexity. Put it all
in the kernel and use /sys/power/state as the user interface. Stop
oversolving the problem.No, that doesn't avoid other hard issues, but for the most part suspend2
has addressed them.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
-
Won't there be problems if e.g. X tries to write something to its
logfile after snapshot ?Xav
-
Sure. But that's a user-level issue.
You do have to allow writing after snapshotting, since at a minimum, you'd
want the snapshot itself to be written. So the kernel has to be fully
running, and support full user space. No "degraded mode" like now.So when I said "fully running user mode", I really meant it from the
perspective of the kernel - not necessarily from the perspective of the
"user". You do want to limit _what_ user mode does, but you must not limit
it by making the kernel less capable.Remounting mounted filesystems read-only sounds like a good idea, for
example. We can do that. We have the technology. But we shouldn't limit
user space from doing other things (for example, it might want to actually
*mount* a new filesystem for writing the snapshot image).For example, right now we try to "fix" that with the whole process freezer
thing. And that code has *caused* more problems than it fixed, since it
tries to freeze all the kernel threads etc, and you simply don't have a
truly *working*system*.I think it's fine to freeze processes if that is what you want to do (for
example, send them SIGSTOP), but we freeze them *too*much* right now, and
the suspend stuff has taken over policy way too much. We don't actually
leave the system in a runnable state. I can almost guarantee that you'd be
*better* off having the snapshot taking thing do akill(-1, SIGSTOP);
in user space than our current broken process freezer. At least that
wouldn't have screwed up those kernel threads as badly as swsusp did.And no, I'm not saying that my suggestion is the only way to do it. Go
wild. But the *current* situation is just broken. Three different things,
none of which people can agree on. I'd *much* rather see a conceptually
simpler approach that then required, but even more important is that right
now people aren't even discussing alternatives, they're just pushing one
of the three existing things, and that's simply not viable. Because I'm
not merging...
While the ioctl() interface is horrid, I think it's actually in
principle pretty close to your snapshot_system()/resume_snapshot().
The ugliness probably comes from the fact that suspend to RAM and
snapshot/shutdown are interleaved there too.
-
Hi.
It doesn't need a fully functional userspace (unless you want to write
to a fuse device, and even then that could be worked around - make it
like uswsusp or userui).... can I deverge for a second and say that from
this point of view, fuse is the lamest idea ever invented. Guaranteed to
break your ability to suspend^Wsnapshot.... Anyhow, if the kernel has
bmapped the pages it's going to write to beforehand, it knows where theWe tried that. It would need some work. IIRC remounting filesystems
read-only makes files become marked read-only. Perfectly sensible,
except that if you then remount the filesystem rw at resume time, all
those files are still marked ro and userspace crashes and burns. Not
unfixable, I'll agree, but there is more work to do there.As to the example, mounting a new filesystem for writing the snapshot
image should probably be done before we do the snapshot. Then it won't
be in danger of triggering anything that might require one of the otherPerhaps you should try to make an alternative yourself instead of
pushing us into making something we don't believe will work (my case) or
have already done but in a way you don't like (Rafael). Don't talk about
Pavel cutting code. He's just acking/nacking what Rafael sends him.Nigel
I've done that in the past (USB, PCMCIA - screw the maintainers, redo
it basically from scratch). But the thing is, I'm totally uninterested
personally in the whole disk-snapshotting, so I'm not likely to do it
there.But yes, I'm actually hoping that some new person will come in with a new
idea. The current people seem to be too set in "their" corners, and I
don't expect that to really change.Quite honestly, I don't foresee any of the current tree approaches really
doing something new and obviously better, unless somebody new steps in.Linus
-
Hi.
That's because there is no other possibility. Sooner or later you have
to do a snapshot, and somehow you have to save it. You're not going to
get a new solution, just one that do those basic things in new and
better ways.I'm perfectly willing to think through some alternate approach if you
suggest something or prod my thinking in a new direction, but I'm afraid
I just can't see right now how we can achieve what you're after.Nigel
Ok, what about this approach I've been mulling about for a while:
Suspend-to-disk is pretty much an exercise in state saving. There are
multiple ways to do state saving, but they tend to end up in two
categories: implicit and explicit.In implicit state saving, you try to save the state of the
system/application/whatever "under its feet", more or less, and then
fixup what is no saved/saveable correctly. A well-known example is
the undumping process Emacs goes (went?) where it tries to dump the
state of the memory as a new executable, with a lot of pleasure with
various executable formats and subtleties due to side effects in libc
code you don't control.In explicit state saving each object saves what is needed from its
state to an independently defined format (instead of "whatever the
memory organization happens to be at that point"). When reloading the
state you have to parse it, and it usually requires
rebuilding/relocating all references/pointers/etc. XEmacs currently
has a "portable dumper" that pretty much does just that. We don't
have any redumping problems anymore, they're over.Which one is the best depends heavily on the application. The amount
of code in the implicit case depends on the amount of fixups to do.
In the kernel case it happens to be a lot, pretty much everything that
touches hardware has to save to memory the device state and reload it
on resume. And bugs on hardware handling can be quite annoying to
debug. And if some driver does not to saving/resume correctly, you
have no way outside of playing with modules to ensure the safety of
the suspend cycle.The amount of code in the explicit case is an interesting variable in
the case of the kernel. You have to save what is needed, but how do
you define what is needed? It is, pretty much, what running processes
can observe from userspace. Now, what can a process observe:
- its application text and anonymous memory pages
- its file handles
- its mapped files
- its mapped whatever else
- its ...
If you are looking seriously at this you might want to start with the
code in the OpenVZ kernel (http://openvz.org) that allows a VE to
"checkpoint" to disk and "restore" on the same or a different machine.This is, as far as I can tell, a portable implementation of this that
already handles real live userspace applications moving transparently
between two machines.It has the advantage that it lives in an orderly world where most
devices and the file system are virtual but, hey, it works right now.Regards,
Daniel
--
Digital Infrastructure Solutions -- making IT simple, stable and secure
Phone: 0401 155 707 email: contact@digital-infrastructure.com.au
http://digital-infrastructure.com.au/
-
Hi.
Just to let you know - I'm not ignoring your message. It's just taking
some time to think through the issues and try to formulate a good reply.
Oh, and of course there are a gazillion other messages flying about at
the moment that need attention too.Regards,
Nigel
On Thursday, 26 April 2007 22:08, Nigel Cunningham wrote:
Well, I think that much of what Linus is saying indicates that he hasn't tried
to write any such thing himself. ;-)Anyway, I'm tired of all this thing. Really. I've just been trying to make
things _work_ more-or-less reliably in a way that Pavel liked and I really
didn't know that much about the kernel when I started. In fact, I started as a
user who needed certain functionality from the kernel and that was not there
at the time. I've made some mistakes because of that (like the definitions of
the ioctl numbers in suspend.h - this was just a rookie mistake, and I'm
ashamed of it, but _nobody_ catched it, although I believe many people were
looking at the patch).Now that I know much more than before, I can say I agree with Linus on his
opinion about the separation of s2ram form the snapshot/restore functionality
(I'll call it 'hibernation' for simplicity from now on). It should be done,
because it would make things simpler and cleaner. Still, it will be difficult
to do without screwing users en masse and that's my main concern here.I don't agree that we don't need the tasks freezer for suspending and
hibernation. We need it, because we need to be sure that the (other) tasks
will not get us in the way, and that also applies to kernel threads (and I
don't think the tasks freezer is 'screwing' them, BTW).I agree that the userland interface for swsusp is not very nice and I'm going
to do my best to clean that up. I hope that someone will help me, but if not,
then that's fine. OTOH, it's difficult, if not impossible, to do a
userland-driven hibernation in a completely clean way. I've tried that and I'm
not exactly satisfied with the result, although it works and some distros use
it. I wouldn't have done it again, but then I'm going to support the existing
users, as I promised.Now, I think that the hibernation should better be done completely in the
kernel, because that's just conceptually simpler, although ...
That's definitely true. The only interaction I ever had with "hibernation"
(and yes, we should just call it that) is when I was working on s2ram and
cleaning up the PCI device suspend/resume in particular, and trying
(_mostly_ successfully - I think I broke it once or twice mainly due to
interactions with the console, but on the whole I think it mostly worked)So my strong opinion on it literally comes from the other end (ie _not_
knowing about hibernation, but trying to work with s2ram, and cursing theI do agree. It will inevitably affect a lot of devices. That's always
I actually feel much less strongly about that, because just separating out
s2ram and hibernate entirely from each other would already really get the
thing _I_ care about taken care of - being able to work on one of the
other without fear of breaking the other one.And besides, I actually came into the whole discussion because I'm not a
huge fan of thinking that user-land is "better". If the thing can sanely
be done in kernel, I'm actually all for that. What drives me wild is
having three different things, and nobody driving.It needs somebody who (a) cares (b) has good taste and (c) has enough time
and personal karma to burn that he can actually take the (obviously)
inevitable heat from just doing things right, and convincing people to
select *one* implementation.That kind of person is really really hard to find. And if you're it,
you're in for some pain ;)Linus
-
Hi Rafael.
I don't want to remove user visible interfaces either (I understand that
you mean the ioctls by that?). Perhaps we can find a way to make them
still usable with a more in-kernel solution (ie some things becomeSure. We should spend some time discussing and planning beforehand so we
Regarding open-coded things, I assume you're referring to the extents. I
would argue that they're not open-coded because list.h implements doubly
linked lists, and extents use a singly linked list. That said, I suppose
we could make the extents doubly linked and use list.h, even though that
e ;-))Yes.
Thanks for this email. It's really encouraging, and I'm more than glad
to work with you. Unfortunately, as you've seen me keep saying already,
I have very limited time to work on this. Thankfully you seem to have
more, and Pekka has also stepped up to help, so maybe we can make good
forward progress despite my limitations.Regards,
Nigel
There are other solutions, though. One is that we could export a
system call interface which freezes a filesystem and prevents any
further I/O. We mostly have something like that right now (via the
the write_super_lockfs function in the superblock operations
structure), but we haven't exported it to userspace. And right now
not all filesystems support it, but in theory that could be fixed (or
you only suppor suspend/resume if all filesystems support lockfs).We would also need a similar interface to freeze any block device I/O,
in case you have a database running and doing direct I/O to a block
device. (Or again, we could simply not support that case; how many
people will be running running a database accessing a block deivce on
their laptop?)So in order to do this right, we would have to double the number of
new interfaces needed from the two proposed by Linus --- which is why
I think the userspace suspend solution is fundamentally NOT the right
one. Rather the right one is the one which Linux ultimately used for
PCMCIA, which is to do it all in the kernel.- Ted
-
block device I/O uses generic_file*whateveriscurrenthere*_write, which
checks for the freeze flag, so the infrastructure for that is there
aswell.-
Hi Nigel,
As I am a total newbie to the power management code, I am unable to
spot the conceptual difference in uswsusp suspend.c:suspend_system()
and suspend2 kernel/power/suspend.c:suspend_main(). How are they
different?I assume this doesn't affect the kernel at all with uswsusp?
This is too broad. Please be more specific of the problems the current
suspend and snapshot/shutdown code in the kernel has.Now to add to your list, as far as I can tell, suspend2 provides
better feedback to the user via the netlink mechanism (although the
kernel shouldn't be sending messages such as userui_redraw but instead
let the userspace know of the actual events, for example, that tasks
have now been frozen). However, I am unsure if this is still relevant
as most of the work (snapshot writing) is being done in userspace
where we explicitly know when processes have been frozen, when the
snapshot is finished, and when it's written to disk.Pekka
-
Hi.
Well uswsusp would benefit from using multiple threads - if it can - to
Did you see the 'Reasons to merge' email I sent? It has more detail on
=46rom uswsusp's point of view, yeah. But I'm still coming from the 'doing
this in kernelspace makes far more sense' perspective.Regards,
Nigel
It's doable[1], but I'm not sure that the added complexity is worth of it.
I'm suprised that you see a big improvement. I'd expect that the image
write is bottlenecked by the disk performance. On my PC (Core2, locked
at 1.6GHz) lzf can compress 250-280MB/s; even with an older CPU that can
do 1/3 it's still more than the disk can handle.Luca
[1] We may even use MPI to compress over a Beowulf cluster, it's
userspace ;)
--
"Ricorda sempre che sei unico, esattamente come tutti gli altri".
-
| Greg Kroah-Hartman | [PATCH 006/196] Chinese: add translation of oops-tracing.txt |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Borislav Petkov | 2.6.23-rc1: no setup signature found... |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [BUG] New Kernel Bugs |
