These patches are from Oren Laaden. I've refactored them a bit to make them a wee bit more reviewable. I think this separates out the per-arch bits pretty well. It should also be at least build-bisetable. If there are no objections to this general approach, then we plan to start submitting these bits to -mm. -- At the containers mini-conference before OLS, the consensus among all the stakeholders was that doing checkpoint/restart in the kernel as much as possible was the best approach. With this approach, the kernel will export a relatively opaque 'blob' of data to userspace which can then be handed to the new kernel at restore time. This is different that what had been proposed before, which was that a userspace application would be responsible for collecting all of this data. We were also planning on adding lots of new, little kernel interfaces for all of the things that needed checkpointing. This unites those into a single, grand interface. The 'blob' will contain copies of select portions of kernel structures such as vmas and mm_structs. It will also contain copies of the actual memory that the process uses. Any changes in this blob's format between kernel revisions can be handled by an in-userspace conversion program. This is a similar approach to virtually all of the commercial checkpoint/restart products out there, as well as the research project Zap. These patches basically serialize internel kernel state and write it out to a file descriptor. The checkpoint and restore are done with two new system calls: sys_checkpoint and sys_restart. In this incarnation, they can only work checkpoint and restore a single task. The task's address space may consist of only private, simple vma's - anonymous or file-mapped. -- Oren's original announcement In the recent mini-summit at OLS 2008 and the following days it was agreed to tackle the checkpoint/restart (CR) by beginning with a very simple case: save and restore a single task, with simple ...
From: Oren Laadan <orenl@cs.columbia.edu> This patch adds those interfaces, as well as all of the helpers needed to easily manage the file format. The code is roughly broken out as follows: ckpt/sys.c - user/kernel data transfer, as well as setting up of the checkpoint/restart context (a per-checkpoint data structure for housekeeping) ckpt/checkpoint.c - output wrappers and basic checkpoint handling ckpt/restart.c - input wrappers and basic restart handling Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- linux-2.6.git-dave/Makefile | 2 linux-2.6.git-dave/ckpt/Makefile | 1 linux-2.6.git-dave/ckpt/checkpoint.c | 207 +++++++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/ckpt.h | 82 ++++++++++++ linux-2.6.git-dave/ckpt/ckpt_hdr.h | 69 ++++++++++ linux-2.6.git-dave/ckpt/restart.c | 189 ++++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/sys.c | 233 +++++++++++++++++++++++++++++++++++ 7 files changed, 782 insertions(+), 1 deletion(-) diff -puN /dev/null ckpt/checkpoint.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1,207 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/version.h> +#include <linux/sched.h> +#include <linux/time.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/dcache.h> +#include <linux/mount.h> +#include <asm/ptrace.h> + +#include "ckpt.h" +#include "ckpt_hdr.h" + +/** + * cr_get_fname - return pathname of a given file + * @file: file pointer + * @buf: buffer for pathname + * @n: ...
Do you rely on the kernel version in order to determine the format of the binary data, or is it just informational? If you think the format can change in incompatible ways, you probably need something more specific than the version number these days, because there are just so many different trees with Please use the existing pr_debug and dev_debug here, instead of creating This structure has an odd multiple of 32-bit members, which means that if you put it into a larger structure that also contains 64-bit members, the larger structure may get different alignment on x86-32 and x86-64, which you might want to avoid. In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input get_fs()/set_fs() always feels a bit ouch, and this way you have to use __force to avoid the warnings about __user pointer casts in sparse. I wonder if you can use splice_read/splice_write to get around Why do you need CAP_SYS_ADMIN for this? Can't regular users The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead? Arnd <>< --
Yeah, this is very true. My guess is that we'll need something like Can't we just declare all these things __packed__ and stop worrying Ugh, this is crappy code anyway. It needs to return an error and have I have to wonder if this is just a symptom of us trying to do this the wrong way. We're trying to talk the kernel into writing internal gunk into a FD. You're right, it is like a splice where one end of the pipe is in the kernel. Yes, eventually. I think one good point is that we should probably remove this now so that we *have* to think about security implications as we add each individual patch. For instance, what kind of checking do we do when we restore an mlock()'d VMA? Fine with me. Renamed in new patches, hopefully. I'll send new patches out later today. -- Dave --
Exactly. The header should eventually contain sufficient information to describe the kernel version, configuration, compiler, cpu (arch and capabilities), and checkpoint code version. How would you suggest to identify the origin tree with an identifier Actually not quite. 'n' is _not_ controlled by the input data, and at the same time ctx->hpos should always carry enough room by design. If that is not the case, then it's a logical bug, not DoS attack. To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers as they are read; since headers can be read in a nested manner, ctx->hpos points to the next free position in that buffer. So 'n' is the size of the header that we are about to read - decided at compile time, not the user input. The BUG_ON() statement asserts that by design we have enough buffer (like you'd check that you didn't run out of kernel stack...) If it is preferred, we can change this to write a kernel message and Hmmm... even if not strictly now, we *will* need admin privileges for the CR operations, for the following reasons: checkpoint: we save the entire state of a set of processes to a file - so we must have privileges to do so, at least within (or with respect to) the said container. Even if we are the user who owns the container, we'll need root access within that container. restart: we restore the entire set of a set of processes, which may require some privileged operations (again, at least within or with respect to the said container). Otherwise any user could inject any restart data into the kernel and create any set of processes with arbitrary permissions. --
kmalloc/kfree() area really, really fast. I wonder if the code gets easier or harder to read if we just alloc/free as we need to. How large are these allocations, usually? Will stack allocation work in most cases? -- Dave --
The ctx->hbuf interface is a pair of cr_hbuf_get(ctx, length) and a matching cr_hbuf_put(ctx, length), almost like using kmalloc/kfree(). The main difference is that cleanup in error paths is implicit (the That depends on how we construct the headers. In Zap there are some headers that use relatively long structures to be put on the stack, and it wouldn't make much sense to divide them into smaller headers artificially. However, I forgot to mention earlier that an important reason to use this construct is actually in anticipation for a future optimization: during application downtime the checkpoint state will be aggregated into an in-memory buffer, and only after the application is allowed to continue execution (unfrozen) the buffer will be written-back to the FD. In that scenario, we will allocate a larger buffer in the ctx (eg based on some heuristics) and cr_hbuf_get() will return the next location in that buffer, while cr_hbuf_put() will do nothing. Oren. --
Including struct utsname in the header covers most of this. I supposed you can't do it entirely safe, and you always need to be prepared for malicious input data, so there probably isn't much point in getting My recommendation in general is to make kernel code crash loudly if there is a bug in the kernel itself. Returning error codes makes most sense if they get sent back to the user, which then can make sense of Exactly. There was a project that implemented checkpoint/restart through ptrace (don't remember what it was called), so with certain limitations it should also be possible to implement the syscalls so that any user that can ptrace the tasks can also checkpoint them. Arnd <>< --
The only problem I can see with this is that you lose efficiency, especially when you have to build your checkpoint image with lots of things that are config-specific. The approach sounds like a good one in theory, but I'm a bit skeptical that we could stick to it in practice, in a mainline kernel where there are billions of config options. It is definitely something to strive for, though. Good point! -- Dave --
I personally dislike __packed__ because it makes it very easy to get suboptimal object code. If you either pad every structure to a multiple of 64 bits or avoid __u64 members, you don't have a problem. Also, I think avoiding implicit padding inside of data structures is very helpful for user interfaces, if necessary you can always add explicit Maybe you can invert the logic and let the new syscalls create a file descriptor, and then have user space read or splice the checkpoint data from it, and restore it by writing to the file descriptor. It's probably easy to do using anon_inode_getfd() and would solve this problem, but at the same time make checkpointing the current thread I think the question can be generalized further: How do you deal with saved tasks that have more priviledges than the task doing the restore? There are probably more, but what I can think of right now includes: * anything you can set using ulimit * capabilities * threads running as another user/group * open files that have had their permissions changed after the open Arnd <>< --
Yeah, it does seem kinda backwards. But, instead of even having to worry about the anon_inode stuff, why don't we just put it in a fs like everything else? checkpointfs! I'm also really not convinced that putting the entire checkpoint in one glob is really the solution, either. I mean, is system call overhead really a problem here? -- Dave --
Well, anon_inodes are really easy to use and have replaced some of the simple non-mountable file systems in the kernel. checkpointfs sounds interesting and I guess in a plan9 world of fairies and fantasy, you should be able to create a checkpoint of your system using 'tar czf - /proc/', but I'm not sure it helps here. The main problem I see with that would be atomicity: If you want multiple processes to keep interacting with each other, you need to save them at the same point in time, which gets harder as you split your interface into more than a single file descriptor. Arnd <>< --
It could take ages to write out a checkpoint even to a single fd, so I suspect we'd have the exact same kinds of issues either way. -- Dave --
I guess either way, you have to SIGSTOP (or similar) all the tasks you want to checkpoint atomically before you start saving the contents. If you use a single fd, you can do that under the covers, when using a more complex file system, it seems more logical to require an explicit interface for this. Arnd <>< --
Oh, we're already working on patches to the freezer code to do this for us. There's a branch in here from Matt H. that's doing just that: http://git.kernel.org/?p=linux/kernel/git/daveh/linux-2.6-next-lxc.git;a=shortlog -- Dave --
One reason is that I suspect that stops us from being able to send that data straight to a pipe to compress and/or send on the network, without hitting local disk. Though if the checkpointfs was ram-based maybe not? As Oren has pointed out before, passing in an fd means we can pass a socket into the syscall. Using the anon_inodes would also prevent that, but if it makes for a cleaner overall solution then I'm not against considering either one --
With anon_inodes, you can still implement splice_read/splice_write, so you can splice it into a socket. Arnd <>< --
If you do pass a socket, will it handle blocking correctly? Getting deadlocked task would be bad. What happens if I try to snapshot into /proc/self/fd/0 ? Or maybe restore from /proc/cmdline? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Heh, that's a good point. What was the other code where we kept coming up with deadlocks like that? Anyone remember? -- Dave --
Hmmm... these are good points. Keep in mind that our principal goal is to checkpoint a whole container, rather then a task to checkpoint itself (which is a by-product). Of course your comments apply to a whole container as well. In both cases, I don't think that blocking on a socket is a problem; the checkpointer will enter a TASK_INTERRUPTIBLE state. Where is the deadlock ? Writing or reading to/from /proc/self/... likewise - the programmer must understand the implications, or the program won't work as expected. I don't see a possible deadlock here, though. For example - writing to /proc/self/fd/0 is ok; the state of fd[0] of that task will be captured at some point in the middle of the checkpoint, so after restart one cannot assume anything about the file position; the rest should work. Oren. --
At the checkpoint end, the ptrace checks seem apporpriate: If you're allowed to stop and manipulate the process, then you may as well be allowed to checkpoint and see/tweak its memory that way. At the restart end, every resource which was checkpointed will have to be re-created, and permissions checked against the privilege of the task which did the restart. We may end up having to make use of the new credentials for this. This could become unpleasant: if an unprivileged task asked a privileged helper to create something for the unprivileged task to use (i.e. a raw socket), then the user needs to be privileged to re-created the resource. But it's necessary. -serge --
Right. Of course, the hard part here will be to make it obvious to be safe. Having to check all sorts of permissions means there will be many opportunities for exploitable bugs. The best way I can think of for this would be to use existing syscalls (e.g. sched_setscheduler, setfsuid, ...) from user space whereever possible and do only the bare minimum for the restart part in the kernel. Arnd <>< --
Well, the current direction is about as far away from that as you can get, unless we basically call those system calls from inside our new sys_restart() one. As of now, we're as much work in the kernel as possible, and doing the bare minimum in userspace. That's what both Oren and our OpenVZ colleagues have advocated. -- Dave --
Arnd, Jeremy and Oren, Thanks for all of the very interesting comments about the ABI. Considering that we're still *really* early in getting this concept merged up into mainline, what do you all think we should do now? My main goal here is just to get everyone to understand the approach that we're proposing rather than to really fix the interfaces in stone. I bet we're going to be changing them a lot before these patches actually get in. -- Dave --
I think the two most important aspects here need to be security and simplicity. If you have to choose between the two, it probably makes sense to put security first, because loading untrusted data into the kernel puts you at a significant risk to start with. If you can show a restart interface that lets regular users restart their tasks in a way anyone can verify to be secure, that will be a good indication that you're on the right track. The other problem that you really need to solve is interface stability. What you are creating is a binary representation of many kernel internal data structures, so in our common rules, you have to make sure that you remain forward and backward compatible. Simply saying that you need to run an identical kernel when restarting from a checkpoint is not enough IMHO. Some more words on specific interfaces that we have discussed: The single-file-descriptor approach has the big advantage of keeping the complexity in one place (the kernel). To be consistent with other kernel interfaces, I would make the kernel hand out a file descriptor, not let the user open a file and pass that into the kernel as you do now. A new file system is a good idea for many complex interfaces that make their way into the kernel, but I don't think it will help in this case. For checkpointing a single task, or even a task with its children, a different interface I could imagine would be to have a new file in procfs per pid that you can read as a pipe giving our the same data that you currently save in the checkpoint file descriptor. It does mean that you won't be able to pass flags down easily (you could write to the pipe before you start reading, but that's not too nice). On the restart side, I think the most consistent interface would be a new binfmt_chkpt implementation that you can use to execve a checkpoint, just like you execute an ELF file today. The binfmt can be a module (unlike a syscall), so an administrator that is afraid of the security ...
On Mon, 11 Aug 2008 23:47:49 +0200 OTOH, making one of these checkpoint files go into any 2.6.x kernel seems like a very high bar, to the point, perhaps, of killing this feature entirely. There could be a case for viewing sys_restore() as being a lot like sys_init_module() - a view into kernel internals that goes beyond the normal user-space ABI, and beyond the stability guarantee. It might be possible to create a certain amount of version portability with a modversions-like mechanism, but it sure seems hard to do better than that. jon --
The OpenVZ dudes like refer to something that Andrew Morton said about this (paraphrasing...): if we need cross-version restore support, we can count on userspace to do the conversion. You can almost think of it like the crashdump processing utility that we have. Instead of worrying about having the kernel *always* produce the same crashdump with the same gunk in it, we make userspace do all the parsing and interpretation. It also makes it quite possible for a distribution to make a change (say because of a security fix) in the kernel that changes the checkpoint format, then to quickly code up the necessary bits for the conversion program. -- Dave --
quoting: > There could be a case for viewing sys_restore() as being a lot like > sys_init_module() - a view into kernel internals that goes beyond the > normal user-space ABI, and beyond the stability guarantee. It might be > possible to create a certain amount of version portability with a > modversions-like mechanism, but it sure seems hard to do better than > that. > > jon Extending this view in the context of security - we can require sysadmin privilege to restart, and then sysadmin is responsible for the contents of the file. The kernel will ensure the the data isn't corrupted. Much like with loading a kenrel module - the admin may load any sort of crap. Then, sysadmin may, for instance, add a signature on a checkpointed file to verify it's integrity. (Well, one problem with this scheme in the context of self-checkpoint Using a single handle (crid or a special file descriptor) to identify the whole checkpoint is very useful - to be able to stream it (eg. over the network, or through filters). It is also very important for future features and optimizations. For example, to reduce downtime of the application during checkpoint, one can use COW for dirty pages, and only write-back the entire data after the application resumes execution. Or imagine a use-case where one would like to keep the entire checkpoint in memory. These are pretty hard to do if you split the handling between This is an interesting idea but not without its problems. In particular, a successful execve() by one thread destroys all the others. Also, it isn't clear how this can work with pre-copying and live-migration; And finally, I'm not sure how to handle shared objects in this manner. As for kernel module - it is easy to implement most of the checkpoint restart functionality in a kernel module, leaving only the syscall stubs in the kernel. Oren. --
Sorry, I don't buy that argument. I'm convinced that an implementation is possible where any user can load checkpoints of tasks that he could create by starting the processes directly. If you argue that loading a corrupted checkpoint can cause any problems, then I would assume Right, execve currently assumes that the new process starts up with a single thread, but a potential binfmt_chkpt would need to potentially start multithreaded. I guess this either requires execve to reuse the existing threads (assuming they have been set up correctly in advance) or to create new ones according to the context of the checkpoint data. It may not be as easy as I thought initially, but both seem possible. Restarting a whole set of processes from a checkpoint would be What do you mean with pre-copying? How is live-migration different from restarting a previously saved Yeah, I've done the same in spufs, but I still think it's ugly ;-) Arnd <>< --
By pre-copying I refer to the first stage of live-migration: to reduce down time, much of the state of a container can be saved while tasks are still running (most notably memory, but also file system snapshot, if need be). Since the state may change, this is repeated - to save the what changed in the meanwhile - until the delta is small enough. During all this time the tasks continue to execute. At this point, we freeze the container, save the last delta, and resume (in case of snapshot) or or kill (in case of live-migration) the container. I'm not convinced that execve() is the best way to handle this iterative process. Also, with multiple tasks in a container, data for consecutive tasks will appear in order in the checkpoint image. Moreover, a future optimization would be the have multiple threads checkpoint the container, with data interleaved in the checkpoint image stream. Here, too, I'm not sure how execve()-like approach plays. Finally there is the case of shared objects: v2 demonstrates this in checkpoint/objhash.c (see also Documentation/checkpoint.txt). Again, I'm not sure how execve() can adapt to this need. I definitely agree that using something like execve() is elegant and has its advantages. It just isn't clear to me that it is truly suitable for the needs. Suggestions are welcome. --
I closely follow the valuable feedback and fix the code accordingly. I propose to extend the proof of concept, to also be able to save and restore "simple" open files (regular files, directories). My motivation is twofold: (1) Providing eough functionality for people to meaningfully play with the proposed patches and try them. (2) Demonstrate how I propose to handle shared resources (open files). The first point is very important because it makes the concept actually useful to a much broader set of programs than otherwise. I hope this will attract a larger audience :) To this end, I have already extended the patchset, and I should be able to send something working in a day or two. Oren. --
Yes.
It seems to me that worrying about ABI at this point is a bit premature.
This feature, as it currently stands, is essentially useless for any
practical purpose. Self-checkpointing a single process with no handling
of non-file file descriptors and no proper handling of file
file-descriptors is not very useful.
My understanding that this is basically a prototype for a more useful
multi-process or container-wide checkpoint facility.
While you could try to come up with an extensible file format that would
be able to handle any future extensions, the chances are you'd get it
wrong and need to break file format compatibility anyway.
I'm more interested in seeing a description of how you're doing to
handle things like:
* multiple processes
* pipes
* UNIX domain sockets
* INET sockets (both inter and intra machine)
* unlinked open files
* checkpointing file content
* closed files (ie, files which aren't currently open, but will be
soon, esp tmp files)
* shared memory
* (Peter, what have I forgotten?)
Having gone through this before, I don't think an all-kernel solution
can work except for the most simple cases.
Which, come to think of it, is an important point. What are the
expected use-cases for this feature? Do you really mean
checkpoint/restart? Do you expect to be able to checkpoint a process,
leave it running, then "rewind" by restoring the image? Or does
checkpoint always atomically kill the source process(es)? Are you
expecting to be able to resume on another machine?
Lightweight filesystem checkpointing, such as btrfs provides, would seem
like a powerful mechanism for handling a lot of the filesystem state
problems. It would have been useful when we did this...
J
--
>>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy@goop.org> writes: Jeremy> * multiple processes * pipes * UNIX domain sockets * INET Jeremy> sockets (both inter and intra machine) * unlinked open files * Jeremy> checkpointing file content * closed files (ie, files which Jeremy> aren't currently open, but will be soon, esp tmp files) * Jeremy> shared memory * (Peter, what have I forgotten?) File sharing; multiple threads with wierd sharing arrangements (think: clone with various parameters, followed by exec in some of the threads but not others); MERT/system-V shared memory, semaphores and message queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues (pinning, memory layout), processes being debugged (so, checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux process state keeps expanding. Jeremy> Having gone through this before, I don't think an all-kernel Jeremy> solution can work except for the most simple cases. I agree ... it's better to put mechanisms into the kernel that can then be used by a user-space programme to actually do the checkpointing and restarting. Beefing up ptrace or fixing /proc to be a real debugging interface would be a start ... when you can get at *all* the info you need, quickly and easily, the userspace checkpoint falls out fairly naturally. You still have to work out an extensible file format to store stuff, and how to restore all that state you've so lovingly collected. Jeremy> Lightweight filesystem checkpointing, such as btrfs provides, Jeremy> would seem like a powerful mechanism for handling a lot of the Jeremy> filesystem state problems. It would have been useful when we Jeremy> did this... And how! saving bits of files was very timeconsuming. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia --
Except we don't really want to export all the info you need for a complete restartable checkpoint. And especially not make it generally writable. We have also started down that path using ptrace (see cryo, at git://git.sr71.net/~hallyn/cryodev.git). Right before the containers mini-summit, where the general agreement was that a complete in-kernel solution ought to be pursued, I had tried a restart using a binary format that read a checkpoint file and used cryo (userspace using ptrace) for the rest of the restart, only because there was no other reasonable way to set tsk->did_exec on Yes, we're looking forward to using btrfs' snapshots :) -serge --
That and unless we get a lot of synergy from authors of debuggers and debugging code it is a more general and slower interface for Can we please describe this as the giant syscall approach. Instead of a complete in-kernel solution. There are things like filesystems that should be checkpointed separately, or not checkpointed at all. However there is a large set of processes and process state that always goes together and if you checkpoint a container you always want. So building something that is roughly equivalent to a binfmt module but that can save and restore multiple tasks with a single operation Yep. And in the case of migration we don't even need to snapshot a filesystem just mount it from on the target machine. Except for the unlinked files challenge. Eric --
Yep, these are all challenges. If you have some really specific questions, or things you truly think can't be done, please speak up. But, I really don't see any show stoppers in your list. We also plan to do this incrementally. The first consumers are likely to be dumb, simple HPC apps that don't have real hardware like audio or video. Eventually, we'll get to real hardware like infiniband (ugh) or audio. Eventually. (Actually futexes aren't that bad because they don't keep state in-kernel) -- Dave --
Yes, that's exactly it. We're diverging from discussing the important bits as it is, and I think we'd do that more and more with extra Amen to that. I won't speak for the rest of the whackos interested in So, there's a lot of stuff there. The networking stuff is way out of my league, so I'll cc Daniel and make him answer. :) All of the other stuff has been done in various in-kernel implementations. OpenVZ, IBM's Metacluster, Zap (Oren's work at Columbia). Most of it *can* be done from userspace, but some of it is very painful. There are some good OLS papers describing most of these things. Zap might have had one or two academic papers written about it. Maybe. ;) Unlinked files, for instance, are actually available in /proc. You can freeze the app, write a helper that opens /proc/1234/fd, then copies its contents to a linked file (ooooh, with splice!) Anyway, if we can do it in userspace, we can surely do it in the kernel. I'm not sure what you mean by "closed files". Either the app has a fd, it doesn't, or it is in sys_open() somewhere. We have to get the app into a quiescent state before we can checkpoint, so we basically just say that we won't checkpoint things that are *in* the kernel. Is there anything specific you are thinking of that particularly worries Yes. We all want different things, and there are a lot of people interested in this stuff. So, I think all of what you've mentioned above are goals, at least long term. Some, *really* long term. I don't want to get into a full virtualization vs. containers debate, but we also want it for all the same reasons that you migrate Xen Yup. We were just chatting about that with some filesystem folks last week. But, as the OpenVZ dudes like to mention, the poor man's way of moving filesystem snapshots around is always rsync. -- Dave --
Inter-machine networking stuff is hard because its outside the
checkpointed set, so the checkpoint is observable. Migration is easier,
in principle, because you might be able to shift the connection endpoint
without bringing it down. Dealing with networking within your
checkpointed set is just fiddly, particularly remembering and restoring
all the details of things like urgent messages, on-the-fly file
Sure, there's no inherent problem. But do you imagine including the
file contents within your checkpoint image, or would they be saved
It's common for an app to write a tmp file, close it, and then open it a
bit later expecting to find the content it just wrote. If you
checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
resume, then it will lose its tmp file. There's no explicit connection
between the process and its potential working set of files. We had to
deal with it by setting a bunch of policy files to tell the
checkpoint/restart system what filename patterns it had to look out
for. But if you just checkpoint the whole filesystem state along with
So, in other words: whoever wants to work on it gets to define (their)
No, I don't have any real opinion about containers vs virtualization. I
think they're quite distinct solutions for distinct problems.
But I was involved in the design and implementation of a
checkpoint-restart system (along with Peter Chubb), and have the scars
to prove it. We implemented it for IRIX; we called it Hibernator, and
licensed it to SGI for a while (I don't remember what name they marketed
it under). The list of problems that Peter and I mentioned are ones we
had to solve (or, in some cases, failed to solve) to get a workable system.
J
--
All true. Hard stuff. The IBM product works partly by limiting migrations to occurring on a single physical ethernet network. Each container gets its own IP and MAC address. The socket state is checkpointed quite fully and moved Me, personally, I think I'd probably "re-link" the thing, mark it as such, ship it across like a normal file, then unlink it after the I respectfully disagree. The number one prerequisite for checkpoint/restart is isolation. Xen just happens to get this for free. So, instead of saying that there's no explicit connection between the process and its working set, ask yourself how we make a connection. In this case, we can do it with a filesystem (mount) namespace. Each container that we might want to checkpoint must have its writable filesystems contained to a private set that are not shared with other containers. Things like union mounts would help here, but aren't Right. We just start with "everybody has their own disk" which is slow It's almost as big of a problem as trying to virtualize entire machines Cool! I didn't know you guys did the IRIX implementation. I'm sure you guys got a lot farther than any of us are. Did you guys ever write any papers or anything on it? I'd be interested in more information. -- Dave --
We were dealing with checkpointing random sets of processes, and that posed all sorts of problems. Filesystem namespace was one, the pid namespace was another. Doing checkpointing at the container-level No, it's much harder. Hardware is relatively simple and immutable Yeah, there was a paper, but it looks like the internet has lost it. It was at http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps http://www.csu.edu.au/special/conference/apwww95/sept-all.html has mention of the paper. J --
Re-linking works well when the file system supports that - some do not allow this, in which case you need to silently rename instead of really un-linking (even with NFS), or copy the entire contents. Of course, you also need a snapshot of the file system in case it changes after the checkpoint is taken, or take other measures. We can safely Yep. [SNIP] Oren. --
Yeah, it will certainly be fs-dependent. This might be a good application for splice. open("/tmp/linked-newfile", O_RDONLY, perms); splice(unlinked_fd, NULL, new_fd, NULL, MAX_INT, SPLICE_F_MOVE); I'm not sure if it can re-use the blocks on the fs for this, but it probably doesn't matter. -- Dave --
I'm trying to figure out this patch set...here's a few things which This seems like a clunky and error-prone interface - why not just have it allocate the memory always? But, in this case, cr_get_fname() always seems to be called with ctx->tbuf, which, in turn, is an order-1 allocation. Here you're saying that if it's too small, you'll try replacing it with an This magic number is hard-coded in a number of places. Could it maybe This function is going to break every time somebody changes struct task_struct. I'm not quite sure how to prevent that. I wonder if the modversions stuff could somehow be employed to detect changes and make the Like others, I wondered why CAP_SYS_ADMIN was required here. I *still* wonder, though, how you'll ever be able to do restart without a privilege check. There must be a thousand ways to compromise a system by messing Should you maybe check for write access? An attempt to overwrite a read-only file won't succeed, but you could save a lot of work by just failing it with a clear code here. What about the file position? Perhaps there could be a good reason to checkpoint a process into the middle of a file, don't know. In general, I don't see a whole lot of locking going on. Is it really possible to save and restore memory without ever holding mmap_sem? jon --
Yeah, it doesn't make much sense on the surface. I would imagine that
this has some use for when we're stacking things up in the ctx->hbuf
rather than just using it as a completely temporary buffer. But, in any
case, it doesn't make sense as it stands now, so I think it needs to be
In general, I think any time that we are checkpointing $THING and $THING
changes, the checkpoint will break. It just so happens that all we're
checkpointing here is the task_struct, so $THING == task_struct for
now. :)
The things that *really* worry me are things like when flags change
semantics subtly. Or, let's say a flag is used for two different things
in 2.6.26.4 vs 2.6.27. I'm not sure we're ever going to be in a
position to find and fix up stuff like that.
That's one reason I have been advocating doing checkpoint/restart in
much tinier bits so that we can understand each of them as we go along.
As with everything else coming from userspace, the checkpoint file
should be completely untrusted. I do think, though, that the ptrace
That's true. I'll take a look and see.
This patch does reach down and use vfs_write() at some point. There
really aren't any other in-kernel users that do this (short of ecryptfs
and plan9fs). That makes me doubt that we're even using a good approach
I think this is a good example of a place where the kernel can let
userspace shoot itself in its foot if it wants. We might also want to
allow things to be sent over fds that don't necessarily have positions,
I personally haven't audited the locking, yet. It is going to be fun!
But, take a look in patch 3/4:
+ /* write the vma's */
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if ((ret = cr_write_vma(ctx, vma)) < 0)
+ break;
+ }
+ up_read(&mm->mmap_sem);
Thanks for the review, Jonathan!
-- Dave
--
Dave is right on the money: in Zap (the equivalent of) cr_get_fname() may be called with a buffer smaller than PATH_MAX (one page) and hence the need to allocate ad-hoc. Indeed in the current code this is not the One way to reduce the risk is to use an intermediate representation to kernel native data and properties (e.g. classify VMAs during checkpoint instead of relying blindly on the flags). The problem is not so much in restarting a checkpoint image from old kernel on a new kernel - that can be handled by conversion in user space. Tracking changes affecting the checkpoint/restart logic - well, if eventually checkpoint/restart gets to becomes main-stream enough that The only reason I made the analogy without actually implementing it is lack There is some optimistic locking (mmap_sem), improved in the next version. Thanks, Oren. --
Sorry for probably being out-of-date again, but isn't it better to put these headers in the include/linux and export them to the user space? Why? Because we'll need some image-dumping tool (let alone the image converting one for compatibility purposes) and these tools would require to know how the image looks like. Thanks, Pavel --
What's the deal with headers being exported these days? Don't we always have to sanitize them before we ship them over to userspace anyway? -- Dave --
The original version of Oren's patch contained a good hunk of #ifdefs. I've extracted all of those and created a bit of an API for new architectures to follow. Leaving Oren's sign-off because this is all still his code, even though he hasn't seen it mangled like this before. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- linux-2.6.git-dave/ckpt/Makefile | 1 linux-2.6.git-dave/ckpt/checkpoint.c | 7 linux-2.6.git-dave/ckpt/ckpt_arch.h | 6 linux-2.6.git-dave/ckpt/restart.c | 7 linux-2.6.git-dave/ckpt/x86.c | 269 ++++++++++++++++++++++++++++++ linux-2.6.git-dave/include/asm-x86/ckpt.h | 46 +++++ 6 files changed, 336 insertions(+) diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c --- linux-2.6.git/ckpt/checkpoint.c~x86_part 2008-08-04 13:29:59.000000000 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-04 13:29:59.000000000 -0700 @@ -19,6 +19,7 @@ #include "ckpt.h" #include "ckpt_hdr.h" +#include "ckpt_arch.h" /** * cr_get_fname - return pathname of a given file @@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx * ret = cr_write_task_struct(ctx, t); CR_PRINTK("ret (task_struct) %d\n", ret); + if (!ret) + ret = cr_write_thread(ctx, t); + CR_PRINTK("ret (thread) %d\n", ret); + if (!ret) + ret = cr_write_cpu(ctx, t); + CR_PRINTK("ret (cpu) %d\n", ret); return ret; } diff -puN /dev/null ckpt/ckpt_arch.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:29:59.000000000 -0700 @@ -0,0 +1,6 @@ +#include "ckpt.h" + +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t); +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t); +int cr_read_thread(struct cr_ctx *ctx); +int cr_read_cpu(struct cr_ctx *ctx); diff -puN ckpt/Makefile~x86_part ckpt/Makefile --- linux-2.6.git/ckpt/Makefile~x86_part 2008-08-04 13:29:59.000000000 -0700 +++ linux-2.6.git-dave/ckpt/Makefile 2008-08-04 ...
It seems weird that you use __u64 members for the registers, but don't include r8..r15 in the list. As a consequence, this structure does not seem well suited for either x86-32 or x86-64. I would suggest either using struct pt_regs by reference, or defining it so that you can use the same structure for both 32 and 64 bit x86. Arnd <>< --
Hi,
Thanks for the feedback.
The proof-of-concept is written for x86 32 bits, keeping in mind that
we'll need support for 64 bits support. My goal is to leverage feedback
and contributions to have support for 64 bits and other architectures
as well.
In the context of CR, x86-32 and x86-64 are distinct architectures because
you cannot always migrate from one to the other (though 32->64 is sometimes
possible). Therefore, each architecture can have a separate checkpoint file
format (eg r8..r15 only for x86-64).
The information about the kernel configuration, version and cpu settings
will appear on the header; so the restart code will know the architecture
on which the checkpoint had been taken.
So if we want to restart a task checkpointed on x86-32 on a x86-64 machine
(in 32 bit mode), the code will know to not expect that data (r8..r15).
Except for this special case (32 bit running 64 bit), simple conversion can
be done in the kernel if needed, but most conversion between kernel the
format for different kernel versions (should it change) can be done in
We prefer not to use the kernel structure directly, but an intermediate
structure that can help mitigate subtle incompatibilities issues (between
kernel configurations, versions, and even compiler versions).
Anyway, either a single structure for both 32 and 64 bit x86, or separate
"struct cr_hdr_cpu{_32,_64}", one for each architecture.
Oren.
--
The 32bit on 64bit case is quite common on non-x86 architectures, e.g. powerpc or sparc, where 64 bit kernels typically run 32 bit user space. A particularly interesting case is mixing 32 and 64 bit tasks in a container that you are checkpointing. This is a very realistic scenario, so there may be good arguments for keeping the format identical between the variations struct pt_regs is part of the kernel ABI, it will not change. Arnd <>< --
The idea was that x86-32 checkpoints can be restarted on a x86-64 node in I'm in favor about keeping the format identical between the variations of each architecture. Note, however, that "struct pt_regs" won't do because it may change with these variations. So we'll take care of the padding and add r8..r15 in the next version. Oren. --
"Part of the kernel ABI" makes it sound to me like it won't change. Who's right here? :) -- Dave --
> > -- Dave > hehehe .. both; I meant that while it doesn't change per architecture, it varies between architectures. So "struct pt_regs" compiled for x86-32 is different than that compiled for x86-64. Therefore we can't just dump the structure as is and expect that 64 bit would be able to parse the 32 bit. In other words, we need an intermediate representation. Oren. --
Surely we already handle this, though. Don't we allow a 32-bit app running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version? A 64-bit app will get the 64-bit version making the same syscall. It's all handled in the syscall compatibility code. -- Dave --
Sure, that's a compatibility layer around ptrace() in the 64-bit kernel.
Recall that Arnd suggested "keeping the format identical between the
variations of each architecture", and I fully agree. If we want to keep
the format identical, we can't simply define:
struct cr_hdr_cpu {
struct pt_regs regs;
...
};
because that will compile differently on x86-32 and x86-64. So either we
add r8..r15 to the structure as it appears in the patch now (and keep the
format identical), or allow the format to vary, and explicitly test for
this case and add a compatibility layer. Personally I prefer the former.
Oren.
--
Struct pt_regs is not ABI, and can (and has) changed on x86. It's not
suitable for a checkpoint structure because it only contains the
registers that the kernel trashes, not all usermode registers (on i386,
it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff
that's fixed in stone; it expresses it in terms of a register array,
with constants defining what element is which register.
J
--
Thanks for the explanation. I just want to reduce the coding and maintenance burden here. Xen must do this for partition mobility, right? Does it define all its own stuff? -- Dave --
You mean save/restore/migrate? Yes, it defines all its own stuff.
Checkpoint-resume on a whole VM is a rather simpler operation than a
subset of processes.
J
--
Fair enough. How about making the layout in that structure identical to the 64-bit pt_regs though? I don't know if we need that at any time, but my feeling is that it is nicer than a slightly different random layout, e.g. if someone wants to extend gdb to look at checkpointed process dumps. Arnd <>< --
For each vma, there is a 'struct cr_vma'; if the vma is file-mapped,
it will be followed by the file name. The cr_vma->npages will tell
how many pages were dumped for this vma. Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next vma and so on.
I guess I could also separate out the x86-specific bits here, but
they're pretty small, comparatively.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
linux-2.6.git-dave/arch/x86/kernel/ldt.c | 2
linux-2.6.git-dave/ckpt/Makefile | 2
linux-2.6.git-dave/ckpt/ckpt_arch.h | 2
linux-2.6.git-dave/ckpt/ckpt_hdr.h | 21 +
linux-2.6.git-dave/ckpt/ckpt_mem.c | 388 ++++++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/ckpt_mem.h | 32 ++
linux-2.6.git-dave/ckpt/rstr_mem.c | 354 +++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/sys.c | 3
linux-2.6.git-dave/ckpt/x86.c | 83 ++++++
linux-2.6.git-dave/include/asm-x86/ckpt.h | 5
linux-2.6.git-dave/include/asm-x86/desc.h | 3
11 files changed, 892 insertions(+), 3 deletions(-)
diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c
--- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/ldt.c 2008-08-05 08:38:00.000000000 -0700
@@ -183,7 +183,7 @@ static int read_default_ldt(void __user
return bytecount;
}
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
{
struct mm_struct *mm = current->mm;
struct desc_struct ldt;
diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h
--- linux-2.6.git/ckpt/ckpt_arch.h~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-05 08:37:29.000000000 -0700
@@ ...Another structure that is not 32/64 bit ABI safe on x86. It would be safe
if you reorder the members as
struct cr_hdr_mm {
__u32 tag; /* sharing identifier */
__s16 map_count;
__u16 pad; /* not actually needed, but better to make it explicit */
__u64 start_code, end_code, start_data, end_data;
__u64 start_brk, brk, start_stack;
__u64 arg_start, arg_end, env_start, env_end;
same here.
Arnd <><
--
From: Oren Laadan <orenl@cs.columbia.edu> Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file. First create a template for both syscalls: they take a file descriptor (for the image file) and flags as arguments. For sys_checkpoint the first argument identifies the target container; for sys_restart it will identify the checkpoint image. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++ linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++ 2 files changed, 4 insertions(+) diff -puN arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore arch/x86/kernel/syscall_table_32.S --- linux-2.6.git/arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700 +++ linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S 2008-08-07 15:38:04.000000000 -0700 @@ -326,3 +326,5 @@ ENTRY(sys_call_table) .long sys_fallocate .long sys_timerfd_settime /* 325 */ .long sys_timerfd_gettime + .long sys_checkpoint + .long sys_restart diff -puN include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore include/asm-x86/unistd_32.h --- linux-2.6.git/include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700 +++ linux-2.6.git-dave/include/asm-x86/unistd_32.h 2008-08-07 15:38:04.000000000 -0700 @@ -332,6 +332,8 @@ #define __NR_fallocate 324 #define __NR_timerfd_settime 325 #define __NR_timerfd_gettime 326 +#define __NR_checkpoint 327 +#define __NR_restart 328 #ifdef __KERNEL__ diff -puN Makefile~introduce_sys_checkpoint_and_sys_restore Makefile _ --
System calls should also be declared in include/linux/syscalls.h. I guess you are aware that this implementation is not enough to support 32 bit tasks on x86_64. In addition to the native 64-bit code, you would also need the 32-bit compat code here. Arnd <>< --
Yes, of course. The current code does not attempt to do that yet. Oren. --
Note that asm/unistd_32.h is not portable, you should use asm/unistd.h Interface-wise, I would consider checkpointing yourself signficantly different from checkpointing some other thread. If checkpointing yourself is the common case, it probably makes sense to allow passing of pid=0 for this. Arnd <>< --
I don't think it is the common case. Probably now when we're screwing around with it, but not in the future. Do you think it is worth adding the pid=0 handling? -- Dave --
If it's the exception, probably not. Otherwise it would be a nice shortcut to avoid having to do two system calls every time you write code using it. Then again, there are probably not many programs calling it anyway, if it get encapsulated in some user space tool. Arnd <>< --
Hi, Thanks. This is a proof of concept so all sorts of feedback are definitely welcome. Some of the ideas and discussions are found around: http://wiki.openvz.org/Containers/Mini-summit_2008 and the notes: http://wiki.openvz.org/Containers/Mini-summit_2008_notes and the archives of the linux containers mailing list: https://lists.linux-foundation.org/pipermail/containers/ (August and July). Several aspects of the implementation are still experimental and I expect them to evolve with the feedback. In particular, expect the specific user interface (syscalls) and the checkpoint image The checkpoint/restart code is meant to checkpoint a whole container, that is be able to save the state of multiple other tasks. The same code can also be used to checkpoint yourself fairly easily with minimal changes (see comments in the code about "in context" checkpoint/restart that take care of this). I suggest to keep the interface as is in the sense that the pid will identify the target container (e.g. the pid of the init process of that container). Then, pid=0 would mean "the container to which I belong" if you are inside a container (and therefore don't know the pid of the init process there). Finally, to checkpoint yourself, you would set the a bit in the flags argument to something like CR_CKPT_MYSELF. Such a flag will be needed internally anyway to special-case self checkpoint where appropriate. Comments are welcome. Oren. --
