Re: [RFC][PATCH 0/4] kernel-based checkpoint restart

Previous thread: Re: sungem lockup on 2.6.26.2 on sparc64 by Alexander Clouter on Thursday, August 7, 2008 - 3:16 pm. (1 message)

Next thread: [PATCH] leds-pca9532: Fix memory leak and properly handle errors by Sven Wegener on Thursday, August 7, 2008 - 3:49 pm. (2 messages)
From: Dave Hansen
Date: Thursday, August 7, 2008 - 3:40 pm

These patches are from Oren Laaden.  I've refactored them
a bit to make them a wee bit more reviewable.  I think this
separates out the per-arch bits pretty well.  It should also
be at least build-bisetable.

If there are no objections to this general approach, then we
plan to start submitting these bits to -mm.

--

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different that what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped.

--
Oren's original announcement


In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple ...
From: Dave Hansen
Date: Thursday, August 7, 2008 - 3:40 pm

From: Oren Laadan <orenl@cs.columbia.edu>

This patch adds those interfaces, as well as all of the helpers
needed to easily manage the file format.  

The code is roughly broken out as follows:

ckpt/sys.c - user/kernel data transfer, as well as setting up of the
	     checkpoint/restart context (a per-checkpoint data
	     structure for housekeeping)
ckpt/checkpoint.c - output wrappers and basic checkpoint handling
ckpt/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---

 linux-2.6.git-dave/Makefile          |    2 
 linux-2.6.git-dave/ckpt/Makefile     |    1 
 linux-2.6.git-dave/ckpt/checkpoint.c |  207 +++++++++++++++++++++++++++++++
 linux-2.6.git-dave/ckpt/ckpt.h       |   82 ++++++++++++
 linux-2.6.git-dave/ckpt/ckpt_hdr.h   |   69 ++++++++++
 linux-2.6.git-dave/ckpt/restart.c    |  189 ++++++++++++++++++++++++++++
 linux-2.6.git-dave/ckpt/sys.c        |  233 +++++++++++++++++++++++++++++++++++
 7 files changed, 782 insertions(+), 1 deletion(-)

diff -puN /dev/null ckpt/checkpoint.c
--- /dev/null	2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c	2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1,207 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <asm/ptrace.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/**
+ * cr_get_fname - return pathname of a given file
+ * @file: file pointer
+ * @buf: buffer for pathname
+ * @n: ...
From: Arnd Bergmann
Date: Friday, August 8, 2008 - 2:46 am

Do you rely on the kernel version in order to determine the format
of the binary data, or is it just informational?

If you think the format can change in incompatible ways, you
probably need something more specific than the version number
these days, because there are just so many different trees with

Please use the existing pr_debug and dev_debug here, instead of creating

This structure has an odd multiple of 32-bit members, which means
that if you put it into a larger structure that also contains
64-bit members, the larger structure may get different alignment
on x86-32 and x86-64, which you might want to avoid.

In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is

Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input

get_fs()/set_fs() always feels a bit ouch, and this way you have
to use __force to avoid the warnings about __user pointer casts
in sparse.
I wonder if you can use splice_read/splice_write to get around

Why do you need CAP_SYS_ADMIN for this? Can't regular users

The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?

	Arnd <><
--

From: Dave Hansen
Date: Friday, August 8, 2008 - 11:50 am

Yeah, this is very true.  My guess is that we'll need something like


Can't we just declare all these things __packed__ and stop worrying

Ugh, this is crappy code anyway.  It needs to return an error and have

I have to wonder if this is just a symptom of us trying to do this the
wrong way.  We're trying to talk the kernel into writing internal gunk
into a FD.  You're right, it is like a splice where one end of the pipe
is in the kernel.


Yes, eventually.  I think one good point is that we should probably
remove this now so that we *have* to think about security implications
as we add each individual patch.  For instance, what kind of checking do
we do when we restore an mlock()'d VMA?


Fine with me.  Renamed in new patches, hopefully.

I'll send new patches out later today.

-- Dave

--

From: Oren Laadan
Date: Friday, August 8, 2008 - 1:59 pm

Exactly. The header should eventually contain sufficient information
to describe the kernel version, configuration, compiler, cpu (arch and
capabilities), and checkpoint code version.

How would you suggest to identify the origin tree with an identifier

Actually not quite. 'n' is _not_ controlled by the input data, and at
the same time ctx->hpos should always carry enough room by design. If
that is not the case, then it's a logical bug, not DoS attack.

To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers
as they are read; since headers can be read in a nested manner, ctx->hpos
points to the next free position in that buffer. So 'n' is the size of
the header that we are about to read - decided at compile time, not the
user input. The BUG_ON() statement asserts that by design we have enough
buffer (like you'd check that you didn't run out of kernel stack...)

If it is preferred, we can change this to write a kernel message and

Hmmm... even if not strictly now, we *will* need admin privileges for
the CR operations, for the following reasons:

checkpoint: we save the entire state of a set of processes to a file - so
we must have privileges to do so, at least within (or with respect to) the
said container. Even if we are the user who owns the container, we'll need
root access within that container.

restart: we restore the entire set of a set of processes, which may require
some privileged operations (again, at least within or with respect to the
said container). Otherwise any user could inject any restart data into the
kernel and create any set of processes with arbitrary permissions.

--

From: Dave Hansen
Date: Friday, August 8, 2008 - 3:17 pm

kmalloc/kfree() area really, really fast.  I wonder if the code gets
easier or harder to read if we just alloc/free as we need to.

How large are these allocations, usually?  Will stack allocation work in
most cases?

-- Dave

--

From: Oren Laadan
Date: Friday, August 8, 2008 - 4:27 pm

The ctx->hbuf interface is a pair of cr_hbuf_get(ctx, length) and a
matching cr_hbuf_put(ctx, length), almost like using kmalloc/kfree().
The main difference is that cleanup in error paths is implicit (the

That depends on how we construct the headers. In Zap there are some
headers that use relatively long structures to be put on the stack,
and it wouldn't make much sense to divide them into smaller headers
artificially.

However, I forgot to mention earlier that an important reason to use
this construct is actually in anticipation for a future optimization:
during application downtime the checkpoint state will be aggregated
into an in-memory buffer, and only after the application is allowed
to continue execution (unfrozen) the buffer will be written-back to
the FD. In that scenario, we will allocate a larger buffer in the ctx
(eg based on some heuristics) and cr_hbuf_get() will return the next
location in that buffer, while cr_hbuf_put() will do nothing.

Oren.

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 3:23 pm

Including struct utsname in the header covers most of this. I supposed
you can't do it entirely safe, and you always need to be prepared for
malicious input data, so there probably isn't much point in getting

My recommendation in general is to make kernel code crash loudly if
there is a bug in the kernel itself. Returning error codes makes most
sense if they get sent back to the user, which then can make sense of

Exactly. There was a project that implemented checkpoint/restart through
ptrace (don't remember what it was called), so with certain limitations
it should also be possible to implement the syscalls so that any user that
can ptrace the tasks can also checkpoint them.

	Arnd <><
--

From: Pavel Emelyanov
Date: Thursday, August 14, 2008 - 1:09 am

[Empty message]
From: Dave Hansen
Date: Thursday, August 14, 2008 - 8:16 am

The only problem I can see with this is that you lose efficiency,
especially when you have to build your checkpoint image with lots of
things that are config-specific.

The approach sounds like a good one in theory, but I'm a bit skeptical
that we could stick to it in practice, in a mainline kernel where there
are billions of config options.  It is definitely something to strive
for, though.  Good point!

-- Dave

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 3:13 pm

I personally dislike __packed__ because it makes it very easy to get
suboptimal object code. If you either pad every structure to a multiple
of 64 bits or avoid __u64 members, you don't have a problem. Also,
I think avoiding implicit padding inside of data structures is very
helpful for user interfaces, if necessary you can always add explicit

Maybe you can invert the logic and let the new syscalls create a file
descriptor, and then have user space read or splice the checkpoint
data from it, and restore it by writing to the file descriptor.
It's probably easy to do using anon_inode_getfd() and would solve this
problem, but at the same time make checkpointing the current thread

I think the question can be generalized further: How do you deal with
saved tasks that have more priviledges than the task doing the restore?

There are probably more, but what I can think of right now includes:
* anything you can set using ulimit
* capabilities
* threads running as another user/group
* open files that have had their permissions changed after the open

	Arnd <><
--

From: Dave Hansen
Date: Friday, August 8, 2008 - 3:26 pm

Yeah, it does seem kinda backwards.  But, instead of even having to
worry about the anon_inode stuff, why don't we just put it in a fs like
everything else?  checkpointfs!

I'm also really not convinced that putting the entire checkpoint in one
glob is really the solution, either.  I mean, is system call overhead
really a problem here?

-- Dave

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 3:39 pm

Well, anon_inodes are really easy to use and have replaced some of the
simple non-mountable file systems in the kernel.

checkpointfs sounds interesting and I guess in a plan9 world of fairies
and fantasy, you should be able to create a checkpoint of your system using
'tar czf - /proc/', but I'm not sure it helps here.

The main problem I see with that would be atomicity: If you want multiple
processes to keep interacting with each other, you need to save them at
the same point in time, which gets harder as you split your interface into
more than a single file descriptor.

	Arnd <><
--

From: Dave Hansen
Date: Friday, August 8, 2008 - 5:43 pm

It could take ages to write out a checkpoint even to a single fd, so I
suspect we'd have the exact same kinds of issues either way.

-- Dave

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 11:37 pm

I guess either way, you have to SIGSTOP (or similar) all the tasks you want
to checkpoint atomically before you start saving the contents.
If you use a single fd, you can do that under the covers, when using a 
more complex file system, it seems more logical to require an explicit
interface for this.

	Arnd <><
--

From: Dave Hansen
Date: Saturday, August 9, 2008 - 6:39 am

Oh, we're already working on patches to the freezer code to do this for
us.  There's a branch in here from Matt H. that's doing just that:

http://git.kernel.org/?p=linux/kernel/git/daveh/linux-2.6-next-lxc.git;a=shortlog

-- Dave

--

From: Serge E. Hallyn
Date: Monday, August 11, 2008 - 8:07 am

One reason is that I suspect that stops us from being able to send that
data straight to a pipe to compress and/or send on the network, without
hitting local disk.  Though if the checkpointfs was ram-based maybe not?

As Oren has pointed out before, passing in an fd means we can pass a
socket into the syscall.

Using the anon_inodes would also prevent that, but if it makes for a
cleaner overall solution then I'm not against considering either one
--

From: Arnd Bergmann
Date: Monday, August 11, 2008 - 8:25 am

With anon_inodes, you can still implement splice_read/splice_write,
so you can splice it into a socket.

	Arnd <><
--

From: Pavel Machek
Date: Wednesday, August 13, 2008 - 10:53 pm

If you do pass a socket, will it handle blocking correctly? Getting
deadlocked task would be bad. What happens if I try to snapshot into
/proc/self/fd/0 ? Or maybe restore from /proc/cmdline?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Dave Hansen
Date: Thursday, August 14, 2008 - 8:12 am

Heh, that's a good point.  What was the other code where we kept coming
up with deadlocks like that?  Anyone remember?

-- Dave

--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 2:40 pm

Hmmm... these are good points.

Keep in mind that our principal goal is to checkpoint a whole container,
rather then a task to checkpoint itself (which is a by-product). Of course
your comments apply to a whole container as well.

In both cases, I don't think that blocking on a socket is a problem; the
checkpointer will enter a TASK_INTERRUPTIBLE state. Where is the deadlock ?
Writing or reading to/from /proc/self/... likewise - the programmer must
understand the implications, or the program won't work as expected. I don't
see a possible deadlock here, though.

For example - writing to /proc/self/fd/0 is ok; the state of fd[0] of that
task will be captured at some point in the middle of the checkpoint, so
after restart one cannot assume anything about the file position; the rest
should work.

Oren.

--

From: Serge E. Hallyn
Date: Monday, August 11, 2008 - 8:22 am

At the checkpoint end, the ptrace checks seem apporpriate:  If you're
allowed to stop and manipulate the process, then you may as well be
allowed to checkpoint and see/tweak its memory that way.

At the restart end, every resource which was checkpointed will have to
be re-created, and permissions checked against the privilege of the
task which did the restart.  We may end up having to make use of the new
credentials for this.

This could become unpleasant: if an unprivileged task asked a privileged
helper to create something for the unprivileged task to use (i.e. a
raw socket), then the user needs to be privileged to re-created the
resource.  But it's necessary.

-serge
--

From: Arnd Bergmann
Date: Monday, August 11, 2008 - 9:53 am

Right. Of course, the hard part here will be to make it obvious to
be safe. Having to check all sorts of permissions means there will
be many opportunities for exploitable bugs.

The best way I can think of for this would be to use existing syscalls
(e.g. sched_setscheduler, setfsuid, ...) from user space whereever
possible and do only the bare minimum for the restart part in the kernel.

	Arnd <><
--

From: Dave Hansen
Date: Monday, August 11, 2008 - 10:11 am

Well, the current direction is about as far away from that as you can
get, unless we basically call those system calls from inside our new
sys_restart() one.  As of now, we're as much work in the kernel as
possible, and doing the bare minimum in userspace.  That's what both
Oren and our OpenVZ colleagues have advocated.

-- Dave

--

From: Dave Hansen
Date: Monday, August 11, 2008 - 12:48 pm

Arnd, Jeremy and Oren,

Thanks for all of the very interesting comments about the ABI.  

Considering that we're still *really* early in getting this concept
merged up into mainline, what do you all think we should do now?

My main goal here is just to get everyone to understand the approach
that we're proposing rather than to really fix the interfaces in stone.
I bet we're going to be changing them a lot before these patches
actually get in.

-- Dave

--

From: Arnd Bergmann
Date: Monday, August 11, 2008 - 2:47 pm

I think the two most important aspects here need to be security and
simplicity. If you have to choose between the two, it probably makes
sense to put security first, because loading untrusted data into
the kernel puts you at a significant risk to start with. If you
can show a restart interface that lets regular users restart their
tasks in a way anyone can verify to be secure, that will be a
good indication that you're on the right track.

The other problem that you really need to solve is interface
stability. What you are creating is a binary representation
of many kernel internal data structures, so in our common
rules, you have to make sure that you remain forward and
backward compatible. Simply saying that you need to run
an identical kernel when restarting from a checkpoint is not
enough IMHO.


Some more words on specific interfaces that we have discussed:

The single-file-descriptor approach has the big advantage of
keeping the complexity in one place (the kernel). To be consistent
with other kernel interfaces, I would make the kernel hand out a
file descriptor, not let the user open a file and pass that into
the kernel as you do now.

A new file system is a good idea for many complex interfaces that
make their way into the kernel, but I don't think it will help
in this case.

For checkpointing a single task, or even a task with its children,
a different interface I could imagine would be to have a new
file in procfs per pid that you can read as a pipe giving our
the same data that you currently save in the checkpoint file
descriptor. It does mean that you won't be able to pass flags
down easily (you could write to the pipe before you start reading,
but that's not too nice).

On the restart side, I think the most consistent interface would
be a new binfmt_chkpt implementation that you can use to execve
a checkpoint, just like you execute an ELF file today. The binfmt
can be a module (unlike a syscall), so an administrator that is
afraid of the security ...
From: Jonathan Corbet
Date: Monday, August 11, 2008 - 4:14 pm

On Mon, 11 Aug 2008 23:47:49 +0200

OTOH, making one of these checkpoint files go into any 2.6.x kernel
seems like a very high bar, to the point, perhaps, of killing this
feature entirely.  

There could be a case for viewing sys_restore() as being a lot like
sys_init_module() - a view into kernel internals that goes beyond the
normal user-space ABI, and beyond the stability guarantee.  It might be
possible to create a certain amount of version portability with a
modversions-like mechanism, but it sure seems hard to do better than
that.

jon
--

From: Dave Hansen
Date: Monday, August 11, 2008 - 4:23 pm

The OpenVZ dudes like refer to something that Andrew Morton said about
this (paraphrasing...):  if we need cross-version restore support, we
can count on userspace to do the conversion.

You can almost think of it like the crashdump processing utility that we
have.  Instead of worrying about having the kernel *always* produce the
same crashdump with the same gunk in it, we make userspace do all the
parsing and interpretation.

It also makes it quite possible for a distribution to make a change (say
because of a security fix) in the kernel that changes the checkpoint
format, then to quickly code up the necessary bits for the conversion
program. 

-- Dave

--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 10:56 pm

quoting:

 > There could be a case for viewing sys_restore() as being a lot like
 > sys_init_module() - a view into kernel internals that goes beyond the
 > normal user-space ABI, and beyond the stability guarantee.  It might be
 > possible to create a certain amount of version portability with a
 > modversions-like mechanism, but it sure seems hard to do better than
 > that.
 >
 > jon

Extending this view in the context of security - we can require sysadmin
privilege to restart, and then sysadmin is responsible for the contents
of the file. The kernel will ensure the the data isn't corrupted. Much
like with loading a kenrel module - the admin may load any sort of crap.
Then, sysadmin may, for instance, add a signature on a checkpointed file
to verify it's integrity.

(Well, one problem with this scheme in the context of self-checkpoint

Using a single handle (crid or a special file descriptor) to identify
the whole checkpoint is very useful - to be able to stream it (eg. over
the network, or through filters). It is also very important for future
features and optimizations. For example, to reduce downtime of the
application during checkpoint, one can use COW for dirty pages, and
only write-back the entire data after the application resumes execution.
Or imagine a use-case where one would like to keep the entire checkpoint
in memory. These are pretty hard to do if you split the handling between

This is an interesting idea but not without its problems. In particular,
a successful execve() by one thread destroys all the others. Also, it
isn't clear how this can work with pre-copying and live-migration; And
finally, I'm not sure how to handle shared objects in this manner.

As for kernel module - it is easy to implement most of the checkpoint
restart functionality in a kernel module, leaving only the syscall stubs
in the kernel.

Oren.

--

From: Arnd Bergmann
Date: Thursday, August 21, 2008 - 1:43 am

Sorry, I don't buy that argument. I'm convinced that an implementation
is possible where any user can load checkpoints of tasks that he could
create by starting the processes directly. If you argue that loading
a corrupted checkpoint can cause any problems, then I would assume


Right, execve currently assumes that the new process starts up with
a single thread, but a potential binfmt_chkpt would need to potentially
start multithreaded. I guess this either requires execve to reuse
the existing threads (assuming they have been set up correctly in
advance) or to create new ones according to the context of the
checkpoint data. It may not be as easy as I thought initially, but
both seem possible.
Restarting a whole set of processes from a checkpoint would be

What do you mean with pre-copying?
How is live-migration different from restarting a previously saved

Yeah, I've done the same in spufs, but I still think it's ugly ;-)

	Arnd <><
--

From: Oren Laadan
Date: Thursday, August 21, 2008 - 8:43 am

By pre-copying I refer to the first stage of live-migration: to reduce
down time, much of the state of a container can be saved while tasks
are still running (most notably memory, but also file system snapshot,
if need be). Since the state may change, this is repeated - to save the
what changed in the meanwhile - until the delta is small enough. During
all this time the tasks continue to execute. At this point, we freeze
the container, save the last delta, and resume (in case of snapshot) or
or kill (in case of live-migration) the container. I'm not convinced that
execve() is the best way to handle this iterative process.

Also, with multiple tasks in a container, data for consecutive tasks
will appear in order in the checkpoint image. Moreover, a future
optimization would be the have multiple threads checkpoint the container,
with data interleaved in the checkpoint image stream. Here, too, I'm
not sure how execve()-like approach plays.

Finally there is the case of shared objects: v2 demonstrates this in
checkpoint/objhash.c (see also Documentation/checkpoint.txt). Again,
I'm not sure how execve() can adapt to this need.

I definitely agree that using something like execve() is elegant and
has its advantages. It just isn't clear to me that it is truly suitable
for the needs. Suggestions are welcome.

--

From: Oren Laadan
Date: Monday, August 11, 2008 - 2:54 pm

I closely follow the valuable feedback and fix the code accordingly.

I propose to extend the proof of concept, to also be able to save and
restore "simple" open files (regular files, directories). My motivation
is twofold:

(1) Providing eough functionality for people to meaningfully play
with the proposed patches and try them.

(2) Demonstrate how I propose to handle shared resources (open files).

The first point is very important because it makes the concept actually
useful to a much broader set of programs than otherwise. I hope this
will attract a larger audience :)

To this end, I have already extended the patchset, and I should be
able to send something working in a day or two.

Oren.

--

From: Jeremy Fitzhardinge
Date: Monday, August 11, 2008 - 4:38 pm

Yes.

It seems to me that worrying about ABI at this point is a bit premature.

This feature, as it currently stands, is essentially useless for any 
practical purpose.  Self-checkpointing a single process with no handling 
of non-file file descriptors and no proper handling of file 
file-descriptors is not very useful.

My understanding that this is basically a prototype for a more useful 
multi-process or container-wide checkpoint facility.

While you could try to come up with an extensible file format that would 
be able to handle any future extensions, the chances are you'd get it 
wrong and need to break file format compatibility anyway.

I'm more interested in seeing a description of how you're doing to 
handle things like:

    * multiple processes
    * pipes
    * UNIX domain sockets
    * INET sockets (both inter and intra machine)
    * unlinked open files
    * checkpointing file content
    * closed files (ie, files which aren't currently open, but will be
      soon, esp tmp files)
    * shared memory
    * (Peter, what have I forgotten?)

Having gone through this before, I don't think an all-kernel solution 
can work except for the most simple cases.

Which, come to think of it, is an important point.  What are the 
expected use-cases for this feature?  Do you really mean 
checkpoint/restart?  Do you expect to be able to checkpoint a process, 
leave it running, then "rewind" by restoring the image?  Or does 
checkpoint always atomically kill the source process(es)?  Are you 
expecting to be able to resume on another machine?

Lightweight filesystem checkpointing, such as btrfs provides, would seem 
like a powerful mechanism for handling a lot of the filesystem state 
problems.  It would have been useful when we did this...

    J
--

From: Peter Chubb
Date: Monday, August 11, 2008 - 4:54 pm

>>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy@goop.org> writes:



Jeremy>    * multiple processes * pipes * UNIX domain sockets * INET
Jeremy> sockets (both inter and intra machine) * unlinked open files *
Jeremy> checkpointing file content * closed files (ie, files which
Jeremy> aren't currently open, but will be soon, esp tmp files) *
Jeremy> shared memory * (Peter, what have I forgotten?)

File sharing; multiple threads with wierd sharing arrangements (think:
clone with various parameters, followed by exec in some of the threads
but not others); MERT/system-V shared memory, semaphores and message
queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues
(pinning, memory layout), processes being debugged (so,
checkpoint.restart a gdb/target pair), futexes, etc., etc.  Linux
process state keeps expanding.

Jeremy> Having gone through this before, I don't think an all-kernel
Jeremy> solution can work except for the most simple cases.

I agree ... it's better to put mechanisms into the kernel that can
then be used by a user-space programme to actually do the
checkpointing and restarting.

Beefing up ptrace or fixing /proc to be a real debugging interface
would be a start ... when you can get at *all* the info you need,
quickly and easily, the userspace checkpoint falls out fairly
naturally.  You still have to work out an extensible file format to
store stuff, and how to restore all that state you've so lovingly
collected.

Jeremy> Lightweight filesystem checkpointing, such as btrfs provides,
Jeremy> would seem like a powerful mechanism for handling a lot of the
Jeremy> filesystem state problems.  It would have been useful when we
Jeremy> did this...

And how!  saving bits of files was very timeconsuming.
--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
--

From: Serge E. Hallyn
Date: Tuesday, August 12, 2008 - 7:49 am

Except we don't really want to export all the info you need for a
complete restartable checkpoint.  And especially not make it
generally writable.

We have also started down that path using ptrace (see cryo, at
git://git.sr71.net/~hallyn/cryodev.git).

Right before the containers mini-summit, where the general agreement was
that a complete in-kernel solution ought to be pursued, I had tried
a restart using a binary format that read a checkpoint file and used
cryo (userspace using ptrace) for the rest of the restart, only
because there was no other reasonable way to set tsk->did_exec on

Yes, we're looking forward to using btrfs' snapshots :)

-serge
--

From: Eric W. Biederman
Date: Thursday, August 28, 2008 - 4:40 pm

That and unless we get a lot of synergy from authors of debuggers
and debugging code it is a more general and slower interface for

Can we please describe this as the giant syscall approach.  Instead
of a complete in-kernel solution.  There are things like filesystems
that should be checkpointed separately, or not checkpointed at all.

However there is a large set of processes and process state that always
goes together and if you checkpoint a container you always want.

So building something that is roughly equivalent to a binfmt module
but that can save and restore multiple tasks with a single operation

Yep.  And in the case of migration we don't even need to snapshot
a filesystem just mount it from on the target machine.  Except for
the unlinked files challenge.

Eric
--

From: Dave Hansen
Date: Tuesday, August 12, 2008 - 8:11 am

Yep, these are all challenges.  If you have some really specific
questions, or things you truly think can't be done, please speak up.
But, I really don't see any show stoppers in your list.

We also plan to do this incrementally.  The first consumers are likely
to be dumb, simple HPC apps that don't have real hardware like audio or
video.  Eventually, we'll get to real hardware like infiniband (ugh) or
audio.  Eventually.

(Actually futexes aren't that bad because they don't keep state
in-kernel)

-- Dave

--

From: Dave Hansen
Date: Tuesday, August 12, 2008 - 7:58 am

Yes, that's exactly it.  We're diverging from discussing the important
bits as it is, and I think we'd do that more and more with extra

Amen to that.  I won't speak for the rest of the whackos interested in

So, there's a lot of stuff there.  The networking stuff is way out of my
league, so I'll cc Daniel and make him answer. :)

All of the other stuff has been done in various in-kernel
implementations.  OpenVZ, IBM's Metacluster, Zap (Oren's work at
Columbia).  Most of it *can* be done from userspace, but some of it is
very painful.  There are some good OLS papers describing most of these
things.  Zap might have had one or two academic papers written about it.
Maybe.  ;)

Unlinked files, for instance, are actually available in /proc.  You can
freeze the app, write a helper that opens /proc/1234/fd, then copies its
contents to a linked file (ooooh, with splice!)  Anyway, if we can do it
in userspace, we can surely do it in the kernel.

I'm not sure what you mean by "closed files".  Either the app has a fd,
it doesn't, or it is in sys_open() somewhere.  We have to get the app
into a quiescent state before we can checkpoint, so we basically just
say that we won't checkpoint things that are *in* the kernel.

Is there anything specific you are thinking of that particularly worries

Yes.

We all want different things, and there are a lot of people interested
in this stuff.  So, I think all of what you've mentioned above are
goals, at least long term.  Some, *really* long term.

I don't want to get into a full virtualization vs. containers debate,
but we also want it for all the same reasons that you migrate Xen

Yup.  We were just chatting about that with some filesystem folks last
week.  But, as the OpenVZ dudes like to mention, the poor man's way of
moving filesystem snapshots around is always rsync.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Tuesday, August 12, 2008 - 9:32 am

Inter-machine networking stuff is hard because its outside the 
checkpointed set, so the checkpoint is observable.  Migration is easier, 
in principle, because you might be able to shift the connection endpoint 
without bringing it down.  Dealing with networking within your 
checkpointed set is just fiddly, particularly remembering and restoring 
all the details of things like urgent messages, on-the-fly file 

Sure, there's no inherent problem.  But do you imagine including the 
file contents within your checkpoint image, or would they be saved 

It's common for an app to write a tmp file, close it, and then open it a 
bit later expecting to find the content it just wrote.  If you 
checkpoint-kill it in the interim, reboot (clearing out /tmp) and then 
resume, then it will lose its tmp file.  There's no explicit connection 
between the process and its potential working set of files.  We had to 
deal with it by setting a bunch of policy files to tell the 
checkpoint/restart system what filename patterns it had to look out 
for.  But if you just checkpoint the whole filesystem state along with 


So, in other words: whoever wants to work on it gets to define (their) 

No, I don't have any real opinion about containers vs virtualization.  I 
think they're quite distinct solutions for distinct problems.

But I was involved in the design and implementation of a 
checkpoint-restart system (along with Peter Chubb), and have the scars 
to prove it.  We implemented it for IRIX; we called it Hibernator, and 
licensed it to SGI for a while (I don't remember what name they marketed 
it under).  The list of problems that Peter and I mentioned are ones we 
had to solve (or, in some cases, failed to solve) to get a workable system.

    J
--

From: Dave Hansen
Date: Tuesday, August 12, 2008 - 9:46 am

All true.  Hard stuff.

The IBM product works partly by limiting migrations to occurring on a
single physical ethernet network.  Each container gets its own IP and
MAC address.  The socket state is checkpointed quite fully and moved

Me, personally, I think I'd probably "re-link" the thing, mark it as
such, ship it across like a normal file, then unlink it after the

I respectfully disagree.  The number one prerequisite for
checkpoint/restart is isolation.  Xen just happens to get this for free.
So, instead of saying that there's no explicit connection between the
process and its working set, ask yourself how we make a connection.

In this case, we can do it with a filesystem (mount) namespace.  Each
container that we might want to checkpoint must have its writable
filesystems contained to a private set that are not shared with other
containers.  Things like union mounts would help here, but aren't

Right.  We just start with "everybody has their own disk" which is slow

It's almost as big of a problem as trying to virtualize entire machines

Cool!  I didn't know you guys did the IRIX implementation.  I'm sure you
guys got a lot farther than any of us are.  Did you guys ever write any
papers or anything on it?  I'd be interested in more information.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Tuesday, August 12, 2008 - 10:04 am

We were dealing with checkpointing random sets of processes, and that 
posed all sorts of problems.  Filesystem namespace was one, the pid 
namespace was another.  Doing checkpointing at the container-level 

No, it's much harder.  Hardware is relatively simple and immutable 

Yeah, there was a paper, but it looks like the internet has lost it.  It 
was at 
http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps
http://www.csu.edu.au/special/conference/apwww95/sept-all.html has 
mention of the paper.

    J
--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 2:52 pm

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 2:54 pm

Re-linking works well when the file system supports that - some do not
allow this, in which case you need to silently rename instead of really
un-linking (even with NFS), or copy the entire contents.

Of course, you also need a snapshot of the file system in case it changes
after the checkpoint is taken, or take other measures. We can safely

Yep.

[SNIP]

Oren.

--

From: Dave Hansen
Date: Wednesday, August 20, 2008 - 3:11 pm

Yeah, it will certainly be fs-dependent.

This might be a good application for splice.

	open("/tmp/linked-newfile", O_RDONLY, perms);
	splice(unlinked_fd, NULL, new_fd, NULL, MAX_INT, SPLICE_F_MOVE);

I'm not sure if it can re-use the blocks on the fs for this, but it
probably doesn't matter.  

-- Dave

--

From: Jonathan Corbet
Date: Monday, August 11, 2008 - 11:03 am

I'm trying to figure out this patch set...here's a few things which

This seems like a clunky and error-prone interface - why not just have it
allocate the memory always?  But, in this case, cr_get_fname() always seems
to be called with ctx->tbuf, which, in turn, is an order-1 allocation.
Here you're saying that if it's too small, you'll try replacing it with an

This magic number is hard-coded in a number of places.  Could it maybe

This function is going to break every time somebody changes struct
task_struct.  I'm not quite sure how to prevent that.  I wonder if the
modversions stuff could somehow be employed to detect changes and make the

Like others, I wondered why CAP_SYS_ADMIN was required here.  I *still*
wonder, though, how you'll ever be able to do restart without a privilege
check.  There must be a thousand ways to compromise a system by messing

Should you maybe check for write access?  An attempt to overwrite a
read-only file won't succeed, but you could save a lot of work by just
failing it with a clear code here.  

What about the file position?  Perhaps there could be a good reason to
checkpoint a process into the middle of a file, don't know.

In general, I don't see a whole lot of locking going on.  Is it really
possible to save and restore memory without ever holding mmap_sem?

jon
--

From: Dave Hansen
Date: Monday, August 11, 2008 - 11:38 am

Yeah, it doesn't make much sense on the surface.  I would imagine that
this has some use for when we're stacking things up in the ctx->hbuf
rather than just using it as a completely temporary buffer.  But, in any
case, it doesn't make sense as it stands now, so I think it needs to be

In general, I think any time that we are checkpointing $THING and $THING
changes, the checkpoint will break.  It just so happens that all we're
checkpointing here is the task_struct, so $THING == task_struct for
now. :)

The things that *really* worry me are things like when flags change
semantics subtly.  Or, let's say a flag is used for two different things
in 2.6.26.4 vs 2.6.27.  I'm not sure we're ever going to be in a
position to find and fix up stuff like that.

That's one reason I have been advocating doing checkpoint/restart in
much tinier bits so that we can understand each of them as we go along.

As with everything else coming from userspace, the checkpoint file
should be completely untrusted.  I do think, though, that the ptrace

That's true.  I'll take a look and see.

This patch does reach down and use vfs_write() at some point.  There
really aren't any other in-kernel users that do this (short of ecryptfs
and plan9fs).  That makes me doubt that we're even using a good approach

I think this is a good example of a place where the kernel can let
userspace shoot itself in its foot if it wants.  We might also want to
allow things to be sent over fds that don't necessarily have positions,

I personally haven't audited the locking, yet.  It is going to be fun!

But, take a look in patch 3/4:

+       /* write the vma's */
+       down_read(&mm->mmap_sem);
+       for (vma = mm->mmap; vma; vma = vma->vm_next) {
+               if ((ret = cr_write_vma(ctx, vma)) < 0)
+                       break;
+       }
+       up_read(&mm->mmap_sem);

Thanks for the review, Jonathan!

-- Dave

--

From: Oren Laadan
Date: Monday, August 11, 2008 - 8:44 pm

Dave is right on the money: in Zap (the equivalent of) cr_get_fname()
may be called with a buffer smaller than PATH_MAX (one page) and hence
the need to allocate ad-hoc. Indeed in the current code this is not the

One way to reduce the risk is to use an intermediate representation to
kernel native data and properties (e.g. classify VMAs during checkpoint
instead of relying blindly on the flags).

The problem is not so much in restarting a checkpoint image from old
kernel on a new kernel - that can be handled by conversion in user space.

Tracking changes affecting the checkpoint/restart logic - well, if
eventually checkpoint/restart gets to becomes main-stream enough that

The only reason I made the analogy without actually implementing it is lack

There is some optimistic locking (mmap_sem), improved in the next version.

Thanks,

Oren.

--

From: Pavel Emelyanov
Date: Monday, August 18, 2008 - 2:26 am

Sorry for probably being out-of-date again, but isn't it better to
put these headers in the include/linux and export them to the user
space?

Why? Because we'll need some image-dumping tool (let alone the
image converting one for compatibility purposes) and these tools
would require to know how the image looks like.

Thanks,
Pavel
--

From: Dave Hansen
Date: Wednesday, August 20, 2008 - 12:10 pm

What's the deal with headers being exported these days?  Don't we always
have to sanitize them before we ship them over to userspace anyway?

-- Dave

--

From: Dave Hansen
Date: Thursday, August 7, 2008 - 3:40 pm

The original version of Oren's patch contained a good hunk
of #ifdefs.  I've extracted all of those and created a bit
of an API for new architectures to follow.

Leaving Oren's sign-off because this is all still his code,
even though he hasn't seen it mangled like this before.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---

 linux-2.6.git-dave/ckpt/Makefile          |    1 
 linux-2.6.git-dave/ckpt/checkpoint.c      |    7 
 linux-2.6.git-dave/ckpt/ckpt_arch.h       |    6 
 linux-2.6.git-dave/ckpt/restart.c         |    7 
 linux-2.6.git-dave/ckpt/x86.c             |  269 ++++++++++++++++++++++++++++++
 linux-2.6.git-dave/include/asm-x86/ckpt.h |   46 +++++
 6 files changed, 336 insertions(+)

diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c
--- linux-2.6.git/ckpt/checkpoint.c~x86_part	2008-08-04 13:29:59.000000000 -0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c	2008-08-04 13:29:59.000000000 -0700
@@ -19,6 +19,7 @@
 
 #include "ckpt.h"
 #include "ckpt_hdr.h"
+#include "ckpt_arch.h"
 
 /**
  * cr_get_fname - return pathname of a given file
@@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx *
 
 	ret = cr_write_task_struct(ctx, t);
 	CR_PRINTK("ret (task_struct) %d\n", ret);
+	if (!ret)
+		ret = cr_write_thread(ctx, t);
+	CR_PRINTK("ret (thread) %d\n", ret);
+	if (!ret)
+		ret = cr_write_cpu(ctx, t);
+	CR_PRINTK("ret (cpu) %d\n", ret);
 
 	return ret;
 }
diff -puN /dev/null ckpt/ckpt_arch.h
--- /dev/null	2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h	2008-08-04 13:29:59.000000000 -0700
@@ -0,0 +1,6 @@
+#include "ckpt.h"
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff -puN ckpt/Makefile~x86_part ckpt/Makefile
--- linux-2.6.git/ckpt/Makefile~x86_part	2008-08-04 13:29:59.000000000 -0700
+++ linux-2.6.git-dave/ckpt/Makefile	2008-08-04 ...
From: Arnd Bergmann
Date: Friday, August 8, 2008 - 5:09 am

It seems weird that you use __u64 members for the registers, but don't
include r8..r15 in the list. As a consequence, this structure does not
seem well suited for either x86-32 or x86-64.

I would suggest either using struct pt_regs by reference, or defining
it so that you can use the same structure for both 32 and 64 bit x86.

	Arnd <><
--

From: Oren Laadan
Date: Friday, August 8, 2008 - 1:28 pm

Hi,

Thanks for the feedback.

The proof-of-concept is written for x86 32 bits, keeping in mind that
we'll need support for 64 bits support. My goal is to leverage feedback
and contributions to have support for 64 bits and other architectures
as well.


In the context of CR, x86-32 and x86-64 are distinct architectures because
you cannot always migrate from one to the other (though 32->64 is sometimes
possible). Therefore, each architecture can have a separate checkpoint file
format (eg r8..r15 only for x86-64).

The information about the kernel configuration, version and cpu settings
will appear on the header; so the restart code will know the architecture
on which the checkpoint had been taken.

So if we want to restart a task checkpointed on x86-32 on a x86-64 machine
(in 32 bit mode), the code will know to not expect that data (r8..r15).

Except for this special case (32 bit running 64 bit), simple conversion can
be done in the kernel if needed, but most conversion between kernel the
format for different kernel versions (should it change) can be done in

We prefer not to use the kernel structure directly, but an intermediate
structure that can help mitigate subtle incompatibilities issues (between
kernel configurations, versions, and even compiler versions).

Anyway, either a single structure for both 32 and 64 bit x86, or separate
"struct cr_hdr_cpu{_32,_64}", one for each architecture.

Oren.

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 3:29 pm

The 32bit on 64bit case is quite common on non-x86 architectures, e.g.
powerpc or sparc, where 64 bit kernels typically run 32 bit user space.

A particularly interesting case is mixing 32 and 64 bit tasks in a container
that you are checkpointing. This is a very realistic scenario, so there
may be good arguments for keeping the format identical between the variations

struct pt_regs is part of the kernel ABI, it will not change.

	Arnd <><


--

From: Oren Laadan
Date: Friday, August 8, 2008 - 4:04 pm

The idea was that x86-32 checkpoints can be restarted on a x86-64 node in

I'm in favor about keeping the format identical between the variations of
each architecture. Note, however, that "struct pt_regs" won't do because it
may change with these variations.

So we'll take care of the padding and add r8..r15 in the next version.

Oren.


--

From: Dave Hansen
Date: Friday, August 8, 2008 - 5:38 pm

"Part of the kernel ABI" makes it sound to me like it won't change.
Who's right here? :)

-- Dave

--

From: Oren Laadan
Date: Friday, August 8, 2008 - 6:20 pm

>
 > -- Dave
 >

hehehe .. both; I meant that while it doesn't change per architecture, it
varies between architectures. So "struct pt_regs" compiled for x86-32 is
different than that compiled for x86-64. Therefore we can't just dump the
structure as is and expect that 64 bit would be able to parse the 32 bit.
In other words, we need an intermediate representation.

Oren.
--

From: Dave Hansen
Date: Friday, August 8, 2008 - 7:20 pm

Surely we already handle this, though.  Don't we allow a 32-bit app
running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version?
A 64-bit app will get the 64-bit version making the same syscall.  It's
all handled in the syscall compatibility code.

-- Dave

--

From: Oren Laadan
Date: Friday, August 8, 2008 - 7:35 pm

Sure, that's a compatibility layer around ptrace() in the 64-bit kernel.

Recall that Arnd suggested "keeping the format identical between the
variations of each architecture", and I fully agree. If we want to keep
the format identical, we can't simply define:
	struct cr_hdr_cpu {
		struct pt_regs regs;
		...
	};
because that will compile differently on x86-32 and x86-64. So either we
add r8..r15 to the structure as it appears in the patch now (and keep the
format identical), or allow the format to vary, and explicitly test for
this case and add a compatibility layer. Personally I prefer the former.

Oren.

--

From: Jeremy Fitzhardinge
Date: Sunday, August 10, 2008 - 7:55 am

Struct pt_regs is not ABI, and can (and has) changed on x86.   It's not 
suitable for a checkpoint structure because it only contains the 
registers that the kernel trashes, not all usermode registers (on i386, 
it leaves out %gs, for example).  asm-x86/ptrace-abi.h does define stuff 
that's fixed in stone; it expresses it in terms of a register array, 
with constants defining what element is which register.

    J
--

From: Dave Hansen
Date: Monday, August 11, 2008 - 8:36 am

Thanks for the explanation.

I just want to reduce the coding and maintenance burden here.  Xen must
do this for partition mobility, right?  Does it define all its own
stuff?

-- Dave

--

From: Jeremy Fitzhardinge
Date: Monday, August 11, 2008 - 9:07 am

You mean save/restore/migrate?  Yes, it defines all its own stuff.  
Checkpoint-resume on a whole VM is a rather simpler operation than a 
subset of processes.

    J

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 11:43 pm

Fair enough. How about making the layout in that structure identical to
the 64-bit pt_regs though? I don't know if we need that at any time,
but my feeling is that it is nicer than a slightly different random
layout, e.g. if someone wants to extend gdb to look at checkpointed
process dumps.

	Arnd <><
--

From: Dave Hansen
Date: Thursday, August 7, 2008 - 3:40 pm

For each vma, there is a 'struct cr_vma'; if the vma is file-mapped,
it will be followed by the file name.  The cr_vma->npages will tell
how many pages were dumped for this vma.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next vma and so on.

I guess I could also separate out the x86-specific bits here, but
they're pretty small, comparatively.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---

 linux-2.6.git-dave/arch/x86/kernel/ldt.c  |    2 
 linux-2.6.git-dave/ckpt/Makefile          |    2 
 linux-2.6.git-dave/ckpt/ckpt_arch.h       |    2 
 linux-2.6.git-dave/ckpt/ckpt_hdr.h        |   21 +
 linux-2.6.git-dave/ckpt/ckpt_mem.c        |  388 ++++++++++++++++++++++++++++++
 linux-2.6.git-dave/ckpt/ckpt_mem.h        |   32 ++
 linux-2.6.git-dave/ckpt/rstr_mem.c        |  354 +++++++++++++++++++++++++++
 linux-2.6.git-dave/ckpt/sys.c             |    3 
 linux-2.6.git-dave/ckpt/x86.c             |   83 ++++++
 linux-2.6.git-dave/include/asm-x86/ckpt.h |    5 
 linux-2.6.git-dave/include/asm-x86/desc.h |    3 
 11 files changed, 892 insertions(+), 3 deletions(-)

diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c
--- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part	2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/ldt.c	2008-08-05 08:38:00.000000000 -0700
@@ -183,7 +183,7 @@ static int read_default_ldt(void __user 
 	return bytecount;
 }
 
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 {
 	struct mm_struct *mm = current->mm;
 	struct desc_struct ldt;
diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h
--- linux-2.6.git/ckpt/ckpt_arch.h~memory_part	2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h	2008-08-05 08:37:29.000000000 -0700
@@ ...
From: Arnd Bergmann
Date: Friday, August 8, 2008 - 5:12 am

Another structure that is not 32/64 bit ABI safe on x86. It would be safe
if you reorder the members as

struct cr_hdr_mm {
	__u32 tag;	/* sharing identifier */
	__s16 map_count;
	__u16 pad; /* not actually needed, but better to make it explicit */
	__u64 start_code, end_code, start_data, end_data;
	__u64 start_brk, brk, start_stack;
	__u64 arg_start, arg_end, env_start, env_end;

same here.


	Arnd <><
--

From: Dave Hansen
Date: Thursday, August 7, 2008 - 3:40 pm

From: Oren Laadan <orenl@cs.columbia.edu>

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file.

First create a template for both syscalls: they take a file descriptor
(for the image file) and flags as arguments. For sys_checkpoint the
first argument identifies the target container; for sys_restart it will
identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>

---

 linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S |    2 ++
 linux-2.6.git-dave/include/asm-x86/unistd_32.h        |    2 ++
 2 files changed, 4 insertions(+)

diff -puN arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore arch/x86/kernel/syscall_table_32.S
--- linux-2.6.git/arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore	2008-08-07 15:38:04.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S	2008-08-07 15:38:04.000000000 -0700
@@ -326,3 +326,5 @@ ENTRY(sys_call_table)
 	.long sys_fallocate
 	.long sys_timerfd_settime	/* 325 */
 	.long sys_timerfd_gettime
+	.long sys_checkpoint
+	.long sys_restart
diff -puN include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore include/asm-x86/unistd_32.h
--- linux-2.6.git/include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore	2008-08-07 15:38:04.000000000 -0700
+++ linux-2.6.git-dave/include/asm-x86/unistd_32.h	2008-08-07 15:38:04.000000000 -0700
@@ -332,6 +332,8 @@
 #define __NR_fallocate		324
 #define __NR_timerfd_settime	325
 #define __NR_timerfd_gettime	326
+#define __NR_checkpoint		327
+#define __NR_restart		328
 
 #ifdef __KERNEL__
 
diff -puN Makefile~introduce_sys_checkpoint_and_sys_restore Makefile
_
--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 5:15 am

System calls should also be declared in include/linux/syscalls.h.

I guess you are aware that this implementation is not enough to
support 32 bit tasks on x86_64. In addition to the native 64-bit
code, you would also need the 32-bit compat code here.

	Arnd <><
--

From: Oren Laadan
Date: Friday, August 8, 2008 - 1:33 pm

Yes, of course. The current code does not attempt to do that yet.

Oren.
--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 2:25 am

Note that asm/unistd_32.h is not portable, you should use asm/unistd.h

Interface-wise, I would consider checkpointing yourself signficantly
different from checkpointing some other thread. If checkpointing
yourself is the common case, it probably makes sense to allow passing
of pid=0 for this.

	Arnd <><
--

From: Dave Hansen
Date: Friday, August 8, 2008 - 11:06 am

I don't think it is the common case.  Probably now when we're screwing
around with it, but not in the future.  Do you think it is worth adding
the pid=0 handling?

-- Dave

--

From: Arnd Bergmann
Date: Friday, August 8, 2008 - 11:18 am

If it's the exception, probably not. Otherwise it would be a nice shortcut
to avoid having to do two system calls every time you write code using it.
Then again, there are probably not many programs calling it anyway, if it
get encapsulated in some user space tool.

	Arnd <><
--

From: Oren Laadan
Date: Friday, August 8, 2008 - 12:44 pm

Hi,


Thanks. This is a proof of concept so all sorts of feedback are
definitely welcome. Some of the ideas and discussions are found
around:
   http://wiki.openvz.org/Containers/Mini-summit_2008
and the notes:
   http://wiki.openvz.org/Containers/Mini-summit_2008_notes
and the archives of the linux containers mailing list:
   https://lists.linux-foundation.org/pipermail/containers/
(August and July).

Several aspects of the implementation are still experimental and
I expect them to evolve with the feedback. In particular, expect
the specific user interface (syscalls) and the checkpoint image

The checkpoint/restart code is meant to checkpoint a whole container,
that is be able to save the state of multiple other tasks. The same
code can also be used to checkpoint yourself fairly easily with minimal
changes (see comments in the code about "in context" checkpoint/restart
that take care of this).

I suggest to keep the interface as is in the sense that the pid will
identify the target container (e.g. the pid of the init process of that
container).

Then, pid=0 would mean "the container to which I belong" if
you are inside a container (and therefore don't know the pid of the
init process there).

Finally, to checkpoint yourself, you would set the a bit in the flags
argument to something like CR_CKPT_MYSELF. Such a flag will be needed
internally anyway to special-case self checkpoint where appropriate.

Comments are welcome.

Oren.

--

Previous thread: Re: sungem lockup on 2.6.26.2 on sparc64 by Alexander Clouter on Thursday, August 7, 2008 - 3:16 pm. (1 message)

Next thread: [PATCH] leds-pca9532: Fix memory leak and properly handle errors by Sven Wegener on Thursday, August 7, 2008 - 3:49 pm. (2 messages)