Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Previous thread: [PATCH] mm: make ioremap_prot() take a pgprot. by Paul Mundt on Tuesday, November 2, 2010 - 1:31 pm. (8 messages)

Next thread: MS_L2009 by (ms-world) on Tuesday, November 2, 2010 - 3:55 pm. (1 message)
From: Tejun Heo
Date: Tuesday, November 2, 2010 - 2:35 pm

(cc'ing lkml too)
Hello,


The patch size itself isn't too big but I still think it's one scary
patch mostly because the breadth of the code checkpointing needs to
modify and I suspect that probably is the biggest concern regarding
checkpoint-restart from implementation point of view.

FWIW, I'm not quite convinced checkpoint-restart can be something
which can be generally useful.  In controlled environments where the
target application behavior can be relatively well defined and
contained (including actions necessary to rollback in case something
goes bonkers), it would work and can be quite useful, but I'm afraid
the states which need to be saved and restored aren't defined well
enough to be generally applicable.  Not only is it a difficult
problem, it actually is impossible to define common set of states to
be saved and restored - it depends on each application.

As such, I have difficult time believing it can be something generally
useful.  IOW, I think talking about its usage in complex environments
like common desktops is mostly handwaving.  What about X sessions,
network connections, states established in other applications via dbus
or whatnot?  Which files need to be snapshotted together?  What about
shared mmaps?  These questions are not difficult to answer in generic
way, they are impossible.

There is a very distinctive difference between system wide
suspend/hibernation and process checkpointing.  Most programs are
already written with the conditions in mind which can be caused by
system level suspend/hibernation.  Most programs don't expect to be
scheduled and run in any definite amount of time.  There usually
are provisions for loss or failure of resources which are out of the
local system.  There are corner cases which are affected and those
programs contain code to respond to suspend/hibernation.  Please note
that this is about userland application behavior but not
implementation detail in the kernel.  It is a much more fundamental
property.

So, although ...
From: Christoph Hellwig
Date: Tuesday, November 2, 2010 - 2:47 pm

Thanks Tejun,

your writeup brought up a lot of the same issues that I see with
the in-kernel C/R.  Various C/R implementations that are entirely
in userspace or with limited kernel assistance have been in production
in HPC environments for years.  I think especially for these workloads
C/R is an extremly useful feature, and a standard implementation would
do Linux well.

But I think the "transparent" in-kernel one is the wrong approach.  It
tries to give the illusion that C/R will just work, while a lot of
things are simply not support.  In this case whitelisting the allowed
state by requiring special APIs for all I/O (or even just standard
APIs as long as they are supposed by the C/R lib you're linked against)
is the more pragmatic, and I think faithful aproach.  In addition to
the amount of state not supported despite looking transparant the
other big problem with the patchset is that it saves the kernel internal
state which changes all the time from one release to another.  The
handwaiving is that a userspace tool will solve it.  I'm pretty sure
that's not the case; it might solve a few cases but the general
version n to version m conversion is impossible to maintain.  Just look
at the problem qemu has migration between just a handfull of version
of the relatively well (compared to random kernel state) defined vmstate
format.


--

From: Nathan Lynch
Date: Wednesday, November 3, 2010 - 6:47 pm

FWIW there are a couple of kernel-based C/R implementations (BLCR,

I think this is somewhat true of the implementation under consideration
here (although generally it should fail checkpoints that it can't
restart), but it needn't be true of all possible kernel-based

I don't think users will go for it.  They'll continue to use dodgy
out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting
their applications to a new library.  I think a C/R library is an
"ideal" solution, but it's one that nobody would use - especially in
HPC, unless the library somehow provides better performance.

The namespace/isolation features of Linux (CLONE_NEWPID et al) already
provide a pretty workable basis for creating tractably checkpoint-
and-restartable jobs, with a minimum of performance overhead and

Most of the objects that the patchset saves and restores are right at
the "border" of the user/kernel interface, and they're not apt to change
much quickly (e.g. vma start and end, task sigaltstack info).   The
patchset certainly isn't serializing deep internal state such as wait

With this I agree, though.  But if a change in kernel implementation
details forces an incompatible change in the checkpoint image format, is
that really a big deal?  Would it be so bad to say that a checkpoint
image may be restarted only on the same kernel version that created it?
With -stable or enterprise kernels I suspect the issue is unlikely to
come up.


--

From: Tejun Heo
Date: Thursday, November 4, 2010 - 12:36 am

Hello,


I hear that there are plans to integrate one of the userland
snapshotting implementations with HPC workload manager.  ISTR the
combination to be condor + dmtcp but not sure.  I think things like
that make a lot of sense.  Scientists writing programs for HPC
clusters already work in given frameworks and what those applications
do and how to recover are pretty well confined/defined.  If you
integrate snapshotting with such frameworks, it becomes pretty easy
for both the admins and users.

I'll talk about other issues in the reply to Oren's email.

Thanks.

-- 
tejun
--

From: Gene Cooperman
Date: Thursday, November 4, 2010 - 9:04 am

Yes, we are working with Condor to have them validate DMTCP.  Time will tell.
							- Gene

--

From: Nathan Lynch
Date: Thursday, November 4, 2010 - 1:45 pm

If you look at the C/R implementations of those two projects you'll see
that they don't implement what I take to be hch's suggestion - a library
or platform with special-purpose APIs to which applications are ported
in order to gain C/R ability.  For all their good points, the projects
you mention do interposition for glibc's syscall wrappers and provide a
few optional hooks so apps can control certain aspects of C/R.


--

From: Matt Helsley
Date: Friday, November 5, 2010 - 11:48 pm

And even if they did, I don't think asking application developers to use
such a broad API -- one that requires special APIs for all I/O -- is
practical for many of the purposes outlined at kernel summit.
I think DMTCP is better off for not attempting to mandate such APIs.

How rare is it for an application or library to change the underlying
APIs it uses? How many applications have been ported say from Gnome to
KDE (or vice-versa) over the lifetime of the project? Relative to all
the other applications? I would hazard a guess that most were rewritten
rather than ported and that those that were ported are an utterly
insignificant fraction of what's out there.

It's much better to offer tools that, as much as possible, don't care
which APIs the applications use.

Cheers,
	-Matt Helsley
--

From: Oren Laadan
Date: Wednesday, November 3, 2010 - 9:34 pm

Hi Christoph,

I really wish you would have raised these concerns during the
ksummit or thereafter. I'm here (LPC) until Friday, and would be
happy to discuss any aspect of the linux-cr while at it (and if
needed can post a summary to the list).


The fact is that an in-kernel implementation can and does support
a significantly larger feature-set.

Linux-cr does not and will not support everything. Nearly all driver
devices won't be supported in the near future (but interested vendors
could builds such functionality into their drivers!). Also, pseudo
file systems like sysfs, procfs, debugfs will at most get partial
support.

But apart for that, it really covers (or will soon) nearly everything.

"Transparent" means that applications don't know that they are
being checkpointed, nor do they need to cooperate. So linux-cr
is *completely* transparent to applications that are checkpointable.

Perhaps you can elaborate on the "state not supported despite

It is our experience that the format is pretty immune to changes
that occur to in-kernel (and not user/ABI visible) structures.
It mainly changes when we add new features - and I expect that
to happen less frequently once the patchset finds its way to the

The problem space is smaller, because we are aiming at a simpler
goal. We need to always know how to convert from version N to
version N+1. Then conversion from N to N+k is a series of these
conversions.

QEMU has a broader goal: IIUC, both QEMU and KVM versions may
change, they are not tied to each other. So the problem is harder.

In linux-cr, the format is tied to the version of objects that
the kernel that outputs/inputs the data knows. That makes things
much simpler.

Oren.
--

From: Christoph Hellwig
Date: Thursday, November 4, 2010 - 7:25 am

Discussion technical topics with slides in a big room is utterly
pointless.  Just like during all the other such boring talks during KS I
was either asleep, working on something important or out of the room
doing the extended hallway track.

If you want to discuss invasive kernel changes with people do it by
email.  The chance that anyone is going to listen to you is a lot
higher.
--

From: Kapil Arya
Date: Wednesday, November 3, 2010 - 8:40 pm

(Sorry for resending the message; the last message contained some html
tags and was rejected by server)

We would like to thank the previous post for bringing up the topic
of kernel C/R versus userland C/R.  We are two of the developers of DMTCP
(userland checkpointing):  Distributed MultiThreaded CheckPointing .
	     http://dmtcp.sourceforge.net
We had waited to write to the kernel developers because we had wanted
to ensure that DMTCP is sufficiently robust before wasting the time of the
kernel developers.  This thread seems like a good opportunity to begin
a dialogue.

In fact, we only became aware of Linux kernel C/R this September.
Of course, we were aware of Oren Laadan's fine earlier work on ZapC
for distributed checkpointing using the Linux kernel (CLUSTER-2005).
We have a high respect for Oren Laadan and the other Linux C/R developers,
as well as for the developers of BLCR (a C/R kernel module with a userland
component that is widely used in HPC batch faciliites).

By coincidence, when we became aware of Linux C/R, we were already in
the middle of development for a major new release of DMTCP (from version
1.1.x to 1.2.0).  We just finished that release.  Among other features,
this release supports checkpointing of GNU 'screen', and we have tested
screen in some common use cases (with vim, with emacs, etc.).  While it
supports ssh (e.g. checkpointing OpenMPI, which uses ssh), it doesn't yet
support _interactive_ ssh sessions.  That will come in the next release.

We believe that both Linux C/R and DMTCP are becoming quite mature, and
that in general, one can achieve good application coverage with either.

In our personal view, a key difference between in-kernel and userland
approaches is the issue of security.  The Linux C/R developers state

The previous posts also brought up the issue of external connections.
While DMTCP has been developed over six years, in the last year we
have concentrated especially on the issue of external connections.
While we've accumulated ...
From: Tejun Heo
Date: Thursday, November 4, 2010 - 1:05 am

Hello,


And please also don't top-post.  Being the antisocial egomaniacs we
are, people on lkml prefer to dissect the messages we're replying to,
insert insulting comments right where they would be most effective and

That's an interesting point but I don't think it's a dealbreaker.
Kernel CR is gonna require userland agent anyway and access control
can be done there.  Being able to snapshot w/o root privieldge
definitely is a plust but it's not like CR is gonna be deployed on

Yeap, agreed.  There gotta be user agents which can monitor and
manipulate userland states.  It's a fundamentally nasty job, that of
collecting and applying application-specific workarounds.  I've only
glanced the dmtcp paper so my understanding is pretty superficial.
With that in mind, can you please answer some of my curiosities?

* As Oren pointed out in another message, there are somethings which
  could seem a bit too visible to the target application.  Like the
  manager thread (is it visible to the application or is it hidden by
  the libc wrapper?) and reserved signal.  Also, while it's true that
  all programs should be ready to handle -EINTR failure from system
  calls, it's something which is very difficult to verify and test and
  could lead to once-in-a-blue-moon head scratchy kind of failures.

  I think most of those issues can be tackled with minor narrow-scoped
  changes to the kernel.  Do you guys have things on mind which the
  kernel can do to make these things more transparent or safer?

* The feats dmtcp achieves with its set of workarounds are impressive
  but at the same time look quite hairy.  Christoph said that having a
  standard userland C-R implementation would be quite useful and IMHO
  it would be helpful in that direction if the implementation is
  modularized enough so that the core functionality and the set of
  workarounds can be easily separated.  Is it already so?

Thanks.

-- 
tejun
--

From: Gene Cooperman
Date: Thursday, November 4, 2010 - 9:44 am

This is a good point to clarify some issues.  C/R has several good
targets.  For example, BLCR has targeted HPC batch facilities, and
does it well.
    DMTCP started life on the desktop, and it's still a primary focus of DMTCP.
We worked to support screen on this release precisely so that advanced
desktop users have the option of putting their whole screen session
under checkpoint control.  It complements the core goal of screen:
If you walk away from a terminal, you can get back the session elsewhere.
If your session crashes, you can get back the session elsewhere
These are also some excellent points for discussion!  The manager thread
is visible.  For example, if you run a gdb session under checkpoint
control (only available in our unstable branch, currently), then
the gdb session will indeed see the checkpoint manager thread.
So, yes.  We are not totally transparent, and a skilled user must
account for this.  There are analogies (the manager thread in the
original LinuxThreads, the rare misfortune of gdb to lose
track of the stack frames).
    We try to hid the reserved signal (SIGUSR2 by default, but the user can
configure it to anything else).  We put wrappers around system calls
that might see our signal handler, but I'm sure there are cases where
we might not succeed --- and so a skilled user would have to configure
to use a different signal handler.  And of course, there is the rare
application that repeatedly resets _every_ signal.  We encountered
this in an earlier version of Maple, and the Maple developers worked
Exactly right!  Excellent point.  Perhaps this gets down to philosophy,
and what is the nature of a bug.  :-)  In some cases, we have encountered
this issue.  Our solution was either to refuse to checkpoint within
certain system calls, or to check the return value and if there was
an -EINTR, then we would re-execute the system call.  This works again,
For the most part, we've always found a way to work within the current
design of the kernel.  We consider this ...
From: Tejun Heo
Date: Friday, November 5, 2010 - 2:28 am

Hello,


Call me skeptical but I still don't see, yet, it being a mainstream
thing (for average sysadmin John and proverbial aunt Tilly).  It
definitely is useful for many different use cases tho.  Hey, but let's

I don't think gdb seeing it is a big deal as long as it's hidden from

I'm probably missing something but can't you stop the application
using PTRACE_ATTACH?  You wouldn't need to hijack a signal or worry
about -EINTR failures (there are some exceptions but nothing really to
worry about).  Also, unless the manager thread needs to be always
online, you can inject manager thread by manipulating the target


I see.  I just thought that it would be helpful to have the core part
- which does per-process checkpointing and restoring and corresponds
to the features implemented by in-kernel CR - as a separate thing.  It
already sounds like that is mostly the case.

I don't have much idea about the scope of the whole thing, so please
feel free to hammer senses into me if I go off track.  From what I
read, it seems like once the target process is stopped, dmtcp is able
to get most information necessary from kernel via /proc and other
methods but the paper says that it needs to intercept socket related
calls to gather enough information to recreate them later.  I'm
curious what's missing from the current /proc.  You can map socket to
inode from /proc/*/fd which can be matched to an entry in
/proc/*/net/PROTO to find out the addresses and most socket options
should be readable via getsockopt.  Am I missing something?

I think this is why userland CR implementation makes much more sense.
Most of states visible to a userland process are rather rigidly
defined by standards and, ultimately, ABI and the kernel exports most
of those information to userland one way or the other.  Given the
right set of needed features, most of which are probabaly already
implemented, a userland implementation should have access to most
information necessary to checkpoint without resorting to too ...
From: Oren Laadan
Date: Friday, November 5, 2010 - 4:18 pm

This is an excellent example to demonstrate several points:

* To freeze the processes, you can use (quote) "hairy" signal
overload mechanism, or even more hairy ptrace; both by the way
have their performance problem with many processes/threads.
Or you can use the in-kernel freezer-cgroup, and forget about 
workarounds, like linux-cr does. And ~200 lines in said diff
are dedicated exactly to that.

* Then, because both the workaround and the entire philosophy
of MTCP c/r engine is that affected processes _participate_ in
the checkpoint, their syscalls _must_ be interrupted. Contrastly,
linux-cr kernel approach allows not only to checkpoint processes
without collaboration, but also builds on the native signal
handling kernel code to restart the system calls (both after
unfreeze, and after restart), such that the original process

Aha ... another great example: yet another piece of the suspect
diff in question is dedicated to allow a restarting process to
request a specific location for the vdso.

BTW, a real security expert (and I'm not one...) may argue that
this operation should only be allowed to privileged users. In fact,
if your code gets around the linux ASLR mechanisms, then someone

FWIW, the restart portion of linux-cr is designed with this in
mind - it is flexible enough to accommodate for smart userspace
tools and wrappers that wish to mock with the processes and
their resource post-restart (but before the processes resume
execution). For example, a distributed checkpoint tool could,
at restart time, reestablish the necessary network connections
(which is much different than live migration of connections,
and clearly not a kernel task). This way, it is trivial to migrate
a distributed application from one set of hosts to another, on

So you'll need mechanisms not only to read the data at checkpoint
time but also to reinstate the data at restart time. By the time
you are done, the kernel all the c/r code (the suspect diff in
question _and_ the rest of the ...
From: Tejun Heo
Date: Saturday, November 6, 2010 - 3:13 am

Hello,


The above problems can be solved for userland C/R with small
self-contained modification to a small part of the kernel.  You're
insisting that because currently some obscure corner cases aren't
handled, the whole thing should be shoved in the kernel and the kernel
should be serializing and deserializing its internal data structures
for everything visible in the userland.  That's silly at best.  Note
the "visible in the userland" part.  Most of those parts are already
discoverable without further modifications to kernel.  The only sane
approach would be add missing pieces which would not only benefit CR
but other applications too.

Also, you said the patches didn't have to change much because the data
structures facing userland didn't change much over different kernel
versions, which of course is true as it's so close to the userland
visible ABI.  That is _NOT_ a selling point for kernel CR.  That's a
BIG GLOWING SIGN telling you that you're on the frigging wrong side of

ASLR is to protect a program from itself not from outside.  If you can

Yeap, that was the reason why I asked how modularized that part of
dmtcp was as it would directly compare with the in-kernel
implementation.  If they can be well separated, I think it would even
be possible to switch between the two while keeping the upper set of

Unfortunately, for most things which matter, everything is already in
place and if you just concentrate on the core part the hackiness seems
quite manageable and I think it wouldn't be too difficult to reduce it
further.  I don't see why userland implementation wouldn't be able to
snapshot any random process without LD_PRELOADs or whatever
cooperation from it.  And, if the COW thing is so important, we can
collect the information and export it to userland via proc or
ringbuffer.  That's what qemu-kvm would need anyway, right?  I don't
think kvm guys would be so crazy as putting the whole snapshotter into

No, that's primarily not the feature of kerne CR.  It's of ...
From: Kapil Arya
Date: Friday, November 5, 2010 - 5:36 pm

In fact CryoPid uses exactly the same approach and has been around for around 5
years. Not as much development effort has gone into CryoPid as DMTCP and so its
application coverage is not as broad. But the larger issue for using PTRACE is
that you can not have two superiors tracing the same inferior process. So if you
want to checkpoint a gdb session or valgrind or tmux or strace, then you can not
directly control and quiesce the inferior process being traced.

Beyond that, we also have a vision (not yet implemented) of process
virtualization by which one can change the behavior of a program. For example,
if a distributed computation runs over infiniband, can we migrate to a TCP/IP
cluster. For this, one needs the flexibility of wrappers around system calls.
This vision of process virtualization also motivates why our own research

Yes, we would love to elaborate :-). We began DMTCP with Linux kernel 2.6.3.
When Address Space Layout Randomization was added, we were forced to add some
hacks concerning VDSO location and end-of-data. end-of-data is the uglier part.
On restart, we directly map each memory segment into the original address at
checkpoint time. The issue comes in mapping heap back to its original location.
We call sbrk() to reset the end-of-data to the end of the original heap. This
fails if the randomized beginning-of-data/end-of-data given to us by the kernel
for the restarted process is too far away from where we want to remap the heap.
To get around this, we play games with legacy layout, other personality
parameters, and RLIMIT_STACK (since the kernel uses RLIMIT_STACK in choosing the
appropriate memory layout).

For our wish list, we would like a way of telling the kernel, where to set
beginning-of-data/end-of-data. Curiously enough, at the time at which Linux
started randomizing address space, there was discussion of offering exactly this
facility for the sake of legacy programs, but it turned out not to be needed.

Similarly, it would be nice to tell the ...
From: Oren Laadan
Date: Saturday, November 6, 2010 - 3:55 pm

This is a very useful vision. However, it is unrelated to how you
do c/r, but rather to what you do after you restart and before you
let the application resume execution.

For example, in your example, you'd need to wrap the library calls
(e.g. of MPI implementation) and replaced them to use TCP/IP or
infiniband. Wrapping on system calls won't help you.

Or you could just replace the resource - e.g., make the restarted
application use s socket for stdout instead of the tty, so you can
redirect the output to where-ever.

Both methods are orthogonal to the c/r itself: linux-cr will allow
you to replace/modify resources if you so wish, and I suspect that
MTCP also can/will.

Interposing on library calls is possible with MTCP methods, or
using binary instrumentation, or PIN, or DynInst, or LD_PRELOAD.

The only two reasons to interpose on systems calls, as I noted
in earlier message (http://lkml.org/lkml/2010/11/5/262 - see
points "2)" and "3)" about userland-workarounds):

One - to virtualize in userspace reosurces (e.g. pids) that the
kernel already knows how to virtualize.

Two - to track state of resources during execution and lie about
their state when needed, because userspace can't cleanly save
and restore their state.

Virtualization through interposition is extremely tricky in and
out of the kernel. The examples given throughout this thread (by
either side) expose the tip of the iceberg. Interposition as a
technique  is full of security and other pitfalls, as discussed
by extensive literature in the area. (I cited in another email).

So I'll repeat the question I asked there: is re-reimplementing
chunks of kernel functionality and all namespaces in userspace


What is "reasonable" overhead ?
For which applications ?
What about a 'kernel make' ?
What about servers (db, web, etc) ?
What about VPSs/VDIs ?
Can we do better, including for HPC ?

Exactly !  Wrapping around apps to isolate them from the environment
is desirable, regardless of how you ...
From: Gene Cooperman
Date: Sunday, November 7, 2010 - 12:42 pm

I'd like to add a few clafifications, below, about DMTCP concerning
Oren's comments.  I'd also like to point out that we've had about 100
downloads per month from sourceforge (and some interesting use cases
from end users) over the last year (although the sourceforge numbers
do go up and down :-) ).  In general, I think we'll all understand the
situation better after having had the opportunity to talk offline.
Below are some clarifications about DMTCP.

We do not put any wrappers around MPI library calls.  MPI calls things
like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
At this time, DMTCP adds wrappers _only_ around calls to libc.so
and libpthread.so .  This is sufficient to checkpoint a distributed

Just a small correction about interposition.  The primary "Reason Two"
for interposing on system calls should be to _spy_ on what the user process
is doing and save that information.  For the most part, we do not
_lie about their state when needed_.  I agree that virtualization of pids
is an exception where we have to lie, but that was already stated as
"Reason One" above.  At restart time, we may also recreate resources that are
no longer in the kernel.  But this is not an example of interposition.
I suppose that it is an example of lying, but every C/R technique will
need to do this.
    Later, perhaps Oren, Kapil and I can browse the DMTCP code together,
and we can look exactly at what each wrapper is doing.  The system call
wrappers are, in fact, the smaller part of the DMTCP code.  It's about
3000 lines of code.  For anybody who is curious about what our wrappers do,
please download the DMTCP source code, and look at

If you're referring to interposition here, that takes place essentially
in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
Also, I don't believe that we're "re-implementing chunks of kernel


I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
How will the user app ever know that we used LD_PRELOAD, ...
From: Oren Laadan
Date: Sunday, November 7, 2010 - 2:30 pm

Of course. And you don't need syscall virtualization for this.
Zap did it already many years ago :)  Only problem with the above
is that, conveniently enough, you _left out_ the context:

 >> For example,
 >> if a distributed computation runs over infiniband, can we migrate to 
a TCP/IP
 >> cluster. For this, one needs the flexibility of wrappers around 
system calls.

Do you also support checkpoint a distributed app that uses an
infiniband MPI stack and restart it with a TCP based MPI stack ?
Can you do it with only syscall wrapping and without knowledge
on the MPI implementation and some MPI-specific logic in the
wrappers ?   I'm curious how you do that without wrapping around
MPI calls, or without an c/r-aware implementation of MPI.

Again, this is unrelated to how you do the core c/r work. I think
we both agree that _this_ kind of app-wrappers/app-awareness is
useful for certain uses of c/r.


The interposition itself is relatively simple (though not atomic).
The problem is the logic to "spy" on and "lie" to the applications.
Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly
maintaining a userspace pid-ns, etc.


I don't object to it per se - it's actually pretty useful oftentimes.
But in our context, it has limitations. For example, it does not
cover static applications, nor apps that call syscalls directly
using int 0x80. Also, it conflicts with LD_PRELOAD possibly needed
for other software (like valgrind) - for which again you would need

I mean that the applications needs to be scheduled and to run to
participate in its own checkpoint. You use syscall interposition
and signals games to do exactly that - gain control over the app
and run your library's code. This has at least three negatives:
first, some apps don't want to or can't run - e.g. ptraced, or
swapped (think incremental checkpoint: why swap everything in ?!);
Second, the coordination can take significant time, especially if
many tasks/threads and resources are involved; Third, it ...
From: Gene Cooperman
Date: Sunday, November 7, 2010 - 4:05 pm

Yes, that's exactly what we plan to do.  And we have begun some of the
initial work.  And yes, we plan to do it without any MPI-specific logic.
When we talk to each other offline, I'd be happy to give you more
details of how we do it now for TCP "without wrapping around MPI calls,
or without an c/r-aware implementation of MPI", and how we are working

And let's wait for the offline discussion for that --- and we'll describe
in detail at that time how we do each one of the things that you mention.
It will be easier to discuss each of the things that you mention by
looking at the DMTCP code "side-by-side" over the phone.  We hope to

For static apps, we would use other interposition techniques.  And yes,
we haven't implemented support of static apps so far, because our
user base hasn't asked for it.  We do handle apps that use the
syscall system call to make system calls.  We don't handle apps
that directly use "int 0x80".  Again, there are ways to do this, but
our user base hasn't asked for it.
    In general, please keep in mind the principles that you rightly had
to remind me of in a previous post.  :-)  Our two pieces of work are coming
from two different directions with two different visions.  Linux C/R wants
to be so transparent that no user app can ever detect it.  DMTCP wants to be
transparent enough that any reasonable use case is covered.
    In particular, DMTCP considers distributed computations to be equally
valid use cases for the core DMTCP C/R.  I also agree that Linux C/R can be
extended to cover distributed apps -- either through userland extensions,
or maybe with techniques like in your excellent CLUSTER-2005 paper.
     Hence, DMTCP has grown its coverage of apps over the years.  When we
talk offline, let's talk about future use cases, and whether there are

DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD.
We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app
starts.  We then remove it before the app really starts.  The ...
From: Oren Laadan
Date: Sunday, November 7, 2010 - 8:55 pm

On 11/07/2010 06:05 PM, Gene Cooperman wrote:


Agreed - as long as we are considering the c/r-engine functionality
(and not the "glue" logic to keep apps outside their context after
the restart).

That said, I'm afraid we'll more definitions to what is "reasonable"

Distributed c/r is one of the proposed use-cases for linux-cr.

The technique in that paper, BTW, was a userspace glue: during
restart, that glue re-establishes connectivity by using new TCP 
connections, and c/r uses those new sockets in lieu of restoring
the old ones.

For that and other use-cases we designed linux-cr to be flexible



Wrappers are great (I did TA the w4118 class here...). They are
a powerful tool; however in _our_ context they have downsides:
(a) wrappers add visible overhead (less so for cpu-bound apps,
more so with server apps)
(b) wrappers that do virtualization to a "black-box" API (as
opposed to integrate with the API) are prone to races (see the
paper that I cited before)
(c) wrappers duplicate kernel logic, IMHO unnecessarily (and I
don't refer to the userspace "glue" from above)
(d) wrappers are hard to make hermetic (no escapes) to apps.

IMO, the one excellent reasons to use wrappers is to support
the userspace glue that allows restarted apps to run out of

I clearly failed to explain well. Lemme try again:

If you use PTRACE to checkpoint, then you ptrace the target tasks,
peek at and save their state, and then let them resume execution.
The target apps need not collaborate - they are forced by the kernel
to the ptraced state regardless of what they were doing, and resume
execution without knowing what happened.

In linux-cr it works similarly: checkpoint does not require that
the processes be scheduled to run - they don't participate; rather,
external process(es) do the work.

In contrast, IIUC, dmtcp uses syscall wrappers and overloading of
signal(s) in order to make every checkpointed process/thread actively
execute the checkpoint logic. I refer to this as ...
From: Gene Cooperman
Date: Monday, November 8, 2010 - 9:26 am

As before, Oren, let's have that phone discussion so that we can preprocess
a lot of this, instead of acting like the the three blind men and the
elephant.  I will _tell you_ the strengths and weaknesses of DMTCP
on the phone, instead of you having to guess at them here on LKML.  And
of course, I hope you will be similarly frank about Linux C/R on the phone.

Thank you for lowering the heat on this last post.  I'll reply only to
some relevant issues in this post, rather than trying to respond to all
of your posts.  I remind you that I still have my own questions about
Linux C/R, but I'm saving them for the phone discussion, since that will

In our experience, the primary overhead of C/R is to save the
data to disk.  This far outweighs the question of how many ms

The paper you cited was:
  http://www.stanford.edu/~talg/papers/traps/abstract.html
  Traps and Pitfalls: Practical Problems in System Call
     Interposition Based Security Tools
  That paper is about Sandboxing.  DMTCP is about C/R.  If DMTCP was trying
to do a sandbox, it might have some of the same traps and pitfalls.
Luckily, userland C/R is a _lot_ easier than userland sandboxing.
By the way, although of less importance, I'll point out that the paper
was written in 2003, before DMTCP even started.
    Next, you talk about races.  The authors of that paper have races
because they are trying to do sandboxing.  I already answered Matt's
post earlier about why we don't see races in DMTCP.
I'll answer it again, but in more detail.
    At ordinary run-time, the DMTCP checkpoint thread is just waiting
on a select -- waiting for instructions from the DMTCP coordinator.
Our system call wrappers around user threads to not change the issue
of races.  If two user threads used to have a race, they will continue
to do so in DMTCP.  If two user threads did not have a race, then
DMTCP will not introduce any new races.  How should DMTCP introduce
a new race when DMTCP wrappers _never_ communicate with any other thread.
    At ...
From: Oren Laadan
Date: Monday, November 8, 2010 - 11:14 am

Hi,

Ok, I'll bite the bullet for now - to be continued...


VMware, Xen and KVM already do live migration. However, VMs
are a separate beast.

We are concerned about _application_ level c/r and migration
(complete containers or individual applications). Many proven
techniques from the VM world apply to our context too (in your
example, post-copy migration).

Oren.
--

From: Gene Cooperman
Date: Monday, November 8, 2010 - 11:37 am

Thanks for the careful response, Oren.  For others who read this,
one could interpret Oren's rapid post as criticizing the work of
Andres Lagar Cavilla.  I'm sure that this was not Oren's intention.
Please read below for a brief clarification of the novelty of SnowFlock.
    Anyway, I really look forward to the phone discussion.  I've also
enjoyed our interchange, for giving me an opportunity to explain more about
the DMTCP design.  Thank you.
                                                        Best wishes,
                                                        - Gene


I absolutely agree with your point that live migration of
applications is a different beast, and technically very novel.
    Since I know Andres Lagar Cavilla personally, I also feel obligated
to comment why SnowFlock truly is novel in the VM space.  First, as Andres
writes:
"SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
VMM [Barham 2003]."
In the abstract, Andres points out one of the major points of novelty:
"To evaluate SnowFlock, we focus on the demanding
scenario of services requiring on-the-fly creation of hundreds
of parallel workers in order to solve computationallyintensive
queries in seconds."
We must be careful that we don't destroy someone's reputation without
--

From: Oren Laadan
Date: Monday, November 8, 2010 - 12:34 pm

Err... yes, that was careless of me. I was too focused on

Yes, it's really nice work - I saw it when I visited there.
(Coincidentally the post-copy idea with Xen appeared also in
VEE 09 briefly before).

Oren.

--

From: Dan Smith
Date: Monday, November 8, 2010 - 12:05 pm

GC> As before, Oren, let's have that phone discussion so that we can
GC> preprocess a lot of this, instead of acting like the the three
GC> blind men and the elephant.  I will _tell you_ the strengths and
GC> weaknesses of DMTCP on the phone, instead of you having to guess
GC> at them here on LKML.  And of course, I hope you will be similarly
GC> frank about Linux C/R on the phone.

I want to be in on that discussion too, as do a lot of other people
here.  However, I doubt we'll all be able to find a common spot on our
collective schedules, nor would that conversation be archived for
posterity.  I think sticking to LKML is the right (and time-tested)
approach.

OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in
OL> which case skype's sockets' network stacks are reconstructed,
OL> transparently to both skype (local apps) and the peer (remote
OL> apps).  Then, at the destination host and skype continues to work.

GC> That's a really cool thing to do, and it's definitely not part of
GC> what DMTCP does.  It might be possible to do userland live
GC> migration, but it's definitely not part of our current scope.

How would you go about doing that in userland?  With the current
linux-cr implementation, I can move something like sshd or sendmail
from one machine to another without a remote (connected) client
noticing anything more than a bit of delay during the move.

I think that saving and restoring the state of a TCP connection from
userland is probably a good example of a case where it makes sense to
have it as part of a C/R function, but not necessarily exposed in /sys
or /proc somewhere.  Unless it can be argued that doing so is not
useful, I think that's a good talking point for discussing the kernel
vs. user approach, no?

-- 
Dan Smith
IBM Linux Technology Center
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 4:14 am

Hello,



Meh, just implementing a conntrack module should be good enough for
most use cases.  If it ever becomes a general enough problem (which I
extremely strongly doubt), we can think about allowing processes in a
netns to change sequence number but that would be a single setsockopt
option instead of the horror show of dumping in-kernel data structures
in binary blob.

Thanks.

-- 
tejun
--

From: Dan Smith
Date: Wednesday, November 17, 2010 - 8:33 am

TH> If it ever becomes a general enough problem (which I extremely
TH> strongly doubt),

Migration of a container?  Yeah, it's one of the primary reasons for
doing what we're doing :)

TH> we can think about allowing processes in a netns to change
TH> sequence number but that would be a single setsockopt option

Yeah, well there's more than that, of course, if you want to be able
to checkpoint a socket in any state.  Buffers, time-wait, etc.

TH> instead of the horror show of dumping in-kernel data structures in
TH> binary blob.

Well, as should be evident from a review of the code, we don't dump
binary kernel data structures as a general rule.  We canonicalize them
into checkpoint headers on the way out and build the new data
structures (or use existing kernel interfaces to do so) on the way in.
You know, just like netlink does.

It has even been suggested that we do this with netlink instead, to
mirror the other "horror show" tools that we all use on a daily basis.
We're not opposed to this, but we do have some concerns about
performance.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 8:40 am

Hello,


Well, then push for the feature.  If the rationale is strong enough,

I haven't really thought about it too deeply but for all other misc
states, you should be able to emulate it by talking to a netfilter
module.  The reason why I suggested sequence number changing setsocket
option is because that is the only performance sensitive part and with
that you should be able to resume live sockets without conntracking.


The horror show part is dumping internal data structure without due
scrutinization in a way which can only ever be useful for CR when most
of the same states are already exported via ABI defined ways.

Thanks.

-- 
tejun
--

From: Alexey Dobriyan
Date: Wednesday, November 17, 2010 - 10:04 am

That's what review process is for, isn't it?

Please, look at what is being dumped and what isn't.
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 3:45 am

Hello, sorry about the long delay.  Was lost in something else.


I've been thinking about this.  We can easily introduce a new ptrace
call which allows neseting.  AFAICS, ptrace already exports most of
information necessary to restart the task - where it's stopped and
why.  The only missing thing seems to be the wait state (including for
group stop) which can be added without too much difficulty.  I'll try
to write up a RFC patch.  Things like that would useful for other
things too - say, you would be able to attach gdb to a strace'd

Yeah, definitely, for the higher level workarounds, there's no way
around it but I think it would still be worthwhile to be able to
provide a baseline implementation which can checkpoint and restart a


I haven't really looked at the VDSO generation but symbol offsets
inside VDSO page can differ depending on kernel version,
configuration, toolchains used, etc... right?  You would need an extra

I wrote in another mail but you can find out which fd's are shared by
flipping O_NONBLOCK and looking at the flags field of

As a few others have already pointed out, I think it's better to keep
technical discussions on-line.  Different people think at different
paces and the schedules don't always match.  Plus, other people can
jump in and look up things later.  It may take a bit more effort at
the beginning but I think it gets easier in time.

Thank you.

-- 
tejun
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 5:12 am

Ooh, one more thing, /proc/*/net/* has tx/rx queue counts.  With
those, you wouldn't need the cookie based connection draining, right?

Thanks.

-- 
tejun
--

From: Matt Helsley
Date: Friday, November 5, 2010 - 10:32 pm

Rightly so. It hasn't been widely proven as something that distros
would be willing to integrate into a normal desktop session. We've got
some demos of it working with VNC, twm, and vim. Oren has his own VNC,
twm, etc demos too. We haven't looked very closely at more advanced
desktop sessions like (in no particular order) KDE or Gnome. Nor have
we yet looked at working with any portions of X that were meant to provide
this but were never popular enough to do so (XSMP iirc).

Does DMTCP handle KDE/Gnome sessions? X too?

On the kernel side of things for the desktop, right now we think our
biggest obstacle is inotify. I've been working on kernel patches for
kernel-cr to do that and it seems fairly do-able. Does DMTCP handle
restarting inotify watches without dropping events that were present
during checkpoint?

The other problem for kernel c/r of X is likely to be DRM. Since the
different graphics chipsets vary so widely there's nothing we can do
to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset
as far as I know. Perhaps if that would help hybrid graphics systems
then it's something that could be common between DRM and
checkpoint/restart but it's very much pie-in-the-sky at the moment.

kernel c/r of input devices might be alot easier. We just simulate
hot [un]plug of the devices and rely on X responding. We can even
checkpoint the events X would have missed and deliver them prior to hot
unplug.

Also, how does DMTCP handle unlinked files? They are important because
lots of process open a file in /tmp and then unlink it. And that's not
even the most difficult case to deal with. How does DMTCP handle:

link a to b
open a (stays open)
rm a
<checkpoint and restart>
open b
write to b
read from a (the write must appear)


Is the checkpoint control process hidden from the application? What
happens if it gets killed or dies in the middle of checkpoint? Can
a malicious task being checkpointed (perhaps for later analysis)


Wouldn't checkpoint and ...
From: Oren Laadan
Date: Saturday, November 6, 2010 - 8:01 am

Actually, I do have a demo of Zap (linux-cr predecessor) with a _full_
gnome desktop running under VNC with:
* a movie player,
* firefox,
* thunderbird,
* openoffice,
* kernel make,
* gdb debugging something,
* WINE with microsoft office (oops)
all of these checkpointed with < 25ms of downtime and resumed an
arbitrary time later, successfully.


At the very least userspace would need to interpose on all
inotify related syscalls to track (log) what the user did to
be able to redo it at restart. (And I'm sure there will be
crazy to impossible races and corner cases there).

Does it make sense to replicate in userspace everything already done

DRM is hardware, and is complex for both userspace and kernel. Let's
assume it isn't support until it's properly virtualized.

(In the long-long run, I'd envision hardware manufacturers providing
c/r support within their drivers - e.g. a checkpoint() and restart()
kernel methods. But that's only if they care about it, and in any

[snip]

Oren.
--

From: Gene Cooperman
Date: Saturday, November 6, 2010 - 1:40 pm

By the way, Oren, Kapil and I are hoping to find time in the next few
days to talk offline.  Apparently the Linux C/R and DMTCP had continued
for some years unaware of each other.  We appreciate that a huge amount
of work has gone into both of the approaches, and so we'd like to reap
the benefit of the experiences of the two approaches.  We're still learning
more about each others' approaches.  Below, I'll try to answer as best
I can the questions that Matt brings up.  Since Matt brings up _lots_
of questions, and I add my own topics, I thought it best to add a table
of contents to this e-mail.  For each topic, you'll see a discussion
inline below.

1.  Distros, checkpointing a desktop, KDE/Gnome, X
  [ Trying to answer Matt's question ]

2.  Directly checkpointing a single X11 app
  [ Our own preferred approach, as opposed to checkpinting an entire desktop;
    This is easy, but we just haven't had the time lately.  I estimate
    the time to do it is about one person working straight out for two weeks
    or so.  But who has that much spare time.  :-)  ]

3. OpenGL
  [ Checpointing OpenGL would be a really big win.  We don't know the
    right way, but we're looking.  Do you have some thoughts on that?  Thanks.]

4. inotify and NSCD
  [ We try to virtualize a single app, instead of also checkpointing
    inotify and NSCD themselves.  It would have been interesting to consider
    checkpointing them in userland, but that would require root privilege,
    and one core design principle we have, is that all of our C/R is
    completely unprivileged.  So, we would see distributing DMTCP as
    a package in a distro, and letting individual users decide for
    what computation they might want to use it. ]

5. Checkpointing DRM state and other graphics chip state
  [ It comes down to virtualization around a single app versus checkpointing
    _all_ of X. --- Two different approaches. ]

6. kernel c/r of input devices might be alot easier
  [ We agree with you.  By virtualizing ...
From: Oren Laadan
Date: Saturday, November 6, 2010 - 3:41 pm

That was my understanding too. However, I also felt that I'd better


Hmmm... that sounds pretty fast .. given that you will need to
save and reconstruct an arbitrary state kept by the X server...

More importantly, this line of thought was brought up in this
thread multiple times, yet in a very misleading way.

The question is _not_ whether one can do c/r of a single apps
without their surrounding environment. The answer for that is
simple: it _is_ possible either using proper (and more likely
per-app) wrappers, or by adapting the apps to tolerate that.

The above is entirely orthogonal to whether the c/r is in kernel
or in userspace.

So for terminal based apps, one can use 'screen'. For individual X
apps, one can use a light VNC server with proper embedding in the
desktop (e.g. metavnc). Or you could use screen-for-X like 'xpra'.
Or you can write wrappers (messy or hairy or not) that will try to
do that, or you could modify the apps. IIUC, dmtcp chose the way
of the wrappers.

But that is independent of where you do c/r !  The issue on the
table is whether the _core_ c/r should go in kernel or userspace.
Those wrappers of dmtcp are great and will be useful with either
approach.

So let us please _not_ argue that only one approach can c/r apps
or processes out of their context. That is inaccurate and misleading.

And while one may argue that one use-case is more important than
another, let us also _not_ dismiss such use cases (as was argued
by others in this thread). For example, c/r of a full desktop
session in VNC, or a VPS, is a perfectly valid and useful case.


FYI, inotify() is a syscall and does not require root privileges. It's
a kernel API used to get notifications of changes to file system inodes.

Back to the point argued above, "virtualization around a single app"
are the wrappers that allow to take an app out of context and sort of
implant it in another context. It's a very desirable feature, but

Hmm... can you really c/r from userspace a ...
From: Gene Cooperman
Date: Sunday, November 7, 2010 - 11:49 am

These are all good points by Oren.  It's not about in-kernel _or_ userland.
There are opportunities to use both -- each where it is strongest,
and I'm looking forward to that discussion with Oren.  I do think
that reconstructing the state of the X server is not as hard as Oren


Yes, I know.  I was writing too fast in trying to respond to all the points.
Matt had asked how we would handle inotify(), but I was getting swamped
by all the questions.  There is a virtualization approach to inotify in which
one puts wrappers around inotify_add_watch(), inotify_rm_watch() and
friends in the same way as we wrap open() and could wrap close().
One would then need to wrap read() (which we don't like to do, just
in case it could add significant overhead).  But if we consider kernel
and userland virtualization together, then something similar to  TIOCSTI

I agree.  I look forward to the discussion where we can put all this

Let's try it and see.  If you write a program, we'll try it out in
DMTCP (unstable branch) and see.  So far, checkpointing gdb sessions
has worked well for us.  If there is something we don't cover, it will

Another excellent topic for discussion.  I look forward to the discussion.

Also a very good point above, and I agree.  The offline discussion should
be a better forum for putting this all into perspective.

Thanks again for your thoughtful response,
- Gene
--

From: Oren Laadan
Date: Sunday, November 7, 2010 - 2:59 pm

[cc'ing linux containers mailing list]

On 11/07/2010 01:49 PM, Gene Cooperman wrote:


This sounds like reimplementation in userspace the very same logic

We could work to add ABIs and APIs for each and every possible piece
of state that affects userspace. And for each we'll argue forever
about the design and some time later regret that it wasn't designed
correctly :p

Even if that happens (which is very unlikely and unnecessary),
it will generate all the very same code in the kernel that Tejun
has been complaining about, and _more_. And we will still suffer
from issues such as lack of atomicity and being unable to do many
simple and advanced optimizations.

Or we could use linux-cr for that: do the c/r in the kernel,
keep the know-how in the kernel, expose (and commit to) a
per-kernel-version ABI (not vow to keep countless new individual
ABIs forever after getting them wrongly...), be able to do all
sorts of useful optimization and provide atomicity and guarantees
(see under "leak detection" in the OLS linux-cr paper). Also,
once the c/r infrastructure is in the kernel, it will be easy
(and encouraged) to support new =ly introduced features.

Finally, then we would use dmtcp as well as other tools on top
of the kernel-cr - and I'm looking forward to do that !


Try "strace bash" :)
I suspect it won't work - and for the reasons I described.


Same here. Talk to you soon...

Oren.
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 4:57 am

Hello, Oren.



It may be harder but those will be localized for specific features
which would be useful for other purposes too.  With in-kernel CR,
you're adding a bunch of intrusive changes which can't be tested or

And the only reason it seems easier is because you're working around
the ABI problem by declaring that these binary blobs wouldn't be kept
compatible between different kernel versions and configurations.  That
simply is the wrong approach.  If you want to export something, build

Yeah, this part I agree.  The higher level workarounds implemented in
dmtcp are quite impressive and useful no matter what happens to lower
layer.

Thanks.

-- 
tejun
--

From: Serge E. Hallyn
Date: Wednesday, November 17, 2010 - 8:39 am

By this do you mean the very idea of having CR support in the kernel?
Or our design of it in the kernel?  Let's go back to July 2008, at the
containers mini-summit, where it was unanimously agreed upon that the
kernel was the right place (Checkpoint/Resetart [CR] under
http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
we would start by supporting a single task with no resources.  Was that
whole discussion effectively misguided, in your opinion?  Or do you
feel that since the first steps outlined in that discussion we've
either "gone too far" or strayed in the subsequent design?

-serge
--

From: Tejun Heo
Date: Wednesday, November 17, 2010 - 8:46 am

Hello, Serge.



The conclusion doesn't seem like such a good idea, well, at least to
me for what it's worth.  Conclusions at summits don't carry decisive
weight.  It'll still have to prove its worthiness for mainline all the
same and in light of already working userland alternative and the
expanded area now covered by virtualization, the arguments in this
thread don't seem too strong.

Thanks.

-- 
tejun
--

From: Pavel Emelyanov
Date: Thursday, November 18, 2010 - 2:13 am

From: Tejun Heo
Date: Thursday, November 18, 2010 - 2:48 am

Hello, Pavel.


I think I already did that several times in this thread but here's an
attempt at summary.

* It adds a bunch of pseudo ABI when most of the same information is
  available via already established ABI.

* In a way which can only ever be used and tested by CR.  If possible,
  kernel should provide generic mechanisms which can be used to
  implement features in userland.  One of the reasons why we'd like to
  export small basic building blocks instead of full end-to-end
  solutions from the kernel is that we don't know how things will
  change in the future.  In-kernel CR puts too much in the kernel in a
  way too inflexible manner.

* It essentially adds a separate complete set of entry/exit points for
  a lot of things, which makes things more error prone and increases
  maintenance overhead across the board.

* And, most of all, there are userland implementation and
  virtualization, making the benefit to overhead ratio completely off.
  Userland implementation _already_ achieves most of what's necessary
  for the most important use case of HPC without any special help from
  the kernel.  The only reasonable thing to do is taking a good look
  at it and finding ways to improve it.

Thanks.

-- 
tejun
--

From: Jose R. Santos
Date: Thursday, November 18, 2010 - 1:13 pm

On Thu, 18 Nov 2010 10:48:34 +0100

Yet the arguments seem to be vague enough not to be convincing to the

Can you elaborate on this?  What established ABI are you proposing we

So what if it can only be tested with CR as long as we can make CR work
on a variety of environments?  Scalability changes for _really_ large
SMP boxes can only be reliably tested by people such equipment.  We are
not imposing any such restriction and this code can be tested on very

I partially agree with you here.  There will be maintenance overhead
every time you add code to the kernel that _may_ make changes in the
future more complicated.  This true for _any_ code that is added to the
core kernel.  Now in my experience such maintenance burden is most
disruptive when the code being added creates a lot of new state that
need to be tracked in multiple places unrelated to CR (in this case).
Our argument is that the CR code is not creating new state that will
cause painful future changes to the kernel.  If you have specific
example that you are concerned with, great.  Lets discuss those.

Are we promising zero maintenance cost? But guess what, neither do most
features that make into the kernel.

Now, if we change the argument around...  What would be the maintenance
cost keeping this outside the kernel.  I would argue that it is much

Can we keep virtualization out of this.  Every time someone mentions
virtualization as a solution, it makes me feel like these people just
don't understand the problem we are trying to solve.  It is just not
practical to create a new VM for every application you want to CR.

What are these _most_ important cases of HPC that you are referring too?
Can we do a lot of these cases from userspace? Sure, but why are the
ones that can't be done from userspace any less important.  If nobody

The userspace vs in-kernel discussion has been done before as multiple
people have already said in this thread.  Show me a version of userspace
CR that can correctly do all that an ...
From: Serge Hallyn
Date: Thursday, November 18, 2010 - 8:54 pm

Guess I'll just be offensive here and say, straight-out:  I don't
believe it.  Can I see the userspace implementation of c/r?

If it's as good as the kernel level c/r, then aweseome - we don't
need the kernel patches.

If it's not as good, then the thing is, we're not drawing arbitrary
lines saying "is this good enough", rather we want completely
reliable and transparent c/r.  IOW, the running task and the other
end can't tell that a migration happened, and, if checkpoint says
it worked, then restart must succeed.

-serge
--

From: Oren Laadan
Date: Thursday, November 18, 2010 - 12:53 pm

While it's your opinion that userland alternatives "already work",
in reality they are unsuitable for several real use-cases. The
userland approach has serious restrictions - which I will cover
in a follow-up post to my discussion with Gene soon.

Note that one important point of agreement was that DMTCP's ability
to provide "glue" to restart applications without their original
context is _orthogonal_ to how the core c/r is done. IOW - there
exciting goodies from DMTCP are useful with either form of c/r.

You also argue that "virtualization" (VMs?) covers everything else,
implying that lightweight virtualization is useless. In reality it
is an important technology, already in the kernel (surely you don't
suggest to pull it out ?!) and for a reason. That is already a very
good reason to provide, e.g. containers c/r and live-migration to
keep it competitive and useful.

Thanks,

Oren.
--

From: Serge Hallyn
Date: Thursday, November 18, 2010 - 9:10 pm

Of course.  It allows us to present at kernel summit and look for early
rejections to save us all some time (which we did, at the container
mini-summit readout at ksummit 2008), but it would be silly to read


Here's where we disagree.  If you are right about a viable userland
alternative ('already working' isn't even a preqeq in my opinion,
so long as it is really viable), then I'm with you, but I'm not buying
it at this point.

Seriously.  Truly.  Honestly.  I am *not* looking for any extra kernel

-serge
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 7:04 am

What's so wrong with Gene's work?  Sure, it has some hacky aspects but
let's fix those up.  To me, it sure looks like much saner and
manageable approach than in-kernel CR.  We can add nested ptrace,
CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
supports, add an ability to adjust brk, export inotify state via
fdinfo and so on.

The thing is already working, the codebase of core part is fairly
small and condor is contemplating integrating it, so at least some
people in HPC segment think it's already viable.  Maybe the HPC
cluster I'm currently sitting near is special case but people here
really don't run very fancy stuff.  In most cases, they're fairly
simple (from system POV) C programs reading/writing data and burning a
_LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
integrated with condor would work well enough for them.

Sure, in-kernel CR has better or more reliable coverage now but by how
much?  The basic things are already there in userland.  The tradeoff
simply doesn't make any sense.  If it were a well separated self
sustained feature, it probably would be able to get in, but it's all
over the place and requires a completely new concept - the
quasi-ABI'ish binary blob which would probably be portable across
different kernel versions with some massaging.  I personally think the
idea is fundamentally flawed (just go through the usual ABI!) but even
if it were not it would require _MUCH_ stronger rationale than it
currently has to be even considered for mainline inclusion.

Maybe it's just me but most of the arguments for in-kernel CR look
very weak.  They're either about remote toy use cases or along the
line that userland CR currently doesn't do everything kernel CR does
(yet).  Even if it weren't for me, I frankly can't see how it would be
included in mainline.

I think it would be best for everyone to improve userland CR.  A lot
of knowdledge and experience gained through kernel CR would be
applicable and won't go wasted.  ...
From: Kirill Korotaev
Date: Friday, November 19, 2010 - 7:36 am

Tejun,

Sorry for getting into the middle of the discussion, but...

Can you imagine how many userland APIs are needed to make userspace C/R?

Do you really want APIs in user-space which allow to:
- send signals with siginfo attached (kill() doesn't work...)
- read inotify configuration
- insert SKB's into socket buffers
- setup all TCP/IP parameters for sockets
- wait for AIO pending in other processes
- setting different statistics counters (like netdev stats etc.)
and so on...

For every small piece of functionality you will need to export ABI and maintain it forever.
It's thousands of APIs! And why the hell they are needed in user space at all?

BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r 
didn't even bother with tty's migration. In OpenVZ we really do need much more then that like
autofs/NFS support, preserve statistics, TTYs, etc. etc. etc.

Thanks,
Kirill


--

From: Tejun Heo
Date: Friday, November 19, 2010 - 8:33 am

Hello,




Can't we drain kernel buffers?  ie. Stop further writing and wait the

I _think_ most can be restored by talking to netfilter module.

I haven't looked at aio implementation for a while now but can't we
drain these upon checkpointing and just carry the completion status?
Also, if aio is what you're concerned about, I would say the problem


I think it's actually quite the contrary.  Most things are already
visible to userland.  They _have_ to be and that's the reason why
userland implementation can already get most things working without
any change to the kernel with some amount of hackery.  To me in-kernel
CR seems to approach the problem from the exactly wrong direction -
rather than dealing with specific exceptions, it create a completely
new framework which is very foreign and not useful outside of CR.

Also, think about it.  Which one is better?  A kernel which can fully
show its ABI visible states to userland or one which dumps its
internal data structurs in binary blobs.  To me, the latter seems


Would it be impossible to preserve autofs/NFS and TTYs from userland?
Then, why so?  For statistics, I'm a bit lost.  Why does it matter and
even if it does would it justify putting the whole CR inside kernel?

Thank you.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:00 am

On send:
if network dies right after freeze, you lose.

On receive:

Because you'll introduce million stupid interfaces not interesting to
anyone but C/R.
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:01 am

Just like CLONE_SET_PID.
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 9:10 am

Well, if you ask me, having pidns w/o a way to reinstate PID from
userland is pretty silly and you and I might not know yet but it's
quite imaginable that there will be other use cases for the capability
unlike in-kernel CR.  Kernel provides building blocks not the whole
frigging package and for very good reasons.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:25 am

No.
Chrome uses CLONE_PID so that exploit couldn't attach to processes in

Speaking of pids, pid's value itself is never interesing (except maybe pid 1).
It's a cookie.

CLONE_SET_PID came up only now because only C/R wants it.
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 9:06 am

Hello,


Gosh, if you're really worried about that, put a netfilter module
which would buffer and simulate acks to extract the packets before

Just store the data somewhere.  The checkpointer can drain the socket,

In this thread, how many have you guys come up with?  Not even a dozen
and most can be sovled almost trivially.  Seriously, what the hell..

Thanks.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:16 am

I do not count them.

The paragon of absurdity is struct task_struct::did_exec .
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 9:19 am

Yeah, then go and figure how to do that in a way which would be useful
for other purposes too instead of trying to shove the whole
checkpointer inside the kernel.  It sure would be harder but hey
that's the way it is.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:27 am

System call for one bit? This is ridiculous.
Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work).
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 9:32 am

Really, whatever.  Just keep doing what you're doing.  Hey, if it
makes you happy, it can't be too wrong.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:38 am

Because /proc/*/did_exec useless to anyone but C/R (even for reading!).

Because code is much simpler:

    tsk->did_exec = !!tsk_img->did_exec;
+
    __u8 did_exec;
--

From: Tejun Heo
Date: Friday, November 19, 2010 - 9:50 am

I don't think you'll need a full file.  Just shove it in status or
somewhere.  Your argument is completely absurd.  So, because exporting
single bit is so horrible to everyone else, you want to shove the

Sigh, yeah, except for the horror show to create tsk_img.  Your
"paragon of absurdity" is did_exec which is only ever used to decide
whether setpgid() should fail with -EACCES, seriously?  Here's a
thought.  Ignore it for now and concentrate on more relevant problems.
I'm fairly sure CR'd program malfunctioning over did_exec wouldn't
mark the beginning of the end of our civilization.  You gotta be
kidding me.

-- 
tejun
--

From: Alexey Dobriyan
Date: Friday, November 19, 2010 - 9:55 am

task_struct image work is common for both userspace C/R and in-kernel.
You _have_ to define it.

You're so newjerseyly now.
--

From: Oren Laadan
Date: Saturday, November 20, 2010 - 10:58 am

You assume that c/r is done by the checkpointed processes _themselves_,
that is that to checkpoint a process that process need to be made runnable 
and it will save its own state (which is the model of dmtcp, but not of
using ptrace). 

This model is restrictive: it requires that you hijack the execution of
that process somehow and make it run. What if the process isn't runnable
(e.g. in vfork waiting for completion, or ptraced deep in the kernel) ?
letting it run even just a bit may modify its state. It also means that
if you have many processes in the checkpointed session, e.g. 1000, then
_all_ of them will have to be scheduled to run !

With kernel c/r this is unnecessary:  you can use an auxiliary process
to checkpoint other processes without scheduling the other processes.
I.e. it's _transparent_ and _preemptive_.

Another advantage is that if anything fails during checkpoint (for 
whatever reason), there are no side-effects (which is not the case with

Are we jusding aesteics ?  To me the former looks uglier...

The amount of fragile hacks you need to go through to make it work
in userspace for the generic cases (including userspace trickery
and new crazy APIs from the kernel for state that was never even an 
ABI, like skb's), and the restrictions it posses simply suggest that 
userspace is not the right place to do it. 

Thanks,

Oren.
--

From: Oren Laadan
Date: Saturday, November 20, 2010 - 11:05 am

Hi,

Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches

[now relax, grab (another) cup of coffee and read on...]

PART I:  ==PERSPECTIVE==

A rough classification of c/r categories:

* container-c/r: important use-case, e.g. c/r and migration of an
  application containers like VPS (virtual private server), VDI
  (desktop) or  other self-contained application (e.g. Oracle server).
  Here _all_ the relevant processes are included in the checkpoint.

* standalone-c/r: another use-case is standalone-c/r where a set of
  processes is checkpointed, but not the entire environment, and then
  those processes are restarted in a different "eco-system".

* distributed-c/r: meaning several sets of processes, each running
  on a different host. (Each set may be a separate container there).

In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.

In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.

There are generally three types of "glue":

(1) Modify the application or selected libraries to be c/r-aware, and
  notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
  trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
  at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
  that will glue in what's missing (e.g. ...
From: Oren Laadan
Date: Saturday, November 20, 2010 - 11:08 am

login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
  PINE 4.64   COMPOSE MESSAGE                                                                     
Folder: Drafts  8 Messages  +

To      : Tejun Heo <tj@kernel.org>
Cc      : Serge Hallyn <serge.hallyn@canonical.com>,
          Kapil Arya <kapil@ccs.neu.edu>,
          Gene Cooperman <gene@ccs.neu.edu>,
          linux-kernel@vger.kernel.org,
          xemul@sw.ru,
          "Eric W. Biederman" <ebiederm@xmission.com>,
          Linux Containers <containers@lists.osdl.org>
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,

[continuation of posting regarding kernel vs userspace approach]

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches


PART II:  ==PHILOSOPHY==

Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can
checkpoint a full container, a process hierarchy, or a single process,
For containers, it provides guarantees like restart-ability; For the
others, it provides the flexibility so that c/r-aware applications,
libraries, helpers, and wrappers can glue what they wish to glue.

1) Transparent - completely transparent for container-c/r, and largely
  so for standalone-cr ("largely" - as in except for the glue which is
  needed due to loss of eco-system, not due to restarting).
2) Reliable - if checkpoint succeeds that it is guaranteed for
  to succeed too (for container-c/r).
3) Preemtptive - works without requiring that checkpointed processes
  be scheduled to run (and thus "collaborate")
4) Complete - covers all visible and hidden state in the kernel
  about processes (even if not directly visible to userspace)
5) Efficient - can be optimized ...
From: Oren Laadan
Date: Saturday, November 20, 2010 - 11:11 am

login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
  PINE 4.64   COMPOSE MESSAGE                                                                     
Folder: Drafts  8 Messages  +

To      : Tejun Heo <tj@kernel.org>
Cc      : Serge Hallyn <serge.hallyn@canonical.com>,
          Kapil Arya <kapil@ccs.neu.edu>,
          Gene Cooperman <gene@ccs.neu.edu>,
          linux-kernel@vger.kernel.org,
          xemul@sw.ru,
          "Eric W. Biederman" <ebiederm@xmission.com>,
          Linux Containers <containers@lists.osdl.org>
Fcc     : imap://ol2104@mail.columbia.edu/Sent
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,

[continuation of discussion of kernel vs userspace c/r approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches


PART III:  ==SOME TECHNICAL ASPECTS==

Important to know about userspace (DMTCP example) before presenting a
comparison between kernel and userspace approaches:

DMTCP has two components: 1) c/r-engine to save/restore process state,
and 2) glue to restart processes out of their original context. They
are _orthogonal_: the glue can be used with of other c/r-engines, like
linux-cr. This discussion refers to the c/r-engine _only_.

Focusing on the c/r-engine of DMTCP - it uses syscall interposition
for three reasons:

1) To take control of processes at checkpoint
2) To always track state of resources not visible to userspace
3) To virtualize identifiers after restart

#1 is needed because processes saves their own state (and need to run
the checkpoint code for that).

#2 is needed because the kernel does not expose all state, and #3 is
needed because the kernel does not give ways to restore ...
From: Oren Laadan
Date: Saturday, November 20, 2010 - 11:15 am

[[apologies for the silly prefix on last two posts - a combination
of windows, putty, pine andslow connection is not helping me :( ]]
--

From: Tejun Heo
Date: Saturday, November 20, 2010 - 12:33 pm

Hello,


Maybe it's a good idea to post a clean concatenated version for later
reference?

Thanks.

-- 
tejun
--

From: Gene Cooperman
Date: Sunday, November 21, 2010 - 1:18 am

In this post, Kapil and I will provide our own summary of how we
see the issues for discussion so far.  In the next post, we'll reply
specifically to comment on Oren's table of comparison between
linux-cr and userspace.

In general, we'd like to add that the conversation with Oren was very
useful for us, and I think Oren will also agree that we were able to
converge on the purely technical questions.

Concerning opinions, we want to be cautious on opinions, since we're
still learning the context of this ongoing discussion on LKML.  There is
probably still some context that we're missing.

Below, we'll summarize the four major questions that we've understood from
this discussion so far.  But before doing so, I want to point out that a single
process or process tree will always have many possible interactions with
the rest of the world.  Within our own group, we have an internal slogan:
  "You can't checkpoint the world."
A virtual machine can have a relatively closed world, which makes it more
robust, but checkpointing will always have some fragile parts.
We give four examples below: 
a.  time virtualization
b.  external database
c.  NSCD daemon
d.  screen and other full-screen text programs
These are not the only examples of difficult interactions with the
rest of the world.

Anyway, in my opinion, the conversation with Oren seemed to converge
into two larger cases:
1.  In a pure userland C/R like DMTCP, how many corner cases are not handled,
	or could not be handled, in a pure userland approach?
	Also, how important are those corner cases?  Do some
	have important use cases that rise above just a corner case?
	[ inotify is one of those examples.  For DMTCP to support this,
	  it would have to put wrappers around inotify_add_watch,
	  inotify_rm_watch, read, etc., and maybe even tracking inodes in case
	  the file had been renamed after the inotify_add_watch.  Something
	  could be made to work for the common cases, but it would
	  still be a hack --- to be done only if ...
From: Gene Cooperman
Date: Sunday, November 21, 2010 - 1:21 am

As Kapil and I wrote before, we benefited greatly from having talked with Oren,
and learning some more about the context of the discussion.  We were able
to understand better the good technical points that Oren was making.
    Since the comparison table below concerns DMTCP, we'd like to

In our experiments so far, the overhead of system calls has been
unmeasurable.  We never wrap read() or write(), in order to keep overhead low.
We also never wrap pthread synchronization primitives such as locks,
for the same reason.  The other system calls are used much less often, and so
 
As above, we believe that the overhead while running is negligible.  I'm
assuming that image size refers to in-kernel advantages for incremental
checkpointing.  This is useful for apps where the modified pages tend
not to dominate.  We agree with this point.  As an orthogonal point,
by default DMTCP compresses all checkpoint images using gzip on the fly.
This is useful even when most pages are modified between checkpoints.
Still, as Oren writes, Linux C/R could also add a userland component
to compress checkpoint images on the fly.
    Next, live migration is a question that we simply haven't thought much
about.  If it's important, we could think about what userland approaches might

We'd like to clarify what may be some misconceptions.  The DMTCP
controller does not launch or manage any tasks.  The DMTCP controller
is stateless, and is only there to provide a barrier, namespace server,
and single point of contact to relay ckpt/restart commands.  Recall that
the DMTCP controller handls processes across hosts --- not just on a
single host.
    Also, in any computation involving multiple processes, _every_ process
of the computation is a point of failure.  If any process of the computation
dies, then the simple application strategy is to give up and revert to an
earlier checkpoint.  There are techniques by which an app or DMTCP can
recreate certain failed processes.  DMTCP doesn't currently recreate

Our ...
From: Sukadev Bhattiprolu
Date: Monday, November 22, 2010 - 11:02 am

Gene Cooperman [gene@ccs.neu.edu] wrote:
| > RELIABILITY     checkpoint w/ single syscall;   non-atomic, cannot find leaks
| >                 atomic operation. guaranteed    to determine restartability
| >                 restartability for containers
| 
| My understanding is that the guarantees apply for Linux containers, but not
| for a tree of processes.  Does this imply that linux-cr would have some
| of the same reliability issues as DMTCP for a tree of processes?  (I mean
| the question sincerely, and am not intending to be rude.)  In any case,
| won't DMTCP and Linux C/R have to handle orthogonal reliability issues
| such as external database, time virtualization, and other examples
| from our previous post?

Yes if the user attempts to checkpoint a partial container (what we refer
to process subtree) or fails to snapshot/restore filesystem there could be
leaks that we cannot detect.

But one guarantee we are trying to provide is that if the user checkpoints
a _complete_ container, then we will detect a leak if one exists.

Is there a way to establish a set of constraints (eg: run application in a
container, snapshot/restore filesystem) and then provide leak detection with
a pure userpsace implementation ?

Sukadev
--

From: Oren Laadan
Date: Tuesday, November 23, 2010 - 10:53 am

Syscall interception will have visible effect on applications that
use those syscalls. You may not observe overheasd with HPC ones,
but do you have numbers on server apps ?  apps that use fork/clone
and pipes extensively ?  threads benchmarks et ?  compare that




The controller is another point of failure. I already pointed that
the (controlled) application crashes when your controller dies, and
you mentioned it's a bug that should be fixed. But then there will always 
be a risk for another, and another ...   You also mentioned that if the
controller dies, then the app should contionue to run, but will not be
checkpointable anymore (IIUC).

The point is, that the controller is another point of failure, and makes 
the execution/checkpoint intrusive. It also adds security and 
user-management issues as you'll need one (or more ?) controller per user 
(right now, it's one for all, no ?). and so on.

Plus, because the restarted apps get their virtualized IDs from the 
controller, then they can't now "see" existing/new processes that

The point is that you _add_ a point of failure: you make the "checkpoint" 
operation a possible reason for the application to crash. In contrast, in 
linux-cr the checkpoiint is idempotent - nunharmful because it does not 

There are two points in the claim above:

1) linux-cr can checkpoint with a single syscall - it's atomic. This
gives you more guarantees about the consistency of the checkpointed 
application(s), and less "opportunitites" for the operation as a whole to 
fail.

2) restartability - for full-container checkpoint only.

There is no "reliability" issue with c/r of non-containers - it's a matter 
of definition: it depends on what your requirements from the userspace 
application and what sort of "glue" you have for it.
 
And I request again - let's leave out the questions of "time 
virtualization" and "external databases" - how are they different for the 
VM virtalization solution ?  they are conpletely orthogonal to the ...
From: Kapil Arya
Date: Tuesday, November 23, 2010 - 8:50 pm

(Our first comment below actually replies to an earlier post by Oren. It seemed

We would guess that Zap would not be able to support screen without a user
space component. The bug occurs when screen is configured to have a status line
at the bottom. We would be interested if you want to try it and let us know the
results.


Its true that we haven't taken serious data on overhead with server apps. Is
there a particular server app that you are thinking of as an example? I would
expect fork/clone and pipes to be invoked infrequently in the server apps and do
not add measurably to CPU time. In most server apps such as MySQL, it is
common to maintain a pool of threads for reuse rather than to repeatedly call
clone for a new thread. This is done to ensure that the overhead of the clone
calls is not significant. I would expect a similar policy for fork and pipes.


Just to clarify, DMTCP uses one coordinator for each checkpointable
computation. A single user may be running multiple computations with one
coordinator for each computation. We don't actually use the word controller
in DMTCP terminology because the coordinator is stateless and so in

This appears to be a misconception. The wrappers within the user process
maintain the pid-translation table for that process. The translation table is
the translation between the original pid given by the kernel and the current
pid set by the kernel on restart. This is handled locally and does not involve
the coordinator.

In the case of a fork there could be a pid-clash (the original pid
generated for a
new process that conflicts with someone else's original pid). However, DMTCP
handles this by checking within the fork wrapper for a pid-clash. In the rare
case of a pid-clash, the child process exits and the parent forks again. Same

We were speaking above of the case when the process dies during a
computation. We were not referring to checkpoint time.

<snip>

We would like to add our own comment/question. To set the context we quote ...
From: Oren Laadan
Date: Thursday, November 25, 2010 - 9:04 am

I apologize for being blunt - but this is probably an issue specific to 

What is a "local" socket ?  af_unix, or locally connected af_inet ?

Anyway, with linux-cr you'd do what's needed after the restarted tasks are
created, but before their state is restored. For each such "old" socket
that you want to replace, you'd create (in userspace with arbitrary glue" 
code!) a new socket, and use this socket when restoring the state of the

Repainting during restart is the least of your problems.

Leak detection is not a problem: 
If the socket connects out of the containers (like af_inet) - then it is 
not a leak, andyou treat it as described above.
If the sockets connects within the container but you don't checkpoint the
"peer" process - then it is not a container-c/r (in which case you don't 
look for leaks).

Also, the application could mark resources to not be checkpointed (e.g. 
scratch memory to save storage, or sockets to not count as leaks).


I explain again - in case it wasn't clear from my 3-part post: leak 
detection is relevant _only_ for full container-c/r. It doesn't make 
sense otherwise.

If you want to checkpoint individual components of an application,
then it's up to userspace to produce/provide the relevant "glue" to
make it "make sense" when those components restart without their 
original eco-system.

Thanks,

Oren.
From: Gene Cooperman
Date: Sunday, November 28, 2010 - 9:09 pm

Hi Oren,


I completely agree with you, Oren.  DMTCP was never designed to be split
into a userland and in-kernel replacement.  We will want to re-factor
DMTCP to make this happen.
    I'm sorry if my e-mail came off as confrontational.  That was not my
intention.  I was just looking forward to an interesting intellectual
experiment --- how to go about combining DMTCP and Linux C/R.   I was
trying to guess ahead of time where there are interesting challenges, and
my hope is that we will find a way to solve them together.

Best wishes,
- Gene
--

From: Grant Likely
Date: Sunday, November 21, 2010 - 3:41 pm

Hi Gene,


At the risk of restating already applied arguments, and as a c/r
outsider, this touches on the real crux of the issue for me.  What is
the complete set of boundaries between a c/r group of processes and
the outside world?  Is it bounded and is it understandable by mere
kernel engineers?  Does it change the assumptions about what a Linux
process /is/, and how to handle it?  How much?  The broad strokes seem
to be straight forward, but as already pointed out, the devil is in

Temporal issues need to be (are being?) addressed regardless.  In
certain respects, I'm sure c/r can be seen as a *really long*
scheduler latency, and would have the same effect as a system going
into suspend, or a vm-level checkpoint.  I would think the same

Right here is exactly the example of a boundary that needs explicit
rules.  When a pair of processes have a shared region, and only one of
them is checkpointed, then what is the behaviour on restore?  In this
specific example, a context-specific hack is used to achieve the
desired result, but that doesn't work (as I believe you agree) in the
--

From: Oren Laadan
Date: Monday, November 22, 2010 - 10:34 am

That depends of what your definition of "world". One definition
is "world := VM", as you state above. Another is "world := container"
which I stated in my post(s). You can checkpoint both.

For those cases where the "world" cannot be fully checkpointed, 
I explicitly pointed  that we should focus on the core c/r 

IMHO, irrelevant to current discussion. And btw, this is done in

This falls within the category of "glue", and is - as I try once
again to remind - tentirely oorthogonal to the topic of where

This actually never required a userspace "component" with Zap
or linux-cr (to the best of my knowledge)..

Even if it did - the question is not how to deal with "glue"
(you demonstrated quite well how to do that with DMTCP), but 
how should teh basic, core c/r functionality work - which is
below, and orthogonal to the "glue".

Let us please focus on the base c/r engine functionality...

(gotta disconnect now .. more later)

Oren.
--

From: Oren Laadan
Date: Monday, November 22, 2010 - 10:18 am

Sure, as soon I am back on sane connection (~1 week)
(I cut it in three to make it easier for people to digest ...)

Oren.
--

From: Matt Helsley
Date: Wednesday, November 17, 2010 - 3:17 pm

You seem to be arguing "Z is only testable/useful for doing the things Z
was made for". I couldn't agree more with that. CR is useful for:

	Fault-tolerance (typical HPC)
	Load-balancing (less-typical HPC)
	Debugging (simple [e.g. instead of coredumps] or complex
		time-reversible)
	Embedded devices that need to deal with persistent low-memory
		situations.

I think Oren's Kernel Summit presentation succinctly summarized these:
	http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf

My personal favorite idea (that hasn't been implemented yet) is an
application startup cache. I've been wondering if caching bash startup
after all the shared libraries have been searched, loaded, and linked
couldn't save a bunch of time spent in shell scripts. Post-link actually
seems like a checkpoint in application startup which would be generally
useful too. Of course you'd want to flush [portions of] the cache when
packages get upgraded/removed or shell PATHs change and the caches
would have to be per-user.

I'm less confident but still curious about caching after running rc
scripts (less confident because it would depend highly on the content
of the rc scripts). A scripted boot, for example, might be able to save
some time if the same rc scripts are run and they don't vary over time.
That in turn might be useful for carefully-tuned boots on embedded devices.

That said we don't currently have code for application caching. Yet we
can't be expected to write tools for every possible use of our API in

Oren, that statement might be read to imply that it's based on
something as useless as kernel version numbers. Arnd has pointed out in the
past how unsuitable that is and I tend to agree. There are at least two
possible things we can relate it to: the SHA of the compiled kernel tree
(which doesn't quite work because it assumes everybody uses git trees :( ),
or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could
also stuff that header into the kernel (much like kconfigs ...
From: Tejun Heo
Date: Thursday, November 18, 2010 - 3:06 am

Hello, Matt.


I'm saying it's way too narrow scoped and inflexible to be a kernel
feature.  Kernel features should be like the basic tools, you know,
hammers, saws, drills and stuff.  In-kernel CR is more like an over
complicated food processor which usually sits in the top drawer after

which can do all of the above, a lot of which can be achieved in

What does that have anything to do with the kernel?  If you want
post-link cache, implement it in ld.so where it belongs.  That's like

Continuing the same line of thought.  It _CAN_ be used to do that in a

Yeah, exactly, so just do it inside the established ABI extending
where it makes sense.  No reason to add a whole separate set.

Thanks.

-- 
tejun
--

From: Oren Laadan
Date: Thursday, November 18, 2010 - 1:25 pm

BTW, it's the same for userspace c/r: for the same set of features,
the format (ABI) remains unchanged. Adding features breaks this and
a new version is necessary, and conversion from old to new will be
needed.

Moreover, supporting a new feature in userspace means adding the
proper API/ABI in the kernel, including refactoring etc, which is
even harder than adding the support for it in linux-cr.

Oren.
--

From: Oren Laadan
Date: Sunday, November 7, 2010 - 2:44 pm

[cc'ing linux containers mailing list]


My experience is different:

I downloaded dmtcp and followed the quick-start guide:
(1) "dmtcp_coordinator" on one terminal
(2) "dmtcp_checkpoint bash" on another terminal

Then I:
(3) pkill -9 dmtcp_coordinator
... oops - 'bash' died.

I didn't even try to take a checkpoint :(

Oren.
--

From: Gene Cooperman
Date: Sunday, November 7, 2010 - 4:31 pm

You're right.  I just reproduced your example.  But please remember that
we're working in a design space where if any process of a computation
dies, then we kill the computation and restart.  It doesn't matter to us
if it's a user process or the DMTCP coordinator that died.  I do think
this is getting too detailed for the LKML list, but since you bring it
up, here is the analysis.  The user bash process exits with:

[31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
     _magicBits = 
Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?

This means that when the DMTCP coordinator died, it sent a message to the
checkpoint thread within the user process.  The message was ill-formed.
The current DMTCP code says that if a checkpoint thread receives an
ill-formed message from the coordinator, then it should die.  It's not
hard to change the protocol between DMTCP coordinator and checkpoint
thread of the user process into a more robust protocol with RETRY, further
ACK, etc.  We haven't done this.  Right now, the user simply restarts from
the last checkpoint.  If one process of a computation has been compromised
(either DMTCP coordinator or user process), then the whole computation
has been compromised.  I think in a previous version of DMTCP, the policy
was to allow the computation to continue when the coordinator dies.
Policies change.

But I think you're missing the larger point.  We've developed DMTCP
over six years, largely with programmers who are much less experienced
than the kernel developers.  Yet DMTCP works reliably for many users.
I consider this a credit to the DMTCP design.  The Linux C/R design
is also excellent.

Can we get back to questions of design, using the implementations as
reference implementations?  If you don't object, I'll also skip replying
to the other post, since I think we're getting too detailed.  I'm having
trouble keeping up with the ...
From: Oren Laadan
Date: Friday, November 5, 2010 - 3:24 pm

Indeed, this is a restriction on the new eclone() syscall, and can
be addressed with proper userspace tools (including crypo-sign the
checkpoint image). There core of the c/r code allows a user to

Why not ?  it has zero overhead when not in use, and a reasonable
code footprint (which can be reduced by modularizing some of it,

Are we talking about distributed checkpoint or "standalone" ?

DMTCP relies on user agents to allow distributed/remote execution
in a manner mostly transparent to the application. Many distributed
systems don't require (and do not use) user agents. Consider a
multi-tier system with web server, sql server and some applications
server. These are not suitable to DMTCP's mode or work.

(This is not to say DMTCP isn't useful - it's a clever piece of
software with specific goals and more geared towards HPC needs).

Now regarding "standalone" c/r, if you want to save/restore single
or a subset of processes of a system without the rest of it, then
you will always need user agents, regardless of userspace/kernel
method. Likewise, their work on those tools will be as useful
independently of which c/r 'engine' it uses.

When you include all the relevant processes (e.g. an entire VNC
session, a web server, HPC and batch jobs), you generally don't
need the user agents. The checkpoint is self-contained, and linux-cr

If there is a will, there is (almost always) a way ;)

What MTCP does, IIUC, is wrap around the applications with a complete
pid-namespace (and more) in userspace. There are/were also commercial
products that do that. It's a tremendous effort and I'm impressed by
their (MTCP) work so far.

It is important to understand that it has a price tag: performance
and complexity. It's usually useful for HPC needs, but unsuitable

Hmmm... the kernel already does much of it - for instance, we have
neat pid-namespace infrastructure; does it make sense to go into
the trouble of adding interfaces to provide for pid-virtalization
in userspace ?  we should ...
From: Oren Laadan
Date: Wednesday, November 3, 2010 - 9:03 pm

Hi,

(disclaimer: you may want to grab a cup of your favorite coffee)


I agree, it *looks* scary. But that's mostly because it's a dumb
diff out of context, rather than a  standard "patch" as set of
logical incremental changes. So posting this diff is probably the
worst way to present the impact on existing code. It merely gives
a ballpark of that.

However, please keep in mind that this diff is really an aggregate
of multiple unrelated, structured, small changes, including:
- cleanups (e.g. x86 ptrace)
- refactoring (e.g. ipc, eventpoll, user-ns)
- new features/enhancements (e,g. splice, freezer, mm)

I'm confident that each of these will make more sense when presented

In the ksummit presentation I gave an extensive list of real
use-cases (existing and future). The slides are here:
    http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf

For more technical details there is also the OLS-2010 paper here:
    http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf
presentation slide from there are here:

I'm unsure which states you have in mind that will not be well defined.

It is a difficult problem, and C/R has limitations, but I think we've
got it pretty right this time :)

* we save and restores *all* *execution* state of the applications
 (except for well-defined unsupported features; hardware devices
 are one such example).

* we don't save FS state (use filesystem snapshots for that); but
 we do save runtime FS state (e.g. open files, etc).

* we don't save state of peers (applications/systems) over network;
 but we do save network connections for proper live-migration.

(Of course, there is a supporting userspace ecosystem, like utilities
to do the checkpoint/restart, to freeze/thaw the application, to
snapshot the filesystem etc).

So unless the applications uses unsupported resource - it will be

I have a cool demo (and I gave one today!) that shows how I run one
desktop session and restart an older desktop session that then runs
in ...
From: Tejun Heo
Date: Thursday, November 4, 2010 - 2:43 am

Hello, Oren.



Yeah, could be so but I wasn't really referring to the scariness of
the patch per-se but rather how many subsystems CR needs to interact


If you think only about target processes, yeah sure, you can cover
most of the stuff but that's not the impossible part.  What's not
defined is interaction with the rest of the system and userland.
Userland ecosystem is crazy complex.  You simply cannot stop, say,
banshee or even pidgin, let it mingle with the rest of the system and

I'm afraid I can't agree with that.  You can store and restore the
states which kernel is aware of but that's a very small fraction of

Sure, you can freeze whole tree of related processes and move them
around, but if you think about it, it's an already broken scenario.
For example, dbus (or rather agents listening to it) doesn't only
carry states specific to the set of applications being snapshotted.
It also carries whole bunch of system-wide states or states for other
applications.  As soon as the system goes on executing after
checkpointing, the checkpointed image of dbus and its agents become
inconsistent and useless.  You can't restore it later.  You don't know
what happened to other parts of the system inbetween.

And this problem doesn't stem from technical details of the
implementation.  It's fundamental.  CR tries to snapshot subset of a
big state machine and then use the snapshot later or elsewhere.  It
doesn't and can't have full visibility into how the subset of states
have and are going to interact with the rest of the states.  As soon
as the whole state machine makes progress, there is no guarantee of
consistency.

Without explicit provisions for specific applications, it just can't
work in generic manner.  Can I move my banshee or gwibber to my next
machine transparently with in-kernel CR or even restore it later?  In
many cases, even I (the user) can't define what the desired states

So, that's why it comes down to containers and namespaces.  You need
to preemptively put ...
From: Luck, Tony
Date: Thursday, November 4, 2010 - 5:48 am

This is why I think it is important to define the limits of
which kernel state features are covered (or going to be
covered) by checkpoint/restart - and then list applications
that are supported (Oren mentioned mysql server in this thread).
It will always be easy for someone to point at some application
like powertop and say "we can't migrate that, so checkpoint
restart is therefore useless" ... this just is not true. This
can be useful without having to be complete (as long as the

See above - it may be enough to cover a significant number of

Okay - so "dbus" is in the list of "can't so that no, and will
never be able to checkpoint/restore that class" - big deal. I'm
getting repetitive no, but one last time: just because this can't

I don't think that you'll ever make virtualization good enough

The CR cool-aid hasn't gotten so far into my system to accept
this claim.  If these "can't stop for more than a few milli-seconds"
processes are HPC workloads, then I'm not seeing how you can do
much to help them.  I think these applications are using almost
all of the RAM on the system, and most of the pages are anonymous.
Just how do you checkpoint several GB of dirty pages in a few
milli-seconds (when there is almost no free memory on the system)?

If you have something else in mind, then please explain a little more.

-Tony
--

From: Tejun Heo
Date: Thursday, November 4, 2010 - 6:06 am

Hello,


I was arguing that it is far from being _generally_ useful or
transparent.  If you're saying that it is something useful for certain

If you think about HPC, userland implementation is enough.  In 99% of
cases, those programs just read and write data files and burn a lot of
CPU cycles.  You don't need a lot of fancy stuff to do that.  More
important things would be integrating with job management so that
snapshots and rollbacks can be automatically done.

I agree that CR would be very useful for certain use cases and
applications.  I just can't see where the giant patchset fits between
userland implementation which seems enough for the the most common use
case of HPC and virtualization which is maturing fast.

Thanks.

-- 
tejun
--

From: Matt Helsley
Date: Saturday, November 6, 2010 - 3:12 am

On Thu, Nov 04, 2010 at 10:43:15AM +0100, Tejun Heo wrote:



If you think specialized hardware acceleration is necessary for
containers then perhaps you have a poor understanding of what a container
is. Chances are if you're running a container with namespaces configured
then you're already paying the performance costs of running in a
container. If you've compared the performance of that kernel to your
virtualization hardware then you already know how they compare.

For containers everything is native. You're not emulating instructions.
You're not running most instructions and trapping some. You're not
running whole other kernels, coordinating sharing of pages and cpu
with those kernels, etc. You're not emulating devices, busses,
interrupts, etc. And you're also not then circumventing every
virtualization mechanism you just added in order to provide decent
performance.

I rather doubt you'll see a difference between "native" hardware and...
native hardware. And I expect you'll see much better performance in one of
your containers than you'll ever see in some hand-waved
hypothetically-improved virtualization that your response implored us to
work on instead.

Our checkpoint/restart patches do *NOT* implement containers. They
sometimes work with containers to make use of checkpoint/restart simple.
In fact they are the strategy we use to enable "generic"
checkpoint/restart that you seem to think we lack. Everything else is
an optimization choice that we give userspace which virtualization
notably lacks.

Like above, I expect that your virtualization hardware will compare
unfavorably to kernel-based checkpoint/restart of containers. Imagine
checkpointing "ls" or "sleep 10" in a VM. Then imagine doing so for a
container. It takes way less time and way less disk for the container.

(It's also going to be easier to manage since you won't have to do
lots of special steps to get at the information in a container which is
shutdown or even one that's running. If "mycontainer" is ...
From: Tejun Heo
Date: Saturday, November 6, 2010 - 4:03 am

Hello,



Sure, that was my point.  So, let's drop the handwaving about being

Yeah, and imagine what people would say if ext4, or heaven forbid,

I don't believe my notion of containers was or is flawed and already
said that the diffstat per-se didn't look too bad.  With enough
benefits, I wouldn't be opposed against the rather invasive changes.
It's just that the whole thing is conceived backwards and there are
already working alternatives which may be somewhat messy now but
nevertheless achieve about the same effect without the craziness of
serializing in-kernel data structures which are already mostly visible
to userland to begin with.

Thanks.

-- 
tejun
--

From: Davide Libenzi
Date: Sunday, November 7, 2010 - 3:59 pm

Please, do not compare things like single file systems, drivers, or 
otherwise fairly isolated components, with this "thing".
This thing touches a freaky-large number of subsystems, effectively 
adding a glueage between them, which can might end up causing problems 
(and/or restrict design choices) in the future.
The naked patch looks like just a sugar coating to me, which left out 300+ 
lines of extra logic in epoll alone.
This is one of the widest, deepest, intrusive patches I have seen in a 
while, whose inclusion would require a little bit more than handwaving and 
continuous re-posting IMO.



- Davide


--

From: david
Date: Sunday, November 7, 2010 - 7:32 pm

I've got a question about the ABI that would be created

I see two possible areas that could be considered an ABI

1. control of the C/R process

   This is very clearly a userspace ABI, to be figured out and locked down 
like any other ABI

2. the details of how things are stored and added back into a system

   This is not as clear. at one extreme, this could be like the module 
interface, (the checkpointed image is only guaranteed to work on a new 
system with a kernel compiled with the same config options as the system 
it was checkpointed from). At the other extreme, this could be something 
that allows you to ckeckpoint an image on 2.6.40 and restore it on 2.6.80. 
Or it could be something in between.

I don't see any way that it is sane to make the C/R image defiition and 
interface (#2) be an ABI that is guaranteed to never change without 
hurting future kernel development (exactly the type of things that Davide 
is worried about above), but what sort of guarantee are people interested 
in?

is it enough to sa that it must be the same kernel version compiled with 
the same options? (or at least the same options for some list of things 
that matter, most device drivers probably would not matter for example)

or would you need compatibility across all compile options for a kernel 
release?

would you require compatibility between 2.6.x.y and 2.6.x.z?

would you require compatibility between 2.6.x and 2.6.x+n (for some value 
of n)?

is this something that could go in with the weakest guarantee initially, 
and then as everyone is more comfortable with it, start extending the 
guarantee (and as-needed adding code to the kernel to maintain 
compatibility with old images)?

would you require compatibility between 2.6.x and 2.6.x-n?

David Lang
--

From: Oren Laadan
Date: Thursday, November 18, 2010 - 1:41 pm

Agreed. The guarantee should be to specific kernels, in a sense (see
Matt's post in this thread 11/17).

The image format is tied to "set of features supported" (which boils
down to something like kernel version).  The format is constructed
in a modular way such that most new features can be added without
breaking old format. For the rare cases that they do, conversion
can be done in userspace in a straightforward manner. (All you need

We don't "require" compatibility. The compatibility is defined per
object (type) in the image format. New objects need not break
compatibility. Changes to objects are very rare; and when they happen
they "bump" the version. This can help avoid issues related to kernel
configs/options. Restarting an image incompatible with a particular
kernel will fail, adjustments should be done by userspace filtering.

Thanks,

Oren.
--

From: Kapil Arya
Date: Thursday, November 4, 2010 - 8:55 pm

(Sorry for the length of this email, we are excited about being able
to discuss technical details.)

This is wonderful to have this exchange of techniques and visions.  Oren, we
are guessing that you are at Columbia. If so, we would love to have you come up
here and give a talk in Boston. Alternatively, if you prefer, we would be happy
to go to Columbia and give a talk there.

In comparing functionality, one recent bug we had to overcome was with screen
with a hardstatus line and a scroll region for the terminal. We eventually
solved it in a subtle way by sending SIGWINCH, and then lying to screen about
changing the kernel window size, and then sending screen another SIGWINCH while
telling it the true window size. We were pleased to see that Linux C/R also
supports screen and we are curious how it handles this issue of restoring the
scroll region in the X11 terminal window. Thanks.

Oren noted that sometimes it's important to stop the process only for a few
miliseconds while one checkpoints. In DMTCP, we do that by configuring with
--enable-forked-checkpointing. This causes us to fork a child process taking
advantage of copy-on-write and then checkpoint the memory pages of the child

This is a good example of distinct approaches when starting from Kernel C/R or
user-space C/R. We currently checkpoint VNC servers in a way similar to Linux
C/R. However, in the next few months, we want to directly checkpoint a single
X-windows application without the X11-server. The approach is easily understood
by analogy. Currently libc.so talks to the kernel. At checkpoint time, we
interrogate the kernel state and then "break" the connection to the kernel and
checkpoint. Similarly, libX11.so (or libX11-xcb.so) talks to the X11-server. At
checkpoint time, we will interrogate the state of the X11-server and then break

Thanks very much for bringing up these implementation questions. Its wonderful
to have someone interested in the low level technology to talk to. We would
like to share with you ...
From: Luck, Tony
Date: Friday, November 5, 2010 - 4:57 am

Interesting ... but while the process is only stopped for the duration
of the fork, it may be taking COW faults on almost every page it
touches.  I think this will not work well for large HPC applications
that allocate most of physical memory as anonymous pages for the
application. It may even result in an OOM kill if you don't complete
the checkpoint of the child and have it exit in a timely manner.

-Tony


--

From: Gene Cooperman
Date: Friday, November 5, 2010 - 10:17 am

I agree with you that forked checkpointing is probably not what you
want in the middle of an HPC computation.  But isn't that part of
the nature of COW?  Whether the COW is invoked within the kernel,
or from outside the kernel via fork --- in either case, when you have
mostly dirty pages, you will have to copy most of the pages.
Do I understand your point correctly?			Thanks,
							- Gene
--

From: Matt Helsley
Date: Friday, November 5, 2010 - 6:16 pm

The current linux-cr approach to handling [dirty] pages doesn't use COW.
The tasks are frozen using the cgroup freezer and thus unable to modify
the pages. So we don't have to mess with page tables nor do we pay
any extra overhead for page faults.

If we ever implement thawed checkpointing -- checkpointing while
the task isn't frozen -- then we'd probably use COW and see
the same faults. The difference then would be that in-kernel we
wouldn't have one extra task per mm being checkpointed.

Cheers,
	-Matt Helsley
--

From: Oren Laadan
Date: Friday, November 5, 2010 - 9:06 pm

The current linux-cr patchset leaves out any optimizations
for simplicity of reviewing - first get it working and reviewed.

Thawed checkpointing can be done with any COW tax, by leveraging
the native hardware dirty bit in page tables. There is no need to
trigger additional checkpoints. Tracking modified pages using the
dirty bit is a feature also desired by the KVM community, and we
plan to work with them on implementing it.

Oren.
--

From: Matt Helsley
Date: Friday, November 5, 2010 - 10:18 pm

s/checkpoints/faults/

Cheers,
	-Matt Helsley
--

From: Oren Laadan
Date: Saturday, November 6, 2010 - 2:00 pm

COW is one way of reducing down time (whether through fork or
in-kernel checkpoint). However, it is possible to avoid using
it (and thus avoid extra page faults and memory overload) by
using the page-table "dirty" bit to track dirty pages. This way
one can "pre-copy" the checkpoint image while the application is
running, without additional overhead (the idea is similar to how
live-migration is done).

Oren.
--

From: Sukadev Bhattiprolu
Date: Friday, November 5, 2010 - 10:31 am

Like Oren said, we run the application inside the container - which would have
 its own pid namespace. When we restart, we again create a container, which
starts with a fresh pid namespace, so the pids will not be in use.
IOW, a process
 has a virtual pid and a global pid. The virtual pid is what the
application sees
when it calls getpid() and that pid will be correctly restored when you create
the container.

Sukadev
--

From: Oren Laadan
Date: Saturday, November 6, 2010 - 2:05 pm

With pleasure.
(LPC would have been a good opportunity - I was in Boston).

Oren.
--

Previous thread: [PATCH] mm: make ioremap_prot() take a pgprot. by Paul Mundt on Tuesday, November 2, 2010 - 1:31 pm. (8 messages)

Next thread: MS_L2009 by (ms-world) on Tuesday, November 2, 2010 - 3:55 pm. (1 message)