(cc'ing lkml too) Hello, The patch size itself isn't too big but I still think it's one scary patch mostly because the breadth of the code checkpointing needs to modify and I suspect that probably is the biggest concern regarding checkpoint-restart from implementation point of view. FWIW, I'm not quite convinced checkpoint-restart can be something which can be generally useful. In controlled environments where the target application behavior can be relatively well defined and contained (including actions necessary to rollback in case something goes bonkers), it would work and can be quite useful, but I'm afraid the states which need to be saved and restored aren't defined well enough to be generally applicable. Not only is it a difficult problem, it actually is impossible to define common set of states to be saved and restored - it depends on each application. As such, I have difficult time believing it can be something generally useful. IOW, I think talking about its usage in complex environments like common desktops is mostly handwaving. What about X sessions, network connections, states established in other applications via dbus or whatnot? Which files need to be snapshotted together? What about shared mmaps? These questions are not difficult to answer in generic way, they are impossible. There is a very distinctive difference between system wide suspend/hibernation and process checkpointing. Most programs are already written with the conditions in mind which can be caused by system level suspend/hibernation. Most programs don't expect to be scheduled and run in any definite amount of time. There usually are provisions for loss or failure of resources which are out of the local system. There are corner cases which are affected and those programs contain code to respond to suspend/hibernation. Please note that this is about userland application behavior but not implementation detail in the kernel. It is a much more fundamental property. So, although ...
Thanks Tejun, your writeup brought up a lot of the same issues that I see with the in-kernel C/R. Various C/R implementations that are entirely in userspace or with limited kernel assistance have been in production in HPC environments for years. I think especially for these workloads C/R is an extremly useful feature, and a standard implementation would do Linux well. But I think the "transparent" in-kernel one is the wrong approach. It tries to give the illusion that C/R will just work, while a lot of things are simply not support. In this case whitelisting the allowed state by requiring special APIs for all I/O (or even just standard APIs as long as they are supposed by the C/R lib you're linked against) is the more pragmatic, and I think faithful aproach. In addition to the amount of state not supported despite looking transparant the other big problem with the patchset is that it saves the kernel internal state which changes all the time from one release to another. The handwaiving is that a userspace tool will solve it. I'm pretty sure that's not the case; it might solve a few cases but the general version n to version m conversion is impossible to maintain. Just look at the problem qemu has migration between just a handfull of version of the relatively well (compared to random kernel state) defined vmstate format. --
FWIW there are a couple of kernel-based C/R implementations (BLCR, I think this is somewhat true of the implementation under consideration here (although generally it should fail checkpoints that it can't restart), but it needn't be true of all possible kernel-based I don't think users will go for it. They'll continue to use dodgy out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting their applications to a new library. I think a C/R library is an "ideal" solution, but it's one that nobody would use - especially in HPC, unless the library somehow provides better performance. The namespace/isolation features of Linux (CLONE_NEWPID et al) already provide a pretty workable basis for creating tractably checkpoint- and-restartable jobs, with a minimum of performance overhead and Most of the objects that the patchset saves and restores are right at the "border" of the user/kernel interface, and they're not apt to change much quickly (e.g. vma start and end, task sigaltstack info). The patchset certainly isn't serializing deep internal state such as wait With this I agree, though. But if a change in kernel implementation details forces an incompatible change in the checkpoint image format, is that really a big deal? Would it be so bad to say that a checkpoint image may be restarted only on the same kernel version that created it? With -stable or enterprise kernels I suspect the issue is unlikely to come up. --
Hello, I hear that there are plans to integrate one of the userland snapshotting implementations with HPC workload manager. ISTR the combination to be condor + dmtcp but not sure. I think things like that make a lot of sense. Scientists writing programs for HPC clusters already work in given frameworks and what those applications do and how to recover are pretty well confined/defined. If you integrate snapshotting with such frameworks, it becomes pretty easy for both the admins and users. I'll talk about other issues in the reply to Oren's email. Thanks. -- tejun --
Yes, we are working with Condor to have them validate DMTCP. Time will tell. - Gene --
If you look at the C/R implementations of those two projects you'll see that they don't implement what I take to be hch's suggestion - a library or platform with special-purpose APIs to which applications are ported in order to gain C/R ability. For all their good points, the projects you mention do interposition for glibc's syscall wrappers and provide a few optional hooks so apps can control certain aspects of C/R. --
And even if they did, I don't think asking application developers to use such a broad API -- one that requires special APIs for all I/O -- is practical for many of the purposes outlined at kernel summit. I think DMTCP is better off for not attempting to mandate such APIs. How rare is it for an application or library to change the underlying APIs it uses? How many applications have been ported say from Gnome to KDE (or vice-versa) over the lifetime of the project? Relative to all the other applications? I would hazard a guess that most were rewritten rather than ported and that those that were ported are an utterly insignificant fraction of what's out there. It's much better to offer tools that, as much as possible, don't care which APIs the applications use. Cheers, -Matt Helsley --
Hi Christoph, I really wish you would have raised these concerns during the ksummit or thereafter. I'm here (LPC) until Friday, and would be happy to discuss any aspect of the linux-cr while at it (and if needed can post a summary to the list). The fact is that an in-kernel implementation can and does support a significantly larger feature-set. Linux-cr does not and will not support everything. Nearly all driver devices won't be supported in the near future (but interested vendors could builds such functionality into their drivers!). Also, pseudo file systems like sysfs, procfs, debugfs will at most get partial support. But apart for that, it really covers (or will soon) nearly everything. "Transparent" means that applications don't know that they are being checkpointed, nor do they need to cooperate. So linux-cr is *completely* transparent to applications that are checkpointable. Perhaps you can elaborate on the "state not supported despite It is our experience that the format is pretty immune to changes that occur to in-kernel (and not user/ABI visible) structures. It mainly changes when we add new features - and I expect that to happen less frequently once the patchset finds its way to the The problem space is smaller, because we are aiming at a simpler goal. We need to always know how to convert from version N to version N+1. Then conversion from N to N+k is a series of these conversions. QEMU has a broader goal: IIUC, both QEMU and KVM versions may change, they are not tied to each other. So the problem is harder. In linux-cr, the format is tied to the version of objects that the kernel that outputs/inputs the data knows. That makes things much simpler. Oren. --
Discussion technical topics with slides in a big room is utterly pointless. Just like during all the other such boring talks during KS I was either asleep, working on something important or out of the room doing the extended hallway track. If you want to discuss invasive kernel changes with people do it by email. The chance that anyone is going to listen to you is a lot higher. --
(Sorry for resending the message; the last message contained some html tags and was rejected by server) We would like to thank the previous post for bringing up the topic of kernel C/R versus userland C/R. We are two of the developers of DMTCP (userland checkpointing): Distributed MultiThreaded CheckPointing . http://dmtcp.sourceforge.net We had waited to write to the kernel developers because we had wanted to ensure that DMTCP is sufficiently robust before wasting the time of the kernel developers. This thread seems like a good opportunity to begin a dialogue. In fact, we only became aware of Linux kernel C/R this September. Of course, we were aware of Oren Laadan's fine earlier work on ZapC for distributed checkpointing using the Linux kernel (CLUSTER-2005). We have a high respect for Oren Laadan and the other Linux C/R developers, as well as for the developers of BLCR (a C/R kernel module with a userland component that is widely used in HPC batch faciliites). By coincidence, when we became aware of Linux C/R, we were already in the middle of development for a major new release of DMTCP (from version 1.1.x to 1.2.0). We just finished that release. Among other features, this release supports checkpointing of GNU 'screen', and we have tested screen in some common use cases (with vim, with emacs, etc.). While it supports ssh (e.g. checkpointing OpenMPI, which uses ssh), it doesn't yet support _interactive_ ssh sessions. That will come in the next release. We believe that both Linux C/R and DMTCP are becoming quite mature, and that in general, one can achieve good application coverage with either. In our personal view, a key difference between in-kernel and userland approaches is the issue of security. The Linux C/R developers state The previous posts also brought up the issue of external connections. While DMTCP has been developed over six years, in the last year we have concentrated especially on the issue of external connections. While we've accumulated ...
Hello, And please also don't top-post. Being the antisocial egomaniacs we are, people on lkml prefer to dissect the messages we're replying to, insert insulting comments right where they would be most effective and That's an interesting point but I don't think it's a dealbreaker. Kernel CR is gonna require userland agent anyway and access control can be done there. Being able to snapshot w/o root privieldge definitely is a plust but it's not like CR is gonna be deployed on Yeap, agreed. There gotta be user agents which can monitor and manipulate userland states. It's a fundamentally nasty job, that of collecting and applying application-specific workarounds. I've only glanced the dmtcp paper so my understanding is pretty superficial. With that in mind, can you please answer some of my curiosities? * As Oren pointed out in another message, there are somethings which could seem a bit too visible to the target application. Like the manager thread (is it visible to the application or is it hidden by the libc wrapper?) and reserved signal. Also, while it's true that all programs should be ready to handle -EINTR failure from system calls, it's something which is very difficult to verify and test and could lead to once-in-a-blue-moon head scratchy kind of failures. I think most of those issues can be tackled with minor narrow-scoped changes to the kernel. Do you guys have things on mind which the kernel can do to make these things more transparent or safer? * The feats dmtcp achieves with its set of workarounds are impressive but at the same time look quite hairy. Christoph said that having a standard userland C-R implementation would be quite useful and IMHO it would be helpful in that direction if the implementation is modularized enough so that the core functionality and the set of workarounds can be easily separated. Is it already so? Thanks. -- tejun --
This is a good point to clarify some issues. C/R has several good
targets. For example, BLCR has targeted HPC batch facilities, and
does it well.
DMTCP started life on the desktop, and it's still a primary focus of DMTCP.
We worked to support screen on this release precisely so that advanced
desktop users have the option of putting their whole screen session
under checkpoint control. It complements the core goal of screen:
If you walk away from a terminal, you can get back the session elsewhere.
If your session crashes, you can get back the session elsewhere
These are also some excellent points for discussion! The manager thread
is visible. For example, if you run a gdb session under checkpoint
control (only available in our unstable branch, currently), then
the gdb session will indeed see the checkpoint manager thread.
So, yes. We are not totally transparent, and a skilled user must
account for this. There are analogies (the manager thread in the
original LinuxThreads, the rare misfortune of gdb to lose
track of the stack frames).
We try to hid the reserved signal (SIGUSR2 by default, but the user can
configure it to anything else). We put wrappers around system calls
that might see our signal handler, but I'm sure there are cases where
we might not succeed --- and so a skilled user would have to configure
to use a different signal handler. And of course, there is the rare
application that repeatedly resets _every_ signal. We encountered
this in an earlier version of Maple, and the Maple developers worked
Exactly right! Excellent point. Perhaps this gets down to philosophy,
and what is the nature of a bug. :-) In some cases, we have encountered
this issue. Our solution was either to refuse to checkpoint within
certain system calls, or to check the return value and if there was
an -EINTR, then we would re-execute the system call. This works again,
For the most part, we've always found a way to work within the current
design of the kernel. We consider this ...Hello, Call me skeptical but I still don't see, yet, it being a mainstream thing (for average sysadmin John and proverbial aunt Tilly). It definitely is useful for many different use cases tho. Hey, but let's I don't think gdb seeing it is a big deal as long as it's hidden from I'm probably missing something but can't you stop the application using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry about -EINTR failures (there are some exceptions but nothing really to worry about). Also, unless the manager thread needs to be always online, you can inject manager thread by manipulating the target I see. I just thought that it would be helpful to have the core part - which does per-process checkpointing and restoring and corresponds to the features implemented by in-kernel CR - as a separate thing. It already sounds like that is mostly the case. I don't have much idea about the scope of the whole thing, so please feel free to hammer senses into me if I go off track. From what I read, it seems like once the target process is stopped, dmtcp is able to get most information necessary from kernel via /proc and other methods but the paper says that it needs to intercept socket related calls to gather enough information to recreate them later. I'm curious what's missing from the current /proc. You can map socket to inode from /proc/*/fd which can be matched to an entry in /proc/*/net/PROTO to find out the addresses and most socket options should be readable via getsockopt. Am I missing something? I think this is why userland CR implementation makes much more sense. Most of states visible to a userland process are rather rigidly defined by standards and, ultimately, ABI and the kernel exports most of those information to userland one way or the other. Given the right set of needed features, most of which are probabaly already implemented, a userland implementation should have access to most information necessary to checkpoint without resorting to too ...
This is an excellent example to demonstrate several points: * To freeze the processes, you can use (quote) "hairy" signal overload mechanism, or even more hairy ptrace; both by the way have their performance problem with many processes/threads. Or you can use the in-kernel freezer-cgroup, and forget about workarounds, like linux-cr does. And ~200 lines in said diff are dedicated exactly to that. * Then, because both the workaround and the entire philosophy of MTCP c/r engine is that affected processes _participate_ in the checkpoint, their syscalls _must_ be interrupted. Contrastly, linux-cr kernel approach allows not only to checkpoint processes without collaboration, but also builds on the native signal handling kernel code to restart the system calls (both after unfreeze, and after restart), such that the original process Aha ... another great example: yet another piece of the suspect diff in question is dedicated to allow a restarting process to request a specific location for the vdso. BTW, a real security expert (and I'm not one...) may argue that this operation should only be allowed to privileged users. In fact, if your code gets around the linux ASLR mechanisms, then someone FWIW, the restart portion of linux-cr is designed with this in mind - it is flexible enough to accommodate for smart userspace tools and wrappers that wish to mock with the processes and their resource post-restart (but before the processes resume execution). For example, a distributed checkpoint tool could, at restart time, reestablish the necessary network connections (which is much different than live migration of connections, and clearly not a kernel task). This way, it is trivial to migrate a distributed application from one set of hosts to another, on So you'll need mechanisms not only to read the data at checkpoint time but also to reinstate the data at restart time. By the time you are done, the kernel all the c/r code (the suspect diff in question _and_ the rest of the ...
Hello, The above problems can be solved for userland C/R with small self-contained modification to a small part of the kernel. You're insisting that because currently some obscure corner cases aren't handled, the whole thing should be shoved in the kernel and the kernel should be serializing and deserializing its internal data structures for everything visible in the userland. That's silly at best. Note the "visible in the userland" part. Most of those parts are already discoverable without further modifications to kernel. The only sane approach would be add missing pieces which would not only benefit CR but other applications too. Also, you said the patches didn't have to change much because the data structures facing userland didn't change much over different kernel versions, which of course is true as it's so close to the userland visible ABI. That is _NOT_ a selling point for kernel CR. That's a BIG GLOWING SIGN telling you that you're on the frigging wrong side of ASLR is to protect a program from itself not from outside. If you can Yeap, that was the reason why I asked how modularized that part of dmtcp was as it would directly compare with the in-kernel implementation. If they can be well separated, I think it would even be possible to switch between the two while keeping the upper set of Unfortunately, for most things which matter, everything is already in place and if you just concentrate on the core part the hackiness seems quite manageable and I think it wouldn't be too difficult to reduce it further. I don't see why userland implementation wouldn't be able to snapshot any random process without LD_PRELOADs or whatever cooperation from it. And, if the COW thing is so important, we can collect the information and export it to userland via proc or ringbuffer. That's what qemu-kvm would need anyway, right? I don't think kvm guys would be so crazy as putting the whole snapshotter into No, that's primarily not the feature of kerne CR. It's of ...
In fact CryoPid uses exactly the same approach and has been around for around 5 years. Not as much development effort has gone into CryoPid as DMTCP and so its application coverage is not as broad. But the larger issue for using PTRACE is that you can not have two superiors tracing the same inferior process. So if you want to checkpoint a gdb session or valgrind or tmux or strace, then you can not directly control and quiesce the inferior process being traced. Beyond that, we also have a vision (not yet implemented) of process virtualization by which one can change the behavior of a program. For example, if a distributed computation runs over infiniband, can we migrate to a TCP/IP cluster. For this, one needs the flexibility of wrappers around system calls. This vision of process virtualization also motivates why our own research Yes, we would love to elaborate :-). We began DMTCP with Linux kernel 2.6.3. When Address Space Layout Randomization was added, we were forced to add some hacks concerning VDSO location and end-of-data. end-of-data is the uglier part. On restart, we directly map each memory segment into the original address at checkpoint time. The issue comes in mapping heap back to its original location. We call sbrk() to reset the end-of-data to the end of the original heap. This fails if the randomized beginning-of-data/end-of-data given to us by the kernel for the restarted process is too far away from where we want to remap the heap. To get around this, we play games with legacy layout, other personality parameters, and RLIMIT_STACK (since the kernel uses RLIMIT_STACK in choosing the appropriate memory layout). For our wish list, we would like a way of telling the kernel, where to set beginning-of-data/end-of-data. Curiously enough, at the time at which Linux started randomizing address space, there was discussion of offering exactly this facility for the sake of legacy programs, but it turned out not to be needed. Similarly, it would be nice to tell the ...
This is a very useful vision. However, it is unrelated to how you do c/r, but rather to what you do after you restart and before you let the application resume execution. For example, in your example, you'd need to wrap the library calls (e.g. of MPI implementation) and replaced them to use TCP/IP or infiniband. Wrapping on system calls won't help you. Or you could just replace the resource - e.g., make the restarted application use s socket for stdout instead of the tty, so you can redirect the output to where-ever. Both methods are orthogonal to the c/r itself: linux-cr will allow you to replace/modify resources if you so wish, and I suspect that MTCP also can/will. Interposing on library calls is possible with MTCP methods, or using binary instrumentation, or PIN, or DynInst, or LD_PRELOAD. The only two reasons to interpose on systems calls, as I noted in earlier message (http://lkml.org/lkml/2010/11/5/262 - see points "2)" and "3)" about userland-workarounds): One - to virtualize in userspace reosurces (e.g. pids) that the kernel already knows how to virtualize. Two - to track state of resources during execution and lie about their state when needed, because userspace can't cleanly save and restore their state. Virtualization through interposition is extremely tricky in and out of the kernel. The examples given throughout this thread (by either side) expose the tip of the iceberg. Interposition as a technique is full of security and other pitfalls, as discussed by extensive literature in the area. (I cited in another email). So I'll repeat the question I asked there: is re-reimplementing chunks of kernel functionality and all namespaces in userspace What is "reasonable" overhead ? For which applications ? What about a 'kernel make' ? What about servers (db, web, etc) ? What about VPSs/VDIs ? Can we do better, including for HPC ? Exactly ! Wrapping around apps to isolate them from the environment is desirable, regardless of how you ...
I'd like to add a few clafifications, below, about DMTCP concerning
Oren's comments. I'd also like to point out that we've had about 100
downloads per month from sourceforge (and some interesting use cases
from end users) over the last year (although the sourceforge numbers
do go up and down :-) ). In general, I think we'll all understand the
situation better after having had the opportunity to talk offline.
Below are some clarifications about DMTCP.
We do not put any wrappers around MPI library calls. MPI calls things
like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
At this time, DMTCP adds wrappers _only_ around calls to libc.so
and libpthread.so . This is sufficient to checkpoint a distributed
Just a small correction about interposition. The primary "Reason Two"
for interposing on system calls should be to _spy_ on what the user process
is doing and save that information. For the most part, we do not
_lie about their state when needed_. I agree that virtualization of pids
is an exception where we have to lie, but that was already stated as
"Reason One" above. At restart time, we may also recreate resources that are
no longer in the kernel. But this is not an example of interposition.
I suppose that it is an example of lying, but every C/R technique will
need to do this.
Later, perhaps Oren, Kapil and I can browse the DMTCP code together,
and we can look exactly at what each wrapper is doing. The system call
wrappers are, in fact, the smaller part of the DMTCP code. It's about
3000 lines of code. For anybody who is curious about what our wrappers do,
please download the DMTCP source code, and look at
If you're referring to interposition here, that takes place essentially
in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
Also, I don't believe that we're "re-implementing chunks of kernel
I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
How will the user app ever know that we used LD_PRELOAD, ...Of course. And you don't need syscall virtualization for this. Zap did it already many years ago :) Only problem with the above is that, conveniently enough, you _left out_ the context: >> For example, >> if a distributed computation runs over infiniband, can we migrate to a TCP/IP >> cluster. For this, one needs the flexibility of wrappers around system calls. Do you also support checkpoint a distributed app that uses an infiniband MPI stack and restart it with a TCP based MPI stack ? Can you do it with only syscall wrapping and without knowledge on the MPI implementation and some MPI-specific logic in the wrappers ? I'm curious how you do that without wrapping around MPI calls, or without an c/r-aware implementation of MPI. Again, this is unrelated to how you do the core c/r work. I think we both agree that _this_ kind of app-wrappers/app-awareness is useful for certain uses of c/r. The interposition itself is relatively simple (though not atomic). The problem is the logic to "spy" on and "lie" to the applications. Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly maintaining a userspace pid-ns, etc. I don't object to it per se - it's actually pretty useful oftentimes. But in our context, it has limitations. For example, it does not cover static applications, nor apps that call syscalls directly using int 0x80. Also, it conflicts with LD_PRELOAD possibly needed for other software (like valgrind) - for which again you would need I mean that the applications needs to be scheduled and to run to participate in its own checkpoint. You use syscall interposition and signals games to do exactly that - gain control over the app and run your library's code. This has at least three negatives: first, some apps don't want to or can't run - e.g. ptraced, or swapped (think incremental checkpoint: why swap everything in ?!); Second, the coordination can take significant time, especially if many tasks/threads and resources are involved; Third, it ...
Yes, that's exactly what we plan to do. And we have begun some of the
initial work. And yes, we plan to do it without any MPI-specific logic.
When we talk to each other offline, I'd be happy to give you more
details of how we do it now for TCP "without wrapping around MPI calls,
or without an c/r-aware implementation of MPI", and how we are working
And let's wait for the offline discussion for that --- and we'll describe
in detail at that time how we do each one of the things that you mention.
It will be easier to discuss each of the things that you mention by
looking at the DMTCP code "side-by-side" over the phone. We hope to
For static apps, we would use other interposition techniques. And yes,
we haven't implemented support of static apps so far, because our
user base hasn't asked for it. We do handle apps that use the
syscall system call to make system calls. We don't handle apps
that directly use "int 0x80". Again, there are ways to do this, but
our user base hasn't asked for it.
In general, please keep in mind the principles that you rightly had
to remind me of in a previous post. :-) Our two pieces of work are coming
from two different directions with two different visions. Linux C/R wants
to be so transparent that no user app can ever detect it. DMTCP wants to be
transparent enough that any reasonable use case is covered.
In particular, DMTCP considers distributed computations to be equally
valid use cases for the core DMTCP C/R. I also agree that Linux C/R can be
extended to cover distributed apps -- either through userland extensions,
or maybe with techniques like in your excellent CLUSTER-2005 paper.
Hence, DMTCP has grown its coverage of apps over the years. When we
talk offline, let's talk about future use cases, and whether there are
DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD.
We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app
starts. We then remove it before the app really starts. The ...On 11/07/2010 06:05 PM, Gene Cooperman wrote: Agreed - as long as we are considering the c/r-engine functionality (and not the "glue" logic to keep apps outside their context after the restart). That said, I'm afraid we'll more definitions to what is "reasonable" Distributed c/r is one of the proposed use-cases for linux-cr. The technique in that paper, BTW, was a userspace glue: during restart, that glue re-establishes connectivity by using new TCP connections, and c/r uses those new sockets in lieu of restoring the old ones. For that and other use-cases we designed linux-cr to be flexible Wrappers are great (I did TA the w4118 class here...). They are a powerful tool; however in _our_ context they have downsides: (a) wrappers add visible overhead (less so for cpu-bound apps, more so with server apps) (b) wrappers that do virtualization to a "black-box" API (as opposed to integrate with the API) are prone to races (see the paper that I cited before) (c) wrappers duplicate kernel logic, IMHO unnecessarily (and I don't refer to the userspace "glue" from above) (d) wrappers are hard to make hermetic (no escapes) to apps. IMO, the one excellent reasons to use wrappers is to support the userspace glue that allows restarted apps to run out of I clearly failed to explain well. Lemme try again: If you use PTRACE to checkpoint, then you ptrace the target tasks, peek at and save their state, and then let them resume execution. The target apps need not collaborate - they are forced by the kernel to the ptraced state regardless of what they were doing, and resume execution without knowing what happened. In linux-cr it works similarly: checkpoint does not require that the processes be scheduled to run - they don't participate; rather, external process(es) do the work. In contrast, IIUC, dmtcp uses syscall wrappers and overloading of signal(s) in order to make every checkpointed process/thread actively execute the checkpoint logic. I refer to this as ...
As before, Oren, let's have that phone discussion so that we can preprocess a lot of this, instead of acting like the the three blind men and the elephant. I will _tell you_ the strengths and weaknesses of DMTCP on the phone, instead of you having to guess at them here on LKML. And of course, I hope you will be similarly frank about Linux C/R on the phone. Thank you for lowering the heat on this last post. I'll reply only to some relevant issues in this post, rather than trying to respond to all of your posts. I remind you that I still have my own questions about Linux C/R, but I'm saving them for the phone discussion, since that will In our experience, the primary overhead of C/R is to save the data to disk. This far outweighs the question of how many ms The paper you cited was: http://www.stanford.edu/~talg/papers/traps/abstract.html Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools That paper is about Sandboxing. DMTCP is about C/R. If DMTCP was trying to do a sandbox, it might have some of the same traps and pitfalls. Luckily, userland C/R is a _lot_ easier than userland sandboxing. By the way, although of less importance, I'll point out that the paper was written in 2003, before DMTCP even started. Next, you talk about races. The authors of that paper have races because they are trying to do sandboxing. I already answered Matt's post earlier about why we don't see races in DMTCP. I'll answer it again, but in more detail. At ordinary run-time, the DMTCP checkpoint thread is just waiting on a select -- waiting for instructions from the DMTCP coordinator. Our system call wrappers around user threads to not change the issue of races. If two user threads used to have a race, they will continue to do so in DMTCP. If two user threads did not have a race, then DMTCP will not introduce any new races. How should DMTCP introduce a new race when DMTCP wrappers _never_ communicate with any other thread. At ...
Hi, Ok, I'll bite the bullet for now - to be continued... VMware, Xen and KVM already do live migration. However, VMs are a separate beast. We are concerned about _application_ level c/r and migration (complete containers or individual applications). Many proven techniques from the VM world apply to our context too (in your example, post-copy migration). Oren. --
Thanks for the careful response, Oren. For others who read this,
one could interpret Oren's rapid post as criticizing the work of
Andres Lagar Cavilla. I'm sure that this was not Oren's intention.
Please read below for a brief clarification of the novelty of SnowFlock.
Anyway, I really look forward to the phone discussion. I've also
enjoyed our interchange, for giving me an opportunity to explain more about
the DMTCP design. Thank you.
Best wishes,
- Gene
I absolutely agree with your point that live migration of
applications is a different beast, and technically very novel.
Since I know Andres Lagar Cavilla personally, I also feel obligated
to comment why SnowFlock truly is novel in the VM space. First, as Andres
writes:
"SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
VMM [Barham 2003]."
In the abstract, Andres points out one of the major points of novelty:
"To evaluate SnowFlock, we focus on the demanding
scenario of services requiring on-the-fly creation of hundreds
of parallel workers in order to solve computationallyintensive
queries in seconds."
We must be careful that we don't destroy someone's reputation without
--
Err... yes, that was careless of me. I was too focused on Yes, it's really nice work - I saw it when I visited there. (Coincidentally the post-copy idea with Xen appeared also in VEE 09 briefly before). Oren. --
GC> As before, Oren, let's have that phone discussion so that we can GC> preprocess a lot of this, instead of acting like the the three GC> blind men and the elephant. I will _tell you_ the strengths and GC> weaknesses of DMTCP on the phone, instead of you having to guess GC> at them here on LKML. And of course, I hope you will be similarly GC> frank about Linux C/R on the phone. I want to be in on that discussion too, as do a lot of other people here. However, I doubt we'll all be able to find a common spot on our collective schedules, nor would that conversation be archived for posterity. I think sticking to LKML is the right (and time-tested) approach. OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in OL> which case skype's sockets' network stacks are reconstructed, OL> transparently to both skype (local apps) and the peer (remote OL> apps). Then, at the destination host and skype continues to work. GC> That's a really cool thing to do, and it's definitely not part of GC> what DMTCP does. It might be possible to do userland live GC> migration, but it's definitely not part of our current scope. How would you go about doing that in userland? With the current linux-cr implementation, I can move something like sshd or sendmail from one machine to another without a remote (connected) client noticing anything more than a bit of delay during the move. I think that saving and restoring the state of a TCP connection from userland is probably a good example of a case where it makes sense to have it as part of a C/R function, but not necessarily exposed in /sys or /proc somewhere. Unless it can be argued that doing so is not useful, I think that's a good talking point for discussing the kernel vs. user approach, no? -- Dan Smith IBM Linux Technology Center --
Hello, Meh, just implementing a conntrack module should be good enough for most use cases. If it ever becomes a general enough problem (which I extremely strongly doubt), we can think about allowing processes in a netns to change sequence number but that would be a single setsockopt option instead of the horror show of dumping in-kernel data structures in binary blob. Thanks. -- tejun --
TH> If it ever becomes a general enough problem (which I extremely TH> strongly doubt), Migration of a container? Yeah, it's one of the primary reasons for doing what we're doing :) TH> we can think about allowing processes in a netns to change TH> sequence number but that would be a single setsockopt option Yeah, well there's more than that, of course, if you want to be able to checkpoint a socket in any state. Buffers, time-wait, etc. TH> instead of the horror show of dumping in-kernel data structures in TH> binary blob. Well, as should be evident from a review of the code, we don't dump binary kernel data structures as a general rule. We canonicalize them into checkpoint headers on the way out and build the new data structures (or use existing kernel interfaces to do so) on the way in. You know, just like netlink does. It has even been suggested that we do this with netlink instead, to mirror the other "horror show" tools that we all use on a daily basis. We're not opposed to this, but we do have some concerns about performance. -- Dan Smith IBM Linux Technology Center email: danms@us.ibm.com --
Hello, Well, then push for the feature. If the rationale is strong enough, I haven't really thought about it too deeply but for all other misc states, you should be able to emulate it by talking to a netfilter module. The reason why I suggested sequence number changing setsocket option is because that is the only performance sensitive part and with that you should be able to resume live sockets without conntracking. The horror show part is dumping internal data structure without due scrutinization in a way which can only ever be useful for CR when most of the same states are already exported via ABI defined ways. Thanks. -- tejun --
That's what review process is for, isn't it? Please, look at what is being dumped and what isn't. --
Hello, sorry about the long delay. Was lost in something else. I've been thinking about this. We can easily introduce a new ptrace call which allows neseting. AFAICS, ptrace already exports most of information necessary to restart the task - where it's stopped and why. The only missing thing seems to be the wait state (including for group stop) which can be added without too much difficulty. I'll try to write up a RFC patch. Things like that would useful for other things too - say, you would be able to attach gdb to a strace'd Yeah, definitely, for the higher level workarounds, there's no way around it but I think it would still be worthwhile to be able to provide a baseline implementation which can checkpoint and restart a I haven't really looked at the VDSO generation but symbol offsets inside VDSO page can differ depending on kernel version, configuration, toolchains used, etc... right? You would need an extra I wrote in another mail but you can find out which fd's are shared by flipping O_NONBLOCK and looking at the flags field of As a few others have already pointed out, I think it's better to keep technical discussions on-line. Different people think at different paces and the schedules don't always match. Plus, other people can jump in and look up things later. It may take a bit more effort at the beginning but I think it gets easier in time. Thank you. -- tejun --
Ooh, one more thing, /proc/*/net/* has tx/rx queue counts. With those, you wouldn't need the cookie based connection draining, right? Thanks. -- tejun --
Rightly so. It hasn't been widely proven as something that distros would be willing to integrate into a normal desktop session. We've got some demos of it working with VNC, twm, and vim. Oren has his own VNC, twm, etc demos too. We haven't looked very closely at more advanced desktop sessions like (in no particular order) KDE or Gnome. Nor have we yet looked at working with any portions of X that were meant to provide this but were never popular enough to do so (XSMP iirc). Does DMTCP handle KDE/Gnome sessions? X too? On the kernel side of things for the desktop, right now we think our biggest obstacle is inotify. I've been working on kernel patches for kernel-cr to do that and it seems fairly do-able. Does DMTCP handle restarting inotify watches without dropping events that were present during checkpoint? The other problem for kernel c/r of X is likely to be DRM. Since the different graphics chipsets vary so widely there's nothing we can do to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset as far as I know. Perhaps if that would help hybrid graphics systems then it's something that could be common between DRM and checkpoint/restart but it's very much pie-in-the-sky at the moment. kernel c/r of input devices might be alot easier. We just simulate hot [un]plug of the devices and rely on X responding. We can even checkpoint the events X would have missed and deliver them prior to hot unplug. Also, how does DMTCP handle unlinked files? They are important because lots of process open a file in /tmp and then unlink it. And that's not even the most difficult case to deal with. How does DMTCP handle: link a to b open a (stays open) rm a <checkpoint and restart> open b write to b read from a (the write must appear) Is the checkpoint control process hidden from the application? What happens if it gets killed or dies in the middle of checkpoint? Can a malicious task being checkpointed (perhaps for later analysis) Wouldn't checkpoint and ...
Actually, I do have a demo of Zap (linux-cr predecessor) with a _full_ gnome desktop running under VNC with: * a movie player, * firefox, * thunderbird, * openoffice, * kernel make, * gdb debugging something, * WINE with microsoft office (oops) all of these checkpointed with < 25ms of downtime and resumed an arbitrary time later, successfully. At the very least userspace would need to interpose on all inotify related syscalls to track (log) what the user did to be able to redo it at restart. (And I'm sure there will be crazy to impossible races and corner cases there). Does it make sense to replicate in userspace everything already done DRM is hardware, and is complex for both userspace and kernel. Let's assume it isn't support until it's properly virtualized. (In the long-long run, I'd envision hardware manufacturers providing c/r support within their drivers - e.g. a checkpoint() and restart() kernel methods. But that's only if they care about it, and in any [snip] Oren. --
By the way, Oren, Kapil and I are hoping to find time in the next few
days to talk offline. Apparently the Linux C/R and DMTCP had continued
for some years unaware of each other. We appreciate that a huge amount
of work has gone into both of the approaches, and so we'd like to reap
the benefit of the experiences of the two approaches. We're still learning
more about each others' approaches. Below, I'll try to answer as best
I can the questions that Matt brings up. Since Matt brings up _lots_
of questions, and I add my own topics, I thought it best to add a table
of contents to this e-mail. For each topic, you'll see a discussion
inline below.
1. Distros, checkpointing a desktop, KDE/Gnome, X
[ Trying to answer Matt's question ]
2. Directly checkpointing a single X11 app
[ Our own preferred approach, as opposed to checkpinting an entire desktop;
This is easy, but we just haven't had the time lately. I estimate
the time to do it is about one person working straight out for two weeks
or so. But who has that much spare time. :-) ]
3. OpenGL
[ Checpointing OpenGL would be a really big win. We don't know the
right way, but we're looking. Do you have some thoughts on that? Thanks.]
4. inotify and NSCD
[ We try to virtualize a single app, instead of also checkpointing
inotify and NSCD themselves. It would have been interesting to consider
checkpointing them in userland, but that would require root privilege,
and one core design principle we have, is that all of our C/R is
completely unprivileged. So, we would see distributing DMTCP as
a package in a distro, and letting individual users decide for
what computation they might want to use it. ]
5. Checkpointing DRM state and other graphics chip state
[ It comes down to virtualization around a single app versus checkpointing
_all_ of X. --- Two different approaches. ]
6. kernel c/r of input devices might be alot easier
[ We agree with you. By virtualizing ...That was my understanding too. However, I also felt that I'd better Hmmm... that sounds pretty fast .. given that you will need to save and reconstruct an arbitrary state kept by the X server... More importantly, this line of thought was brought up in this thread multiple times, yet in a very misleading way. The question is _not_ whether one can do c/r of a single apps without their surrounding environment. The answer for that is simple: it _is_ possible either using proper (and more likely per-app) wrappers, or by adapting the apps to tolerate that. The above is entirely orthogonal to whether the c/r is in kernel or in userspace. So for terminal based apps, one can use 'screen'. For individual X apps, one can use a light VNC server with proper embedding in the desktop (e.g. metavnc). Or you could use screen-for-X like 'xpra'. Or you can write wrappers (messy or hairy or not) that will try to do that, or you could modify the apps. IIUC, dmtcp chose the way of the wrappers. But that is independent of where you do c/r ! The issue on the table is whether the _core_ c/r should go in kernel or userspace. Those wrappers of dmtcp are great and will be useful with either approach. So let us please _not_ argue that only one approach can c/r apps or processes out of their context. That is inaccurate and misleading. And while one may argue that one use-case is more important than another, let us also _not_ dismiss such use cases (as was argued by others in this thread). For example, c/r of a full desktop session in VNC, or a VPS, is a perfectly valid and useful case. FYI, inotify() is a syscall and does not require root privileges. It's a kernel API used to get notifications of changes to file system inodes. Back to the point argued above, "virtualization around a single app" are the wrappers that allow to take an app out of context and sort of implant it in another context. It's a very desirable feature, but Hmm... can you really c/r from userspace a ...
These are all good points by Oren. It's not about in-kernel _or_ userland. There are opportunities to use both -- each where it is strongest, and I'm looking forward to that discussion with Oren. I do think that reconstructing the state of the X server is not as hard as Oren Yes, I know. I was writing too fast in trying to respond to all the points. Matt had asked how we would handle inotify(), but I was getting swamped by all the questions. There is a virtualization approach to inotify in which one puts wrappers around inotify_add_watch(), inotify_rm_watch() and friends in the same way as we wrap open() and could wrap close(). One would then need to wrap read() (which we don't like to do, just in case it could add significant overhead). But if we consider kernel and userland virtualization together, then something similar to TIOCSTI I agree. I look forward to the discussion where we can put all this Let's try it and see. If you write a program, we'll try it out in DMTCP (unstable branch) and see. So far, checkpointing gdb sessions has worked well for us. If there is something we don't cover, it will Another excellent topic for discussion. I look forward to the discussion. Also a very good point above, and I agree. The offline discussion should be a better forum for putting this all into perspective. Thanks again for your thoughtful response, - Gene --
[cc'ing linux containers mailing list] On 11/07/2010 01:49 PM, Gene Cooperman wrote: This sounds like reimplementation in userspace the very same logic We could work to add ABIs and APIs for each and every possible piece of state that affects userspace. And for each we'll argue forever about the design and some time later regret that it wasn't designed correctly :p Even if that happens (which is very unlikely and unnecessary), it will generate all the very same code in the kernel that Tejun has been complaining about, and _more_. And we will still suffer from issues such as lack of atomicity and being unable to do many simple and advanced optimizations. Or we could use linux-cr for that: do the c/r in the kernel, keep the know-how in the kernel, expose (and commit to) a per-kernel-version ABI (not vow to keep countless new individual ABIs forever after getting them wrongly...), be able to do all sorts of useful optimization and provide atomicity and guarantees (see under "leak detection" in the OLS linux-cr paper). Also, once the c/r infrastructure is in the kernel, it will be easy (and encouraged) to support new =ly introduced features. Finally, then we would use dmtcp as well as other tools on top of the kernel-cr - and I'm looking forward to do that ! Try "strace bash" :) I suspect it won't work - and for the reasons I described. Same here. Talk to you soon... Oren. --
Hello, Oren. It may be harder but those will be localized for specific features which would be useful for other purposes too. With in-kernel CR, you're adding a bunch of intrusive changes which can't be tested or And the only reason it seems easier is because you're working around the ABI problem by declaring that these binary blobs wouldn't be kept compatible between different kernel versions and configurations. That simply is the wrong approach. If you want to export something, build Yeah, this part I agree. The higher level workarounds implemented in dmtcp are quite impressive and useful no matter what happens to lower layer. Thanks. -- tejun --
By this do you mean the very idea of having CR support in the kernel? Or our design of it in the kernel? Let's go back to July 2008, at the containers mini-summit, where it was unanimously agreed upon that the kernel was the right place (Checkpoint/Resetart [CR] under http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that we would start by supporting a single task with no resources. Was that whole discussion effectively misguided, in your opinion? Or do you feel that since the first steps outlined in that discussion we've either "gone too far" or strayed in the subsequent design? -serge --
Hello, Serge. The conclusion doesn't seem like such a good idea, well, at least to me for what it's worth. Conclusions at summits don't carry decisive weight. It'll still have to prove its worthiness for mainline all the same and in light of already working userland alternative and the expanded area now covered by virtualization, the arguments in this thread don't seem too strong. Thanks. -- tejun --
Hello, Pavel. I think I already did that several times in this thread but here's an attempt at summary. * It adds a bunch of pseudo ABI when most of the same information is available via already established ABI. * In a way which can only ever be used and tested by CR. If possible, kernel should provide generic mechanisms which can be used to implement features in userland. One of the reasons why we'd like to export small basic building blocks instead of full end-to-end solutions from the kernel is that we don't know how things will change in the future. In-kernel CR puts too much in the kernel in a way too inflexible manner. * It essentially adds a separate complete set of entry/exit points for a lot of things, which makes things more error prone and increases maintenance overhead across the board. * And, most of all, there are userland implementation and virtualization, making the benefit to overhead ratio completely off. Userland implementation _already_ achieves most of what's necessary for the most important use case of HPC without any special help from the kernel. The only reasonable thing to do is taking a good look at it and finding ways to improve it. Thanks. -- tejun --
On Thu, 18 Nov 2010 10:48:34 +0100 Yet the arguments seem to be vague enough not to be convincing to the Can you elaborate on this? What established ABI are you proposing we So what if it can only be tested with CR as long as we can make CR work on a variety of environments? Scalability changes for _really_ large SMP boxes can only be reliably tested by people such equipment. We are not imposing any such restriction and this code can be tested on very I partially agree with you here. There will be maintenance overhead every time you add code to the kernel that _may_ make changes in the future more complicated. This true for _any_ code that is added to the core kernel. Now in my experience such maintenance burden is most disruptive when the code being added creates a lot of new state that need to be tracked in multiple places unrelated to CR (in this case). Our argument is that the CR code is not creating new state that will cause painful future changes to the kernel. If you have specific example that you are concerned with, great. Lets discuss those. Are we promising zero maintenance cost? But guess what, neither do most features that make into the kernel. Now, if we change the argument around... What would be the maintenance cost keeping this outside the kernel. I would argue that it is much Can we keep virtualization out of this. Every time someone mentions virtualization as a solution, it makes me feel like these people just don't understand the problem we are trying to solve. It is just not practical to create a new VM for every application you want to CR. What are these _most_ important cases of HPC that you are referring too? Can we do a lot of these cases from userspace? Sure, but why are the ones that can't be done from userspace any less important. If nobody The userspace vs in-kernel discussion has been done before as multiple people have already said in this thread. Show me a version of userspace CR that can correctly do all that an ...
Guess I'll just be offensive here and say, straight-out: I don't believe it. Can I see the userspace implementation of c/r? If it's as good as the kernel level c/r, then aweseome - we don't need the kernel patches. If it's not as good, then the thing is, we're not drawing arbitrary lines saying "is this good enough", rather we want completely reliable and transparent c/r. IOW, the running task and the other end can't tell that a migration happened, and, if checkpoint says it worked, then restart must succeed. -serge --
While it's your opinion that userland alternatives "already work", in reality they are unsuitable for several real use-cases. The userland approach has serious restrictions - which I will cover in a follow-up post to my discussion with Gene soon. Note that one important point of agreement was that DMTCP's ability to provide "glue" to restart applications without their original context is _orthogonal_ to how the core c/r is done. IOW - there exciting goodies from DMTCP are useful with either form of c/r. You also argue that "virtualization" (VMs?) covers everything else, implying that lightweight virtualization is useless. In reality it is an important technology, already in the kernel (surely you don't suggest to pull it out ?!) and for a reason. That is already a very good reason to provide, e.g. containers c/r and live-migration to keep it competitive and useful. Thanks, Oren. --
Of course. It allows us to present at kernel summit and look for early
rejections to save us all some time (which we did, at the container
mini-summit readout at ksummit 2008), but it would be silly to read
Here's where we disagree. If you are right about a viable userland
alternative ('already working' isn't even a preqeq in my opinion,
so long as it is really viable), then I'm with you, but I'm not buying
it at this point.
Seriously. Truly. Honestly. I am *not* looking for any extra kernel
-serge
--
What's so wrong with Gene's work? Sure, it has some hacky aspects but let's fix those up. To me, it sure looks like much saner and manageable approach than in-kernel CR. We can add nested ptrace, CLONE_SET_PID (or whatever) in pidns, integrate it with various ns supports, add an ability to adjust brk, export inotify state via fdinfo and so on. The thing is already working, the codebase of core part is fairly small and condor is contemplating integrating it, so at least some people in HPC segment think it's already viable. Maybe the HPC cluster I'm currently sitting near is special case but people here really don't run very fancy stuff. In most cases, they're fairly simple (from system POV) C programs reading/writing data and burning a _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp integrated with condor would work well enough for them. Sure, in-kernel CR has better or more reliable coverage now but by how much? The basic things are already there in userland. The tradeoff simply doesn't make any sense. If it were a well separated self sustained feature, it probably would be able to get in, but it's all over the place and requires a completely new concept - the quasi-ABI'ish binary blob which would probably be portable across different kernel versions with some massaging. I personally think the idea is fundamentally flawed (just go through the usual ABI!) but even if it were not it would require _MUCH_ stronger rationale than it currently has to be even considered for mainline inclusion. Maybe it's just me but most of the arguments for in-kernel CR look very weak. They're either about remote toy use cases or along the line that userland CR currently doesn't do everything kernel CR does (yet). Even if it weren't for me, I frankly can't see how it would be included in mainline. I think it would be best for everyone to improve userland CR. A lot of knowdledge and experience gained through kernel CR would be applicable and won't go wasted. ...
Tejun, Sorry for getting into the middle of the discussion, but... Can you imagine how many userland APIs are needed to make userspace C/R? Do you really want APIs in user-space which allow to: - send signals with siginfo attached (kill() doesn't work...) - read inotify configuration - insert SKB's into socket buffers - setup all TCP/IP parameters for sockets - wait for AIO pending in other processes - setting different statistics counters (like netdev stats etc.) and so on... For every small piece of functionality you will need to export ABI and maintain it forever. It's thousands of APIs! And why the hell they are needed in user space at all? BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r didn't even bother with tty's migration. In OpenVZ we really do need much more then that like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc. Thanks, Kirill --
Hello, Can't we drain kernel buffers? ie. Stop further writing and wait the I _think_ most can be restored by talking to netfilter module. I haven't looked at aio implementation for a while now but can't we drain these upon checkpointing and just carry the completion status? Also, if aio is what you're concerned about, I would say the problem I think it's actually quite the contrary. Most things are already visible to userland. They _have_ to be and that's the reason why userland implementation can already get most things working without any change to the kernel with some amount of hackery. To me in-kernel CR seems to approach the problem from the exactly wrong direction - rather than dealing with specific exceptions, it create a completely new framework which is very foreign and not useful outside of CR. Also, think about it. Which one is better? A kernel which can fully show its ABI visible states to userland or one which dumps its internal data structurs in binary blobs. To me, the latter seems Would it be impossible to preserve autofs/NFS and TTYs from userland? Then, why so? For statistics, I'm a bit lost. Why does it matter and even if it does would it justify putting the whole CR inside kernel? Thank you. -- tejun --
On send: if network dies right after freeze, you lose. On receive: Because you'll introduce million stupid interfaces not interesting to anyone but C/R. --
Well, if you ask me, having pidns w/o a way to reinstate PID from userland is pretty silly and you and I might not know yet but it's quite imaginable that there will be other use cases for the capability unlike in-kernel CR. Kernel provides building blocks not the whole frigging package and for very good reasons. -- tejun --
No. Chrome uses CLONE_PID so that exploit couldn't attach to processes in Speaking of pids, pid's value itself is never interesing (except maybe pid 1). It's a cookie. CLONE_SET_PID came up only now because only C/R wants it. --
Hello, Gosh, if you're really worried about that, put a netfilter module which would buffer and simulate acks to extract the packets before Just store the data somewhere. The checkpointer can drain the socket, In this thread, how many have you guys come up with? Not even a dozen and most can be sovled almost trivially. Seriously, what the hell.. Thanks. -- tejun --
I do not count them. The paragon of absurdity is struct task_struct::did_exec . --
Yeah, then go and figure how to do that in a way which would be useful for other purposes too instead of trying to shove the whole checkpointer inside the kernel. It sure would be harder but hey that's the way it is. -- tejun --
System call for one bit? This is ridiculous. Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work). --
Really, whatever. Just keep doing what you're doing. Hey, if it makes you happy, it can't be too wrong. -- tejun --
Because /proc/*/did_exec useless to anyone but C/R (even for reading!).
Because code is much simpler:
tsk->did_exec = !!tsk_img->did_exec;
+
__u8 did_exec;
--
I don't think you'll need a full file. Just shove it in status or somewhere. Your argument is completely absurd. So, because exporting single bit is so horrible to everyone else, you want to shove the Sigh, yeah, except for the horror show to create tsk_img. Your "paragon of absurdity" is did_exec which is only ever used to decide whether setpgid() should fail with -EACCES, seriously? Here's a thought. Ignore it for now and concentrate on more relevant problems. I'm fairly sure CR'd program malfunctioning over did_exec wouldn't mark the beginning of the end of our civilization. You gotta be kidding me. -- tejun --
task_struct image work is common for both userspace C/R and in-kernel. You _have_ to define it. You're so newjerseyly now. --
You assume that c/r is done by the checkpointed processes _themselves_, that is that to checkpoint a process that process need to be made runnable and it will save its own state (which is the model of dmtcp, but not of using ptrace). This model is restrictive: it requires that you hijack the execution of that process somehow and make it run. What if the process isn't runnable (e.g. in vfork waiting for completion, or ptraced deep in the kernel) ? letting it run even just a bit may modify its state. It also means that if you have many processes in the checkpointed session, e.g. 1000, then _all_ of them will have to be scheduled to run ! With kernel c/r this is unnecessary: you can use an auxiliary process to checkpoint other processes without scheduling the other processes. I.e. it's _transparent_ and _preemptive_. Another advantage is that if anything fails during checkpoint (for whatever reason), there are no side-effects (which is not the case with Are we jusding aesteics ? To me the former looks uglier... The amount of fragile hacks you need to go through to make it work in userspace for the generic cases (including userspace trickery and new crazy APIs from the kernel for state that was never even an ABI, like skb's), and the restrictions it posses simply suggest that userspace is not the right place to do it. Thanks, Oren. --
Hi, Based on discussion with Gene, I'd like to clarify key points and difference between kernel and userspace approaches (specifically linux-cr and dmtcp): three parts to break the long post... part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches [now relax, grab (another) cup of coffee and read on...] PART I: ==PERSPECTIVE== A rough classification of c/r categories: * container-c/r: important use-case, e.g. c/r and migration of an application containers like VPS (virtual private server), VDI (desktop) or other self-contained application (e.g. Oracle server). Here _all_ the relevant processes are included in the checkpoint. * standalone-c/r: another use-case is standalone-c/r where a set of processes is checkpointed, but not the entire environment, and then those processes are restarted in a different "eco-system". * distributed-c/r: meaning several sets of processes, each running on a different host. (Each set may be a separate container there). In container-c/r, the main challenge is to be _reliable_ in the sense that a restart from a successful checkpoint should always succeed. In standalone-c/r, the main challenge is that an application resumes execution after a restart in a possible _different_ eco-system. Some application don't care (e.g 'bc'). Other applications do care, and to different degrees; for these we need "glue" to pacify the application. There are generally three types of "glue": (1) Modify the application or selected libraries to be c/r-aware, and notify it when restart completes. (e.g. CoCheck MPI library). (2) Add a userspace helper that will run post-restart to do necessary trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem at the new host after migration; reconnect a socket to a peer). (3) Use interposition on selected library calls and add wrapper code that will glue in what's missing (e.g. ...
login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
PINE 4.64 COMPOSE MESSAGE
Folder: Drafts 8 Messages +
To : Tejun Heo <tj@kernel.org>
Cc : Serge Hallyn <serge.hallyn@canonical.com>,
Kapil Arya <kapil@ccs.neu.edu>,
Gene Cooperman <gene@ccs.neu.edu>,
linux-kernel@vger.kernel.org,
xemul@sw.ru,
"Eric W. Biederman" <ebiederm@xmission.com>,
Linux Containers <containers@lists.osdl.org>
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,
[continuation of posting regarding kernel vs userspace approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
PART II: ==PHILOSOPHY==
Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can
checkpoint a full container, a process hierarchy, or a single process,
For containers, it provides guarantees like restart-ability; For the
others, it provides the flexibility so that c/r-aware applications,
libraries, helpers, and wrappers can glue what they wish to glue.
1) Transparent - completely transparent for container-c/r, and largely
so for standalone-cr ("largely" - as in except for the glue which is
needed due to loss of eco-system, not due to restarting).
2) Reliable - if checkpoint succeeds that it is guaranteed for
to succeed too (for container-c/r).
3) Preemtptive - works without requiring that checkpointed processes
be scheduled to run (and thus "collaborate")
4) Complete - covers all visible and hidden state in the kernel
about processes (even if not directly visible to userspace)
5) Efficient - can be optimized ...login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
PINE 4.64 COMPOSE MESSAGE
Folder: Drafts 8 Messages +
To : Tejun Heo <tj@kernel.org>
Cc : Serge Hallyn <serge.hallyn@canonical.com>,
Kapil Arya <kapil@ccs.neu.edu>,
Gene Cooperman <gene@ccs.neu.edu>,
linux-kernel@vger.kernel.org,
xemul@sw.ru,
"Eric W. Biederman" <ebiederm@xmission.com>,
Linux Containers <containers@lists.osdl.org>
Fcc : imap://ol2104@mail.columbia.edu/Sent
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,
[continuation of discussion of kernel vs userspace c/r approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
PART III: ==SOME TECHNICAL ASPECTS==
Important to know about userspace (DMTCP example) before presenting a
comparison between kernel and userspace approaches:
DMTCP has two components: 1) c/r-engine to save/restore process state,
and 2) glue to restart processes out of their original context. They
are _orthogonal_: the glue can be used with of other c/r-engines, like
linux-cr. This discussion refers to the c/r-engine _only_.
Focusing on the c/r-engine of DMTCP - it uses syscall interposition
for three reasons:
1) To take control of processes at checkpoint
2) To always track state of resources not visible to userspace
3) To virtualize identifiers after restart
#1 is needed because processes saves their own state (and need to run
the checkpoint code for that).
#2 is needed because the kernel does not expose all state, and #3 is
needed because the kernel does not give ways to restore ...[[apologies for the silly prefix on last two posts - a combination of windows, putty, pine andslow connection is not helping me :( ]] --
Hello, Maybe it's a good idea to post a clean concatenated version for later reference? Thanks. -- tejun --
In this post, Kapil and I will provide our own summary of how we see the issues for discussion so far. In the next post, we'll reply specifically to comment on Oren's table of comparison between linux-cr and userspace. In general, we'd like to add that the conversation with Oren was very useful for us, and I think Oren will also agree that we were able to converge on the purely technical questions. Concerning opinions, we want to be cautious on opinions, since we're still learning the context of this ongoing discussion on LKML. There is probably still some context that we're missing. Below, we'll summarize the four major questions that we've understood from this discussion so far. But before doing so, I want to point out that a single process or process tree will always have many possible interactions with the rest of the world. Within our own group, we have an internal slogan: "You can't checkpoint the world." A virtual machine can have a relatively closed world, which makes it more robust, but checkpointing will always have some fragile parts. We give four examples below: a. time virtualization b. external database c. NSCD daemon d. screen and other full-screen text programs These are not the only examples of difficult interactions with the rest of the world. Anyway, in my opinion, the conversation with Oren seemed to converge into two larger cases: 1. In a pure userland C/R like DMTCP, how many corner cases are not handled, or could not be handled, in a pure userland approach? Also, how important are those corner cases? Do some have important use cases that rise above just a corner case? [ inotify is one of those examples. For DMTCP to support this, it would have to put wrappers around inotify_add_watch, inotify_rm_watch, read, etc., and maybe even tracking inodes in case the file had been renamed after the inotify_add_watch. Something could be made to work for the common cases, but it would still be a hack --- to be done only if ...
As Kapil and I wrote before, we benefited greatly from having talked with Oren,
and learning some more about the context of the discussion. We were able
to understand better the good technical points that Oren was making.
Since the comparison table below concerns DMTCP, we'd like to
In our experiments so far, the overhead of system calls has been
unmeasurable. We never wrap read() or write(), in order to keep overhead low.
We also never wrap pthread synchronization primitives such as locks,
for the same reason. The other system calls are used much less often, and so
As above, we believe that the overhead while running is negligible. I'm
assuming that image size refers to in-kernel advantages for incremental
checkpointing. This is useful for apps where the modified pages tend
not to dominate. We agree with this point. As an orthogonal point,
by default DMTCP compresses all checkpoint images using gzip on the fly.
This is useful even when most pages are modified between checkpoints.
Still, as Oren writes, Linux C/R could also add a userland component
to compress checkpoint images on the fly.
Next, live migration is a question that we simply haven't thought much
about. If it's important, we could think about what userland approaches might
We'd like to clarify what may be some misconceptions. The DMTCP
controller does not launch or manage any tasks. The DMTCP controller
is stateless, and is only there to provide a barrier, namespace server,
and single point of contact to relay ckpt/restart commands. Recall that
the DMTCP controller handls processes across hosts --- not just on a
single host.
Also, in any computation involving multiple processes, _every_ process
of the computation is a point of failure. If any process of the computation
dies, then the simple application strategy is to give up and revert to an
earlier checkpoint. There are techniques by which an app or DMTCP can
recreate certain failed processes. DMTCP doesn't currently recreate
Our ...Gene Cooperman [gene@ccs.neu.edu] wrote: | > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks | > atomic operation. guaranteed to determine restartability | > restartability for containers | | My understanding is that the guarantees apply for Linux containers, but not | for a tree of processes. Does this imply that linux-cr would have some | of the same reliability issues as DMTCP for a tree of processes? (I mean | the question sincerely, and am not intending to be rude.) In any case, | won't DMTCP and Linux C/R have to handle orthogonal reliability issues | such as external database, time virtualization, and other examples | from our previous post? Yes if the user attempts to checkpoint a partial container (what we refer to process subtree) or fails to snapshot/restore filesystem there could be leaks that we cannot detect. But one guarantee we are trying to provide is that if the user checkpoints a _complete_ container, then we will detect a leak if one exists. Is there a way to establish a set of constraints (eg: run application in a container, snapshot/restore filesystem) and then provide leak detection with a pure userpsace implementation ? Sukadev --
Syscall interception will have visible effect on applications that use those syscalls. You may not observe overheasd with HPC ones, but do you have numbers on server apps ? apps that use fork/clone and pipes extensively ? threads benchmarks et ? compare that The controller is another point of failure. I already pointed that the (controlled) application crashes when your controller dies, and you mentioned it's a bug that should be fixed. But then there will always be a risk for another, and another ... You also mentioned that if the controller dies, then the app should contionue to run, but will not be checkpointable anymore (IIUC). The point is, that the controller is another point of failure, and makes the execution/checkpoint intrusive. It also adds security and user-management issues as you'll need one (or more ?) controller per user (right now, it's one for all, no ?). and so on. Plus, because the restarted apps get their virtualized IDs from the controller, then they can't now "see" existing/new processes that The point is that you _add_ a point of failure: you make the "checkpoint" operation a possible reason for the application to crash. In contrast, in linux-cr the checkpoiint is idempotent - nunharmful because it does not There are two points in the claim above: 1) linux-cr can checkpoint with a single syscall - it's atomic. This gives you more guarantees about the consistency of the checkpointed application(s), and less "opportunitites" for the operation as a whole to fail. 2) restartability - for full-container checkpoint only. There is no "reliability" issue with c/r of non-containers - it's a matter of definition: it depends on what your requirements from the userspace application and what sort of "glue" you have for it. And I request again - let's leave out the questions of "time virtualization" and "external databases" - how are they different for the VM virtalization solution ? they are conpletely orthogonal to the ...
(Our first comment below actually replies to an earlier post by Oren. It seemed We would guess that Zap would not be able to support screen without a user space component. The bug occurs when screen is configured to have a status line at the bottom. We would be interested if you want to try it and let us know the results. Its true that we haven't taken serious data on overhead with server apps. Is there a particular server app that you are thinking of as an example? I would expect fork/clone and pipes to be invoked infrequently in the server apps and do not add measurably to CPU time. In most server apps such as MySQL, it is common to maintain a pool of threads for reuse rather than to repeatedly call clone for a new thread. This is done to ensure that the overhead of the clone calls is not significant. I would expect a similar policy for fork and pipes. Just to clarify, DMTCP uses one coordinator for each checkpointable computation. A single user may be running multiple computations with one coordinator for each computation. We don't actually use the word controller in DMTCP terminology because the coordinator is stateless and so in This appears to be a misconception. The wrappers within the user process maintain the pid-translation table for that process. The translation table is the translation between the original pid given by the kernel and the current pid set by the kernel on restart. This is handled locally and does not involve the coordinator. In the case of a fork there could be a pid-clash (the original pid generated for a new process that conflicts with someone else's original pid). However, DMTCP handles this by checking within the fork wrapper for a pid-clash. In the rare case of a pid-clash, the child process exits and the parent forks again. Same We were speaking above of the case when the process dies during a computation. We were not referring to checkpoint time. <snip> We would like to add our own comment/question. To set the context we quote ...
I apologize for being blunt - but this is probably an issue specific to What is a "local" socket ? af_unix, or locally connected af_inet ? Anyway, with linux-cr you'd do what's needed after the restarted tasks are created, but before their state is restored. For each such "old" socket that you want to replace, you'd create (in userspace with arbitrary glue" code!) a new socket, and use this socket when restoring the state of the Repainting during restart is the least of your problems. Leak detection is not a problem: If the socket connects out of the containers (like af_inet) - then it is not a leak, andyou treat it as described above. If the sockets connects within the container but you don't checkpoint the "peer" process - then it is not a container-c/r (in which case you don't look for leaks). Also, the application could mark resources to not be checkpointed (e.g. scratch memory to save storage, or sockets to not count as leaks). I explain again - in case it wasn't clear from my 3-part post: leak detection is relevant _only_ for full container-c/r. It doesn't make sense otherwise. If you want to checkpoint individual components of an application, then it's up to userspace to produce/provide the relevant "glue" to make it "make sense" when those components restart without their original eco-system. Thanks, Oren.
Hi Oren,
I completely agree with you, Oren. DMTCP was never designed to be split
into a userland and in-kernel replacement. We will want to re-factor
DMTCP to make this happen.
I'm sorry if my e-mail came off as confrontational. That was not my
intention. I was just looking forward to an interesting intellectual
experiment --- how to go about combining DMTCP and Linux C/R. I was
trying to guess ahead of time where there are interesting challenges, and
my hope is that we will find a way to solve them together.
Best wishes,
- Gene
--
Hi Gene, At the risk of restating already applied arguments, and as a c/r outsider, this touches on the real crux of the issue for me. What is the complete set of boundaries between a c/r group of processes and the outside world? Is it bounded and is it understandable by mere kernel engineers? Does it change the assumptions about what a Linux process /is/, and how to handle it? How much? The broad strokes seem to be straight forward, but as already pointed out, the devil is in Temporal issues need to be (are being?) addressed regardless. In certain respects, I'm sure c/r can be seen as a *really long* scheduler latency, and would have the same effect as a system going into suspend, or a vm-level checkpoint. I would think the same Right here is exactly the example of a boundary that needs explicit rules. When a pair of processes have a shared region, and only one of them is checkpointed, then what is the behaviour on restore? In this specific example, a context-specific hack is used to achieve the desired result, but that doesn't work (as I believe you agree) in the --
That depends of what your definition of "world". One definition is "world := VM", as you state above. Another is "world := container" which I stated in my post(s). You can checkpoint both. For those cases where the "world" cannot be fully checkpointed, I explicitly pointed that we should focus on the core c/r IMHO, irrelevant to current discussion. And btw, this is done in This falls within the category of "glue", and is - as I try once again to remind - tentirely oorthogonal to the topic of where This actually never required a userspace "component" with Zap or linux-cr (to the best of my knowledge).. Even if it did - the question is not how to deal with "glue" (you demonstrated quite well how to do that with DMTCP), but how should teh basic, core c/r functionality work - which is below, and orthogonal to the "glue". Let us please focus on the base c/r engine functionality... (gotta disconnect now .. more later) Oren. --
Sure, as soon I am back on sane connection (~1 week) (I cut it in three to make it easier for people to digest ...) Oren. --
You seem to be arguing "Z is only testable/useful for doing the things Z was made for". I couldn't agree more with that. CR is useful for: Fault-tolerance (typical HPC) Load-balancing (less-typical HPC) Debugging (simple [e.g. instead of coredumps] or complex time-reversible) Embedded devices that need to deal with persistent low-memory situations. I think Oren's Kernel Summit presentation succinctly summarized these: http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf My personal favorite idea (that hasn't been implemented yet) is an application startup cache. I've been wondering if caching bash startup after all the shared libraries have been searched, loaded, and linked couldn't save a bunch of time spent in shell scripts. Post-link actually seems like a checkpoint in application startup which would be generally useful too. Of course you'd want to flush [portions of] the cache when packages get upgraded/removed or shell PATHs change and the caches would have to be per-user. I'm less confident but still curious about caching after running rc scripts (less confident because it would depend highly on the content of the rc scripts). A scripted boot, for example, might be able to save some time if the same rc scripts are run and they don't vary over time. That in turn might be useful for carefully-tuned boots on embedded devices. That said we don't currently have code for application caching. Yet we can't be expected to write tools for every possible use of our API in Oren, that statement might be read to imply that it's based on something as useless as kernel version numbers. Arnd has pointed out in the past how unsuitable that is and I tend to agree. There are at least two possible things we can relate it to: the SHA of the compiled kernel tree (which doesn't quite work because it assumes everybody uses git trees :( ), or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could also stuff that header into the kernel (much like kconfigs ...
Hello, Matt. I'm saying it's way too narrow scoped and inflexible to be a kernel feature. Kernel features should be like the basic tools, you know, hammers, saws, drills and stuff. In-kernel CR is more like an over complicated food processor which usually sits in the top drawer after which can do all of the above, a lot of which can be achieved in What does that have anything to do with the kernel? If you want post-link cache, implement it in ld.so where it belongs. That's like Continuing the same line of thought. It _CAN_ be used to do that in a Yeah, exactly, so just do it inside the established ABI extending where it makes sense. No reason to add a whole separate set. Thanks. -- tejun --
BTW, it's the same for userspace c/r: for the same set of features, the format (ABI) remains unchanged. Adding features breaks this and a new version is necessary, and conversion from old to new will be needed. Moreover, supporting a new feature in userspace means adding the proper API/ABI in the kernel, including refactoring etc, which is even harder than adding the support for it in linux-cr. Oren. --
[cc'ing linux containers mailing list] My experience is different: I downloaded dmtcp and followed the quick-start guide: (1) "dmtcp_coordinator" on one terminal (2) "dmtcp_checkpoint bash" on another terminal Then I: (3) pkill -9 dmtcp_coordinator ... oops - 'bash' died. I didn't even try to take a checkpoint :( Oren. --
You're right. I just reproduced your example. But please remember that
we're working in a design space where if any process of a computation
dies, then we kill the computation and restart. It doesn't matter to us
if it's a user process or the DMTCP coordinator that died. I do think
this is getting too detailed for the LKML list, but since you bring it
up, here is the analysis. The user bash process exits with:
[31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly?
This means that when the DMTCP coordinator died, it sent a message to the
checkpoint thread within the user process. The message was ill-formed.
The current DMTCP code says that if a checkpoint thread receives an
ill-formed message from the coordinator, then it should die. It's not
hard to change the protocol between DMTCP coordinator and checkpoint
thread of the user process into a more robust protocol with RETRY, further
ACK, etc. We haven't done this. Right now, the user simply restarts from
the last checkpoint. If one process of a computation has been compromised
(either DMTCP coordinator or user process), then the whole computation
has been compromised. I think in a previous version of DMTCP, the policy
was to allow the computation to continue when the coordinator dies.
Policies change.
But I think you're missing the larger point. We've developed DMTCP
over six years, largely with programmers who are much less experienced
than the kernel developers. Yet DMTCP works reliably for many users.
I consider this a credit to the DMTCP design. The Linux C/R design
is also excellent.
Can we get back to questions of design, using the implementations as
reference implementations? If you don't object, I'll also skip replying
to the other post, since I think we're getting too detailed. I'm having
trouble keeping up with the ...Indeed, this is a restriction on the new eclone() syscall, and can be addressed with proper userspace tools (including crypo-sign the checkpoint image). There core of the c/r code allows a user to Why not ? it has zero overhead when not in use, and a reasonable code footprint (which can be reduced by modularizing some of it, Are we talking about distributed checkpoint or "standalone" ? DMTCP relies on user agents to allow distributed/remote execution in a manner mostly transparent to the application. Many distributed systems don't require (and do not use) user agents. Consider a multi-tier system with web server, sql server and some applications server. These are not suitable to DMTCP's mode or work. (This is not to say DMTCP isn't useful - it's a clever piece of software with specific goals and more geared towards HPC needs). Now regarding "standalone" c/r, if you want to save/restore single or a subset of processes of a system without the rest of it, then you will always need user agents, regardless of userspace/kernel method. Likewise, their work on those tools will be as useful independently of which c/r 'engine' it uses. When you include all the relevant processes (e.g. an entire VNC session, a web server, HPC and batch jobs), you generally don't need the user agents. The checkpoint is self-contained, and linux-cr If there is a will, there is (almost always) a way ;) What MTCP does, IIUC, is wrap around the applications with a complete pid-namespace (and more) in userspace. There are/were also commercial products that do that. It's a tremendous effort and I'm impressed by their (MTCP) work so far. It is important to understand that it has a price tag: performance and complexity. It's usually useful for HPC needs, but unsuitable Hmmm... the kernel already does much of it - for instance, we have neat pid-namespace infrastructure; does it make sense to go into the trouble of adding interfaces to provide for pid-virtalization in userspace ? we should ...
Hi,
(disclaimer: you may want to grab a cup of your favorite coffee)
I agree, it *looks* scary. But that's mostly because it's a dumb
diff out of context, rather than a standard "patch" as set of
logical incremental changes. So posting this diff is probably the
worst way to present the impact on existing code. It merely gives
a ballpark of that.
However, please keep in mind that this diff is really an aggregate
of multiple unrelated, structured, small changes, including:
- cleanups (e.g. x86 ptrace)
- refactoring (e.g. ipc, eventpoll, user-ns)
- new features/enhancements (e,g. splice, freezer, mm)
I'm confident that each of these will make more sense when presented
In the ksummit presentation I gave an extensive list of real
use-cases (existing and future). The slides are here:
http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf
For more technical details there is also the OLS-2010 paper here:
http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf
presentation slide from there are here:
I'm unsure which states you have in mind that will not be well defined.
It is a difficult problem, and C/R has limitations, but I think we've
got it pretty right this time :)
* we save and restores *all* *execution* state of the applications
(except for well-defined unsupported features; hardware devices
are one such example).
* we don't save FS state (use filesystem snapshots for that); but
we do save runtime FS state (e.g. open files, etc).
* we don't save state of peers (applications/systems) over network;
but we do save network connections for proper live-migration.
(Of course, there is a supporting userspace ecosystem, like utilities
to do the checkpoint/restart, to freeze/thaw the application, to
snapshot the filesystem etc).
So unless the applications uses unsupported resource - it will be
I have a cool demo (and I gave one today!) that shows how I run one
desktop session and restart an older desktop session that then runs
in ...Hello, Oren. Yeah, could be so but I wasn't really referring to the scariness of the patch per-se but rather how many subsystems CR needs to interact If you think only about target processes, yeah sure, you can cover most of the stuff but that's not the impossible part. What's not defined is interaction with the rest of the system and userland. Userland ecosystem is crazy complex. You simply cannot stop, say, banshee or even pidgin, let it mingle with the rest of the system and I'm afraid I can't agree with that. You can store and restore the states which kernel is aware of but that's a very small fraction of Sure, you can freeze whole tree of related processes and move them around, but if you think about it, it's an already broken scenario. For example, dbus (or rather agents listening to it) doesn't only carry states specific to the set of applications being snapshotted. It also carries whole bunch of system-wide states or states for other applications. As soon as the system goes on executing after checkpointing, the checkpointed image of dbus and its agents become inconsistent and useless. You can't restore it later. You don't know what happened to other parts of the system inbetween. And this problem doesn't stem from technical details of the implementation. It's fundamental. CR tries to snapshot subset of a big state machine and then use the snapshot later or elsewhere. It doesn't and can't have full visibility into how the subset of states have and are going to interact with the rest of the states. As soon as the whole state machine makes progress, there is no guarantee of consistency. Without explicit provisions for specific applications, it just can't work in generic manner. Can I move my banshee or gwibber to my next machine transparently with in-kernel CR or even restore it later? In many cases, even I (the user) can't define what the desired states So, that's why it comes down to containers and namespaces. You need to preemptively put ...
This is why I think it is important to define the limits of which kernel state features are covered (or going to be covered) by checkpoint/restart - and then list applications that are supported (Oren mentioned mysql server in this thread). It will always be easy for someone to point at some application like powertop and say "we can't migrate that, so checkpoint restart is therefore useless" ... this just is not true. This can be useful without having to be complete (as long as the See above - it may be enough to cover a significant number of Okay - so "dbus" is in the list of "can't so that no, and will never be able to checkpoint/restore that class" - big deal. I'm getting repetitive no, but one last time: just because this can't I don't think that you'll ever make virtualization good enough The CR cool-aid hasn't gotten so far into my system to accept this claim. If these "can't stop for more than a few milli-seconds" processes are HPC workloads, then I'm not seeing how you can do much to help them. I think these applications are using almost all of the RAM on the system, and most of the pages are anonymous. Just how do you checkpoint several GB of dirty pages in a few milli-seconds (when there is almost no free memory on the system)? If you have something else in mind, then please explain a little more. -Tony --
Hello, I was arguing that it is far from being _generally_ useful or transparent. If you're saying that it is something useful for certain If you think about HPC, userland implementation is enough. In 99% of cases, those programs just read and write data files and burn a lot of CPU cycles. You don't need a lot of fancy stuff to do that. More important things would be integrating with job management so that snapshots and rollbacks can be automatically done. I agree that CR would be very useful for certain use cases and applications. I just can't see where the giant patchset fits between userland implementation which seems enough for the the most common use case of HPC and virtualization which is maturing fast. Thanks. -- tejun --
On Thu, Nov 04, 2010 at 10:43:15AM +0100, Tejun Heo wrote: If you think specialized hardware acceleration is necessary for containers then perhaps you have a poor understanding of what a container is. Chances are if you're running a container with namespaces configured then you're already paying the performance costs of running in a container. If you've compared the performance of that kernel to your virtualization hardware then you already know how they compare. For containers everything is native. You're not emulating instructions. You're not running most instructions and trapping some. You're not running whole other kernels, coordinating sharing of pages and cpu with those kernels, etc. You're not emulating devices, busses, interrupts, etc. And you're also not then circumventing every virtualization mechanism you just added in order to provide decent performance. I rather doubt you'll see a difference between "native" hardware and... native hardware. And I expect you'll see much better performance in one of your containers than you'll ever see in some hand-waved hypothetically-improved virtualization that your response implored us to work on instead. Our checkpoint/restart patches do *NOT* implement containers. They sometimes work with containers to make use of checkpoint/restart simple. In fact they are the strategy we use to enable "generic" checkpoint/restart that you seem to think we lack. Everything else is an optimization choice that we give userspace which virtualization notably lacks. Like above, I expect that your virtualization hardware will compare unfavorably to kernel-based checkpoint/restart of containers. Imagine checkpointing "ls" or "sleep 10" in a VM. Then imagine doing so for a container. It takes way less time and way less disk for the container. (It's also going to be easier to manage since you won't have to do lots of special steps to get at the information in a container which is shutdown or even one that's running. If "mycontainer" is ...
Hello, Sure, that was my point. So, let's drop the handwaving about being Yeah, and imagine what people would say if ext4, or heaven forbid, I don't believe my notion of containers was or is flawed and already said that the diffstat per-se didn't look too bad. With enough benefits, I wouldn't be opposed against the rather invasive changes. It's just that the whole thing is conceived backwards and there are already working alternatives which may be somewhat messy now but nevertheless achieve about the same effect without the craziness of serializing in-kernel data structures which are already mostly visible to userland to begin with. Thanks. -- tejun --
Please, do not compare things like single file systems, drivers, or otherwise fairly isolated components, with this "thing". This thing touches a freaky-large number of subsystems, effectively adding a glueage between them, which can might end up causing problems (and/or restrict design choices) in the future. The naked patch looks like just a sugar coating to me, which left out 300+ lines of extra logic in epoll alone. This is one of the widest, deepest, intrusive patches I have seen in a while, whose inclusion would require a little bit more than handwaving and continuous re-posting IMO. - Davide --
I've got a question about the ABI that would be created I see two possible areas that could be considered an ABI 1. control of the C/R process This is very clearly a userspace ABI, to be figured out and locked down like any other ABI 2. the details of how things are stored and added back into a system This is not as clear. at one extreme, this could be like the module interface, (the checkpointed image is only guaranteed to work on a new system with a kernel compiled with the same config options as the system it was checkpointed from). At the other extreme, this could be something that allows you to ckeckpoint an image on 2.6.40 and restore it on 2.6.80. Or it could be something in between. I don't see any way that it is sane to make the C/R image defiition and interface (#2) be an ABI that is guaranteed to never change without hurting future kernel development (exactly the type of things that Davide is worried about above), but what sort of guarantee are people interested in? is it enough to sa that it must be the same kernel version compiled with the same options? (or at least the same options for some list of things that matter, most device drivers probably would not matter for example) or would you need compatibility across all compile options for a kernel release? would you require compatibility between 2.6.x.y and 2.6.x.z? would you require compatibility between 2.6.x and 2.6.x+n (for some value of n)? is this something that could go in with the weakest guarantee initially, and then as everyone is more comfortable with it, start extending the guarantee (and as-needed adding code to the kernel to maintain compatibility with old images)? would you require compatibility between 2.6.x and 2.6.x-n? David Lang --
Agreed. The guarantee should be to specific kernels, in a sense (see Matt's post in this thread 11/17). The image format is tied to "set of features supported" (which boils down to something like kernel version). The format is constructed in a modular way such that most new features can be added without breaking old format. For the rare cases that they do, conversion can be done in userspace in a straightforward manner. (All you need We don't "require" compatibility. The compatibility is defined per object (type) in the image format. New objects need not break compatibility. Changes to objects are very rare; and when they happen they "bump" the version. This can help avoid issues related to kernel configs/options. Restarting an image incompatible with a particular kernel will fail, adjustments should be done by userspace filtering. Thanks, Oren. --
(Sorry for the length of this email, we are excited about being able to discuss technical details.) This is wonderful to have this exchange of techniques and visions. Oren, we are guessing that you are at Columbia. If so, we would love to have you come up here and give a talk in Boston. Alternatively, if you prefer, we would be happy to go to Columbia and give a talk there. In comparing functionality, one recent bug we had to overcome was with screen with a hardstatus line and a scroll region for the terminal. We eventually solved it in a subtle way by sending SIGWINCH, and then lying to screen about changing the kernel window size, and then sending screen another SIGWINCH while telling it the true window size. We were pleased to see that Linux C/R also supports screen and we are curious how it handles this issue of restoring the scroll region in the X11 terminal window. Thanks. Oren noted that sometimes it's important to stop the process only for a few miliseconds while one checkpoints. In DMTCP, we do that by configuring with --enable-forked-checkpointing. This causes us to fork a child process taking advantage of copy-on-write and then checkpoint the memory pages of the child This is a good example of distinct approaches when starting from Kernel C/R or user-space C/R. We currently checkpoint VNC servers in a way similar to Linux C/R. However, in the next few months, we want to directly checkpoint a single X-windows application without the X11-server. The approach is easily understood by analogy. Currently libc.so talks to the kernel. At checkpoint time, we interrogate the kernel state and then "break" the connection to the kernel and checkpoint. Similarly, libX11.so (or libX11-xcb.so) talks to the X11-server. At checkpoint time, we will interrogate the state of the X11-server and then break Thanks very much for bringing up these implementation questions. Its wonderful to have someone interested in the low level technology to talk to. We would like to share with you ...
Interesting ... but while the process is only stopped for the duration of the fork, it may be taking COW faults on almost every page it touches. I think this will not work well for large HPC applications that allocate most of physical memory as anonymous pages for the application. It may even result in an OOM kill if you don't complete the checkpoint of the child and have it exit in a timely manner. -Tony --
I agree with you that forked checkpointing is probably not what you want in the middle of an HPC computation. But isn't that part of the nature of COW? Whether the COW is invoked within the kernel, or from outside the kernel via fork --- in either case, when you have mostly dirty pages, you will have to copy most of the pages. Do I understand your point correctly? Thanks, - Gene --
The current linux-cr approach to handling [dirty] pages doesn't use COW. The tasks are frozen using the cgroup freezer and thus unable to modify the pages. So we don't have to mess with page tables nor do we pay any extra overhead for page faults. If we ever implement thawed checkpointing -- checkpointing while the task isn't frozen -- then we'd probably use COW and see the same faults. The difference then would be that in-kernel we wouldn't have one extra task per mm being checkpointed. Cheers, -Matt Helsley --
The current linux-cr patchset leaves out any optimizations for simplicity of reviewing - first get it working and reviewed. Thawed checkpointing can be done with any COW tax, by leveraging the native hardware dirty bit in page tables. There is no need to trigger additional checkpoints. Tracking modified pages using the dirty bit is a feature also desired by the KVM community, and we plan to work with them on implementing it. Oren. --
s/checkpoints/faults/ Cheers, -Matt Helsley --
COW is one way of reducing down time (whether through fork or in-kernel checkpoint). However, it is possible to avoid using it (and thus avoid extra page faults and memory overload) by using the page-table "dirty" bit to track dirty pages. This way one can "pre-copy" the checkpoint image while the application is running, without additional overhead (the idea is similar to how live-migration is done). Oren. --
Like Oren said, we run the application inside the container - which would have its own pid namespace. When we restart, we again create a container, which starts with a fresh pid namespace, so the pids will not be in use. IOW, a process has a virtual pid and a global pid. The virtual pid is what the application sees when it calls getpid() and that pid will be correctly restored when you create the container. Sukadev --
With pleasure. (LPC would have been a good opportunity - I was in Boston). Oren. --
