Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Kapil Arya
Date: Wednesday, November 3, 2010 - 8:40 pm

(Sorry for resending the message; the last message contained some html
tags and was rejected by server)

We would like to thank the previous post for bringing up the topic
of kernel C/R versus userland C/R.  We are two of the developers of DMTCP
(userland checkpointing):  Distributed MultiThreaded CheckPointing .
	     http://dmtcp.sourceforge.net
We had waited to write to the kernel developers because we had wanted
to ensure that DMTCP is sufficiently robust before wasting the time of the
kernel developers.  This thread seems like a good opportunity to begin
a dialogue.

In fact, we only became aware of Linux kernel C/R this September.
Of course, we were aware of Oren Laadan's fine earlier work on ZapC
for distributed checkpointing using the Linux kernel (CLUSTER-2005).
We have a high respect for Oren Laadan and the other Linux C/R developers,
as well as for the developers of BLCR (a C/R kernel module with a userland
component that is widely used in HPC batch faciliites).

By coincidence, when we became aware of Linux C/R, we were already in
the middle of development for a major new release of DMTCP (from version
1.1.x to 1.2.0).  We just finished that release.  Among other features,
this release supports checkpointing of GNU 'screen', and we have tested
screen in some common use cases (with vim, with emacs, etc.).  While it
supports ssh (e.g. checkpointing OpenMPI, which uses ssh), it doesn't yet
support _interactive_ ssh sessions.  That will come in the next release.

We believe that both Linux C/R and DMTCP are becoming quite mature, and
that in general, one can achieve good application coverage with either.

In our personal view, a key difference between in-kernel and userland
approaches is the issue of security.  The Linux C/R developers state
the issue very well in their FAQ (question number 7):

The previous posts also brought up the issue of external connections.
While DMTCP has been developed over six years, in the last year we
have concentrated especially on the issue of external connections.
While we've accumulated many war stories, one will illustrate the point.
Most Linux distros link vi to vim.  Vim supports mouse and other operations
via the X11 server.  When vim starts up, it connects to the X11
server (which may be local, or remote if ssh uses X11 forwarding).
On transparent checkpoint and restart, vim expects to continue
talking to the X11 server.  Currently, DMTCP recognizes such
X11 server connections and refuses them.  Vim still survives without
its mouse and other X11 services.  For the future, we are considering
a more flexible approach that will take account of the X11 protocol.

Strategies like these are easily handled in userspace.  We suspect
that while one may begin with a pure kernel approach, eventually,
one will still want to add a userland component to achieve this kind
of flexibility, just as BLCR has already done.

					   Best wishes,
                                           - Gene Cooperman and Kapil Arya
					     from the DMTCP team

On Tue, Nov 2, 2010 at 5:35 PM, Tejun Heo <tj@kernel.org> wrote:
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Christoph Hellwig, (Tue Nov 2, 2:47 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Kapil Arya, (Wed Nov 3, 8:40 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Christoph Hellwig, (Thu Nov 4, 7:25 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Fri Nov 5, 10:17 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Sukadev Bhattiprolu, (Fri Nov 5, 10:31 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 7, 11:49 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 7, 12:42 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Mon Nov 8, 11:37 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Serge E. Hallyn, (Wed Nov 17, 8:39 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Wed Nov 17, 10:04 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Pavel Emelyanov, (Thu Nov 18, 2:13 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Jose R. Santos, (Thu Nov 18, 1:13 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Kirill Korotaev, (Fri Nov 19, 7:36 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:00 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:01 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:16 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:25 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:27 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:38 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:55 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 21, 1:18 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 21, 1:21 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Sukadev Bhattiprolu, (Mon Nov 22, 11:02 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 28, 9:09 pm)