Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Tejun Heo
Date: Tuesday, November 2, 2010 - 2:35 pm

(cc'ing lkml too)
Hello,

On 11/02/2010 08:30 PM, Oren Laadan wrote:

The patch size itself isn't too big but I still think it's one scary
patch mostly because the breadth of the code checkpointing needs to
modify and I suspect that probably is the biggest concern regarding
checkpoint-restart from implementation point of view.

FWIW, I'm not quite convinced checkpoint-restart can be something
which can be generally useful.  In controlled environments where the
target application behavior can be relatively well defined and
contained (including actions necessary to rollback in case something
goes bonkers), it would work and can be quite useful, but I'm afraid
the states which need to be saved and restored aren't defined well
enough to be generally applicable.  Not only is it a difficult
problem, it actually is impossible to define common set of states to
be saved and restored - it depends on each application.

As such, I have difficult time believing it can be something generally
useful.  IOW, I think talking about its usage in complex environments
like common desktops is mostly handwaving.  What about X sessions,
network connections, states established in other applications via dbus
or whatnot?  Which files need to be snapshotted together?  What about
shared mmaps?  These questions are not difficult to answer in generic
way, they are impossible.

There is a very distinctive difference between system wide
suspend/hibernation and process checkpointing.  Most programs are
already written with the conditions in mind which can be caused by
system level suspend/hibernation.  Most programs don't expect to be
scheduled and run in any definite amount of time.  There usually
are provisions for loss or failure of resources which are out of the
local system.  There are corner cases which are affected and those
programs contain code to respond to suspend/hibernation.  Please note
that this is about userland application behavior but not
implementation detail in the kernel.  It is a much more fundamental
property.

So, although checkpoint-restart can be very useful for certain
circumstances, I don't believe there can be a general implementation.
It inevitably needs to put somewhat strict restrictions on what the
applications being checkpointed are allowed to do.  And after my
train of thought reaches there, I fail to see what the advantages of
in-kernel implementation would be compared to something like the
following.

  http://dmtcp.sourceforge.net/

Sure, in-kernel implementation would be able to fake it better, but I
don't think it's anything major.  The coverage would be slightly
better but breaking the illusion wouldn't take much.  Just push it a
bit further and it will break all the same.  In addition, to be
useful, it would need userland framework or set of workarounds which
are aware of and can manipulate userland states anyway.  For workloads
for which checkpointing would be most beneficial (HPC for example), I
think something like the above would do just fine and it would make
much more sense to add small features to make userland checkpointing
work better than doing the whole thing in the kernel.

I think in-kernel checkpointing is in awkward place in terms of
tradeoff between its benefits and the added complexities to implement
it.  If you give up coverage slightly, userland checkpointing is
there.  If you need reliable coverage, proper virtualization isn't too
far away.  As such, FWIW, I fail to see enough justification for the
added complexity.  I'll be happy to be proven wrong tho.  :-)

Thank you.

-- 
tejun
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Tejun Heo, (Tue Nov 2, 2:35 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Christoph Hellwig, (Tue Nov 2, 2:47 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Christoph Hellwig, (Thu Nov 4, 7:25 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Fri Nov 5, 10:17 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Sukadev Bhattiprolu, (Fri Nov 5, 10:31 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 7, 11:49 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 7, 12:42 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Mon Nov 8, 11:37 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Serge E. Hallyn, (Wed Nov 17, 8:39 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Wed Nov 17, 10:04 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Pavel Emelyanov, (Thu Nov 18, 2:13 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Jose R. Santos, (Thu Nov 18, 1:13 pm)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Kirill Korotaev, (Fri Nov 19, 7:36 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:00 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:01 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:16 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:25 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:27 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:38 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Alexey Dobriyan, (Fri Nov 19, 9:55 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 21, 1:18 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 21, 1:21 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Sukadev Bhattiprolu, (Mon Nov 22, 11:02 am)
Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch, Gene Cooperman, (Sun Nov 28, 9:09 pm)