Re: checkpoint/restart ABI

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Jeremy Fitzhardinge
Date: Tuesday, August 12, 2008 - 9:32 am

Dave Hansen wrote:

Inter-machine networking stuff is hard because its outside the 
checkpointed set, so the checkpoint is observable.  Migration is easier, 
in principle, because you might be able to shift the connection endpoint 
without bringing it down.  Dealing with networking within your 
checkpointed set is just fiddly, particularly remembering and restoring 
all the details of things like urgent messages, on-the-fly file 
descriptors, packet boundaries, etc.


Sure, there's no inherent problem.  But do you imagine including the 
file contents within your checkpoint image, or would they be saved 
separately?


It's common for an app to write a tmp file, close it, and then open it a 
bit later expecting to find the content it just wrote.  If you 
checkpoint-kill it in the interim, reboot (clearing out /tmp) and then 
resume, then it will lose its tmp file.  There's no explicit connection 
between the process and its potential working set of files.  We had to 
deal with it by setting a bunch of policy files to tell the 
checkpoint/restart system what filename patterns it had to look out 
for.  But if you just checkpoint the whole filesystem state along with 
the process(es), then perhaps it isn't an issue.


No, that's the problem; it all worries me.  It's a big problem space.


So, in other words: whoever wants to work on it gets to define (their) 
goals.  Fair enough.


No, I don't have any real opinion about containers vs virtualization.  I 
think they're quite distinct solutions for distinct problems.

But I was involved in the design and implementation of a 
checkpoint-restart system (along with Peter Chubb), and have the scars 
to prove it.  We implemented it for IRIX; we called it Hibernator, and 
licensed it to SGI for a while (I don't remember what name they marketed 
it under).  The list of problems that Peter and I mentioned are ones we 
had to solve (or, in some cases, failed to solve) to get a workable system.

    J
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[RFC][PATCH 0/4] kernel-based checkpoint restart, Dave Hansen, (Thu Aug 7, 3:40 pm)
[RFC][PATCH 2/4] checkpoint/restart: x86 support, Dave Hansen, (Thu Aug 7, 3:40 pm)
Re: [RFC][PATCH 0/4] kernel-based checkpoint restart, Arnd Bergmann, (Fri Aug 8, 2:25 am)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Arnd Bergmann, (Fri Aug 8, 5:09 am)
Re: [RFC][PATCH 0/4] kernel-based checkpoint restart, Dave Hansen, (Fri Aug 8, 11:06 am)
Re: [RFC][PATCH 0/4] kernel-based checkpoint restart, Arnd Bergmann, (Fri Aug 8, 11:18 am)
Re: [RFC][PATCH 0/4] kernel-based checkpoint restart, Oren Laadan, (Fri Aug 8, 12:44 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Oren Laadan, (Fri Aug 8, 1:28 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Arnd Bergmann, (Fri Aug 8, 3:29 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Oren Laadan, (Fri Aug 8, 4:04 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Dave Hansen, (Fri Aug 8, 5:38 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Oren Laadan, (Fri Aug 8, 6:20 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Dave Hansen, (Fri Aug 8, 7:20 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Oren Laadan, (Fri Aug 8, 7:35 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Arnd Bergmann, (Fri Aug 8, 11:43 pm)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Jeremy Fitzhardinge, (Sun Aug 10, 7:55 am)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Dave Hansen, (Mon Aug 11, 8:36 am)
Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support, Jeremy Fitzhardinge, (Mon Aug 11, 9:07 am)
checkpoint/restart ABI, Dave Hansen, (Mon Aug 11, 12:48 pm)
Re: checkpoint/restart ABI, Arnd Bergmann, (Mon Aug 11, 2:47 pm)
Re: checkpoint/restart ABI, Oren Laadan, (Mon Aug 11, 2:54 pm)
Re: checkpoint/restart ABI, Jonathan Corbet, (Mon Aug 11, 4:14 pm)
Re: checkpoint/restart ABI, Dave Hansen, (Mon Aug 11, 4:23 pm)
Re: checkpoint/restart ABI, Jeremy Fitzhardinge, (Mon Aug 11, 4:38 pm)
Re: checkpoint/restart ABI, Peter Chubb, (Mon Aug 11, 4:54 pm)
Re: checkpoint/restart ABI, Serge E. Hallyn, (Tue Aug 12, 7:49 am)
Re: checkpoint/restart ABI, Dave Hansen, (Tue Aug 12, 7:58 am)
Re: checkpoint/restart ABI, Dave Hansen, (Tue Aug 12, 8:11 am)
Re: checkpoint/restart ABI, Jeremy Fitzhardinge, (Tue Aug 12, 9:32 am)
Re: checkpoint/restart ABI, Dave Hansen, (Tue Aug 12, 9:46 am)
Re: checkpoint/restart ABI, Jeremy Fitzhardinge, (Tue Aug 12, 10:04 am)
Re: checkpoint/restart ABI, Oren Laadan, (Wed Aug 20, 2:52 pm)
Re: checkpoint/restart ABI, Oren Laadan, (Wed Aug 20, 2:54 pm)
Re: checkpoint/restart ABI, Dave Hansen, (Wed Aug 20, 3:11 pm)
Re: checkpoint/restart ABI, Oren Laadan, (Wed Aug 20, 10:56 pm)
Re: checkpoint/restart ABI, Arnd Bergmann, (Thu Aug 21, 1:43 am)
Re: checkpoint/restart ABI, Oren Laadan, (Thu Aug 21, 8:43 am)
Re: checkpoint/restart ABI, Eric W. Biederman, (Thu Aug 28, 4:40 pm)