Inter-machine networking stuff is hard because its outside the
checkpointed set, so the checkpoint is observable. Migration is easier,
in principle, because you might be able to shift the connection endpoint
without bringing it down. Dealing with networking within your
checkpointed set is just fiddly, particularly remembering and restoring
all the details of things like urgent messages, on-the-fly file
descriptors, packet boundaries, etc.
Sure, there's no inherent problem. But do you imagine including the
file contents within your checkpoint image, or would they be saved
separately?
It's common for an app to write a tmp file, close it, and then open it a
bit later expecting to find the content it just wrote. If you
checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
resume, then it will lose its tmp file. There's no explicit connection
between the process and its potential working set of files. We had to
deal with it by setting a bunch of policy files to tell the
checkpoint/restart system what filename patterns it had to look out
for. But if you just checkpoint the whole filesystem state along with
the process(es), then perhaps it isn't an issue.
No, that's the problem; it all worries me. It's a big problem space.
So, in other words: whoever wants to work on it gets to define (their)
goals. Fair enough.
No, I don't have any real opinion about containers vs virtualization. I
think they're quite distinct solutions for distinct problems.
But I was involved in the design and implementation of a
checkpoint-restart system (along with Peter Chubb), and have the scars
to prove it. We implemented it for IRIX; we called it Hibernator, and
licensed it to SGI for a while (I don't remember what name they marketed
it under). The list of problems that Peter and I mentioned are ones we
had to solve (or, in some cases, failed to solve) to get a workable system.
J
--