This is an excellent example to demonstrate several points:
* To freeze the processes, you can use (quote) "hairy" signal
overload mechanism, or even more hairy ptrace; both by the way
have their performance problem with many processes/threads.
Or you can use the in-kernel freezer-cgroup, and forget about
workarounds, like linux-cr does. And ~200 lines in said diff
are dedicated exactly to that.
* Then, because both the workaround and the entire philosophy
of MTCP c/r engine is that affected processes _participate_ in
the checkpoint, their syscalls _must_ be interrupted. Contrastly,
linux-cr kernel approach allows not only to checkpoint processes
without collaboration, but also builds on the native signal
handling kernel code to restart the system calls (both after
unfreeze, and after restart), such that the original process
does not observe -EINTR.
Aha ... another great example: yet another piece of the suspect
diff in question is dedicated to allow a restarting process to
request a specific location for the vdso.
BTW, a real security expert (and I'm not one...) may argue that
this operation should only be allowed to privileged users. In fact,
if your code gets around the linux ASLR mechanisms, then someone
should fix the kernel ASLR code :)
FWIW, the restart portion of linux-cr is designed with this in
mind - it is flexible enough to accommodate for smart userspace
tools and wrappers that wish to mock with the processes and
their resource post-restart (but before the processes resume
execution). For example, a distributed checkpoint tool could,
at restart time, reestablish the necessary network connections
(which is much different than live migration of connections,
and clearly not a kernel task). This way, it is trivial to migrate
a distributed application from one set of hosts to another, on
different networks, with very little effort.
So you'll need mechanisms not only to read the data at checkpoint
time but also to reinstate the data at restart time. By the time
you are done, the kernel all the c/r code (the suspect diff in
question _and_ the rest of the logic) in the form of new interfaces
and ABIs to usersapce...; the userspace code will grow some more
hair; and there will be zero maintainability gain. And at the same
you won't be able to leverage optimizations only possible in the
kernel.
To be precise, there are three types of userland workarounds:
1) userland workarounds to make a restarted application work when
peer processrs aren't saved - e.g, in distributed checkpoint you
need a workaround to rebuild the socket to the peer; or in his
example with the 'ncsd' daemon from earlier in the thread.
These are needed regardless of the c/r engine of choice. In many
cases they can be avoided if applications are run in containers.
(which can be as simple as running a program using 'nohup')
2) userland workarounds to duplicate virtualization logic already
done by the kernel - like the userspace pid-namespace and the
complex logic and hacks needed to make it work. This is completely
unnecessary when you do kernel c/r.
3) userland workarounds to compensate for the fact that userspace
can't get or set some state during checkpoint or restart. For
example, in the kernel it's trivial to track shared files. How
would you say, from userspace, if fd[0] of parent A and child B is
the same file opened and then inherited, or the same filename
opened twice individually ? For files, it is possible to figure
this out in user space, e.g. by intercepting and tracking all forks
and all file operations (including passing fd's via afunix sockets).
There are other hairy ways to do it, but not quite so for other
resources.
As another example, consider SIDs and PGIDs. With proper algorithms
you can ensure that your processes get the right SID at fork time.
But in the general case, you can't reproduce PGIDs accurately
without replaying what the processes (including those that had died
already) behaved.
And to track zombies at checkpoint, you'd need to actually collect
them, so you must do it in a hairy wrapper, and keep the secret
until the application calls wait(). But then, there may be some
side effects due to collecting zombies, e.g. the pid may be reused
against the application's expectation.
Some of these have workarounds, some not. Do you really think that
re-reimplementing linux and namespaces in userspace is the way to go ?
Then, you can add to the kernel endless amount of interfaces to
export all of this - both data, and the functionality to re-instate
this data at checkpoint. But ... wait -- isn't that what linux-cr
already does ?
That is one opinion. Then there are people using VPSs in commercial
and private environments, for example.
VMs are wonderful (re)invention. Regardless of any one single
person's about VMs vs containers, both are here to stay, and both
have their use-cases and users. IMHO, it is wrong to ignore the
need for c/r and migration capabilities for containers, whether
they run full desktop environments, multiple applications or single
processes.
Oren.
--