These patches implement checkpoint-restart [CR v2]. This version adds
save and restore of open files state (regular files and directories)
which makes it more usable. Other changes address the feedback given
for the previous version. It is also refactored (along Dave's posting)
for easier reviewing.
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)
Changelog:
[2008-Aug-20] v2:
- Added dump and restore of open files (regular and directories);
see the changes in the test program (ckpt.c)
- Added basic handling of shared objects, and use 'parent tag'
- Added documentation
- Improved ABI, add 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup in response to comments for v1
[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.
--
(Dave Hansen's announcement)
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.
This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that ...Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file. First create a template for both syscalls: they take a file descriptor (for the image file) and flags as arguments. For sys_checkpoint the first argument identifies the target container; for sys_restart it will identify the checkpoint image. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/x86/kernel/syscall_table_32.S | 2 ++ include/asm-x86/unistd_32.h | 2 ++ include/linux/syscalls.h | 2 ++ 3 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index d44395f..5543136 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -332,3 +332,5 @@ ENTRY(sys_call_table) .long sys_dup3 /* 330 */ .long sys_pipe2 .long sys_inotify_init1 + .long sys_checkpoint + .long sys_restart diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h index d739467..88bdec4 100644 --- a/include/asm-x86/unistd_32.h +++ b/include/asm-x86/unistd_32.h @@ -338,6 +338,8 @@ #define __NR_dup3 330 #define __NR_pipe2 331 #define __NR_inotify_init1 332 +#define __NR_checkpoint 333 +#define __NR_restart 334 #ifdef __KERNEL__ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d6ff145..edc218b 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_eventfd2(unsigned int count, int flags); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags); +asmlinkage long sys_restart(int crid, int fd, unsigned long flags); int kernel_execve(const char ...
this, of course, should be the last patch, not the first. --
Uh oh. These got whitespace mangled somehow. I'll look through them, but probably can't produce patches on top of them for now. -- Dave --
This patch also has to go *AFTER* you actually define the syscall functions. Otherwise, this won't be build-bisectable when it gets committed. -- Dave --
Oops .. not sure what happened. I'll re-send after addressing the recent comments. Oren. --
For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. The cr_vma->npages will tell
how many pages were dumped for this VMA. Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 3 +
checkpoint/ckpt.h | 6 +
checkpoint/ckpt_arch.h | 1 +
checkpoint/ckpt_hdr.h | 31 ++++
checkpoint/ckpt_mem.c | 382 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt_mem.h | 30 ++++
checkpoint/ckpt_x86.c | 28 ++++
checkpoint/rstr_x86.c | 2 +
checkpoint/sys.c | 7 +-
include/asm-x86/ckpt.h | 5 +
11 files changed, 495 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/ckpt_mem.c
create mode 100644 checkpoint/ckpt_mem.h
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 29dbb2d..032fc9f 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 949ed58..bf868ae 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -169,6 +169,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
ret = cr_write_task_struct(ctx, t);
cr_debug("task_struct: ret %d\n", ret);
if (!ret)
+ ret = cr_write_mm(ctx, t);
+ cr_debug("memory: ret %d\n", ret);
+ if (!ret)
ret = cr_write_thread(ctx, t);
cr_debug("thread: ret %d\n", ret);
if (!ret)
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 7ecafb3..0addb63 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -29,6 +29,9 @@ struct cr_ctx {
...please move these into arch/x86/mm/checkpoint.c and arch/x86/mm/restore.c. (also, please dont try to abbreviate too much in filenames, makes it harder to follow changes later on, etc.) Cool stuff btw! Any ETA on when more complex apps can be checkpointed/restored? Also, shouldnt at least part of this effort be joined with the live migration efforts of openvz? (or do you see fundamental, unsolvable differences?) Ingo --
CONFIG_INGO=Y (sorry saw andrew Morton do that Couldn't resist); justin P. Mattock --
My understanding is that this is a result of the joint discussion (with the OpenVZ folks present) that happened during the containers min-summit prior to OLS this year. -- Balbir --
great news :-) This stuff should be completed ASAP, user-transparent kernel upgrades on a generic Linux distro would be really nice :-) Ingo --
Need to check ret here: + if (ret) [...] Thanks, Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Thanks Louis for all the comments. Will fix in v3. Oren. --
This classification eventually simplifies both dump and restore. For instance, it decides whether a file name follows or not. There will be more, later: CR_VMA_FILE_UNLINKED (mapped to an unlinked file), CR_VMA_ANON_SHARED (shared anonymous), CR_VMA_ANON_SHARED_SKIP I figured the use of 'tag' for the identifiers of shared objects is clear. By using a __u32 the size of that field is immediately visible, while using a structure will hide the actual size; in turn this is what we want visible Using __u32 instead of enum guarantees the size. I'll change the It is my understanding that the code must not sleep between kmap_atomic() Yes. I'll encapsulate that in it's own function. Oren. --
I still don't see there being a need to explicitly specify the distinction. Why should a VMA mapping an unlinked file be any different from a linked one? It should point over to the 'file' checkpoint structure and let the real work be done there. There are no truly anonymous shared areas. They anon ones are still file-backed as far as the kernel is concerned. If we do the file saving Either way you stack it, I think 'tag' is a horrendous variable name. What kind of tag? What does it do? What does it tag? What does it reference? What values are valid in there? If we have a separate structure, we simply make it clear that the structure will get used all over the place, and that it, too is part of the ABI. Knowing the sizes isn't as important as knowing that the ABI Exactly. The code is copy-and-pasted. If there's a bug in the original, who will change this one? Better to simply consolidate the This is where the mentality has to shift. Instead of thinking "there is no exact in-kernel match for what I want here" we need to consider that we can modify the in-kernel ones. They can be modified to suit both Yes, but you're going to absolutely kill performance on systems which have expensive global TLB flushes. Frequent kmaps()s should be avoided at all costs. The best way around this would be to change the interface to cr_kwrite() so that it didn't have to use *mapped* kernel memory. Maybe an interface that takes 'struct page'. Or, one where we kmap_atomic() the -ENOSUP might be clearest for now. "Your system call tried to do something unsupported." -- Dave --
The classifications helps to make the code cleaner (and more readable). In any case, you need at least to save a classifier that tells whether a shared VMA is anonymous, or mapped to a file, or is an IPC shmem segment (and also whether that IPC segment has been removed - a la unlinked): you simply don't treat them equally; and on the restart side you can't re-test it on the VMA structure itself. Given that we need a classifier anyway, re-using it to describe all VMAs and not only distinguish the above (and maybe more) cases not only enhances the readability of the code, but also allows to merge common code paths based on the value of the classifier. Probably you won't be convinced until I add the code to support all types of I agree. However, my main goal now is not to make this patch perfect, but to provide a viable proof-of-concept that demonstrates how we want to do things. Unless you feel we are near ready to send these for inclusion soon (...), I intend to prioritize design and functionality. I'll add a FIXME comment there. Of course, should you provide a patch to fix Are you suggesting adding a new error code ? Oren. --
So, you currently have buy-in on this basic approach from the OpenVZ guys, and probably Ingo. If you get too far along in the "design and functionality" and a community member comes up with a really good objection to some basic part of the "design and functionality" you're going to have a lot of code to rewrite. I'm trying to get minimized the quantity of "design and functionality" down to the bare minimum set where we can get this merged. So, yes, I do think these should be sent for inclusion soon. I believe that we're on course to create another large out-of-tree patch set that does in-kernel checkpoint/restart. We have no shortage of those. It's been done many times before. This one will *HOPEFULLY* be different from all the rest when it gets Dang. What's the thing we return from system calls that are unsupported? Did I just dream that up? -- Dave --
ENOSYS? Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Brain not functioning, yet. Thanks. -- Dave --
Or EOPNOTSUPP - but that's documented as a network error ("Operation not
supported on transport endpoint").
J
--
It's true there are out-of-tree proofs-of-concept for in-kernel c/r, but on the other hand it is nice seeing where Oren's code is going. Based on the patchsets you and he have been sending, the impression I've gotten was that Oren was fleshing out the longer-term tree, while you were working on the upstream-acceptable minimal patchset. That seems like a good model to me. (Was I wrong about either of your intents?) It does mean that much of Oren's patchset may end up having to be re-written based on how Dave's patches look when they get upstream, but I'm sure Oren knows that's par for the course and doesn't mind. (Or, is that too much effort required on your part to try and cherrypick bits of Oren's changes back into your tree?) In any case, if that *is* the model, then I'd really like to see a repost of your (Dave's) simplified patchset. I think we need to decide precisly which features we want to try to push first (I thought we were not going to support fds?), throw out any extraneous infrastructure, put the source for the one-and-only program that it can checkpoint and restart in the patch description, and see what feedback we get from the community. The reason it'll be good to have Oren continue on his own path is that you know we'll get questions like "well how are you going to handle (X)", so then we can point them to Oren's tree. At least, that's how I was seeing it. -serge --
That would probably work as long as Oren is working on top of what I have. It's going to be a lot harder if I have to repeat the same break-out process for each iteration of Oren's patches, though. -- Dave --
If Oren's purpose is not quite to create a upstreamable patchset, then it would make more sense for him to keep a git tree and put new patches on top of his existing ones (within reaason as he rebases). Then you'd at least be able to trivially look at the latest deltas. -serge --
The trick would be compromising on things that I, for instance, think need to be rewritten or removed. Oren would have to rebase his work against what I do. I guess you could think about it like me being upstream from Oren. Anything that I would change, Oren would need to rebase on top of. Oren, are you willing to keep a set of patches that add functionality on top of a minimal set that I'm keeping? Mine being the set that we're trying to push into mainline as soon as possible. -- Dave --
Dave, Serge: I'm currently away so I must keep this short. I think we have so far more discussion than an actual problem. I'm happy to coordinate with every interested party to eventually see this work go into main stream. My only concerns are twofold: first, to get more feedback I believe we need to get the code a bit more usable; including FDs is an excellent way to actually do that. That will add significant value to the patch. I think it's important to demonstrate how shared resources and multiple processes are handled. FDs demonstrate the former (with a fixed version of the recent patchset - I will post soon). The latter will increase the size of the patchset significantly, so perhaps can indeed wait for now. It should not be hard for me to add functionality on top of a more basic patchset. The question is, what is "basic" ? Anyway, I will be back towards the end of the week. Let's try to discuss this over IRC then (e.g. Friday afternoon ?). Oren. --
hmm, yes and no. fd's are a must have but I would be more interested to see an external checkpoint/restart and signal support first. why ? because it would be already usable for most computational programs in HPC, like this stupid one : https://www.ccs.uky.edu/~bmadhu/pi/pi1.c signals are required because it's how 'load' and/or 'system' managers interact with the jobs they spawn. external checkpoint/restart for the same reason. for files, I would first only care about stdios (make sure they're relinked to something safe on restart) and file states of regular files. contents is generally handled externally (deleted files being an annoying exception) then, support for openmp application is a nice to have, so I'd probably shared resources are only useful in a multiprocess/multitask context. I'd start working on this first. here we jump directly in the pid namespace issues, how we start a set of process in a pidnamespace ? how do we relink it to its parent pidnamespace ? are signals well propagated ? etc. but hey, we'll have to solve it one day. FD's are shared but have many types which are pain to handle. (it would interesting to see if we can add checkpoint() and restart() operations in fileops) So, for shared resources demonstration, I'd work on sysvipc, there are less types to handle and they force us to think how we are going hmm, that depends how you do it. If you restart all the hierarchy in the kernel, It will increase for sure the patch footprint. However if you restore the hierarchy from user space and then let each process restore itself from some binary blob, it should not. This, of course, means that the binary blob representing the state of the container (we call it statefile) is not totally opaque. It see it a bit like /proc, a directory containing shared states (all namespaces) and tasks states. That's something to discuss. I do prefer the second option for many reasons: . each process restarts itself from its current ...
I'm sorry for being out of discussion for so long time. I've just sent a patchset with external checkpoint/restart as it is implemented in OpenVZ. External checkpoint/restart also is very usefull if we will have container's infrastructure in mainstream. It won't be very hard to add support for signals in my patchset. I'll wait for some time for comments and will add support for signals in next version. Regards, --
OK. Then, we should be able to have a simple C/R framework (user and kernel) if we integrate your patchset or Oren's with Daniel's user tools : http://lxc.cvs.sourceforge.net/lxc/ I was just starting to work on Oren's patchset when you sent yours :) Thanks, C. --
I will take a look on Daniel's user tools. Thanks, Andrey --
Excellent. Having two trees or branches, one very basic and nigh-upon useless, with another advanced enough to be a valuable proof of concept to answer "oh Sounds great. (I may be taking the day off to paint, but will try to monitor irc) thanks, -serge --
Restoring the memory address space begins with nuking the existing one of the current process, and then reading the VMA state and contents. Call do_mmap_pgoffset() for each VMA and then read in the data. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/Makefile | 2 +- checkpoint/ckpt_arch.h | 1 + checkpoint/restart.c | 3 + checkpoint/rstr_mem.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++ checkpoint/rstr_x86.c | 55 ++++++++ 5 files changed, 416 insertions(+), 1 deletions(-) create mode 100644 checkpoint/rstr_mem.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 032fc9f..41e0877 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -1,2 +1,2 @@ -obj-y += sys.o checkpoint.o restart.o ckpt_mem.o +obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h index 3b87a6f..ab7ac1c 100644 --- a/checkpoint/ckpt_arch.h +++ b/checkpoint/ckpt_arch.h @@ -6,3 +6,4 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag); int cr_read_thread(struct cr_ctx *ctx); int cr_read_cpu(struct cr_ctx *ctx); +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag); diff --git a/checkpoint/restart.c b/checkpoint/restart.c index a85f48b..81ce0a4 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -183,6 +183,9 @@ static int cr_read_task(struct cr_ctx *ctx) ret = cr_read_task_struct(ctx); cr_debug("task_struct: ret %d\n", ret); if (!ret) + ret = cr_read_mm(ctx); + cr_debug("memory: ret %d\n", ret); + if (!ret) ret = cr_read_thread(ctx); cr_debug("thread: ret %d\n", ret); if (!ret) diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c new file mode 100644 index 0000000..df602a9 --- /dev/null +++ b/checkpoint/rstr_mem.c @@ -0,0 +1,356 @@ +/* + * Restart memory contents + * + * Copyright (C) 2008 Oren Laadan + * + ...
Should down_write(&mm->mmap_sem) again here, and hold it until all vmas are restored. This means removing down_write() from cr_vma_writable(). Or perha= ps make it finer grain: release it before looping on the vmas and make Thanks, Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
(Following Dave Hansen's refactoring of the original post) Add logic to save and restore architecture specific state, including thread-specific state, CPU registers and FPU state. Currently only x86-32 is supported. Compiling on x86-64 will trigger an explicit error. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/Makefile | 1 + checkpoint/checkpoint.c | 9 ++- checkpoint/ckpt_arch.h | 7 ++ checkpoint/ckpt_x86.c | 190 +++++++++++++++++++++++++++++++++++++++++++++++ checkpoint/restart.c | 11 ++- checkpoint/rstr_x86.c | 176 +++++++++++++++++++++++++++++++++++++++++++ include/asm-x86/ckpt.h | 72 ++++++++++++++++++ 7 files changed, 463 insertions(+), 3 deletions(-) create mode 100644 checkpoint/ckpt_arch.h create mode 100644 checkpoint/ckpt_x86.c create mode 100644 checkpoint/rstr_x86.c create mode 100644 include/asm-x86/ckpt.h diff --git a/checkpoint/Makefile b/checkpoint/Makefile index d129878..29dbb2d 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -1 +1,2 @@ obj-y += sys.o checkpoint.o restart.o +obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 25343f5..949ed58 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -20,6 +20,7 @@ #include "ckpt.h" #include "ckpt_hdr.h" +#include "ckpt_arch.h" /** * cr_fill_fname - return pathname of a given file @@ -166,7 +167,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t) } ret = cr_write_task_struct(ctx, t); - cr_debug("ret %d\n", ret); + cr_debug("task_struct: ret %d\n", ret); + if (!ret) + ret = cr_write_thread(ctx, t); + cr_debug("thread: ret %d\n", ret); + if (!ret) + ret = cr_write_cpu(ctx, t); + cr_debug("cpu: ret %d\n", ret); return ret; } diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h new file mode 100644 index 0000000..b7cc8c9 --- /dev/null +++ b/checkpoint/ckpt_arch.h @@ ...
Well, I told you I'd write some Kconfig magic for you, but I can't since I can't apply these patches. :) You need this in a Kconfig file somewhere: config CHECKPOINT_RESTART prompt "Good prompt" def_bool y depends on X86_32 && EXPERIMENTAL help Lots of fun help... config SELECT_ALL_NAMESPACES prompt "Enable all kernel namespace support" selects IPC_NS NET_NS ... help This option will save you having to go track down all of the individual places that namespaces are supported in the kernel. If you don't turn this on, the restart porttion of checkpoint/restart gets a lots less reliable... probably in checkpoint/Kconfig. Then, in all of the architectures' arch/*/Kconfig files, you need to: source "checkpoint/Kconfig" Then, the Makefile looks like this: --- /dev/null 2008-04-22 10:49:52.000000000 -0700 +++ oren-cr.git-dave/checkpoint/Makefile 2008-08-22 12:30:53.000000000 -0700 @@ -0,0 +1 @@ +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o -- Dave --
Covers application checkpoint/restart, overall design, interfaces and checkpoint image format. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/checkpoint.txt | 177 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 177 insertions(+), 0 deletions(-) create mode 100644 Documentation/checkpoint.txt diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt new file mode 100644 index 0000000..fdc69cb --- /dev/null +++ b/Documentation/checkpoint.txt @@ -0,0 +1,177 @@ + + === Checkpoint-Restart support in the Linux kernel === + +Copyright (C) 2008 Oren Laadan + +Author: Oren Laadan <orenl@cs.columbia.edu> + +License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: + +Application checkpoint/restart [CR] is the ability to save the state +of a running application so that it can later resume its execution +from the time at which it was checkpointed. An application can be +migrated by checkpointing it on one machine and restarting it on +another. CR can provide many potential benefits: + +* Failure recovery: by rolling back an to a previous checkpoint + +* Improved response time: by restarting applications from checkpoints + instead of from scratch. + +* Improved system utilization: by suspending long running CPU + intensive jobs and resuming them when load decreases. + +* Fault resilience: by migrating applications off of faulty hosts. + +* Dynamic load balancing: by migrating applications to less loaded + hosts. + +* Improved service availability and administration: by migrating + applications before host maintenance so that they continue to run + with minimal downtime + +* Time-travel: by taking periodic checkpoints and restarting from + any previous checkpoint. + + +=== Overall design + +Checkpoint and restart is done in the kernel as much as possible. The +kernel exports a relative opaque 'blob' of data to userspace which can +then be handed to the ...
Add those interfaces, as well as helpers needed to easily manage the file format. The code is roughly broken out as follows: ckpt/sys.c - user/kernel data transfer, as well as setup of the checkpoint/restart context (a per-checkpoint data structure for housekeeping) ckpt/checkpoint.c - output wrappers and basic checkpoint handling ckpt/restart.c - input wrappers and basic restart handling Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Makefile | 2 +- checkpoint/Makefile | 1 + checkpoint/checkpoint.c | 191 +++++++++++++++++++++++++++++++++++++++ checkpoint/ckpt.h | 71 +++++++++++++++ checkpoint/ckpt_hdr.h | 86 ++++++++++++++++++ checkpoint/restart.c | 199 +++++++++++++++++++++++++++++++++++++++++ checkpoint/sys.c | 227 +++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 776 insertions(+), 1 deletions(-) create mode 100644 checkpoint/Makefile create mode 100644 checkpoint/checkpoint.c create mode 100644 checkpoint/ckpt.h create mode 100644 checkpoint/ckpt_hdr.h create mode 100644 checkpoint/restart.c create mode 100644 checkpoint/sys.c diff --git a/Makefile b/Makefile index f3e2065..584c8f9 100644 --- a/Makefile +++ b/Makefile @@ -619,7 +619,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/checkpoint/Makefile b/checkpoint/Makefile new file mode 100644 index 0000000..d129878 --- /dev/null +++ b/checkpoint/Makefile @@ -0,0 +1 @@ +obj-y += sys.o checkpoint.o restart.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c new file mode 100644 index ...
I thought we agreed that this counter should be per-container. Perhaps add a I'm a bit puzzled by this assumption. I would say: either this is a self-checkpoint (only current process), or this is a container checkpoint. = In the latter case, I expect that in the general case the checkpointer lives outside the container (and the interface of sys_checkpoint() below confirms this), so it's root fs is probably not the container's one. Does it differ from what you're planning? Thanks, --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
You are correct. We lack infrastructure for what I'd call "container-object", and the patchset does not yet tackle a container and multiple tasks, so this is an interim solution. Will add a FIXME comment. Thanks, Oren. --
These all need to be fixed up to Linux style: ret = = cr_kwrite(ctx, h, sizeof(*h))); if (ret < 0) ...; You might want to try checkpatch.pl on these. It is pretty good at We at least neex EXTRAVERSION in there if we're going to even pretend alloc/delloc is not a bad thing. :) I think at least a few of these TASK_COMM_LEN might be subject to changing. Can it be part of the user<->kernel ABI? I replaced all of the uses of these with kmalloc()/kfree() and stack allocations. I'm really not sure what these buy us since our allocators are already so fast. tbuf, especially, worries me if it gets used in That ( ? : ) needs to get broken out so it is easier to read. This function is already nice and tight as it is, so saving a line or two How about just making cr_read_obj_type() stop returning positive values? Well, if TASK_COMM_LEN changed, then the size of the structure changed, First of all, the unlikely() needs to die. This isn't performance-critical code, and I don't see any justification for why it is needed. Also, errors like -EAGAIN are just fine and normal to come back from a write to a file. We shoudn't be erroring out on them. (Note that this is something that userspace would be handling by itself if we were doing Any particular reason you're using __get_free_pages() here? kmalloc() would save you a cast, and having to deal with keeping "order" vs. a plain size. Also, why don't you just stick tbuf and hbuf's allocation into the ctx structure itself for now? It'll save some of the error paths. Yeah, it might become an order 2 or 3 allocation, but that Should we just combine most of these two functions? If we passed a fd into cr_ctx_alloc(), it could do the fget/put_light() bits, and we wouldn't need the code in both sys_checkpoint() and sys_restart(). -- Dave --
It isn't clear exactly what we need in the header in terms of version, config, and compile information. I put this as a starter, because it's obvious and because this info is likely to be useful even at the user/admin level. I don't expect this to remain as is as we continue development. This is a temporary solution since this patchset doesn't really handle real These are flags (more will come later). Good point. Having the 'task_comm_len' field there isn't enough because the size of the entire struct changes anyways. I disagree with putting some of the cr_hdr_... on the stack: first, if they grow in size, it's easy to forget to change the allocation to stack. Second, using a standard code/handling for all cr_hdr_... makes the code more readable and easier for others to extend by simply following existing code. The main argument for ->hbuf is that eventually we would like to buffer data in the kernel to avoid potentially slow writing/reading when processes are frozen during checkpoint/restart. By using the simple cr_get_hbuf() and cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf, but later they will provide a pointer directly in the data buffer. Moreover, it simplifies the error path since cleanup (of ->hbuf) is implicit. Also, it is designed to allow checkpoint and restart function to be called in a nested manner, again simplifying the code. Finally, my experience was that it can impact performance if you are after very short downtimes, especially for the checkpoint. The use of ->tbuf is less critical and is mainly for performance and to simplify the error path, never to be called nested. I'm ok with removing it for now. The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag - field of the object that it reads in. The 'ptag' is the tag of the parent object, and is useful in several places. For instance, the 'ptag' of an MM header identifies (is equal to) the 'tag' of TASK to which it belongs. In this ...
It actually makes it harder for others to jump into because they see something which is basically a kmalloc() to the rest of the kernel. We don't have ext2_alloc() or usb_alloc(), so I'm not sure we should have Um, we're writing to a file descriptor and kmap()'ing. Those can both potentially be very, very slow. I don't think that a few kmalloc()s It simplifies *one* error path. But, it complicates the ctx creation Right, but I'm comparing it to kmalloc() Certainly kmalloc()s can be Since this ptag isn't actually used, I can't really offer a suggestion. I don't see the whole picture. -- Dave --
I disagree. It's more than "allocate memory", it's: "give me some room on the buffer". When we don't care about downtime (like now), it's enough to kmalloc a temporary buffer and flush (write to the FD) immediately. But when we will care about downtime, "the buffer" will be memory in kernel where the entire checkpoint image data will be kept while the container is frozen (memory contents need only be marked copy-on-write, not explicitly buffered). In this setting, cr_kwrite() will not really write the data to the FD, but only record somewhere that the data exists. Only after the container resumes execution will we eventually write the data to the FD and free the buffer. (This is also useful in case you want to keep the checkpoint image entirely in memory without flushing to any FD; and yes, there are use-cases for that). So for this, instead of using kmalloc() to allocate a temp-buf, fill it with data, and then copy that data to the actual (big) buffer, the mechanism provides a shortcut to allocate space directly on the buffer, and save the But you miss the point: the writing to the file descriptor in the (future) optimized version will _not_ happen while the container is frozen. Only much later after the container resumes, so at the end it will be: (1) freeze container (2) dump data to in-kernel buffer (mark dirty pages copy-on-write) (3) unfreeze container (4) write in-kernel buffer to FD. By deferring step #4 until after the freeze-period, you can save tens and hundreds of milliseconds, and seconds. Now the goal is to make step #2 be as fast as possible, and every millisecond (and less) counts, e.g. to take fast checkpoints of interactive desktops. And my experience is that in such "workload" repetitive kmalloc/kfree inside that dump loop has a measurable impact on performance (of course, for those objects which are Actually it simplifies *many* error paths -- in *all* cr_write_...() You can see how it's used in the code that dumps files. Or look at the ...
Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent' and lookup tag in the hash table; if not found (first occurence), read in 'struct cr_hdr_fd_data', create a new FD and register in the hash. Otherwise attach the file pointer from the hash as an FD. This patch only handles basic FDs - regular files, directories and also symbolic links. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/Makefile | 2 +- checkpoint/checkpoint.c | 3 + checkpoint/ckpt.h | 6 +- checkpoint/restart.c | 3 + checkpoint/rstr_file.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 213 insertions(+), 3 deletions(-) create mode 100644 checkpoint/rstr_file.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 179175b..fd073cd 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -1,3 +1,3 @@ obj-y += sys.o checkpoint.o restart.o objhash.o \ - ckpt_mem.o rstr_mem.o ckpt_file.o + ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index bf868ae..fe30ebb 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -172,6 +172,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t) ret = cr_write_mm(ctx, t); cr_debug("memory: ret %d\n", ret); if (!ret) + ret = cr_write_files(ctx, t); + cr_debug("files: ret %d\n", ret); + if (!ret) ret = cr_write_thread(ctx, t); cr_debug("thread: ret %d\n", ret); if (!ret) diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h index ef2f74d..b83dea1 100644 --- a/checkpoint/ckpt.h +++ b/checkpoint/ckpt.h @@ -83,11 +83,13 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n); int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type); int cr_read_str(struct cr_ctx *ctx, void *str, int n); +int do_checkpoint(struct cr_ctx *ctx); int cr_write_mm(struct cr_ctx *ctx, ...
To restore the open files, cr_read_files() first closes all the
existing FDs of the current process. This breaks the assumption
underlying the use of fget_light/fput_light in sys_restart(), so
use fget/fput instead.
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7b2670a..ec1609b 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -210,10 +210,9 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
{
struct cr_ctx *ctx;
struct file *file;
- int fput_needed;
int ret;
- file = fget_light(fd, &fput_needed);
+ file = fget(fd);
if (!file)
return -EBADF;
@@ -223,14 +222,14 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
if (!ctx) {
- fput_light(file, fput_needed);
+ fput(file);
return -ENOMEM;
}
ret = do_restart(ctx);
cr_ctx_free(ctx);
- fput_light(file, fput_needed);
+ fput(file);
return ret;
}
--
Infrastructure to handle objects that may be shared and referenced by multiple tasks or other objects, e..g open files, memory address space etc. The state of shared objects is saved once. On the first encounter, the state is dumped and the object is assigned a unique identifier and also stored in a hash table (indexed by its physical kenrel address). From then on the object will be found in the hash and only its identifier is saved. On restart the identifier is looked up in the hash table; if not found then the state is read, the object is created, and added to the hash table (this time indexed by its identifier). Otherwise, the object in the hash table is used. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/checkpoint.txt | 44 ++++++++++ checkpoint/Makefile | 2 +- checkpoint/ckpt.h | 18 ++++ checkpoint/objhash.c | 193 ++++++++++++++++++++++++++++++++++++++++++ checkpoint/sys.c | 4 + 5 files changed, 260 insertions(+), 1 deletions(-) create mode 100644 checkpoint/objhash.c diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt index fdc69cb..923ec10 100644 --- a/Documentation/checkpoint.txt +++ b/Documentation/checkpoint.txt @@ -163,6 +163,50 @@ cr_hdr + cr_hdr_task cr_hdr + cr_hdr_tail +=== Shared resources (objects) + +Many resources used by tasks may be shared by more than one task (e.g. +file descriptors, memory address space, etc), or even have multiple +references from other resources (e.g. a single inode that represents +two ends of a pipe). + +Clearly, the state of shared objects need only be saved once, even if +they occur multiple times. We use a hash table (ctx->objhash) to keep +track of shared objects in the following manner. + +On the first encounter, the state is dumped and the object is assigned +a unique identifier and also stored in the hash table (indexed by its +physical kenrel address). From then on the object will be found in ...
^
|
Only true when PAGE_SIZE =3D=3D 4K and in 32bits. Perhaps like below?
#define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG =3D=3D 64 ? 3 : 2)
#define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
#define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >=3D 10 ? CR_MIN_OBJHASH_NBI=
TS : 10)
#define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS - PAGE=
_SHIFT)
Louis
--=20
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Gah. Can't we just use the existing radix tree or hash implementations we already have? -- Dave --
AFAICS, all hash implementations work like this (I'm referring to pid hashes and compat_ioctl hashes for instance). The only common code is the hash functions, which Oren seems to have used. Radix trees may be an alternative, given that object tags are sequentially generated in some way. I have no idea about the simplicity we can gain thou= gh. Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned a
tag and registered in the object hash.
For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.
This patch only handles basic FDs - regular files, directories and also
symbolic links.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
checkpoint/Makefile | 3 +-
checkpoint/ckpt.h | 2 +-
checkpoint/ckpt_file.c | 234 ++++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt_file.h | 17 ++++
checkpoint/ckpt_hdr.h | 31 +++++++
5 files changed, 285 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/ckpt_file.c
create mode 100644 checkpoint/ckpt_file.h
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index cd57d9d..179175b 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,3 @@
-obj-y += sys.o checkpoint.o restart.o objhash.o ckpt_mem.o rstr_mem.o
+obj-y += sys.o checkpoint.o restart.o objhash.o \
+ ckpt_mem.o rstr_mem.o ckpt_file.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 8b02c4c..ef2f74d 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -13,7 +13,7 @@
#include <linux/path.h>
#include <linux/fs.h>
-#define CR_VERSION 1
+#define CR_VERSION 2
struct cr_ctx {
pid_t pid; /* container identifier */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..18faaf1
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,234 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main ...Why not simply use fcheck_files(files, n) and check if the result is not NU=
You should probably use fcheck_files() instead of copying its code here. I =
agree
that this means dereferencing files->fdt a second time below, but is it so
^
^
We are not at a 64bits boundary here. Should add one __32 padding or reorde=
r the
--=20
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Actually this is ok - note there are two fields on fourth row. I'll change to one field per row so it's easily seen. Thanks, Oren. --
Have to revise my eyes, really ;) Sorry for the noise! Louis --=20 Dr Louis Rilling Kerlabs - IRISA Skype: louis.rilling Campus Universitaire de Beaulieu Phone: (+33|0) 2 99 84 71 52 Avenue du General Leclerc Fax: (+33|0) 2 99 84 71 71 35042 Rennes CEDEX - France http://www.kerlabs.com/
Subject line should have been 's/PATCH 1/PATCH 0/' ... I left cr_debug() in for now to provide more info than pr_debug(); eventually that will be changed back to pr_debug() In the mini-conference we considered doing CR in a kernel module, but decided against because we needed a system call. It is still possible to put the bulk of the code in a module. This is useful, besides reducing debug time (recompile, unload, reload), to reduce the kernel memory footprint. Also, an administrator can load/unload the module to enable/disable this feature. Any thoughts ? Oren. --
