These patches implement checkpoint-restart [CR v3]. This version is
aimed at addressing feedback and eliminating bugs, after having added
save and restore of open files state (regular files and directories)
which makes it more usable.
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)
Changelog:
[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree
[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1
[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.
--
(Dave Hansen's announcement)
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.
This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little ...Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file descriptor. The syscalls take a file descriptor (for the image file) and flags as arguments. For sys_checkpoint the first argument identifies the target container; for sys_restart it will identify the checkpoint image. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/x86/kernel/syscall_table_32.S | 2 ++ checkpoint/Kconfig | 11 +++++++++++ checkpoint/Makefile | 5 +++++ checkpoint/sys.c | 35 +++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_32.h | 2 ++ include/linux/syscalls.h | 2 ++ init/Kconfig | 2 ++ kernel/sys_ni.c | 4 ++++ 8 files changed, 63 insertions(+), 0 deletions(-) create mode 100644 checkpoint/Kconfig create mode 100644 checkpoint/Makefile create mode 100644 checkpoint/sys.c diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index d44395f..5543136 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -332,3 +332,5 @@ ENTRY(sys_call_table) .long sys_dup3 /* 330 */ .long sys_pipe2 .long sys_inotify_init1 + .long sys_checkpoint + .long sys_restart diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig new file mode 100644 index 0000000..a9f22ef --- /dev/null +++ b/checkpoint/Kconfig @@ -0,0 +1,11 @@ +config CHECKPOINT_RESTART + prompt "Enable checkpoint/restart (EXPERIMENTAL)" + def_bool y + depends on X86_32 && EXPERIMENTAL + help + Application checkpoint/restart is the ability to save the + state of a running application so that it can later resume + its execution from the time at which it was checkpointed. + + Turning this option on will enable checkpoint and restart + functionality in the kernel. diff --git ...
^ | there are some spaces at the beginning of this line which makes the patch not applicable for me. This needs a fix for v4 I think. Thanks for the pachset. anyhow, I'll play with it. --
So can we compare your api to Andrey's? You've explained before that crid is used to tie together multiple calls to checkpoint, but why do you have to specify it for restart? Can't it just come from the fd? Or, the fd will be passed in seek()d to the right position for the data for this task, so the crid won't be available there? Andrey, how will the 'ctid' in your patchset be used? It sounds like it's actually going to set some integer id on the created container? We actually don't have container ids (or even containers) right now, so we probably don't want that in our api, --
I added the 'crid' inside to support a mode of operation in which we would like the checkpoint data to remain in memory across multiple system calls. Here are example scenarios: 1) We will want to reduce down time by first buffering the checkpoint image in memory, then resuming the container, and only then writing the data back to a (the) file descriptor. So instead of: freeze -> checkpoint and write back -> unfreeze We want: freeze -> checkpoint to buffer -> unfreeze -> write back I envision each of these steps to be a separate invocation of a syscall. to the 'crid' returned by the sys_checkpoint() at the 2nd step, will be used to identify that data in the 4th step. (Note, that between the unfreeze and the write-back, another checkpoint may be already taken). 2) A task may want to take a checkpoint (e.g. of itself, or a whole container) and keep that checkpoint in memory; at a later time it may want to revert to that checkpoint. Moreover, it may keep multiple such checkpoints (to where it may want to return). 'crid' tells sys_restart which one to use. Note that this 'crid' will in fact be tied to resources that are kept by the kernel - e.g. references to COW pages (when we add that). Louis suggested to use a specialized FD instead of a numeric 'crid' (that is: create a anonymous inode and a struct file that represent that checkpoint in the kernel, and return an FD to it). This approach has pros and cons of 'crid' (see the archives of the containers mailing list). For now I kept 'crid', but I'm definitely open to change it to a FD. --
Oh, so the crid identifies one checkpoint inside the file - the single -serge --
Not quite. Let me rephrase the motivation first: There are occasions when we would like to keep the checkpoint data in the kernel for some (relatively long) time, between syscalls. By "checkpoint data" I mean references to memory contents (pages) and all the other data. The two scenarios above are two examples: between the syscall to checkpoint and the syscall to unfreeze and then write-back the data to a file (first example), and for some time until a task may want to "go back in time" (second example, useful for ultra fast "undo" for a task). Note that in both cases when I say "keep in kernel" I mean before it is written to a file, or to the network. Simply in memory, in some efficient manner. Subsequent syscalls will need to refer to a specific checkpoint data that is kept in memory - e.g. to write-back to a file-descriptor, or to clean up, or to restart from it. (At any single time a specific container may have multiple checkpoints associated with it - eg. because they have not yet been written back to storage but already taken). Once the data is written back to a file descriptor, the in-kernel data can be discarded and cleaned-up. The main reason why I want to keep the data in the kernel and not instead copy to user space, is efficiency: most of the checkpoint data is the memory footprint; by keeping the data in the kernel, one can merely keep a COW reference instead of a whole copy of everything (save space and copy time). So, if we have keep data in kernel between syscalls, then we must have a way to refer to it. The current implementation uses a very simple 'crid' value to do that - although, clearly, at the moment it isn't used. I hope this explains better. --
Ah, ok. So we're either using an fd or a crid. Personally I'd then prefer two syscalls, which are wrappers around a more flexible in-kernel api. That way we can start with just sys_restart(int fd, long flags) and add sys_restart_crid(int crid, long flags) later. -serge --
'ctid' in my patchset will be used later (when we will have container infrastructure) to specify container ID. Right now we can drop this 'ctid' --
Jumping in the API thread : how will this API interact with the namespaces ? I think the exact question is how are we seeing the restart sequence ? shall we (1) restart from inside a set of pre established namespaces or (2) restore the state of the namespaces upon restart ? I think (1) is the best option in semantic, because it's closer to what the kernel does: create a directory (a container) and then fill it with files (tasks). That's how the cgroup framework works and I have the feeling we will be using this framework to build the 'super' container object. nop ? This direction has an impact on the API because the restart sequence will depend on a set of preliminary settings to create an 'empty' container which can then be used to exec() tasks or restart() tasks. This is a very different API than a magical restart() syscall creating hundreds of namespaces and zillions of tasks from scratch using an opaque binary blob. less attractive for sure but it feels more kernel friendly :) But, may be you have addressed this topic at the summit and the question is closed ? C. --
Add those interfaces, as well as helpers needed to easily manage the file format. The code is roughly broken out as follows: checkpoint/sys.c - user/kernel data transfer, as well as setup of the checkpoint/restart context (a per-checkpoint data structure for housekeeping) checkpoint/checkpoint.c - output wrappers and basic checkpoint handling checkpoint/restart.c - input wrappers and basic restart handling Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Makefile | 2 +- checkpoint/Makefile | 2 +- checkpoint/checkpoint.c | 188 ++++++++++++++++++++++++++++++++++++++ checkpoint/restart.c | 189 ++++++++++++++++++++++++++++++++++++++ checkpoint/sys.c | 226 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/ckpt.h | 65 +++++++++++++ include/linux/ckpt_hdr.h | 82 +++++++++++++++++ include/linux/magic.h | 3 + 8 files changed, 751 insertions(+), 6 deletions(-) create mode 100644 checkpoint/checkpoint.c create mode 100644 checkpoint/restart.c create mode 100644 include/linux/ckpt.h create mode 100644 include/linux/ckpt_hdr.h diff --git a/Makefile b/Makefile index f448e00..a558ad2 100644 --- a/Makefile +++ b/Makefile @@ -619,7 +619,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 07d018b..d2df68c 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -2,4 +2,4 @@ # Makefile for linux checkpoint/restart. # -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o ...
64bits alignment issue? I probably missed it in previous versions... --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
In the first version it was followed by two __u16's (pid and tgid)... -serge --
This really looks good to me - nothing particularly exotic, nice and simple. Dave, are you happy with the allocations here, or were you objecting to cr_hbuf_get() and cr_hbuf_put()? thanks, --
I still don't think there's really enough justification as it stands, but don't let me get in the way. If it ends up being an issue, it's pretty straightforward to rip them out or put back. The code is very nice that way. -- Dave --
(Following Dave Hansen's refactoring of the original post)
Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.
Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
arch/x86/mm/Makefile | 2 +
arch/x86/mm/checkpoint.c | 194 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/restart.c | 178 ++++++++++++++++++++++++++++++++++++++++
checkpoint/checkpoint.c | 13 +++-
checkpoint/ckpt_arch.h | 7 ++
checkpoint/restart.c | 13 +++-
include/asm-x86/ckpt_hdr.h | 72 ++++++++++++++++
include/linux/ckpt_hdr.h | 1 +
8 files changed, 478 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/mm/checkpoint.c
create mode 100644 arch/x86/mm/restart.c
create mode 100644 checkpoint/ckpt_arch.h
create mode 100644 include/asm-x86/ckpt_hdr.h
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o
obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..71d21e6
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct ...Covers application checkpoint/restart, overall design, interfaces and checkpoint image format. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/checkpoint.txt | 182 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 182 insertions(+), 0 deletions(-) create mode 100644 Documentation/checkpoint.txt diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt new file mode 100644 index 0000000..71930af --- /dev/null +++ b/Documentation/checkpoint.txt @@ -0,0 +1,182 @@ + + === Checkpoint-Restart support in the Linux kernel === + +Copyright (C) 2008 Oren Laadan + +Author: Oren Laadan <orenl@cs.columbia.edu> + +License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: + +Application checkpoint/restart [CR] is the ability to save the state +of a running application so that it can later resume its execution +from the time at which it was checkpointed. An application can be +migrated by checkpointing it on one machine and restarting it on +another. CR can provide many potential benefits: + +* Failure recovery: by rolling back an to a previous checkpoint + +* Improved response time: by restarting applications from checkpoints + instead of from scratch. + +* Improved system utilization: by suspending long running CPU + intensive jobs and resuming them when load decreases. + +* Fault resilience: by migrating applications off of faulty hosts. + +* Dynamic load balancing: by migrating applications to less loaded + hosts. + +* Improved service availability and administration: by migrating + applications before host maintenance so that they continue to run + with minimal downtime + +* Time-travel: by taking periodic checkpoints and restarting from + any previous checkpoint. + + +=== Overall design + +Checkpoint and restart is done in the kernel as much as possible. The +kernel exports a relative opaque 'blob' of data to userspace which can +then be handed to the ...
Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
arch/x86/mm/restart.c | 56 ++++++++
checkpoint/Makefile | 2 +-
checkpoint/ckpt_arch.h | 1 +
checkpoint/restart.c | 43 ++++++
checkpoint/rstr_mem.c | 351 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ckpt.h | 2 +
6 files changed, 454 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/rstr_mem.c
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index d7fb89a..7c5a7d7 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -177,3 +177,59 @@ int cr_read_cpu(struct cr_ctx *ctx)
out:
return ret;
}
+
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int n, rparent;
+
+ rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+ cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+ if (rparent < 0)
+ return rparent;
+ if (rparent != parent)
+ return -EINVAL;
+
+ if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+ return -EINVAL;
+
+ /* to utilize the syscall modify_ldt() we first convert the data
+ * in the checkpoint image from 'struct desc_struct' to 'struct
+ * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */
+
+ for (n = 0; n < hh->nldt; n++) {
+ struct user_desc info;
+ struct desc_struct desc;
+ mm_segment_t old_fs;
+ int ret;
+
+ ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+ if (ret < 0)
+ return ret;
+
+ info.entry_number = n;
+ info.base_addr = desc.base0 | (desc.base1 << 16);
+ info.limit = desc.limit0;
+ info.seg_32bit = desc.d;
+ info.contents = ...Wouldn't it just be better to save the checkpoint image in the format Do you find it any easier to read this as: nr = npages; if (nr > pgarr->nleft) nr = pgarr->nleft; I'm having some difficulty parsing this function. Could we s/data/contents/ in the function name? It also looks like addrs is being used like an array here. Can we use it explicitly that way? I'd also like to see it called vaddr or something explicit about what kinds "cr_vma_writable" is a question to me. This needs to be "cr_vma_make_writable" or something to indicate that it is modifying the Kill the unlikely(), please. It's unnecessary and tends to make things slower when not used correctly. Can you please check all the future Is this to fixup the same things that setup_arg_pages() uses it for? We Looking at this code, I can now tell that we need to be more explicit about what nr_pages is. Is it the nr_pages that the vma spans, The english here is a bit funky. I think this needs to be cr_vma_read_page_addrs(). The other way makes it sound like you're Where did this sucker get allocated? This is going to be a bit difficult to audit since it isn't allocated and freed (or released) at the same level. Seems like it would be much nicer if it was allocated Ugh. Is this a security hole? What if the user was not allowed to write to the file being mmap()'d by this VMA? Is this a window where someone could come in and (using ptrace or something similar) write to the file? We copy into the process address space all the time when not in its Man, that's hard to parse. Could you break that up a little bit to make I'd probably just make a local vm_start to make these all consistent and Why the heck is this here? Isn't this a fresh mm? We shouldn't have to do this unless we had a VMA here previously. Maybe it would be more The other approach would be to do something more analogous to exec(): create the entire new mm and switch to it in the end. -- Dave --
Because the syscall accepts a different format (user_desc) than it gives back (desc_struct, which is the kernel format), conversion must occur, either during checkpoint or during restart. I prefer in restart: because checkpoint is more performance critical, because it allows - in the future - to directly insert the data bypassing the syscall (like openvz), and because we may need conversions anyway in This is needed for a VMA that has modified pages but is read-only: e.g. a task modifies memory then calls mprotect() to make it read-only. Since during restart we will recreate this VMA as read-only, we need to temporarily make it read-write to be able to read the saved contents into it, and then restore the read-only protection. setup_arg_pages() is unrelated, and the code doesn't seem to have much in nr_pages is the number of pages _saved_ for this VMA, that is, the number of pages to follow to read in. If it's zero, we don't read anything (the VMA is had no dirty pages). Otherwise, we need to read that many addresses The name is misleading, should be: cr_pgarr_reset(). What happens is that we allocate the chain of cr_pgarr on demand. Once done with one MM, we reset the state and reuse the same chain until we need to expand it. Not a security hole: this is only for private memory, so it never modifies the underlying file. This is related to what I explained before about read-only VMAs that have modified pages. The process is restarting, inside a container that is restarting. All tasks inside should be calling sys_restart() (by design) and no other process from outside should be allowed to ptrace them at this point. (In any case, if some other tasks ptraces this task, it can make it do Good point, thanks. [...] Oren. --
I'd suggest not storing virtual addresses in 'unsigned long'. It's passable when you're doing lots of arithmetic on the addresses, but that isn't happening here. That probably means cascading back and changing Have you looked at mprotect_fixup()? It deals with two things: 1. altering the commit charge against RSS if the mapping is actually writable. 2. Merging the VMA with an adjacent one if possible We don't want to do either of these two things. Even if we do merge the VMA, it will be a waste of time and energy since we'll just re-split it Right, but that doesn't come out of the code easily. I'm asking you to please change those variable names to make it easier to figure out what those things are used for. You, as the author, have a good grip but OK, so a shared, read-only mmap() should never get into this code path. What if an attacker modified the checkpoint file to pretend to have No. I'm suggesting that since this lets you effectively write to something that is not writable, it may be a hole with which to bypass I'm just saying that you don't need to be in a process's context in order to copy contents into its virtual address space. Check out access_process_vm(). -- Dave --
Your observation is correct; I chose this interface because it's really simple and handy. I'm not worried about the performance because such VMAs (read only but modified) are really rare, and the code can be optimized later on. VMAs of shared maps (IPC, anonymous shared) will be treated differently. VMAs of shared files (mapped shared) are saved without their contents, as the contents remains available on the file system ! (yes, for that we will eventually need file system snapshots). As for an attack that provides an altered checkpoint image: since we (currently) don't escalate privileges, the attacker will not be able Once we get positive responses about the current patchset, the next step is to handle multiple processes: I plan to extend the freezer That's a good comment, but here all we are doing here is to modify a That would be the other way to implement the restart. But, since restart executes in task's context, it's simpler and more efficient to leverage copy-to-user(). In terms of security, both methods brings about the same end results: the memory is modified (perhaps bypassing the read-only property of the VMA) Oren. --
The worry is that it will never get cleaned up, and it is basically cruft as it stands. People may think that it is here protecting or I bugged Serge about this. He said that this, at least, bypasses the SE Linux checks that are normally done with an mprotect() system call. That's a larger design problem that we need to keep in mind: we need to OK, but I just asked you why a ptrace() of a process during this elevated privilege operation couldn't potentially do something bad. You responded that, by design, we can't ptrace things. The design is all well and good, but the patch isn't, because it doesn't implement that But copy_to_user() is fundamentally different. It writes *over* contents and in to files. Simulating a fault fills in those pages, but it never writes over things or in to files. Faulting is fundamentally safer. Faulting today can also handle populating a memory area with pages that appear read-only via userspace. That's exactly what we're doing here as well. Anyway, I don't expect that you'll agree with this. I'll prototype doing it the other way at some point and we can compare how both look. -- Dave --
Here's the restore side of my cleanups on top of the v4 patches and the
one against 4/9 I just sent.
This purely makes it compile again.
---
linux-2.6.git-dave/checkpoint/ckpt_mem.c | 2 --
linux-2.6.git-dave/checkpoint/ckpt_mem.h | 4 ++++
linux-2.6.git-dave/checkpoint/rstr_mem.c | 30 +++++++++++++++++++-----------
3 files changed, 23 insertions(+), 13 deletions(-)
diff -puN checkpoint/rstr_mem.c~p5-dave checkpoint/rstr_mem.c
--- linux-2.6.git/checkpoint/rstr_mem.c~p5-dave 2008-09-10 14:51:26.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/rstr_mem.c 2008-09-10 14:58:53.000000000 -0700
@@ -31,27 +31,35 @@
* read in directly to the address space of the current process
*/
+static int pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+ return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
/**
- * cr_vma_read_pages_vaddrs - read addresses of pages to page-array chain
+ * cr_vma_restore_contents - read addresses of pages to page-array chain
* @ctx - restart context
* @npages - number of pages
*/
-static int cr_vma_read_pages_vaddrs(struct cr_ctx *ctx, int npages)
+static int cr_vma_restore_contents(struct cr_ctx *ctx, int pages_to_read)
{
struct cr_pgarr *pgarr;
int nr, ret;
- while (npages) {
- pgarr = cr_pgarr_prep(ctx);
+ while (pages_to_read) {
+ unsigned long *vaddr_position;
+ pgarr = cr_get_empty_pgarr(ctx);
if (!pgarr)
return -ENOMEM;
- nr = min(npages, (int) pgarr->nr_free);
- ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
+ nr = pages_to_read;
+ if (nr > pgarr_nr_free(pgarr))
+ nr = pgarr_nr_free(pgarr);
+ vaddr_position = &pgarr->vaddrs[pgarr->nr_used];
+ ret = cr_kread(ctx, vaddr_position, nr * sizeof(unsigned long));
if (ret < 0)
return ret;
- pgarr->nr_free -= nr;
pgarr->nr_used += nr;
- npages -= nr;
+ pages_to_read -= nr;
}
return 0;
}
@@ -67,7 +75,7 @@ static int cr_vma_read_pages_contents(st
unsigned long *vaddrs;
int i, ret;
...Let me start with the bottom line - since this creates too much confusion, I'll just switch to the alternative: will use get_user_pages() to bring pages in and copy the data directly. Hopefully this will end the discussion. (Note, there there is a performance penalty in the form of extra data copy: instead of reading data directly to the page, we instead read into a buffer, I also discussed this with Serge, and I got the impression that he agreed that there was no security issue because it was all and only If a task is ptraced, then the tracer can easily arrange for the tracee to call mprotect(), or to call sys_restart() with a tampered checkpoint file, or do other tricks. The call to mprotect_fix(), on a private vma, does not make this any worse. That is why I didn't bother implementing copy_to_user() does not write into a file with private VMAs. copy_to_user() in our case will always trigger a page fault. Back to bottom line - whether or not I agree - I already changed the code to use get_user_pages() and got rid of this controversy. Oren. --
We will want the checks there eventually. Now it's going to require new selinux policy to deal with new denials, so in other words we're basically going to be adding checks which people will be required to work their way around :) But we'll still need checks, because it is bypassing selinux checks, and just because you need to bypass them to be able to do restart (or run lisp) doesn't mean we can just drop the checks. -serge --
Yep but, as we discussed on IRC, this code needs some optimization for pages in swap, anyway. It isn't optimal for those, either. So, for this we'll leave it at a minimal amount of code rather than maximal Yep, as long as there are some sanity checks to make sure that we're not I completely agree that it isn't an issue on a private VMA. My only concern is if this is done to any shared memory or could potentially be abused in such a way that it gets applied to shared memory. -- Dave --
For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. The cr_vma->npages will tell
how many pages were dumped for this VMA. Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
arch/x86/mm/checkpoint.c | 30 ++++
arch/x86/mm/restart.c | 1 +
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 53 ++++++
checkpoint/ckpt_arch.h | 1 +
checkpoint/ckpt_mem.c | 409 ++++++++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt_mem.h | 30 ++++
checkpoint/sys.c | 19 ++-
include/asm-x86/ckpt_hdr.h | 5 +
include/linux/ckpt.h | 9 +-
include/linux/ckpt_hdr.h | 30 ++++
11 files changed, 582 insertions(+), 8 deletions(-)
create mode 100644 checkpoint/ckpt_mem.c
create mode 100644 checkpoint/ckpt_mem.h
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 71d21e6..50cfd29 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
cr_hbuf_put(ctx, sizeof(*hh));
return ret;
}
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ h.type = CR_HDR_MM_CONTEXT;
+ h.len = sizeof(*hh);
+ h.parent = parent;
+
+ mutex_lock(&mm->context.lock);
+
+ hh->ldt_entry_size = LDT_ENTRY_SIZE;
+ hh->nldt = mm->context.size;
+
+ cr_debug("nldt %d\n", hh->nldt);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * ...What we effectively have here is:
void *addrs[CR_PGARR_TOTAL];
void *pages[CR_PGARR_TOTAL];
right?
Would any of this get simpler if we just had:
struct cr_page {
struct page *page;
unsigned long vaddr;
};
struct cr_pgarr {
struct cr_page *cr_pages;
struct cr_pgarr *next;
unsigned short nleft;
unsigned short nused;
};
Also, we do have lots of linked list implementations in the kernel.
They do lots of fun stuff like poisoning and checking for
initialization. We should use them instead of rolling our own. It lets
us do other fun stuff like list_for_each().
Also, just looking at this structure 'nleft' and 'nused' sound a bit
redundant. I know from looking at the code that this is how many have
been filled and read back at restore time, but that is not very obvious
looking at the structure. I think we can do a bit better in the
structure itself.
The length of the arrays is fixed at compile-time, right? Should we
just make that explicit as well?
-- Dave
--
The reason I use separate arrays instead of an array of tuples is that the logic is to write all vaddr at once - simply by dumping the array The length of the array may be tunable, or even adaptive (e.g. based on statistics from recent checkpoints), in the future. Oren. --
I'm not sure "tuning" it to make those arrays longer than PAGE_SIZE will ever be a good idea. Seems like we should just keep the structures at PAGE_SIZE to me. -- Dave --
Infrastructure to handle objects that may be shared and referenced by multiple tasks or other objects, e..g open files, memory address space etc. The state of shared objects is saved once. On the first encounter, the state is dumped and the object is assigned a unique identifier and also stored in a hash table (indexed by its physical kenrel address). From then on the object will be found in the hash and only its identifier is saved. On restart the identifier is looked up in the hash table; if not found then the state is read, the object is created, and added to the hash table (this time indexed by its identifier). Otherwise, the object in the hash table is used. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/checkpoint.txt | 44 +++++++++ checkpoint/Makefile | 2 +- checkpoint/objhash.c | 205 ++++++++++++++++++++++++++++++++++++++++++ checkpoint/sys.c | 4 + include/linux/ckpt.h | 18 ++++ 5 files changed, 272 insertions(+), 1 deletions(-) create mode 100644 checkpoint/objhash.c diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt index 71930af..18725e6 100644 --- a/Documentation/checkpoint.txt +++ b/Documentation/checkpoint.txt @@ -163,6 +163,50 @@ cr_hdr + cr_hdr_task cr_hdr + cr_hdr_tail +=== Shared resources (objects) + +Many resources used by tasks may be shared by more than one task (e.g. +file descriptors, memory address space, etc), or even have multiple +references from other resources (e.g. a single inode that represents +two ends of a pipe). + +Clearly, the state of shared objects need only be saved once, even if +they occur multiple times. We use a hash table (ctx->objhash) to keep +track of shared objects in the following manner. + +On the first encounter, the state is dumped and the object is assigned +a unique identifier and also stored in the hash table (indexed by its +physical kenrel address). From then on the object will be found in ...
Why -1? This makes a total number of 512 entries, which will break below wi= th access to entries 0..1023 Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Ugh !!! Was fixed and tested, but sneaked back in :( Thanks for spotting, will resend the patchset. --
Once you get to the point of putting function prototypes in Documentation/, it's probably a good time to start using kerneldocs. :) -- Dave --
Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent' and lookup tag in the hash table; if not found (first occurence), read in 'struct cr_hdr_fd_data', create a new FD and register in the hash. Otherwise attach the file pointer from the hash as an FD. This patch only handles basic FDs - regular files, directories and also symbolic links. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/Makefile | 2 +- checkpoint/restart.c | 4 + checkpoint/rstr_file.c | 205 ++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ckpt.h | 1 + 4 files changed, 211 insertions(+), 1 deletions(-) create mode 100644 checkpoint/rstr_file.c diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 7496695..88bbc10 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -3,4 +3,4 @@ # obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \ - ckpt_mem.o rstr_mem.o ckpt_file.o + ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o diff --git a/checkpoint/restart.c b/checkpoint/restart.c index f8c919d..bc49523 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx) cr_debug("memory: ret %d\n", ret); if (ret < 0) goto out; + ret = cr_read_files(ctx); + cr_debug("files: ret %d\n", ret); + if (ret < 0) + goto out; ret = cr_read_thread(ctx); cr_debug("thread: ret %d\n", ret); if (ret < 0) diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c new file mode 100644 index 0000000..56f4f38 --- /dev/null +++ b/checkpoint/rstr_file.c @@ -0,0 +1,205 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/kernel.h> +#include <linux/sched.h> +#include ...
Dump the files_struct of a task with 'struct cr_hdr_files', followed by all open file descriptors. Since FDs can be shared, they are assigned a tag and registered in the object hash. For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag and its close-on-exec property. If the FD is to be saved (first time) then this is followed by a 'struct cr_hdr_fd_data' with the FD state. Then will come the next FD and so on. This patch only handles basic FDs - regular files, directories and also symbolic links. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/Makefile | 2 +- checkpoint/checkpoint.c | 4 + checkpoint/ckpt_file.c | 224 ++++++++++++++++++++++++++++++++++++++++++++++ checkpoint/ckpt_file.h | 17 ++++ include/linux/ckpt.h | 7 +- include/linux/ckpt_hdr.h | 34 +++++++- 6 files changed, 283 insertions(+), 5 deletions(-) create mode 100644 checkpoint/ckpt_file.c create mode 100644 checkpoint/ckpt_file.h diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 9843fb9..7496695 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -3,4 +3,4 @@ # obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \ - ckpt_mem.o rstr_mem.o + ckpt_mem.o rstr_mem.o ckpt_file.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 4dae775..aebbf22 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -217,6 +217,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t) cr_debug("memory: ret %d\n", ret); if (ret < 0) goto out; + ret = cr_write_files(ctx, t); + cr_debug("files: ret %d\n", ret); + if (ret < 0) + goto out; ret = cr_write_thread(ctx, t); cr_debug("thread: ret %d\n", ret); if (ret < 0) diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c new file mode 100644 index 0000000..34df371 --- /dev/null +++ b/checkpoint/ckpt_file.c @@ -0,0 +1,224 @@ +/* + * Checkpoint file ...
fdlist =3D krealloc(fdlist, max, GFP_KERNEL)? Sorry, I should have suggested this in my first review. --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
That's a good point; I did it this way to be paranoid, even though the the checkpointee is supposed to be frozen (e.g., if the checkpointee is forcefully killed by, say, OOM, and it's fdt->max_fds goes to zero. But now I notice that check_files() already tests for this. I'm not sure it makes the code simpler, but I'll fix that. --
Have you tried emailing these to yourself and trying to apply them? That's usually a good way to iron out these whitespace issues. I think this set is still munged. -- Dave --
That loop needs some love. At least save us from one level of My gut also says that there has to be a better way to find a good size for fdlist() than growing it this way. Why do we even have a fixed size for this? Why a BUG_ON()? We'll deref it in just a sec anyway. We prefer to just Is there a plan to save off the 'struct user' here instead? Nested user Why don't we just store (and use) (inode->i_mode & S_IFMT) in fd_type Since the kernel always seems to make fds integers, it would make sense to me to store them as integers in the checkpoint image. Why bother to This if() block is in the normal flow path of the function and should go at the top indentation level. You can just do this: if (ret < 0) goto out; // if block contents here... -- Dave --
Can you suggest a better way to find the open files of a task ? Either I loop twice (loop to count, then allocate, then loop to fill), Of course. Eventually, 'struct user' will be another shared object that There will be others that cannot be inferred from inode->i_mode, e.g. CR_FD_FILE_UNLINKED, CR_FD_DIR_UNLINKED, CR_FD_SOCK_UNIX, [...] Oren. --
I'd suggest the double loop. I think it is much more straightforward I think we have a basically different philosophy on these. I'd say don't define them unless absolutely necessary. The less you spell out explicitly, the more flexibility you have in the future, and the less code you spend doing simple conversions. Anyway, I see why you're doing it this way. -- Dave --
