This patchset introduces kernel based checkpointing/restart as it is implemented in OpenVZ project. This patchset has limited functionality and are able to checkpoint/restart only single process. Recently Oren Laaden sent another kernel based implementation of checkpoint/restart. The main differences between this patchset and Oren's patchset are: * In this patchset checkpointing initiated not from the process (right now we do not have a container, only namespaces), Oren's patchset performs checkpointing from the process context. * Restart in this patchset is initiated from process, which restarts a new process (in new namespaces) with saved state. Oren's patchset uses the same process from which restart was initiated and restore saved state over it. * Checkpoint/restart functionality in this patchset is implemented as a kernel module As checkpointing is initiated not from the process which state should be saved we should freeze a process before saving its state. Right now Container Freezer from Matt Helsley can be used for this. This patchset introduce only a concept how kernel based checkpointing/restart can be implemented and are able to checkpoint/restart only a single process with simple VMAs. I've tried to split my patchset in small patches to make review more easier. --
Right now they just return -ENOSYS. Later they will provide functionality to checkpoint and restart a container. Both syscalls take as arguments a file descriptor and flags. Also sys_checkpoint take as the first argument a PID of container's init (later it will be container ID); sys_restart takes as the first argument a container ID (right now it will not be used). Signed-off-by: Andrey Mirkin <major@openvz.org> --- Makefile | 2 +- arch/x86/kernel/syscall_table_32.S | 2 + cpt/Makefile | 1 + cpt/sys_core.c | 38 ++++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_32.h | 2 + 5 files changed, 44 insertions(+), 1 deletions(-) create mode 100644 cpt/Makefile create mode 100644 cpt/sys_core.c diff --git a/Makefile b/Makefile index ea413fa..1dee5c0 100644 --- a/Makefile +++ b/Makefile @@ -619,7 +619,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ cpt/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index fd9d4f4..4a0d7fb 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -333,3 +333,5 @@ ENTRY(sys_call_table) .long sys_pipe2 .long sys_inotify_init1 .long sys_hijack + .long sys_checkpoint + .long sys_restart /* 335 */ diff --git a/cpt/Makefile b/cpt/Makefile new file mode 100644 index 0000000..2276fb1 --- /dev/null +++ b/cpt/Makefile @@ -0,0 +1 @@ +obj-y += sys_core.o diff --git a/cpt/sys_core.c b/cpt/sys_core.c new file mode 100644 index 0000000..1a97fb6 --- /dev/null +++ b/cpt/sys_core.c @@ -0,0 +1,38 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is ...
A config option CONFIG_CHECKPOINT is introduced.
New structure cpt_operations is introduced to store pointers to
checkpoint/restart functions from module.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/Kconfig | 7 +++++++
cpt/Makefile | 4 ++++
cpt/cpt.h | 19 +++++++++++++++++++
cpt/sys.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
cpt/sys_core.c | 29 +++++++++++++++++++++++++++--
init/Kconfig | 2 ++
6 files changed, 107 insertions(+), 2 deletions(-)
create mode 100644 cpt/Kconfig
create mode 100644 cpt/cpt.h
create mode 100644 cpt/sys.c
diff --git a/cpt/Kconfig b/cpt/Kconfig
new file mode 100644
index 0000000..b9bc72d
--- /dev/null
+++ b/cpt/Kconfig
@@ -0,0 +1,7 @@
+config CHECKPOINT
+ tristate "Checkpoint & restart for containers"
+ depends on EXPERIMENTAL
+ default n
+ help
+ This option adds module "cptrst", which allow to save a running
+ container to a file and restart it later using this image file.
diff --git a/cpt/Makefile b/cpt/Makefile
index 2276fb1..bfe75d5 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -1 +1,5 @@
obj-y += sys_core.o
+
+obj-$(CONFIG_CHECKPOINT) += cptrst.o
+
+cptrst-objs := sys.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
new file mode 100644
index 0000000..381a9bf
--- /dev/null
+++ b/cpt/cpt.h
@@ -0,0 +1,19 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <major@openvz.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+struct cpt_operations
+{
+ struct module * owner;
+ int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
+ int (*restart)(int ctid, int fd, unsigned long flags);
+};
+extern struct cpt_operations cpt_ops;
diff --git a/cpt/sys.c b/cpt/sys.c
new file mode 100644
index 0000000..4051286
--- /dev/null
+++ b/cpt/sys.c
@@ -0,0 ...Add functions for context allocation/destroy.
Introduce functions to read/write image.
Introduce image header and object header.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/cpt.h | 37 ++++++++++++++++
cpt/cpt_image.h | 63 +++++++++++++++++++++++++++
cpt/sys.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 224 insertions(+), 3 deletions(-)
create mode 100644 cpt/cpt_image.h
diff --git a/cpt/cpt.h b/cpt/cpt.h
index 381a9bf..607ac1b 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -10,6 +10,8 @@
*
*/
+#include "cpt_image.h"
+
struct cpt_operations
{
struct module * owner;
@@ -17,3 +19,38 @@ struct cpt_operations
int (*restart)(int ctid, int fd, unsigned long flags);
};
extern struct cpt_operations cpt_ops;
+
+#define CPT_CTX_ERROR -1
+#define CPT_CTX_IDLE 0
+#define CPT_CTX_DUMPING 1
+#define CPT_CTX_UNDUMPING 2
+
+typedef struct cpt_context
+{
+ pid_t pid; /* should be changed to ctid later */
+ int ctx_id; /* context id */
+ struct list_head ctx_list;
+ int refcount;
+ int ctx_state;
+ struct semaphore main_sem;
+
+ int errno;
+
+ struct file *file;
+ loff_t current_object;
+
+ struct list_head object_array[CPT_OBJ_MAX];
+
+ int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
+ int (*read)(void *addr, size_t count, struct cpt_context *ctx);
+} cpt_context_t;
+
+extern int debug_level;
+
+#define cpt_printk(lvl, fmt, args...) do { \
+ if (lvl <= debug_level) \
+ printk(fmt, ##args); \
+ } while (0)
+
+#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
+#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
new file mode 100644
index 0000000..3d26229
--- /dev/null
+++ b/cpt/cpt_image.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <major@openvz.org>
+ *
+ * This program is free software; you can redistribute it ...Actually right now we are going to dump only one process.
Function for dumping head of image file are added.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/Makefile | 2 +-
cpt/checkpoint.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
cpt/cpt.h | 3 ++
cpt/sys.c | 3 +-
kernel/fork.c | 2 +
5 files changed, 82 insertions(+), 2 deletions(-)
create mode 100644 cpt/checkpoint.c
diff --git a/cpt/Makefile b/cpt/Makefile
index bfe75d5..173346b 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o
-cptrst-objs := sys.o
+cptrst-objs := sys.o checkpoint.o
diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c
new file mode 100644
index 0000000..b4d9686
--- /dev/null
+++ b/cpt/checkpoint.c
@@ -0,0 +1,74 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <major@openvz.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "cpt.h"
+
+static int cpt_write_head(struct cpt_context *ctx)
+{
+ struct cpt_head hdr;
+
+ memset(&hdr, 0, sizeof(hdr));
+ hdr.cpt_signature[0] = CPT_SIGNATURE0;
+ hdr.cpt_signature[1] = CPT_SIGNATURE1;
+ hdr.cpt_signature[2] = CPT_SIGNATURE2;
+ hdr.cpt_signature[3] = CPT_SIGNATURE3;
+ hdr.cpt_hdrlen = sizeof(hdr);
+ hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff;
+ hdr.cpt_image_extra = 0;
+#if defined(CONFIG_X86_32)
+ hdr.cpt_arch = CPT_ARCH_I386;
+#else
+#error Arch is not supported
+#endif
+ return ctx->write(&hdr, sizeof(hdr), ctx);
+}
+
+int dump_container(struct ...Functions to dump task struct, fpu state and registers are added.
All IDs are saved from the POV of process (container) namespace.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/Makefile | 2 +-
cpt/checkpoint.c | 2 +-
cpt/cpt.h | 1 +
cpt/cpt_image.h | 123 +++++++++++++++++++++++++++
cpt/cpt_process.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 362 insertions(+), 2 deletions(-)
create mode 100644 cpt/cpt_process.c
diff --git a/cpt/Makefile b/cpt/Makefile
index 173346b..457cc96 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o
-cptrst-objs := sys.o checkpoint.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o
diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c
index b4d9686..41c3523 100644
--- a/cpt/checkpoint.c
+++ b/cpt/checkpoint.c
@@ -65,7 +65,7 @@ int dump_container(struct cpt_context *ctx)
/* Dump task here */
if (!err)
- err = -ENOSYS;
+ err = cpt_dump_task(root, ctx);
out:
ctx->nsproxy = NULL;
diff --git a/cpt/cpt.h b/cpt/cpt.h
index b421a11..1bb483d 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -57,3 +57,4 @@ extern int debug_level;
#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
int dump_container(struct cpt_context *ctx);
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index 3d26229..b7b68e1 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -13,6 +13,9 @@
#ifndef __CPT_IMAGE_H_
#define __CPT_IMAGE_H_ 1
+#include <linux/sched.h>
+#include <asm/segment.h>
+
enum _cpt_object_type
{
CPT_OBJ_TASK = 0,
@@ -20,6 +23,8 @@ enum _cpt_object_type
/* The objects above are stored in memory while checkpointing */
CPT_OBJ_HEAD = 1024,
+ CPT_OBJ_X86_REGS,
+ CPT_OBJ_BITS,
};
enum _cpt_content_type {
@@ -28,6 +33,8 @@ enum _cpt_content_type {
CPT_CONTENT_DATA,
CPT_CONTENT_NAME,
...Functions to dump mm struct, VMAs and mm context are added.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
arch/x86/mm/hugetlbpage.c | 2 +
cpt/Makefile | 2 +-
cpt/cpt.h | 1 +
cpt/cpt_image.h | 61 +++++++
cpt/cpt_mm.c | 431 +++++++++++++++++++++++++++++++++++++++++++++
cpt/cpt_process.c | 8 +-
mm/memory.c | 1 +
7 files changed, 497 insertions(+), 5 deletions(-)
create mode 100644 cpt/cpt_mm.c
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..63028e7 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/sysctl.h>
+#include <linux/module.h>
#include <asm/mman.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -221,6 +222,7 @@ int pmd_huge(pmd_t pmd)
{
return !!(pmd_val(pmd) & _PAGE_PSE);
}
+EXPORT_SYMBOL(pmd_huge);
int pud_huge(pud_t pud)
{
diff --git a/cpt/Makefile b/cpt/Makefile
index 457cc96..bbb0e37 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o
-cptrst-objs := sys.o checkpoint.o cpt_process.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index 1bb483d..73ae296 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -58,3 +58,4 @@ extern int debug_level;
int dump_container(struct cpt_context *ctx);
int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index b7b68e1..ae019e7 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -16,13 +16,19 @@
#include <linux/sched.h>
#include <asm/segment.h>
+#define CPT_NULL (~0ULL)
+
enum _cpt_object_type
{
CPT_OBJ_TASK = 0,
+ CPT_OBJ_MM,
CPT_OBJ_MAX,
/* The objects above are stored in memory while ...Actually, right now this function will restart only one process.
Function to read head of dump file is introduced.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/Makefile | 2 +-
cpt/cpt.h | 1 +
cpt/restart.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
cpt/sys.c | 2 +-
4 files changed, 90 insertions(+), 2 deletions(-)
create mode 100644 cpt/restart.c
diff --git a/cpt/Makefile b/cpt/Makefile
index bbb0e37..47c7852 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o
-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index 73ae296..6471246 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -59,3 +59,4 @@ extern int debug_level;
int dump_container(struct cpt_context *ctx);
int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
+int restart_container(struct cpt_context *ctx);
diff --git a/cpt/restart.c b/cpt/restart.c
new file mode 100644
index 0000000..5770985
--- /dev/null
+++ b/cpt/restart.c
@@ -0,0 +1,87 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <major@openvz.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "cpt.h"
+#include "cpt_image.h"
+
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx)
+{
+ int err;
+ struct cpt_object_hdr *hdr = tmp;
+ err = ctx->read(hdr, sizeof(struct cpt_object_hdr), ctx);
+ if (err)
+ return err;
+ if (type > 0 && type != hdr->cpt_type)
+ return -EINVAL;
+ if ...Functions to restart process, restore its state, fpu and registers are added. Signed-off-by: Andrey Mirkin <major@openvz.org> --- arch/x86/kernel/entry_32.S | 21 +++ arch/x86/kernel/process_32.c | 3 + cpt/Makefile | 3 +- cpt/cpt.h | 2 + cpt/restart.c | 2 +- cpt/rst_process.c | 277 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 1 + 7 files changed, 307 insertions(+), 2 deletions(-) create mode 100644 cpt/rst_process.c diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 109792b..a4848a3 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -225,6 +225,7 @@ ENTRY(ret_from_fork) GET_THREAD_INFO(%ebp) popl %eax CFI_ADJUST_CFA_OFFSET -4 +ret_from_fork_tail: pushl $0x0202 # Reset kernel eflags CFI_ADJUST_CFA_OFFSET 4 popfl @@ -233,6 +234,26 @@ ENTRY(ret_from_fork) CFI_ENDPROC END(ret_from_fork) +ENTRY(i386_ret_from_resume) + CFI_STARTPROC + pushl %eax + CFI_ADJUST_CFA_OFFSET 4 + call schedule_tail + GET_THREAD_INFO(%ebp) + popl %eax + CFI_ADJUST_CFA_OFFSET -4 + movl (%esp), %eax + testl %eax, %eax + jz 1f + pushl %esp + call *%eax + addl $4, %esp +1: + addl $256, %esp + jmp ret_from_fork_tail + CFI_ENDPROC +END(i386_ret_from_resume) + /* * Return to user mode is not as complex as all this looks, * but we want the default path for a system call return to diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 4711eed..1bdec02 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -58,6 +58,9 @@ #include <asm/kdebug.h> asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); +EXPORT_SYMBOL(ret_from_fork); +asmlinkage void i386_ret_from_resume(void) __asm__("i386_ret_from_resume"); +EXPORT_SYMBOL(i386_ret_from_resume); DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; ...
Functions to restore mm, VMAs and mm context are added.
Signed-off-by: Andrey Mirkin <major@openvz.org>
---
cpt/Makefile | 2 +-
cpt/cpt.h | 1 +
cpt/cpt_image.h | 5 +
cpt/rst_mm.c | 320 +++++++++++++++++++++++++++++++++++++++++++++++++++++
cpt/rst_process.c | 3 +-
mm/mmap.c | 1 +
mm/mprotect.c | 2 +
8 files changed, 336 insertions(+), 2 deletions(-)
create mode 100644 cpt/rst_mm.c
diff --git a/cpt/Makefile b/cpt/Makefile
index 689a0eb..19ca732 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -3,4 +3,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o
cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
- rst_process.o
+ rst_process.o rst_mm.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index d59255f..f2a1b28 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -62,3 +62,4 @@ int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
int restart_container(struct cpt_context *ctx);
int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
int rst_restart_process(struct cpt_context *ctx);
+int rst_restore_mm(struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index ae019e7..3b5b418 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -233,6 +233,11 @@ struct cpt_x86_regs
__u32 cpt_ss;
} __attribute__ ((aligned (8)));
+static inline void __user * cpt_ptr_import(__u64 ptr)
+{
+ return (void*)(unsigned long)ptr;
+}
+
static inline __u64 cpt_timespec_export(struct timespec *tv)
{
return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
diff --git a/cpt/rst_mm.c b/cpt/rst_mm.c
new file mode 100644
index 0000000..cccbff6
--- /dev/null
+++ b/cpt/rst_mm.c
@@ -0,0 +1,320 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <major@openvz.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the ...This space reserved for hooks deserves some explanations IMO. I partly know= the reasons, since I have looked at OpenVZ code with you, but I'm sure that alm= ost nobody knows... Btw, the constant HOOK_RESERVE should be defined in a common place for asse= mbly Doing try_module_get(THIS_MODULE) and checking its result is almost always = the symptom of a BUG. I explain myself: If it is possible that THIS_MODULE is currently unloading, its memory (especially its text section) can be freed at any time during the call to local_kernel_thread(), which leads to a beautiful OOPS. So, either the code is buggy, or this check is useless. I agree however tha= t the module ref count must be incremented, since the new thread will run code fr= om I don't see the point of this two-steps completion. Could you explain? Thanks, Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Can we do something with naming here? resume usually means 'resume ....and you are not even consistent. Plus, rst_ will probably be understood as a reset by most people. Hmmm... your syscalls get fd. What happens if I pass something like /proc/self/maps to them? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
IIUC, as soon as ZERO_PAGE is hit, checkpoint aborts. I doubt that many Why not down_reading mm->mmap_sem here, while it is properly down_written d= uring restart? Is it because we can rely on the tasks being frozen during checkpo= int, and not during restart? Thanks, Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
The get_task_struct() above won't pin the tsk->nsproxy though, will it? So should you be doing a rcu_read_lock(); nsproxy = get_task_nsproxy(root); rcu_read_unlock(); to make sure the nsproxy doesn't disappear out from under you? (Maybe you do it in a later patch...) thanks, --
You right here, will fix it in next version. Thanks, --
It will be more clear that ctx_state refers to it. Thanks, -- Matthieu Fertré IRISA, Campus universitaire de Beaulieu - 35 042 Rennes, France tel: +33 2 99 84 22 74 fax: +33 2 99 84 71 71 mail: mfertre@irisa.fr --
Thanks for the idea with enum, I'll fix it in next version. Regards, --
This layout looks a bit awkward for 32bits/64bits compatibility. Maybe put cpt_hdrlen before cpt_type? Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
You right here. Thanks for the catch. I will change that in next version. Thanks, Andrey --
it looks like fput(file) is done twice in checkpoint() and context_release() ? C. --
Oh, you are right. I'm sorry, I was in rush and sent an outdated version. Will resend correct patch shortly. Thanks, Andrey --
Thanks Andrey. The structure of this patchset is very nice, and I'm finding it very easy to read, at least the first half. The differences (in yours and Oren's) between checkpointing and restarting yourself or another task should probably not be important as people will want to be able to do both in the end, and I'm sure both patchsets can do both. And you have things like your own printk helpers in patch 3 that, just as in Oren's set, will have to go away. But I'm not yet able to see what the real differences are. --
Thanks for reviewing. My owen printk is just for debug purpose, I will change them to something already existing. I'm waiting for more comments from you. Thanks, --
this patchset is based on top of : git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-lxc.git right ? C. --
Yes, it is based on Dave's git tree. Regards, Andrey --
why ? Do we really think that C/R implementations will be so different that we will need C/R ops to support all of them ? I imagine that there could be different models : 1. brute force : dump it all and kill 2. incremental 3. live migration ... But I see all of them really tied to the kernel internals. The first issues I see with this direction are some EXPORT_SYMBOL() that would OK that's integrated and Daniel's tools : http://lxc.cvs.sourceforge.net/lxc/ one more reason to work on integration :) --
Checkpoint/restart functionality is implemented as a kernel module to provide more flexibility during development process - you can easily unload module, fix what you need and load module again, you do not need to reboot you server in this case. We have discussed this question during Container's mini summit on OLS-2008 and we have found out that it will be convenient to have such an Yes, you're right :) Regards, --
It would probably be best to separate out the bits that are needed to make this feature build as a module. Since, as you say, this is only useful during development, we'll have to rip these bits out eventually to get it merged. Might as well do that from the start. -- Dave --
At the mini-summit two reasons were brought up to make it a module: 1. So sysadmins worried about security implications can completely unload the module 2. So developers can unload and reload the module while Actually I don't think we expected to use different implementations for --
er. Thank you Andrey for having highlighted the differences with Oren's approac= h, and for having split this patchset. Few remarks in reply to the patches. Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Forgot a global comment: you will probably get the same (rather pointless f= or a proof of concept, IMHO) requests as Oren got, to 1) improve coding style: a) especially avoid error handling like: err =3D foo(); if (!err) err =3D bar(); if (!err) err =3D baz(); and prefer err =3D foo(); if (err) goto foo_err; err =3D bar(); ... b) do not write conditions on a single line, like if (foo) bar; 2) put arch-dependent code in arch/ subdirs. 3) I probably forgot other ones. I obviously do not personally request you to take these requests into accou= nt ;) Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Thanks for comments. Unfortunately I have missed 2-3 weeks of discussion of Oren's patches. Right now I'm reading all threads devoted to checkpointing. I'll take into account you comments and I'll try to fix all comments in next version. Thanks, Andrey --
./scripts/checkpatch.pl catches most of it (and it does catch a few on your patches) but Dave is even better ! :) Cheers, --
I'm waiting for his comments :) Regards, --
Thanks Andrey. I'll take a close look at it by the weekend. Just a couple Actually, my patch is written to checkpoint from outside the context of the Here, too, handling of namespaces was left out in order to provide a simple proof of concept. In fact, spawning a new process and entering a new namespace can be easily done in user-space. --
So, just like the I/O controller patches, we surely can't just throw patch sets back and forth at each other. We're also sure to wear out any potential reviewers, especially on LKML. The differences you've described between this and Oren's patches are pretty small, all things considered. Would you be willing to simply help with Oren's patches or port things over from this set instead of working on a completely disjoint set? -- Dave --
Hi Andrey, I'm curious what you want to happen with this patch set. Is there something specific in Oren's set that deficient which you need implemented? Are there some technical reasons you prefer this code? -- Dave --
To be fair, and since (IIRC) the initial intent was to start with OpenVZ's approach, shouldn't Oren answer the same questions with respect to Andrey's patchset? I'm afraid that we are forgetting to take the best from both approaches... Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
I agree with Louis. I played with Oren's patchset and tryed to port it on x86_64. I was able to sys_checkpoint/sys_restart but if you remove the restoring of the general registers, the restart still works. I am not an expert on asm, but my hypothesis is when we call sys_checkpoint the registers are saved on the stack by the syscall and when we restore the memory of the process, we restore the stack and the stacked registers are restored when exiting the sys_restart. That make me feel there is an important gap between external checkpoint and internal checkpoint. Dmitry's patchset is nice too, but IMO, it goes too far from what we decided to do at the container mini-summit. I think there are a lot of design questions to be solved before going further. IMHO we should look at Dmitry patchset and merge the external checkpoint code to Oren's patchset in order to checkpoint *one* process and have the process to restart itself. At this point, we can begin to talk about the restart itself, shall we have the kernel to fork the processes to be restarted ? shall we fork from userspace and implement some mechanism to have each processes to restart themselves ? etc... --
I think you are talking about Andrey. --
This is a misconception: my patches are not "internal checkpoint". My patches are basically "external checkpoint" by design, which *also* accommodates self-checkpointing (aka internal). The same holds for the restart. The implementation is demonstrated with "self-checkpoint" to avoid complicating things at this early stage of proof-of-concept. For multiple processes all that is needed is a container and a loop on the checkpoint side, and a method to recreate processes on the restart side. Andrew suggests to do it in kernel space, I still have doubts. While I held out the multi-process part of the patch so far because I was explicitly asked to do it, it seems like this would be a good time In both approaches, processes restart themselves, in the sense that a process to be restarted eventually calls "do_restart()" (or equivalent). The only question is how processes are created. Andrew's patch creates everything inside the kernel. I would like to still give it a try outside the kernel. Everything is ready, except that we need a way to pre-select a PID for the new child... we never agreed on that one, did we ? If we go ahead with the kernel-based process creation, it's easy to merge it to the current patch-set. Oren. --
Yep, I read your patchset :)
I just want to clarify what we want to demonstrate with this patchset
for the proof-of-concept ? A self CR does not show what are the
complicate parts of the CR, we are just showing we can dump the memory
from the kernel and do setcontext/getcontext.
We state at the container mini-summit on an approach:
1. Pre-dump
2. Freeze the container
3. Dump
4. Thaw/Kill the container
5. Post-dump
We already have the freezer, and we can forget for now pre-dump and
post-dump.
IMHO, for the proof-of-concept we should do a minimal CR (like you did),
but conforming with these 5 points, but that means we have to do an
external checkpoint.
If the POC conforms with that, the patchset will be a little different
and that will show what are the difficult part for restarting a process,
especially to restart it at the frozen state :) and that will give an
A question to Andrey, do you, in OpenVZ, restart "externally" or it is
the first process of the pid namespace which calls sys_restart and then
IMHO it is too soon...
--
Right, Oren, iiuc you are insisting that 'external checkpoint' and 'multiple task checkpoint' are the same thing. But they aren't. Rather, I think that what we say is 'multiple tasks c/r' is what you say should be done from user-space :) So particularly given that your patchset seems to be in good shape, I'd like to see external checkpoint explicitly supported. Please just call me a dunce if v7 already works for that. thanks, -serge --
Then I don't explain myself clearly :) The only thing I consider doing in user space is the creation of the container, the namespaces and the processes. I argue that "external checkpoint of a single process" is very few lines of code away from "self checkpoint" that is in v7. I'm not sure how you define "external restart" ? eventually, the processes restart themselves. It is a question of how the processes It seems like you want a single process to checkpoint a single (other) process, and then a single process to start a single (other) process. I tried to explicitly avoid dealing with the container (user space ? kernel space ?) and with creating new processes (user space ? kernel space ?). Nevertheless, it's the _same_ code. All that is needed is to make the checkpoint syscall refer to the other task instead of self, and the restart should create a container and fork there, then call sys_restart. I guess instead of repeating this argument over, I will go ahead and post a patch on top of v7 to demonstrate this (without a container, therefore without preserving the original pid). Oren. --
If I ever said external restart, I actually meant external checkpoint. I was under the impression that sys_checkpoint() on some other task's Yes, as i believe you said in another email earlier today, we have not decided about how to restore the pid. Eric continues to argue for playing games with /proc/sys/kernel/pid_max. -serge --
Cedric made a patch for the external checkpoint: http://lxc.sourceforge.net/patches/2.6.27/2.6.27-rc8-lxc1/0035-enable-external-checkpo... The main difference is you will need to freeze the process because it will not block itself via a syscall (there is the freezer patchset). For the restart, perhaps you can just do a process calling sys_restart and so we delay the fork from user/kernel discussion, no ? --
In OpenVZ we are creating first task and namespaces from sys_restart. --
Yes I still prefer in-kernel. Can you elaborate on advantages of doing Yup, and i appreciate your restraint until now :) It made your patchset Can you send that out as a patch(set) on top of your v7? I'd love to see (and test) it. thanks, -serge --
Both solution are valid. Nevertheless, I would choose the solution sharing existing code and being arch independent. Now, we can start by duplicating code and see later how we could share. But for mainline inclusion, I think that code reuse is a faster path. C. --
I'm only really "supporting" Oren's patch set because he got it out way before the OpenVZ one showed up, and has kept integrating improvements into it to keep me interested. The OpenVZ approach does a few things that Oren's does not, and it *is* a bit more mature. I'm quite willing to jump camps if there's something compelling in what Andrey has posted, even though I've put quite a bit of time in to reviewing and improving Oren's. I'm looking for that "something". :) -- Dave --
Hi Dave, Right now my patchset (v2) provides an ability to checkpoint and restart a group of processes. The process of checkpointing and restart can be initiated from external process (not from the process which should be checkpointed). Also I think that all the restart job (including process forking) should be done in kernel, as in this case we will not depend on user space and will be more secure. This is also implemented in my patchset. Andrey --
Absolutely. Oren's code does it this way to make for a smaller patch at first. The syscall takes a pid argument so it is surely expected to be Do you think that this is an approach that Oren's patches are married to, or is this a "feature" we can add on later? I don't care which patch set we end up sticking in the kernel. I'm trying to figure out which code we can more easily build upon in the future. The fact that Oren's or yours can't do certain little things right now does not bother me. Honestly, I'm a little more confident that everyone can work with Oren since he managed to get 7 revisions of his patch out and make some pretty large changes while in the same time the OpenVZ patch was only released twice. I'm not sure what has changed in the OpenVZ patch between releases, either. Are there any reasons that you absolutely can not use the code Oren posted? Will it not fulfill your needs somehow? If so, could you please elaborate on how? -- Dave --
Well, AFAICS from Oren's patch set his approach is oriented on process creation in user space. I think we should choose right now what approach will be used for process creation. We have two options here: fork processes in kernel or fork them in user space. If process will be forked in user space, then there will be a gap when process will be in user space and can be killed with received signal before entering kernel. Also we will need a functionolity to create processes with predefined PID. I think it is not very good to provide such ability to user space. That That is my fault. I am working right now on my Ph.D, that is why my activity We have one major difference with Oren's code - how processes are created during restr. Right now I'm trying to port kernel process creation on top of Oren's patches. I agree that working in collaboration will speed up merging of checkpointing to mainstream. Andrey P.S.: Sorry for late reply, my mailer attached your e-mail to wrong thread. --
This is inaccurate. I intentionally did not address how processes will be created, by simply allowing either way to be added to the patch. I do agree that we probably want to decide how to do it. However, there is also room to allow for both approaches, in a compatible Why do we care about it ? Why is there a difference if it is killed before or after entering This is the weak side of creating the processes in user space - that we need such an interface. Note, however, that we can easily "hide" it inside the interface of the sys_restart() call, and restrict how it may be used. --
Yes, you right. Either way is possible with your patchset. But as I understand in ZAP you are using user space process creation. No? That is why I think that your design is more convenient for user process If one process is killed during restart then you can even do not notice that (if processes are created from user space and then call sys_restart). And you Of course we can "hide" it somehow, but anyway we will have a hole and that is not good. Anyway we should ask everyone what they think about user- and kernel- based process creation. Dave, Serge, Cedric, Daniel, Louis what do you think about that? --
Frankly, I'm not convinced (yet) that one approach is better than the other= one. I only *tend* to prefer kernel-based, for the reasons explained below. I kn= ow that there are arguments in favor of userspace (I've at least seen security-related ones), but I let their authors detail them (again). In Kerrighed this is kernel-based, and will remain kernel-based because we checkpoint a distributed task tree, and want to restart it as mush as possi= ble with the same distribution. The distributed protocol used for restart is currently too fragile and complex to rely on customized user-space implementations. That said, if someone brings very good arguments in favor = of userspace implementations, we might consider changing this. Without taking distributed restart into account, I also tend to prefer kernel-based, mainly for two (not so strong) reasons: 1) this prevents userspace from doing weird things, like changing the task = tree and let the kernel detect it and deal with the mess this creates (think abo= ut two threads being restarted in separate processes that do not even share th= eir parents). But one can argue that userspace can change the checkpoint image = as well, so that the kernel must check for such weird things anyway. 2) restart will be more efficient with respect to shared objects. Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
To me, this is one of the strongest arguments out there for doing restart as much as possible with existing user<->kernel APIs. Having the kernel detect and clean up userspace's messes is not going to work. We might as well just do things in the kernel rather than do that. What we *should* do is leverage all of the existing APIs that we already have instead of creating completely new code paths into which my butter Can you quantify this? Which objects? How much more efficient? -- Dave --
Quantify? No. I expect that investigating both approaches will show us numb= ers. Unless Oren already has some? Which objects? I think that two kinds will especially matter: objects usual= ly shared only inside a thread group (mm_struct, fs_struct, files_struct, signal_struct and sighand_struct), and individual file descriptors. The poi= nt is to avoid creating new structures before destroying them because the restart= ed task shares them with a previously restarted one. Concerning individual file descriptors, limiting the number of open files b= efore calling sys_restart() may avoid these useless creations/destructions (actua= lly the "useless" work mainly consists in managing ref counts since file descri= ptors are shared after fork()). Concerning thread-shared structures, it is probably easy for userspace to g= uess which clone flags to use when restarting threads, but 1) kernel-space will have to check that the sharing is correct anyway, and 2) kernel-space will have to fix it anyway if structures are not shared in = an obvious manner between tasks (think about A creating B with shared files_st= ruct, B creating C with shared files_struct, B unsharing its files_struct, and th= en checkpoint). So, with a userspace implementation, useless structures will be created any= way, and optimizing the common cases (regular threads) just duplicates kernel's = work of checking which shared structure to use for each task to restart. With a kernel-space implementation, all useless creations can be avoided, a= nd no duplicate work is needed. That said, numbers may show us that useless creations are not so time-consuming, but we won't know before seeing them... Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
I do have some. it's pretty quick :) see the usenix 2007 paper... all the forks in the user space will be done with CLONE_VM etc, to avoid exactly that sort of overhead. in any event, my experience is that this is not the dominant factor in the ok. that's not a lot of work :p they can also be avoided in user space - you "optimistically" create everything shared to begin with, and in the kernel (inside sys_restart) you "unshare" and create the necessary resources on demand - just like you would do with kernel based process creation. in this case, the extra work is only ref-counting, and then sys_restart will unconditionally attach the right shared resource to the restarting process (the "right" shared resource will be found, of course, in the shared pool). this way, you don't even need to check what the user gave you - you simply yes, odds are that you are right. Oren --
I'm not convinced either. I think both implementation can eventually Zap also has distributed checkpoint which does not require strict I don't really buy this argument. First, as you say, user can change the checkpoint image file. Second, you can verify in the kernel that the real relationships of the processes match those specified (and Can you elaborate on this ? In what sense "more efficient" ? Note that the topic in question is not whether to do the entire restart from user space (and I argue that most work should be done in the kernel), but rather whether process creation (and only that) should be done in kernel or user space. Quick thoughts of pros/cons of each approach are: user space: + re-use existing api (fork) + easier to debug + will allow 'handmade' resources restart: it was mentioned before that one may want to reattach stdout to a different place after restart; a user based restart of processes can make this much easier: e.g. the user process can create the alternative resources, give them to the kernel and only then call sys_restart) + arch-independent code - a bit slower than in kernel space - requires a clone-with-specific-pid syscall or interface kernel space: + a bit easier to control everything + a bit faster than user space + no need for user-visible interface for clone-with-... - arch-dependent code - needs special code to fight 'fork-bomb' So, I'm not convinced, and I even think there may be room to both, for the time being. I volunteer to support the user-space alternative while we make up our minds. Oren. --
Yes. Tasks from different nodes have parent-children, session leader, etc. relationships, and the distributed management of struct pid lifecycle is a = bit touchy too. By the way, splitting the checkpoint image in one file for each= task helps us a lot to make restart parallel, because it is more efficient for t= he file system to handle parallel reads of different files from different nodes than Yes, I hope that investigating both approaches will give us stronger argume= nts. Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
You can also make parallel restart work with the single stream, without much effort. Particularly if you store everything on the file system. In both cases, the limiting factor is shared resources - where one task cannot proceed with checkpoint because it waits for another task to first (re)create that resource. [...] Oren. --
Sure we can use a single stream, since we already share file descriptors ac= cross nodes. But the distributed synchronization of the file pointer is costly compared to having each node access different files. This way we push the parallelization bottelneck down to the file system rather than in the We just try to avoid other bottlenecks :) And besides file descriptors, sha= red resources are as common as multi-threaded programs, which are not the major= ity of the workloads we can address. Louis --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
I prefer kernel. Pretty sure Dave prefers user-space. I'd say the thing to do is push the core API that supports single-thread c/r. Then try to push the patch to do recreate processes from the kernel, and let the arguments for and against that stand on their own. -serge --
My worry is where a single sys_restart() plus in-kernel process creation takes us. In practice, what do we do? Do we single-thread the entire restore process? Or, do we do in-kernel process creation and have multiple kernel threads trying to read out of different points in the checkpoint file, trying to restore all their own states in parallel? Does that mean that we can't in practice restore from a fd like a pipe or a network socket? In the same way, if we *do* create the processes in userspace, how do we do _that_? Do we just fork() and sleep() until the kernel comes along and blows our state away? How does the kernel process doing the restoring tell userspace how many things to fork? How do we match these new userspace processes up with the ones coming out of the checkpoint process? To me, it's just way too early to talk about this stuff. Both approaches have their issues, and I'm yet to see the differences manifested in code so I can really sink my teeth into them -- Dave --
Rethinking this -- if the user wishes she can construct a suitable checkpoint image that would do exactly that. It takes more effort than using a system call, but the result is similar. What I had in mind for that special clone-with-pid is to restrict when it can be used (e.g. when the container is in a "restarting" state or something like that. [...] Oren. --
Both patchsets share the same design, namely be able to checkpoint and restart multiple processes, with the operation initiated by an external processes. I deliberately left out the part that handles multiple processes to keeps things simple for initial review, and until we decide on the I'm not convinced that creating the processes in the kernel makes it more secure. Can you elaborate ? for the discussion, let's compare these two basic scenarios: 1) container and processes are created in user space; each process calls "sys_restart()" which eventually calls "do_restart()" that does kernel-based restart. 2) container and processes are created in kernel space; each process calls "do_restart()" to do kernel-based restart. In fact, creating in user based makes it easier to enforce capabilities and limits of the user. It also simplifies the debugging significantly, and allows us to delegate the entire issue of containers and namespace management back to user space, where it probably belongs. On the other hand, doing it in kernel space likely to produce simpler code for the creation of the processes. Thanks, Oren. --
I agree that multiple process handling is not needed for initial review. But I believe that the question with process creation should be discussed right Well, in this case there will be a gap after process is returned from fork but before entering kernel. During that time process can be killed by delivered signal. Another drawback of this approach is that we will need to provide an You right here. Both approaches have pros and cons, but I think that kernel approach has more advantages Andrey --
