Re: [Devel] Re: [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore

Previous thread: none

Next thread: [PATCH 1/2] bcm5974-0.63: Finger counting improved by Henrik Rydberg on Wednesday, September 3, 2008 - 5:01 am. (1 message)
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

This patchset introduces kernel based checkpointing/restart as it is
implemented in OpenVZ project. This patchset has limited functionality and
are able to checkpoint/restart only single process. Recently Oren Laaden
sent another kernel based implementation of checkpoint/restart. The main
differences between this patchset and Oren's patchset are:

* In this patchset checkpointing initiated not from the process
(right now we do not have a container, only namespaces), Oren's patchset
performs checkpointing from the process context.

* Restart in this patchset is initiated from process, which restarts a new
process (in new namespaces) with saved state. Oren's patchset uses the same
process from which restart was initiated and restore saved state over it.

* Checkpoint/restart functionality in this patchset is implemented as a kernel
module


As checkpointing is initiated not from the process which state should be saved
we should freeze a process before saving its state. Right now Container Freezer
from Matt Helsley can be used for this.

This patchset introduce only a concept how kernel based checkpointing/restart
can be implemented and are able to checkpoint/restart only a single process
with simple VMAs. 

I've tried to split my patchset in small patches to make review more easier.
--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Right now they just return -ENOSYS. Later they will provide functionality
to checkpoint and restart a container.

Both syscalls take as arguments a file descriptor and flags.
Also sys_checkpoint take as the first argument a PID of container's init
(later it will be container ID); sys_restart takes as the first argument
a container ID (right now it will not be used).

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 Makefile                           |    2 +-
 arch/x86/kernel/syscall_table_32.S |    2 +
 cpt/Makefile                       |    1 +
 cpt/sys_core.c                     |   38 ++++++++++++++++++++++++++++++++++++
 include/asm-x86/unistd_32.h        |    2 +
 5 files changed, 44 insertions(+), 1 deletions(-)
 create mode 100644 cpt/Makefile
 create mode 100644 cpt/sys_core.c

diff --git a/Makefile b/Makefile
index ea413fa..1dee5c0 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ cpt/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index fd9d4f4..4a0d7fb 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -333,3 +333,5 @@ ENTRY(sys_call_table)
 	.long sys_pipe2
 	.long sys_inotify_init1
 	.long sys_hijack
+	.long sys_checkpoint
+	.long sys_restart		/* 335 */
diff --git a/cpt/Makefile b/cpt/Makefile
new file mode 100644
index 0000000..2276fb1
--- /dev/null
+++ b/cpt/Makefile
@@ -0,0 +1 @@
+obj-y += sys_core.o
diff --git a/cpt/sys_core.c b/cpt/sys_core.c
new file mode 100644
index 0000000..1a97fb6
--- /dev/null
+++ b/cpt/sys_core.c
@@ -0,0 +1,38 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

A config option CONFIG_CHECKPOINT is introduced.
New structure cpt_operations is introduced to store pointers to
checkpoint/restart functions from module.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/Kconfig    |    7 +++++++
 cpt/Makefile   |    4 ++++
 cpt/cpt.h      |   19 +++++++++++++++++++
 cpt/sys.c      |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 cpt/sys_core.c |   29 +++++++++++++++++++++++++++--
 init/Kconfig   |    2 ++
 6 files changed, 107 insertions(+), 2 deletions(-)
 create mode 100644 cpt/Kconfig
 create mode 100644 cpt/cpt.h
 create mode 100644 cpt/sys.c

diff --git a/cpt/Kconfig b/cpt/Kconfig
new file mode 100644
index 0000000..b9bc72d
--- /dev/null
+++ b/cpt/Kconfig
@@ -0,0 +1,7 @@
+config CHECKPOINT
+	tristate "Checkpoint & restart for containers"
+	depends on EXPERIMENTAL
+	default n
+	help
+	  This option adds module "cptrst", which allow to save a running
+	  container to a file and restart it later using this image file.
diff --git a/cpt/Makefile b/cpt/Makefile
index 2276fb1..bfe75d5 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -1 +1,5 @@
 obj-y += sys_core.o
+
+obj-$(CONFIG_CHECKPOINT) += cptrst.o
+
+cptrst-objs := sys.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
new file mode 100644
index 0000000..381a9bf
--- /dev/null
+++ b/cpt/cpt.h
@@ -0,0 +1,19 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+struct cpt_operations
+{
+	struct module * owner;
+	int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
+	int (*restart)(int ctid, int fd, unsigned long flags);
+};
+extern struct cpt_operations cpt_ops;
diff --git a/cpt/sys.c b/cpt/sys.c
new file mode 100644
index 0000000..4051286
--- /dev/null
+++ b/cpt/sys.c
@@ -0,0 ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Add functions for context allocation/destroy.
Introduce functions to read/write image.
Introduce image header and object header.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/cpt.h       |   37 ++++++++++++++++
 cpt/cpt_image.h |   63 +++++++++++++++++++++++++++
 cpt/sys.c       |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 224 insertions(+), 3 deletions(-)
 create mode 100644 cpt/cpt_image.h

diff --git a/cpt/cpt.h b/cpt/cpt.h
index 381a9bf..607ac1b 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -10,6 +10,8 @@
  *
  */
 
+#include "cpt_image.h"
+
 struct cpt_operations
 {
 	struct module * owner;
@@ -17,3 +19,38 @@ struct cpt_operations
 	int (*restart)(int ctid, int fd, unsigned long flags);
 };
 extern struct cpt_operations cpt_ops;
+
+#define CPT_CTX_ERROR           -1
+#define CPT_CTX_IDLE            0
+#define CPT_CTX_DUMPING         1
+#define CPT_CTX_UNDUMPING       2
+
+typedef struct cpt_context
+{
+	pid_t		pid;		/* should be changed to ctid later */
+	int		ctx_id;		/* context id */
+	struct list_head ctx_list;
+	int		refcount;
+	int		ctx_state;
+	struct semaphore main_sem;
+
+	int		errno;
+
+	struct file	*file;
+	loff_t		current_object;
+
+	struct list_head object_array[CPT_OBJ_MAX];
+
+	int		(*write)(const void *addr, size_t count, struct cpt_context *ctx);
+	int		(*read)(void *addr, size_t count, struct cpt_context *ctx);
+} cpt_context_t;
+
+extern int debug_level;
+
+#define cpt_printk(lvl, fmt, args...)	do {	\
+		if (lvl <= debug_level)		\
+			printk(fmt, ##args);	\
+	} while (0)
+
+#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
+#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
new file mode 100644
index 0000000..3d26229
--- /dev/null
+++ b/cpt/cpt_image.h
@@ -0,0 +1,63 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Actually right now we are going to dump only one process.
Function for dumping head of image file are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/Makefile     |    2 +-
 cpt/checkpoint.c |   74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 cpt/cpt.h        |    3 ++
 cpt/sys.c        |    3 +-
 kernel/fork.c    |    2 +
 5 files changed, 82 insertions(+), 2 deletions(-)
 create mode 100644 cpt/checkpoint.c

diff --git a/cpt/Makefile b/cpt/Makefile
index bfe75d5..173346b 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o
+cptrst-objs := sys.o checkpoint.o
diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c
new file mode 100644
index 0000000..b4d9686
--- /dev/null
+++ b/cpt/checkpoint.c
@@ -0,0 +1,74 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "cpt.h"
+
+static int cpt_write_head(struct cpt_context *ctx)
+{
+	struct cpt_head hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.cpt_signature[0] = CPT_SIGNATURE0;
+	hdr.cpt_signature[1] = CPT_SIGNATURE1;
+	hdr.cpt_signature[2] = CPT_SIGNATURE2;
+	hdr.cpt_signature[3] = CPT_SIGNATURE3;
+	hdr.cpt_hdrlen = sizeof(hdr);
+	hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff;
+	hdr.cpt_image_extra = 0;
+#if defined(CONFIG_X86_32)
+	hdr.cpt_arch = CPT_ARCH_I386;
+#else
+#error  Arch is not supported
+#endif
+	return ctx->write(&hdr, sizeof(hdr), ctx);
+}
+
+int dump_container(struct ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Functions to dump task struct, fpu state and registers are added.
All IDs are saved from the POV of process (container) namespace.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/Makefile      |    2 +-
 cpt/checkpoint.c  |    2 +-
 cpt/cpt.h         |    1 +
 cpt/cpt_image.h   |  123 +++++++++++++++++++++++++++
 cpt/cpt_process.c |  236 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 362 insertions(+), 2 deletions(-)
 create mode 100644 cpt/cpt_process.c

diff --git a/cpt/Makefile b/cpt/Makefile
index 173346b..457cc96 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o
diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c
index b4d9686..41c3523 100644
--- a/cpt/checkpoint.c
+++ b/cpt/checkpoint.c
@@ -65,7 +65,7 @@ int dump_container(struct cpt_context *ctx)
 	
 	/* Dump task here */
 	if (!err)
-		err = -ENOSYS;	
+		err = cpt_dump_task(root, ctx);
 
 out:
 	ctx->nsproxy = NULL;
diff --git a/cpt/cpt.h b/cpt/cpt.h
index b421a11..1bb483d 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -57,3 +57,4 @@ extern int debug_level;
 #define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
 
 int dump_container(struct cpt_context *ctx);
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index 3d26229..b7b68e1 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -13,6 +13,9 @@
 #ifndef __CPT_IMAGE_H_
 #define __CPT_IMAGE_H_ 1
 
+#include <linux/sched.h>
+#include <asm/segment.h>
+
 enum _cpt_object_type
 {
 	CPT_OBJ_TASK = 0,
@@ -20,6 +23,8 @@ enum _cpt_object_type
 	/* The objects above are stored in memory while checkpointing */
 
 	CPT_OBJ_HEAD = 1024,
+	CPT_OBJ_X86_REGS,
+	CPT_OBJ_BITS,
 };
 
 enum _cpt_content_type {
@@ -28,6 +33,8 @@ enum _cpt_content_type {
 	CPT_CONTENT_DATA,
 	CPT_CONTENT_NAME,
 ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Functions to dump mm struct, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 arch/x86/mm/hugetlbpage.c |    2 +
 cpt/Makefile              |    2 +-
 cpt/cpt.h                 |    1 +
 cpt/cpt_image.h           |   61 +++++++
 cpt/cpt_mm.c              |  431 +++++++++++++++++++++++++++++++++++++++++++++
 cpt/cpt_process.c         |    8 +-
 mm/memory.c               |    1 +
 7 files changed, 497 insertions(+), 5 deletions(-)
 create mode 100644 cpt/cpt_mm.c

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..63028e7 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
+#include <linux/module.h>
 #include <asm/mman.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -221,6 +222,7 @@ int pmd_huge(pmd_t pmd)
 {
 	return !!(pmd_val(pmd) & _PAGE_PSE);
 }
+EXPORT_SYMBOL(pmd_huge);
 
 int pud_huge(pud_t pud)
 {
diff --git a/cpt/Makefile b/cpt/Makefile
index 457cc96..bbb0e37 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o cpt_process.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index 1bb483d..73ae296 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -58,3 +58,4 @@ extern int debug_level;
 
 int dump_container(struct cpt_context *ctx);
 int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index b7b68e1..ae019e7 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -16,13 +16,19 @@
 #include <linux/sched.h>
 #include <asm/segment.h>
 
+#define CPT_NULL (~0ULL)
+
 enum _cpt_object_type
 {
 	CPT_OBJ_TASK = 0,
+	CPT_OBJ_MM,
 	CPT_OBJ_MAX,
 	/* The objects above are stored in memory while ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Actually, right now this function will restart only one process.
Function to read head of dump file is introduced.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/Makefile  |    2 +-
 cpt/cpt.h     |    1 +
 cpt/restart.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 cpt/sys.c     |    2 +-
 4 files changed, 90 insertions(+), 2 deletions(-)
 create mode 100644 cpt/restart.c

diff --git a/cpt/Makefile b/cpt/Makefile
index bbb0e37..47c7852 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o
 
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index 73ae296..6471246 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -59,3 +59,4 @@ extern int debug_level;
 int dump_container(struct cpt_context *ctx);
 int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
 int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
+int restart_container(struct cpt_context *ctx);
diff --git a/cpt/restart.c b/cpt/restart.c
new file mode 100644
index 0000000..5770985
--- /dev/null
+++ b/cpt/restart.c
@@ -0,0 +1,87 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "cpt.h"
+#include "cpt_image.h"
+
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx)
+{
+	int err;
+	struct cpt_object_hdr *hdr = tmp;
+	err = ctx->read(hdr, sizeof(struct cpt_object_hdr), ctx);
+	if (err)
+		return err;
+	if (type > 0 && type != hdr->cpt_type)
+		return -EINVAL;
+	if ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Functions to restart process, restore its state, fpu and registers are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 arch/x86/kernel/entry_32.S   |   21 +++
 arch/x86/kernel/process_32.c |    3 +
 cpt/Makefile                 |    3 +-
 cpt/cpt.h                    |    2 +
 cpt/restart.c                |    2 +-
 cpt/rst_process.c            |  277 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c               |    1 +
 7 files changed, 307 insertions(+), 2 deletions(-)
 create mode 100644 cpt/rst_process.c

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 109792b..a4848a3 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
 	GET_THREAD_INFO(%ebp)
 	popl %eax
 	CFI_ADJUST_CFA_OFFSET -4
+ret_from_fork_tail:
 	pushl $0x0202			# Reset kernel eflags
 	CFI_ADJUST_CFA_OFFSET 4
 	popfl
@@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
 	CFI_ENDPROC
 END(ret_from_fork)
 
+ENTRY(i386_ret_from_resume)
+	CFI_STARTPROC
+	pushl %eax
+	CFI_ADJUST_CFA_OFFSET 4
+	call schedule_tail
+	GET_THREAD_INFO(%ebp)
+	popl %eax
+	CFI_ADJUST_CFA_OFFSET -4
+	movl (%esp), %eax
+	testl %eax, %eax
+	jz    1f
+	pushl %esp
+	call  *%eax
+	addl  $4, %esp
+1:
+	addl  $256, %esp
+	jmp   ret_from_fork_tail
+	CFI_ENDPROC
+END(i386_ret_from_resume)
+
 /*
  * Return to user mode is not as complex as all this looks,
  * but we want the default path for a system call return to
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4711eed..1bdec02 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -58,6 +58,9 @@
 #include <asm/kdebug.h>
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
+EXPORT_SYMBOL(ret_from_fork);
+asmlinkage void i386_ret_from_resume(void) __asm__("i386_ret_from_resume");
+EXPORT_SYMBOL(i386_ret_from_resume);
 
 DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
 ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 3:57 am

Functions to restore mm, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <major@openvz.org>
---
 cpt/Makefile      |    2 +-
 cpt/cpt.h         |    1 +
 cpt/cpt_image.h   |    5 +
 cpt/rst_mm.c      |  320 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 cpt/rst_process.c |    3 +-
 mm/mmap.c         |    1 +
 mm/mprotect.c     |    2 +
 8 files changed, 336 insertions(+), 2 deletions(-)
 create mode 100644 cpt/rst_mm.c

diff --git a/cpt/Makefile b/cpt/Makefile
index 689a0eb..19ca732 100644
--- a/cpt/Makefile
+++ b/cpt/Makefile
@@ -3,4 +3,4 @@ obj-y += sys_core.o
 obj-$(CONFIG_CHECKPOINT) += cptrst.o
 
 cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
-	       rst_process.o
+	       rst_process.o rst_mm.o
diff --git a/cpt/cpt.h b/cpt/cpt.h
index d59255f..f2a1b28 100644
--- a/cpt/cpt.h
+++ b/cpt/cpt.h
@@ -62,3 +62,4 @@ int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
 int restart_container(struct cpt_context *ctx);
 int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
 int rst_restart_process(struct cpt_context *ctx);
+int rst_restore_mm(struct cpt_context *ctx);
diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h
index ae019e7..3b5b418 100644
--- a/cpt/cpt_image.h
+++ b/cpt/cpt_image.h
@@ -233,6 +233,11 @@ struct cpt_x86_regs
 	__u32	cpt_ss;
 } __attribute__ ((aligned (8)));
 
+static inline void __user * cpt_ptr_import(__u64 ptr)
+{
+	return (void*)(unsigned long)ptr;
+}
+
 static inline __u64 cpt_timespec_export(struct timespec *tv)
 {
 	return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
diff --git a/cpt/rst_mm.c b/cpt/rst_mm.c
new file mode 100644
index 0000000..cccbff6
--- /dev/null
+++ b/cpt/rst_mm.c
@@ -0,0 +1,320 @@
+/*
+ *  Copyright (C) 2008 Parallels, Inc.
+ *
+ *  Author: Andrey Mirkin <major@openvz.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the ...
From: Louis Rilling
Date: Wednesday, September 3, 2008 - 7:32 am

This space reserved for hooks deserves some explanations IMO. I partly know=
 the
reasons, since I have looked at OpenVZ code with you, but I'm sure that alm=
ost
nobody knows...

Btw, the constant HOOK_RESERVE should be defined in a common place for asse=
mbly


Doing try_module_get(THIS_MODULE) and checking its result is almost always =
the
symptom of a BUG. I explain myself:
If it is possible that THIS_MODULE is currently unloading, its memory
(especially its text section) can be freed at any time during the call to
local_kernel_thread(), which leads to a beautiful OOPS.

So, either the code is buggy, or this check is useless. I agree however tha=
t the
module ref count must be incremented, since the new thread will run code fr=
om


I don't see the point of this two-steps completion. Could you explain?

Thanks,

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Pavel Machek
Date: Saturday, September 13, 2008 - 10:34 am

Can we do something with naming here? resume usually means 'resume

....and you are not even consistent. Plus, rst_ will probably be
understood as a reset by most people.

Hmmm... your syscalls get fd. What happens if I pass something like
/proc/self/maps to them?


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Louis Rilling
Date: Wednesday, September 3, 2008 - 7:17 am

IIUC, as soon as ZERO_PAGE is hit, checkpoint aborts. I doubt that many

Why not down_reading mm->mmap_sem here, while it is properly down_written d=
uring
restart? Is it because we can rely on the tasks being frozen during checkpo=
int,
and not during restart?

Thanks,

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Serge E. Hallyn
Date: Wednesday, September 3, 2008 - 7:23 am

The get_task_struct() above won't pin the tsk->nsproxy
though, will it?  So should you be doing a
	rcu_read_lock();
	nsproxy = get_task_nsproxy(root);
	rcu_read_unlock();
to make sure the nsproxy doesn't disappear out from under
you?

(Maybe you do it in a later patch...)

thanks,
--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:45 am

You right here, will fix it in next version.

Thanks,
--

From: Matthieu Fertré
Date: Wednesday, September 3, 2008 - 5:29 am

It will be more clear that ctx_state refers to it.

Thanks,



-- 
Matthieu Fertré

IRISA, Campus universitaire de Beaulieu - 35 042 Rennes, France
tel: +33 2 99 84 22 74
fax: +33 2 99 84 71 71
mail: mfertre@irisa.fr

--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:11 am

Thanks for the idea with enum, I'll fix it in next version.

Regards,
--

From: Louis Rilling
Date: Wednesday, September 3, 2008 - 6:56 am

This layout looks a bit awkward for 32bits/64bits compatibility. Maybe put
cpt_hdrlen before cpt_type?

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:07 am

You right here. Thanks for the catch. I will change that in next version.

Thanks,
Andrey
--

From: Cedric Le Goater
Date: Wednesday, September 3, 2008 - 7:13 am

it looks like fput(file) is done twice in checkpoint() and context_release() ?

C.
--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:29 am

Oh, you are right. I'm sorry, I was in rush and sent an outdated version.
Will resend correct patch shortly.

Thanks,
Andrey
--

From: Serge E. Hallyn
Date: Wednesday, September 3, 2008 - 7:27 am

Thanks Andrey.  The structure of this patchset is very nice, and I'm
finding it very easy to read, at least the first half.

The differences (in yours and Oren's) between checkpointing and
restarting yourself or another task should probably not be important
as people will want to be able to do both in the end, and I'm sure
both patchsets can do both.  And you have things like your own
printk helpers in patch 3 that, just as in Oren's set, will have to
go away.  But I'm not yet able to see what the real differences are.

--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:51 am

Thanks for reviewing. My owen printk is just for debug purpose, I will change 
them to something already existing.
I'm waiting for more comments from you.

Thanks,
--

From: Cedric Le Goater
Date: Wednesday, September 3, 2008 - 4:44 am

this patchset is based on top of :

git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-lxc.git

right ?

C.
--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 6:05 am

Yes, it is based on Dave's git tree.

Regards,
Andrey
--

From: Cedric Le Goater
Date: Wednesday, September 3, 2008 - 5:28 am

why ? Do we really think that C/R implementations will be so different that
we will need C/R ops to support all of them ? I imagine that there could be 
different models :

	1. brute force : dump it all and kill
	2. incremental
	3. live migration 
	...

But I see all of them really tied to the kernel internals.

The first issues I see with this direction are some EXPORT_SYMBOL() that would 

OK that's integrated and Daniel's tools :

http://lxc.cvs.sourceforge.net/lxc/

one more reason to work on integration :)


--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 6:59 am

Checkpoint/restart functionality is implemented as a kernel module to provide 
more flexibility during development process - you can easily unload module, 
fix what you need and load module again, you do not need to reboot you server 
in this case. We have discussed this question during Container's mini summit 
on OLS-2008 and we have found out that it will be convenient to have such an 

Yes, you're right :)

Regards,
--

From: Dave Hansen
Date: Thursday, September 4, 2008 - 3:55 pm

It would probably be best to separate out the bits that are needed to
make this feature build as a module.  Since, as you say, this is only
useful during development, we'll have to rip these bits out eventually
to get it merged.  Might as well do that from the start.  

-- Dave

--

From: Serge E. Hallyn
Date: Wednesday, September 3, 2008 - 7:18 am

At the mini-summit two reasons were brought up to make it a module:

	1. So sysadmins worried about security implications can
	completely unload the module

	2. So developers can unload and reload the module while

Actually I don't think we expected to use different implementations for
--

From: Louis Rilling
Date: Wednesday, September 3, 2008 - 6:49 am

er.

Thank you Andrey for having highlighted the differences with Oren's approac=
h,
and for having split this patchset. Few remarks in reply to the patches.

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Louis Rilling
Date: Wednesday, September 3, 2008 - 7:06 am

Forgot a global comment: you will probably get the same (rather pointless f=
or a
proof of concept, IMHO) requests as Oren got, to
1) improve coding style:
  a) especially avoid error handling like:
	err =3D foo();
	if (!err)
		err =3D bar();
	if (!err)
		err =3D baz();

and prefer
	err =3D foo();
	if (err)
		goto foo_err;
	err =3D bar();
	...

  b) do not write conditions on a single line, like
	if (foo) bar;

2) put arch-dependent code in arch/ subdirs.
3) I probably forgot other ones.

I obviously do not personally request you to take these requests into accou=
nt ;)

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:19 am

Thanks for comments.
Unfortunately I have missed 2-3 weeks of discussion of Oren's patches.
Right now I'm reading all threads devoted to checkpointing.
I'll take into account you comments and I'll try to fix all comments in next 
version.

Thanks,
Andrey
--

From: Cedric Le Goater
Date: Wednesday, September 3, 2008 - 7:26 am

./scripts/checkpatch.pl catches most of it (and it does catch a few on your 
patches)

but Dave is even better ! :) 

Cheers, 


--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 7:53 am

I'm waiting for his comments :)

Regards,
--

From: Oren Laadan
Date: Thursday, September 4, 2008 - 1:14 am

Thanks Andrey. I'll take a close look at it by the weekend. Just a couple

Actually, my patch is written to checkpoint from outside the context of the

Here, too, handling of namespaces was left out in order to provide a simple
proof of concept. In fact, spawning a new process and entering a new namespace
can be easily done in user-space.

--

From: Dave Hansen
Date: Thursday, September 4, 2008 - 7:05 am

So, just like the I/O controller patches, we surely can't just throw
patch sets back and forth at each other.  We're also sure to wear out
any potential reviewers, especially on LKML.

The differences you've described between this and Oren's patches are
pretty small, all things considered.  Would you be willing to simply
help with Oren's patches or port things over from this set instead of
working on a completely disjoint set?

-- Dave

--

From: Dave Hansen
Date: Friday, October 17, 2008 - 4:33 pm

Hi Andrey,

I'm curious what you want to happen with this patch set.  Is there
something specific in Oren's set that deficient which you need
implemented?  Are there some technical reasons you prefer this code?

-- Dave

--

From: Louis Rilling
Date: Monday, October 20, 2008 - 4:10 am

To be fair, and since (IIRC) the initial intent was to start with OpenVZ's
approach, shouldn't Oren answer the same questions with respect to Andrey's
patchset?

I'm afraid that we are forgetting to take the best from both approaches...

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Daniel Lezcano
Date: Monday, October 20, 2008 - 6:25 am

I agree with Louis.

I played with Oren's patchset and tryed to port it on x86_64. I was able 
to sys_checkpoint/sys_restart but if you remove the restoring of the 
general registers, the restart still works. I am not an expert on asm, 
but my hypothesis is when we call sys_checkpoint the registers are saved 
on the stack by the syscall and when we restore the memory of the 
process, we restore the stack and the stacked registers are restored 
when exiting the sys_restart. That make me feel there is an important 
gap between external checkpoint and internal checkpoint.

Dmitry's patchset is nice too, but IMO, it goes too far from what we 
decided to do at the container mini-summit. I think there are a lot of 
design questions to be solved before going further.

IMHO we should look at Dmitry patchset and merge the external checkpoint 
code to Oren's patchset in order to checkpoint *one* process and have 
the process to restart itself. At this point, we can begin to talk about 
the restart itself, shall we have the kernel to fork the processes to be 
restarted ? shall we fork from userspace and implement some mechanism to 
have each processes to restart themselves ? etc...








--

From: Cedric Le Goater
Date: Monday, October 20, 2008 - 6:48 am

I think you are talking about Andrey.

--

From: Daniel Lezcano
Date: Monday, October 20, 2008 - 6:49 am

Yes :)
--

From: Oren Laadan
Date: Monday, October 20, 2008 - 8:53 am

This is a misconception: my patches are not "internal checkpoint". My
patches are basically "external checkpoint" by design, which *also*
accommodates self-checkpointing (aka internal). The same holds for the
restart. The implementation is demonstrated with "self-checkpoint" to
avoid complicating things at this early stage of proof-of-concept.

For multiple processes all that is needed is a container and a loop
on the checkpoint side, and a method to recreate processes on the
restart side. Andrew suggests to do it in kernel space, I still have
doubts.

While I held out the multi-process part of the patch so far because I
was explicitly asked to do it, it seems like this would be a good time

In both approaches, processes restart themselves, in the sense that a
process to be restarted eventually calls "do_restart()" (or equivalent).

The only question is how processes are created. Andrew's patch creates
everything inside the kernel. I would like to still give it a try outside
the kernel. Everything is ready, except that we need a way to pre-select
a PID for the new child... we never agreed on that one, did we ?

If we go ahead with the kernel-based process creation, it's easy to merge
it to the current patch-set.

Oren.

--

From: Daniel Lezcano
Date: Monday, October 20, 2008 - 9:37 am

Yep, I read your patchset :)

I just want to clarify what we want to demonstrate with this patchset 
for the proof-of-concept ? A self CR does not show what are the 
complicate parts of the CR, we are just showing we can dump the memory 
from the kernel and do setcontext/getcontext.

We state at the container mini-summit on an approach:

    1. Pre-dump
    2. Freeze the container
    3. Dump
    4. Thaw/Kill the container
    5. Post-dump

We already have the freezer, and we can forget for now pre-dump and 
post-dump.

IMHO, for the proof-of-concept we should do a minimal CR (like you did), 
but conforming with these 5 points, but that means we have to do an 
external checkpoint.

If the POC conforms with that, the patchset will be a little different 
and that will show what are the difficult part for restarting a process, 
especially to restart it at the frozen state :) and that will give an 

A question to Andrey, do you, in OpenVZ, restart "externally" or it is 
the first process of the pid namespace which calls sys_restart and then 

IMHO it is too soon...

--

From: Serge E. Hallyn
Date: Monday, October 20, 2008 - 10:23 am

Right, Oren, iiuc you are insisting that 'external checkpoint' and
'multiple task checkpoint' are the same thing.  But they aren't.
Rather, I think that what we say is 'multiple tasks c/r' is what you say
should be done from user-space :)

So particularly given that your patchset seems to be in good shape,
I'd like to see external checkpoint explicitly supported.  Please
just call me a dunce if v7 already works for that.

thanks,
-serge
--

From: Oren Laadan
Date: Monday, October 20, 2008 - 5:18 pm

Then I don't explain myself clearly :)

The only thing I consider doing in user space is the creation of
the container, the namespaces and the processes.

I argue that "external checkpoint of a single process" is very few
lines of code away from "self checkpoint" that is in v7.

I'm not sure how you define "external restart" ?  eventually, the
processes restart themselves. It is a question of how the processes

It seems like you want a single process to checkpoint a single (other)
process, and then a single process to start a single (other) process.

I tried to explicitly avoid dealing with the container (user space ?
kernel space ?) and with creating new processes (user space ? kernel
space ?).

Nevertheless, it's the _same_ code. All that is needed is to make the
checkpoint syscall refer to the other task instead of self, and the
restart should create a container and fork there, then call sys_restart.

I guess instead of repeating this argument over, I will go ahead and
post a patch on top of v7 to demonstrate this (without a container,
therefore without preserving the original pid).

Oren.

--

From: Serge E. Hallyn
Date: Monday, October 20, 2008 - 5:58 pm

If I ever said external restart, I actually meant external checkpoint.



I was under the impression that sys_checkpoint() on some other task's


Yes, as i believe you said in another email earlier today, we have not
decided about how to restore the pid.  Eric continues to argue for
playing games with /proc/sys/kernel/pid_max.

-serge
--

From: Daniel Lezcano
Date: Tuesday, October 21, 2008 - 6:24 am

Cedric made a patch for the external checkpoint:

http://lxc.sourceforge.net/patches/2.6.27/2.6.27-rc8-lxc1/0035-enable-external-checkpo...

The main difference is you will need to freeze the process because it 
will not block itself via a syscall (there is the freezer patchset).

For the restart, perhaps you can just do a process calling sys_restart 
and so we delay the fork from user/kernel discussion, no ?

--

From: Andrey Mirkin
Date: Monday, October 27, 2008 - 7:45 am

In OpenVZ we are creating first task and namespaces from sys_restart.

--

From: Serge E. Hallyn
Date: Monday, October 20, 2008 - 9:51 am

Yes I still prefer in-kernel.  Can you elaborate on advantages of doing

Yup, and i appreciate your restraint until now :)  It made your patchset

Can you send that out as a patch(set) on top of your v7?  I'd love to
see (and test) it.

thanks,
-serge
--

From: Cedric Le Goater
Date: Tuesday, October 21, 2008 - 2:36 am

Both solution are valid. Nevertheless, I would choose the solution 
sharing existing code and being arch independent.

Now, we can start by duplicating code and see later how we could
share. But for mainline inclusion, I think that code reuse is a
faster path.

C.
--

From: Dave Hansen
Date: Monday, October 20, 2008 - 9:36 am

I'm only really "supporting" Oren's patch set because he got it out way
before the OpenVZ one showed up, and has kept integrating improvements
into it to keep me interested.  The OpenVZ approach does a few things
that Oren's does not, and it *is* a bit more mature.

I'm quite willing to jump camps if there's something compelling in what
Andrey has posted, even though I've put quite a bit of time in to
reviewing and improving Oren's.  I'm looking for that "something". :)

-- Dave

--

From: Andrey Mirkin
Date: Monday, October 20, 2008 - 5:14 am

Hi Dave,

Right now my patchset (v2) provides an ability to checkpoint and restart a 
group of processes. The process of checkpointing and restart can be initiated 
from external process (not from the process which should be checkpointed).
Also I think that all the restart job (including process forking) should be 
done in kernel, as in this case we will not depend on user space and will be 
more secure. This is also implemented in my patchset.

Andrey
--

From: Dave Hansen
Date: Monday, October 20, 2008 - 8:55 am

Absolutely.  Oren's code does it this way to make for a smaller patch at
first.  The syscall takes a pid argument so it is surely expected to be

Do you think that this is an approach that Oren's patches are married
to, or is this a "feature" we can add on later?

I don't care which patch set we end up sticking in the kernel.  I'm
trying to figure out which code we can more easily build upon in the
future.  The fact that Oren's or yours can't do certain little things
right now does not bother me.

Honestly, I'm a little more confident that everyone can work with Oren
since he managed to get 7 revisions of his patch out and make some
pretty large changes while in the same time the OpenVZ patch was only
released twice.  I'm not sure what has changed in the OpenVZ patch
between releases, either.

Are there any reasons that you absolutely can not use the code Oren
posted?  Will it not fulfill your needs somehow?  If so, could you
please elaborate on how?

-- Dave

--

From: Andrey Mirkin
Date: Monday, October 27, 2008 - 7:07 am

Well, AFAICS from Oren's patch set his approach is oriented on process 
creation in user space. I think we should choose right now what approach will 
be used for process creation.
We have two options here: fork processes in kernel or fork them in user space.
If process will be forked in user space, then there will be a gap when process 
will be in user space and can be killed with received signal before entering 
kernel. Also we will need a functionolity to create processes with predefined 
PID. I think it is not very good to provide such ability to user space. That 

That is my fault. I am working right now on my Ph.D, that is why my activity 

We have one major difference with Oren's code - how processes are created 
during restr.
Right now I'm trying to port kernel process creation on top of Oren's patches.
I agree that working in collaboration will speed up merging of checkpointing 
to mainstream.

Andrey

P.S.: Sorry for late reply, my mailer attached your e-mail to wrong thread.
--

From: Oren Laadan
Date: Monday, October 27, 2008 - 7:39 am

This is inaccurate.

I intentionally did not address how processes will be created, by
simply allowing either way to be added to the patch.

I do agree that we probably want to decide how to do it. However,
there is also room to allow for both approaches, in a compatible

Why do we care about it ?
Why is there a difference if it is killed before or after entering

This is the weak side of creating the processes in user space -
that we need such an interface. Note, however, that we can
easily "hide" it inside the interface of the sys_restart() call,
and restrict how it may be used.

--

From: Andrey Mirkin
Date: Wednesday, October 29, 2008 - 11:02 pm

Yes, you right. Either way is possible with your patchset. But as I understand 
in ZAP you are using user space process creation. No?
That is why I think that your design is more convenient for user process 


If one process is killed during restart then you can even do not notice that 
(if processes are created from user space and then call sys_restart). And you 

Of course we can "hide" it somehow, but anyway we will have a hole and that is 
not good.

Anyway we should ask everyone what they think about user- and kernel- based 
process creation.
Dave, Serge, Cedric, Daniel, Louis what do you think about that?

--

From: Louis Rilling
Date: Thursday, October 30, 2008 - 4:47 am

Frankly, I'm not convinced (yet) that one approach is better than the other=
 one.
I only *tend* to prefer kernel-based, for the reasons explained below. I kn=
ow
that there are arguments in favor of userspace (I've at least seen
security-related ones), but I let their authors detail them (again).

In Kerrighed this is kernel-based, and will remain kernel-based because we
checkpoint a distributed task tree, and want to restart it as mush as possi=
ble
with the same distribution. The distributed protocol used for restart is
currently too fragile and complex to rely on customized user-space
implementations. That said, if someone brings very good arguments in favor =
of
userspace implementations, we might consider changing this.

Without taking distributed restart into account, I also tend to prefer
kernel-based, mainly for two (not so strong) reasons:
1) this prevents userspace from doing weird things, like changing the task =
tree
and let the kernel detect it and deal with the mess this creates (think abo=
ut
two threads being restarted in separate processes that do not even share th=
eir
parents). But one can argue that userspace can change the checkpoint image =
as
well, so that the kernel must check for such weird things anyway.
2) restart will be more efficient with respect to shared objects.

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Dave Hansen
Date: Thursday, October 30, 2008 - 10:08 am

To me, this is one of the strongest arguments out there for doing
restart as much as possible with existing user<->kernel APIs.  Having
the kernel detect and clean up userspace's messes is not going to work.
We might as well just do things in the kernel rather than do that.

What we *should* do is leverage all of the existing APIs that we already
have instead of creating completely new code paths into which my butter

Can you quantify this?  Which objects?  How much more efficient?

-- Dave

--

From: Louis Rilling
Date: Thursday, October 30, 2008 - 11:01 am

Quantify? No. I expect that investigating both approaches will show us numb=
ers.
Unless Oren already has some?

Which objects? I think that two kinds will especially matter: objects usual=
ly
shared only inside a thread group (mm_struct, fs_struct, files_struct,
signal_struct and sighand_struct), and individual file descriptors. The poi=
nt is
to avoid creating new structures before destroying them because the restart=
ed
task shares them with a previously restarted one.

Concerning individual file descriptors, limiting the number of open files b=
efore
calling sys_restart() may avoid these useless creations/destructions (actua=
lly
the "useless" work mainly consists in managing ref counts since file descri=
ptors
are shared after fork()).

Concerning thread-shared structures, it is probably easy for userspace to g=
uess
which clone flags to use when restarting threads, but
1) kernel-space will have to check that the sharing is correct anyway, and
2) kernel-space will have to fix it anyway if structures are not shared in =
an
obvious manner between tasks (think about A creating B with shared files_st=
ruct,
B creating C with shared files_struct, B unsharing its files_struct, and th=
en
checkpoint).

So, with a userspace implementation, useless structures will be created any=
way,
and optimizing the common cases (regular threads) just duplicates kernel's =
work
of checking which shared structure to use for each task to restart.
With a kernel-space implementation, all useless creations can be avoided, a=
nd no
duplicate work is needed.

That said, numbers may show us that useless creations are not so
time-consuming, but we won't know before seeing them...

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Thursday, October 30, 2008 - 11:28 am

I do have some. it's pretty quick :)  see the usenix 2007 paper...

all the forks in the user space will be done with CLONE_VM etc, to avoid
exactly that sort of overhead.

in any event, my experience is that this is not the dominant factor in the

ok. that's not a lot of work :p

they can also be avoided in user space - you "optimistically" create everything
shared to begin with, and in the kernel (inside sys_restart) you "unshare" and
create the necessary resources on demand - just like you would do with kernel
based process creation.

in this case, the extra work is only ref-counting, and then sys_restart will
unconditionally attach the right shared resource to the restarting process
(the "right" shared resource will be found, of course, in the shared pool).

this way, you don't even need to check what the user gave you - you simply

yes, odds are that you are right.

Oren

--

From: Oren Laadan
Date: Thursday, October 30, 2008 - 10:45 am

I'm not convinced either. I think both implementation can eventually

Zap also has distributed checkpoint which does not require strict

I don't really buy this argument. First, as you say, user can change
the checkpoint image file. Second, you can verify in the kernel that
the real relationships of the processes match those specified (and

Can you elaborate on this ?  In what sense "more efficient" ?

Note that the topic in question is not whether to do the entire restart
from user space (and I argue that most work should be done in the kernel),
but rather whether process creation (and only that) should be done in
kernel or user space.

Quick thoughts of pros/cons of each approach are:

user space:

+ re-use existing api (fork)
+ easier to debug
+ will allow 'handmade' resources restart: it was mentioned before that
  one may want to reattach stdout to a different place after restart; a
  user based restart of processes can make this much easier: e.g. the
  user process can create the alternative resources, give them to the
  kernel and only then call sys_restart)
+ arch-independent code

- a bit slower than in kernel space
- requires a clone-with-specific-pid syscall or interface

kernel space:

+ a bit easier to control everything
+ a bit faster than user space
+ no need for user-visible interface for clone-with-...

- arch-dependent code
- needs special code to fight 'fork-bomb'

So, I'm not convinced, and I even think there may be room to both, for
the time being. I volunteer to support the user-space alternative while
we make up our minds.

Oren.

--

From: Louis Rilling
Date: Thursday, October 30, 2008 - 11:14 am

Yes. Tasks from different nodes have parent-children, session leader, etc.
relationships, and the distributed management of struct pid lifecycle is a =
bit
touchy too. By the way, splitting the checkpoint image in one file for each=
 task
helps us a lot to make restart parallel, because it is more efficient for t=
he file
system to handle parallel reads of different files from different nodes than


Yes, I hope that investigating both approaches will give us stronger argume=
nts.

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Thursday, October 30, 2008 - 11:32 am

You can also make parallel restart work with the single stream, without
much effort. Particularly if you store everything on the file system.

In both cases, the limiting factor is shared resources - where one task
cannot proceed with checkpoint because it waits for another task to first
(re)create that resource.

[...]

Oren.

--

From: Louis Rilling
Date: Friday, October 31, 2008 - 3:37 am

Sure we can use a single stream, since we already share file descriptors ac=
cross
nodes. But the distributed synchronization of the file pointer is costly
compared to having each node access different files. This way we push the
parallelization bottelneck down to the file system rather than in the

We just try to avoid other bottlenecks :) And besides file descriptors, sha=
red
resources are as common as multi-threaded programs, which are not the major=
ity
of the workloads we can address.

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Serge E. Hallyn
Date: Thursday, October 30, 2008 - 7:08 am

I prefer kernel.  Pretty sure Dave prefers user-space.

I'd say the thing to do is push the core API that supports single-thread
c/r.  Then try to push the patch to do recreate processes from the
kernel, and let the arguments for and against that stand on their own.

-serge
--

From: Dave Hansen
Date: Thursday, October 30, 2008 - 10:03 am

My worry is where a single sys_restart() plus in-kernel process creation
takes us.

In practice, what do we do?  Do we single-thread the entire restore
process?  Or, do we do in-kernel process creation and have multiple
kernel threads trying to read out of different points in the checkpoint
file, trying to restore all their own states in parallel?  Does that
mean that we can't in practice restore from a fd like a pipe or a
network socket?

In the same way, if we *do* create the processes in userspace, how do we
do _that_?  Do we just fork() and sleep() until the kernel comes along
and blows our state away?  How does the kernel process doing the
restoring tell userspace how many things to fork?  How do we match these
new userspace processes up with the ones coming out of the checkpoint
process?

To me, it's just way too early to talk about this stuff.  Both
approaches have their issues, and I'm yet to see the differences
manifested in code so I can really sink my teeth into them

-- Dave

--

From: Oren Laadan
Date: Monday, November 3, 2008 - 12:35 pm

Rethinking this -- if the user wishes she can construct a suitable
checkpoint image that would do exactly that. It takes more effort than
using a system call, but the result is similar.

What I had in mind for that special clone-with-pid is to restrict when
it can be used (e.g. when the container is in a "restarting" state or
something like that.

[...]

Oren.



--

From: Oren Laadan
Date: Monday, October 20, 2008 - 10:17 am

Both patchsets share the same design, namely be able to checkpoint and
restart multiple processes, with the operation initiated by an external
processes.

I deliberately left out the part that handles multiple processes to
keeps things simple for initial review, and until we decide on the

I'm not convinced that creating the processes in the kernel makes it
more secure.  Can you elaborate ?  for the discussion, let's compare
these two basic scenarios:

1) container and processes are created in user space; each process
calls "sys_restart()" which eventually calls "do_restart()" that
does kernel-based restart.

2) container and processes are created in kernel space; each process
calls "do_restart()" to do kernel-based restart.

In fact, creating in user based makes it easier to enforce capabilities
and limits of the user. It also simplifies the debugging significantly,
and allows us to delegate the entire issue of containers and namespace
management back to user space, where it probably belongs.

On the other hand, doing it in kernel space likely to produce simpler
code for the creation of the processes.

Thanks,

Oren.

--

From: Andrey Mirkin
Date: Monday, October 27, 2008 - 7:38 am

I agree that multiple process handling is not needed for initial review. But I 
believe that the question with process creation should be discussed right 

Well, in this case there will be a gap after process is returned from fork but 
before entering kernel. During that time process can be killed by delivered 
signal. Another drawback of this approach is that we will need to provide an 

You right here. Both approaches have pros and cons, but I think that kernel 
approach has more advantages

Andrey
--

Previous thread: none

Next thread: [PATCH 1/2] bcm5974-0.63: Finger counting improved by Henrik Rydberg on Wednesday, September 3, 2008 - 5:01 am. (1 message)