Re: [RFC v5][PATCH 0/9] Kernel based checkpoint/restart

Previous thread: Re: [linux-dvb] Multiproto API/Driver Update by Markus Rechberger on Saturday, September 13, 2008 - 3:56 pm. (21 messages)

Next thread: 2.6.27-rc6: lockdep warning: iprune_mutex at shrink_icache_memory+0x38/0x1a8 by Alexander Beregalov on Saturday, September 13, 2008 - 4:31 pm. (6 messages)
From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:05 pm

These patches implement basic checkpoint-restart [CR]. This version (v5)
supports basic tasks with simple private memory, and open files (regular
files and directories only). The main changes include rework of memory
restart code and response to feedback. See original announcements below.

Oren.

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with ...
From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  174 ++++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  189 ++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  218 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   60 +++++++++++
 include/linux/checkpoint_hdr.h |   75 ++++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 717 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index f448e00..a558ad2 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += ...
From: Dave Hansen
Date: Monday, September 15, 2008 - 10:54 am

This one has been bugging me a bit.  This adds one net line of code and
I think it's much easier to read:

{
	struct cr_hdr h;
	int ret;

	ret = cr_read_obj(ctx, &h, buf, n);
	if (ret)
		return ret;

	ret = -EINVAL;	
	if (h.type == type)
		ret = h.parent;

	return ret;
}                     

-- Dave

--

From: Dave Hansen
Date: Monday, September 15, 2008 - 10:59 am

It's probably time to address this fixme.  

-- Dave

--

From: Dave Hansen
Date: Monday, September 15, 2008 - 11:00 am

Could you explain why you're not using the slab here?

-- Dave

--

From: Dave Hansen
Date: Monday, September 15, 2008 - 11:02 am

All of the casting here is unnecessary.  'void *' behaves like 'char *'
when you do arithmetic on it.  

I really do detest having a memory allocator BUG_ON() when it runs out
of space.  

-- Dave

--

From: Oren Laadan
Date: Monday, September 15, 2008 - 11:52 am

The BUG_ON() statement asserts that we don't run out of buffer space.
Buffer usage is a function of the checkpoint/restart logic, and does
not depend on user input, hence not susceptible to DoS.

In other words, if the code is correct, this should never happen (much
like a kernel stack overflow), and if it happens it's a kernel bug. I
think it was Arnd who recommended with regard to this to crash loudly
if there is a bug in the kernel ...

Oren.
--

From: Dave Hansen
Date: Monday, September 15, 2008 - 12:13 pm

OK, that's fair enough.  But, can we document it as such?  "Only headers
and things of known, static sizes can go in here.  We don't use

Yes, it does mean that there is a bug because someone either made a
structure bigger, PAGE_SIZE smaller, or a call path got deeper than we
expected.  I'm just having visions of the email hitting my inbox in 18
months. :)

The structures are sized consistently across all architectures and
configurations.  However, PAGE_SIZE and the size of that buffer are not.
The buffer will be 8k on x86, but 128k on most ppc64 configurations.

Can we at least make it sized in numbers of bytes rather than pages?
Also, please remember that using kmalloc() buys you lots of fun stuff on
top of get_free_pages(), like redzones and easier debugging.

-- Dave

--

From: Bastian Blank
Date: Tuesday, September 16, 2008 - 5:27 am

No. Arithmetic on void pointer is not allowed by the standard. This is a
gcc extension. It's okay to be used in the Linux kernel but not in
general.

The second cast (from char * to void *) however is useless.

Bastian

-- 
Prepare for tomorrow -- get ready.
		-- Edith Keeler, "The City On the Edge of Forever",
		   stardate unknown
--

From: Serge E. Hallyn
Date: Monday, September 15, 2008 - 2:15 pm

This feels a litlte weird, since the ctx->crid actually was calculated
in ctx_alloc() in sys_checkpoint().  I'd almost prefer do_checkpoint()
return 0 on success.


If h->len > n should we return -ENOSPC so the caller can theoretically

Here I am getting confused about return values.  

cr_read_obj() says it returns size of payload.

cr_read_obj_type() says it returns "object reference of the parent
object", though if cr_read_obj() returned non-zero then it returns its
return value, which could be size of object read.

cr_read_string() returns cr_read_obj_type()'s return value.

Now here you only copy buf into t->comm if return value was 0.

Am I misreading?  Or is this code wrong?

Annoying as it may be (causing patch conflicts in patches 3-9), could
you please put the return values of all these functions in the comments
--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/checkpoint.c         |   30 +++
 arch/x86/mm/restart.c            |    1 +
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   53 ++++
 checkpoint/checkpoint_arch.h     |    2 +
 checkpoint/checkpoint_mem.h      |   41 ++++
 checkpoint/ckpt_mem.c            |  492 ++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c                 |   27 ++-
 include/asm-x86/checkpoint_hdr.h |    5 +
 include/linux/checkpoint.h       |   12 +
 include/linux/checkpoint_hdr.h   |   32 +++
 11 files changed, 692 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index eb60003..fa87e4a 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -196,3 +196,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto ...
From: MinChan Kim
Date: Tuesday, September 16, 2008 - 11:48 pm

If COW occur in file-backed pages, Is page_mapping is modified ?
I don't understand why this condition represent "file backed clean cow"?



-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/restart.c            |   54 ++++++
 checkpoint/Makefile              |    2 +-
 checkpoint/checkpoint_arch.h     |    2 +
 checkpoint/checkpoint_mem.h      |    5 +
 checkpoint/restart.c             |   43 +++++
 checkpoint/rstr_mem.c            |  370 ++++++++++++++++++++++++++++++++++++++
 include/asm-x86/checkpoint_hdr.h |    4 +
 include/linux/checkpoint.h       |    3 +
 8 files changed, 482 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 79780df..a0cd5ab 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -176,3 +176,57 @@ int cr_read_cpu(struct cr_ctx *ctx)
  out:
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return -EINVAL;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		return -EINVAL;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+		int ret;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			return ret;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = ...
From: Dave Hansen
Date: Monday, September 15, 2008 - 12:14 pm

PATH_MAX is about as short of a global macro name as you're going to

That's an interesting loop condition, especially since it will never
actually get triggered.  while(1) would give the same functionality,
right?


-- Dave

--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:05 pm

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/asm-x86/unistd_32.h        |    2 +
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile ...
From: Serge E. Hallyn
Date: Monday, September 15, 2008 - 1:28 pm

Acked-by: Serge Hallyn <serue@us.ibm.com>

thanks,
--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned an
objref and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its objref
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |    4 +
 checkpoint/checkpoint_file.h   |   17 +++
 checkpoint/ckpt_file.c         |  221 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    7 +-
 include/linux/checkpoint_hdr.h |   32 ++++++-
 6 files changed, 278 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index d4c1b31..87420dc 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -203,6 +203,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 ...
From: Bastian Blank
Date: Sunday, September 14, 2008 - 2:51 am

Why not?
| int i;
| int n = 0;
| int tot = CR_DEFAULT_FDTABLE;



_NO_. tot is signed, this does not have documented overflow behaviour.

krealloc does not free the memory on error, so this is a leak.

Bastian

-- 
The more complex the mind, the greater the need for the simplicity of play.
		-- Kirk, "Shore Leave", stardate 3025.8
--

From: Oren Laadan
Date: Sunday, September 14, 2008 - 8:40 am

Yes, the assumption is that the process is frozen (or that we checkpoint
ourselves).

With this assumption, it is possible to (a) leave out RCU locking, and also
(b) continue after the krealloc() from where we left off. Also, it means that
(c) the contents of our 'fds' array remain valid by the time the caller makes
use of it.

This certainly deserves a comment in the code, will add.

If the assumption doesn't hold, then I'll have to add the RCU locking. Cases
(b) and (c) are already safe because the caller(s) use fcheck_files() to
access the file-descriptors collected in the array.

While in my mind a task should never be unfrozen while being checkpointed, in
reality future code may allow it e.g. if a OOM kicks in a kills it. So I tend


Ok.

Thanks,

Oren.
--

From: Serge E. Hallyn
Date: Tuesday, September 16, 2008 - 4:03 pm

More to the point, you're not preventing them being unfrozen, so I think

Right, so you may as well drop this.  You're not protecting from
userspace here, right?  You're protecting against a bogus max_fds.
--

From: Dave Hansen
Date: Monday, September 22, 2008 - 8:31 am

Oren,

You should probably also cc the x86 maintainers on at least the
x86-specific bits.


-- Dave

--

From: Dave Hansen
Date: Tuesday, September 16, 2008 - 8:54 am

Dude, you don't need to cast to void*.  

-- Dave

--

From: Dave Hansen
Date: Tuesday, September 16, 2008 - 9:55 am

One of the big things I'm looking for in these file patches is to make
sure that we can expand some day to lots of mounts and lots of
filesystem namespaces.

This tells me that we probably need a per-process vfsroot (which is also
a shared object).  We probably need to make a note in the task structure
as well as here.  

-- Dave

--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |  207 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 207 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..6bf75ce
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,207 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new ...
From: Serge E. Hallyn
Date: Monday, September 15, 2008 - 1:26 pm

This really should include your demo programs from your patch 0/9
--

From: MinChan Kim
Date: Tuesday, September 16, 2008 - 11:23 pm

which is right between id and parent?



-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |   38 +++++++
 checkpoint/Makefile          |    2 +-
 checkpoint/objhash.c         |  237 ++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c             |    4 +
 include/linux/checkpoint.h   |   20 ++++
 5 files changed, 300 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index 6bf75ce..2929512 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -169,6 +169,44 @@ cr_hdr + cr_hdr_task
 cr_hdr + cr_hdr_tail
 
 
+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in ...
From: Dave Hansen
Date: Tuesday, September 16, 2008 - 9:48 am

I'd much rather see all this documentation put properly inline with the

I have some nits about some of that.  I'd prefer to see the main code
flow of the functions get taken out of the if(){} blocks and put at the
first indenting level (cr_obj_new() could use this).

One other nit would be to try and get rid of some of the '?:' use.  I
personally find those hard to read and we don't use them that heavily in
the core kernel.  For instance, the above can be written:

	obj = cr_obj_find_by_ptr(ctx, ptr);
	if (!obj)
		return -ESRCH;
	if (obj->type != type)
		return -EINVAL;
	return obj->objref;

Which make the error conditions and their results just pop out much
easier at the cost of a single added line of code.

-- Dave

--

From: MinChan Kim
Date: Wednesday, September 17, 2008 - 12:31 am

One more thing, 'physical' is rather awkward.
Many kernel developer use physical address and virtual address with
different meaning. It is confusing them.



-- 
Kinds regards,
MinChan Kim
--

From: Serge E. Hallyn
Date: Tuesday, September 16, 2008 - 1:54 pm

Acked-by: Serge Hallyn <serue@us.ibm.com>

Thanks, Oren, I actually think this is quite nice and readable.

Though three questions below.  First one is, since you've mentioned
having multiple threads doing checkpoint, won't you need some locking?

What is the point of the 'type'?

By that I mean: is it meant to catch bugs in the implementation, or bad

What exactly will objref_index be used for?  I don't see any real
--

From: Oren Laadan
Date: Tuesday, September 16, 2008 - 2:36 pm

yes.


The hash table is maintained in the kernel during checkpoint/restart,
adding shared objects to the hash table when they are first seen. An
object can be a 'struct file', 'struct mm_struct', etc.

When an object is added, it's reference count is incremented to ensure
that the pointer is still valid for as long as it's in the hash table.
Similarly, when we remove an object from the hash, we need to decrement
the reference count. This is done in cr_obj_ref_grab(), cr_obj_ref_drop().

There are different functions to inc/dec the reference count of objects
of different types. '->type' keeps track of the type of the object, so
we know which function to call. (At this point, we only track shared

'objref_index' is a counter used to assign unique identifiers (objref)
to objects as they are added to the hash table. It is incremented with
every object that joins the hash table (and the old value is used as


[...]

Oren.

--

From: Serge E. Hallyn
Date: Tuesday, September 16, 2008 - 3:09 pm

If you were to rename objref_index to next_free_ref, I think that'd
be helpful.  (I also don't think 'obj' should in the name at all, and
you should just s/objref/id/, but that's really too much pain)

thanks,
-serge
--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/Makefile             |    2 +
 arch/x86/mm/checkpoint.c         |  198 ++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c            |  177 ++++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c          |   13 +++-
 checkpoint/checkpoint_arch.h     |    7 ++
 checkpoint/restart.c             |   13 +++-
 include/asm-x86/checkpoint_hdr.h |   72 ++++++++++++++
 include/linux/checkpoint_hdr.h   |    1 +
 8 files changed, 481 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h
 create mode 100644 include/asm-x86/checkpoint_hdr.h

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..eb60003
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,198 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx ...
From: Serge E. Hallyn
Date: Monday, September 15, 2008 - 2:31 pm

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:06 pm

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  202 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 208 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a0d5e60..956e274 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..780c0fc
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,202 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include ...
From: Serge E. Hallyn
Date: Tuesday, September 16, 2008 - 4:08 pm

From: Oren Laadan
Date: Tuesday, September 16, 2008 - 5:11 pm

No. (this was discussed earlier already).

cr_hbuf_get() "allocates" space inside a dedicated buffer for headers
in the checkpoint context (ctx->hbuf). It does not allocate new kernel
memory. Instead, it returns the current position in that buffer
ctx->hbuf[ctx->hpos], and advances ctx->hpos appropriately. On the
other side, cr_hbuf_put() reverses that effect, reducing ctx->hpos
accordingly.

If an error occurs, the checkpoint (or restart) operation is aborted,
and eventually the context (ctx) will be cleaned up; at that point the
special purpose buffer will be freed.

[...]

Oren.


--

From: Serge E. Hallyn
Date: Tuesday, September 16, 2008 - 9:56 pm

Yes I realize you're not doing a real allocation here and so, especially
if the whole thing is about to fail anyway, there may seem to be little
point in bothering to _put().  The thing is it's an unbalanced
operation, and the behind-the-scenes implementation may change at some
later point so IMO it's definately worth balancing these things now.

-serge
--

From: Dave Hansen
Date: Monday, September 22, 2008 - 9:02 am

I think this is like claiming that my malloc() will get freed if my
applications exits, so I don't have to worry about free().  It is messy,
it makes lifetime rules and use less explicit, and it makes bugs harder
to find.  If I were a good programmer (which I'm not) I probably
wouldn't do that.

-- Dave

--

From: Oren Laadan
Date: Saturday, September 13, 2008 - 4:22 pm

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  202 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 208 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a0d5e60..956e274 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..780c0fc
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,202 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include ...
From: Serge E. Hallyn
Date: Wednesday, September 17, 2008 - 7:16 am

At the kernel summit, it was asked whether at some point this CR work
could be used as the basis for a new suspend to disk implementation.
I answered that we don't plan to handle kernel threads.  In your
experience, are there any reasons why eventually we couldn't eventually
handle kernel threads?

-serge
--

From: Oren Laadan
Date: Wednesday, October 8, 2008 - 2:59 am

As far as I can see, there isn't an inherent reason not to handle
kernel threads. However, I never looked deep into the problem.

The main issue that I can see with it is similar to what the
hibernation developers must tackle anyway - how to freeze kernel
threads when some of them may still be needed to take the system
down.

Assuming that is solved, then we're left with how to freeze the
kernel threads in a state that makes sense for a restart; for
regular tasks this is right before going back to user-land (*),
for kernel threads it may not be the best place :)

(*) however, tasks that are ptraced or in the middle of a vfork
will require special treatment - since upon freezing they cannot
be forced to that convenient position; so upon restart there must
be special code to make their behavior after restart compatible
with what they would have done originally - probably by emulation
as opposed to rebuilding their old kernel stack. For instance,
if a task was stopped due to ptrace before return from a syscall,
then upon restart it should return to that exact state.

So I'd assume that kernel threads could be treated in a similar
manner by special-casing, if necessary.

Question: I'd assume that at least for some of the kernel threads
a simple re-launch at restart will do; how many really require that
we save and restore their state ?

Oren

--

From: Serge E. Hallyn
Date: Wednesday, September 24, 2008 - 2:42 pm

Playing around a bit, all seems to work as advertised.

Has anyone playing with this run into any oopses with the latest
version?

Cedric, could you send out your patch (or a pointer to them) to make
ckpt/restart on another task work, so we can pound the core code a bit
harder with liblxc?

thanks,
-serge
--

From: Cedric Le Goater
Date: Thursday, September 25, 2008 - 5:58 am

sure. the user tools are in CVS for the moment :

	http://lxc.cvs.sourceforge.net/lxc/

but Daniel will probably release something soon. the lxc-checkpoint and 
lxc-restart tools are a bit too optimistic on the availability of the 
checkpoint and restart syscalls, a cleanup might be needed.

the patchset is here :

	http://legoater.free.fr/patches/2.6.27/2.6.27-rc7-lxc2/

it's based on top of current mainline and includes :

	. sysfs patches for net namespace
	. freezer subsystem
	. oren's V5 checkpoint and restart patches
	. a personal hack to do external checkpoint and restart using
	  the lxc-checkpoint and lxc-restart tools
	. some fixes
	
and also

	an old mq namespace implementation (being reworked currently)

restart is still bogus because some state is not well captured at
checkpoint time. 

I'm working on it right now !

Thanks,

C.
--

Previous thread: Re: [linux-dvb] Multiproto API/Driver Update by Markus Rechberger on Saturday, September 13, 2008 - 3:56 pm. (21 messages)

Next thread: 2.6.27-rc6: lockdep warning: iprune_mutex at shrink_icache_memory+0x38/0x1a8 by Alexander Beregalov on Saturday, September 13, 2008 - 4:31 pm. (6 messages)