Re: [RFC v4][PATCH 5/9] Memory managemnet (restore)

Previous thread: none

Next thread: Re: + ehea-fix-dlpar-memory-handling.patch added to -mm tree by Hannes Hering on Tuesday, September 9, 2008 - 12:58 am. (1 message)
From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

These patches implement checkpoint-restart [CR v3]. This version is
aimed at addressing feedback and eliminating bugs, after having added
save and restore of open files state (regular files and directories)
which makes it more usable.

Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, ...
From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/Makefile       |    2 +
 arch/x86/mm/checkpoint.c   |  194 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c      |  178 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c    |   13 +++-
 checkpoint/ckpt_arch.h     |    7 ++
 checkpoint/restart.c       |   13 +++-
 include/asm-x86/ckpt_hdr.h |   72 ++++++++++++++++
 include/linux/ckpt_hdr.h   |    1 +
 8 files changed, 478 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/ckpt_arch.h
 create mode 100644 include/asm-x86/ckpt_hdr.h

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..71d21e6
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/ckpt.h>
+#include <linux/ckpt_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr ...
From: Ingo Molnar
Date: Tuesday, September 9, 2008 - 1:17 am

please use the correct comment style consistently throughout your 



hm, the preemption disabling seems pointless here. What does it protect 


plsdntuseannyngabbrvtsngnrcd. [1]

"checkpoint_" should be just fine in most cases.

	Ingo

[1] (please dont use annoying abbreviations in generic code)
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 4:23 pm

Oren.
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |  187 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 187 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..f67aef1
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,187 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new ...
From: MinChan Kim
Date: Wednesday, September 10, 2008 - 12:13 am

-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Documentation/checkpoint.txt |   38 +++++++
 checkpoint/Makefile          |    2 +-
 checkpoint/objhash.c         |  237 ++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c             |    4 +
 include/linux/ckpt.h         |   18 +++
 5 files changed, 298 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index f67aef1..97eee24 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -163,6 +163,44 @@ cr_hdr + cr_hdr_task
 cr_hdr + cr_hdr_tail
 
 
+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in ...
From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/kernel/syscall_table_32.S |    2 ++
 checkpoint/Kconfig                 |   11 +++++++++++
 checkpoint/Makefile                |    5 +++++
 checkpoint/sys.c                   |   35 +++++++++++++++++++++++++++++++++++
 include/asm-x86/unistd_32.h        |    2 ++
 include/linux/syscalls.h           |    2 ++
 init/Kconfig                       |    2 ++
 kernel/sys_ni.c                    |    4 ++++
 8 files changed, 63 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..a9f22ef
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool y
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile ...
From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile    |    2 +-
 checkpoint/restart.c   |    4 +
 checkpoint/rstr_file.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ckpt.h   |    1 +
 4 files changed, 211 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f8c919d..bc49523 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -212,6 +212,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..28c4109
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,205 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include ...
From: Dave Hansen
Date: Tuesday, September 9, 2008 - 9:26 am

This needs to use an ERR_PTR().  It will save using the double-pointer.

-- Dave

--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 6:49 pm

I suppose you refer to the call to cr_scan_fds(): either 'fdtable'
or 'n' will have to pass-by-reference. Is it that you prefer it to be
	fdtable = cr_scan_fds(files, &n);
?

Oren.

--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 9:09 am

I was misreading the use of 'n'.  Can you really not use close_files()
for this operation?  You'd need to add some locking around it, but I
think it does what you need here.

-- Dave

--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:55 am

I thought about that. However, close_files() assumes that the files_struct
will be discarded thereafter, so it does not reset ->fd_open->fd_bits[] bits,
does not adjust ->next_fd field, and does not use rcu_assign_pointer(). And
then, even if we adjust, we'll have to watch future differences between
sys_close() and close_files(), so using sys_close() is more future-proof.

Besides, cr_scan_fds() is used by the checkpoint logic already, so it's easy
to reuse for restart as well.

Oren.
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned an
objref and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its objref
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.c  |    4 +
 checkpoint/ckpt_file.c   |  221 ++++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/ckpt_file.h   |   17 ++++
 include/linux/ckpt.h     |    7 +-
 include/linux/ckpt_hdr.h |   34 +++++++-
 6 files changed, 280 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/ckpt_file.c
 create mode 100644 checkpoint/ckpt_file.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4dae775..aebbf22 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -217,6 +217,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..ca58b28
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,221 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ ...
From: Vegard Nossum
Date: Tuesday, September 9, 2008 - 1:06 am

Hm, shouldn't this be max * sizeof(*fdlist) like the kmalloc above?


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Vegard Nossum
Date: Tuesday, September 9, 2008 - 1:23 am

Hi,

Below are some concerns, I would be grateful for explanations (or
pointers if I missed them before).


Should cr_hbuf_put() come before the return here?

As far as I've understood, "leaking" the buffer size/data isn't
critical (1. because it's just some extra space, and/or 2. the buffer
is discarded on error anyway). The code looks really unbalanced




Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 7:01 pm

You are right on the money: the space is allocated on a temporary
buffer that is part of the checkpoint context, and is discarded on
error (and success) anyway.

Although the code may seem somewhat unbalanced, I personally find it
useful in that it simplifies the error paths in the code. "Balancing"
the code by adding cr_hbuf_put() calls is not functionally necessary,
will clobber the code and add to its (source and compiled) size.

Certainly it could use better documentation, probably in sys.c where


[...]

Thanks,

Oren.

--

From: MinChan Kim
Date: Wednesday, September 10, 2008 - 10:02 pm

max is read-only variable so that you don't need to declare local variable.




-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:37 pm

Oops ... yes. Thanks.

[...]

Oren.


--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Makefile                 |    2 +-
 checkpoint/Makefile      |    2 +-
 checkpoint/checkpoint.c  |  188 +++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c     |  189 +++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c         |  218 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/ckpt.h     |   60 +++++++++++++
 include/linux/ckpt_hdr.h |   84 ++++++++++++++++++
 include/linux/magic.h    |    3 +
 8 files changed, 740 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/ckpt.h
 create mode 100644 include/linux/ckpt_hdr.h

diff --git a/Makefile b/Makefile
index f448e00..a558ad2 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff ...
From: MinChan Kim
Date: Tuesday, September 9, 2008 - 11:10 pm

Why do you need get_file?
You already called fget.



-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:36 am

This was meant for when we will restart multiple processes, each would
have access to the checkpoint-context, such that the checkpoint-context
may outlives the task that created it and initiated the restart. Thus
the file-pointer will need to stay around longer than that task.

Of course, restart of multiple processes _can_ be coded such that this
first task will always terminate last - either after restart completes
successfully, or after all the other tasks aborted and won't use the
checkpoint-context anymore.

Because that code is not part of the this patch-set, I considered it
safer to grab a reference of the file pointer, making it less likely

Oren.

--

From: MinChan Kim
Date: Wednesday, September 10, 2008 - 3:54 pm

Hi, Oren.


OK. Thanks for your explanation.

What do you mean by that ? Isn't it a your part of your code?
When the last checkpoint-context is ended, who free file ?



-- 
Kinds regards,
MinChan Kim
--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:44 pm

I meant that future code would make that clear, but that creates more
confusion than not.

Instead, I'll leave that to the future and for now just clean the code
as you suggested.

Oren.


--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name.  The cr_vma->npages will tell
how many pages were dumped for this VMA.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/checkpoint.c   |   30 +++
 arch/x86/mm/restart.c      |    1 +
 checkpoint/Makefile        |    3 +-
 checkpoint/checkpoint.c    |   53 ++++++
 checkpoint/ckpt_arch.h     |    1 +
 checkpoint/ckpt_mem.c      |  448 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/ckpt_mem.h      |   35 ++++
 checkpoint/sys.c           |   23 ++-
 include/asm-x86/ckpt_hdr.h |    5 +
 include/linux/ckpt.h       |   12 ++
 include/linux/ckpt_hdr.h   |   30 +++
 11 files changed, 635 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/ckpt_mem.c
 create mode 100644 checkpoint/ckpt_mem.h

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 71d21e6..50cfd29 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -192,3 +192,33 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * ...
From: Vegard Nossum
Date: Tuesday, September 9, 2008 - 2:22 am

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: MinChan Kim
Date: Wednesday, September 10, 2008 - 12:51 am

-- 
Kinds regards,
MinChan Kim
--

From: MinChan Kim
Date: Wednesday, September 10, 2008 - 4:49 pm

one more thing.


At first trial, ctx->pgcur is null.



-- 
Kinds regards,
MinChan Kim
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 9:55 am

get_user_pages() is really the wrong thing to use here.  It makes pages
*present* so that we can do things like hand them off to a driver.  For
checkpointing, we really don't care about that.  It's a waste of time,
for instance to perform faults to fill the mappings up with zero pages
and page tables.  Just think of what will happen the first time we touch
a very large, very sparse anonymous area.  We'll probably kill the
system just allocating page tables.  Take a look at the comment in
follow_page().  This is a similar operation to core dumping, and we need
to be careful.

This might be fine for a proof of concept, but it needs to be thought
out much more thoroughly before getting merged.  I guess I'm
volunteering to go do that.

-- Dave

--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 10:45 am

Why is the FOLL_TOUCH required here?

-- Dave

--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:28 am

The intention is not to allocate unallocated pages, but to get the page
pointer and bring in swapped out pages if necessary. (Avoiding swap-in
is possible, but left for future optimization).

Indeed, follow_page() does the work just fine; Of course, it should be
called with FOLL_ANON instead of FOLL_TOUCH. Thanks for pointing out.

Oren.
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 2:03 pm

This is a lot of changes.  But, they're all kinda intertwined
so it's hard to make individual patches out of them.  I've
tried to explain each of the changes as you look through
the patch sequentially.

Note that this patch removes more code than it adds, and I think it
makes everything more readable.  There are a few things I need to fix up
in the restore patch (like the use of nr_free), but nothing really
fundamental.

- Remove use of get_free_pages() for buffers, use kmalloc()
  for its debugging advantages, and not having to deal with
  "orders".
- Now that we use kfree(), don't check for NULL pages/vaddr
  since kfree() does it for us.
- Zero out the pgarr as we remove things from it.  Bug
  hunters will thank us.
- Change ctx->pgarr name to pgarr_list.
- Change the ordering of the pgarr_list so that the first
  entry is always the old "pgcurr".  Make a function to
  find the first entry, and kill "pgcurr".
- Get rid of pgarr->nr_free.  It's redundant with nr_used
  and the fixed-size allocation of all pgarrs.  Create
  a helper function (pgarr_is_full()) to replace it.
- Remove cr_pgarr_prep(), just use cr_pgarr_alloc().
- Create cr_add_to_pgarr() helper which also does some
  checking of the page states that come back from
  follow_page().
- Rename cr_vma_fill_pgarr() to cr_private_vma() to make
  it painfully obvious that it does not deal with
  shared memory of any kind.
- Don't fault in pages with handle_mm_fault(), we do not
  need them to be present, nor should we be wasting
  space on pagetables that need to be created for sparse
  memory areas.
- Add parenthesis around 'page_mapping(page) != NULL' check
- Don't bother releasing pgarr pages for a VMA since it
  will never free all of the VMA's pages anyway
- Don't track total pages.  If we really need this, we can
  track the number of full 'pgarr's on the list, and add
  the used pages from the first one.
- Reverse list iteration since we changed the list order
- Give cr_pgarr_reset() a better name: ...
From: Dave Hansen
Date: Wednesday, September 10, 2008 - 2:38 pm

This confuses me.  cr_vma_fill_pgarr() if it runs into an error attempts
to free up the pgarr references from the current pgarr that was just
filled.  But, that could only be a portion of a large VMA.  If it can't
free up the entire VMA worth of references (at least), why does it even
try to free a portion?  Why not just return since the upper levels need
to clean up the other portions anyway?

Also, is it really necessary to track the total amount filled in here?
I kinda gums up the code.

-- Dave

--

From: Dave Hansen
Date: Friday, September 12, 2008 - 9:57 am

Oren, please take a good, hard look through all these patches (i.e.
grep) and find all of the (un)likely calls and toss them out.  They're
-- Dave

--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 12:42 am

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/mm/checkpoint.c   |    5 +-
 arch/x86/mm/restart.c      |   54 +++++++
 checkpoint/Makefile        |    2 +-
 checkpoint/ckpt_arch.h     |    2 +
 checkpoint/restart.c       |   43 ++++++
 checkpoint/rstr_mem.c      |  351 ++++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ckpt_hdr.h |    4 +
 include/linux/ckpt.h       |    2 +
 include/linux/ckpt_hdr.h   |    2 +-
 9 files changed, 460 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 50cfd29..534684f 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -208,17 +208,16 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
 
 	hh->ldt_entry_size = LDT_ENTRY_SIZE;
 	hh->nldt = mm->context.size;
-
 	cr_debug("nldt %d\n", hh->nldt);
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
 
 	mutex_unlock(&mm->context.lock);
-
+ out:
 	return ret;
 }
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index d7fb89a..be5e0cd 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -177,3 +177,57 @@ int cr_read_cpu(struct cr_ctx *ctx)
  out:
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, rparent;
+
+	rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (rparent < 0)
+		return rparent;
+	if (rparent != parent)
+		return ...
From: Serge E. Hallyn
Date: Tuesday, September 9, 2008 - 9:07 am

As Dave has pointed out, this appears to be a security problem.  I think
what you need to do is create a new helper mprotect_fixup_withchecks(),
which does all the DAC+MAC checks which are done in the sys_mprotect()
loop starting with "for (nstart = start ; ; ) {...  Otherwise an
unprivileged user can create a checkpoint image of a program which has
done a ro shared file mmap, edit the checkpoint, then restart it and (i
assume) cause the modified contents to be written to the file.  This
could violate both DAC checks and selinux checks.

So create that helper which does the security checks, and use it
--

From: Oren Laadan
Date: Tuesday, September 9, 2008 - 4:35 pm

As I replied to Dave, I don't see why this would be a security problem.

This handles private memory only. In particular, the uncommon case of a
read-only VMA tha has modified contents. This _cannot_ affect the file
from which this VMA may have been mapped.

Shared memory (not file-mapped) will be handled differently: since it is
always backed up by an inode in shmfs, the restart will populate the
relevant pages directly. Besides, non-file-mapped shared memory is again
not a security concern.

Finally, shared memory that maps to a file is simply _not saved_ at all;
it is part of the file system, and belongs to the (future) file system
snapshot capability. Since the contents are always available in the file
system, we don't need to save it (like we don't save shared libraries).

That said, it is necessary that the code ensures that the vm_flags that
belong to a VMA of a private type, e.g. CR_VMA_ANON/CR_VMA_FILE, indeed

[...]

Oren.

--

From: Serge E. Hallyn
Date: Wednesday, September 10, 2008 - 8:00 am

Cool.  That sounds good and I'll look for that in the next version.

There still may be objections about bypassing selinux execmem/execheap
permission checks, but I think that's ok for now.  Long-term I expect
we'll want the security_file_mprotect checks there, and selinux users
will have to use a policy where restart is started in a privileged
restart_t domain or somesuch (and eventually transitions back to the
checkpointed selinux type if possible).

thanks,
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 12:31 pm

cr_pgarr_prep() can return a partially full pgarr, right?  Won't the
cr_kread() always start at the beginning of the pgarr->vaddrs[] array?
Seems to me like it will clobber things from the last call.

-- Dave

--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 12:48 pm

Note that 'nr' is either equal to ->nr_free - in which case we consume
the entire 'pgarr' vaddr array such that the next call to cr_pgarr_prep()
will get a fresh one, or is smaller than ->nr_free - in which case that
is the last iteration of the loop anyhow, so it won't be clobbered.

Also, after we return - our caller, cr_vma_read_pages(), resets the state
of the page-array chain by calling cr_pgarr_reset().

Oren.
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 1:49 pm

Man, that's awfully subtle for something which is so simple.

I think it is a waste of memory to have to hold *all* of the vaddrs in
memory at once.  Is there a real requirement for that somehow?  The code
would look a lot simpler use less memory if it was done (for instance)
using a single 'struct pgaddr' at a time.  There are an awful lot of HPC
apps that have nearly all physical memory in the machine allocated and
mapped into a single VMA.  This approach could be quite painful there.

I know it's being done this way because that's what the dump format
looks like.  Would you consider changing the dump format to have blocks
of pages and vaddrs together?  That should also parallelize a bit more
naturally.

Anyway, this either needs a big fat comment or something that is
self-describing like this:

+       while (npages) {
+               pgarr = alloc_fresh_pgarr(...)
+               if (!pgarr)
+                       return -ENOMEM;
+               nr = min(npages, (int) pgarr->nr_free);
+               ret = cr_kread(ctx, pgarr->vaddrs, nr * sizeof(unsigned long));
+               if (ret < 0)
+                       return ret;
+               pgarr->nr_free -= nr;
+               pgarr->nr_used += nr;
+               npages -= nr;
		add_pgarr_to_ctx(ctx, pgarr);
+       }
+       return 0;

When someone is looking at that, it is painfully obvious that they're
not writing over anyone else's vaddrs since the pgarr is fresh.  

-- Dave

--

From: Oren Laadan
Date: Wednesday, September 10, 2008 - 11:59 pm

It's being done this way to allow for a future optimization that will aim
at reducing downtime of the application by buffering all the data that is
to be saved while the container is frozen, so that the write-back of the
buffer happens after the container resumes execution.

(It is this reasoning that dictates the dump format and the code, not the
other way around).

That said, the point about reducing memory footprint of checkpoint/restart
is valid as well. Moreover, it conflicts with the above in requiring small
buffering, if any.

To enable both modes of operation, I'll modify the dump format to allow
multiple blocks of (addresses list followed by pages contents).

Oren.

--

From: Dave Hansen
Date: Tuesday, September 9, 2008 - 11:06 am

Cool, I can actually apply these! :)

-- Dave

--

Previous thread: none

Next thread: Re: + ehea-fix-dlpar-memory-handling.patch added to -mm tree by Hannes Hering on Tuesday, September 9, 2008 - 12:58 am. (1 message)