Re: [RFC v2][PATCH 8/9] File descriprtors - dump state

Previous thread: Linux v2.6.27-rc4 by Linus Torvalds on Wednesday, August 20, 2008 - 7:52 pm. (1 message)

Next thread: [PATCH 2.6.27-rc4 1/27] mdb: export genapic and machine_restart functions by jmerkey on Wednesday, August 20, 2008 - 7:50 pm. (1 message)
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 7:58 pm

These patches implement checkpoint-restart [CR v2]. This version adds
save and restore of open files state (regular files and directories)
which makes it more usable. Other changes address the feedback given
for the previous version. It is also refactored (along Dave's posting)
for easier reviewing.

Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Aug-20] v2:
   - Added dump and restore of open files (regular and directories);
     see the changes in the test program (ckpt.c)
   - Added basic handling of shared objects, and use 'parent tag'
   - Added documentation
   - Improved ABI, add 64bit padding for image data
   - Improved locking when saving/restoring memory
   - Added UTS information to header (release, version, machine)
   - Cleanup extraction of filename from a file pointer
   - Refactor to allow easier reviewing
   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
     security policy (this means that file restore may fail)
   - Other cleanup in response to comments for v1

[2008-Jul-29] v1:
   - Initial version: support a single task with address space of only
     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
     argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that ...
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:03 pm

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file.

First create a template for both syscalls: they take a file descriptor
(for the image file) and flags as arguments. For sys_checkpoint the
first argument identifies the target container; for sys_restart it will
identify the checkpoint image.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  arch/x86/kernel/syscall_table_32.S |    2 ++
  include/asm-x86/unistd_32.h        |    2 ++
  include/linux/syscalls.h           |    2 ++
  3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
  	.long sys_dup3			/* 330 */
  	.long sys_pipe2
  	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
  #define __NR_dup3		330
  #define __NR_pipe2		331
  #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334

  #ifdef __KERNEL__

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..edc218b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
  asmlinkage long sys_eventfd(unsigned int count);
  asmlinkage long sys_eventfd2(unsigned int count, int flags);
  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);

  int kernel_execve(const char ...
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 10:17 pm

this, of course, should be the last patch, not the first.

--

From: Dave Hansen
Date: Friday, August 22, 2008 - 12:32 pm

Uh oh.  These got whitespace mangled somehow.  I'll look through them,
but probably can't produce patches on top of them for now.

-- Dave

--

From: Dave Hansen
Date: Friday, August 22, 2008 - 1:11 pm

This patch also has to go *AFTER* you actually define the syscall
functions.  Otherwise, this won't be build-bisectable when it gets
committed.

-- Dave

--

From: Oren Laadan
Date: Friday, August 22, 2008 - 2:20 pm

Oops .. not sure what happened.
I'll re-send after addressing the recent comments.

Oren.

--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:05 pm

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name.  The cr_vma->npages will tell
how many pages were dumped for this VMA.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile     |    2 +-
  checkpoint/checkpoint.c |    3 +
  checkpoint/ckpt.h       |    6 +
  checkpoint/ckpt_arch.h  |    1 +
  checkpoint/ckpt_hdr.h   |   31 ++++
  checkpoint/ckpt_mem.c   |  382 +++++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/ckpt_mem.h   |   30 ++++
  checkpoint/ckpt_x86.c   |   28 ++++
  checkpoint/rstr_x86.c   |    2 +
  checkpoint/sys.c        |    7 +-
  include/asm-x86/ckpt.h  |    5 +
  11 files changed, 495 insertions(+), 2 deletions(-)
  create mode 100644 checkpoint/ckpt_mem.c
  create mode 100644 checkpoint/ckpt_mem.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 29dbb2d..032fc9f 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o
  obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 949ed58..bf868ae 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -169,6 +169,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  	ret = cr_write_task_struct(ctx, t);
  	cr_debug("task_struct: ret %d\n", ret);
  	if (!ret)
+		ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (!ret)
  		ret = cr_write_thread(ctx, t);
  	cr_debug("thread: ret %d\n", ret);
  	if (!ret)
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 7ecafb3..0addb63 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -29,6 +29,9 @@ struct cr_ctx {
 ...
From: Ingo Molnar
Date: Thursday, August 21, 2008 - 12:30 am

please move these into arch/x86/mm/checkpoint.c and 
arch/x86/mm/restore.c. (also, please dont try to abbreviate too much in 
filenames, makes it harder to follow changes later on, etc.)

Cool stuff btw! Any ETA on when more complex apps can be 
checkpointed/restored? Also, shouldnt at least part of this effort be 
joined with the live migration efforts of openvz? (or do you see 
fundamental, unsolvable differences?)

	Ingo
--

From: Justin P. Mattock
Date: Thursday, August 21, 2008 - 1:01 am

CONFIG_INGO=Y
(sorry saw andrew Morton do that
Couldn't resist);

justin P. Mattock



--

From: Balbir Singh
Date: Thursday, August 21, 2008 - 3:28 am

My understanding is that this is a result of the joint discussion (with the
OpenVZ folks present) that happened during the containers min-summit prior to
OLS this year.

-- 
	Balbir
--

From: Ingo Molnar
Date: Thursday, August 21, 2008 - 4:59 am

great news :-)

This stuff should be completed ASAP, user-transparent kernel upgrades on 
a generic Linux distro would be really nice :-)

	Ingo
--

From: Louis Rilling
Date: Thursday, August 21, 2008 - 2:53 am

Need to check ret here:

+ 		if (ret)

[...]

Thanks,

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Friday, August 22, 2008 - 2:21 pm

Thanks Louis for all the comments. Will fix in v3.

Oren.

--

From: Dave Hansen
Date: Friday, August 22, 2008 - 1:37 pm

[Empty message]
From: Oren Laadan
Date: Saturday, August 23, 2008 - 10:40 pm

This classification eventually simplifies both dump and restore. For
instance, it decides whether a file name follows or not.

There will be more, later:  CR_VMA_FILE_UNLINKED (mapped to an unlinked
file), CR_VMA_ANON_SHARED (shared anonymous), CR_VMA_ANON_SHARED_SKIP

I figured the use of 'tag' for the identifiers of shared objects is clear.
By using a __u32 the size of that field is immediately visible, while using
a structure will hide the actual size; in turn this is what we want visible

Using __u32 instead of enum guarantees the size. I'll change the



It is my understanding that the code must not sleep between kmap_atomic()

Yes. I'll encapsulate that in it's own function.

Oren.

--

From: Dave Hansen
Date: Tuesday, August 26, 2008 - 9:33 am

I still don't see there being a need to explicitly specify the
distinction.  Why should a VMA mapping an unlinked file be any different
from a linked one?  It should point over to the 'file' checkpoint
structure and let the real work be done there.

There are no truly anonymous shared areas.  They anon ones are still
file-backed as far as the kernel is concerned.  If we do the file saving

Either way you stack it, I think 'tag' is a horrendous variable name.
What kind of tag?  What does it do?  What does it tag?  What does it
reference?  What values are valid in there?

If we have a separate structure, we simply make it clear that the
structure will get used all over the place, and that it, too is part of
the ABI.  Knowing the sizes isn't as important as knowing that the ABI

Exactly.  The code is copy-and-pasted.  If there's a bug in the
original, who will change this one?  Better to simply consolidate the

This is where the mentality has to shift.  Instead of thinking "there is
no exact in-kernel match for what I want here" we need to consider that
we can modify the in-kernel ones.  They can be modified to suit both

Yes, but you're going to absolutely kill performance on systems which
have expensive global TLB flushes.  Frequent kmaps()s should be avoided
at all costs.

The best way around this would be to change the interface to cr_kwrite()
so that it didn't have to use *mapped* kernel memory.  Maybe an
interface that takes 'struct page'.  Or, one where we kmap_atomic() the

-ENOSUP might be clearest for now.  "Your system call tried to do
something unsupported."

-- Dave

--

From: Oren Laadan
Date: Tuesday, August 26, 2008 - 5:14 pm

The classifications helps to make the code cleaner (and more readable). In
any case, you need at least to save a classifier that tells whether a shared
VMA is anonymous, or mapped to a file, or is an IPC shmem segment (and also
whether that IPC segment has been removed - a la unlinked): you simply don't
treat them equally; and on the restart side you can't re-test it on the VMA 
structure itself.

Given that we need a classifier anyway, re-using it to describe all VMAs and
not only distinguish the above (and maybe more) cases not only enhances the
readability of the code, but also allows to merge common code paths based on
the value of the classifier.

Probably you won't be convinced until I add the code to support all types of


I agree.

However, my main goal now is not to make this patch perfect, but to provide
a viable proof-of-concept that demonstrates how we want to do things. Unless
you feel we are near ready to send these for inclusion soon (...), I intend
to prioritize design and functionality.

I'll add a FIXME comment there. Of course, should you provide a patch to fix

Are you suggesting adding a new error code ?

Oren.


--

From: Dave Hansen
Date: Wednesday, August 27, 2008 - 8:41 am

So, you currently have buy-in on this basic approach from the OpenVZ
guys, and probably Ingo.  If you get too far along in the "design and
functionality" and a community member comes up with a really good
objection to some basic part of the "design and functionality" you're
going to have a lot of code to rewrite.

I'm trying to get minimized the quantity of "design and functionality"
down to the bare minimum set where we can get this merged.  So, yes, I
do think these should be sent for inclusion soon.

I believe that we're on course to create another large out-of-tree patch
set that does in-kernel checkpoint/restart.  We have no shortage of
those.  It's been done many times before.

This one will *HOPEFULLY* be different from all the rest when it gets

Dang.  What's the thing we return from system calls that are
unsupported?  Did I just dream that up?

-- Dave

--

From: Louis Rilling
Date: Wednesday, August 27, 2008 - 8:57 am

ENOSYS?

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Dave Hansen
Date: Wednesday, August 27, 2008 - 9:12 am

Brain not functioning, yet.  Thanks.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Wednesday, August 27, 2008 - 9:19 am

Or EOPNOTSUPP - but that's documented as a network error ("Operation not
supported on transport endpoint").

    J
--

From: Serge E. Hallyn
Date: Wednesday, August 27, 2008 - 1:34 pm

It's true there are out-of-tree proofs-of-concept for in-kernel c/r, but
on the other hand it is nice seeing where Oren's code is going.  Based
on the patchsets you and he have been sending, the impression I've
gotten was that Oren was fleshing out the longer-term tree, while you
were working on the upstream-acceptable minimal patchset.  That seems
like a good model to me.  (Was I wrong about either of your intents?)

It does mean that much of Oren's patchset may end up having to be
re-written based on how Dave's patches look when they get upstream, but
I'm sure Oren knows that's par for the course and doesn't mind.

(Or, is that too much effort required on your part to try and
cherrypick bits of Oren's changes back into your tree?)

In any case, if that *is* the model, then I'd really like to see a
repost of your (Dave's) simplified patchset.  I think we need to
decide precisly which features we want to try to push first (I
thought we were not going to support fds?), throw out any extraneous
infrastructure, put the source for the one-and-only program that
it can checkpoint and restart in the patch description, and see what
feedback we get from the community.  The reason it'll be good to
have Oren continue on his own path is that you know we'll get
questions like "well how are you going to handle (X)", so then
we can point them to Oren's tree.

At least, that's how I was seeing it.

-serge
--

From: Dave Hansen
Date: Wednesday, August 27, 2008 - 1:38 pm

That would probably work as long as Oren is working on top of what I
have.  It's going to be a lot harder if I have to repeat the same
break-out process for each iteration of Oren's patches, though.

-- Dave

--

From: Serge E. Hallyn
Date: Wednesday, August 27, 2008 - 1:48 pm

If Oren's purpose is not quite to create a upstreamable patchset,
then it would make more sense for him to keep a git tree and
put new patches on top of his existing ones (within reaason as he
rebases).  Then you'd at least be able to trivially look at the latest
deltas.

-serge
--

From: Dave Hansen
Date: Wednesday, August 27, 2008 - 1:56 pm

The trick would be compromising on things that I, for instance, think
need to be rewritten or removed.

Oren would have to rebase his work against what I do.  I guess you could
think about it like me being upstream from Oren.  Anything that I would
change, Oren would need to rebase on top of.

Oren, are you willing to keep a set of patches that add functionality on
top of a minimal set that I'm keeping?  Mine being the set that we're
trying to push into mainline as soon as possible.

-- Dave

--

From: Oren Laadan
Date: Sunday, August 31, 2008 - 12:16 am

Dave, Serge:

I'm currently away so I must keep this short. I think we have so far
more discussion than an actual problem. I'm happy to coordinate with
every interested party to eventually see this work go into main stream.

My only concerns are twofold: first, to get more feedback I believe we
need to get the code a bit more usable; including FDs is an excellent
way to actually do that. That will add significant value to the patch.
I think it's important to demonstrate how shared resources and multiple
processes are handled. FDs demonstrate the former (with a fixed version
of the recent patchset - I will post soon). The latter will increase
the size of the patchset significantly, so perhaps can indeed wait for
now.

It should not be hard for me to add functionality on top of a more
basic patchset. The question is, what is "basic" ?  Anyway, I will be
back towards the end of the week. Let's try to discuss this over IRC
then (e.g. Friday afternoon ?).

Oren.

--

From: Cedric Le Goater
Date: Sunday, August 31, 2008 - 10:34 am

hmm, yes and no.

fd's are a must have but I would be more interested to see an external
checkpoint/restart and signal support first. why ? because it would be 
already usable for most computational programs in HPC, like this stupid 
one :

	https://www.ccs.uky.edu/~bmadhu/pi/pi1.c

signals are required because it's how 'load' and/or 'system' managers
interact with the jobs they spawn. external checkpoint/restart for the
same reason.

for files, I would first only care about stdios (make sure they're 
relinked to something safe on restart) and file states of regular files. 
contents is generally handled externally (deleted files being an annoying 
exception)

then, support for openmp application is a nice to have, so I'd probably 

shared resources are only useful in a multiprocess/multitask context. 
I'd start working on this first. here we jump directly in the pid namespace
issues, how we start a set of process in a pidnamespace ? how do we 
relink it to its parent pidnamespace ? are signals well propagated ? etc. 
but hey, we'll have to solve it one day.

FD's are shared but have many types which are pain to handle. (it would 
interesting to see if we can add checkpoint() and restart() operations in 
fileops) So, for shared resources demonstration, I'd work on sysvipc, 
there are less types to handle and they force us to think how we are going 

hmm, that depends how you do it.

If you restart all the hierarchy in the kernel, It will increase for sure 
the patch footprint. However if you restore the hierarchy from user space 
and then let each process restore itself from some binary blob, it should
not. This, of course, means that the binary blob representing the state of 
the container (we call it statefile) is not totally opaque. It see it a bit 
like  /proc, a directory containing shared states (all namespaces) and tasks 
states. That's something to discuss.

I do prefer the second option for many reasons: 

. each process restarts itself from its current ...
From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 4:43 am

I'm sorry for being out of discussion for so long time.
I've just sent a patchset with external checkpoint/restart as it is 
implemented in OpenVZ.
External checkpoint/restart also is very usefull if we will have container's 
infrastructure in mainstream.
It won't be very hard to add support for signals in my patchset. I'll wait for 
some time for comments and will add support for signals in next version.

Regards,
--

From: Cedric Le Goater
Date: Wednesday, September 3, 2008 - 5:15 am

OK.

Then, we should be able to have a simple C/R framework (user and kernel) if we
integrate your patchset or Oren's with Daniel's user tools :

	http://lxc.cvs.sourceforge.net/lxc/

I was just starting to work on Oren's patchset when you sent yours :) 

Thanks,

C.
--

From: Andrey Mirkin
Date: Wednesday, September 3, 2008 - 6:29 am

I will take a look on Daniel's user tools.

Thanks,
Andrey
--

From: Serge E. Hallyn
Date: Tuesday, September 2, 2008 - 8:32 am

Excellent.

Having two trees or branches, one very basic and nigh-upon useless, with
another advanced enough to be a valuable proof of concept to answer "oh


Sounds great.  (I may be taking the day off to paint, but will try to
monitor irc)

thanks,
-serge
--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:05 pm

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile    |    2 +-
  checkpoint/ckpt_arch.h |    1 +
  checkpoint/restart.c   |    3 +
  checkpoint/rstr_mem.c  |  356 ++++++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/rstr_x86.c  |   55 ++++++++
  5 files changed, 416 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/rstr_mem.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 032fc9f..41e0877 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o ckpt_mem.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o
  obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 3b87a6f..ab7ac1c 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -6,3 +6,4 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);

  int cr_read_thread(struct cr_ctx *ctx);
  int cr_read_cpu(struct cr_ctx *ctx);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a85f48b..81ce0a4 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -183,6 +183,9 @@ static int cr_read_task(struct cr_ctx *ctx)
  	ret = cr_read_task_struct(ctx);
  	cr_debug("task_struct: ret %d\n", ret);
  	if (!ret)
+		ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (!ret)
  		ret = cr_read_thread(ctx);
  	cr_debug("thread: ret %d\n", ret);
  	if (!ret)
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..df602a9
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,356 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ ...
From: Louis Rilling
Date: Thursday, August 21, 2008 - 3:07 am

Should down_write(&mm->mmap_sem) again here, and hold it until all vmas are
restored. This means removing down_write() from cr_vma_writable(). Or perha=
ps
make it finer grain: release it before looping on the vmas and make

Thanks,

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:04 pm

(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile     |    1 +
  checkpoint/checkpoint.c |    9 ++-
  checkpoint/ckpt_arch.h  |    7 ++
  checkpoint/ckpt_x86.c   |  190 +++++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/restart.c    |   11 ++-
  checkpoint/rstr_x86.c   |  176 +++++++++++++++++++++++++++++++++++++++++++
  include/asm-x86/ckpt.h  |   72 ++++++++++++++++++
  7 files changed, 463 insertions(+), 3 deletions(-)
  create mode 100644 checkpoint/ckpt_arch.h
  create mode 100644 checkpoint/ckpt_x86.c
  create mode 100644 checkpoint/rstr_x86.c
  create mode 100644 include/asm-x86/ckpt.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d129878..29dbb2d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1 +1,2 @@
  obj-y += sys.o checkpoint.o restart.o
+obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 25343f5..949ed58 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,7 @@

  #include "ckpt.h"
  #include "ckpt_hdr.h"
+#include "ckpt_arch.h"

  /**
   * cr_fill_fname - return pathname of a given file
@@ -166,7 +167,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  	}

  	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (!ret)
+		ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (!ret)
+		ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);

  	return ret;
  }
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
new file mode 100644
index 0000000..b7cc8c9
--- /dev/null
+++ b/checkpoint/ckpt_arch.h
@@ ...
From: Dave Hansen
Date: Friday, August 22, 2008 - 1:17 pm

Well, I told you I'd write some Kconfig magic for you, but I can't since
I can't apply these patches. :)

You need this in a Kconfig file somewhere:

config CHECKPOINT_RESTART
	prompt "Good prompt"
	def_bool y
	depends on X86_32 && EXPERIMENTAL
	help
		Lots of fun help...

config SELECT_ALL_NAMESPACES
	prompt "Enable all kernel namespace support"
	selects IPC_NS NET_NS ...
	help
		This option will save you having to go track down
		all of the individual places that namespaces are
		supported in the kernel.

		If you don't turn this on, the restart porttion
		of checkpoint/restart gets a lots less reliable...	

probably in checkpoint/Kconfig.

Then, in all of the architectures' arch/*/Kconfig files, you need to:

source "checkpoint/Kconfig"

Then, the Makefile looks like this:

--- /dev/null   2008-04-22 10:49:52.000000000 -0700
+++ oren-cr.git-dave/checkpoint/Makefile        2008-08-22 12:30:53.000000000 -0700
@@ -0,0 +1 @@
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o

-- Dave

--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:06 pm

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Documentation/checkpoint.txt |  177 ++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 177 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..fdc69cb
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,177 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the ...
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:04 pm

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

ckpt/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

ckpt/checkpoint.c - output wrappers and basic checkpoint handling

ckpt/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Makefile                |    2 +-
  checkpoint/Makefile     |    1 +
  checkpoint/checkpoint.c |  191 +++++++++++++++++++++++++++++++++++++++
  checkpoint/ckpt.h       |   71 +++++++++++++++
  checkpoint/ckpt_hdr.h   |   86 ++++++++++++++++++
  checkpoint/restart.c    |  199 +++++++++++++++++++++++++++++++++++++++++
  checkpoint/sys.c        |  227 +++++++++++++++++++++++++++++++++++++++++++++++
  7 files changed, 776 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/Makefile
  create mode 100644 checkpoint/checkpoint.c
  create mode 100644 checkpoint/ckpt.h
  create mode 100644 checkpoint/ckpt_hdr.h
  create mode 100644 checkpoint/restart.c
  create mode 100644 checkpoint/sys.c

diff --git a/Makefile b/Makefile
index f3e2065..584c8f9 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd


  ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

  vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
  		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..d129878
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1 @@
+obj-y += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index ...
From: Louis Rilling
Date: Thursday, August 21, 2008 - 2:35 am

I thought we agreed that this counter should be per-container. Perhaps add a

I'm a bit puzzled by this assumption. I would say: either this is a
self-checkpoint (only current process), or this is a container checkpoint. =
In
the latter case, I expect that in the general case the checkpointer lives
outside the container (and the interface of sys_checkpoint() below confirms
this), so it's root fs is probably not the container's one.

Does it differ from what you're planning?

Thanks,


--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Saturday, August 23, 2008 - 10:58 pm

You are correct. We lack infrastructure for what I'd call "container-object",
and the patchset does not yet tackle a container and multiple tasks, so this
is an interim solution. Will add a FIXME comment.

Thanks,

Oren.

--

From: Dave Hansen
Date: Friday, August 22, 2008 - 1:01 pm

These all need to be fixed up to Linux style:

	ret = = cr_kwrite(ctx, h, sizeof(*h)));
	if (ret < 0)
		...;

You might want to try checkpatch.pl on these.  It is pretty good at

We at least neex EXTRAVERSION in there if we're going to even pretend

alloc/delloc is not a bad thing. :)  I think at least a few of these




TASK_COMM_LEN might be subject to changing.  Can it be part of the
user<->kernel ABI?


I replaced all of the uses of these with kmalloc()/kfree() and stack
allocations.  I'm really not sure what these buy us since our allocators
are already so fast.  tbuf, especially, worries me if it gets used in

That ( ? : ) needs to get broken out so it is easier to read.  This
function is already nice and tight as it is, so saving a line or two

How about just making cr_read_obj_type() stop returning positive values?


Well, if TASK_COMM_LEN changed, then the size of the structure changed,

First of all, the unlikely() needs to die.  This isn't
performance-critical code, and I don't see any justification for why it
is needed.

Also, errors like -EAGAIN are just fine and normal to come back from a
write to a file.  We shoudn't be erroring out on them.  (Note that this
is something that userspace would be handling by itself if we were doing

Any particular reason you're using __get_free_pages() here?  kmalloc()
would save you a cast, and having to deal with keeping "order" vs. a
plain size.  Also, why don't you just stick tbuf and hbuf's allocation
into the ctx structure itself for now?  It'll save some of the error
paths.  Yeah, it might become an order 2 or 3 allocation, but that

Should we just combine most of these two functions?  If we passed a fd
into cr_ctx_alloc(), it could do the fget/put_light() bits, and we
wouldn't need the code in both sys_checkpoint() and sys_restart().

-- Dave

--

From: Oren Laadan
Date: Sunday, August 24, 2008 - 7:47 pm

It isn't clear exactly what we need in the header in terms of version, config,
and compile information. I put this as a starter, because it's obvious and
because this info is likely to be useful even at the user/admin level. I don't
expect this to remain as is as we continue development.



This is a temporary solution since this patchset doesn't really handle real

These are flags (more will come later).


Good point. Having the 'task_comm_len' field there isn't enough because the
size of the entire struct changes anyways.


I disagree with putting some of the cr_hdr_... on the stack: first, if they
grow in size, it's easy to forget to change the allocation to stack. Second,
using a standard code/handling for all cr_hdr_... makes the code more readable
and easier for others to extend by simply following existing code.

The main argument for ->hbuf is that eventually we would like to buffer
data in the kernel to avoid potentially slow writing/reading when processes
are frozen during checkpoint/restart. By using the simple cr_get_hbuf() and
cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf,
but later they will provide a pointer directly in the data buffer. Moreover,
it simplifies the error path since cleanup (of ->hbuf) is implicit. Also,
it is designed to allow checkpoint and restart function to be called in a
nested manner, again simplifying the code. Finally, my experience was that
it can impact performance if you are after very short downtimes, especially
for the checkpoint.

The use of ->tbuf is less critical and is mainly for performance and to
simplify the error path, never to be called nested. I'm ok with removing
it for now.


The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag -
field of the object that it reads in. The 'ptag' is the tag of the parent
object, and is useful in several places. For instance, the 'ptag' of an MM
header identifies (is equal to) the 'tag' of TASK to which it belongs. In
this ...
From: Dave Hansen
Date: Tuesday, August 26, 2008 - 9:42 am

It actually makes it harder for others to jump into because they see
something which is basically a kmalloc() to the rest of the kernel.  We
don't have ext2_alloc() or usb_alloc(), so I'm not sure we should have

Um, we're writing to a file descriptor and kmap()'ing.  Those can both
potentially be very, very slow.  I don't think that a few kmalloc()s

It simplifies *one* error path.  But, it complicates the ctx creation

Right, but I'm comparing it to kmalloc()  Certainly kmalloc()s can be

Since this ptag isn't actually used, I can't really offer a suggestion.
I don't see the whole picture.

-- Dave

--

From: Oren Laadan
Date: Tuesday, August 26, 2008 - 5:38 pm

I disagree. It's more than "allocate memory", it's: "give me some room on
the buffer". When we don't care about downtime (like now), it's enough to
kmalloc a temporary buffer and flush (write to the FD) immediately.

But when we will care about downtime, "the buffer" will be memory in kernel
where the entire checkpoint image data will be kept while the container is
frozen (memory contents need only be marked copy-on-write, not explicitly
buffered).

In this setting, cr_kwrite() will not really write the data to the FD, but
only record somewhere that the data exists. Only after the container resumes
execution will we eventually write the data to the FD and free the buffer.

(This is also useful in case you want to keep the checkpoint image entirely
in memory without flushing to any FD; and yes, there are use-cases for that).

So for this, instead of using kmalloc() to allocate a temp-buf, fill it with
data, and then copy that data to the actual (big) buffer, the mechanism
provides a shortcut to allocate space directly on the buffer, and save the

But you miss the point: the writing to the file descriptor in the (future)
optimized version will _not_ happen while the container is frozen. Only
much later after the container resumes, so at the end it will be:
(1) freeze container (2) dump data to in-kernel buffer (mark dirty pages
copy-on-write) (3) unfreeze container (4) write in-kernel buffer to FD.

By deferring step #4 until after the freeze-period, you can save tens and
hundreds of milliseconds, and seconds. Now the goal is to make step #2 be
as fast as possible, and every millisecond (and less) counts, e.g. to take
fast checkpoints of interactive desktops. And my experience is that in
such "workload" repetitive kmalloc/kfree inside that dump loop has a
measurable impact on performance (of course, for those objects which are

Actually it simplifies *many* error paths -- in *all* cr_write_...()

You can see how it's used in the code that dumps files. Or look at the ...
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:07 pm

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup tag in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile     |    2 +-
  checkpoint/checkpoint.c |    3 +
  checkpoint/ckpt.h       |    6 +-
  checkpoint/restart.c    |    3 +
  checkpoint/rstr_file.c  |  202 +++++++++++++++++++++++++++++++++++++++++++++++
  5 files changed, 213 insertions(+), 3 deletions(-)
  create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 179175b..fd073cd 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,3 +1,3 @@
  obj-y += sys.o checkpoint.o restart.o objhash.o \
-	ckpt_mem.o rstr_mem.o ckpt_file.o
+	ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
  obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index bf868ae..fe30ebb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -172,6 +172,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
  		ret = cr_write_mm(ctx, t);
  	cr_debug("memory: ret %d\n", ret);
  	if (!ret)
+		ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (!ret)
  		ret = cr_write_thread(ctx, t);
  	cr_debug("thread: ret %d\n", ret);
  	if (!ret)
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index ef2f74d..b83dea1 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -83,11 +83,13 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
  int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
  int cr_read_str(struct cr_ctx *ctx, void *str, int n);

+int do_checkpoint(struct cr_ctx *ctx);
  int cr_write_mm(struct cr_ctx *ctx, ...
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 10:26 pm

To restore the open files, cr_read_files() first closes all the
existing FDs of the current process. This breaks the assumption
underlying the use of fget_light/fput_light in sys_restart(), so
use fget/fput instead.


diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7b2670a..ec1609b 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -210,10 +210,9 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
  {
  	struct cr_ctx *ctx;
  	struct file *file;
-	int fput_needed;
  	int ret;

-	file = fget_light(fd, &fput_needed);
+	file = fget(fd);
  	if (!file)
  		return -EBADF;

@@ -223,14 +222,14 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)

  	ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
  	if (!ctx) {
-		fput_light(file, fput_needed);
+		fput(file);
  		return -ENOMEM;
  	}

  	ret = do_restart(ctx);

  	cr_ctx_free(ctx);
-	fput_light(file, fput_needed);
+	fput(file);

  	return ret;
  }
--

From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:06 pm

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier and also
stored in a hash table (indexed by its physical kenrel address). From
then on the object will be found in the hash and only its identifier is
saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Documentation/checkpoint.txt |   44 ++++++++++
  checkpoint/Makefile          |    2 +-
  checkpoint/ckpt.h            |   18 ++++
  checkpoint/objhash.c         |  193 ++++++++++++++++++++++++++++++++++++++++++
  checkpoint/sys.c             |    4 +
  5 files changed, 260 insertions(+), 1 deletions(-)
  create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index fdc69cb..923ec10 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -163,6 +163,50 @@ cr_hdr + cr_hdr_task
  cr_hdr + cr_hdr_tail


+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in ...
From: Louis Rilling
Date: Thursday, August 21, 2008 - 3:40 am

^
    |

Only true when PAGE_SIZE =3D=3D 4K and in 32bits. Perhaps like below?

#define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG =3D=3D 64 ? 3 : 2)
#define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
#define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >=3D 10 ? CR_MIN_OBJHASH_NBI=
TS : 10)
#define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS - PAGE=
_SHIFT)

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Dave Hansen
Date: Tuesday, August 26, 2008 - 10:01 am

Gah.  Can't we just use the existing radix tree or hash implementations
we already have?

-- Dave

--

From: Louis Rilling
Date: Wednesday, August 27, 2008 - 1:26 am

AFAICS, all hash implementations work like this (I'm referring to pid hashes
and compat_ioctl hashes for instance). The only common code is the hash
functions, which Oren seems to have used.

Radix trees may be an alternative, given that object tags are sequentially
generated in some way. I have no idea about the simplicity we can gain thou=
gh.

Louis

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:07 pm

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned a
tag and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  checkpoint/Makefile    |    3 +-
  checkpoint/ckpt.h      |    2 +-
  checkpoint/ckpt_file.c |  234 ++++++++++++++++++++++++++++++++++++++++++++++++
  checkpoint/ckpt_file.h |   17 ++++
  checkpoint/ckpt_hdr.h  |   31 +++++++
  5 files changed, 285 insertions(+), 2 deletions(-)
  create mode 100644 checkpoint/ckpt_file.c
  create mode 100644 checkpoint/ckpt_file.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index cd57d9d..179175b 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,3 @@
-obj-y += sys.o checkpoint.o restart.o objhash.o ckpt_mem.o rstr_mem.o
+obj-y += sys.o checkpoint.o restart.o objhash.o \
+	ckpt_mem.o rstr_mem.o ckpt_file.o
  obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 8b02c4c..ef2f74d 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -13,7 +13,7 @@
  #include <linux/path.h>
  #include <linux/fs.h>

-#define CR_VERSION  1
+#define CR_VERSION  2

  struct cr_ctx {
  	pid_t pid;		/* container identifier */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..18faaf1
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,234 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main ...
From: Louis Rilling
Date: Thursday, August 21, 2008 - 4:06 am

Why not simply use fcheck_files(files, n) and check if the result is not NU=

You should probably use fcheck_files() instead of copying its code here. I =
agree
that this means dereferencing files->fdt a second time below, but is it so
    ^
    ^



We are not at a 64bits boundary here. Should add one __32 padding or reorde=
r the

--=20
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes
From: Oren Laadan
Date: Sunday, August 24, 2008 - 8:28 pm

Actually this is ok - note there are two fields on fourth row. I'll
change to one field per row so it's easily seen.

Thanks,

Oren.

--

From: Louis Rilling
Date: Monday, August 25, 2008 - 3:30 am

Have to revise my eyes, really ;) Sorry for the noise!

Louis

--=20
Dr Louis Rilling			Kerlabs - IRISA
Skype: louis.rilling			Campus Universitaire de Beaulieu
Phone: (+33|0) 2 99 84 71 52		Avenue du General Leclerc
Fax: (+33|0) 2 99 84 71 71		35042 Rennes CEDEX - France
http://www.kerlabs.com/
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 10:15 pm

Subject line should have been 's/PATCH 1/PATCH 0/' ...

I left cr_debug() in for now to provide more info than pr_debug();
eventually that will be changed back to pr_debug()

In the mini-conference we considered doing CR in a kernel module,
but decided against because we needed a system call. It is still
possible to put the bulk of the code in a module. This is useful,
besides reducing debug time (recompile, unload, reload), to reduce
the kernel memory footprint. Also, an administrator can load/unload
the module to enable/disable this feature. Any thoughts ?

Oren.


--

Previous thread: Linux v2.6.27-rc4 by Linus Torvalds on Wednesday, August 20, 2008 - 7:52 pm. (1 message)

Next thread: [PATCH 2.6.27-rc4 1/27] mdb: export genapic and machine_restart functions by jmerkey on Wednesday, August 20, 2008 - 7:50 pm. (1 message)