Hi Andrew, Following up on the thread on the checkpoint-restart patch set (http://lkml.org/lkml/2010/3/1/422), the following series is the latest checkpoint/restart, based on 2.6.33. The first 20 patches are cleanups and prepartion for c/r; they are followed by the actual c/r code. Please apply to -mm, and let us know if there is any way we can help. Thanks, Oren. --- Linux Checkpoint-Restart: web, wiki: http://www.linux-cr.org bug track: https://www.linux-cr.org/redmine The repositories for the project are in: kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary --- CHANGELOG: v20 [2010-Mar-16] BUG FIXES (only) - [Serge Hallyn] Fix unlabeled restore case - [Serge Hallyn] Always restore msg_msg label - [Serge Hallyn] Selinux prevents msgrcv on restore message queues? - [Serge Hallyn] save_access_regs for self-checkpoint - [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages - Fix "scheduling in atomic" while restoring ipc (sem, shm, msg) - Cleanup: no need to restore perm->{id,key,seq} - Fix sysvipc=n compile - Make uts_ns=n compile - Only use arch_setup_additional_pages() if supported by arch - Export key symbols to enable c/r from kernel modules - Avoid crash if incoming object doesn't have .restore - Replace error_sem with an event completion - [Serge Hallyn] Change sysctl and default for unprivileged use - [Nathan Lynch] Use syscall_get_error - Add entry for checkpoint/restart in MAINTAINERS [2010-Feb-19] v19 NEW FEATURES - Support for x86-64 architecture - Support for c/r of LSM (smack, selinux) - Support for c/r of task fs_root and pwd - Support for c/r of epoll - Support for c/r of eventfd - Enable C/R while executing over NFS - Preliminary c/r of mounts namespace - Add @logfd argument to sys_{checkpoint,restart} prototypes ...
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> do_fork_with_pids() is same as do_fork(), except that it takes an additional, 'pid_set', parameter. This parameter, currently unused, specifies the set of target pids of the process in each of its pid namespaces. Changelog[v7]: - Drop 'struct pid_set' object and pass in 'pid_t *target_pids' instead of 'struct pid_set *'. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'. Changelog[v4]: - Rename 'struct target_pid_set' to 'struct pid_set' since it may be useful in other contexts. Changelog[v3]: - Fix "long-line" warning from checkpatch.pl Changelog[v2]: - To facilitate moving architecture-inpdendent code to kernel/fork.c pass in 'struct target_pid_set __user *' to do_fork_with_pids() rather than 'pid_t *' (next patch moves the arch-independent code to kernel/fork.c) Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/sched.h | 3 +++ kernel/fork.c | 17 +++++++++++++++-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d57eab8..4f079f7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2189,6 +2189,9 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, + unsigned long, int __user *, int __user *, + unsigned int, pid_t __user *); struct ...
From: Alexey Dobriyan <adobriyan@gmail.com> Add "start" argument, to request to map vDSO to a specific place, and fail the operation if not. This is useful for restart(2) to ensure that memory layout is restore exactly as needed. Changelog[v19]: - [serge hallyn] Fix potential use-before-set ret Changelog[v2]: - [ntl] powerpc: vdso build fix (ckpt-v17) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/powerpc/include/asm/elf.h | 1 + arch/powerpc/kernel/vdso.c | 13 ++++++++++++- arch/s390/include/asm/elf.h | 2 +- arch/s390/kernel/vdso.c | 13 ++++++++++++- arch/sh/include/asm/elf.h | 1 + arch/sh/kernel/vsyscall/vsyscall.c | 2 +- arch/x86/include/asm/elf.h | 3 ++- arch/x86/vdso/vdso32-setup.c | 9 +++++++-- arch/x86/vdso/vma.c | 11 ++++++++--- fs/binfmt_elf.c | 2 +- 10 files changed, 46 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h index c376eda..0b06255 100644 --- a/arch/powerpc/include/asm/elf.h +++ b/arch/powerpc/include/asm/elf.h @@ -266,6 +266,7 @@ extern int ucache_bsize; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp); #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b); diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index d84d192..74210ab 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -188,7 +188,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma) * This is called from binfmt_elf, we create the special vma for the * vDSO and insert it into the mm struct tree */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm ...
From: Serge E. Hallyn <serue@us.ibm.com>
When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks. Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.
This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
include/linux/cred.h | 8 +++
kernel/cred.c | 114 ++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 134 ++++++++------------------------------------------
3 files changed, 143 insertions(+), 113 deletions(-)
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4e3387a..e35631e 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -22,6 +22,9 @@ struct user_struct;
struct cred;
struct inode;
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
/*
* COW Supplementary groups list
*/
@@ -396,4 +399,9 @@ do { \
*(_fsgid) = __cred->fsgid; \
} while(0)
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
#endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 1ed8ca1..1fefcb1 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -890,3 +890,117 @@ void validate_creds_for_do_exit(struct task_struct *tsk)
}
#endif /* CONFIG_DEBUG_CREDENTIALS */
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+ int retval;
+ const struct cred *old;
+
+ retval = ...From: Matt Helsley <matthltc@us.ibm.com> When the cgroup freezer is used to freeze tasks we do not want to thaw those tasks during resume. Currently we test the cgroup freezer state of the resuming tasks to see if the cgroup is FROZEN. If so then we don't thaw the task. However, the FREEZING state also indicates that the task should remain frozen. This also avoids a problem pointed out by Oren Ladaan: the freezer state transition from FREEZING to FROZEN is updated lazily when userspace reads or writes the freezer.state file in the cgroup filesystem. This means that resume will thaw tasks in cgroups which should be in the FROZEN state if there is no read/write of the freezer.state file to trigger this transition before suspend. NOTE: Another "simple" solution would be to always update the cgroup freezer state during resume. However it's a bad choice for several reasons: Updating the cgroup freezer state is somewhat expensive because it requires walking all the tasks in the cgroup and checking if they are each frozen. Worse, this could easily make resume run in N^2 time where N is the number of tasks in the cgroup. Finally, updating the freezer state from this code path requires trickier locking because of the way locks must be ordered. Instead of updating the freezer state we rely on the fact that lazy updates only manage the transition from FREEZING to FROZEN. We know that a cgroup with the FREEZING state may actually be FROZEN so test for that state too. This makes sense in the resume path even for partially-frozen cgroups -- those that really are FREEZING but not FROZEN. Reported-by: Oren Ladaan <orenl@cs.columbia.edu> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Cedric Le Goater <legoater@free.fr> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: linux-pm@lists.linux-foundation.org Seems like a candidate for -stable. --- include/linux/freezer.h | 7 ...
From: Matt Helsley <matthltc@us.ibm.com>
The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:
echo FROZEN > /cgroups/my_container/freezer.state
...
rc = sys_checkpoint( <pid of container root> );
To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.
"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.
The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.
Notes:
As a side-effect this prevents the multiple tasks from entering the
CHECKPOINTING state simultaneously. All but one will get -EBUSY.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
Documentation/cgroups/freezer-subsystem.txt | 10 ++
include/linux/freezer.h | 8 ++
kernel/cgroup_freezer.c | 166 ++++++++++++++++++++-------
3 files changed, 142 insertions(+), 42 deletions(-)
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- ...From: Dave Hansen <dave@linux.vnet.ibm.com> Let's not steal too much space in the 'General Setup' menu. Take a cue from the cgroups code and create a submenu. This can go upstream now. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- init/Kconfig | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index d95ca7c..0c00a78 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -668,7 +668,7 @@ config RELAY If unsure, say N. -config NAMESPACES +menuconfig NAMESPACES bool "Namespaces support" if EMBEDDED default !EMBEDDED help -- 1.6.3.3 --
These two are used in the next patch when calling vfs_read/write()
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
---
fs/read_write.c | 10 ----------
include/linux/fs.h | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index b7f4a1f..e258301 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
EXPORT_SYMBOL(vfs_write);
-static inline loff_t file_pos_read(struct file *file)
-{
- return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
- file->f_pos = pos;
-}
-
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebb1cd5..6c08df2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
struct iovec *fast_pointer,
struct iovec **ret_pointer);
+static inline loff_t file_pos_read(struct file *file)
+{
+ return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+ file->f_pos = pos;
+}
+
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
--
1.6.3.3
--
Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.
In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.
Currently only x86-32 is supported.
Changelog[v19]:
- [Serge Hallyn] Use ckpt_err() for arch incompatbilities
Changelog[v19-rc3]:
- Rebase to kernel 2.6.33:
* Use PTREGSCALL4 for sys_{checkpoint,restart}
* Remove debug-reg support (need to redo with perf_events)
- [Serge Hallyn] Support for ia32 (checkpoint, restart)
- Split arch/x86/checkpoint.c to generic and 32bit specific parts
- sys_{checkpoint,restore} to use ptregs
Changelog[v19-rc1]:
- Fix up headers so we can munge them for use by userspace
- [Matt Helsley] Add cpp definitions for enums
- Allow X86_EFLAGS_RF on restart
Changelog[v17]:
- Fix compilation for architectures that don't support checkpoint
- Validate cpu registers and TLS descriptors on restart
- Validate debug registers on restart
- Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
- All objects are preceded by ckpt_hdr (TLS and xstate_buf)
- Add architecture identifier to main header
Changelog[v14]:
- Use new interface ckpt_hdr_get/put()
- Embed struct ckpt_hdr in struct ckpt_hdr...
- Remove preempt_disable/enable() around init_fpu() and fix leak
- Revert change to pr_debug(), back to ckpt_debug()
- Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
- A couple of missed calls to ckpt_hbuf_put()
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Add arch-specific header that details architecture capabilities;
split FPU restore to send capabilities only once.
- Test for zero TLS entries in ckpt_write_thread()
- Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
- Fix save/restore state of FPU
Changelog[v5]:
- Remove ...From: Matt Helsley <matthltc@us.ibm.com> These lists record which futexes the task holds. To keep the overhead of robust futexes low the list is kept in userspace. When the task exits the kernel carefully walks these lists to recover held futexes that other tasks may be attempting to acquire with FUTEX_WAIT. Because they point to userspace memory that is saved/restored by checkpoint/restart saving the list pointers themselves is safe. While saving the pointers is safe during checkpoint, restart is tricky because the robust futex ABI contains provisions for changes based on checking the size of the list head. So we need to save the length of the list head too in order to make sure that the kernel used during restart is capable of handling that ABI. Since there is only one ABI supported at the moment taking the list head's size is simple. Should the ABI change we will need to use the same size as specified during sys_set_robust_list() and hence some new means of determining the length of this userspace structure in sys_checkpoint would be required. Rather than rewrite the logic that checks and handles the ABI we reuse sys_set_robust_list() by factoring out the body of the function and calling it during restart. Changelog [v19]: - Keep __u32s in even groups for 32-64 bit compatibility Signed-off-by: Matt Helsley <matthltc@us.ibm.com> [orenl@cs.columbia.edu: move save/restore code to checkpoint/process.c] Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/process.c | 49 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 5 ++++ include/linux/compat.h | 3 +- include/linux/futex.h | 1 + kernel/futex.c | 19 +++++++++----- kernel/futex_compat.c | 13 ++++++++-- 6 files changed, 79 insertions(+), 11 deletions(-) diff --git a/checkpoint/process.c b/checkpoint/process.c index c47dea1..f36e320 100644 --- ...
Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.
One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).
This interface allows chronic procrastination in the kernel:
deferqueue_create(void):
Allocates and returns a new deferqueue.
deferqueue_run(deferqueue):
Executes all the pending works in the queue. Returns the number
of works executed, or an error upon the first error reported by
a deferred work.
deferqueue_add(deferqueue, data, size, func, dtor):
Enqueue a deferred work. @function is the callback function to
do the work, which will be called with @data as an argument.
@size tells the size of data. @dtor is a destructor callback
that is invoked for deferred works remaining in the queue when
the queue is destroyed. NOTE: for a given deferred work, @dtor
is _not_ called if @func was already called (regardless of the
return value of the latter).
deferqueue_destroy(deferqueue):
Free the deferqueue and any queued items while invoking the
@dtor callback for each queued item.
Why aren't we using the existing kernel workqueue mechanism? We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., ...Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.
Changelog[v20]:
- Only use arch_setup_additional_pages() if supported by arch
Changelog[v19]:
- [Serge Hallyn] do_munmap(): remove unused local vars
- [Serge Hallyn] Checkpoint saved_auxv as u64s
Changelog[v19-rc3]:
- [Serge Hallyn] move destroy_mm into mmap.c and remove size check
- [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
- Do not hold mmap_sem when reading memory pages on restart
Changelog[v19-rc2]:
- Expose page write functions
- [Serge Hallyn] Fix return value of read_pages_contents()
Changelog[v18]:
- Tighten checks on supported vma to checkpoint or restart
Changelog[v17]:
- Restore mm->{flags,def_flags,saved_auxv}
- Fix bogus warning in do_restore_mm()
Changelog[v16]:
- Restore mm->exe_file
Changelog[v14]:
- Introduce per vma-type restore() function
- Merge restart code into same file as checkpoint (memory.c)
- Compare saved 'vdso' field of mm_context with current value
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'h->parent'
- Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
- Avoid access to hh->vma_type after the header is freed
- Test for no vma's in exit_mmap() before calling unmap_vma() (or it
may crash if restart fails after having removed all vma's)
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
- Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Changelog[v5]:
- Improve memory restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, ...Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.
mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/mm.h | 11 +++++++++++
mm/shmem.c | 15 ++-------------
2 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bdeb0b5..b37a9f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -337,6 +337,17 @@ void put_pages_list(struct list_head *pages);
void split_page(struct page *page, unsigned int order);
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+ SGP_READ, /* don't exceed i_size, don't allocate page */
+ SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
+ SGP_WRITE, /* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+ struct page **pagep, enum sgp_type sgp, int *type);
+
/*
* Compound pages have a destructor function. Provide a
* prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index eef4ebe..d93c394 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -98,14 +98,6 @@ static struct vfsmount *shm_mnt;
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
- SGP_READ, /* don't exceed i_size, don't allocate page */
- SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
- SGP_WRITE, /* may exceed i_size, may allocate page */
-};
-
#ifdef CONFIG_TMPFS
static unsigned long ...During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.
This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.
It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
fs/splice.c | 61 ++++++++++++++++++++++++++++++++---------------
include/linux/splice.h | 9 +++++++
2 files changed, 50 insertions(+), 20 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index 3920866..76acb55 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1051,18 +1051,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
EXPORT_SYMBOL(generic_splice_sendpage);
/*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+ if (S_ISFIFO(inode->i_mode))
+ return inode->i_pipe;
+
+ return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ size_t len, unsigned int flags);
+
+/*
* Attempt to initiate a splice from pipe to file.
*/
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
- loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags)
{
ssize_t ...A pipe is a double-headed inode with a buffer attached to it. We checkpoint the pipe buffer only once, as soon as we hit one side of the pipe, regardless whether it is read- or write- end. To checkpoint a file descriptor that refers to a pipe (either end), we first lookup the inode in the hash table: If not found, it is the first encounter of this pipe. Besides the file descriptor, we also (a) save the pipe data, and (b) register the pipe inode in the hash. If found, it is the second encounter of this pipe, namely, as we hit the other end of the same pipe. In both cases we write the pipe-objref of the inode. To restore, create a new pipe and thus have two file pointers (read- and write- ends). We only use one of them, depending on which side was checkpointed first. We register the file pointer of the other end in the hash table, with the pipe_objref given for this pipe from the checkpoint, to be used later when the other arrives. At this point we also restore the contents of the pipe buffers. To save the pipe buffer, given a source pipe, use do_tee() to clone its contents into a temporary 'struct pipe_inode_info', and then use do_splice_from() to transfer it directly to the checkpoint image file. To restore the pipe buffer, with a fresh newly allocated target pipe, use do_splice_to() to splice the data directly between the checkpoint image file and the pipe. Changelog[v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Adjust format of pipe buffer to include the mandatory pre-header Changelog[v17]: - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/files.c | 7 ++ fs/pipe.c | 157 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint_hdr.h | 9 +++ include/linux/pipe_fs_i.h ...
FIFOs are almost like pipes.
Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.
To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.
Changelog [v19-rc3]:
- Rebase to kernel 2.6.33
Changelog [v19-rc1]:
- Switch to ckpt_obj_try_fetch()
- [Matt Helsley] Add cpp definitions for enums
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/files.c | 6 +++
fs/pipe.c | 81 +++++++++++++++++++++++++++++++++++++++-
include/linux/checkpoint_hdr.h | 2 +
include/linux/pipe_fs_i.h | 2 +
4 files changed, 90 insertions(+), 1 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1c294fe..c647bfd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -599,6 +599,12 @@ static struct restore_file_ops restore_file_ops[] = {
.file_type = CKPT_FILE_PIPE,
.restore = pipe_file_restore,
},
+ /* fifo */
+ {
+ .file_name = "FIFO",
+ .file_type = CKPT_FILE_FIFO,
+ .restore = fifo_file_restore,
+ },
};
static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 747b2d7..8c79493 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -830,6 +830,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
return ret;
}
+static struct vfsmount *pipe_mnt __read_mostly;
+
#ifdef CONFIG_CHECKPOINT
static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
{
@@ -877,7 +879,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
if (!h)
return -ENOMEM;
- h->common.f_type = ...From: Matt Helsley <matthltc@us.ibm.com>
We do not support restarting fsnotify watches. inotify and fanotify utilize
anon_inodes for pseudofiles which lack the .checkpoint operation. So they
already cleanly prevent checkpoint. dnotify on the other hand registers
its watches using fcntl() which does not require the userspace task to
hold an fd with an empty .checkpoint operation. This means userspace
could use dnotify to set up fsnotify watches which won't be re-created during
restart.
Check for fsnotify watches created with dnotify and reject checkpoint
if there are any.
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/files.c | 5 +++++
fs/notify/dnotify/dnotify.c | 18 ++++++++++++++++++
include/linux/dnotify.h | 6 ++++++
3 files changed, 29 insertions(+), 0 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index c647bfd..62feadd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -207,6 +207,11 @@ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
return -EBADF;
}
+ if (is_dnotify_attached(file)) {
+ ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file);
+ return -EBADF;
+ }
+
ret = file->f_op->checkpoint(ctx, file);
if (ret < 0)
ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 7e54e52..0a63bf6 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -289,6 +289,24 @@ static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnent
return 0;
}
+int is_dnotify_attached(struct file *filp)
+{
+ struct fsnotify_mark_entry *entry;
+ struct inode *inode;
+
+ inode = filp->f_path.dentry->d_inode;
+ if (!S_ISDIR(inode->i_mode))
+ return 0;
+
+ spin_lock(&inode->i_lock);
+ entry = ...Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.
(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).
Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.
Changelog[v20]:
Fix "scheduling in atomic" while restoring ipc shm
Changelog[v19-rc3]:
- Rebase to kernel 2.6.33
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
- Collect files used by shm objects
- Use file instead of inode as shared object during checkpoint
Changelog[v17]:
- Restore objects in the right namespace
- Properly initialize ctx->deferqueue
- Fix compilation with CONFIG_CHECKPOINT=n
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/checkpoint_hdr.h | 21 +++
ipc/Makefile | 3 +-
ipc/checkpoint.c | 2 +-
ipc/checkpoint_msg.c | 380 ++++++++++++++++++++++++++++++++++++++++
ipc/msg.c | 10 +-
ipc/msgutil.c | 8 -
ipc/util.h | 13 ++
7 files changed, 420 insertions(+), 17 deletions(-)
create mode 100644 ipc/checkpoint_msg.c
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1b2ffef..07e918e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -114,6 +114,8 @@ enum {
#define CKPT_HDR_IPC_SHM CKPT_HDR_IPC_SHM
CKPT_HDR_IPC_MSG,
#define CKPT_HDR_IPC_MSG CKPT_HDR_IPC_MSG
+ CKPT_HDR_IPC_MSG_MSG,
+#define CKPT_HDR_IPC_MSG_MSG ...We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.
Changelog:
Jan 20:
. Define s390x sys_restart wrapper
Mar 30:
. Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Mar 03:
. Picked up additional use of magic '3' in ptrace.h
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/s390/Kconfig | 4 ++++
arch/s390/kernel/process.c | 9 +++++++++
2 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c802352..95bb4ed 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
config GENERIC_CLOCKEVENTS
def_bool y
+config CHECKPOINT_SUPPORT
+ bool
+ default y if 64BIT
+
config GENERIC_BUG
bool
depends on BUG
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5b0729a..eb834fd 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -240,6 +240,15 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
parent_tidptr, child_tidptr);
}
+#ifdef CONFIG_CHECKPOINT
+extern long do_sys_restart(pid_t pid, int fd, unsigned long flags, int logfd);
+SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, unsigned long, flags,
+ int, logfd)
+{
+ return do_sys_restart(pid, fd, flags, logfd);
+}
+#endif
+
SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *,
uca, int, args_size, pid_t __user *, pids)
{
--
1.6.3.3
--
From: Dan Smith <danms@us.ibm.com>
As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric. CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays. It's not critical, but it helps us unify
the checkpoint and restart paths for some things.
Changelog:
Mar 04:
. Removed semicolons
. Added build-time check for __must_be_array in CKPT_COPY_ARRAY
Feb 27:
. Changed CKPT_COPY() to use assignment, eliminating the need
for the CKPT_COPY_BIT() macro
. Add CKPT_COPY_ARRAY() macro to help copying register arrays,
etc
. Move the macro definitions inside the CR #ifdef
Feb 25:
. Changed WARN_ON() to BUILD_BUG_ON()
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
include/linux/checkpoint.h | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 81e2150..9eeb71c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -260,6 +260,34 @@ static inline int ckpt_validate_errno(int errno)
return (errno >= 0) && (errno < MAX_ERRNO);
}
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE) \
+ do { \
+ if (op == CKPT_CPT) \
+ SAVE = LIVE; \
+ else \
+ LIVE = SAVE; \
+ } while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count) \
+ do ...From: Dan Smith <danms@us.ibm.com>
Implement the s390 arch-specific checkpoint/restart helpers. This
is on top of Oren Laadan's c/r code.
With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro. While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to. That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.
Changelog [v20]:
- [Serge Hallyn] save_access_regs for self-checkpoint
- [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
Changelog [v19]:
- [Serge Hallyn] Move get_signal_to_deliver() up in do_signal
Changelog [v19-rc3]:
- [Serge Hallyn] Ue simpler test_task_thread to test current ti flags
- [Serge Hallyn] Fix 31-bit s390 checkpoint/restart wrappers
- [Serge Hallyn] Update sys_checkpoint (do_sys_checkpoint on all archs)
- [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel
Changelog [v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog:
Jun 15:
. Fix checkpoint and restart compat wrappers
May 28:
. Export asm/checkpoint_hdr.h to userspace
. Define CKPT_ARCH_ID for S390
Apr 11:
. Introduce ckpt_arch_vdso()
Feb 27:
. Add checkpoint_s390.h
. Fixed up save and restore of PSW, with the non-address bits
properly masked out
Feb 25:
. Make checkpoint_hdr.h safe for inclusion in userspace
. Replace comment about vsdo code
. Add comment about restoring access registers
. Write and read an empty ckpt_hdr_head_arch record to appease
code (mktree) that expects it to be there
. Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
Feb 24:
. Use CKPT_COPY() to unify the un/loading of cpu and mm state
. Fix fprs definition in ckpt_hdr_cpu
. Remove debug WARN_ON() from checkpoint.c
...This patch adds checkpoint/restart of blocked signals mask (t->blocked) and a template for shared signals (t->signal). Because t->signal sharing is tied to threads, we ensure proper sharing of t->signal (struct signal_struct) for threads only. Access to t->signal is protected by locking t->sighand->lock. Therefore, the usual checkpoint_obj() invoking the callback checkpoint_signal(ctx, signal) is insufficient because the task pointer is unavailable. Instead, handling of t->signal sharing is explicit using helpers like ckpt_obj_lookup_add(), ckpt_obj_fetch() and ckpt_obj_insert(). The actual state is saved (if needed) _after_ the task_objs data. To prevent tasks from handling restored signals during restart, set their mask to block all signals and only restore the original mask at the very end (before the last sync point). Introduce per-task pointer 'ckpt_data' to temporary store data for restore actions that are deferred to the end (like restoring the signal block mask). Changelog [ckpt-v19]: - Use task->saves_sigmask and drop task->checkpoint_data - [Serge Hallyn] Handle saved_sigmask at checkpoint Changelog [ckpt-v19-rc1]: - Defer restore of blocked signals mask during restart - [Matt Helsley] Add cpp definitions for enums Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/s390/kernel/checkpoint.c | 2 - arch/s390/kernel/signal.c | 5 ++ arch/x86/kernel/signal.c | 5 ++ checkpoint/objhash.c | 7 +++ checkpoint/process.c | 71 +++++++++++++++++++++++++- checkpoint/restart.c | 13 +++++ checkpoint/signal.c | 111 ++++++++++++++++++++++++++++++++++++++++ include/linux/checkpoint.h | 8 +++ include/linux/checkpoint_hdr.h | 16 ++++++ include/linux/signal.h | 3 + kernel/fork.c | 3 + 11 ...
This patch adds support for real/virt/prof itimers.
Expiry and the interval values are both saved in nanoseconds.
Changelog[v19-rc2]:
- Adjust virt/prof itimer code for kernel 2.6.32
Changelog[v1]:
- [Louis Rilling] Fix saving of signal->it_real_incr if not expired
- Fix restoring of signal->it_real_incr if expire is zero
- Save virt/prof expire relative to process accumulated time
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/signal.c | 90 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint_hdr.h | 6 +++
include/linux/posix-timers.h | 9 ++++
kernel/posix-cpu-timers.c | 9 ----
4 files changed, 105 insertions(+), 9 deletions(-)
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 3d13c56..ecb94f8 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -15,6 +15,8 @@
#include <linux/signal.h>
#include <linux/errno.h>
#include <linux/resource.h>
+#include <linux/timer.h>
+#include <linux/posix-timers.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
@@ -315,6 +317,9 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
struct signal_struct *signal;
struct sigpending shared_pending;
struct rlimit *rlim;
+ struct timeval tval;
+ struct cpu_itimer *it;
+ cputime_t cputime;
unsigned long flags;
int i, ret;
@@ -351,6 +356,53 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
h->rlim[i].rlim_cur = rlim[i].rlim_cur;
h->rlim[i].rlim_max = rlim[i].rlim_max;
}
+
+ /* real/virt/prof itimers */
+ if (hrtimer_active(&signal->real_timer)) {
+ /* For an active timer compute the time delta */
+ ktime_t delta = hrtimer_get_remaining(&signal->real_timer);
+ /*
+ * If the timer expired after the the test above, then
+ * set the expire to the ...From: Dan Smith <danms@us.ibm.com>
Make these helpers available to others.
Changes in v2:
- Avoid checking the groupinfo in ctx->realcred against the current in
may_setgid()
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/user.h | 9 +++++++++
kernel/user.c | 13 ++++++++++++-
2 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/include/linux/user.h b/include/linux/user.h
index 68daf84..c231e9c 100644
--- a/include/linux/user.h
+++ b/include/linux/user.h
@@ -1 +1,10 @@
+#ifndef _LINUX_USER_H
+#define _LINUX_USER_H
+
#include <asm/user.h>
+#include <linux/sched.h>
+
+extern int may_setuid(struct user_namespace *ns, uid_t uid);
+extern int may_setgid(gid_t gid);
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index 15f762c..8ddafea 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -604,7 +604,7 @@ int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
return do_checkpoint_user(ctx, (struct user_struct *) ptr);
}
-static int may_setuid(struct user_namespace *ns, uid_t uid)
+int may_setuid(struct user_namespace *ns, uid_t uid)
{
/*
* this next check will one day become
@@ -631,6 +631,17 @@ static int may_setuid(struct user_namespace *ns, uid_t uid)
return 0;
}
+int may_setgid(gid_t gid)
+{
+ if (capable(CAP_SETGID))
+ return 1;
+
+ if (in_egroup_p(gid))
+ return 1;
+
+ return 0;
+}
+
static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
{
struct user_struct *u;
--
1.6.3.3
--
The main challenge with restoring the pgid of tasks is that the
original "owner" (the process with that pid) might have exited
already. I call these "ghost" pgids. 'mktree' does create these
processes, but they then exit without participating in the restart.
To solve this, this patch introduces a RESTART_GHOST flag, used for
"ghost" owners that are created only to pass their pgid to other
tasks. ('mktree' now makes them call restart(2) instead of exiting).
When a "ghost" task calls restart(2), it will be placed on a wait
queue until the restart completes and then exit. This guarantees that
the pgid that it owns remains available for all (regular) restarting
tasks for when they need it.
Regular tasks perform the restart as before, except that they also
now restore their old pgrp, which is guaranteed to exist.
Changelog [v19-rc1]:
- Simplify logic of tracking restarting tasks
- Debug final process-tree state on restart
- [Matt Helsley] Add cpp definitions for enums
- Self-restart to tolerate missing pgid
Changelog [v3]:
- Fix leak of ckpt_ctx when restoring "ghost" tasks
Changelog [v2]:
- Call change_pid() only if new pgrp differs from current one
Changelog [v1]:
- Verify that pgid owner is a thread-group-leader.
- Handle the case of pgid/sid == 0 using root's parent pid-ns
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/process.c | 101 ++++++++++++++++++++++++++++++++++++++
checkpoint/restart.c | 59 +++++++++++++++++++---
checkpoint/sys.c | 3 +-
include/linux/checkpoint.h | 11 +++-
include/linux/checkpoint_hdr.h | 3 +
include/linux/checkpoint_types.h | 7 ++-
6 files changed, 171 insertions(+), 13 deletions(-)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index c5e9357..e0ef795 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -24,6 +24,57 @@
...From: Dan Smith <danms@us.ibm.com> This is an incremental step towards supporting checkpoint/restart on AF_INET sockets. In this scenario, any sockets that were in TCP_LISTEN state are restored as they were. Any that were connected are forced to TCP_CLOSE. This should cover a range of use cases that involve applications that are tolerant of such an interruption. Changelog [v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changes in v2: - Fix whitespace - Fix return in inet_checkpoint() on failed ckpt_hdr_get_type() - Fix garbage free on error path of inet_read_buffer() - Fix unnecessary ret=0 in inet_read_buffers() - Add inet_precheck() (like unix) to validate the address lengths (and more later) Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Smith <danms@us.ibm.com> Acked-by: Oren Laadan <orenl@librato.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- Documentation/checkpoint/readme.txt | 21 ++++ include/linux/checkpoint_hdr.h | 12 ++ include/net/inet_common.h | 13 +++ net/checkpoint.c | 9 ++ net/ipv4/Makefile | 1 + net/ipv4/af_inet.c | 6 + net/ipv4/checkpoint.c | 190 +++++++++++++++++++++++++++++++++++ 7 files changed, 252 insertions(+), 0 deletions(-) create mode 100644 net/ipv4/checkpoint.c diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt index 4fa5560..2548bb4 100644 --- a/Documentation/checkpoint/readme.txt +++ b/Documentation/checkpoint/readme.txt @@ -344,6 +344,27 @@ we will be forced to more carefully review each of those features. However, this can be controlled with a sysctl-variable. +Sockets +======= + +For AF_UNIX sockets, both endpoints must be within the checkpointed +task set to maintain a connected state after restart. UNIX sockets +that are in the process of passing a descriptor will cause ...
Add checkpoint/restart of controlling terminal: current->signal->tty.
This is only done for session leaders.
If the session leader belongs to the ancestor pid-ns, then checkpoint
skips this tty; On restart, it will not be restored, and whatever tty
is in place from parent pid-ns (at restart) will be inherited.
Chagnelog [v1]:
- Don't restore tty_old_pgrp it pgid is CKPT_PID_NULL
- Initialize pgrp to NULL in restore_signal
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/signal.c | 79 +++++++++++++++++++++++++++++++++++++++-
drivers/char/tty_io.c | 33 +++++++++++++----
include/linux/checkpoint.h | 1 +
include/linux/checkpoint_hdr.h | 6 +++
include/linux/tty.h | 5 +++
5 files changed, 115 insertions(+), 9 deletions(-)
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index ecb94f8..9d0e9c3 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -316,12 +316,13 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
struct ckpt_hdr_signal *h;
struct signal_struct *signal;
struct sigpending shared_pending;
+ struct tty_struct *tty = NULL;
struct rlimit *rlim;
struct timeval tval;
struct cpu_itimer *it;
cputime_t cputime;
unsigned long flags;
- int i, ret;
+ int i, ret = 0;
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
if (!h)
@@ -403,9 +404,34 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
cputime_to_timeval(it->incr, &tval);
h->it_prof_incr = timeval_to_ns(&tval);
+ /* tty */
+ if (signal->leader) {
+ h->tty_old_pgrp = ckpt_pid_nr(ctx, signal->tty_old_pgrp);
+ tty = tty_kref_get(signal->tty);
+ if (tty) {
+ /* irq is already disabled */
+ spin_lock(&tty->ctrl_lock);
+ h->tty_pgrp = ckpt_pid_nr(ctx, tty->pgrp);
+ spin_unlock(&tty->ctrl_lock);
+ tty_kref_put(tty);
+ }
+ }
+
...From: Matt Helsley <matthltc@us.ibm.com> Save/restore epoll items during checkpoint/restart respectively. Output the epoll header and items separately. Chunk the output much like the pid array gets chunked. This ensures that even sub-order 0 allocations will enable checkpoint of large epoll sets. A subsequent patch will do something similar for the restore path. On restart, we grab a piece of memory suitable to store a "chunk" of items for input. Read the input one chunk at a time and add epoll items for each item in the chunk. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge Hallyn <serue@us.ibm.com> Changelog [v19]: - [Oren Laadan] Fix broken compilation for no-c/r architectures Changelog [v19-rc1]: - [Oren Laadan] Return -EBUSY (not BUG_ON) if fd is gone on restart - [Oren Laadan] Fix the chunk size instead of auto-tune Changelog v5: Fix potential recursion during collect. Replace call to ckpt_obj_collect() with ckpt_collect_file(). [Oren] Fix checkpoint leak detection when there are more items than expected. Cleanup/simplify error write paths. (will complicate in a later patch) [Oren] Remove files_deferq bits. [Oren] Remove extra newline. [Oren] Remove aggregate check on number of watches added. [Oren] This is OK since these will be done individually anyway. Remove check for negative objrefs during restart. [Oren] Fixup comment regarding race that indicates checkpoint leaks. [Oren] s/ckpt_read_obj/ckpt_read_buf_type/ [Oren] Patch for lots of epoll items follows. Moved sys_close(epfd) right under fget(). [Oren] Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_* This makes it more similar to the pid array code. [Oren] It also simplifies the error recovery paths. Tested polling a pipe and 50,000 UNIX sockets. Changelog v4: ckpt-v18 Use files_deferq as submitted by Dan Smith Cleanup to only report >= 1 items when debugging. Changelog v3: ...
From: Matt Helsley <matthltc@us.ibm.com>
Save/restore eventfd files. These are anon_inodes just like epoll
but instead of a set of files to poll they are a 64-bit counter
and a flag value. Used for AIO.
[Oren Laadan] Added #ifdef's around checkpoint/restart to compile even
without CONFIG_CHECKPOINT
Changelog[v19]:
- Fix broken compilation for architectures that don't support c/r
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/files.c | 7 +++++
fs/eventfd.c | 55 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint_hdr.h | 8 ++++++
include/linux/eventfd.h | 12 ++++++++
4 files changed, 82 insertions(+), 0 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 6aaaf22..4b551fe 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -23,6 +23,7 @@
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
#include <linux/eventpoll.h>
+#include <linux/eventfd.h>
#include <net/sock.h>
@@ -644,6 +645,12 @@ static struct restore_file_ops restore_file_ops[] = {
.file_type = CKPT_FILE_EPOLL,
.restore = ep_file_restore,
},
+ /* eventfd */
+ {
+ .file_name = "EVENTFD",
+ .file_type = CKPT_FILE_EVENTFD,
+ .restore = eventfd_restore,
+ },
};
static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 7758cc3..f2785c0 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -18,6 +18,7 @@
#include <linux/module.h>
#include <linux/kref.h>
#include <linux/eventfd.h>
+#include <linux/checkpoint.h>
struct eventfd_ctx {
struct kref kref;
@@ -287,11 +288,65 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
return res;
}
+#ifdef CONFIG_CHECKPOINT
+static int eventfd_checkpoint(struct ckpt_ctx *ckpt_ctx, struct file ...We only allow c/r when all processes shared a single mounts ns.
We do intend to implement c/r of mounts and mounts namespaces in the
kernel. It shouldn't be ugly or complicate locking to do so. Just
haven't gotten around to it. A more complete solution is more than we
want to take on now for v19.
But we'd like as much as possible for everything which we don't
support, to not be checkpointable, since not doing so has in the past
invited slanderous accusations of being a toy implementation :)
Meanwhile, we get the following:
1) Checkpoint bails if not all tasks share the same mnt-ns
2) Leak detection works for full container checkpoint
On restart, all tasks inherit the same mnt-ns of the coordinator, by
default. A follow-up patch to user-cr will add a new switch to the
'restart' to request a CLONE_NEWMNT flag when creating the root-task
of the restart.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/objhash.c | 25 +++++++++++++++++++++++++
include/linux/checkpoint.h | 2 +-
include/linux/checkpoint_hdr.h | 4 ++++
kernel/nsproxy.c | 16 +++++++++++++---
4 files changed, 43 insertions(+), 4 deletions(-)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 5c4749d..42998b2 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/ipc_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/mnt_namespace.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
#include <net/sock.h>
@@ -214,6 +215,22 @@ static int obj_ipc_ns_users(void *ptr)
return atomic_read(&((struct ipc_namespace *) ptr)->count);
}
+static int obj_mnt_ns_grab(void *ptr)
+{
+ get_mnt_ns((struct mnt_namespace *) ptr);
+ return 0;
+}
+
+static void obj_mnt_ns_drop(void *ptr, int lastref)
+{
+ put_mnt_ns((struct mnt_namespace *) ptr);
+}
+
+static int obj_mnt_ns_users(void ...From: Nathan Lynch <ntl@pobox.com>
Changelog [v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/checkpoint_hdr.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 28dfc36..acf964a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -185,6 +185,10 @@ enum {
#define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64
CKPT_ARCH_S390X,
#define CKPT_ARCH_S390X CKPT_ARCH_S390X
+ CKPT_ARCH_PPC32,
+#define CKPT_ARCH_PPC32 CKPT_ARCH_PPC32
+ CKPT_ARCH_PPC64,
+#define CKPT_ARCH_PPC64 CKPT_ARCH_PPC64
};
/* shared objrects (objref) */
--
1.6.3.3
--
From: Serge E. Hallyn <serue@us.ibm.com> Documentation/checkpoint/readme.txt begins: """ Application checkpoint/restart is the ability to save the state of a running application so that it can later resume its execution from the time at which it was checkpointed. """ This patch implements checkpoint and restore of Smack security labels. The rules are the same as in previous versions: 1. when objects are created during restore() they are automatically labeled with current_security(). 2. if there was a label checkpointed with the object, and that label != current_security() (which is the same as obj->security), then the object is relabeled if the sys_restart() caller has CAP_MAC_ADMIN. Otherwise we return -EPERM. This has been tested by checkpointing tasks under labels _, vs1, and vs2, and restarting from tasks under _, vs1, and vs2, with and without CAP_MAC_ADMIN in the bounding set, and with and without the '-k' (keep_lsm) flag to mktree. Expected results: #shell 1: echo vs1 > /proc/self/attr/current ckpt > out echo vs2 > /proc/self/attr/current mktree -F /cgroup/2 < out (frozen) # shell 2: cat /proc/`pidof ckpt`/attr/current vs2 echo THAWED > /cgroup/2/freezer.state # shell 1: mktree -k -F /cgroup/2 < out (frozen) # shell 2: cat /proc/`pidof ckpt`/attr/current vs1 echo THAWED > /cgroup/2/freezer.state # shell 1: capsh --drop=cap_mac_admin -- mktree -k -F /cgroup/2 < out (permission denied) There are testcases in git://git.sr71.net/~hallyn/cr_tests.git under cr_tests/smack, which automate the above (and pass). Changelog: sep 3: add a version to smack lsm, accessible through /smack/version (Casey and Serge) sep 10: rename xyz_get_ctx() to xyz_checkpoint() Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- checkpoint/restart.c | 1 + security/smack/smack.h | 1 + ...
From: Serge E. Hallyn <serue@us.ibm.com> Documentation/checkpoint/readme.txt begins: """ Application checkpoint/restart is the ability to save the state of a running application so that it can later resume its execution from the time at which it was checkpointed. """ This patch adds the ability to checkpoint and restore selinux contexts for tasks, open files, and sysvipc objects. Contexts are checkpointed as strings. For tasks and files, where a security struct actually points to several contexts, all contexts are written out in one string, separated by ':::'. The default behaviors are to checkpoint contexts, but not to restore them. To attempt to restore them, sys_restart() must be given the RESTART_KEEP_LSM flag. If this is given then the caller of sys_restart() must have the new 'restore' permission to the target objclass, or for instance PROCESS__SETFSCREATE to itself to specify a create_sid. There are some tests under cr_tests/selinux at git://git.sr71.net/~hallyn/cr_tests.git. A corresponding simple refpolicy (and /usr/share/selinux/devel/include) patch is needed. The programs to checkpoint and restart (called 'checkpoint' and 'restart') come from git://git.ncl.cs.columbia.edu/pub/git/user-cr.git. This patch applies against the checkpoint/restart-enabled kernel tree at git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git/. Changelog: Feb 02: [orenl] rebase to kernel 2.6.33 * add tags in classmap.h (includes files autogenerated) Dec 09: update to use common_audit_data. oct 09: fix memory overrun in selinux_cred_checkpoint. oct 02: (Stephen Smalley suggestions): 1. s/__u32/u32/ 2. enable the fown sid restoration 3. use process_restore to authorize resetting osid 4. don't make new hooks inline. oct 01: Remove some debugging that is redundant with avc log data. sep 10: (Most addressing suggestions by Stephen Smalley) 1. change xyz_get_ctx() to xyz_checkpoint(). 2. check entrypoint permission on cred_restore 3. always dec ...
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> --- MAINTAINERS | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2533fc4..65c8954 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1435,6 +1435,18 @@ M: Andy Whitcroft <apw@canonical.com> S: Supported F: scripts/checkpatch.pl +CHECKPOINT-RESTART +M: Oren Laadan <orenl@cs.columbia.edu> +M: Serge E. Hallyn <serue@us.ibm.com> +L: containers@lists.linux-foundation.org +W: http://ckpt.wiki.kernel.org/index.php/Main_Page +S: Maintained +F: *checkpoint* +K: checkpoint +K: restore +K: ckpt +K: c/r + CISCO 10G ETHERNET DRIVER M: Scott Feldman <scofeldm@cisco.com> M: Joe Eykholt <jeykholt@cisco.com> -- 1.6.3.3 --
From: Serge E. Hallyn <serue@us.ibm.com>
Documentation/checkpoint/readme.txt begins:
"""
Application checkpoint/restart is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed.
"""
This patch adds generic support for c/r of LSM credentials. Support
for Smack and SELinux (and TOMOYO if appropriate) will be added later.
Capabilities is already supported through generic creds code.
This patch supports ipc_perm, msg_msg, cred (task) and file ->security
fields. Inodes, superblocks, netif, and xfrm currently are restored
not through sys_restart() but through container creation, and so the
security fields should be done then as well. Network should be added
when network c/r is added.
Briefly, all security fields must be exported by the LSM as a simple
null-terminated string. They are checkpointed through the
security_checkpoint_obj() helper, because we must pass it an extra
sectype field. Splitting SECURITY_OBJ_SEC into one type per object
type would not work because, in Smack, one void* security is used for
all object types. But we must pass the sectype field because in
SELinux a different type of structure is stashed in each object type.
The RESTART_KEEP_LSM flag indicates that the LSM should
attempt to reuse checkpointed security labels. It is always
invalid when the LSM at restart differs from that at checkpoint.
It is currently only usable for capabilities.
(For capabilities, restart without RESTART_KEEP_LSM is technically
not implemented. There actually might be a use case for that,
but the safety of it is dubious so for now we always re-create
checkpointed capability sets whether RESTART_KEEP_LSM is
specified or not)
Changelog[v20]
- [Serge Hallyn] Fix unlabeled restore case
- [Serge Hallyn] Always restore msg_msg label
- [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
Changelog:
sep 3: fix memory leak on LSM restore error path
...From: Nathan Lynch <ntl@pobox.com> Signed-off-by: Nathan Lynch <ntl@pobox.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/powerpc/Kconfig | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index ba3948c..dd88d3d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -26,6 +26,9 @@ config MMU bool default y +config CHECKPOINT_SUPPORT + def_bool y + config GENERIC_CMOS_UPDATE def_bool y -- 1.6.3.3 --
From: Serge E. Hallyn <serue@us.ibm.com> The LSM name is 'selinux', 'smack', 'tomoyo', or 'dummy'. We add that to the container configuration section. We also add a LSM policy configuration section. That is placed after the LSM name. It is written by the LSM in security_checkpoint_header(), called during checkpoint container(), and read by the LSM during security_may_restart(), which is called from restore_lsm() in restore_container(). Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/checkpoint/readme.txt | 24 ++++++++++++++ checkpoint/checkpoint.c | 13 +++++++- checkpoint/restart.c | 41 +++++++++++++++++++++++ checkpoint/sys.c | 22 ++++++++++++ include/linux/checkpoint.h | 6 +++ include/linux/checkpoint_hdr.h | 16 +++++++++ include/linux/checkpoint_types.h | 2 + include/linux/security.h | 61 +++++++++++++++++++++++++++++++++++ security/capability.c | 25 ++++++++++++++ security/security.c | 26 +++++++++++++++ 10 files changed, 235 insertions(+), 1 deletions(-) diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt index 2548bb4..030a001 100644 --- a/Documentation/checkpoint/readme.txt +++ b/Documentation/checkpoint/readme.txt @@ -343,6 +343,30 @@ So that's why we don't want CAP_SYS_ADMIN required up-front. That way we will be forced to more carefully review each of those features. However, this can be controlled with a sysctl-variable. +LSM +=== + +Security modules use custom labels on subjects and objects to +further mediate access decisions beyond DAC controls. When +checkpoint applications, these labels are [ work in progress ] +checkpointed along with the objects. At restart, the +RESTART_KEEP_LSM flag tells the kernel whether re-created objects +whould keep their checkpointed labels, or get automatically +recalculated ...
From: Nathan Lynch <ntl@pobox.com> Changelog [v19]: - checkpoint/powerpc: fix up checkpoint syscall, tidy restart Signed-off-by: Nathan Lynch <ntl@pobox.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> --- arch/powerpc/include/asm/systbl.h | 2 ++ arch/powerpc/include/asm/unistd.h | 4 +++- arch/powerpc/kernel/entry_32.S | 23 +++++++++++++++++++++++ arch/powerpc/kernel/entry_64.S | 16 ++++++++++++++++ arch/powerpc/kernel/process.c | 19 +++++++++++++++++++ 5 files changed, 63 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index ee41254..2c1dd27 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -327,3 +327,5 @@ COMPAT_SYS_SPU(preadv) COMPAT_SYS_SPU(pwritev) COMPAT_SYS(rt_tgsigqueueinfo) PPC_SYS(eclone) +PPC_SYS(checkpoint) +PPC_SYS(restart) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index 37357a2..1551242 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -346,10 +346,12 @@ #define __NR_pwritev 321 #define __NR_rt_tgsigqueueinfo 322 #define __NR_eclone 323 +#define __NR_checkpoint 324 +#define __NR_restart 325 #ifdef __KERNEL__ -#define __NR_syscalls 324 +#define __NR_syscalls 326 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 579f1da..853814b 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -594,6 +594,29 @@ ppc_eclone: stw r0,_TRAP(r1) /* register set saved */ b sys_eclone +/* To handle self-checkpoint we must save nvpgprs */ + .globl ppc_checkpoint +ppc_checkpoint: + SAVE_NVGPRS(r1) + lwz r0,_TRAP(r1) + rlwinm r0,r0,0,0,30 /* clear LSB to indicate full */ + stw r0,_TRAP(r1) /* register set saved */ + b sys_checkpoint + +/* The full register set must be restored upon return ...
From: Nathan Lynch <ntl@pobox.com>
A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register). The restart code needs to validate this
value before making any changes to the current task.
ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR. Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.
Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c. Make it static.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/powerpc/include/asm/ptrace.h | 7 +++
arch/powerpc/kernel/ptrace.c | 88 +++++++++++++++++++++++++------------
2 files changed, 66 insertions(+), 29 deletions(-)
diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index cbd759e..df5825b 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
#define instruction_pointer(regs) ((regs)->nip)
#define user_stack_pointer(regs) ((regs)->gpr[1])
#define regs_return_value(regs) ((regs)->gpr[3])
@@ -142,6 +144,11 @@ extern void user_disable_single_step(struct task_struct *);
#define ARCH_HAS_USER_SINGLE_STEP_INFO
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+ unsigned int index);
+
#endif /* __ASSEMBLY__ */
#endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index ef14988..913ec8f 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -755,22 +755,25 @@ void user_disable_single_step(struct task_struct *task)
clear_tsk_thread_flag(task, ...From: Nathan Lynch <ntl@pobox.com>
The powerpc implementations of syscall_get_error and
syscall_set_return_value should use CCR0:S0 (0x10000000) for testing
and setting syscall error status. Fortunately these APIs don't seem
to be used at the moment.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/powerpc/include/asm/syscall.h | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index efa7f0b..23913e9 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -30,7 +30,7 @@ static inline void syscall_rollback(struct task_struct *task,
static inline long syscall_get_error(struct task_struct *task,
struct pt_regs *regs)
{
- return (regs->ccr & 0x1000) ? -regs->gpr[3] : 0;
+ return (regs->ccr & 0x10000000) ? -regs->gpr[3] : 0;
}
static inline long syscall_get_return_value(struct task_struct *task,
@@ -44,10 +44,10 @@ static inline void syscall_set_return_value(struct task_struct *task,
int error, long val)
{
if (error) {
- regs->ccr |= 0x1000L;
+ regs->ccr |= 0x10000000L;
regs->gpr[3] = -error;
} else {
- regs->ccr &= ~0x1000L;
+ regs->ccr &= ~0x10000000L;
regs->gpr[3] = val;
}
}
--
1.6.3.3
--
From: Nathan Lynch <ntl@pobox.com>
Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.
The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.
The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).
What works:
* self and external checkpoint of simple (single thread, one open
file) 32- and 64-bit processes on a ppc64 kernel
What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa
Untested:
* ppc32 (but it builds)
Changelog[v19]:
- [Serge Hallyn] Add hook task_has_saved_sigmask()
Changelog[v19-rc3]:
- [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel}
- [Nathan Lynch] Warn if full register state unavailable
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
[Oren Laadan <orenl@cs.columbia.edu>] Add arch-specific tty support
---
arch/powerpc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/checkpoint_hdr.h | 37 ++
arch/powerpc/kernel/Makefile | 1 +
arch/powerpc/kernel/checkpoint.c | 533 +++++++++++++++++++++++++++++
arch/powerpc/kernel/signal.c | 6 +
5 files changed, 578 insertions(+), 0 deletions(-)
create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
create mode 100644 arch/powerpc/kernel/checkpoint.c
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
header-y += ...Checkpoint and restore task->fs. Tasks sharing task->fs will
share them again after restart.
Original patch by Serge Hallyn <serue@us.ibm.com>
Changelog:
Jan 25: [orenl] Addressed comments by .. myself:
- add leak detection
- change order of save/restore of chroot and cwd
- save/restore fs only after file-table and mm
- rename functions to adapt existing conventions
Dec 28: [serge] Addressed comments by Oren (and Dave)
- define and use {get,put}_fs_struct helpers
- fix locking comment
- define ckpt_read_fname() and use in checkpoint/files.c
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
---
checkpoint/files.c | 203 +++++++++++++++++++++++++++++++++++++++-
checkpoint/objhash.c | 34 +++++++
checkpoint/process.c | 17 ++++
fs/fs_struct.c | 21 ++++
fs/open.c | 58 +++++++-----
include/linux/checkpoint.h | 8 ++-
include/linux/checkpoint_hdr.h | 12 +++
include/linux/fs.h | 4 +
include/linux/fs_struct.h | 2 +
9 files changed, 331 insertions(+), 28 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 4b551fe..7855bae 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -15,6 +15,9 @@
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/fs_struct.h>
+#include <linux/fs.h>
#include <linux/fdtable.h>
#include <linux/fsnotify.h>
#include <linux/pipe_fs_i.h>
@@ -374,6 +377,62 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
return objref;
}
+int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct fs_struct *fs;
+ int fs_objref;
+
+ task_lock(current);
+ fs = t->fs;
+ get_fs_struct(fs);
+ task_unlock(current);
+
+ fs_objref = checkpoint_obj(ctx, fs, CKPT_OBJ_FS);
+ put_fs_struct(fs);
+
+ return ...This patch adds support for checkpoint and restart of pseudo terminals
(PTYs). Since PTYs are shared (pointed to by file, and signal), they
are managed via objhash.
PTYs are master/slave pairs; The code arranges for the master to
always be checkpointed first, followed by the slave. This is important
since during restart both ends are created when restoring the master.
In this patch only UNIX98 style PTYs are supported.
Currently only PTYs that are referenced by open files are handled.
Thus PTYs checkpoint starts with a file in tty_file_checkpoint(). It
will first checkpoint the master and slave PTYs via tty_checkpoint(),
and then complete the saving of the file descriptor. This means that
in the image file, the order of objects is: master-tty, slave-tty,
file-desc.
During restart, to restore the master side, we open the /dev/ptmx
device and get a file handle. But at this point we don't know the
designated objref for this file, because the file is due later on in
the image stream. On the other hand, we can't just fput() the file
because it will close the PTY too.
Instead, when we checkpoint the master PTY, we _reserve_ an objref
for the file (which won't be further used in checkpoint). Then at
restart, use it to insert the file to objhash.
TODO:
* Better sanitize input from checkpoint image on restore
* Check the locking when saving/restoring tty_struct state
* Echo position/buffer isn't saved (is it needed ?)
* Handle multiple devpts mounts (namespaces)
* Paths of ptmx and slaves are hard coded (/dev/ptmx, /dev/pts/...)
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog[v4]:
- Fix error path(s) in restore_tty_ldisc()
- Fix memory leak in restore_tty_ldisc()
Changelog[v3]:
- [Serge Hallyn] Set tty on error path
Changelog[v2]:
- Don't save/restore tty->{session,pgrp}
- Fix leak: drop file reference after ckpt_obj_insert()
- Move get_file() inside locked clause (fix race)
Changelog[v1]:
- Adjust ...During restart, we need to allocate pty slaves with the same
identifiers as recorded during checkpoint. Modify the allocation code
to allow an in-kernel caller to request a specific slave identifier.
For this, add a new field to task_struct - 'required_id'. It will
hold the desired identifier when restoring a (master) pty.
The code in ptmx_open() will use this value only for tasks that try to
open /dev/ptmx that are restarting (PF_RESTARTING), and if the value
isn't CKPT_REQUIRED_NONE (-1).
Changelog[v19-rc3]:
- Rebase to kernel 2.6.33
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
drivers/char/pty.c | 65 +++++++++++++++++++++++++++++++++++++++++---
drivers/char/tty_io.c | 1 +
fs/devpts/inode.c | 13 +++++++--
include/linux/devpts_fs.h | 6 +++-
include/linux/tty.h | 2 +
5 files changed, 78 insertions(+), 9 deletions(-)
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index 385c44b..33f720b 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -614,9 +614,10 @@ static const struct tty_operations pty_unix98_ops = {
};
/**
- * ptmx_open - open a unix 98 pty master
+ * __ptmx_open - open a unix 98 pty master
* @inode: inode of device file
* @filp: file pointer to tty
+ * @index: desired slave index
*
* Allocate a unix98 pty master device from the ptmx driver.
*
@@ -625,16 +626,15 @@ static const struct tty_operations pty_unix98_ops = {
* allocated_ptys_lock handles the list of free pty numbers
*/
-static int __ptmx_open(struct inode *inode, struct file *filp)
+static int __ptmx_open(struct inode *inode, struct file *filp, int index)
{
struct tty_struct *tty;
int retval;
- int index;
nonseekable_open(inode, filp);
/* find a device that is not in use. */
- index = devpts_new_index(inode);
+ index = devpts_new_index(inode, index);
if (index < 0)
return index;
@@ ...This adds new 'proto_ops' function for checkpointing and restoring
sockets. This allows the checkpoint/restart code to compile nicely
when, e.g., AF_UNIX sockets are selected as a module.
It also adds a function 'collecting' a socket for leak-detection
during full-container checkpoint. This is useful for those sockets
that hold references to other "collectable" objects. Two examples are
AF_UNIX buffers which reference the socket of origin, and sockets that
have file descriptors in-transit.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/net.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/include/linux/net.h b/include/linux/net.h
index 5e8083c..72a53b9 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -153,6 +153,9 @@ struct sockaddr;
struct msghdr;
struct module;
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+
struct proto_ops {
int family;
struct module *owner;
@@ -197,6 +200,12 @@ struct proto_ops {
int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+ int (*checkpoint)(struct ckpt_ctx *ctx,
+ struct socket *sock);
+ int (*collect)(struct ckpt_ctx *ctx,
+ struct socket *sock);
+ int (*restore)(struct ckpt_ctx *ctx, struct socket *sock,
+ struct ckpt_hdr_socket *h);
};
#define DECLARE_SOCKADDR(type, dst, src) \
--
1.6.3.3
--
This patch adds checkpoint and restart of pending signals queues: struct sigpending, both per-task t->sigpending and shared (per- thread-group) t->signal->shared_sigpending. To checkpoint pending signals (private/shared) we first detach the signal queue (and copy the mask) to a separate struct sigpending. This separate structure can be iterated through without locking. Once the state is saved, we re-attaches (prepends) the original signal queue back to the original struct sigpending. Signals that arrive(d) in the meantime will be suitably queued after these (for real-time signals). Repeated non-realtime signals will not be queued because they will already be marked in the pending mask, that remains as is. This is the expected behavior of non-realtime signals. Changelog [v19-rc1]: - Switch to ckpt_obj_try_fetch() - [Matt Helsley] Add cpp definitions for enums Changelog [v4]: - Rename headerless struct ckpt_hdr_* to struct ckpt_* Changelog [v3]: - [Dan Smith] Sanity check for number of pending signals in buffer Changelog [v2]: - Validate si_errno from checkpoint image Changelog [v1]: - Fix compilation warnings - [Louis Rilling] Remove SIGQUEUE_PREALLOC flag from queued signals - [Louis Rilling] Fail if task has posix-timers or SI_TIMER signal Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/signal.c | 280 +++++++++++++++++++++++++++++++++++++++- include/linux/checkpoint_hdr.h | 24 ++++ 2 files changed, 301 insertions(+), 3 deletions(-) diff --git a/checkpoint/signal.c b/checkpoint/signal.c index 5884462..3d13c56 100644 --- a/checkpoint/signal.c +++ b/checkpoint/signal.c @@ -167,12 +167,156 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref) * signal checkpoint/restart */ +static void fill_siginfo(struct ckpt_siginfo *si, siginfo_t ...
This patch adds checkpoint and restart of rlimit information that is part of shared signal_struct. Changelog[v19-rc3]: - Rebase to kernel 2.6.33 Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | 2 ++ checkpoint/restart.c | 3 +++ checkpoint/signal.c | 27 +++++++++++++++++++++++---- include/linux/checkpoint_hdr.h | 17 +++++++++++++++++ include/linux/resource.h | 1 + kernel/sys.c | 36 +++++++++++++++++++++++------------- 6 files changed, 69 insertions(+), 17 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 445fef7..71a4bec 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -122,6 +122,8 @@ static void fill_kernel_const(struct ckpt_const *h) h->uts_version_len = sizeof(uts->version); h->uts_machine_len = sizeof(uts->machine); h->uts_domainname_len = sizeof(uts->domainname); + /* rlimit */ + h->rlimit_nlimits = RLIM_NLIMITS; } /* write the checkpoint header */ diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 026911e..863ee87 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -583,6 +583,9 @@ static int check_kernel_const(struct ckpt_const *h) return -EINVAL; if (h->uts_domainname_len != sizeof(uts->domainname)) return -EINVAL; + /* rlimit */ + if (h->rlimit_nlimits != RLIM_NLIMITS) + return -EINVAL; return 0; } diff --git a/checkpoint/signal.c b/checkpoint/signal.c index fedb8f8..5884462 100644 --- a/checkpoint/signal.c +++ b/checkpoint/signal.c @@ -14,6 +14,7 @@ #include <linux/sched.h> #include <linux/signal.h> #include <linux/errno.h> +#include <linux/resource.h> #include <linux/checkpoint.h> #include <linux/checkpoint_hdr.h> @@ -169,13 +170,22 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int ...
This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.
Add _NSIG to kernel constants saved/tested with image header.
Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog[v1]:
- Rename headerless struct ckpt_hdr_* to struct ckpt_*
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/s390/include/asm/checkpoint_hdr.h | 8 ++
arch/x86/include/asm/checkpoint_hdr.h | 8 ++
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 2 +
checkpoint/objhash.c | 26 +++++
checkpoint/process.c | 19 ++++
checkpoint/restart.c | 3 +
checkpoint/signal.c | 163 ++++++++++++++++++++++++++++++++
include/linux/checkpoint.h | 9 ++-
include/linux/checkpoint_hdr.h | 24 +++++
10 files changed, 263 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/signal.c
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index e3312c0..7d30317 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -91,6 +91,14 @@ struct ckpt_hdr_mm_context {
unsigned long asce_limit;
};
+#define CKPT_ARCH_NSIG 64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
+#error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
+#endif
+#endif
+
struct ckpt_hdr_header_arch {
struct ckpt_hdr h;
};
diff --git a/arch/x86/include/asm/checkpoint_hdr.h ...From: Serge E. Hallyn <serue@us.ibm.com> [ Andrew: I am punting on dealing with the subsystem cooperation issues in this version, in favor of trying to get LSM issues straightened out ] An application checkpoint image will store capability sets (and the bounding set) as __u64s. Define checkpoint and restart functions to translate between those and kernel_cap_t's. Define a common function do_capset_tocred() which applies capability set changes to a passed-in struct cred. The restore function uses do_capset_tocred() to apply the restored capabilities to the struct cred being crafted, subject to the current task's (task executing sys_restart()) permissions. Changlog [v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog: Jun 09: Can't choose securebits or drop bounding set if file capabilities aren't compiled into the kernel. Also just store caps in __u32s (looks cleaner). Jun 01: Made the checkpoint and restore functions and the ckpt_hdr_capabilities struct more opaque to the rest of the c/r code, as suggested by Andrew Morgan, and using naming suggested by Oren. Jun 01: Add commented BUILD_BUG_ON() to point out that the current implementation depends on 64-bit capabilities. (Andrew Morgan and Alexey Dobriyan). May 28: add helpers to c/r securebits Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- include/linux/capability.h | 6 ++ include/linux/checkpoint_hdr.h | 12 +++ kernel/capability.c | 164 +++++++++++++++++++++++++++++++++++++--- security/commoncap.c | 19 +---- 4 files changed, 173 insertions(+), 28 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index 39e5ff5..5abd86c 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -566,6 +566,12 @@ extern int capable(int cap); struct dentry; extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data ...
From: Serge E. Hallyn <serue@us.ibm.com>
Restore a file's f_cred. This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.
Changelog[v1]:
- [Nathan Lynch] discard const from struct cred * where appropriate
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
checkpoint/files.c | 18 ++++++++++++++++--
include/linux/checkpoint_hdr.h | 2 +-
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 62feadd..63a611f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -148,15 +148,21 @@ static int scan_fds(struct files_struct *files, int **fdtable)
int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
struct ckpt_hdr_file *h)
{
+ struct cred *f_cred = (struct cred *) file->f_cred;
+
h->f_flags = file->f_flags;
h->f_mode = file->f_mode;
h->f_pos = file->f_pos;
h->f_version = file->f_version;
+ h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+ if (h->f_credref < 0)
+ return h->f_credref;
+
ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
h->f_credref);
- /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+ /* FIX: need also file->f_owner, etc */
return 0;
}
@@ -522,8 +528,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
fmode_t new_mode = file->f_mode;
fmode_t saved_mode = (__force fmode_t) h->f_mode;
int ret;
+ struct cred *cred;
+
+ /* FIX: need to restore owner etc */
- /* FIX: need to restore uid, gid, owner etc */
+ /* restore the cred */
+ cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+ if (IS_ERR(cred))
+ return PTR_ERR(cred);
+ put_cred(file->f_cred);
+ file->f_cred = get_cred(cred);
/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git ...From: Serge E. Hallyn <serue@us.ibm.com> This patch adds the checkpointing and restart of credentials (uids, gids, and capabilities) to Oren's c/r patchset (on top of v14). It goes to great pains to re-use (and define when needed) common helpers, in order to make sure that as security code is modified, the cr code will be updated. Some of the helpers should still be moved (i.e. _creds() functions should be in kernel/cred.c). When building the credentials for the restarted process, I 1. create a new struct cred as a copy of the running task's cred (using prepare_cred()) 2. always authorize any changes to the new struct cred based on the permissions of current_cred() (not the current transient state of the new cred). While this may mean that certain transient_cred1->transient_cred2 states are allowed which otherwise wouldn't be allowed, the fact remains that current_cred() is allowed to transition to transient_cred2. The reconstructed creds are applied to the task at the very end of the sys_restart call. This ensures that any objects which need to be re-created (file, socket, etc) are re-created using the creds of the task calling sys_restart - preventing an unpriv user from creating a privileged object, and ensuring that a root task can restart a process which had started out privileged, created some privileged objects, then dropped its privilege. With these patches, the root user can restart checkpoint images (created by either hallyn or root) of user hallyn's tasks, resulting in a program owned by hallyn. Changelog [v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog: Sep 08: [NTL] discard const from struct cred * where appropriate Jun 15: Fix user_ns handling when !CONFIG_USER_N Set creator_ref=0 for root_ns (discard @flags) Don't overwrite global user-ns if CONFIG_USER_NS Jun 10: Merge with ckpt-v16-dev (Oren Laadan) Jun 01: Don't check ordering of groups in group_info, bc set_groups() will sort it for us. May 28: 1. ...
Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.
The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.
TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.
Changelog[v20]:
Fix "scheduling in atomic" while restoring ipc sem
Changelog[v19-rc3]:
- Don't free sma if it's an error on restore
Changelog[v18]:
- Handle kmalloc failure in restore_sem_array()
Changelog[v17]:
- Restore objects in the right namespace
- Forward declare struct msg_msg (instead of include linux/msg.h)
- Fix typo in comment
- Don't unlock ipc before calling freeary in error path
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/checkpoint_hdr.h | 8 ++
ipc/Makefile | 2 +-
ipc/checkpoint.c | 4 -
ipc/checkpoint_sem.c | 240 ++++++++++++++++++++++++++++++++++++++++
ipc/sem.c | 11 +-
ipc/util.h | 8 ++
6 files changed, 261 insertions(+), 12 deletions(-)
create mode 100644 ipc/checkpoint_sem.c
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 07e918e..64c3f8f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -494,6 +494,14 @@ struct ckpt_hdr_ipc_msg_msg {
__u32 m_ts;
} __attribute__((aligned(8)));
+struct ckpt_hdr_ipc_sem {
+ struct ckpt_hdr h;
+ struct ckpt_hdr_ipc_perms perms;
+ __u64 sem_otime;
+ __u64 sem_ctime;
+ __u32 sem_nsems;
+} __attribute__((aligned(8)));
+
#define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a) > ...Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.
(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).
Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.
Changelog[v20]:
Fix "scheduling in atomic" while restoring ipc shm
Changelog[v19-rc3]:
- Rebase to kernel 2.6.33
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
- Collect files used by shm objects
- Use file instead of inode as shared object during checkpoint
Changelog[v17]:
- Restore objects in the right namespace
- Properly initialize ctx->deferqueue
- Fix compilation with CONFIG_CHECKPOINT=n
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/checkpoint.c | 5 +
checkpoint/memory.c | 28 +++-
checkpoint/restart.c | 6 +
checkpoint/sys.c | 7 +
include/linux/checkpoint.h | 10 ++
include/linux/checkpoint_hdr.h | 21 +++-
include/linux/checkpoint_types.h | 1 +
include/linux/shm.h | 15 ++
ipc/Makefile | 2 +-
ipc/checkpoint.c | 25 +++-
ipc/checkpoint_shm.c | 306 ++++++++++++++++++++++++++++++++++++++
ipc/shm.c | 84 ++++++++++-
ipc/util.h | 9 +
kernel/nsproxy.c | 8 +
mm/shmem.c | 2 +-
15 files changed, 514 insertions(+), 15 deletions(-)
create mode 100644 ipc/checkpoint_shm.c
diff ...For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:
1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.
2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.
Original patch introducing nsproxy c/r by Dan Smith <danms@us.ibm.com>
Chagnelog[v19]:
- Restart to handle checkpoint images lacking {uts,ipc}-ns
Chagnelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Chagnelog[v18]:
- Add a few more ckpt_write_err()s
Chagnelog[v17]:
- Only collect sub-objects of struct_nsproxy once.
- Restore namespace pieces directly instead of using sys_unshare()
- Proper handling of restart from namespace(s) without namespace(s)
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/checkpoint.c | 29 +++++++++--
checkpoint/objhash.c | 28 ++++++++++
checkpoint/process.c | 81 +++++++++++++++++++++++++++++
include/linux/checkpoint.h | 5 ++
include/linux/checkpoint_hdr.h | 16 ++++++
kernel/nsproxy.c | 110 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 265 insertions(+), 4 deletions(-)
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fd88d5f..9bafb13 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -218,6 +218,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
struct task_struct *root = ctx->root_task;
+ struct nsproxy *nsproxy;
+ int ret = 0;
ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
@@ -257,11 +259,30 @@ static int ...During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
ipc/msg.c | 17 ++++++++++++-----
ipc/sem.c | 17 ++++++++++++-----
ipc/shm.c | 19 +++++++++++++------
ipc/util.c | 42 +++++++++++++++++++++++++++++-------------
ipc/util.h | 9 +++++----
5 files changed, 71 insertions(+), 33 deletions(-)
diff --git a/ipc/msg.c b/ipc/msg.c
index af42ef8..9230e7c 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
#define msg_unlock(msq) ipc_unlock(&(msq)->q_perm)
static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
#ifdef CONFIG_PROC_FS
static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
#endif
@@ -175,10 +175,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
* newque - Create a new msg queue
* @ns: namespace
* @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
*
* Called with msg_ids.rw_mutex held (writer)
*/
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
{
struct msg_queue *msq;
int id, retval;
@@ -202,7 +204,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
/*
* ipc_addid() locks msq
*/
- id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+ id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
if (id < 0) {
security_msg_queue_free(msq);
...Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.
Save and restores the common state (parameters) of ipc namespace.
Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.
Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid. We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs? Unsure, so going for the stricter behavior.
TODO: restore kern_ipc_perms->security.
Changelog[v20]:
- Fix "scheduling while atomic" in ipc c/r
- Cleanup: no need to restore perm->{id,key,seq}
- Fix sysvipc=n compile
Changelog[v19]:
- Restart to handle checkpoint images lacking {uts,ipc}-ns
Changelog[v19-rc3]:
- ipc_objref should be s32 like all objrefs
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Fix compile with !CONFIG_CHECKPOINT_DEBUG
Changelog[v17]:
- Fix include: use checkpoint.h not checkpoint_hdr.h
- Collect nsproxy->ipc_ns
- Restore objects in the right namespace
- If !CONFIG_IPC_NS only restore objects, not global settings
- Don't overwrite global ipc-ns if !CONFIG_IPC_NS
- Reset the checkpointed uid and gid info on ipc objects
- Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <danms@us.ibm.com>]
- Fix compilation with CONFIG_SYSVIPC=n
- Update to match UTS changes
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/checkpoint.c | 2 -
checkpoint/objhash.c | 28 +++
include/linux/checkpoint.h | 13 ++
include/linux/checkpoint_hdr.h | 60 +++++++
...From: Dan Smith <danms@us.ibm.com>
This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have. Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.
Changes[v20]:
- Make uts_ns=n compile
Changes[v19]:
- Restart to handle checkpoint images lacking {uts,ipc}-ns
Changes[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changes[v17]:
- Collect nsproxy->uts_ns
- Save uts string lengths once in ckpt_hdr_const
- Save and restore all fields of uts-ns
- Don't overwrite global uts-ns if !CONFIG_UTS_NS
- Replace sys_unshare() with create_uts_ns()
- Take uts_sem around access to uts data
Changes:
- Remove the kernel restore path
- Punt on nested namespaces
- Use __NEW_UTS_LEN in nodename and domainname buffers
- Add a note to Documentation/checkpoint/internals.txt to indicate where
in the save/restore process the UTS information is kept
- Store (and track) the objref of the namespace itself instead of the
nsproxy (based on comments from Dave on IRC)
- Remove explicit check for non-root nsproxy
- Store the nodename and domainname lengths and use ckpt_write_string()
to store the actual name strings
- Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
- Remove "types" bitfield and use the "is this new" flag to determine
whether or not we should write out a new ns descriptor
- Replace kernel restore path
- Move the namespace information to be directly after the task
information record
- Update Documentation to reflect new location of namespace info
- Support checkpoint and restart of nested UTS namespaces
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/Makefile | 1 ...The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.
Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).
Changelog[v14]:
- Introduce patch
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/memory.c | 66 ++++++++++++++++++++++++++++++++------------
include/linux/checkpoint.h | 6 ++++
include/linux/mm.h | 2 +
mm/filemap.c | 13 ++++++++-
mm/shmem.c | 49 ++++++++++++++++++++++++++++++++
5 files changed, 117 insertions(+), 19 deletions(-)
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 0fe3b38..b56124e 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -875,16 +875,39 @@ int restore_read_page(struct ckpt_ctx *ctx, struct page *page)
return 0;
}
+static struct page *bring_private_page(unsigned long addr)
+{
+ struct page *page;
+ int ret;
+
+ ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+ if (ret < 0)
+ page = ERR_PTR(ret);
+ return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+ struct page *page = NULL;
+ int ret;
+
+ ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ if (page)
+ unlock_page(page);
+ return page;
+}
+
/**
* read_pages_contents - read in data of pages in page-array chain
* @ctx - restart context
*/
-static int ...We now handle anonymous and file-mapped shared memory. Support for IPC shared memory requires support for IPC first. We extend ckpt_write_vma() to detect shared memory VMAs and handle it separately than private memory. There is not much to do for file-mapped shared memory, except to force msync() on the region to ensure that the file system is consistent with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE. Anonymous shared memory is always backed by inode in shmem filesystem. We use that inode to look up the VMA in the objhash and register it if not found (on first encounter). In this case, the type of the VMA is CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is found there, we must have already saved it before, so we change the type to CKPT_VMA_SHM_ANON_SKIP and skip it. To dump the contents of a shmem VMA, we loop through the pages of the inode in the shmem filesystem, and dump the contents of each dirty (allocated) page - unallocated pages must be clean. Note that we save the original size of a shmem VMA because it may have been re-mapped partially. The format itself remains like with private VMAs, except that instead of addresses we record _indices_ (page nr) into the backing inode. Changelog[v19-rc3]: - Rebase to kernel 2.6.33 Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for enums Changelog[v18]: - Mark the backing file as visited at chekcpoint Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/memory.c | 141 ++++++++++++++++++++++++++++++++++++---- checkpoint/objhash.c | 17 +++++ include/linux/checkpoint.h | 15 +++-- include/linux/checkpoint_hdr.h | 12 ++++ mm/filemap.c | 39 +++++++++++- mm/mmap.c | 2 +- mm/shmem.c | 35 ++++++++++ 7 files changed, 239 insertions(+), 22 deletions(-) diff ...
* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
drivers/char/mem.c | 2 ++
drivers/char/random.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 48788db..57e3443 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -763,6 +763,7 @@ static const struct file_operations null_fops = {
.read = read_null,
.write = write_null,
.splice_write = splice_write_null,
+ .checkpoint = generic_file_checkpoint,
};
#ifdef CONFIG_DEVPORT
@@ -779,6 +780,7 @@ static const struct file_operations zero_fops = {
.read = read_zero,
.write = write_zero,
.mmap = mmap_zero,
+ .checkpoint = generic_file_checkpoint,
};
/*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2849713..c082789 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1169,6 +1169,7 @@ const struct file_operations random_fops = {
.poll = random_poll,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+ .checkpoint = generic_file_checkpoint,
};
const struct file_operations urandom_fops = {
@@ -1176,6 +1177,7 @@ const struct file_operations urandom_fops = {
.write = random_write,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+ .checkpoint = generic_file_checkpoint,
};
/***************************************************************
--
1.6.3.3
--
From: Dave Hansen <dave@linux.vnet.ibm.com>
This marks ext[234] as being checkpointable. There will be many
more to do this to, but this is a start.
Changelog[ckpt-v19-rc3]:
- Rebase to kernel 2.6.33 (ext2)
Changelog[v1]:
- [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
fs/ext2/dir.c | 1 +
fs/ext2/file.c | 2 ++
fs/ext3/dir.c | 1 +
fs/ext3/file.c | 1 +
fs/ext4/dir.c | 1 +
fs/ext4/file.c | 4 ++++
6 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..84c17f9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
.compat_ioctl = ext2_compat_ioctl,
#endif
.fsync = ext2_fsync,
+ .checkpoint = generic_file_checkpoint,
};
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 586e358..b38d7b9 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -75,6 +75,7 @@ const struct file_operations ext2_file_operations = {
.fsync = ext2_fsync,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};
#ifdef CONFIG_EXT2_FS_XIP
@@ -90,6 +91,7 @@ const struct file_operations ext2_xip_file_operations = {
.open = generic_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
+ .checkpoint = generic_file_checkpoint,
};
#endif
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
#endif
.fsync = ext3_sync_file, /* BKL held */
.release = ext3_release_dir,
+ .checkpoint = generic_file_checkpoint,
};
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 388bbdf..bcd9b88 ...Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).
Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.
Changelog[v19]:
- Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
- [Serge Hallyn] Rename fs_mnt to root_fs_path
- [Dave Hansen] Error out on file locks and leases
- [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
- Add a few more ckpt_write_err()s
- [Dan Smith] Export fill_fname() as ckpt_fill_fname()
- Introduce ckpt_collect_file() that also uses file->collect method
- In collect_file_stabl() use retval from ckpt_obj_collect() to
test for first-time-object
Changelog[v17]:
- Only collect sub-objects of files_struct once
- Better file error debugging
- Use (new) d_unlinked()
Changelog[v16]:
- Fix compile warning in checkpoint_bad()
Changelog[v16]:
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- File objects are dumped/restored prior to the first reference
- Introduce a per file-type restore() callback
- Use struct file_operations->checkpoint()
- Put code for generic file descriptors in a separate function
- Use one CKPT_FILE_GENERIC for both regular files and dirs
- Revert change to pr_debug(), back to ckpt_debug()
- Use only unsigned fields in checkpoint headers
- Rename: ckpt_write_files() => checkpoint_fd_table()
- Rename: ckpt_write_fd_data() => checkpoint_file()
- Discard field 'h->parent'
Changelog[v12]:
- Replace obsolete ...Changelog[v17]
- Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
include/linux/mm.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60c467b..48d67ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
struct user_struct;
struct writeback_control;
struct rlimit;
+struct ckpt_ctx;
#ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
@@ -220,6 +221,9 @@ struct vm_operations_struct {
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
const nodemask_t *to, unsigned long flags);
#endif
+#ifdef CONFIG_CHECKPOINT
+ int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
};
struct mmu_gather;
--
1.6.3.3
--
This is a preparatory patch necessary for checkpoint/restart (next two patches) of memory to work correctly. The patch introduces a new FOLL_DIRTY flag which tells follow_page() to return -EFAULT also for not-present file-backed pages. In 2.6.32 follow_page() changes its behavior due to this commit: mm: FOLL_DUMP replace FOLL_ANON 8e4b9a60718970bbc02dfd3abd0b956ab65af231 Also introduce __get_dirty_page() that returns a page only if it's "dirty", that is that has been modified before, and otherwise returns NULL. It uses FOLL_DUMP | FOLL_DIRTY and converts the error value EFAULT to NULL - telling the caller that the page in question is clean. (This also optimizes for checkpoint in the next patch: before, if a file-backed page was not-present we would first fault it in (read from disk) and then detect that it was virgin. Instead, now we detect that the page is clean earlier without needing to fault it in). To see why it's needed, consider these scenarios: 1. Task maps a file beyond it's limit, never touches those extra page (if it did, it would get EFAULT/Bus error) 2. Task maps a file and writes the last page, then the file gets truncated (by at least a page). A subsequent access to the page will cause bus error (VM_FAULT_SIGBUS). 3. If the file size is extended back (using truncate) and the task accesses that page, then the task will get a fresh page (losing data it had written to that address before). [Before kernel 2.6.32, that page would become anonymous once it was dirtied, such that accesses in case #2 are valid, and in case #3 the task would see the old page regardless of the file contents.] --CHECKPOINT: before we used FOLL_ANON flags to tell follow_page() to return the zero-page for case#1. For case#2, the actual page was returned. Without this patch, In kernel 2.3.32, FOLL_DUMP would make follow_page() return NULL and then fault handler would have returned VM_FAULT_SIGBUS in case#1 (and depending on arch, case#2 too), and checkpoint would ...
For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents. Then comes the next vma and so on.
To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.
Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.
Changelog[v19-rc]:
- [Serge Hallyn] Checkpoint saved_auxv as u64s
Changelog[v19-rc3]:
- Separate __get_dirty_page() into its own patch
- Export filemap_checkpoint()
- [Serge Hallyn] Disallow checkpoint of tasks with aio requests
- Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
Changelog[v19-rc2]:
- Expose page write functions
- Take mmap_sem() around vma_fill_pgarr() (fix regression)
- Move consider_private_page() to mm/memory.c:__get_dirty_page()
Changelog[v19-rc1]:
- [Matt Helsley] Add cpp definitions for enums
- Do not hold mmap_sem while checkpointing vma's
Changelog[v18]:
- Tighten checks on supported vma to checkpoint or restart
- Add a few more ckpt_write_err()s
- [Serge Hallyn] Export filemap_checkpoint() (used later for ext4)
- Use ckpt_collect_file() instead of ckpt_obj_collect() for files
- In collect_mm() use retval from ckpt_obj_collect() to test for
first-time-object
Changelog[v17]:
- Only collect sub-objects of mm_struct once
- Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
- Precede vaddrs/pages with a buffer header
- Checkpoint mm->exe_file
- Handle shared ...For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.
Changelog[v19-rc1]:
- Fix lockdep complaint in restore_obj_files()
Changelog[v19-rc1]:
- Restore thread/cpu state early
- Ensure null-termination of file names read from image
- Fix compile warning in restore_open_fname()
Changelog[v18]:
- Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
- Validate f_mode after restore against saved f_mode
- Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- Introduce a per file-type restore() callback
- Revert change to pr_debug(), back to ckpt_debug()
- Rename: restore_files() => restore_fd_table()
- Rename: ckpt_read_fd_data() => restore_file()
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'hh->parent'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/files.c | 318 ++++++++++++++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 2 +
checkpoint/process.c | 20 +++
include/linux/checkpoint.h | 7 +
4 files changed, 347 insertions(+), 0 deletions(-)
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 7a57b24..b404c8f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
#include <linux/sched.h>
#include <linux/file.h>
#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
#include <linux/deferqueue.h>
...While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.
This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.
As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.
Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.
Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).
Changelog[v17]
- Introduce 'collect' method
Changelog[v17]
- Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
fs/fcntl.c | 21 +++++++++++++--------
include/linux/fs.h | 7 +++++++
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 97e01dc..e1f02ca 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
return err;
}
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+ int err;
+
+ err = security_file_fcntl(filp, cmd, arg);
+ if (err)
+ goto out;
+ err = do_fcntl(fd, cmd, arg, filp);
+ out:
+ return err;
+}
+
SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned ...The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
From then on the object will be found in the hash and only its identifier
is saved.
On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.
The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.
The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.
Changelog[v20]:
- Export key symbols to enable c/r from kernel modules
- Avoid crash if incoming object doesn't have .restore
Changelog[v19-rc1]:
- Define ckpt_obj_try_fetch
- Disallow zero or negative objref during restart
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Use ckpt_err() in ckpt_obj_fetch()
- [Serge Hallyn] Use ckpt_err() in ckpt_read_obj_type()
- Factor out objref handling from {_,}ckpt_read_obj()
Changelog[v18]:
- Add ckpt_obj_reserve()
- Change ref_drop() to accept a @lastref argument (useful for cleanup)
- Disallow multiple objects with same objref in restart
- Allow _ckpt_read_obj_type() to read object header only (w/o payload)
Changelog[v17]:
- Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
- Add prototype of ckpt_obj_lookup
- Complain on attempt to add NULL ptr to objhash
- Prepare for 'leaks detection'
Changelog[v16]:
- Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
- Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
- Replace long ...Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE checkpoint, return an error code if the actual objects' counts are higher, indicating leaks (references to the objects from a task not being checkpointed). The comparison of the objhash user counts to object refcounts as a basis for checking for leaks comes from Alexey's OpenVZ-based c/r patchset. "Leak detection" occurs _before_ any real state is saved, as a pre-step. This prevents races due to sharing with outside world where the sharing ceases before the leak test takes place, thus protecting the checkpoint image from inconsistencies. Once leak testing concludes, checkpoint will proceed. Because objects are already in the objhash, checkpoint_obj() cannot distinguish between the first and subsequent encounters. This is solved with a flag (CKPT_OBJ_CHECKPOINTED) per object. Two additional checks take place during checkpoint: for objects that were created during, and objects destroyed, while the leak-detection pre-step took place. (By the time this occurs part of the checkpoint image has been written out to disk, so this is purely advisory). Changelog[v20]: - Export key symbols to enable c/r from kernel modules Changelog[v18]: - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic - Replace some EAGAIN with EBUSY - Add a few more ckpt_write_err()s - Introduce CKPT_OBJ_VISITED - ckpt_obj_collect() returns objref for new objects, 0 otherwise - Rename ckpt_obj_checkpointed() to ckpt_obj_visited() - Introduce ckpt_obj_visit() to mark objects as visited - Set the CHECKPOINTED flag on objects before calling checkpoint Changelog[v17]: - Leak detection is performed in two-steps - Detect reverse-leaks (objects disappearing unexpectedly) - Skip reverse-leak detection if ops->ref_users isn't defined Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/checkpoint.c | ...
During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.
During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.
But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.
This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).
Changelog[v19-rc1]:
- Simplify logic of tracking restarting tasks
Changelog[v18]:
- Fix leak of ckpt_ctx when restoring zombie tasks
- Add a few more ckpt_write_err()s
Changelog[v17]:
- Validate t->exit_signal for both threads and leader
- Skip zombies in most of may_checkpoint_task()
- Save/restore t->pdeath_signal
- Validate ->exit_signal and ->pdeath_signal
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
checkpoint/checkpoint.c | 10 ++++--
checkpoint/process.c | 69 +++++++++++++++++++++++++++++++++++-----
checkpoint/restart.c | 22 +++++++++++--
include/linux/checkpoint.h | 1 +
include/linux/checkpoint_hdr.h | 1 +
5 files changed, 89 insertions(+), 14 deletions(-)
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 1e38ae3..ea1494d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -218,7 +218,7 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
- if (t->state == TASK_DEAD) {
+ if (t->exit_state == EXIT_DEAD) {
_ckpt_err(ctx, -EBUSY, "%(T)Task state EXIT_DEAD\n");
return -EBUSY;
}
@@ ...Restarting of multiple processes expects all restarting tasks to call sys_restart(). Once inside the system call, each task will restart itself at the same order that they were saved. The internals of the syscall will take care of in-kernel synchronization bewteen tasks. This patch does _not_ create the task tree in the kernel. Instead it assumes that all tasks are created in some way and then invoke the restart syscall. You can use the userspace mktree.c program to do that. There is one special task - the coordinator - that is not part of the restarted hierarchy. The coordinator task allocates the restart context (ctx) and orchestrates the restart. Thus even if a restart fails after, or during the restore of the root task, the user perceives a clean exit and an error message. The coordinator task will: 1) read header and tree, create @ctx (wake up restarting tasks) 2) set the ->checkpoint_ctx field of itself and all descendants 3) wait for all restarting tasks to reach sync point #1 4) activate first restarting task (root task) 5) wait for all other tasks to complete and reach sync point #3 6) wake up everybody (Note that in step #2 the coordinator assumes that the entire task hierarchy exists by the time it enters sys_restart; this is arranged in user space by 'mktree') Task that are restarting has three sync points: 1) wait for its ->checkpoint_ctx to be set (by the coordinator) 2) wait for the task's turn to restore (be active) [...now the task restores its state...] 3) wait for all other tasks to complete The third sync point ensures that a task may only resume execution after all tasks have successfully restored their state (or fail if an error has occured). This prevents tasks from returning to user space prematurely, before the entire restart completes. If a single task wishes to restart, it can set the "RESTART_TASKSELF" flag to restart(2) to skip the logic of the coordinator. The root-task is a child of the coordinator, identified by the ...
To restore zombie's we will create the a task, that, on its turn to run, calls do_exit(). Unlike normal tasks that exit, we need to prevent notification side effects that send signals to other processes, e.g. parent (SIGCHLD) or child tasks (per child's request). There are three main cases for such notifications: 1) do_notify_parent(): parent of a process is notified about a change in status (e.g. become zombie, reparent, etc). If parent ignores, then mark child for immediate release (skip zombie). 2) kill_orphan_pgrp(): a process group that becomes orphaned will signal stopped jobs (HUP then CONT). 3) forget_original_parent(): children of a process are signaled (per request) with p->pdeath_signal Remember that restoring signal state (for any restarting task) must complete _before_ it is allowed to resume execution, and not during the resume. Otherwise, a running task may send a signal to another task that hasn't restored yet, so the new signal will be lost soon-after. I considered two possible way to address this: 1. Add another sync point to restart: all tasks will first restore their state without signals (all signals blocked), and zombies call do_exit(). A sync point then will ensure that all zombies are gone and their effects done. Then all tasks restore their signal state (and mask), and sync (new point) again. Only then they may resume execution. The main disadvantage is the added complexity and inefficiency, for no good reason. 2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag, and teach the above three notifications to skip sending the signal if theis flag is set. The main advantage is simplicity and completeness. Also, such a flag may to be useful later on. This the method implemented. Changelog [ckpt-v19-rc3]: - Rebase to kernel 2.6.33 Changelog [ckpt-v19-rc1]: - In reparent_thread() test for PF_RESTARTING on parent Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn ...
Checkpointing of multiple processes works by recording the tasks tree structure below a given "root" task. The root task is expected to be a container init, and then an entire container is checkpointed. However, passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement and allows to checkpoint a subtree of processes from the root task. For a given root task, do a DFS scan of the tasks tree and collect them into an array (keeping a reference to each task). Using DFS simplifies the recreation of tasks either in user space or kernel space. For each task collected, test if it can be checkpointed, and save its pid, tgid, and ppid. The actual work is divided into two passes: a first scan counts the tasks, then memory is allocated and a second scan fills the array. Whether checkpoints and restarts require CAP_SYS_ADMIN is determined by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks are intended to prevent privilege escalation, however if 0 it prevents unprivileged users from exploiting any privilege escalation bugs. The logic is suitable for creation of processes during restart either in userspace or by the kernel. Currently we ignore threads and zombies. Changelog[v20]: - [Serge Hallyn] Change sysctl and default for unprivileged use Changelog[v19-rc3]: - Rebase to kernel 2.6.33 (fix sysctl entry for ckpt_unpriv_allowed) Changelog[v19-rc1]: - Introduce walk_task_subtree() to iterate through descendants - [Matt Helsley] Add cpp definitions for enums - [Serge Hallyn] Add global section container to image format Changelog[v18]: - Replace some EAGAIN with EBUSY - Add a few more ckpt_write_err()s - Rename headerless struct ckpt_hdr_* to struct ckpt_* Changelog[v16]: - CHECKPOINT_SUBTREE flags allows subtree (not whole container) - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges Changelog[v14]: - Refuse non-self checkpoint if target task isn't frozen - Refuse checkpoint (for now) if task is ptraced - ...
Now we can do "external" checkpoint, i.e. act on another task. sys_checkpoint() now looks up the target pid (in our namespace) and checkpoints that corresponding task. That task should be the root of a container, unless CHECKPOINT_SUBTREE flag is given. Set state of freezer cgroup of checkpointed task hierarchy to "CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be thawed while at it. Ensure that all tasks belong to root task's freezer cgroup (the root task is also tested, to detect it if changes its freezer cgroups before it moves to "CHECKPOINTING"). sys_restart() remains nearly the same, as the restart is always done in the context of the restarting task. However, the original task may have been frozen from user space, or interrupted from a syscall for the checkpoint. This is accounted for by restoring a suitable retval for the restarting task, according to how it was checkpointed. Changelog[v20]: - [Nathan Lynch] Use syscall_get_error Changelog[v19-rc1]: - [Serge Hallyn] Add global section container to image format Changelog[v17]: - Move restore_retval() to this patch - Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH - Use CHECKPOINTING state for hierarchy's freezer for checkpoint Changelog[v16]: - Use CHECKPOINT_SUBTREE to allow subtree (partial container) Changelog[v14]: - Refuse non-self checkpoint if target task isn't frozen Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Copy contents of 'init->fs->root' instead of pointing to them Changelog[v10]: - Grab vfs root of container init, rather than current process Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> --- checkpoint/Kconfig | 1 + checkpoint/checkpoint.c | 98 +++++++++++++++++++++++++++++++++++++- checkpoint/restart.c | 63 ++++++++++++++++++++++++- checkpoint/sys.c ...
(Paraphrasing what's said this message: http://lists.openwall.net/linux-kernel/2007/12/05/64) Restart blocks are callbacks used cause a system call to be restarted with the arguments specified in the system call restart block. It is useful for system call that are not idempotent, i.e. the argument(s) might be a relative timeout, where some adjustments are required when restarting the system call. It relies on the system call itself to set up its restart point and the argument save area. They are rare: an actual signal would turn that it an EINTR. The only case that should ever trigger this is some kernel action that interrupts the system call, but does not actually result in any user-visible state changes - like freeze and thaw. So restart blocks are about time remaining for the system call to sleep/wait. Generally in c/r, there are two possible time models that we can follow: absolute, relative. Here, I chose to save the relative timeout, measured from the beginning of the checkpoint. The time when the checkpoint (and restart) begin is also saved. This information is sufficient to restart in either model (absolute or negative). Which model to use should eventually be a per application choice (and possible configurable via cradvise() or some sort). For now, we adopt the relative model, namely, at restart the timeout is set relative to the beginning of the restart. To checkpoint, we check if a task has a valid restart block, and if so we save the *remaining* time that is has to wait/sleep, and the type of the restart block. To restart, we fill in the data required at the proper place in the thread information. If the system call return an error (which is possibly an -ERESTARTSYS eg), we not only use that error as our own return value, but also arrange for the task to execute the signal handler (by faking a signal). The handler, in turn, already has the code to handle these restart request gracefully. Changelog[v19-rc1]: - [Matt Helsley] Add cpp definitions for ...
To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.
More details on c/r of restart-blocks and how it works in the
following patch.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
fs/select.c | 2 +-
include/linux/futex.h | 11 +++++++++++
include/linux/poll.h | 3 +++
include/linux/posix-timers.h | 6 ++++++
kernel/compat.c | 4 ++--
kernel/futex.c | 12 +-----------
kernel/posix-timers.c | 2 +-
7 files changed, 25 insertions(+), 15 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index fd38ce2..7e3de2c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -873,7 +873,7 @@ out_fds:
return err;
}
-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
{
struct pollfd __user *ufds = restart_block->poll.ufds;
int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1e5a26d..ae755f6 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -136,6 +136,17 @@ extern int
handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED 0x01
+#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
+
+/* for c/r */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
* Futexes are matched on equal values of this key.
* The key type depends on whether it's a shared or private mapping.
* Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index ...Support for checkpoint and restart for X86_32 architecture. Partly based on Alexey's work. Support for 32bit on 64bit and fixes from Serge Hallyn. Checkpoint Restart (app/arch) (app/arch/program*) --------------------------------------- 64/x86-64 -> 64/x86-64 works 32/x86-64 -> 32/x86-64 works 32/x86-64 -> 32/x86-32 ? 32/x86-32 -> 32/x86-64 ? 32/x86-64 -> 32/x86-32 ? 32/x86-32 -> 32/x86-64 ? (*) "program" indicates the bit-ness of 'restart' executable. Changelog[v19-rc3]: - Rebase to kernel 2.6.33 - [Serge Hallyn] Changes to fs/gs register handling - [Serge Hallyn] Allow 32-bit restart of 64-bit and vice versa - [Serge Hallyn] Only allow 'restart' with same bit-ness as image. Signed-off-by: Oren Laadan <orenl@cs.columbia.edu> Signed-off-by: Serge Hallyn <serue@us.ibm.com> --- arch/x86/Kconfig | 2 +- arch/x86/include/asm/checkpoint_hdr.h | 6 + arch/x86/include/asm/unistd_64.h | 4 + arch/x86/kernel/Makefile | 2 + arch/x86/kernel/checkpoint.c | 16 +++ arch/x86/kernel/checkpoint_64.c | 241 +++++++++++++++++++++++++++++++++ arch/x86/kernel/entry_64.S | 7 + include/linux/checkpoint_hdr.h | 2 + 8 files changed, 279 insertions(+), 1 deletions(-) create mode 100644 arch/x86/kernel/checkpoint_64.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d5a7284..a6ae38a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -93,7 +93,7 @@ config HAVE_LATENCYTOP_SUPPORT config CHECKPOINT_SUPPORT bool - default y if X86_32 + default y config MMU def_bool y diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h index e6cfc99..6f600dd 100644 --- a/arch/x86/include/asm/checkpoint_hdr.h +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -36,6 +36,10 @@ #include <asm/processor.h> #endif +#ifdef CONFIG_X86_64 +#define ...
Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:
checkpoint/sys.c - user/kernel data transfer, as well as setup of the
c/r context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling
checkpoint/process.c - c/r of task data
For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.
Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.
Changelog[v20]:
- Export key symbols to enable c/r from kernel modules
Changelog[v19]:
- [Serge Hallyn] Use ckpt_err() to for bad header values
Changelog[v19-rc3]:
- sys_{checkpoint,restart} to use ptregs prototype
Changelog[v19-rc1]:
- Set ctx->errno in do_ckpt_msg() if needed
- Document prototype of ckpt_write_err in header
- Update prototype of ckpt_read_obj()
- Fix up headers so we can munge them for use by userspace
- [Matt Helsley] Check for empty string for _ckpt_write_err()
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Add global section container to image format
- [Matt Helsley] Fix total byte read/write count for large images
- ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
- [Serge Hallyn] Define new api for error and debug logging
- Use logfd in sys_{checkpoint,restart}
Changelog[v18]:
- Detect error-headers in input data on restart, and abort.
- Standard format for checkpoint error strings (and documentation)
- [Matt Helsley] Rename headerless struct ckpt_hdr_* to struct ckpt_*
- [Dan Smith] Add an errno validation function
- Add ckpt_read_payload(): read a variable-length object (no header)
- Add ckpt_read_string(): same for strings (ensures null-terminated)
- Add ckpt_read_consume(): consumes ...Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.
The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.
A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.
By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.
We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart. Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic. They also would significantly complicate
checkpoint that includes self.
Changelog[v19-rc1]:
- Add 'int logfd' to prototype of sys_{checkpoint,restart}
Changelog[v18]:
- [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
- Move checkpoint closer to namespaces (kconfig)
- Kill "Enable" in c/r config option
Changelog[v16]:
- Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
- Change ...Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.
Changelog[v19-rc1]:
- Update documentation and examples for new syscalls API
- [Liu Alexander] Fix typos
- [Serge Hallyn] Update checkpoint image format
Changelog[v16]:
- Update documentation
- Unify into readme.txt and usage.txt
Changelog[v14]:
- Discard the 'h.parent' field
- New image format (shared objects appear before they are referenced
unless they are compound)
Changelog[v8]:
- Split into multiple files in Documentation/checkpoint/...
- Extend documentation, fix typos and comments from feedback
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
Documentation/checkpoint/checkpoint.c | 38 +++
Documentation/checkpoint/readme.txt | 370 ++++++++++++++++++++++++++++
Documentation/checkpoint/self_checkpoint.c | 69 +++++
Documentation/checkpoint/self_restart.c | 40 +++
Documentation/checkpoint/usage.txt | 247 +++++++++++++++++++
5 files changed, 764 insertions(+), 0 deletions(-)
create mode 100644 Documentation/checkpoint/checkpoint.c
create mode 100644 Documentation/checkpoint/readme.txt
create mode 100644 Documentation/checkpoint/self_checkpoint.c
create mode 100644 Documentation/checkpoint/self_restart.c
create mode 100644 Documentation/checkpoint/usage.txt
diff --git a/Documentation/checkpoint/checkpoint.c b/Documentation/checkpoint/checkpoint.c
new file mode 100644
index 0000000..8560f30
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint.c
@@ -0,0 +1,38 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+#include <linux/checkpoint.h>
+
+static inline int checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ return syscall(__NR_checkpoint, pid, fd, ...Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup: cgroup_freezer_make_frozen(task)
Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.
This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.
This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).
It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
CC: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
include/linux/freezer.h | 1 +
kernel/cgroup_freezer.c | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 3d32641..0cb22cb 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task);
extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
+extern int cgroup_freezer_make_frozen(struct task_struct *task);
#else /* !CONFIG_CGROUP_FREEZER */
static inline int cgroup_freezing_or_frozen(struct task_struct *task)
{
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index dd87010..efd4597 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -479,4 +479,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task)
*/
WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
}
+
+int cgroup_freezer_make_frozen(struct task_struct ...From: Matt Helsley <matthltc@us.ibm.com> Update stale comments regarding locking order and add a little more detail so it's easier to follow the locking between the cgroup freezer and the power management freezer code. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Cedric Le Goater <legoater@free.fr> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> --- kernel/cgroup_freezer.c | 21 +++++++++++++-------- 1 files changed, 13 insertions(+), 8 deletions(-) diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index eb3f34d..2c44736 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys; /* Locks taken and their ordering * ------------------------------ - * css_set_lock * cgroup_mutex (AKA cgroup_lock) - * task->alloc_lock (AKA task_lock) * freezer->lock + * css_set_lock + * task->alloc_lock (AKA task_lock) * task->sighand->siglock * * cgroup code forces css_set_lock to be taken before task->alloc_lock @@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys; * freezer_create(), freezer_destroy(): * cgroup_mutex [ by cgroup core ] * - * can_attach(): - * cgroup_mutex + * freezer_can_attach(): + * cgroup_mutex (held by caller of can_attach) * - * cgroup_frozen(): + * cgroup_freezing_or_frozen(): * task->alloc_lock (to get task's cgroup) * * freezer_fork() (preserving fork() performance means can't take cgroup_mutex): - * task->alloc_lock (to get task's cgroup) * freezer->lock * sighand->siglock (if the cgroup is freezing) * * freezer_read(): * cgroup_mutex * freezer->lock + * write_lock css_set_lock (cgroup iterator start) + * task->alloc_lock * read_lock css_set_lock (cgroup iterator start) * * freezer_write() (freeze): * cgroup_mutex * freezer->lock + * write_lock css_set_lock (cgroup iterator start) + * task->alloc_lock * read_lock ...
Looks reasonable. Is anyone handling that already or do you want me to take it to my tree? --
From: Serge E. Hallyn <serue@us.ibm.com>
Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct. Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.
This will let us keep the task restore code simpler.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
include/linux/user_namespace.h | 8 ++++++
kernel/user_namespace.c | 53 ++++++++++++++++++++++++++++------------
2 files changed, 45 insertions(+), 16 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..f6ea75d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;
#ifdef CONFIG_USER_NS
+struct user_namespace *new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot);
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
{
if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)
#else
+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot)
+{
+ return ERR_PTR(-EINVAL);
+}
+
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
{
return &init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
#include <linux/user_namespace.h>
#include <linux/cred.h>
-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int ...From: Serge E. Hallyn <serue@us.ibm.com>
Implement the s390 hook for sys_eclone().
Changelog:
Nov 24: Removed user-space code from commit log. See user-cr git tree.
Nov 17: remove redundant flags_high check
Nov 13: As suggested by Heiko, convert eclone to take its
parameters via registers.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/s390/include/asm/unistd.h | 3 ++-
arch/s390/kernel/compat_linux.c | 17 +++++++++++++++++
arch/s390/kernel/compat_wrapper.S | 8 ++++++++
arch/s390/kernel/process.c | 37 +++++++++++++++++++++++++++++++++++++
arch/s390/kernel/syscalls.S | 1 +
5 files changed, 65 insertions(+), 1 deletions(-)
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 6e9f049..2250950 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
#define __NR_pwritev 329
#define __NR_rt_tgsigqueueinfo 330
#define __NR_perf_event_open 331
-#define NR_syscalls 332
+#define __NR_eclone 332
+#define NR_syscalls 333
/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 11c3aba..f9e8983 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -663,6 +663,23 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
return sys_write(fd, buf, count);
}
+asmlinkage long sys32_clone(void)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long clone_flags;
+ unsigned long newsp;
+ int __user *parent_tidptr, *child_tidptr;
+
+ clone_flags = regs->gprs[3] & 0xffffffffUL;
+ newsp = regs->orig_gpr2 & 0x7fffffffUL;
+ parent_tidptr = compat_ptr(regs->gprs[4]);
+ child_tidptr = compat_ptr(regs->gprs[5]);
+ if (!newsp)
+ newsp = regs->gprs[15];
+ return do_fork(clone_flags, newsp, regs, 0,
+ parent_tidptr, child_tidptr);
+}
+
/*
* 31 bit emulation ...From: Nathan Lynch <ntl@pobox.com>
Wired up for both ppc32 and ppc64, but tested only with the latter.
Changelog:
- Jan 20: (ntl) fix 32-bit build
- Nov 17: (serge) remove redundant flags_high check, and
don't fold it into flags.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
arch/powerpc/include/asm/syscalls.h | 6 ++++
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 3 +-
arch/powerpc/kernel/entry_32.S | 8 +++++
arch/powerpc/kernel/entry_64.S | 5 +++
arch/powerpc/kernel/process.c | 54 ++++++++++++++++++++++++++++++++++-
6 files changed, 75 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index eb8eb40..1674544 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -24,6 +24,12 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
int __user *parent_tidp, void __user *child_threadptr,
int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_eclone(unsigned long flags_low,
+ struct clone_args __user *args,
+ size_t args_size,
+ pid_t __user *pids,
+ unsigned long p5, unsigned long p6,
+ struct pt_regs *regs);
asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4, unsigned long p5,
unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 07d2d19..ee41254 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,4 @@ SYSCALL_SPU(perf_event_open)
COMPAT_SYS_SPU(preadv)
COMPAT_SYS_SPU(pwritev)
COMPAT_SYS(rt_tgsigqueueinfo)
+PPC_SYS(eclone)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index ...From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> This gives a brief overview of the eclone() system call. We should eventually describe more details in existing clone(2) man page or in a new man page. Changelog[v13]: - [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to ->child_stack and ensure ->child_stack_size is 0 on architectures that don't need it. - [Arnd Bergmann] Remove ->reserved1 field - [Louis Rilling, Dave Hansen] Combine the two asm statements in the example into one and use memory constraint to avoid unncessary copies. Changelog[v12]: - [Serge Hallyn] Fix/simplify stack-setup in the example code - [Serge Hallyn, Oren Laadan] Rename syscall to eclone() Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' - [Oren Laadan] Fix some typos and clarify the order of pids in the @pids parameter. Changelog[v10]: - Rename clone3() to clone_with_pids() and fix some typos. - Modify example to show usage with the ptregs implementation. Changelog[v9]: - [Pavel Machek]: Fix an inconsistency and rename new file to Documentation/clone3. - [Roland McGrath, H. Peter Anvin] Updates to description and example to reflect new prototype of clone3() and the updated/ renamed 'struct clone_args'. Changelog[v8]: - clone2() is already in use in IA64. Rename syscall to clone3() - Add notes to say that we return -EINVAL if invalid clone flags are specified or if the reserved fields are not 0. Changelog[v7]: - Rename clone_with_pids() to clone2() - Changes to reflect new prototype of clone2() (using clone_struct). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- Documentation/eclone | 348 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 348 ...
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. eclone(), intended for use during restart, is the same as clone(), except that it takes a 'pids' paramter. This parameter lets caller choose specific pid numbers for the child process, in the process's active and ancestor pid namespaces. (Descendant pid namespaces in general don't matter since processes don't have pids in them anyway, but see comments in copy_target_pids() regarding CLONE_NEWPID). eclone() also attempts to address a second limitation of the clone() system call. clone() is restricted to 32 clone flags and all but one of these are in use. If more new clone flags are needed, we will be forced to define a new variant of the clone() system call. To address this, eclone() allows at least 64 clone flags with some room for more if necessary. To prevent unprivileged processes from misusing this interface, eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL. See Documentation/eclone in next patch for more details and an example of its usage. NOTE: - System calls are restricted to 6 parameters and the number and sizes of parameters needed for eclone() exceed 6 integers. The new prototype works around this restriction while providing some flexibility if eclone() needs to be further extended in the future. TODO: - We should convert clone-flags to 64-bit value in all architectures. Its probably best to do that as a separate patchset since clone_flags touches several functions and that patchset seems independent of this new system call. Changelog[v14]: - [Oren Laadan] Rebase to kernel 2.6.33 * introduce PTREGSCALL4 for sys_eclone * consolidate syscall definitions for 32/64 bit - [Oren Laadan] Merge x86_64 (trivial ...
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
This parameter is currently NULL, but will be used in a follow-on patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
include/linux/pid.h | 2 +-
kernel/fork.c | 3 ++-
kernel/pid.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, int last);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
extern void free_pid(struct pid *pid);
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index e9cf524..2e10cb8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -985,6 +985,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ pid_t *target_pids = NULL;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1167,7 +1168,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns);
+ pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 1f15bb6..b0d7fc9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -276,13 +276,14 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct ...From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Add a 'target_pids' parameter to copy_process(). The new parameter will be
used in a follow-on patch when eclone() is implemented.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
kernel/fork.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e10cb8..737bca9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,12 +980,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+ pid_t *target_pids,
int trace)
{
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
- pid_t *target_pids = NULL;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1359,7 +1359,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;
task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
- &init_struct_pid, 0);
+ &init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);
@@ -1382,6 +1382,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+ pid_t *target_pids = NULL;
/*
* Do some preliminary argument and permissions checking before we
@@ -1422,7 +1423,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);
p = copy_process(clone_flags, stack_start, regs, stack_size,
- child_tidptr, NULL, trace);
+ child_tidptr, NULL, target_pids, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
--
1.6.3.3
--
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> As pointed out by Oren Laadan, we want to ensure that unused bits in the clone-flags remain unused and available for future. To ensure this, define a mask of clone-flags and check the flags in the clone() system calls. Changelog[v9]: - Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS to avoid breaking any applications that may have set it. IOW, this patch/check only applies to clone-flags bits 33 and higher. Changelog[v8]: - New patch in set Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Oren Laadan <orenl.cs.columbia.edu> --- include/linux/sched.h | 12 ++++++++++++ kernel/fork.c | 3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 78efe7c..d57eab8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -29,6 +29,18 @@ #define CLONE_NEWNET 0x40000000 /* New network namespace */ #define CLONE_IO 0x80000000 /* Clone io context */ +#define CLONE_UNUSED 0x00001000 /* Can be reused ? */ + +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\ + CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\ + CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\ + CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\ + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\ + CLONE_DETACHED | CLONE_UNTRACED |\ + CLONE_CHILD_SETTID | CLONE_STOPPED |\ + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\ + CLONE_NEWPID | CLONE_NEWNET | CLONE_IO) + /* * Scheduling policies */ diff --git a/kernel/fork.c b/kernel/fork.c index 737bca9..f95cbd2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -987,6 +987,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, struct task_struct *p; int cgroup_callbacks_done = ...
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.
Changelog[v4]:
- [Oren Laadan] Adapt to kernel 2.6.33-rc5
Changelog[v3]:
- Earlier version of patchset called alloc_pidmap_page() from two
places. But now its called from only one place. Even so, moving
this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
kernel/pid.c | 41 ++++++++++++++++++++++++++---------------
1 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index 2e17c9c..39292e6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,30 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}
+static int alloc_pidmap_page(struct pidmap *map)
+{
+ void *page;
+
+ if (likely(map->page))
+ return 0;
+
+ page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (!map->page) {
+ map->page = page;
+ page = NULL;
+ }
+ spin_unlock_irq(&pidmap_lock);
+ kfree(page);
+ if (unlikely(!map->page))
+ return -1;
+
+ return 0;
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -134,22 +158,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
- if (unlikely(!map->page)) {
- void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing ...From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.
Changelog[v13]:
- Don't let do_alloc_pidmap return 0 if it failed to find a pid.
Changelog[v9]:
- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
- [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
- (Eric Biederman): Avoid set_pidmap() function. Added couple of
checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
kernel/pid.c | 41 +++++++++++++++++++++++++++++++++--------
1 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index 252babf..1f15bb6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,17 +146,18 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
}
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+ int max)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, offset, max_scan, pid;
struct pidmap *map;
pid = last + 1;
if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ pid = min;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page))
if (alloc_pidmap_page(map) < 0)
@@ -165,7 +166,6 @@ static int alloc_pidmap(struct ...From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed. With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.
Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.
Changelog[v1]:
- [Oren Laadan] Rebase to kernel 2.6.33
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 10 ++++++----
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..e9cf524 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1167,10 +1167,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }
if (clone_flags & CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 39292e6..252babf 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -160,7 +160,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page))
if (alloc_pidmap_page(map) < 0)
- break;
+ return -ENOMEM;
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
@@ -191,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return -EBUSY;
}
int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -260,8 +260,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
...Hi Andrew, Oren sent v20 of the checkpoint/restart patchset out two weeks ago. We've addressed some feedback from linux-fsdevel and added network and pid namespace support. So we could resend again now. However we also have a bigger patchset in the works which is feature-neutral, but moves all the code out of linux-2.6/checkpoint/ and next to the code it affects. I ancitipate #ifdef clashes though, so we'll need to do quite a bit of various-config-and-arch testing of the new code layout. If you're at a good point to pull it, we can resend the code as is now so as to get some wider testing exposure. Or, if you prefer, we can wait until after the code move in case that would be seen as more amenable to meaningful review. We don't want to patch-bomb needlessly so thought we'd ask first :) thanks, -serge --
On Thu, 1 Apr 2010 18:37:10 -0500 I guess the final product would be better. It sounds like it'll have the added benefit of making the various interested parties pay more attention, too ;) --
