Following there will be two preparatory patches for an ARM port of the checkpoint-restart code and finally a third patch implementing the architecture-specific parts of c/r. The preparatory patches consist of a systrace implementation for ARM based on a previous patch from Roland McGrath and an eclone implementation for ARM. The systrace implementation is partial and provides the needed functionality for c/r. There is a separate patch for the user space code, which supports cross-compilation, extracting headers for ARM and an eclone implementation for ARM. The kernel patches presented here are based on the ckpt-v20 patch set. Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --
This small commit introduces a global state of system calls for ARM making it possible for a debugger or checkpointing to gain information about another process' state with respect to system calls. The patch is based on this proposal from Roland McGrath: https://patchwork.kernel.org/patch/32101/ Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/arm/include/asm/syscall.h | 31 +++++++++++++++++++++++++++++++ arch/arm/kernel/asm-offsets.c | 1 + arch/arm/kernel/entry-common.S | 8 +++++++- arch/arm/kernel/ptrace.c | 2 -- arch/arm/kernel/signal.c | 14 +++++++------- 5 files changed, 46 insertions(+), 10 deletions(-) create mode 100644 arch/arm/include/asm/syscall.h diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h new file mode 100644 index 0000000..3b3248f --- /dev/null +++ b/arch/arm/include/asm/syscall.h @@ -0,0 +1,31 @@ +/* + * syscalls.h - Linux syscall interfaces for ARM + * + * Copyright (c) 2010 Christoffer Dall + * + * This file is released under the GPLv2. + * See the file COPYING for more details. + */ + +#ifndef _ASM_ARM_SYSCALLS_H +#define _ASM_ARM_SYSCALLS_H + +static inline int syscall_get_nr(struct task_struct *task, + struct pt_regs *regs) +{ + return (int)(task_thread_info(task)->syscall); +} + +static inline long syscall_get_return_value(struct task_struct *task, + struct pt_regs *regs) +{ + return regs->ARM_r0; +} + +static inline long syscall_get_error(struct task_struct *task, + struct pt_regs *regs) +{ + return regs->ARM_r0; +} + +#endif /* _ASM_ARM_SYSCALLS_H */ diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c index 4a88125..726a0ad 100644 --- a/arch/arm/kernel/asm-offsets.c +++ b/arch/arm/kernel/asm-offsets.c @@ -48,6 +48,7 @@ int main(void) DEFINE(TI_CPU, offsetof(struct thread_info, cpu)); ...
I don't particularly like the idea that we always store the syscall number to memory for every system call, whether the stored version is used or not. Since ARM caches are generally not write allocate, this means mostly write-only variables can have a higher than expected expense. Is there not some thread flag which can be checked to see if we need to store the syscall number? --
Perhaps before we freeze the task we can save the syscall number on ARM. The patches suggest that the signal delivery path -- which the freezer utilizes -- has the syscall number already. Should work since the threads must be frozen first anyway. Cheers, -Matt Helsley --
I like the idea. However, would it also work for those cases when the freezing does not occur from the signal delivery path - e.g. for vfork and ptraced tasks ? Oren. --
We could just as easily set it before the vfork uninterruptible completion. ptracing I'd don't know about though. Cheers, -Matt Helsley --
Actually, the signal path doesn't have the syscall number, it has vfork() uses freezer_do_not_count() to tell the freezer that it's effectively frozen. It's also used by drivers/char/apm-emulation.c Looking at calls to ptrace_notify(), ptrace_stop() and ptace_event(), there are several places where a ptraced task can stop with TASK_TRACED (which is good enough for the freezer), outside the signal handling path. This means that recording the syscall number for all these cases is going to be tedious and intrusive. I prefer to somehow figure out the syscall from the task's state or pt_regs, or by (re)using the same assembly code that already does that. Oren. --
Well, this could be changed to pass the syscall number through
registers along to try_to_freeze without any mentionable performance
Re-using the assembly code or factoring it out so that it can be used
from multiple places doesn't seem very pleasing to me, as the assembly
code is in the critical path and written specifically for the context
of a process entering the kernel. Please correct me if I'm wrong.
I imagine simply a function in C, more or less re-implementing the
logic that's already in entry-common.S, might do the trick. I wouldn't
worry much about the performance in this case as it will not be used
often. The following _untested_ snippet illustrates my idea:
---
arch/arm/include/asm/syscall.h | 93 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 92 insertions(+), 1 deletions(-)
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index 3b3248f..a7f2615 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -10,10 +10,101 @@
#ifndef _ASM_ARM_SYSCALLS_H
#define _ASM_ARM_SYSCALLS_H
+static inline int get_swi_instruction(struct task_struct *task,
+ struct pt_regs *regs,
+ unsigned long *instr)
+{
+ struct page *page = NULL;
+ unsigned long instr_addr;
+ unsigned long *ptr;
+ int ret;
+
+ instr_addr = regs->ARM_pc - 4;
+
+ down_read(&task->mm->mmap_sem);
+ ret = get_user_pages(task, task->mm, instr_addr,
+ 1, 0, 0, &page, NULL);
+ up_read(&task->mm->mmap_sem);
+
+ if (ret < 0)
+ return ret;
+
+ ptr = (unsigned long *)kmap_atomic(page, KM_USER1);
+ memcpy(instr,
+ ptr + (instr_addr >> PAGE_SHIFT),
+ sizeof(unsigned long));
+ kunmap_atomic(ptr, KM_USER1);
+
+ page_cache_release(page);
+
+ return 0;
+}
+
+static inline int __syscall_get_nr(struct task_struct *task,
+ struct pt_regs *regs)
+{
+ int ret;
+ int scno;
+ unsigned long instr;
+ bool config_oabi = false;
+ bool config_aeabi = false;
+ bool config_arm_thumb = ...Yes, that's possible. I was thinking we could still use your thread info field but only store to it when we know it will be useful for c/r rather than for each syscall. Personally, I'd rather avoid passing the extra ^shouldn't this be: (again, not familiar with ARM so my understanding is: I guess swi is "syscall word immediate". The syscall nr is embedded in the instruction as an immediate value and you're getting a copy of that instruction using the value of the pc register just after the syscall instruction was executed.) Perhaps I am missing or forgetting something. Why isn't this as simple as calling get_user() or even copy_from_user() using instr_addr? Cheers, -Matt Helsley --
Oops, made my own mistake. I think the address of the kmap'd instruction would be: ptr + (instr_addr & ~PAGE_MASK) Cheers, -Matt Helsley --
In c/r, we only need it at restart when a task calls it on itself. However the interface itself of get_syscall_nr() can be called by any task on another task. (In fact, I think that for the most part, saving the syscall number at checkpoint time may be better than figuring out at restart time). Oren. --
So, as Oren is saying, the point was to make the syscall_get_nr(..) work according to the interface specified in include/asm-generic/syscall.h. Considering it's unknown how we will deal with checkpoint/restart across CONFIG_ARM_THUMB, CONFIG_OABI_COMPAT etc., I also think it's a better idea to checkpoint the syscall number at checkpoint and for the restore, place architecture specific hooks to get the syscall number instead of calling syscall_get_nr(...) directly. In this way we should always be able to get the syscall and correctly restart, independently of what tricks we do to checkpoint restart across configuration settings - if any. Best, Christoffer --
Implements architecture specific requirements for checkpoint/restart on ARM. The changes touch almost only c/r related code. Most of the work is done in arch/arm/checkpoint.c, which implements checkpointing of the CPU and necessary fields on the thread_info struct. The ISA version (given by __LINUX_ARM_ARCH__) is checkpointed and verified against the machine architecture on restart. If they differ, an error is raised and restart aborted. It should be possible to restart on newer architectures, but further investigation is warranted. Regarding ThumbEE, the thumbee_state field on the thread_info is stored in checkpoints when CONFIG_ARM_THUMBEE and 0 is stored otherwise. If a value different than 0 is checkpointed and CONFIG_ARM_THUMBEE is not set on the restore system, the restore is aborted. Feedback on this implementation is very welcome. We checkpoint whether the system is running with CONFIG_MMU or not and require the same configuration for the system on which we restore the process. It might be possible to allow something more fine-grained, if it's worth the energy. Input on this item is also very welcome, specifically from someone who knows the exact meaning of the end_brk field. Added support for syscall sys_checkpoint and sys_restart for ARM: __NR_checkpoint 367 __NR_restart 368 Cc: rmk@arm.linux.org.uk Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/arm/Kconfig | 4 + arch/arm/include/asm/checkpoint_hdr.h | 71 +++++++++ arch/arm/include/asm/ptrace.h | 1 + arch/arm/include/asm/unistd.h | 2 + arch/arm/kernel/Makefile | 1 + arch/arm/kernel/calls.S | 2 + arch/arm/kernel/checkpoint.c | 276 +++++++++++++++++++++++++++++++++ arch/arm/kernel/signal.c | 5 + arch/arm/kernel/sys_arm.c | 13 ++ include/linux/checkpoint_hdr.h | 2 + 10 ...
In terms of the cr api I don't see any problems. Two nits below, but in any case Acked-by: Serge Hallyn <serue@us.ibm.com> thanks, this is really cool, especially how minimal it is :) -serge will load_cpu_regs() ever be changed to return anything but 0? If not both fns can be simplified. Again ret doesn't seem needed here. -serge --
indeed it doesn't. -Christoffer --
I think you misunderstand what __LINUX_ARM_ARCH__ signifies. It is the build architecture for the kernel, and it indicates the lowest architecture version that the kernel will run on. That doesn't indicate what ISA version the system is running on, or even if the ABI is compatible (we have two ABIs - OABI and EABI). There's also the matter of FP implementation - whether it is VFP or FPA, and whether iwMMXt is available or not. (iwMMXt precludes the use of Processes which run on MMU and non-MMU CPUs are unlikely to be interchangable - the run time environments are quite different. I How do you safely obtain consistent information from a thread? Do you temporarily stop it? --
On Tue, Mar 23, 2010 at 09:18:43PM +0000, Russell King - ARM Linux wrote: It must be frozen with the cgroup freezer (which reuses the suspend freezer). sys_checkpoint moves the cgroup freezer into the CHECKPOINTING state which prevents tasks in that group from being thawed until just before checkpoint returns. Cheers, -Matt Helsley --
On Tue, Mar 23, 2010 at 10:18 PM, Russell King - ARM Linux Yes, clearly I didn't understand this fully. So is it in fact possible to compile the kernel with __LINUX_ARM_ARCH=6 and have CONFIG_CPU_32v7? Or is it a matter of running a v6 kernel with CONFIG_CPU_32v6 on a newer architecture? What I would like to accomplish is the best way to make sure that the restarted process will in fact be able to run. What is the best way to That's why I checkpointed CONFIG_OABI_COMPAT, but I realize that it's not sufficient. How about checkpointing CONFIG_AEABI and CONFIG_OABI_COMPAT and making sure that we either restore to the same setting of the two or restore I had a feeling this would be an issue, but I never dove into the workings of FP on ARM. Can you give me some concrete pointers as what to checkpoint and restart / check on restart for a process using FP to I encountered it when looking at struct thread_info in arch/arm/include/asm/thread_info.h, and have not seen et before. After looking into it a little more, it's included it 2.6.33 and defined in --
With CONFIG_OABI_COMPAT enabled, each process can be in either personality: OABI or EABI. Checkpointing will need to remember which one. With CONFIG_OABI_COMPAT disabled, it'll be fixed at one or the other, but there's no reason why a process should not be moved between kernels with different values of CONFIG_OABI_COMPAT, so long as the OABI or EABI personality is supported by the destination kernel. In other words, CONFIG_OABI_COMPAT shouldn't be in the checkpoint It's possible in principle to run many non-MMU binaries on MMU kernels, but I've never heard of anyone doing it. -- Jamie --
FDPIC supports running the same binaries with or without MMU depending on your ABI, it's not really that uncommon, even if it's mostly just used for prototyping. --
Thanks - I didn't know anyone actually did it :-) But I can see the value for product rollouts of some binaries onto a mixture of hardware, too. bFLT (flat) binaries should be runnable with an MMU too. -- Jamie --
I would imagine that the chance that a restart will fail anyway when restoring an MMU process on a non-MMU kernel. However, as you suggest, the other way around should be possible. Thanks for clearing that up. Specifically, do you know the meaning of the end_brk field on the mm_context_t struct and if I need to checkpoint it on restart for non-MMU systems (and potentially do something more clever during restart on an MMU kernel?) --
In addition to doing everything that clone() system call does, the eclone() system call: - allows additional clone flags (31 of 32 bits in the flags parameter to clone() are in use) - allows user to specify a pid for the child process in its active and ancestor pid namespaces. Eclone is needed for restarting a process from a checkpoint. See more in Documentation/eclone and refer to the original LKML posting: http://lkml.org/lkml/2009/11/11/361 The new system call for ARM has number 366. Cc: rmk@arm.linux.org.uk Cc: libc-ports <libc-ports@sourceware.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk> Acked-by: Oren Laadan <orenl@cs.columbia.edu> --- arch/arm/include/asm/unistd.h | 1 + arch/arm/kernel/calls.S | 1 + arch/arm/kernel/entry-common.S | 6 ++++++ arch/arm/kernel/sys_arm.c | 39 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 47 insertions(+), 0 deletions(-) diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h index cf9cdaa..f295a6c 100644 --- a/arch/arm/include/asm/unistd.h +++ b/arch/arm/include/asm/unistd.h @@ -392,6 +392,7 @@ #define __NR_rt_tgsigqueueinfo (__NR_SYSCALL_BASE+363) #define __NR_perf_event_open (__NR_SYSCALL_BASE+364) #define __NR_recvmmsg (__NR_SYSCALL_BASE+365) +#define __NR_eclone (__NR_SYSCALL_BASE+366) /* * The following SWIs are ARM private. diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S index 9314a2d..5ef0b03 100644 --- a/arch/arm/kernel/calls.S +++ b/arch/arm/kernel/calls.S @@ -375,6 +375,7 @@ CALL(sys_rt_tgsigqueueinfo) CALL(sys_perf_event_open) /* 365 */ CALL(sys_recvmmsg) + CALL(sys_eclone_wrapper) #ifndef syscalls_counted .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls #define syscalls_counted diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S index f694f4d..9ead15d 100644 --- ...
I'm curious why, if you want the entire set of registers, you don't just do: add r0, sp, #S_OFF b sys_eclone and load the syscall arguments out of regs->ARM_foo. This avoids the need __user on an integer type doesn't make any sense; integer types do not This will produce sparse errors. Is there a reason why 'clone_args' Hmm, so let me get this syscall interface right. We have some arguments passed in registers and others via a (variable sized?) structure. It seems really weird to have, eg, a pointer to the pids and the number of pids passed in two separate ways. The grouping between what's passed in registers and via this clone_args structure seems to be random. Can it be sanitized? --
Russell King - ARM Linux [linux@arm.linux.org.uk] wrote:
| > +asmlinkage int sys_eclone(unsigned flags_low, struct clone_args __user *uca,
| > + int args_size, pid_t __user *pids,
| > + struct pt_regs *regs)
| > +{
| > + int rc;
| > + struct clone_args kca;
| > + unsigned long flags;
| > + int __user *parent_tidp;
| > + int __user *child_tidp;
| > + unsigned long __user stack;
|
| __user on an integer type doesn't make any sense; integer types do not
| have address spaces.
Ah, will fix that for x86 32/64 bit implementations.
|
| > + unsigned long stack_size;
| > +
| > + rc = fetch_clone_args_from_user(uca, args_size, &kca);
| > + if (rc)
| > + return rc;
| > +
| > + /*
| > + * TODO: Convert 'clone-flags' to 64-bits on all architectures.
| > + * TODO: When ->clone_flags_high is non-zero, copy it in to the
| > + * higher word(s) of 'flags':
| > + *
| > + * flags = (kca.clone_flags_high << 32) | flags_low;
| > + */
| > + flags = flags_low;
| > + parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr;
| > + child_tidp = (int *)(unsigned long)kca.child_tid_ptr;
|
| This will produce sparse errors. Is there a reason why 'clone_args'
| tid pointers aren't already pointers marked with __user ?
Making them pointers would make them 32-bit on some and 64-bit on other
architectures. We wanted the fields to be the same size on all architectures
for easier portability/extensibility. Given that we are copying-in a structure
from user-space, copying in a few extra bytes on the 32-bit architectures
would not be significant.
|
| > +
| > + stack_size = (unsigned long)kca.child_stack_size;
|
| Shouldn't this already be of integer type?
|
| > + if (stack_size)
| > + return -EINVAL;
|
| So the stack must have a zero size? Is this missing a '!' ?
Some architectures (IA64 ?) use the stack-size field. Those that don't
need it should error out. Again, its because we are trying to keep the
interface common across architectures.
|
| > +
| > ...On Tue, Mar 23, 2010 at 10:06 PM, Russell King - ARM Linux I simply copied the code from sys_clone. Do you prefer that I change Thanks for you feedback. I will let the people behind eclone deal with the eclone specifics. --
