Re: [C/R ARM][PATCH 2/3] ARM: Add the eclone system call

Previous thread: Webmail Verification Update! by Etzel, Linda on Sunday, March 21, 2010 - 5:31 pm. (1 message)

Next thread: [PATCH 03/10] x86: set nr_irqs_gsi only in probe_nr_irqs_gsi by Yinghai Lu on Sunday, March 21, 2010 - 6:36 pm. (1 message)
From: Christoffer Dall
Date: Sunday, March 21, 2010 - 6:06 pm

Following there will be two preparatory patches for an ARM port of the
checkpoint-restart code and finally a third patch implementing the
architecture-specific parts of c/r.

The preparatory patches consist of a systrace implementation for ARM
based on a previous patch from Roland McGrath and an eclone implementation
for ARM. The systrace implementation is partial and provides the needed
functionality for c/r.

There is a separate patch for the user space code, which supports
cross-compilation, extracting headers for ARM and an eclone implementation
for ARM.

The kernel patches presented here are based on the ckpt-v20 patch set.

Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
--

From: Christoffer Dall
Date: Sunday, March 21, 2010 - 6:06 pm

This small commit introduces a global state of system calls for ARM
making it possible for a debugger or checkpointing to gain information
about another process' state with respect to system calls.

The patch is based on this proposal from Roland McGrath:
https://patchwork.kernel.org/patch/32101/

Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/arm/include/asm/syscall.h |   31 +++++++++++++++++++++++++++++++
 arch/arm/kernel/asm-offsets.c  |    1 +
 arch/arm/kernel/entry-common.S |    8 +++++++-
 arch/arm/kernel/ptrace.c       |    2 --
 arch/arm/kernel/signal.c       |   14 +++++++-------
 5 files changed, 46 insertions(+), 10 deletions(-)
 create mode 100644 arch/arm/include/asm/syscall.h

diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
new file mode 100644
index 0000000..3b3248f
--- /dev/null
+++ b/arch/arm/include/asm/syscall.h
@@ -0,0 +1,31 @@
+/*
+ * syscalls.h - Linux syscall interfaces for ARM
+ *
+ * Copyright (c) 2010 Christoffer Dall
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifndef _ASM_ARM_SYSCALLS_H
+#define _ASM_ARM_SYSCALLS_H
+
+static inline int syscall_get_nr(struct task_struct *task,
+				 struct pt_regs *regs)
+{
+	return (int)(task_thread_info(task)->syscall);
+}
+
+static inline long syscall_get_return_value(struct task_struct *task,
+					    struct pt_regs *regs)
+{
+	return regs->ARM_r0;
+}
+
+static inline long syscall_get_error(struct task_struct *task,
+				     struct pt_regs *regs)
+{
+	return regs->ARM_r0;
+}
+
+#endif /* _ASM_ARM_SYSCALLS_H */
diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
index 4a88125..726a0ad 100644
--- a/arch/arm/kernel/asm-offsets.c
+++ b/arch/arm/kernel/asm-offsets.c
@@ -48,6 +48,7 @@ int main(void)
   DEFINE(TI_CPU,		offsetof(struct thread_info, cpu));
   ...
From: Russell King - ARM Linux
Date: Tuesday, March 23, 2010 - 1:53 pm

I don't particularly like the idea that we always store the syscall
number to memory for every system call, whether the stored version is
used or not.

Since ARM caches are generally not write allocate, this means mostly
write-only variables can have a higher than expected expense.

Is there not some thread flag which can be checked to see if we need to
store the syscall number?
--

From: Matt Helsley
Date: Tuesday, March 23, 2010 - 7:03 pm

Perhaps before we freeze the task we can save the syscall number on ARM.
The patches suggest that the signal delivery path -- which the freezer
utilizes -- has the syscall number already.

Should work since the threads must be frozen first anyway.

Cheers,
	-Matt Helsley
--

From: Oren Laadan
Date: Tuesday, March 23, 2010 - 9:57 pm

I like the idea.

However, would it also work for those cases when the freezing does not 
occur from the signal delivery path - e.g. for vfork and ptraced tasks ?

Oren.
--

From: Matt Helsley
Date: Wednesday, March 24, 2010 - 7:02 am

We could just as easily set it before the vfork uninterruptible completion.
ptracing I'd don't know about though.

Cheers,
	-Matt Helsley
--

From: Oren Laadan
Date: Wednesday, March 24, 2010 - 8:53 am

Actually, the signal path doesn't have the syscall number, it has

vfork() uses freezer_do_not_count() to tell the freezer that it's
effectively frozen. It's also used by drivers/char/apm-emulation.c

Looking at calls to ptrace_notify(), ptrace_stop() and ptace_event(),
there are several places where a ptraced task can stop with TASK_TRACED
(which is good enough for the freezer), outside the signal handling
path.

This means that recording the syscall number for all these cases is
going to be tedious and intrusive.

I prefer to somehow figure out the syscall from the task's state or
pt_regs, or by (re)using the same assembly code that already does that.

Oren.

--

From: Christoffer Dall
Date: Wednesday, March 24, 2010 - 12:36 pm

Well, this could be changed to pass the syscall number through
registers along to try_to_freeze without any mentionable performance

Re-using the assembly code or factoring it out so that it can be used
from multiple places doesn't seem very pleasing to me, as the assembly
code is in the critical path and written specifically for the context
of a process entering the kernel. Please correct me if I'm wrong.

I imagine simply a function in C, more or less re-implementing the
logic that's already in entry-common.S, might do the trick. I wouldn't
worry much about the performance in this case as it will not be used
often. The following _untested_ snippet illustrates my idea:

---
 arch/arm/include/asm/syscall.h |   93 +++++++++++++++++++++++++++++++++++++++-
 1 files changed, 92 insertions(+), 1 deletions(-)

diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index 3b3248f..a7f2615 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -10,10 +10,101 @@
 #ifndef _ASM_ARM_SYSCALLS_H
 #define _ASM_ARM_SYSCALLS_H

+static inline int get_swi_instruction(struct task_struct *task,
+				      struct pt_regs *regs,
+				      unsigned long *instr)
+{
+	struct page *page = NULL;
+	unsigned long instr_addr;
+	unsigned long *ptr;
+	int ret;
+
+	instr_addr = regs->ARM_pc - 4;
+
+	down_read(&task->mm->mmap_sem);
+	ret = get_user_pages(task, task->mm, instr_addr,
+			     1, 0, 0, &page, NULL);
+	up_read(&task->mm->mmap_sem);
+
+	if (ret < 0)
+		return ret;
+
+	ptr = (unsigned long *)kmap_atomic(page, KM_USER1);
+	memcpy(instr,
+	       ptr + (instr_addr >> PAGE_SHIFT),
+	       sizeof(unsigned long));
+	kunmap_atomic(ptr, KM_USER1);
+
+	page_cache_release(page);
+
+	return 0;
+}
+
+static inline int __syscall_get_nr(struct task_struct *task,
+				   struct pt_regs *regs)
+{
+	int ret;
+	int scno;
+	unsigned long instr;
+	bool config_oabi = false;
+	bool config_aeabi = false;
+	bool config_arm_thumb = ...
From: Matt Helsley
Date: Wednesday, March 24, 2010 - 6:11 pm

Yes, that's possible. I was thinking we could still use your thread info
field but only store to it when we know it will be useful for c/r rather
than for each syscall. Personally, I'd rather avoid passing the extra
			^shouldn't this be:

(again, not familiar with ARM so my understanding is:

I guess swi is "syscall word immediate".

The syscall nr is embedded in the instruction as an immediate
value and you're getting a copy of that instruction using the value of
the pc register just after the syscall instruction was executed.)

Perhaps I am missing or forgetting something. Why isn't this as simple
as calling get_user() or even copy_from_user() using instr_addr?

Cheers,
	-Matt Helsley
--

From: Matt Helsley
Date: Wednesday, March 24, 2010 - 6:17 pm

Oops, made my own mistake. I think the address of the kmap'd instruction
would be:

	ptr + (instr_addr & ~PAGE_MASK)

Cheers,
	-Matt Helsley
--

From: Christoffer Dall
Date: Thursday, March 25, 2010 - 3:29 am

Yes. Thanks for pointing it out.
--

From: Oren Laadan
Date: Wednesday, March 24, 2010 - 6:35 pm

In c/r, we only need it at restart when a task calls it on itself.

However the interface itself of get_syscall_nr() can be called by
any task on another task.

(In fact, I think that for the most part, saving the syscall number
at checkpoint time may be better than figuring out at restart time).

Oren.


--

From: Christoffer Dall
Date: Thursday, March 25, 2010 - 3:34 am

So, as Oren is saying, the point was to make the syscall_get_nr(..)
work according to the interface specified in
include/asm-generic/syscall.h.

Considering it's unknown how we will deal with checkpoint/restart
across CONFIG_ARM_THUMB, CONFIG_OABI_COMPAT etc., I also think it's a
better idea to checkpoint the syscall number at checkpoint and for the
restore, place architecture specific hooks to get the syscall number
instead of calling syscall_get_nr(...) directly. In this way we should
always be able to get the syscall and correctly restart, independently
of what tricks we do to checkpoint restart across configuration
settings - if any.

Best,
Christoffer
--

From: Christoffer Dall
Date: Sunday, March 21, 2010 - 6:06 pm

Implements architecture specific requirements for checkpoint/restart on
ARM. The changes touch almost only c/r related code. Most of the work is
done in arch/arm/checkpoint.c, which implements checkpointing of the CPU
and necessary fields on the thread_info struct.

The ISA version (given by __LINUX_ARM_ARCH__) is checkpointed and verified
against the machine architecture on restart. If they differ, an error is
raised and restart aborted. It should be possible to restart on newer
architectures, but further investigation is warranted.

Regarding ThumbEE, the thumbee_state field on the thread_info is stored
in checkpoints when CONFIG_ARM_THUMBEE and 0 is stored otherwise. If
a value different than 0 is checkpointed and CONFIG_ARM_THUMBEE is not
set on the restore system, the restore is aborted. Feedback on this
implementation is very welcome.

We checkpoint whether the system is running with CONFIG_MMU or not and
require the same configuration for the system on which we restore the
process. It might be possible to allow something more fine-grained,
if it's worth the energy. Input on this item is also very welcome,
specifically from someone who knows the exact meaning of the end_brk
field.

Added support for syscall sys_checkpoint and sys_restart for ARM:
__NR_checkpoint         367
__NR_restart            368


Cc: rmk@arm.linux.org.uk
Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/arm/Kconfig                      |    4 +
 arch/arm/include/asm/checkpoint_hdr.h |   71 +++++++++
 arch/arm/include/asm/ptrace.h         |    1 +
 arch/arm/include/asm/unistd.h         |    2 +
 arch/arm/kernel/Makefile              |    1 +
 arch/arm/kernel/calls.S               |    2 +
 arch/arm/kernel/checkpoint.c          |  276 +++++++++++++++++++++++++++++++++
 arch/arm/kernel/signal.c              |    5 +
 arch/arm/kernel/sys_arm.c             |   13 ++
 include/linux/checkpoint_hdr.h        |    2 +
 10 ...
From: Serge E. Hallyn
Date: Tuesday, March 23, 2010 - 9:09 am

In terms of the cr api I don't see any problems.  Two nits below,
but in any case

Acked-by: Serge Hallyn <serue@us.ibm.com>

thanks, this is really cool, especially how minimal it is :)
-serge


will load_cpu_regs() ever be changed to return anything but 0?  If
not both fns can be simplified.


Again ret doesn't seem needed here.

-serge
--

From: Christoffer Dall
Date: Wednesday, March 24, 2010 - 12:46 pm

indeed it doesn't.

-Christoffer
--

From: Russell King - ARM Linux
Date: Tuesday, March 23, 2010 - 2:18 pm

I think you misunderstand what __LINUX_ARM_ARCH__ signifies.  It is the
build architecture for the kernel, and it indicates the lowest
architecture version that the kernel will run on.

That doesn't indicate what ISA version the system is running on, or even
if the ABI is compatible (we have two ABIs - OABI and EABI).

There's also the matter of FP implementation - whether it is VFP or FPA,
and whether iwMMXt is available or not.  (iwMMXt precludes the use of


Processes which run on MMU and non-MMU CPUs are unlikely to be
interchangable - the run time environments are quite different.  I

How do you safely obtain consistent information from a thread?  Do you
temporarily stop it?
--

From: Matt Helsley
Date: Tuesday, March 23, 2010 - 6:53 pm

On Tue, Mar 23, 2010 at 09:18:43PM +0000, Russell King - ARM Linux wrote:


It must be frozen with the cgroup freezer (which reuses the suspend freezer).
sys_checkpoint moves the cgroup freezer into the CHECKPOINTING state which
prevents tasks in that group from being thawed until just before checkpoint
returns.

Cheers,
	-Matt Helsley
--

From: Christoffer Dall
Date: Wednesday, March 24, 2010 - 1:48 pm

On Tue, Mar 23, 2010 at 10:18 PM, Russell King - ARM Linux

Yes, clearly I didn't understand this fully. So is it in fact possible
to compile the kernel with __LINUX_ARM_ARCH=6 and have
CONFIG_CPU_32v7? Or is it a matter of running a v6 kernel with
CONFIG_CPU_32v6 on a newer architecture?

What I would like to accomplish is the best way to make sure that the
restarted process will in fact be able to run. What is the best way to

That's why I checkpointed CONFIG_OABI_COMPAT, but I realize that it's
not sufficient.

How about checkpointing CONFIG_AEABI and CONFIG_OABI_COMPAT and making
sure that we either restore to the same setting of the two or restore

I had a feeling this would be an issue, but I never dove into the
workings of FP on ARM. Can you give me some concrete pointers as what
to checkpoint and restart / check on restart for a process using FP to

I encountered it when looking at struct thread_info in
arch/arm/include/asm/thread_info.h, and have not seen et before. After
looking into it a little more, it's included it 2.6.33 and defined in
--

From: Jamie Lokier
Date: Thursday, March 25, 2010 - 7:47 pm

With CONFIG_OABI_COMPAT enabled, each process can be in either
personality: OABI or EABI.  Checkpointing will need to remember which
one.

With CONFIG_OABI_COMPAT disabled, it'll be fixed at one or the other,
but there's no reason why a process should not be moved between
kernels with different values of CONFIG_OABI_COMPAT, so long as the
OABI or EABI personality is supported by the destination kernel.

In other words, CONFIG_OABI_COMPAT shouldn't be in the checkpoint

It's possible in principle to run many non-MMU binaries on MMU
kernels, but I've never heard of anyone doing it.

-- Jamie

--

From: Paul Mundt
Date: Thursday, March 25, 2010 - 8:02 pm

FDPIC supports running the same binaries with or without MMU depending on
your ABI, it's not really that uncommon, even if it's mostly just used
for prototyping.
--

From: Jamie Lokier
Date: Thursday, March 25, 2010 - 8:55 pm

Thanks - I didn't know anyone actually did it :-)

But I can see the value for product rollouts of some binaries onto a
mixture of hardware, too.

bFLT (flat) binaries should be runnable with an MMU too.

-- Jamie
--

From: Christoffer Dall
Date: Sunday, March 28, 2010 - 3:55 pm

I would imagine that the chance that a restart will fail anyway when
restoring an MMU process on a non-MMU kernel. However, as you suggest,
the other way around should be possible. Thanks for clearing that up.

Specifically, do you know the meaning of the end_brk field on the
mm_context_t struct and if I need to checkpoint it on restart for
non-MMU systems (and potentially do something more clever during
restart on an MMU kernel?)
--

From: Christoffer Dall
Date: Sunday, March 21, 2010 - 6:06 pm

In addition to doing everything that clone() system call does, the
eclone() system call:

	- allows additional clone flags (31 of 32 bits in the flags
	  parameter to clone() are in use)

	- allows user to specify a pid for the child process in its
	  active and ancestor pid namespaces.

Eclone is needed for restarting a process from a checkpoint. See more
in Documentation/eclone and refer to the original LKML posting:
http://lkml.org/lkml/2009/11/11/361

The new system call for ARM has number 366.

Cc: rmk@arm.linux.org.uk
Cc: libc-ports <libc-ports@sourceware.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Christoffer Dall <christofferdall@christofferdall.dk>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/arm/include/asm/unistd.h  |    1 +
 arch/arm/kernel/calls.S        |    1 +
 arch/arm/kernel/entry-common.S |    6 ++++++
 arch/arm/kernel/sys_arm.c      |   39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index cf9cdaa..f295a6c 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -392,6 +392,7 @@
 #define __NR_rt_tgsigqueueinfo		(__NR_SYSCALL_BASE+363)
 #define __NR_perf_event_open		(__NR_SYSCALL_BASE+364)
 #define __NR_recvmmsg			(__NR_SYSCALL_BASE+365)
+#define __NR_eclone			(__NR_SYSCALL_BASE+366)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 9314a2d..5ef0b03 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -375,6 +375,7 @@
 		CALL(sys_rt_tgsigqueueinfo)
 		CALL(sys_perf_event_open)
 /* 365 */	CALL(sys_recvmmsg)
+		CALL(sys_eclone_wrapper)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index f694f4d..9ead15d 100644
--- ...
From: Russell King - ARM Linux
Date: Tuesday, March 23, 2010 - 2:06 pm

I'm curious why, if you want the entire set of registers, you don't just
do:
		add	r0, sp, #S_OFF
		b	sys_eclone

and load the syscall arguments out of regs->ARM_foo.  This avoids the need

__user on an integer type doesn't make any sense; integer types do not

This will produce sparse errors.  Is there a reason why 'clone_args'



Hmm, so let me get this syscall interface right.  We have some arguments
passed in registers and others via a (variable sized?) structure.  It seems
really weird to have, eg, a pointer to the pids and the number of pids
passed in two separate ways.

The grouping between what's passed in registers and via this clone_args
structure seems to be random.  Can it be sanitized?
--

From: Sukadev Bhattiprolu
Date: Wednesday, March 24, 2010 - 11:19 am

Russell King - ARM Linux [linux@arm.linux.org.uk] wrote:
| > +asmlinkage int sys_eclone(unsigned flags_low, struct clone_args __user *uca,
| > +			  int args_size, pid_t __user *pids,
| > +			  struct pt_regs *regs)
| > +{
| > +	int rc;
| > +	struct clone_args kca;
| > +	unsigned long flags;
| > +	int __user *parent_tidp;
| > +	int __user *child_tidp;
| > +	unsigned long __user stack;
| 
| __user on an integer type doesn't make any sense; integer types do not
| have address spaces.

Ah, will fix  that for x86 32/64 bit implementations.

| 
| > +	unsigned long stack_size;
| > +
| > +	rc = fetch_clone_args_from_user(uca, args_size, &kca);
| > +	if (rc)
| > +		return rc;
| > +
| > +	/*
| > +	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
| > +	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
| > +	 * 	 higher word(s) of 'flags':
| > +	 *
| > +	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
| > +	 */
| > +	flags = flags_low;
| > +	parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr;
| > +	child_tidp = (int *)(unsigned long)kca.child_tid_ptr;
| 
| This will produce sparse errors.  Is there a reason why 'clone_args'
| tid pointers aren't already pointers marked with __user ?

Making them pointers would make them 32-bit on some and 64-bit on other
architectures. We wanted the fields to be the same size on all architectures
for easier portability/extensibility. Given that we are copying-in a structure
from user-space, copying in a few extra bytes on the 32-bit architectures
would not be significant.

| 
| > +
| > +	stack_size = (unsigned long)kca.child_stack_size;
| 
| Shouldn't this already be of integer type?
| 
| > +	if (stack_size)
| > +		return -EINVAL;
| 
| So the stack must have a zero size?  Is this missing a '!' ?

Some architectures (IA64 ?) use the stack-size field. Those that don't
need it should error out. Again, its because we are trying to keep the
interface common across architectures.

| 
| > +
| > ...
From: Christoffer Dall
Date: Wednesday, March 24, 2010 - 12:42 pm

On Tue, Mar 23, 2010 at 10:06 PM, Russell King - ARM Linux

I simply copied the code from sys_clone. Do you prefer that I change


Thanks for you feedback. I will let the people behind eclone deal with
the eclone specifics.
--

Previous thread: Webmail Verification Update! by Etzel, Linda on Sunday, March 21, 2010 - 5:31 pm. (1 message)

Next thread: [PATCH 03/10] x86: set nr_irqs_gsi only in probe_nr_irqs_gsi by Yinghai Lu on Sunday, March 21, 2010 - 6:36 pm. (1 message)