Re: Testing lxc 0.6.5 in Fedora 13

Previous thread: [v3] New ldisc for WiLink7.0 by pavan_savoy on Tuesday, March 23, 2010 - 1:41 pm. (1 message)

Next thread: [watchdog] combine nmi_watchdog and softlockup by Don Zickus on Tuesday, March 23, 2010 - 2:33 pm. (17 messages)
From: Matt Helsley
Date: Tuesday, March 23, 2010 - 2:28 pm

On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:


I'm suprised strace of ls works across pid namespaces. I've been looking
at strace and it seemed to me that one kernel change and a bunch of strace
changes are needed to make strace'ing in child pid namespaces work. Eric
Biederman's setns() patches also might help.

Can you get a little farther with the kernel fix below?

    Fix incorrect pid namespace used by ptrace during fork/vfork/clone
    
    pid namespaces are not used properly by ptrace in do_fork(). When tracing
    parent != real_parent because parent is the tracing task. Yet the pid in
    the real_parent's namespace is being used in do_fork():
    
    	nr = task_pid_vnr(p); /* uses real_parent's pid namespace */
    	if (clone_flags & CLONE_PARENT_SETTID)
    		put_user(nr, parent_tidptr); /* "real_parent_tidptr" */
    	...
    	tracehook_report_clone_complete(trace, regs,
    					clone_flags, nr, p); /* ptrace broken */
    
    	if (clone_flags & CLONE_VFORK) {
    		freezer_do_not_count();
    		wait_for_completion(&vfork);
    		freezer_count();
    		tracehook_report_vfork_done(p, nr); /* ptrace broken */
    
    In this case re-using the value in nr is wrong.
    
    This bug can be seen by attaching to an already-running task
    in a descendent namespace with strace -f. When the traced task forks
    strace won't attach to the new task properly because it sees the
    incorrect pid. For example, if root is running on two VTs and
    root@VTN# indicates switching to VT N:
    
    root@VT1# ns_exec -cp /bin/bash
    root@VT1# echo $$
    1
    root@VT2# strace -f -e fork,vfork,clone -p <pid of bash>
    Process 14518 attached - interrupt to quit
    root@VT1# /bin/bash
    <stops -- new bash shell does not respond to input>
    root@VT2#
    clone(Process 15 attached ... ) = 15
    Process 15044 attached (waiting for parent)
    Process 14518 suspended
    <no more output>
    <hit ctrl-c>
    root@VT1# echo $$
    ...
From: Greg Kurz
Date: Wednesday, March 24, 2010 - 2:25 am

Matt,

Grzegorz is telling you that 'strace -ff lxc-start ls' is broken but
'lxc-start strace -ff ls' works as expected => strace of ls doesn't work
across pid namespaces. No surprise.

Cheers.

-- 
Gregory Kurz                                     gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.

--

From: Grzegorz Nosek
Date: Thursday, March 25, 2010 - 2:33 pm

No, not really. Attaching from outside to a shell running in a container
and running a command yields:

| rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
| pipe([3, 4])                            = 0
| clone(Process 2581 attached (waiting for parent)
| Process 190 attached

Without the patch the order of reported pids is reversed (and at least
with the patched kernel the outside pid is consistently reported first)

| child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7713708) = 190
| [pid  2549] setpgid(190, 190)           = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] close(3)                    = 0
| [pid  2549] close(4)                    = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
| [pid  2549] ioctl(255, TIOCSPGRP, [190]) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| [pid  2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
| [pid  2549] waitpid(-1, Process 2549 suspended

(the shell hangs here)

^C <unfinished ...>
| Process 2549 detached
| Process 2581 detached
| Process 190 detached

(the command executes here normally).

Best regards,
 Grzegorz Nosek
--

From: Oleg Nesterov
Date: Friday, March 26, 2010 - 4:11 am

Yes, this is broken. More precisely, this wasn't even supposed to work.

Even stracing of the sub-init itself (or global init btw) has problems,

Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr,
as it seen inside the init's namespace. This is easy to fix, but I doubt this
can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax

which patch?

Oleg.

--

From: Grzegorz Nosek
Date: Friday, March 26, 2010 - 4:32 am

Is this impossible/very hard to do cleanly? I understand that container's
init becomes vulnerable to signals sent from root-owned processes in the
container. If so, the impact of this issue should be quite limited, no?

Strace'ing processes across pidns boundary would be really useful in day
to day administrative work but if it's unfixable, I guess at least
preventing strace from attaching to processes in a descendant pidns

The patch below posted by Matt. AIUI, it fixes the
tracehook_report_clone_complete() part, which results in an observable
change in strace's behaviour (not that it makes strace work, though).
Anyway, are there any remaining issues on the kernel side or does strace
have to be taught about pid namespaces?

Best regards,
 Grzegorz Nosek

diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
         */
        if (!IS_ERR(p)) {
                struct completion vfork;
+               int ptrace_pid_vnr;

                trace_sched_process_fork(current, p);

@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
                        wake_up_new_task(p, clone_flags);
                }

+               ptrace_pid_vnr = nr;
+               if (unlikely(p->parent != p->real_parent)) {
+                       rcu_read_lock();
+                       ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+                       rcu_read_unlock();
+               }
                tracehook_report_clone_complete(trace, regs,
-                                               clone_flags, nr, p);
+                                               clone_flags,
+                                               ptrace_pid_vnr, p);

                if (clone_flags & CLONE_VFORK) {
                        freezer_do_not_count();
                        wait_for_completion(&vfork);
                        freezer_count();
-  ...
From: Oleg Nesterov
Date: Friday, March 26, 2010 - 5:00 am

I guess it doesn't work because we need to fix strace, see "strace doesn't

At first glance, I don't see other problems, except sometimes the reported

Yes, this is what I meant.

But we should not do this in do_fork().


But once again. This change fixes the value in "tracee->ptrace_message == newpid",
but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all.

Oleg.

--

From: Matt Helsley
Date: Friday, March 26, 2010 - 5:46 am

I'm puzzled. If not here, where should we do this? Or are you saying
ptrace should take a reference to the pid, store it, and get the vnr during
PTRACE_GETEVENTMSG? (drop the reference at detach or when a new pid reference

You are correct. However strace and gdb aren't necessarily the only users
of ptrace so wouldn't it still be good to fix this?

Cheers,
	-Matt Helsley
--

From: Oleg Nesterov
Date: Friday, March 26, 2010 - 6:34 am

Ah, no, sorry.

I meant tracehook_report_clone_complete should do this under "if (trace)".
And we need a helper to get the right pid, it could be used

Yes, agreed.

Oh. The only problem is utrace patches in -mm. I mean the possible textual
conflicts...

Oleg.

--

From: Matt Helsley
Date: Friday, March 26, 2010 - 4:53 am

Yup. strace would need to be modified to use that. I tried that and it still
won't work -- I seem to recall it didn't work because strace uses pid values
obtained from the wait syscall too. To make it work we'd need to be able to
translate those pids in userspace. That's do-able from userspace if you trace
all forks descending from the pidns init task. But it's not do-able for
simple attaches. That's why I was thinking Eric's setns() might be able to
help if strace used it to enter the tracee's pid namespace whenever we need to.

gdb often doesn't use the same methods but has similar problems with pid
namespaces.

Cheers,
	-Matt Helsley
--

From: Grzegorz Nosek
Date: Friday, March 26, 2010 - 5:45 am

Hmm, is there a good reason why strace does not use the data explicitly
provided by the kernel but instead second-guesses it from syscall return
values? I don't know anything about ptrace, really, but I'd expect the
kernel to provide the tracer with out-of-band information otherwise
taken from clone/waitpid/other syscalls?

Best regards,
 Grzegorz Nosek
--

From: Matt Helsley
Date: Friday, March 26, 2010 - 5:54 am

strace isn't linux-only. Checking the syscall return values may be seen
as being more portable. At least that's my guess. That said there are
plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite
a small patch.

However, as I said, that still doesn't "fix" strace so that it can
be used to trace tasks in child pid namespaces. Especially when the

I don't think the kernel provides special out-of-band methods for fetching
pids related to traced tasks except during fork and clone. Not wait*(). The
rest of ptrace tends to be focused on reading/writing registers/memory and
managing signal delivery.

Cheers,
	-Matt Helsley
--

From: Oleg Nesterov
Date: Friday, March 26, 2010 - 6:56 am

Not really, unless I missed something. See my another email, strace
doesn't even use the notification from do_fork(), not only it doesn't

Yes. Looks like, the necessary change in kernel is simple (and btw,
it is well known fact the reported pid is not ns-friendly). What

Hmm... I guess, you mean setns() idea? Otherwise, I _think_ that the

Could you clarify? I think wait*() is already namespace-friendly?

Oleg.

--

From: Oleg Nesterov
Date: Friday, March 26, 2010 - 6:47 am

I know almost nothing too. But _iirc_, this all is even worse, strace
doesn't necessary gets the notification from do_fork(), at least
another quick grep shows it doesn't use PTRACE_O_TRACEFORK.

This is another reason why the change in do_fork() can't really help
right now.

Oleg.

--

From: Roland McGrath
Date: Monday, April 5, 2010 - 8:44 pm

(I've been away for a couple of weeks.)
I concur with the things Oleg's said in this thread.

As to what's "correct" for the kernel in theory, it would certainly make
sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
reporting any PID as such.  The wait and SIGCHLD code already does this, so
that would be consistent.  Off hand I don't see anything other than
tracehook_report_clone{,_complete}() that sees the wrong value now.

Fixing that requires a bit of hair.  The simple and clean approach is to
just have the tracehook calls (i.e. ptrace layer) extract the PID from the
task_struct using the desired pid_ns.  The trouble there is that the
tracehook_report_clone_complete() call is made when that task_struct is no
longer guaranteed to be valid.  The contrary approach of extracting the
appropriate value for the tracer earlier breaks the clean layering because
the fork.c code really should not know at all that ->parent->nsproxy is the
place to look for what values to pass to tracehook calls.  I guess the
simple and clean fix is to get_task_struct() before wake_up_new_task()
and put_task_struct() after tracehook_report_clone_complete().  That does
add some gratuitous atomic incr/decr overhead, though.

None of this has much of anything to do with strace, of course.  As I've
said, I don't see anything other than the PTRACE_GETEVENTMSG value for
PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel.  As
Oleg said, strace doesn't use that at all.  (This is not the place to
discuss the details of strace further.)


Thanks,
Roland
--

From: Matt Helsley
Date: Tuesday, April 6, 2010 - 6:53 am

It's also possible to take an extra reference to the struct pid and pass
that to the tracehook. That and the pid_ns of the tracer receiving the pid
is all we'll ever need inside the tracehook layer. The only advantage, I

Also true.

Of course my suggestion of holding the pid reference won't avoid adding

Also, looking at proposed changes (utrace and Eric Biederman's setns())
storing a pid nr rather than a reference to a task struct or struct pid
probably won't be correct.

In the case of Eric Biederman's setns(), if capable of changing pid namespace,
we could have:

	Traced				Tracer
	fork()
					... (an arbitrary amount of time passes)
					setns()
					ptrace(GETEVENTMSG)

At which point returning a static pid number held in the message field
produces the wrong pid. Also, if utrace allows multiple tracers and they each
exist in a different namespace then storing a pid nr isn't going to work.

So my hunch is, in the long run, we'll need to hold a reference there and
drop it when the last tracer detaches or the next event would have
overwritten the "message". The amount of time it would need to be held
suggests to me that we should refer to a struct pid and not a task struct
if possible.

Cheers,
	-Matt Helsley
--

From: Oleg Nesterov
Subject:
Date: Tuesday, April 6, 2010 - 7:36 am

Yes, but utrace is simple. ptrace_report_clone() does

	ctx->eventmsg = child->pid;

we should fix this line and that is all, afaics. Every tracer has a

Without utrace only one tracer is possible.

So, I think we should either change do_fork() to get the right tracee_pid_nr,
or add get/put into do_fork() and change tracehooks as Roland suggested.

Oleg.

--

From: Eric W. Biederman
Subject:
Date: Tuesday, April 6, 2010 - 8:17 am

For a unicast path where the is no danger of the destination process changing
I don't see why we can't compute the userspace pid_nr.  It only get's tricky
for things like broadcast signals (pgrp, session) when we don't who the
final recipient process will be.

I think that only gets truly bad in the case of unix domain sockets.

Eric
--

From: Eric W. Biederman
Subject:
Date: Tuesday, April 6, 2010 - 8:13 am

My setns work has demonstrated that even for entering a namespace we
never ever need to change the struct pid of a task.  setns has no other
bearing on this problem then to say there is no foreseeable reason to

Forget that.  The pid namespace was architected so that we can ptrace a process

No.  A processes always sees pids from the context of it's original pid
namespace.  All setns does is affect which pid namespace children will
be native in.


Eric
--

From: Matt Helsley
Subject:
Date: Tuesday, April 6, 2010 - 8:29 am

OK, good. So we can resolve the tasks/struct pids within the tracehook
and be done with it. Thanks Eric!

Cheers,
	-Matt Helsley
--

Previous thread: [v3] New ldisc for WiLink7.0 by pavan_savoy on Tuesday, March 23, 2010 - 1:41 pm. (1 message)

Next thread: [watchdog] combine nmi_watchdog and softlockup by Don Zickus on Tuesday, March 23, 2010 - 2:33 pm. (17 messages)