On Sun, Mar 21, 2010 at 08:50:44PM +0100, Grzegorz Nosek wrote:
I'm suprised strace of ls works across pid namespaces. I've been looking
at strace and it seemed to me that one kernel change and a bunch of strace
changes are needed to make strace'ing in child pid namespaces work. Eric
Biederman's setns() patches also might help.
Can you get a little farther with the kernel fix below?
Fix incorrect pid namespace used by ptrace during fork/vfork/clone
pid namespaces are not used properly by ptrace in do_fork(). When tracing
parent != real_parent because parent is the tracing task. Yet the pid in
the real_parent's namespace is being used in do_fork():
nr = task_pid_vnr(p); /* uses real_parent's pid namespace */
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr); /* "real_parent_tidptr" */
...
tracehook_report_clone_complete(trace, regs,
clone_flags, nr, p); /* ptrace broken */
if (clone_flags & CLONE_VFORK) {
freezer_do_not_count();
wait_for_completion(&vfork);
freezer_count();
tracehook_report_vfork_done(p, nr); /* ptrace broken */
In this case re-using the value in nr is wrong.
This bug can be seen by attaching to an already-running task
in a descendent namespace with strace -f. When the traced task forks
strace won't attach to the new task properly because it sees the
incorrect pid. For example, if root is running on two VTs and
root@VTN# indicates switching to VT N:
root@VT1# ns_exec -cp /bin/bash
root@VT1# echo $$
1
root@VT2# strace -f -e fork,vfork,clone -p <pid of bash>
Process 14518 attached - interrupt to quit
root@VT1# /bin/bash
<stops -- new bash shell does not respond to input>
root@VT2#
clone(Process 15 attached ... ) = 15
Process 15044 attached (waiting for parent)
Process 14518 suspended
<no more output>
<hit ctrl-c>
root@VT1# echo $$
...Matt, Grzegorz is telling you that 'strace -ff lxc-start ls' is broken but 'lxc-start strace -ff ls' works as expected => strace of ls doesn't work across pid namespaces. No surprise. Cheers. -- Gregory Kurz gkurz@fr.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)534 638 479 Fax +33 (0)561 400 420 "Anarchy is about taking complete responsibility for yourself." Alan Moore. --
No, not really. Attaching from outside to a shell running in a container and running a command yields: | rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0 | pipe([3, 4]) = 0 | clone(Process 2581 attached (waiting for parent) | Process 190 attached Without the patch the order of reported pids is reversed (and at least with the patched kernel the outside pid is consistently reported first) | child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7713708) = 190 | [pid 2549] setpgid(190, 190) = 0 | [pid 2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 | [pid 2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 | [pid 2549] close(3) = 0 | [pid 2549] close(4) = 0 | [pid 2549] rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0 | [pid 2549] ioctl(255, TIOCSPGRP, [190]) = 0 | [pid 2549] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0 | [pid 2549] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 | [pid 2549] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 | [pid 2549] waitpid(-1, Process 2549 suspended (the shell hangs here) ^C <unfinished ...> | Process 2549 detached | Process 2581 detached | Process 190 detached (the command executes here normally). Best regards, Grzegorz Nosek --
Yes, this is broken. More precisely, this wasn't even supposed to work. Even stracing of the sub-init itself (or global init btw) has problems, Yes. First of all, tracehook_report_clone_complete() reports the wrong pid nr, as it seen inside the init's namespace. This is easy to fix, but I doubt this can help. IIRC strace doesn't use PTRACE_GETEVENTMSG at all, it looks at eax which patch? Oleg. --
Is this impossible/very hard to do cleanly? I understand that container's
init becomes vulnerable to signals sent from root-owned processes in the
container. If so, the impact of this issue should be quite limited, no?
Strace'ing processes across pidns boundary would be really useful in day
to day administrative work but if it's unfixable, I guess at least
preventing strace from attaching to processes in a descendant pidns
The patch below posted by Matt. AIUI, it fixes the
tracehook_report_clone_complete() part, which results in an observable
change in strace's behaviour (not that it makes strace work, though).
Anyway, are there any remaining issues on the kernel side or does strace
have to be taught about pid namespaces?
Best regards,
Grzegorz Nosek
diff --git a/kernel/fork.c b/kernel/fork.c
index 3a65513..7946ea6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,6 +1404,7 @@ long do_fork(unsigned long clone_flags,
*/
if (!IS_ERR(p)) {
struct completion vfork;
+ int ptrace_pid_vnr;
trace_sched_process_fork(current, p);
@@ -1439,14 +1440,21 @@ long do_fork(unsigned long clone_flags,
wake_up_new_task(p, clone_flags);
}
+ ptrace_pid_vnr = nr;
+ if (unlikely(p->parent != p->real_parent)) {
+ rcu_read_lock();
+ ptrace_pid_vnr = task_pid_nr_ns(p, p->parent->nsproxy->pid_ns);
+ rcu_read_unlock();
+ }
tracehook_report_clone_complete(trace, regs,
- clone_flags, nr, p);
+ clone_flags,
+ ptrace_pid_vnr, p);
if (clone_flags & CLONE_VFORK) {
freezer_do_not_count();
wait_for_completion(&vfork);
freezer_count();
- ...I guess it doesn't work because we need to fix strace, see "strace doesn't At first glance, I don't see other problems, except sometimes the reported Yes, this is what I meant. But we should not do this in do_fork(). But once again. This change fixes the value in "tracee->ptrace_message == newpid", but a quick grep shows that strace-4.5.19 doesn't use PTRACE_GETEVENTMSG at all. Oleg. --
I'm puzzled. If not here, where should we do this? Or are you saying ptrace should take a reference to the pid, store it, and get the vnr during PTRACE_GETEVENTMSG? (drop the reference at detach or when a new pid reference You are correct. However strace and gdb aren't necessarily the only users of ptrace so wouldn't it still be good to fix this? Cheers, -Matt Helsley --
Ah, no, sorry. I meant tracehook_report_clone_complete should do this under "if (trace)". And we need a helper to get the right pid, it could be used Yes, agreed. Oh. The only problem is utrace patches in -mm. I mean the possible textual conflicts... Oleg. --
Yup. strace would need to be modified to use that. I tried that and it still won't work -- I seem to recall it didn't work because strace uses pid values obtained from the wait syscall too. To make it work we'd need to be able to translate those pids in userspace. That's do-able from userspace if you trace all forks descending from the pidns init task. But it's not do-able for simple attaches. That's why I was thinking Eric's setns() might be able to help if strace used it to enter the tracee's pid namespace whenever we need to. gdb often doesn't use the same methods but has similar problems with pid namespaces. Cheers, -Matt Helsley --
Hmm, is there a good reason why strace does not use the data explicitly provided by the kernel but instead second-guesses it from syscall return values? I don't know anything about ptrace, really, but I'd expect the kernel to provide the tracer with out-of-band information otherwise taken from clone/waitpid/other syscalls? Best regards, Grzegorz Nosek --
strace isn't linux-only. Checking the syscall return values may be seen as being more portable. At least that's my guess. That said there are plenty of #ifdefs in strace and patching it to use GETEVENTMSG is quite a small patch. However, as I said, that still doesn't "fix" strace so that it can be used to trace tasks in child pid namespaces. Especially when the I don't think the kernel provides special out-of-band methods for fetching pids related to traced tasks except during fork and clone. Not wait*(). The rest of ptrace tends to be focused on reading/writing registers/memory and managing signal delivery. Cheers, -Matt Helsley --
Not really, unless I missed something. See my another email, strace doesn't even use the notification from do_fork(), not only it doesn't Yes. Looks like, the necessary change in kernel is simple (and btw, it is well known fact the reported pid is not ns-friendly). What Hmm... I guess, you mean setns() idea? Otherwise, I _think_ that the Could you clarify? I think wait*() is already namespace-friendly? Oleg. --
I know almost nothing too. But _iirc_, this all is even worse, strace doesn't necessary gets the notification from do_fork(), at least another quick grep shows it doesn't use PTRACE_O_TRACEFORK. This is another reason why the change in do_fork() can't really help right now. Oleg. --
(I've been away for a couple of weeks.)
I concur with the things Oleg's said in this thread.
As to what's "correct" for the kernel in theory, it would certainly make
sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
reporting any PID as such. The wait and SIGCHLD code already does this, so
that would be consistent. Off hand I don't see anything other than
tracehook_report_clone{,_complete}() that sees the wrong value now.
Fixing that requires a bit of hair. The simple and clean approach is to
just have the tracehook calls (i.e. ptrace layer) extract the PID from the
task_struct using the desired pid_ns. The trouble there is that the
tracehook_report_clone_complete() call is made when that task_struct is no
longer guaranteed to be valid. The contrary approach of extracting the
appropriate value for the tracer earlier breaks the clean layering because
the fork.c code really should not know at all that ->parent->nsproxy is the
place to look for what values to pass to tracehook calls. I guess the
simple and clean fix is to get_task_struct() before wake_up_new_task()
and put_task_struct() after tracehook_report_clone_complete(). That does
add some gratuitous atomic incr/decr overhead, though.
None of this has much of anything to do with strace, of course. As I've
said, I don't see anything other than the PTRACE_GETEVENTMSG value for
PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel. As
Oleg said, strace doesn't use that at all. (This is not the place to
discuss the details of strace further.)
Thanks,
Roland
--
It's also possible to take an extra reference to the struct pid and pass that to the tracehook. That and the pid_ns of the tracer receiving the pid is all we'll ever need inside the tracehook layer. The only advantage, I Also true. Of course my suggestion of holding the pid reference won't avoid adding Also, looking at proposed changes (utrace and Eric Biederman's setns()) storing a pid nr rather than a reference to a task struct or struct pid probably won't be correct. In the case of Eric Biederman's setns(), if capable of changing pid namespace, we could have: Traced Tracer fork() ... (an arbitrary amount of time passes) setns() ptrace(GETEVENTMSG) At which point returning a static pid number held in the message field produces the wrong pid. Also, if utrace allows multiple tracers and they each exist in a different namespace then storing a pid nr isn't going to work. So my hunch is, in the long run, we'll need to hold a reference there and drop it when the last tracer detaches or the next event would have overwritten the "message". The amount of time it would need to be held suggests to me that we should refer to a struct pid and not a task struct if possible. Cheers, -Matt Helsley --
Yes, but utrace is simple. ptrace_report_clone() does ctx->eventmsg = child->pid; we should fix this line and that is all, afaics. Every tracer has a Without utrace only one tracer is possible. So, I think we should either change do_fork() to get the right tracee_pid_nr, or add get/put into do_fork() and change tracehooks as Roland suggested. Oleg. --
For a unicast path where the is no danger of the destination process changing I don't see why we can't compute the userspace pid_nr. It only get's tricky for things like broadcast signals (pgrp, session) when we don't who the final recipient process will be. I think that only gets truly bad in the case of unix domain sockets. Eric --
My setns work has demonstrated that even for entering a namespace we never ever need to change the struct pid of a task. setns has no other bearing on this problem then to say there is no foreseeable reason to Forget that. The pid namespace was architected so that we can ptrace a process No. A processes always sees pids from the context of it's original pid namespace. All setns does is affect which pid namespace children will be native in. Eric --
OK, good. So we can resolve the tasks/struct pids within the tracehook and be done with it. Thanks Eric! Cheers, -Matt Helsley --
