Re: Q: perf_event && event->owner

Previous thread: Warning Code: ID67565434. by Webmail © 2010 Team Upgrade on Monday, November 8, 2010 - 7:49 am. (1 message)

Next thread: [PATCH v1.2 1/4] lib: hex2bin converts ascii hexadecimal string to binary by Mimi Zohar on Monday, November 8, 2010 - 8:30 am. (1 message)
From: Oleg Nesterov
Date: Monday, November 8, 2010 - 7:56 am

Hello.

I am trying to understand the usage of hw-breakpoints in arch_ptrace().
ptrace_set_debugreg() and related code looks obviously racy. Nothing
protects us against flush_ptrace_hw_breakpoint() called by the dying
tracee. Afaics we can leak perf_event or use the already freed memory
or both.

Am I missed something?

Looking into the git history, I don't even know which patch should be
blamed (if I am right), there were too many changes. I noticed that
2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
search into helper" moved the PF_EXITING check from find_get_context().
This check coould help if sys_ptrace() races with SIGKILL, but it was
racy anyway.

It is not clear to me what should be done. Looking more, I do not
understand the scope of perf_event/ctx at all, sys_perf_event_open()
looks wrong too, see the next email I am going to send.

Oleg.

--

From: Oleg Nesterov
Date: Monday, November 8, 2010 - 7:57 am

I am puzzled by PF_EXITING check in find_lively_task_by_vpid().

How can it help? The task can call do_exit() right after the check.

And why do we need it? The comment only says "Can't attach events to
a dying task". Maybe it tries protect sys_perf_event_open() against
perf_event_exit_task_context(), but it can't.

c93f7669 "perf_counter: Fix race in attaching counters to tasks and
exiting" says:

    There is also a race between perf_counter_exit_task and
    find_get_context; this solves the race by moving the get_ctx that
    was in perf_counter_alloc into the locked region in find_get_context,
    so that once find_get_context has got the context for a task, it
    won't get freed even if the task calls perf_counter_exit_task.

OK, the code was changed since that commit, but afaics "it won't be
freed" is still true.

However,

    It
    doesn't matter if new top-level (non-inherited) counters get attached
    to the context after perf_counter_exit_task has detached the context
    from the task.  They will just stay there and never get scheduled in
    until the counters' fds get closed, and then perf_release will remove
    them from the context and eventually free the context.

This looks wrong. perf_release() does free_event()->put_ctx(), this pairs
get_ctx() after alloc_perf_context().

But __perf_event_init_context() sets ctx->refcount = 1, and I guess this
reference should be dropped by ctx->task ? If yes, then it is not OK to
attach the event after sys_perf_event_open().

No?


Hmm. jump_label_inc/dec looks obviously racy too. Say, free_event() races
with perf_event_alloc(). There is a window between atomic_xxx() and
jump_label_update(), afaics it is possible to call jump_label_disable()
when perf_task_events/perf_swevent_enabled != 0.

Oleg.

--

From: Oleg Nesterov
Date: Monday, November 8, 2010 - 7:57 am

Another thing I can't understand, event->owner/owner_entry.

Say, some thread calls sys_perf_event_open() and creates the event.
It becomes its owner. Now this thread exits, but fd/event are still
here, and event->owner refers to the dead task_struct.

ptrace looks even more strange. Debugger can attach the breakpoint
to the tracee and then exit/detach. ->ptrace_bps events still point
to the same (may be dead) task. Even if another debugger attaches
and reuses these events.

And for what? Afaics, this is only used by PR_TASK_PERF_EVENTS_xxABLE.
Looks like, tools/perf/ used prctl() in the past. Perhaps this API
can die now and we can kill ->owner/owner_entry?

Oleg.

--

From: Frederic Weisbecker
Date: Monday, November 8, 2010 - 1:11 pm

Hmm, it seems to me that the last reference to the events are
put in __perf_event_exit_task, and then free_event() is called
there, which rcu queues the event to be released.




Hmm, in this case ptrace_bps will continue to trigger on the task
they were applied.

On the other hand, you're right, I'm not sure that the debugger is
the correct owner for the breakpoints.
I think it works though, looking at perf_event_create_kernel_counter():

	event->owner = current;
	get_task_struct(current);

(current is the debugger)

On perf_event_release_kernel():

	put_task_struct(event->owner);

So even if the debugger dies, we keep a valid owner, it works but just makes
few sense as the debugger can change.
Perhaps the real owner should be the task on which we attach our breakpoint.

What do you think?

--

From: Peter Zijlstra
Date: Monday, November 8, 2010 - 1:41 pm

No the point of event->owner is to point to the task that creates the
event, not the task we possibly attach it to (that should be reachable
through event->ctx->task).

As to removing event->owner as Oleg suggests, its a published ABI and
there might be people using it.

The use-case is a monitor thread wanting to stop all monitoring it
initiated, for example because its wants to synchronize various counters
attached to different tasks etc..


--

From: Oleg Nesterov
Date: Tuesday, November 9, 2010 - 9:18 am

This was my main question.

It's a pity ;) This ABI doesn't look very nice, but OK.

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, November 9, 2010 - 8:57 am

I think no, in this case sys_perf_event_open() owns the event. IOW,

I am not saying this is buggy. But it looks very strange to me.

If the creator of perf_event dies, nobody can use its ->perf_event_list
anyway. What is the point to keep the reference to the dead task_struct
and preserve this ->perf_event_list?

And why do we need ->owner at all? Afaics, it is _only_ needed to find
->perf_event_mutex in perf_event_release_kernel(). And this mutex only
protects ->perf_event_list (mostly for prctl).

Of course, I understand that it is not completely trivial to change this.
The exiting creator can clear its ->perf_event_list and set
event->owner = NULL, but then perf_event_release_kernel() should

Yes, it works, but I am not sure about "valid" above ;) Even if the previous
debugger doesn't exit.

And. Suppose that the new debugger attaches and reuses ->ptrace_bps[],
everything works.

Now, the former debugger does prctl(PR_TASK_PERF_EVENTS_DISABLE) and
suddenly bps stop working.

Not to mention this looks racy. Can't prctl() doing perf_event_disable/enable

Not sure... What for?

In any case, I don't think the tracee should "control" this event.

Oleg.

--

From: Peter Zijlstra
Date: Tuesday, November 9, 2010 - 9:56 am

But when the owner dies it will close all its fds, which means it will
clear its tsk->perf_event_list, no? (With exception of the case where
the fd was passed through a unix-socket to another process).

--

From: Oleg Nesterov
Date: Tuesday, November 9, 2010 - 9:58 am

fork(), pthread_create(). Only __fput() calls ->release, when the last
reference to file goes away.

And ptrace(), it doesn't use sys_perf_event_open() to create the event.

Oleg.

--

From: Peter Zijlstra
Date: Tuesday, November 9, 2010 - 10:07 am

Ah,.. quite so. So how about we explicitly destroy the list when the

Right, I guess it uses kernel based things, I guess we could not add
kernel based counters to the list.
--

From: Oleg Nesterov
Date: Tuesday, November 9, 2010 - 10:42 am

Yes, I think it makes sense to destroy the list and set ->owner = NULL.
If we reset the owner, we can also avoid get_task_struct().

The only problem is perf_event_release_kernel(), it can race with the
exiting event->owner. It can do get_task_struct() under rcu lock temporary,

Agreed, another case when event->owner should be NULL.



Hmm. With or without these changes. Shouldn't perf_event_release_kernel()
remove the event from list before anything else? Otherwise, afaics a thread
which does close(event_fd) can race with creator doing prctl(EVENTS_ENABLE),
no?

Oleg.

--

From: Peter Zijlstra
Date: Tuesday, November 9, 2010 - 11:01 am

I think you're right, how about something like this?

---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2254,6 +2249,12 @@ static int perf_release(struct inode *in
 
 	file->private_data = NULL;
 
+	if (event->owner) {
+		mutex_lock(&event->owner->perf_event_mutex);
+		list_del_init(&event->owner_entry);
+		mutex_unlock(&event->owner->perf_event_mutex);
+	}
+
 	return perf_event_release_kernel(event);
 }
 
@@ -5677,7 +5678,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	mutex_unlock(&ctx->mutex);
 
 	event->owner = current;
-	get_task_struct(current);
+
 	mutex_lock(&current->perf_event_mutex);
 	list_add_tail(&event->owner_entry, &current->perf_event_list);
 	mutex_unlock(&current->perf_event_mutex);
@@ -5745,12 +5746,6 @@ perf_event_create_kernel_counter(struct 
 	++ctx->generation;
 	mutex_unlock(&ctx->mutex);
 
-	event->owner = current;
-	get_task_struct(current);
-	mutex_lock(&current->perf_event_mutex);
-	list_add_tail(&event->owner_entry, &current->perf_event_list);
-	mutex_unlock(&current->perf_event_mutex);
-
 	return event;
 
 err_free:
@@ -5901,8 +5896,16 @@ static void perf_event_exit_task_context
  */
 void perf_event_exit_task(struct task_struct *child)
 {
+	struct perf_event *event, *tmp;
 	int ctxn;
 
+	mutex_lock(&child->perf_event_mutex);
+	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+				 owner_entry) {
+		list_del_init(&event->owner_entry);
+	}
+	mutex_unlock(&child->perf_event_mutex);
+
 	for_each_task_context_nr(ctxn)
 ...
From: Oleg Nesterov
Date: Tuesday, November 9, 2010 - 11:57 am

I need to read it with a fresh head ;)


Agreed, it is better to do this in perf_release().

But, this can use the already freed task_struct, event->owner.

Either sys_perf_open() should do get_task_struct() like we currently
do, or perf_event_exit_task() should clear event->owner and then
perf_release() should do something like

	rcu_read_lock();
	owner = event->owner;
	if (owner)
		get_task_struct(owner);
	rcu_read_unlock();

	if (owner) {
		mutex_lock(&event->owner->perf_event_mutex);
		list_del_init(&event->owner_entry);
		mutex_unlock(&event->owner->perf_event_mutex);
		put_task_struct(owner);
	}

Probably this can be simplified...

Oleg.

--

From: Peter Zijlstra
Date: Tuesday, November 9, 2010 - 12:16 pm

You're right, and I seem to suffer from a similar problem, will respin
tomorrow.
--

From: Peter Zijlstra
Date: Wednesday, November 10, 2010 - 8:17 am

I think that's still racy, suppose we do:

void perf_event_exit_task(struct task_struct *child)
{
	struct perf_event *event, *tmp;
	int ctxn;

	mutex_lock(&child->perf_event_mutex);
	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
				 owner_entry) {
		event->owner = NULL;
		list_del_init(&event->owner_entry);
	}
	mutex_unlock(&child->perf_event_mutex);

	for_each_task_context_nr(ctxn)
		perf_event_exit_task_context(child, ctxn);
}


and the close() races with an exit, then couldn't we observe
event->owner after the last put_task_struct()? In which case a
get_task_struct() will result in a double-free.


--

From: Oleg Nesterov
Date: Wednesday, November 10, 2010 - 8:44 am

I think no. Note that we do not just free task_struct via rcu callback.
Instead, delayed_put_task_struct() drops the (may be) last reference.

But the code is racy, yes. owner != NULL case is fine. But
perf_release() can see event->owner == NULL before list_del() was
completed. perf_event_exit_task() needs wmb() in between, I think.

Oleg.

--

From: Peter Zijlstra
Date: Friday, November 12, 2010 - 8:48 am

Utter paranoia took over and I'm still not sure its solid...


---
Subject: perf: Fix owner-list vs exit
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue, 09 Nov 2010 19:01:43 +0100

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289325703.2191.60.camel@laptop>
---
 kernel/perf_event.c |   63 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 12 deletions(-)

Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2251,9 +2246,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
 static int perf_release(struct inode *inode, struct file *file)
 {
 	struct perf_event *event = file->private_data;
+	struct task_struct *owner;
 
 	file->private_data = NULL;
 
+	rcu_read_lock();
+	owner = ACCESS_ONCE(event->owner);
+	/*
+	 * Matches the smp_wmb() in perf_event_exit_task(). If we observe
+	 * !owner it means the list deletion is complete and we can indeed
+	 * free this event, otherwise we need to serialize on
+	 * owner->perf_event_mutex.
+	 */
+	smp_read_barrier_depends();
+	if (owner) {
+		/*
+		 * Since delayed_put_task_struct() also drops the last
+		 * task reference we can safely take a new reference
+		 * while holding the rcu_read_lock().
+		 */
+		get_task_struct(owner);
+	}
+	rcu_read_unlock();
+
+	if (owner) {
+		mutex_lock(&owner->perf_event_mutex);
+		/*
+		 * We have to re-check the event->owner field, if it is cleared
+		 * we raced with perf_event_exit_task(), acquiring the ...
From: Oleg Nesterov
Date: Friday, November 12, 2010 - 11:49 am

Thanks Peter!


--

From: tip-bot for Peter Zijlstra
Date: Thursday, November 18, 2010 - 7:09 am

Commit-ID:  8882135bcd332f294df5455747ea43ba9e6f77ad
Gitweb:     http://git.kernel.org/tip/8882135bcd332f294df5455747ea43ba9e6f77ad
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 9 Nov 2010 19:01:43 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 18 Nov 2010 13:18:46 +0100

perf: Fix owner-list vs exit

Oleg noticed that a perf-fd keeping a reference on the creating task
leads to a few funny side effects.

There's two different aspects to this:

  - kernel based perf-events, these should not take out
    a reference on the creating task and appear on the task's
    event list since they're not bound to fds nor visible
    to userspace.

  - fork() and pthread_create(), these can lead to the creating
    task dying (and thus the task's event-list becomming useless)
    but keeping the list and ref alive until the event is closed.

Combined they lead to malfunction of the ptrace hw_tracepoints.

Cure this by not considering kernel based perf_events for the
owner-list and destroying the owner-list when the owner dies.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <1289576883.2084.286.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   63 +++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index f818d9d..671f6c8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2235,11 +2235,6 @@ int perf_event_release_kernel(struct perf_event *event)
 	raw_spin_unlock_irq(&ctx->lock);
 	mutex_unlock(&ctx->mutex);
 
-	mutex_lock(&event->owner->perf_event_mutex);
-	list_del_init(&event->owner_entry);
-	mutex_unlock(&event->owner->perf_event_mutex);
-	put_task_struct(event->owner);
-
 	free_event(event);
 
 	return 0;
@@ -2252,9 +2247,43 @@ ...
From: Frederic Weisbecker
Date: Monday, November 8, 2010 - 11:41 am

But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
happen at the same time. The parent can only do the ptrace request when
the child is stopped, right? But it can't be stopped in flush_old_exec()...?

Not sure how any race can happen here. I am certainly missing something obvious.

Thanks.

--

From: Oleg Nesterov
Date: Monday, November 8, 2010 - 12:18 pm

Yes. But nothing can "pin" TASK_TRACED.

We know that a) the tracee was stopped() when sys_ptrace() was called
and b) its task_struct can't go away. That is all. The tracee can be
killed at any moment, and sys_ptrace() can race with with

Perhaps ;) Or, it is quite possible I missed something, I never read
this code before and it is certainly not trivial.

Oleg.

--

Previous thread: Warning Code: ID67565434. by Webmail © 2010 Team Upgrade on Monday, November 8, 2010 - 7:49 am. (1 message)

Next thread: [PATCH v1.2 1/4] lib: hex2bin converts ascii hexadecimal string to binary by Mimi Zohar on Monday, November 8, 2010 - 8:30 am. (1 message)