do_task_stat()->task_times() can race with getrusage(), they both can
try to update task->prev_Xtime at the same time.
Remove this bit of d180c5bc "sched: Introduce task_times() to replace
task_{u,s}time()".
See also the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/array.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 34-rc1/fs/proc/array.c~PROC_4_DTS_TASK_TIMES_IS_RACY 2010-03-24 19:53:23.000000000 +0100
+++ 34-rc1/fs/proc/array.c 2010-03-24 19:57:37.000000000 +0100
@@ -449,7 +449,8 @@ static int do_task_stat(struct seq_file
if (!whole) {
min_flt = task->min_flt;
maj_flt = task->maj_flt;
- task_times(task, &utime, &stime);
+ utime = task->utime;
+ stime = task->stime;
gtime = task->gtime;
}
--
One of the reasons for adding this accuracy was to avoid sampling based noise and errors that occur with utime and stime. As long as there is no preemption during the assignment, I think we should be OK. I see two options 1. Disable preemption around assignment 2. Remove task_times() from getrusage() -- Three Cheers, Balbir --
We want more accurate and what more important monotonic values in getrusage. Stanislaw --
No. Preemption will not help here. Stanislaw --
I don't think preemp_disable() can help. Probably we can use task_lock(). As for do_task_stat()->thread_group_times(), I think we can make it rc-safe without breaking /bin/top. 1. add spin_lock_irqsave(&sig->cputimer.lock) around sig->prev_Xtime = max(...) 2. Add a couple of barriers into thread_group_cputime() and __exit_signal() so that without ->siglock we can never overestimate utime/stime if we race with exit. If we underestimate these values, this should be fine: - the error can't be "systematic", the next read from /prod/pid/stat will see the updated values - the prev_Xtime logic in thread_group_times() ensures the reported time can never go back. IOW: at worse, cat /proc/pid/stat can miss the time which the exited thread spent on CPU after the previous read of /proc/pid/stat. This looks absolutely harmless, the next read will see this time. Probably we can even detect this case if we look at sig->nr_threads and retry. I'll try to make patches unless someone has a better idea. I just can't accept the fact that we are doing while_each_thread() under ->siglock here ;) Oleg. --
The easiest way to avoid that races is move all calls to task_times() and thread_group_times() inside ->siglock, but that's a bit crappy. There is also another impossible race here. On 32-bit machines reading/writing sum_exec_runtime is not atomic, IIRC ->siglock Races with __exit_signal() can lead to count Xtime values twice, first: in tsk->Xtime, second: after task exits, in sig->Xtime. As now we have sig->prev_Xtime, this no longer can break times monotonic, but still accounting times twice can be problematic. For example let assume, we have many threads and one, which consume 90% cpu-time of the process, exits and is accounted twice. Then for long period sig->prev_Xtime will show value that is much bigger than the estimated Problem is not only in do_task_stat(). We have couple other places where we iterate over all threads with ->siglock taken. Maybe we can somehow redesign things to avoid that. Maybe increase tsk>-signal->Xtime together with tsk->Xtime is not so bad idea. I'm going to think about that. Stanislaw --
I don't think so. update_curr/etc updates t->se.sum_exec_runtime without Yes sure. I dislike the do_task_stat() case because we always do this, even if this info is not needed, say, for /bin/ps. Oleg. --
Note also that nobody else in /fs/proc needs ->siglock. Except do_io_accounting(), but in this case the user-space explicitly asks for this info. OK, This is V2. Still RFC, although I think 1/4 makes sense in any case. Please comment. Again, I am more or less sure these changes are "correct", but I don't know what /bin/top can think ;) I don't really like the fact thread_group_times() takes cputimer.lock, but imho lock_task_sighand() in do_task_stat() is much worse. Oleg. --
- With the recent changes tsk->signal is stable, we can read it
first and avoid the initialization from INIT_CPUTIME.
- Even if tsk->signal is always valid, we still have to check it
is safe to use next_thread() under rcu_read_lock(). Currently
the code checks ->sighand != NULL, change it to use pid_alive()
which is commonly used to ensure the task wasn't unhashed before
we take rcu_read_lock().
- Change the main loop to use the while_each_thread() helper.
This change itself doesn't add the functional changes, currently
it is always called under tasklist/siglock or when we know that
this thread group is already dead (wait, coredump).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/posix-cpu-timers.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
--- 34-rc1/kernel/posix-cpu-timers.c~cpuacct_1_thread_group_cputime_cleanup_rcu 2010-03-29 18:08:14.000000000 +0200
+++ 34-rc1/kernel/posix-cpu-timers.c 2010-03-29 18:09:15.000000000 +0200
@@ -233,31 +233,24 @@ static int cpu_clock_sample(const clocki
void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
{
- struct sighand_struct *sighand;
- struct signal_struct *sig;
+ struct signal_struct *sig = tsk->signal;
struct task_struct *t;
- *times = INIT_CPUTIME;
+ times->utime = sig->utime;
+ times->stime = sig->stime;
+ times->sum_exec_runtime = sig->sum_sched_runtime;
rcu_read_lock();
- sighand = rcu_dereference(tsk->sighand);
- if (!sighand)
+ /* make sure we can trust tsk->thread_group list */
+ if (!likely(pid_alive(tsk)))
goto out;
- sig = tsk->signal;
-
t = tsk;
do {
times->utime = cputime_add(times->utime, t->utime);
times->stime = cputime_add(times->stime, t->stime);
times->sum_exec_runtime += t->se.sum_exec_runtime;
-
- t = next_thread(t);
- } while (t != tsk);
-
- times->utime = cputime_add(times->utime, sig->utime);
- times->stime = cputime_add(times->stime, ...- change __exit_signal() to do __unhash_process() before we accumulate
the counters in ->signal
- add a couple of barriers into thread_group_cputime() and __exit_signal()
to make sure thread_group_cputime() can never account the same thread
twice if it races with exit.
If any thread T was already accounted in ->signal, next_thread() or
pid_alive() must see the result of __unhash_process(T).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/exit.c | 14 +++++++++-----
kernel/posix-cpu-timers.c | 6 ++++++
2 files changed, 15 insertions(+), 5 deletions(-)
--- 34-rc1/kernel/exit.c~cpuacct_2_thread_group_cputime_rcu_safe 2010-03-29 18:03:17.000000000 +0200
+++ 34-rc1/kernel/exit.c 2010-03-29 18:29:35.000000000 +0200
@@ -88,6 +88,8 @@ static void __exit_signal(struct task_st
rcu_read_lock_held() ||
lockdep_is_held(&tasklist_lock));
spin_lock(&sighand->siglock);
+ __unhash_process(tsk, group_dead);
+ sig->nr_threads--;
posix_cpu_timers_exit(tsk);
if (group_dead) {
@@ -111,9 +113,14 @@ static void __exit_signal(struct task_st
* The group leader stays around as a zombie as long
* as there are other threads. When it gets reaped,
* the exit.c code will add its counts into these totals.
- * We won't ever get here for the group leader, since it
- * will have been the last reference on the signal_struct.
+ *
+ * Make sure that this thread can't be accounted twice
+ * by thread_group_cputime() under rcu. If it sees the
+ * the result of accounting below it must see the result
+ * of __unhash_process()->__list_del(thread_group) above.
*/
+ smp_wmb();
+
sig->utime = cputime_add(sig->utime, tsk->utime);
sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
@@ -127,9 +134,6 @@ static void __exit_signal(struct task_st
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}
- sig->nr_threads--;
- __unhash_process(tsk, ...Memory barriers prevents to account times twice, but as we write
sig->{u,s}time and sig->sum_sched_runtime on one cpu and read them
on another cpu, without a lock, this patch make theoretically possible
that some accumulated values of struct task_cputime may include exited
task values and some not. In example times->utime will include values
from 10 threads, and times->{stime,sum_exec_runtime} values form 9
threads, because local cpu updates sig->utime but not two other values.
This can make scaling in thread_group_times() not correct. I'm not sure
how big drawback is that.
Stanislaw
--
Yes sure. From 4/4 changelog: This is a user visible change. Without ->siglock we can't get the "whole" info atomically, and if we race with exit() we can miss the exiting thread. this also means that it is possible that we don't miss the exiting thread Again, see above. The changelog explicitely explains that this patch assumes it is OK to return the info which is not atomic/accurate (as Neither me, that is why I am asking for comments. To me this looks acceptable, but I can't judge. Oleg. --
thread_group_times() relies on ->siglock hold by the caller when it updates sig->prev_Xtime members. Change the code to take cputimer.lock around the assignment. This makes it rcu-safe and fixes the theoretical race. wait_task_zombie() calls it lockless and thus can race with do_task_stat(). Note: we are taking cputimer.lock under ->siglock but this dependency is not new, thread_group_cputimer() already does this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/sched.c | 3 +++ 1 file changed, 3 insertions(+) --- 34-rc1/kernel/sched.c~cpuacct_3_thread_group_times_dont_rely_on_siglock 2010-03-29 19:44:25.000000000 +0200 +++ 34-rc1/kernel/sched.c 2010-03-29 19:46:21.000000000 +0200 @@ -3449,6 +3449,7 @@ void thread_group_times(struct task_stru struct signal_struct *sig = p->signal; struct task_cputime cputime; cputime_t rtime, utime, total; + unsigned long flags; thread_group_cputime(p, &cputime); @@ -3464,9 +3465,11 @@ void thread_group_times(struct task_stru } else utime = rtime; + spin_lock_irqsave(&sig->cputimer.lock, flags); sig->prev_utime = max(sig->prev_utime, utime); sig->prev_stime = max(sig->prev_stime, cputime_sub(rtime, sig->prev_utime)); + spin_unlock_irqrestore(&sig->cputimer.lock, flags); *ut = sig->prev_utime; *st = sig->prev_stime; --
Change do_task_stat() to walk through the ->thread_group list without
->siglock, we are doing while_each_thread() twice even if the "whole"
info is not necessarily needed, say, for /bin/ps.
We can rely on previous changes which made thread_group_times() rcu-
safe and move the "if (whole)" code from ->siglock to rcu_read_lock().
Note: do_task_stat() needs more cleanups, this series only cares about
thread_group_times() issues.
This is a user visible change. Without ->siglock we can't get the "whole"
info atomically, and if we race with exit() we can miss the exiting thread.
However, I hope this is OK for /bin/top. The next read from /proc/pid/stat
will see the updated info, we can never overestimate the reported numbers,
and they can never go back.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/array.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
--- 34-rc1/fs/proc/array.c~cpuacct_4_do_task_stat_walk_tg_under_rcu 2010-03-29 19:44:24.000000000 +0200
+++ 34-rc1/fs/proc/array.c 2010-03-29 19:47:03.000000000 +0200
@@ -421,8 +421,14 @@ static int do_task_stat(struct seq_file
cgtime = sig->cgtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);
+ sid = task_session_nr_ns(task, ns);
+ ppid = task_tgid_nr_ns(task->real_parent, ns);
+ pgid = task_pgrp_nr_ns(task, ns);
+ unlock_task_sighand(task, &flags);
+
/* add up live thread stats at the group level */
- if (whole) {
+ rcu_read_lock();
+ if (whole && pid_alive(task)) {
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -433,15 +439,11 @@ static int do_task_stat(struct seq_file
min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
- }
-
- sid = task_session_nr_ns(task, ns);
- ppid = task_tgid_nr_ns(task->real_parent, ns);
- pgid = task_pgrp_nr_ns(task, ns);
- unlock_task_sighand(task, ...> [PATCH -mm 1/4] Oh, sorry, this one is 4/4. Oleg. --
