Re: scheduler bug: process running since 5124095h

Previous thread: [GIT PULL] perf fixes by Frederic Weisbecker on Saturday, March 27, 2010 - 10:11 pm. (15 messages)

Next thread: [patch] radeon: range check issues by Dan Carpenter on Sunday, March 28, 2010 - 4:23 am. (2 messages)
From: =?ISO-8859-1?Q?T=F6r=F6k_Edwin?=
Date: Sunday, March 28, 2010 - 1:49 am

I have a simple way to reproduce this:
1. Boot the system, run top, confirm everything is normal
2. Run latencytop, and quit (I used version 0.5)
3. Run top, see 5124095h in the TIME column

For example:
 6649 daemon    20   0 74500 8892 3796 S    6  0.2   0:00.03 debsecan
 4255 root      20   0  8908  360  260 S    2  0.0   0:00.04 irqbalance
    1 root      20   0 10332  692  580 S    0  0.0  5124095h init

The processes that get the 5124095h seem random, but there are plenty of
them (if you sort by 'T', the entire top display is filled with
processes running since 5124095h).

Best regards,
--Edwin
--

From: Mike Galbraith
Date: Monday, March 29, 2010 - 3:52 am

Indeed, and I don't even have CONFIG_LATENCYTOP set.  It bisected to...

761b1d26df542fd5eb348837351e4d2f3bc7bffe is the first bad commit
commit 761b1d26df542fd5eb348837351e4d2f3bc7bffe
Author: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Date:   Thu Nov 12 13:33:45 2009 +0900

    sched: Fix granularity of task_u/stime()
    
    Originally task_s/utime() were designed to return clock_t but
    later changed to return cputime_t by following commit:
    
      commit efe567fc8281661524ffa75477a7c4ca9b466c63
      Author: Christian Borntraeger <borntraeger@de.ibm.com>
      Date:   Thu Aug 23 15:18:02 2007 +0200
    
    It only changed the type of return value, but not the
    implementation. As the result the granularity of task_s/utime()
    is still that of clock_t, not that of cputime_t.
    
    So using task_s/utime() in __exit_signal() makes values
    accumulated to the signal struct to be rounded and coarse
    grained.
    
    This patch removes casts to clock_t in task_u/stime(), to keep
    granularity of cputime_t over the calculation.
    
    v2:
      Use div_u64() to avoid error "undefined reference to `__udivdi3`"
      on some 32bit systems.
    
    Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Cc: xiyou.wangcong@gmail.com
    Cc: Spencer Candland <spencer@bluehost.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Stanislaw Gruszka <sgruszka@redhat.com>
    LKML-Reference: <4AFB9029.9000208@jp.fujitsu.com>
git bisect start
# good: [7f5e918e62cbc9ac27c2f47d3c3dd4b86f67ff0e] Linux 2.6.32.9
git bisect good 7f5e918e62cbc9ac27c2f47d3c3dd4b86f67ff0e
# bad: [baac35c4155a8aa826c70acee6553368ca5243a2] security: fix error return path in ima_inode_alloc
git bisect bad baac35c4155a8aa826c70acee6553368ca5243a2
# bad: [baac35c4155a8aa826c70acee6553368ca5243a2] security: fix error return path in ima_inode_alloc
git bisect bad baac35c4155a8aa826c70acee6553368ca5243a2
# bad: ...
From: Peter Zijlstra
Date: Monday, March 29, 2010 - 3:54 am

Oh lovely, and reverting that cures the issue?

--

From: Mike Galbraith
Date: Monday, March 29, 2010 - 4:33 am

Haven't done so yet, will do later this afternoon.

	-Mike

--

From: Hidetoshi Seto
Date: Monday, March 29, 2010 - 5:04 am

Wow, it's easy to reproduce. I'll check it later... 


Thanks,
H.Seto

--

From: Hidetoshi Seto
Date: Monday, March 29, 2010 - 11:22 pm

Quick report:

The reason why this commit have bisected is because it changed
the type of time values from signed clock_t to unsigned cputime_t,
so that the following if-block become to be always taken:

   > -       stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
   > -                       cputime_to_clock_t(task_utime(p));
   > +       stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);
   > -               p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
   > +               p->prev_stime = max(p->prev_stime, stime);
   > 
   >         return p->prev_stime;

From strace of latancytop, it does write to /proc/<pid>/sched:

   5891  open("/proc/1/sched", O_RDWR)     = 5
   5891  fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
   5891  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
    0x7fc6668f3000
   5891  read(5, "init (1, #threads: 1)\n----------"..., 1024) = 776
   5891  close(5)                          = 0

It results in:

[kernel/sched_debug.c]
void proc_sched_set_task(struct task_struct *p)
{
 :
        p->se.sum_exec_runtime                  = 0;
        p->se.prev_sum_exec_runtime             = 0;
        p->nvcsw                                = 0;
        p->nivcsw                               = 0;
}

So soon some task will have great (in fact negative) stime.

There would be no doubt that this initialize in sched_debug.c
will break monotonicity of sum_exec_runtime.  I confirmed that
the issue is disappeared by comment-out of lines above.

Reverting the bisected commit is wrong solution, because it
will bring another issue, i.e. lost of runtime, and u/stime
seems to be frozen because these values restart from 0 so
prev_* is used for a while.

How to fix?  Is this a bug of latencytop? Kernel?
Please comment.


Thanks,
H.Seto

--

From: Mike Galbraith
Date: Tuesday, March 30, 2010 - 1:07 am

I looked at it in tip this morning (where task_stime doesn't exist), and
did the below to swat the underflow.  Seems ok, but I haven't yet
enabled CONFIG_LATENCYTOP.

diff --git a/kernel/sched.c b/kernel/sched.c
index 81575f1..cdf4fa4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3366,6 +3366,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	 * Compare with previous values, to keep monotonicity:
 	 */
 	p->prev_utime = max(p->prev_utime, utime);
+	rtime = max(rtime, p->prev_utime);
 	p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));
 
 	*ut = p->prev_utime;
@@ -3396,6 +3397,7 @@ void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
 		utime = rtime;
 
 	sig->prev_utime = max(sig->prev_utime, utime);
+	rtime = max(rtime, sig->prev_utime);
 	sig->prev_stime = max(sig->prev_stime,
 			      cputime_sub(rtime, sig->prev_utime));
 


--

From: Mike Galbraith
Date: Tuesday, March 30, 2010 - 2:09 am

I think the correct fix is to not allow resetting of those 4 fields, as
they're used elsewhere in kernel.  Latencytop doesn't seem to miss being
allowed to scribble.  Dunno why it wants to reset any of these though.

sched: fix proc_sched_set_task()

Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks
task_times().  Other places in kernel use nvcsw and nivcsw, which are
being cleared as well,  Clear task statistics only.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reported-by: Török Edwin <edwintorok@gmail.com>
LKML-Reference: <new-submission>

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 8a46a71..ea2c690 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -491,8 +491,4 @@ void proc_sched_set_task(struct task_struct *p)
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
-	p->se.sum_exec_runtime			= 0;
-	p->se.prev_sum_exec_runtime		= 0;
-	p->nvcsw				= 0;
-	p->nivcsw				= 0;
 }



--

From: Peter Zijlstra
Date: Tuesday, March 30, 2010 - 2:01 am

Urgh, that code should really die, bugger that latencytop relies on that
though. Arjan anything you can do to cure latencytop of this madness?
--

From: Mike Galbraith
Date: Tuesday, March 30, 2010 - 2:12 am

I killed those 4 fields without _seeming_ to upset it.

	-Mike

--

From: Peter Zijlstra
Date: Tuesday, March 30, 2010 - 2:34 am

OK, if Arjan doesn't complain I'll take it ;-)
--

Previous thread: [GIT PULL] perf fixes by Frederic Weisbecker on Saturday, March 27, 2010 - 10:11 pm. (15 messages)

Next thread: [patch] radeon: range check issues by Dan Carpenter on Sunday, March 28, 2010 - 4:23 am. (2 messages)