Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

Previous thread: [git patches] libata update by Jeff Garzik on Friday, October 12, 2007 - 3:40 pm. (6 messages)

Next thread: [PATCH RFC 1/2] paravirt: refactor struct paravirt_ops into smaller pv_*_ops by Jeremy Fitzhardinge on Friday, October 12, 2007 - 4:40 pm. (1 message)
To: <stable@...>
Cc: <linux-kernel@...>
Date: Friday, October 12, 2007 - 4:31 pm

Please consider this patch for 2.6.23.2
http://lkml.org/lkml/2007/10/4/389

tested by me in
http://lkml.org/lkml/2007/10/5/150

to fix the regression first reported in
http://lkml.org/lkml/2007/10/3/123

Cause of the wrong display in top is the value of stime in /proc/<pid>/stat
decreasing occasionally.

The issue is also registered as:
http://bugzilla.kernel.org/show_bug.cgi?id=9135

Thanks,
Frans Pop
-

To: Frans Pop <elendil@...>
Cc: <stable@...>, <linux-kernel@...>
Date: Friday, October 12, 2007 - 5:22 pm

Is it already in Linus's tree? If so, do you have a git commit id? If
not, please let us (stable@) know when it is, and what the id is, and
then we can add it to our tree.

thanks,

greg k-h
-

To: Greg KH <greg@...>
Cc: <stable@...>, <linux-kernel@...>, Christian Borntraeger <borntraeger@...>, Ingo Molnar <mingo@...>
Date: Saturday, October 13, 2007 - 3:53 am

Not AFAICT.
CCing Christian (as patch author) and Ingo (as author of the change that
caused the regression) so they can push it through the correct channels.
-

To: Frans Pop <elendil@...>
Cc: Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Sunday, October 14, 2007 - 4:36 pm

I dont know how to proceed with this issue. The more I think about it, the
more I am convinced that using sum_exec_runtime together with sampled utime
and stime will never guarantee monotonicity for utime and stime in proc.
Just imagine an process with 9 ticks for utime and 0 ticks for stime. If
we now sample one tick for stime (but having only a small increase in
sum_exec_runtime) the next utime value will only be 90% of the last value.
So returning to the 2.6.22 model seems to be the safest solution until
somebody else comes up with an idea that works proper.

Ingo, any opinion?

Christian
-

To: Balbir Singh <balbir@...>, Chuck Ebbert <cebbert@...>
Cc: Frans Pop <elendil@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 4:29 am

Chuck, Balbir,

we still have a problem with stime occosionally going backwards. I stated
below that I think this is not fixable with the current utime/stime split
algorithm.

Balbir, you wrote this code, Chuck you tried to fix it. Any ideas how to
fix this properly? The only idea I have requires that we save the old value
of utime and stime and therefore requires additional locking.

Otherwise I suggest to apply my reversion patch from below:

--------- patch-----------

Fix stime going backwards

after 2.6.22 the accounting was changes to use sched_clock and to use stime
and utime only for the split. This is where task_utime and task_stime were
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise
numbers in utime and stime. So the code was partially reverted to use stime
and utime again where appropriate, which are of type  cputime_t.

It now seems, that the rest of the logic is still broken.

This patch basically reverts the rest of b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa
and should restore the 2.6.22 behavior. The process time is used from tasks
utime and stime instead of the scheduler clock. That means, in general after
a long period of time, it is less accurate than the sum_exec_runtime based
code, but it doesnt break top.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

---
fs/proc/array.c | 37 -------------------------------------
1 file changed, 37 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *
/*
* Use precise platform statistics if available:
*/
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
static cputime_t task_utime(struct task_struct *p)
{
return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_
{
return p->stime;
}
-#else
-static ...

To: Christian Borntraeger <borntraeger@...>
Cc: Chuck Ebbert <cebbert@...>, Frans Pop <elendil@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 5:30 am

Hi,

I missed seeing this problem before, sorry about that. Thanks for the

I am trying to think out loud as to what the root cause of the problem
might be. In one of the discussion threads, I saw utime going backwards,
which seemed very odd, I suspect that those are rounding errors.

I don't understand your explanation below

Initially utime = 9, stime = 0, sum_exec_runtime = S1

Later

utime = 9, stime = 1, sum_exec_runtime = S2

We can be sure that S >= (utime + stime)

If S2 = S1 + delta, then as per our calculation

Initially

utime_proc = (utime * (S1))/(utime + stime)
= nsec_to_clock_t(9 * S1 / 9)

later

utime_proc = nsec_to_clock_t(9 * S2/10)

Given that S >= (utime + stime), we should be fine.
The only problem I see is with rounding, like I mentioned before at two
places

1. Rounding at do_div() in task_utime()
2. Rounding in conversion from clock_t_to_cputime()

I have tried and not had any success reproducing the problem, could you
please help me with some pointers/steps to reproduce the problem?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: <balbir@...>
Cc: Chuck Ebbert <cebbert@...>, Frans Pop <elendil@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 6:34 am

I think here is the problem. How can we be sure? We cant. utime and stime
are sampled, so they can be largely off in any direction,if the program
sleeps often and manages to synchronize itself to the timer tick. Lets say
a program only does a simple system call and then sleeps. So sum_exec_runtime
is increased by lets say 1000 cycles on a 1Ghz box which means 1000ns. If now
the timer tick happens exactly at this moment, stime is increased by 1 tick
= 1000000ns.

Maybe there is some magic in the code which I did not see, but obviously
the problem exists and looking at Frans data (stime+utime) are not decreasing,
but stime isnt and utime is. If you look at Frans data you see:
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12  <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10  <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12

(stime+utime) is constant. That means that S2-S1 is obviously smaller than
one tick (See the calculation in task_stime). I am quite sure it is caused

-

To: Christian Borntraeger <borntraeger@...>
Cc: Chuck Ebbert <cebbert@...>, Frans Pop <elendil@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 8:59 am

Yes, I thought of that just after I sent out my email. In the case that
you mention, the utime and stime accounting is incorrect anyway :-)
I think we need to find a better solution. I was going to propose that
we round correctly in (the divisions in)

1. task_utime()
2. clock_t_to_cputime()

I suspect we'll need to round task_utime() to p->utime if the value of
task_utime() < p->utime and the same thing for task_stime(). I've tried
reproducing the problem on my UML setup without any success. Let me

Yes, very interesting observation.

[snip]

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: <balbir@...>
Cc: Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 8:05 am

Hello Balbir,

Any progress on this issue? I noticed that it's still there in current git.

If a better implementation is not expected any time soon, how about an ACK
on the reversion patch Christian proposed in
http://lkml.org/lkml/2007/10/16/76
so we can at least get rid of the regression?

Thanks,
Frans Pop
-

To: Frans Pop <elendil@...>
Cc: Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 8:31 am

We'll I initially thought of it and then I remembered that the
regression occurs only when the accounting itself is inaccurate.
I am tempted to ack the removal, but I would like to get input

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: Balbir Singh <balbir@...>
Cc: Frans Pop <elendil@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 4:04 pm

i've got a patch from Peter queued up. (see below) This should fix the
main issue.

Ingo

-------------------->
Subject: sched: keep utime/stime monotonic
From: Peter Zijlstra <a.p.zijlstra@chello.nl>

keep utime/stime monotonic.

cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
fs/proc/array.c | 3 ++-
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)

Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -358,7 +358,8 @@ static cputime_t task_utime(struct task_
}
utime = (clock_t)temp;

- return clock_t_to_cputime(utime);
+ p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ return p->prev_utime;
}

static cputime_t task_stime(struct task_struct *p)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+ cputime_t prev_utime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1056,6 +1056,7 @@ static struct task_struct *copy_process(
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+ p->prev_utime = cputime_zero...

To: Ingo Molnar <mingo@...>
Cc: Frans Pop <elendil@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 5:43 pm

We'll also need this additional patch (untested), but in the long run
I think the approach needs to be

1. Update stime and utime at the time of context switching -- keep it
in sync with p->sum_exec_runtime
2. Keep track of system/user context at system call entry points

Extend Peter's patch to fix accounting issues.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

fs/proc/array.c | 3 ++-
include/linux/sched.h | 2 +-
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 2 deletions(-)

diff -puN fs/proc/array.c~fix-accounting-issue-extended fs/proc/array.c
--- 2.6.24-rc1/fs/proc/array.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/fs/proc/array.c 2007-10-30 03:09:46.000000000 +0530
@@ -374,7 +374,8 @@ static cputime_t task_stime(struct task_
stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
cputime_to_clock_t(task_utime(p));

- return clock_t_to_cputime(stime);
+ p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ return p->prev_stime;
}
#endif

diff -puN include/linux/sched.h~fix-accounting-issue-extended include/linux/sched.h
--- 2.6.24-rc1/include/linux/sched.h~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/include/linux/sched.h 2007-10-30 03:08:32.000000000 +0530
@@ -1004,7 +1004,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
- cputime_t prev_utime;
+ cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff -puN kernel/fork.c~fix-accounting-issue-extended kernel/fork.c
--- 2.6.24-rc1/kernel/fork.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/kernel/fork.c 2007-10-30 03:08:42.000000000 +0530
@@ -1057,6 +1057,7 @@ static struc...

To: <balbir@...>
Cc: Ingo Molnar <mingo@...>, Frans Pop <elendil@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Tuesday, October 30, 2007 - 1:56 am

This is similar to what s390 and ppc64 do when CONFIG_VIRT_CPU_ACCOUNTING is
set. Whenever you do a patch, please make sure to not break the already
precise values for utime and stime on s390 and ppc. Feel free to CC me
whenever you have a patch for this.

Christian
-

To: Christian Borntraeger <borntraeger@...>
Cc: Ingo Molnar <mingo@...>, Frans Pop <elendil@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Tuesday, October 30, 2007 - 2:00 am

Yes, true. I'll cc everyone on this list and send out the patch as RFC.
Much of this code will be architecture specific and as you rightly

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: <balbir@...>
Cc: Ingo Molnar <mingo@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 7:19 pm

OK. Both patches together do the trick. Gave it a nice long test run and got
no more weirdness.

Feel free to send me a patch for testing when you have one.

Cheers,
Frans
-

To: Frans Pop <elendil@...>
Cc: Ingo Molnar <mingo@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 7:24 pm

Thanks for testing it

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: Frans Pop <elendil@...>
Cc: <balbir@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 7:22 pm

cool, thanks! I've queued it up for the next scheduler batch.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Frans Pop <elendil@...>, <balbir@...>, Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Tuesday, October 30, 2007 - 4:22 pm

It would be nice to put those two on stable point release too.

--
O T A V I O S A L V A D O R
---------------------------------------------
E-mail: otavio@debian.org UIN: 5906116
GNU/Linux User: 239058 GPG ID: 49A5F855
Home Page: http://otavio.ossystems.com.br
---------------------------------------------
"Microsoft sells you Windows ... Linux gives
you the whole house."
-

To: Ingo Molnar <mingo@...>
Cc: Balbir Singh <balbir@...>, Frans Pop <elendil@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 4:33 pm

[...]

I dont think it will work. It will make utime monotic, but stime can still
decrease. For example let sum_exec_runtime increase by a tiny little bit while
utime will get a full additional tick. stime is sum-utime. So stime can still
go backwards. So I think that we need this kind of logic for stime as well,
no?

Christian

-

To: Christian Borntraeger <borntraeger@...>
Cc: Balbir Singh <balbir@...>, Frans Pop <elendil@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Monday, October 29, 2007 - 4:41 pm

yeah, probably. Peter?

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Christian Borntraeger <borntraeger@...>, Balbir Singh <balbir@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Monday, October 29, 2007 - 5:22 pm

Yes, definitely :-)

With this patch stime is still all over the place.

Oct 29 22:12:39 314 64
Oct 29 22:12:40 392 68
Oct 29 22:12:41 408 67 <--
Oct 29 22:12:42 410 67
Oct 29 22:12:43 416 68
Oct 29 22:12:44 420 68
Oct 29 22:12:45 424 68
Oct 29 22:12:46 426 68
Oct 29 22:12:47 430 70
Oct 29 22:12:48 430 70
Oct 29 22:12:49 430 70
Oct 29 22:12:50 432 68 <--
Oct 29 22:12:51 432 69
Oct 29 22:12:52 432 69
Oct 29 22:12:53 432 69
Oct 29 22:12:54 432 69
Oct 29 22:12:55 432 69
Oct 29 22:12:56 433 70
Oct 29 22:12:57 434 69 <--
Oct 29 22:12:58 443 71

utime looks OK now, though I'd like to test it a bit more (when stime is
fixed too) before giving a final verdict on that.
-

To: Ingo Molnar <mingo@...>
Cc: Christian Borntraeger <borntraeger@...>, Balbir Singh <balbir@...>, Frans Pop <elendil@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 29, 2007 - 5:11 pm

/me dons the brown paper bag while mumbling an agreement of sorts.

I'll not attempt to come up with a patch as I fear I'll just make a
bigger mess in my current state, hope to feel better tomorrow..

-

To: <balbir@...>
Cc: Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 6:11 am

I only remembered stime going backwards, but looking back at the mails
you are right, there has been one case of utime going backwards too.

I can reproduce the problem reliably for two KDE programs I have running
permanently on my system: kontact and amarok. Both of these wake up
regularly and use some (not very much) processor capacity before going to
sleep again. Any process that has that that characteristic should do.

/me thinks a bit and tries something

lol - I have just managed to reproduce this with 'top' itself :-P

Run 'top' in one terminal; then in another terminal:
$ ps ax | grep " [t]op"
17869 pts/0 S+ 0:00 top
$ while true; do awk '{print $14" "$15}' /proc/17869/stat; sleep 1; done | ts
[...]
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12 <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10 <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12

This example happens to have both stime _and_ utime decreasing...

'ts' is a small utility that prints the timestamps before the readings;
the test will work just as well without it.
Note that it may take a while for the error to show up. In this case it
was 40 seconds.

Cheers,
Frans Pop
-

To: Frans Pop <elendil@...>
Cc: Christian Borntraeger <borntraeger@...>, Chuck Ebbert <cebbert@...>, Greg KH <greg@...>, <stable@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Tuesday, October 16, 2007 - 6:38 am

Thanks, I know why I am not seeing the problem. My box has
CONFIG_VIRT_CPU_ACCOUNTING set. I'll find another box and try.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

Previous thread: [git patches] libata update by Jeff Garzik on Friday, October 12, 2007 - 3:40 pm. (6 messages)

Next thread: [PATCH RFC 1/2] paravirt: refactor struct paravirt_ops into smaller pv_*_ops by Jeremy Fitzhardinge on Friday, October 12, 2007 - 4:40 pm. (1 message)