Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

Previous thread: [git patches] libata update by Jeff Garzik on Friday, October 12, 2007 - 12:40 pm. (6 messages)

Next thread: [PATCH RFC 1/2] paravirt: refactor struct paravirt_ops into smaller pv_*_ops by Jeremy Fitzhardinge on Friday, October 12, 2007 - 1:40 pm. (1 message)
From: Frans Pop
Date: Friday, October 12, 2007 - 1:31 pm

Please consider this patch for 2.6.23.2
http://lkml.org/lkml/2007/10/4/389

tested by me in
http://lkml.org/lkml/2007/10/5/150

to fix the regression first reported in
http://lkml.org/lkml/2007/10/3/123

Cause of the wrong display in top is the value of stime in /proc/<pid>/stat 
decreasing occasionally.

The issue is also registered as:
http://bugzilla.kernel.org/show_bug.cgi?id=9135

Thanks,
Frans Pop
-

From: Greg KH
Date: Friday, October 12, 2007 - 2:22 pm

Is it already in Linus's tree?  If so, do you have a git commit id?  If
not, please let us (stable@) know when it is, and what the id is, and
then we can add it to our tree.

thanks,

greg k-h
-

From: Frans Pop
Date: Saturday, October 13, 2007 - 12:53 am

Not AFAICT.
CCing Christian (as patch author) and Ingo (as author of the change that 
caused the regression) so they can push it through the correct channels.
-

From: Christian Borntraeger
Date: Sunday, October 14, 2007 - 1:36 pm

I dont know how to proceed with this issue. The more I think about it, the
more I am convinced that using sum_exec_runtime together with sampled utime
and stime will never guarantee monotonicity for utime and stime in proc. 
Just imagine an process with 9 ticks for utime and 0 ticks for stime. If
we now sample one tick for stime (but having only a small increase in
sum_exec_runtime) the next utime value will only be 90% of the last value. 
So returning to the 2.6.22 model seems to be the safest solution until 
somebody else comes up with an idea that works proper.

Ingo, any opinion?

Christian
-

From: Christian Borntraeger
Date: Tuesday, October 16, 2007 - 1:29 am

Chuck, Balbir,

we still have a problem with stime occosionally going backwards. I stated
below that I think this is not fixable with the current utime/stime split
algorithm.

Balbir, you wrote this code, Chuck you tried to fix it. Any ideas how to
fix this properly? The only idea I have requires that we save the old value
of utime and stime and therefore requires additional locking.

Otherwise I suggest to apply my reversion patch from below:


--------- patch-----------

Fix stime going backwards

after 2.6.22 the accounting was changes to use sched_clock and to use stime 
and utime only for the split. This is where task_utime and task_stime were 
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise 
numbers in utime and stime. So the code was partially reverted to use stime 
and utime again where appropriate, which are of type  cputime_t. 

It now seems, that the rest of the logic is still broken.

This patch basically reverts the rest of b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa
and  should restore the 2.6.22 behavior. The process time is used from tasks
utime  and stime instead of the scheduler clock. That means, in general after
a long period of time, it is less accurate than the sum_exec_runtime based 
code, but it doesnt break top.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

---
 fs/proc/array.c |   37 -------------------------------------
 1 file changed, 37 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *
 /*
  * Use precise platform statistics if available:
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
 static cputime_t task_utime(struct task_struct *p)
 {
 	return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_
 {
 	return p->stime;
 }
-#else
-static cputime_t ...
From: Balbir Singh
Date: Tuesday, October 16, 2007 - 2:30 am

Hi,

I missed seeing this problem before, sorry about that. Thanks for the

I am trying to think out loud as to what the root cause of the problem
might be. In one of the discussion threads, I saw utime going backwards,
which seemed very odd, I suspect that those are rounding errors.

I don't understand your explanation below

Initially utime = 9, stime = 0, sum_exec_runtime = S1

Later

utime = 9, stime = 1, sum_exec_runtime = S2

We can be sure that S >= (utime + stime)

If S2 = S1 + delta, then as per our calculation

Initially

utime_proc = (utime * (S1))/(utime + stime)
           = nsec_to_clock_t(9 * S1 / 9)

later

utime_proc = nsec_to_clock_t(9 * S2/10)

Given that S >= (utime + stime), we should be fine.
The only problem I see is with rounding, like I mentioned before at two
places

1. Rounding at do_div() in task_utime()
2. Rounding in conversion from clock_t_to_cputime()

I have tried and not had any success reproducing the problem, could you
please help me with some pointers/steps to reproduce the problem?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Frans Pop
Date: Tuesday, October 16, 2007 - 3:11 am

I only remembered stime going backwards, but looking back at the mails
you are right, there has been one case of utime going backwards too.

I can reproduce the problem reliably for two KDE programs I have running
permanently on my system: kontact and amarok. Both of these wake up
regularly and use some (not very much) processor capacity before going to
sleep again. Any process that has that that characteristic should do.

/me thinks a bit and tries something

lol - I have just managed to reproduce this with 'top' itself :-P

Run 'top' in one terminal; then in another terminal:
$ ps ax | grep " [t]op"
17869 pts/0    S+     0:00 top
$ while true; do awk '{print $14" "$15}' /proc/17869/stat; sleep 1; done | ts
[...]
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12  <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10  <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12

This example happens to have both stime _and_ utime decreasing...

'ts' is a small utility that prints the timestamps before the readings;
the test will work just as well without it.
Note that it may take a while for the error to show up. In this case it
was 40 seconds.

Cheers,
Frans Pop
-

From: Balbir Singh
Date: Tuesday, October 16, 2007 - 3:38 am

Thanks, I know why I am not seeing the problem. My box has
CONFIG_VIRT_CPU_ACCOUNTING set. I'll find another box and try.



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Christian Borntraeger
Date: Tuesday, October 16, 2007 - 3:34 am

I think here is the problem. How can we be sure? We cant. utime and stime
are sampled, so they can be largely off in any direction,if the program
sleeps often and manages to synchronize itself to the timer tick. Lets say
a program only does a simple system call and then sleeps. So sum_exec_runtime
is increased by lets say 1000 cycles on a 1Ghz box which means 1000ns. If now 
the timer tick happens exactly at this moment, stime is increased by 1 tick
= 1000000ns. 

Maybe there is some magic in the code which I did not see, but obviously
the problem exists and looking at Frans data (stime+utime) are not decreasing,
but stime isnt and utime is. If you look at Frans data you see:
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12  <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10  <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12

(stime+utime) is constant. That means that S2-S1 is obviously smaller than
one tick (See the calculation in task_stime). I am quite sure it is caused



-

From: Balbir Singh
Date: Tuesday, October 16, 2007 - 5:59 am

Yes, I thought of that just after I sent out my email. In the case that
you mention, the utime and stime accounting is incorrect anyway :-)
I think we need to find a better solution. I was going to propose that
we round correctly in (the divisions in)

1. task_utime()
2. clock_t_to_cputime()

I suspect we'll need to round task_utime() to p->utime if the value of
task_utime() < p->utime and the same thing for task_stime(). I've tried
reproducing the problem on my UML setup without any success. Let me

Yes, very interesting observation.

[snip]


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Frans Pop
Date: Monday, October 29, 2007 - 5:05 am

Hello Balbir,


Any progress on this issue? I noticed that it's still there in current git.

If a better implementation is not expected any time soon, how about an ACK 
on the reversion patch Christian proposed in
http://lkml.org/lkml/2007/10/16/76
so we can at least get rid of the regression?

Thanks,
Frans Pop
-

From: Balbir Singh
Date: Monday, October 29, 2007 - 5:31 am

We'll I initially thought of it and then I remembered that the
regression occurs only when the accounting itself is inaccurate.
I am tempted to ack the removal, but I would like to get input


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Ingo Molnar
Date: Monday, October 29, 2007 - 1:04 pm

i've got a patch from Peter queued up. (see below) This should fix the 
main issue.

	Ingo

-------------------->
Subject: sched: keep utime/stime monotonic
From: Peter Zijlstra <a.p.zijlstra@chello.nl>

keep utime/stime monotonic.

cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/proc/array.c       |    3 ++-
 include/linux/sched.h |    1 +
 kernel/fork.c         |    1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -358,7 +358,8 @@ static cputime_t task_utime(struct task_
 	}
 	utime = (clock_t)temp;
 
-	return clock_t_to_cputime(utime);
+	p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+	return p->prev_utime;
 }
 
 static cputime_t task_stime(struct task_struct *p)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
 	unsigned int rt_priority;
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
+	cputime_t prev_utime;
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1056,6 +1056,7 @@ static struct task_struct *copy_process(
 	p->gtime = cputime_zero;
 	p->utimescaled = cputime_zero;
 	p->stimescaled = cputime_zero;
+	p->prev_utime = cputime_zero;
 
 #ifdef CONFIG_TASK_XACCT
 ...
From: Christian Borntraeger
Date: Monday, October 29, 2007 - 1:33 pm

[...]

I dont think it will work. It will make utime monotic, but stime can still 
decrease. For example let sum_exec_runtime increase by a tiny little bit while
utime will get a full additional tick. stime is sum-utime. So stime can still
go backwards. So I think that we need this kind of logic for stime as well, 
no?

Christian



-

From: Ingo Molnar
Date: Monday, October 29, 2007 - 1:41 pm

yeah, probably. Peter?

	Ingo
-

From: Peter Zijlstra
Date: Monday, October 29, 2007 - 2:11 pm

/me dons the brown paper bag while mumbling an agreement of sorts.

I'll not attempt to come up with a patch as I fear I'll just make a
bigger mess in my current state, hope to feel better tomorrow..

-

From: Frans Pop
Date: Monday, October 29, 2007 - 2:22 pm

Yes, definitely :-)

With this patch stime is still all over the place.

Oct 29 22:12:39 314 64
Oct 29 22:12:40 392 68
Oct 29 22:12:41 408 67  <--
Oct 29 22:12:42 410 67
Oct 29 22:12:43 416 68
Oct 29 22:12:44 420 68
Oct 29 22:12:45 424 68
Oct 29 22:12:46 426 68
Oct 29 22:12:47 430 70
Oct 29 22:12:48 430 70
Oct 29 22:12:49 430 70
Oct 29 22:12:50 432 68  <--
Oct 29 22:12:51 432 69
Oct 29 22:12:52 432 69
Oct 29 22:12:53 432 69
Oct 29 22:12:54 432 69
Oct 29 22:12:55 432 69
Oct 29 22:12:56 433 70
Oct 29 22:12:57 434 69  <--
Oct 29 22:12:58 443 71

utime looks OK now, though I'd like to test it a bit more (when stime is 
fixed too) before giving a final verdict on that.
-

From: Balbir Singh
Date: Monday, October 29, 2007 - 2:43 pm

We'll also need this additional patch (untested), but in the long run
I think the approach needs to be

1. Update stime and utime at the time of context switching -- keep it
   in sync with p->sum_exec_runtime
2. Keep track of system/user context at system call entry points

 
Extend Peter's patch to fix accounting issues.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 fs/proc/array.c       |    3 ++-
 include/linux/sched.h |    2 +-
 kernel/fork.c         |    1 +
 3 files changed, 4 insertions(+), 2 deletions(-)

diff -puN fs/proc/array.c~fix-accounting-issue-extended fs/proc/array.c
--- 2.6.24-rc1/fs/proc/array.c~fix-accounting-issue-extended	2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/fs/proc/array.c	2007-10-30 03:09:46.000000000 +0530
@@ -374,7 +374,8 @@ static cputime_t task_stime(struct task_
 	stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
 			cputime_to_clock_t(task_utime(p));
 
-	return clock_t_to_cputime(stime);
+	p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+	return p->prev_stime;
 }
 #endif
 
diff -puN include/linux/sched.h~fix-accounting-issue-extended include/linux/sched.h
--- 2.6.24-rc1/include/linux/sched.h~fix-accounting-issue-extended	2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/include/linux/sched.h	2007-10-30 03:08:32.000000000 +0530
@@ -1004,7 +1004,7 @@ struct task_struct {
 	unsigned int rt_priority;
 	cputime_t utime, stime, utimescaled, stimescaled;
 	cputime_t gtime;
-	cputime_t prev_utime;
+	cputime_t prev_utime, prev_stime;
 	unsigned long nvcsw, nivcsw; /* context switch counts */
 	struct timespec start_time; 		/* monotonic time */
 	struct timespec real_start_time;	/* boot based time */
diff -puN kernel/fork.c~fix-accounting-issue-extended kernel/fork.c
--- 2.6.24-rc1/kernel/fork.c~fix-accounting-issue-extended	2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/kernel/fork.c	2007-10-30 03:08:42.000000000 +0530
@@ -1057,6 +1057,7 @@ static struct task_struct ...
From: Frans Pop
Date: Monday, October 29, 2007 - 4:19 pm

OK. Both patches together do the trick. Gave it a nice long test run and got 
no more weirdness.

Feel free to send me a patch for testing when you have one.

Cheers,
Frans
-

From: Ingo Molnar
Date: Monday, October 29, 2007 - 4:22 pm

cool, thanks! I've queued it up for the next scheduler batch.

	Ingo
-

From: Otavio Salvador
Date: Tuesday, October 30, 2007 - 1:22 pm

It would be nice to put those two on stable point release too.

-- 
        O T A V I O    S A L V A D O R
---------------------------------------------
 E-mail: otavio@debian.org      UIN: 5906116
 GNU/Linux User: 239058     GPG ID: 49A5F855
 Home Page: http://otavio.ossystems.com.br
---------------------------------------------
"Microsoft sells you Windows ... Linux gives
 you the whole house."
-

From: Balbir Singh
Date: Monday, October 29, 2007 - 4:24 pm

Thanks for testing it




-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Christian Borntraeger
Date: Monday, October 29, 2007 - 10:56 pm

This is similar to what s390 and ppc64 do when CONFIG_VIRT_CPU_ACCOUNTING is 
set. Whenever you do a patch, please make sure to not break the already 
precise values for utime and stime on s390 and ppc. Feel free to CC me 
whenever you have a patch for this.

Christian
-

From: Balbir Singh
Date: Monday, October 29, 2007 - 11:00 pm

Yes, true. I'll cc everyone on this list and send out the patch as RFC.
Much of this code will be architecture specific and as you rightly


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

Previous thread: [git patches] libata update by Jeff Garzik on Friday, October 12, 2007 - 12:40 pm. (6 messages)

Next thread: [PATCH RFC 1/2] paravirt: refactor struct paravirt_ops into smaller pv_*_ops by Jeremy Fitzhardinge on Friday, October 12, 2007 - 1:40 pm. (1 message)