Please consider this patch for 2.6.23.2
http://lkml.org/lkml/2007/10/4/389tested by me in
http://lkml.org/lkml/2007/10/5/150to fix the regression first reported in
http://lkml.org/lkml/2007/10/3/123Cause of the wrong display in top is the value of stime in /proc/<pid>/stat
decreasing occasionally.The issue is also registered as:
http://bugzilla.kernel.org/show_bug.cgi?id=9135Thanks,
Frans Pop
-
Is it already in Linus's tree? If so, do you have a git commit id? If
not, please let us (stable@) know when it is, and what the id is, and
then we can add it to our tree.thanks,
greg k-h
-
Not AFAICT.
CCing Christian (as patch author) and Ingo (as author of the change that
caused the regression) so they can push it through the correct channels.
-
I dont know how to proceed with this issue. The more I think about it, the
more I am convinced that using sum_exec_runtime together with sampled utime
and stime will never guarantee monotonicity for utime and stime in proc.
Just imagine an process with 9 ticks for utime and 0 ticks for stime. If
we now sample one tick for stime (but having only a small increase in
sum_exec_runtime) the next utime value will only be 90% of the last value.
So returning to the 2.6.22 model seems to be the safest solution until
somebody else comes up with an idea that works proper.Ingo, any opinion?
Christian
-
Chuck, Balbir,
we still have a problem with stime occosionally going backwards. I stated
below that I think this is not fixable with the current utime/stime split
algorithm.Balbir, you wrote this code, Chuck you tried to fix it. Any ideas how to
fix this properly? The only idea I have requires that we save the old value
of utime and stime and therefore requires additional locking.Otherwise I suggest to apply my reversion patch from below:
--------- patch-----------
Fix stime going backwards
after 2.6.22 the accounting was changes to use sched_clock and to use stime
and utime only for the split. This is where task_utime and task_stime were
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise
numbers in utime and stime. So the code was partially reverted to use stime
and utime again where appropriate, which are of type cputime_t.It now seems, that the rest of the logic is still broken.
This patch basically reverts the rest of b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa
and should restore the 2.6.22 behavior. The process time is used from tasks
utime and stime instead of the scheduler clock. That means, in general after
a long period of time, it is less accurate than the sum_exec_runtime based
code, but it doesnt break top.Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
fs/proc/array.c | 37 -------------------------------------
1 file changed, 37 deletions(-)Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *
/*
* Use precise platform statistics if available:
*/
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
static cputime_t task_utime(struct task_struct *p)
{
return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_
{
return p->stime;
}
-#else
-static ...
Hi,
I missed seeing this problem before, sorry about that. Thanks for the
I am trying to think out loud as to what the root cause of the problem
might be. In one of the discussion threads, I saw utime going backwards,
which seemed very odd, I suspect that those are rounding errors.I don't understand your explanation below
Initially utime = 9, stime = 0, sum_exec_runtime = S1
Later
utime = 9, stime = 1, sum_exec_runtime = S2
We can be sure that S >= (utime + stime)
If S2 = S1 + delta, then as per our calculation
Initially
utime_proc = (utime * (S1))/(utime + stime)
= nsec_to_clock_t(9 * S1 / 9)later
utime_proc = nsec_to_clock_t(9 * S2/10)
Given that S >= (utime + stime), we should be fine.
The only problem I see is with rounding, like I mentioned before at two
places1. Rounding at do_div() in task_utime()
2. Rounding in conversion from clock_t_to_cputime()I have tried and not had any success reproducing the problem, could you
please help me with some pointers/steps to reproduce the problem?--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
I think here is the problem. How can we be sure? We cant. utime and stime
are sampled, so they can be largely off in any direction,if the program
sleeps often and manages to synchronize itself to the timer tick. Lets say
a program only does a simple system call and then sleeps. So sum_exec_runtime
is increased by lets say 1000 cycles on a 1Ghz box which means 1000ns. If now
the timer tick happens exactly at this moment, stime is increased by 1 tick
= 1000000ns.Maybe there is some magic in the code which I did not see, but obviously
the problem exists and looking at Frans data (stime+utime) are not decreasing,
but stime isnt and utime is. If you look at Frans data you see:
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12 <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10 <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12(stime+utime) is constant. That means that S2-S1 is obviously smaller than
one tick (See the calculation in task_stime). I am quite sure it is caused-
Yes, I thought of that just after I sent out my email. In the case that
you mention, the utime and stime accounting is incorrect anyway :-)
I think we need to find a better solution. I was going to propose that
we round correctly in (the divisions in)1. task_utime()
2. clock_t_to_cputime()I suspect we'll need to round task_utime() to p->utime if the value of
task_utime() < p->utime and the same thing for task_stime(). I've tried
reproducing the problem on my UML setup without any success. Let meYes, very interesting observation.
[snip]
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
Hello Balbir,
Any progress on this issue? I noticed that it's still there in current git.
If a better implementation is not expected any time soon, how about an ACK
on the reversion patch Christian proposed in
http://lkml.org/lkml/2007/10/16/76
so we can at least get rid of the regression?Thanks,
Frans Pop
-
We'll I initially thought of it and then I remembered that the
regression occurs only when the accounting itself is inaccurate.
I am tempted to ack the removal, but I would like to get input--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
i've got a patch from Peter queued up. (see below) This should fix the
main issue.Ingo
-------------------->
Subject: sched: keep utime/stime monotonic
From: Peter Zijlstra <a.p.zijlstra@chello.nl>keep utime/stime monotonic.
cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
fs/proc/array.c | 3 ++-
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -358,7 +358,8 @@ static cputime_t task_utime(struct task_
}
utime = (clock_t)temp;- return clock_t_to_cputime(utime);
+ p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ return p->prev_utime;
}static cputime_t task_stime(struct task_struct *p)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+ cputime_t prev_utime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1056,6 +1056,7 @@ static struct task_struct *copy_process(
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+ p->prev_utime = cputime_zero...
We'll also need this additional patch (untested), but in the long run
I think the approach needs to be1. Update stime and utime at the time of context switching -- keep it
in sync with p->sum_exec_runtime
2. Keep track of system/user context at system call entry pointsExtend Peter's patch to fix accounting issues.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---fs/proc/array.c | 3 ++-
include/linux/sched.h | 2 +-
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 2 deletions(-)diff -puN fs/proc/array.c~fix-accounting-issue-extended fs/proc/array.c
--- 2.6.24-rc1/fs/proc/array.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/fs/proc/array.c 2007-10-30 03:09:46.000000000 +0530
@@ -374,7 +374,8 @@ static cputime_t task_stime(struct task_
stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
cputime_to_clock_t(task_utime(p));- return clock_t_to_cputime(stime);
+ p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ return p->prev_stime;
}
#endifdiff -puN include/linux/sched.h~fix-accounting-issue-extended include/linux/sched.h
--- 2.6.24-rc1/include/linux/sched.h~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/include/linux/sched.h 2007-10-30 03:08:32.000000000 +0530
@@ -1004,7 +1004,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
- cputime_t prev_utime;
+ cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff -puN kernel/fork.c~fix-accounting-issue-extended kernel/fork.c
--- 2.6.24-rc1/kernel/fork.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/kernel/fork.c 2007-10-30 03:08:42.000000000 +0530
@@ -1057,6 +1057,7 @@ static struc...
This is similar to what s390 and ppc64 do when CONFIG_VIRT_CPU_ACCOUNTING is
set. Whenever you do a patch, please make sure to not break the already
precise values for utime and stime on s390 and ppc. Feel free to CC me
whenever you have a patch for this.Christian
-
Yes, true. I'll cc everyone on this list and send out the patch as RFC.
Much of this code will be architecture specific and as you rightly--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
OK. Both patches together do the trick. Gave it a nice long test run and got
no more weirdness.Feel free to send me a patch for testing when you have one.
Cheers,
Frans
-
Thanks for testing it
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
cool, thanks! I've queued it up for the next scheduler batch.
Ingo
-
It would be nice to put those two on stable point release too.
--
O T A V I O S A L V A D O R
---------------------------------------------
E-mail: otavio@debian.org UIN: 5906116
GNU/Linux User: 239058 GPG ID: 49A5F855
Home Page: http://otavio.ossystems.com.br
---------------------------------------------
"Microsoft sells you Windows ... Linux gives
you the whole house."
-
[...]
I dont think it will work. It will make utime monotic, but stime can still
decrease. For example let sum_exec_runtime increase by a tiny little bit while
utime will get a full additional tick. stime is sum-utime. So stime can still
go backwards. So I think that we need this kind of logic for stime as well,
no?Christian
-
yeah, probably. Peter?
Ingo
-
Yes, definitely :-)
With this patch stime is still all over the place.
Oct 29 22:12:39 314 64
Oct 29 22:12:40 392 68
Oct 29 22:12:41 408 67 <--
Oct 29 22:12:42 410 67
Oct 29 22:12:43 416 68
Oct 29 22:12:44 420 68
Oct 29 22:12:45 424 68
Oct 29 22:12:46 426 68
Oct 29 22:12:47 430 70
Oct 29 22:12:48 430 70
Oct 29 22:12:49 430 70
Oct 29 22:12:50 432 68 <--
Oct 29 22:12:51 432 69
Oct 29 22:12:52 432 69
Oct 29 22:12:53 432 69
Oct 29 22:12:54 432 69
Oct 29 22:12:55 432 69
Oct 29 22:12:56 433 70
Oct 29 22:12:57 434 69 <--
Oct 29 22:12:58 443 71utime looks OK now, though I'd like to test it a bit more (when stime is
fixed too) before giving a final verdict on that.
-
/me dons the brown paper bag while mumbling an agreement of sorts.
I'll not attempt to come up with a patch as I fear I'll just make a
bigger mess in my current state, hope to feel better tomorrow..-
I only remembered stime going backwards, but looking back at the mails
you are right, there has been one case of utime going backwards too.I can reproduce the problem reliably for two KDE programs I have running
permanently on my system: kontact and amarok. Both of these wake up
regularly and use some (not very much) processor capacity before going to
sleep again. Any process that has that that characteristic should do./me thinks a bit and tries something
lol - I have just managed to reproduce this with 'top' itself :-P
Run 'top' in one terminal; then in another terminal:
$ ps ax | grep " [t]op"
17869 pts/0 S+ 0:00 top
$ while true; do awk '{print $14" "$15}' /proc/17869/stat; sleep 1; done | ts
[...]
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12 <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10 <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12This example happens to have both stime _and_ utime decreasing...
'ts' is a small utility that prints the timestamps before the readings;
the test will work just as well without it.
Note that it may take a while for the error to show up. In this case it
was 40 seconds.Cheers,
Frans Pop
-
Thanks, I know why I am not seeing the problem. My box has
CONFIG_VIRT_CPU_ACCOUNTING set. I'll find another box and try.--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
| Linus Torvalds | Linux 2.6.27-rc5 |
| Jared Hulbert | [PATCH 00/10] AXFS: Advanced XIP filesystem |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Linus Torvalds | Linux 2.6.27-rc8 |
git: | |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Mark McLoughlin | [PATCH] bridge: make bridge-nf-call-*tables default configurable |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
