Please consider this patch for 2.6.23.2 http://lkml.org/lkml/2007/10/4/389 tested by me in http://lkml.org/lkml/2007/10/5/150 to fix the regression first reported in http://lkml.org/lkml/2007/10/3/123 Cause of the wrong display in top is the value of stime in /proc/<pid>/stat decreasing occasionally. The issue is also registered as: http://bugzilla.kernel.org/show_bug.cgi?id=9135 Thanks, Frans Pop -
Is it already in Linus's tree? If so, do you have a git commit id? If not, please let us (stable@) know when it is, and what the id is, and then we can add it to our tree. thanks, greg k-h -
Not AFAICT. CCing Christian (as patch author) and Ingo (as author of the change that caused the regression) so they can push it through the correct channels. -
I dont know how to proceed with this issue. The more I think about it, the more I am convinced that using sum_exec_runtime together with sampled utime and stime will never guarantee monotonicity for utime and stime in proc. Just imagine an process with 9 ticks for utime and 0 ticks for stime. If we now sample one tick for stime (but having only a small increase in sum_exec_runtime) the next utime value will only be 90% of the last value. So returning to the 2.6.22 model seems to be the safest solution until somebody else comes up with an idea that works proper. Ingo, any opinion? Christian -
Chuck, Balbir,
we still have a problem with stime occosionally going backwards. I stated
below that I think this is not fixable with the current utime/stime split
algorithm.
Balbir, you wrote this code, Chuck you tried to fix it. Any ideas how to
fix this properly? The only idea I have requires that we save the old value
of utime and stime and therefore requires additional locking.
Otherwise I suggest to apply my reversion patch from below:
--------- patch-----------
Fix stime going backwards
after 2.6.22 the accounting was changes to use sched_clock and to use stime
and utime only for the split. This is where task_utime and task_stime were
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise
numbers in utime and stime. So the code was partially reverted to use stime
and utime again where appropriate, which are of type cputime_t.
It now seems, that the rest of the logic is still broken.
This patch basically reverts the rest of b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa
and should restore the 2.6.22 behavior. The process time is used from tasks
utime and stime instead of the scheduler clock. That means, in general after
a long period of time, it is less accurate than the sum_exec_runtime based
code, but it doesnt break top.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
fs/proc/array.c | 37 -------------------------------------
1 file changed, 37 deletions(-)
Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *
/*
* Use precise platform statistics if available:
*/
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
static cputime_t task_utime(struct task_struct *p)
{
return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_
{
return p->stime;
}
-#else
-static cputime_t ...Hi,
I missed seeing this problem before, sorry about that. Thanks for the
I am trying to think out loud as to what the root cause of the problem
might be. In one of the discussion threads, I saw utime going backwards,
which seemed very odd, I suspect that those are rounding errors.
I don't understand your explanation below
Initially utime = 9, stime = 0, sum_exec_runtime = S1
Later
utime = 9, stime = 1, sum_exec_runtime = S2
We can be sure that S >= (utime + stime)
If S2 = S1 + delta, then as per our calculation
Initially
utime_proc = (utime * (S1))/(utime + stime)
= nsec_to_clock_t(9 * S1 / 9)
later
utime_proc = nsec_to_clock_t(9 * S2/10)
Given that S >= (utime + stime), we should be fine.
The only problem I see is with rounding, like I mentioned before at two
places
1. Rounding at do_div() in task_utime()
2. Rounding in conversion from clock_t_to_cputime()
I have tried and not had any success reproducing the problem, could you
please help me with some pointers/steps to reproduce the problem?
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
I only remembered stime going backwards, but looking back at the mails
you are right, there has been one case of utime going backwards too.
I can reproduce the problem reliably for two KDE programs I have running
permanently on my system: kontact and amarok. Both of these wake up
regularly and use some (not very much) processor capacity before going to
sleep again. Any process that has that that characteristic should do.
/me thinks a bit and tries something
lol - I have just managed to reproduce this with 'top' itself :-P
Run 'top' in one terminal; then in another terminal:
$ ps ax | grep " [t]op"
17869 pts/0 S+ 0:00 top
$ while true; do awk '{print $14" "$15}' /proc/17869/stat; sleep 1; done | ts
[...]
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12 <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10 <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12
This example happens to have both stime _and_ utime decreasing...
'ts' is a small utility that prints the timestamps before the readings;
the test will work just as well without it.
Note that it may take a while for the error to show up. In this case it
was 40 seconds.
Cheers,
Frans Pop
-
Thanks, I know why I am not seeing the problem. My box has CONFIG_VIRT_CPU_ACCOUNTING set. I'll find another box and try. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
I think here is the problem. How can we be sure? We cant. utime and stime are sampled, so they can be largely off in any direction,if the program sleeps often and manages to synchronize itself to the timer tick. Lets say a program only does a simple system call and then sleeps. So sum_exec_runtime is increased by lets say 1000 cycles on a 1Ghz box which means 1000ns. If now the timer tick happens exactly at this moment, stime is increased by 1 tick = 1000000ns. Maybe there is some magic in the code which I did not see, but obviously the problem exists and looking at Frans data (stime+utime) are not decreasing, but stime isnt and utime is. If you look at Frans data you see: Oct 16 11:54:48 8 10 Oct 16 11:54:49 6 12 <-- utime Oct 16 11:54:50 6 12 Oct 16 11:54:51 6 12 Oct 16 11:54:52 8 10 <-- stime Oct 16 11:54:53 8 10 Oct 16 11:54:54 8 10 Oct 16 11:54:55 8 12 Oct 16 11:54:56 8 12 (stime+utime) is constant. That means that S2-S1 is obviously smaller than one tick (See the calculation in task_stime). I am quite sure it is caused -
Yes, I thought of that just after I sent out my email. In the case that you mention, the utime and stime accounting is incorrect anyway :-) I think we need to find a better solution. I was going to propose that we round correctly in (the divisions in) 1. task_utime() 2. clock_t_to_cputime() I suspect we'll need to round task_utime() to p->utime if the value of task_utime() < p->utime and the same thing for task_stime(). I've tried reproducing the problem on my UML setup without any success. Let me Yes, very interesting observation. [snip] -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Hello Balbir, Any progress on this issue? I noticed that it's still there in current git. If a better implementation is not expected any time soon, how about an ACK on the reversion patch Christian proposed in http://lkml.org/lkml/2007/10/16/76 so we can at least get rid of the regression? Thanks, Frans Pop -
We'll I initially thought of it and then I remembered that the regression occurs only when the accounting itself is inaccurate. I am tempted to ack the removal, but I would like to get input -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
i've got a patch from Peter queued up. (see below) This should fix the
main issue.
Ingo
-------------------->
Subject: sched: keep utime/stime monotonic
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
keep utime/stime monotonic.
cpustats use utime/stime as a ratio against sum_exec_runtime, as a
consequence it can happen - when the ratio changes faster than time
accumulates - that either can be appear to go backwards.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
fs/proc/array.c | 3 ++-
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -358,7 +358,8 @@ static cputime_t task_utime(struct task_
}
utime = (clock_t)temp;
- return clock_t_to_cputime(utime);
+ p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ return p->prev_utime;
}
static cputime_t task_stime(struct task_struct *p)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1009,6 +1009,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+ cputime_t prev_utime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -1056,6 +1056,7 @@ static struct task_struct *copy_process(
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+ p->prev_utime = cputime_zero;
#ifdef CONFIG_TASK_XACCT
...[...] I dont think it will work. It will make utime monotic, but stime can still decrease. For example let sum_exec_runtime increase by a tiny little bit while utime will get a full additional tick. stime is sum-utime. So stime can still go backwards. So I think that we need this kind of logic for stime as well, no? Christian -
/me dons the brown paper bag while mumbling an agreement of sorts. I'll not attempt to come up with a patch as I fear I'll just make a bigger mess in my current state, hope to feel better tomorrow.. -
Yes, definitely :-) With this patch stime is still all over the place. Oct 29 22:12:39 314 64 Oct 29 22:12:40 392 68 Oct 29 22:12:41 408 67 <-- Oct 29 22:12:42 410 67 Oct 29 22:12:43 416 68 Oct 29 22:12:44 420 68 Oct 29 22:12:45 424 68 Oct 29 22:12:46 426 68 Oct 29 22:12:47 430 70 Oct 29 22:12:48 430 70 Oct 29 22:12:49 430 70 Oct 29 22:12:50 432 68 <-- Oct 29 22:12:51 432 69 Oct 29 22:12:52 432 69 Oct 29 22:12:53 432 69 Oct 29 22:12:54 432 69 Oct 29 22:12:55 432 69 Oct 29 22:12:56 433 70 Oct 29 22:12:57 434 69 <-- Oct 29 22:12:58 443 71 utime looks OK now, though I'd like to test it a bit more (when stime is fixed too) before giving a final verdict on that. -
We'll also need this additional patch (untested), but in the long run
I think the approach needs to be
1. Update stime and utime at the time of context switching -- keep it
in sync with p->sum_exec_runtime
2. Keep track of system/user context at system call entry points
Extend Peter's patch to fix accounting issues.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
fs/proc/array.c | 3 ++-
include/linux/sched.h | 2 +-
kernel/fork.c | 1 +
3 files changed, 4 insertions(+), 2 deletions(-)
diff -puN fs/proc/array.c~fix-accounting-issue-extended fs/proc/array.c
--- 2.6.24-rc1/fs/proc/array.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/fs/proc/array.c 2007-10-30 03:09:46.000000000 +0530
@@ -374,7 +374,8 @@ static cputime_t task_stime(struct task_
stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
cputime_to_clock_t(task_utime(p));
- return clock_t_to_cputime(stime);
+ p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ return p->prev_stime;
}
#endif
diff -puN include/linux/sched.h~fix-accounting-issue-extended include/linux/sched.h
--- 2.6.24-rc1/include/linux/sched.h~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/include/linux/sched.h 2007-10-30 03:08:32.000000000 +0530
@@ -1004,7 +1004,7 @@ struct task_struct {
unsigned int rt_priority;
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
- cputime_t prev_utime;
+ cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff -puN kernel/fork.c~fix-accounting-issue-extended kernel/fork.c
--- 2.6.24-rc1/kernel/fork.c~fix-accounting-issue-extended 2007-10-30 03:08:18.000000000 +0530
+++ 2.6.24-rc1-balbir/kernel/fork.c 2007-10-30 03:08:42.000000000 +0530
@@ -1057,6 +1057,7 @@ static struct task_struct ...OK. Both patches together do the trick. Gave it a nice long test run and got no more weirdness. Feel free to send me a patch for testing when you have one. Cheers, Frans -
cool, thanks! I've queued it up for the next scheduler batch. Ingo -
It would be nice to put those two on stable point release too.
--
O T A V I O S A L V A D O R
---------------------------------------------
E-mail: otavio@debian.org UIN: 5906116
GNU/Linux User: 239058 GPG ID: 49A5F855
Home Page: http://otavio.ossystems.com.br
---------------------------------------------
"Microsoft sells you Windows ... Linux gives
you the whole house."
-
Thanks for testing it -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
This is similar to what s390 and ppc64 do when CONFIG_VIRT_CPU_ACCOUNTING is set. Whenever you do a patch, please make sure to not break the already precise values for utime and stime on s390 and ppc. Feel free to CC me whenever you have a patch for this. Christian -
Yes, true. I'll cc everyone on this list and send out the patch as RFC. Much of this code will be architecture specific and as you rightly -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
