Hi,
I'm glad to announce a working prototype of the basic algorithm I
already suggested last time.
As I already tried to explain previously CFS has a considerable
algorithmic and computational complexity. This patch should now make it
clearer, why I could so easily skip over Ingo's long explanation of all
the tricks CFS uses to keep the computational overhead low - I simply
don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
kernel, the first 3 runs are without patch, the last 3 runs are with the
patch:
Xeon 1.86GHz:
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
suse Linux 2.6.23- 0.9500 1.3900 1.1700 1.6500 1.1700 1.68000 1.22000
suse Linux 2.6.23- 0.9700 1.3600 1.1400 1.6100 1.1500 1.67000 1.18000
suse Linux 2.6.23- 0.9800 1.4000 1.1500 1.6600 1.1400 1.70000 1.19000
suse Linux 2.6.23- 0.7500 1.2000 1.0000 1.4800 0.9800 1.52000 1.06000
suse Linux 2.6.23- 0.7800 1.2200 0.9800 1.4600 1.0100 1.53000 1.05000
suse Linux 2.6.23- 0.8300 1.2200 1.0000 1.4800 0.9800 1.52000 1.05000
Celeron 2.66GHz:
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
devel6 Linux 2.6.23- 3.1300 3.3800 5.9000 5.8300 18.1 9.33000 18.0
devel6 Linux 2.6.23- 3.1400 3.2800 7.0700 6.0900 17.6 9.18000 17.8
devel6 Linux 2.6.23- 3.1400 3.6000 3.5400 6.1300 17.2 9.02000 17.2
devel6 Linux 2.6.23- 2.7400 3.0300 5.5200 6.4900 ...(finding it difficult to resist the urge to go shopping, I
fast-forwarded to test drive... grep shopping arch/i386/kernel/tcs.c if
you're curious;)
I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
loads, and hit some starvation. If I got it in right (think so) there's
a bug lurking somewhere. taskset -c 1 fairtest2 resulted in the below.
It starts up running both tasks at about 60/40 for hog/sleeper, then
after a short while goes nuts. The hog component eats 100% cpu and
starves the sleeper (and events, forever).
runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
events/1 8 13979193020350 -3984516524180 541606276813 2014 115 0 0 0 0 0
R fairtest2 7984 10027571241955 -7942765479096 5707836335486 294 120 0 0 0 0 0
fairtest2 7989 13539381091732 -4424328443109 8147144513458 286 120 0 0 0 0 0
taskset -c 1 massive_intr 3 9999 produces much saner looking numbers,
and is fair...
runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
massive_intr 12808 29762662234003 21699 -506538 4650 120 0 0 0 0 0
R massive_intr 12809 29762662225939 ...Hi, Thanks for testing, although your test program does nothing unusual here. Can you please send me your .config? Were there some kernel messages while running it? bye, Roman -
I didn't look actually, was in rather a hurry. I'll try it again tomorrow. -Mike
hey, thanks for working on this, it's appreciated! In terms of merging your stuff, your patch looks a bit large and because you did not tell us that you were coding in this area, you probably missed Peter Zijlstra's excellent CFS work: http://programming.kicks-ass.net/kernel-patches/sched-cfs/ The following portion of Peter's series does much of the core math changes of what your patch provides (and which makes up for most of the conceptual delta in your patch), on a step by step basis: sched-update_weight_inv.patch sched-se_load.patch sched-se_load-2.patch sched-64-wait_runtime.patch sched-scaled-limit.patch sched-wait_runtime-scale.patch sched-calc_delta.patch So the most intrusive (math) aspects of your patch have been implemented already for CFS (almost a month ago), in a finegrained way. Peter's patches change the CFS calculations gradually over from 'normalized' to 'non-normalized' wait-runtime, to avoid the normalizing/denormalizing overhead and rounding error. Turn off sleeper fairness, remove the limit code and we should arrive to something quite close to the core math in your patch, with similar rounding properties and similar overhead/complexity. (there are some other small details in the math but this is the biggest item by far.) I find Peter's series very understandable and he outlined the math arguments in his replies to your review mails. (would be glad to re-open those discussions though if you still think there are disagreements.) ( your characterisation errs in that it makes it appear to be a common problem, while in practice it's only a corner-case limited to extreme negative nice levels and even there it needs a very high rate of scheduling and an artificially constructed workload: several hundreds of thousand of context switches per second with a yield-ing loop to be even measurable with unmodified CFS. So this is not a 2.6.23 issue at all - unless there's some testcase that proves the ...
Hi, Actually it changes wait-runtime to a normalized value and it changes nothing about the rounding error I was talking about. It addresses the conversion error between the different units I was Did you even try to understand what I wrote? I didn't say that it's a "common problem", it's a conceptual problem. The rounding has been improved lately, so it's not as easy to trigger with some simple busy loops. Peter's patches don't remove limit_wait_runtime() and AFAICT they can't, Everytime I'm amazed how quickly you get to your judgements... :-( Especially interesting is that you don't need to ask a single question for that, which would mean you actually understood what I wrote, OTOH your wild claims tell me something completely different. BTW who is "we" and how is it possible that this meta mind can come to such quick judgements? The basic concept is quite different enough, one can e.g. see that I have to calculate some of the key CFS variables for the debug output. The concepts are related, but they are definitively not "the same thing mathematically", the method of resolution is quite different, if you think otherwise then please _prove_ it. bye, Roman -
As i mentioned it in my first reply to you i really welcome your changes and your interest in the Linux scheduler, but i'm still somewhat surprised why you focus on rounding so much (and why you attack CFS's math implementation so vehemently and IMO so unfairly) - and we had this discussion before in the "CFS review" thread that you started. The kind of rounding error you seem to be worried about is very, very small. For normal nice-0 tasks it's in the "one part per a hundred million" range, or smaller. As a comparison: "top"'s CPU utilization statistics are accurate to "one part per thousand" - and that's roughly the precision range that humans care about. (Graphical tools are even coarser - one part per hundred or worse.) I suspect if rounding effects are measurable you will post numbers that prove it, correct? You did not do that so far. Or if they are not measurable for any workload we care about, why should we worry about it if it causes no problems in practice? (or if it causes problems, what are the actual effects and can they be addressed in a simpler way?) The main reason i'm interested in changing the fairness math under CFS (be that Peter's changes or your changes) is _not_ primarily to address any rounding behavior, but to potentially improve performance! Rounding errors are at most an academic issue, unless they are actually measurable in a workload we care about. (If rounding gets improved as a side-effect of a change that's an added, albeit lower-prio bonus.) But even more important is quality of scheduling - performance is secondary to that. (unless performance is so bad that it becomes a quality issue: such as an O(N) algorithm would do.) Your math is fairly simple (and that is _good_, just like CFS's existing math is simple), it can be summed up in essence as (without complicating it with nice-level weighting, for easy understandability): " use the already existing p->sum_exec_runtime 'task runtime' metric that CFS ...
Hi, The rounding is insofar a problem as it makes it very difficult to produce a correct mathematical model of CFS, which can be used to verify the working of code. With the recent rounding changes they have indeed little effect in the short term, especially as the error is distributed equally to the task, so every task gets relatively the same time, the error still exists nonetheless and adds up over time (although with the rounding changes it doesn't grow into one direction anymore). The problem is now the limiting, which you can't remove as long as the error exists. Part of this problem is that CFS doesn't really maintain a balanced sum of the available runtime, i.e. the sum of all updated wait_runtime values of all active tasks should be zero. This means under complex loads it's possible to hit the wait_runtime limits and the overflow/underflow time is lost in the calculation and this value can be easily in the millisecond range. To be honest I have no idea how real this problem in the praxis is right now, quite possibly people will only see it as a small glitch and in most cases they won't even notice. Previously it had been rather easy to trigger these problems due to the rounding problems. The main problem for me here is now that with any substantial change in CFS I'm running risk of making this worse and noticable again somehow. For example CFS relies on the 64bit math to have enough resolution so that the errors aren't noticable. That's why I needed a mathematical model I can work with and with it for example I can easily downscale the math to 32bit and I know exactly the I really wouldn't call CFS's existing math simple, the basic ideas are indeed simple, but that the math of the actual implementation is quite a Well, it's part of the math, but I wouldn't call it the "essence" of the _math_, as it leaves many questions unanswered, the essence of the math mainly involves how the problems are solved created by the weighting. If you want ...
So, its like i roll a dice repeatedly and subtract 3.5 every time? Even So, given the above example, if the result of the dice rolls ever exceeds +/- 1,000,000 i´ll get some ugly timing glitches? As the number of dice rolls grows towards infinity the chance of remaining within this boundary goes steadily towards 0%. What does this equate to in the real world? Weeks, month, years, If i understand the problem correctly these errors occur due to the fact that delta-values are used instead of recalculating the "fair" process time every "dice" roll? Somehow the dead simple solution would seem to "reset" or "reinitialise" the "dice" roll every million rolls or so. Or, to put this into context again, figure out how large the accumulated error would have to be to be noticable. Then figure out how long it would take for such an error occur in the real world and just reinitialise the damn thing. Should´nt be too complicated stochastics. Greetings, Syren -
Hi,
On Fri, 31 Aug 2007, Ingo Molnar wrote:
Maybe I should explain for everyone else (especially after seeing some of
the comments on kerneltrap), why I reacted somewhat irritated on what
looks like such an innocent mail.
The problem is without the necessary background one can't know how wrong
I'd expect Ingo to know better, but the more he refuses to answer my
questions, the more I doubt it, at least than it comes to the math part.
While Peter's patches are interesting, they are only a small step to what
I'm trying to achieve. With these patches the basic CFS math pretty much
looks like this:
sum_{t}^{T}(round(time_{t} * round(WMULT / weight_{t}) * WEIGTH0 / WMULT))
= sum_{r}^{R}(round(time_{r} * round(WMULT / weight_sum) * WEIGTH0 / WMULT))
It's based on this equation:
sum_{t}^{T}(time_{t} / weight_{t}) = sum_{r}^{R}(time_{r} / weight_sum)
This is the time a single task gets relative to the the runtime of all
tasks. These sums are incrementally added/substracted to wait_runtime and
it should stay around zero.
All Peter's wait_runtime-scale patch does is to move the weight_{t} from
one side to the other and that's it, it changes _nothing_ about the
rounding above - "the rounding corner-cases" are still there.
In my announcement I describe now in quite some detail, how I get rid of
this rounding effect. The only rounding from above equation which is still
left is "round(WMULT / weight_{t})", but this is luckily a constant and so
is the time one task gets relative to another (unless reniced).
Anyone who has actually read and understood what I wrote will hopefully
I'm not repeating again the whole math here, if anyone has questions about
it, I'll try my best to answer them. So instead here are some of the
intrusive aspects, which supposedly have been implemented already.
One key difference is that I don't maintain the global sum (fair_clock)
directly anymore, I can calculate it if needed, but it's not used to
update ...The "math part" is important if you're doing a thesis defense, but demonstrating better behavior under some realistic load would probably be a better starting point. Maybe Ingo missed something in your math, and maybe he just didn't find a benefit. The ck and sd schedulers developed over a long time of public use and redefinition of goals. The cfs developed faster, but still had months of testing and the benefit of rt experience. Your scheduler is more of a virgin birth, no slow public discussion before, no time yet for people to run it under real loads, you're seeing first impressions. You dropped this on the world two days before a major US holiday, at the end of the summer when those of us not on vacation may be covering for those who are, did you expect Ingo to drop his regular work to look at your stuff? And do you think there are many people who can look at your math, and look at your code, and then have any clue how well it works in practice? I bet there aren't ten people in the world who would even claim to do that, and half of them are kidding themselves. So give people a few weeks to see if the rounding errors you eliminated mean anything in practice. Faster context is nice, but it's not something most people count as their major problem. I did a *lot* of cfs vs. sd testing, not just all the glitch1 tests I posted here, lots of stuff I just send to Ingo, and lots of tests like IPCbench and NNTP server loading showed nothing. (That's good, it means cfs isn't worse than the old scheduler for those loads.) I played with the tuning, I even diddled the code a bit, without finding meaningful improvement, so your possibly better math may not change the overall behavior much. I will wait until I post actual test results before I say any more. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
Hi, That wasn't my design goal. The initial trigger for this was all the 64bit math in CFS and the question how can this be tuned using arch specific knowledge about the input values. For this I needed a correct and verifyable model to analyze, so I know where the limits are and how it reacts to different sets of inputs. This is where I got into trouble with all the rounding - I couldn't substantially change the math without convincing myself that it's still correct for all kinds of input data. That's why I completely redid the math from scratch - it's based on the same basic ideas, but the solution and thus the math is quite different. The main reason I didn't concentrate on the behaviour so much is that since both scheduler are conceptually not that far apart, it should be possible to apply any tuning done to CFS also to my model. But this requires someone actually understands what tuning was done and it wasn't done by random changes and seeing what happens, i.e. someone should be able to explain it to me. BTW in this context it's rather interesting to see that Ingo attacks me now for not having done this tuning yet and for which I explicitely asked Maybe he didn't understand it at all and is too proud to admit it? Did I demand anything like that? It had been fine if he had been asking for more time or if he had more questions first, but it took him only a few hours to come to his conclusion without any need for further I don't think the math is that complex, it may need some more explanations outside the math formulas though. But the math is needed to analyze what effect changes to scheduler have, otherwise there is a risk it becomes only guesswork with random changes which may help in some situations but break others. bye, Roman -
Out of curiosity I was reviewing your patch .. Since you create kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see what had changed .. So I re-diffed your patch to eliminate kernel/sched_norm.c and just make the changes to kernel/sched_fair.c .. The the patch is near the end of this email.. The most notable thing about the rediff is the line count, 4 files changed, 323 insertions(+), 729 deletions(-) That's impressive (assuming my rediff is correct). Although I don't know for certain how that line reduction is achieved.. I also ran hackbench (in a haphazard way) a few times on it vs. CFS in I read your description, but I was distracted by this latex style notation .. Could you walk through in english what these two equation are doing .. The patch below is on top of my tree (2.6.23-rc5-dw1) , ftp://source.mvista.com/pub/dwalker/rt/ Daniel --- include/linux/sched.h | 21 - kernel/sched.c | 186 ++++------- kernel/sched_debug.c | 26 - kernel/sched_fair.c | 819 +++++++++++++------------------------------------- 4 files changed, 323 insertions(+), 729 deletions(-) Index: linux-2.6.22/include/linux/sched.h =================================================================== --- linux-2.6.22.orig/include/linux/sched.h +++ linux-2.6.22/include/linux/sched.h @@ -1046,40 +1046,29 @@ struct load_weight { * * Current field usage histogram: * - * 4 se->block_start * 4 se->run_node - * 4 se->sleep_start - * 4 se->sleep_start_fair * 6 se->load.weight - * 7 se->delta_fair - * 15 se->wait_runtime */ struct sched_entity { - long wait_runtime; - unsigned long delta_fair_run; - unsigned long delta_fair_sleep; - unsigned long delta_exec; - s64 fair_key; + s64 time_key; struct load_weight load; /* for load-balancing */ struct rb_node run_node; - unsigned int on_rq; + unsigned int on_rq, queued;; + unsigned int weight_shift; u64 exec_start; ...
Yeah, at first glance i liked that too, then i looked into the diff and
noticed that a good chunk of the removal "win" comes from the removal of
~35 comment blocks while adding new code that has no comments at all
(!).
And if you look at the resulting code size/complexity, it actually
increases with Roman's patch (UP, nodebug, x86):
text data bss dec hex filename
13420 228 1204 14852 3a04 sched.o.rc5
13554 228 1228 15010 3aa2 sched.o.rc5-roman
Although it _should_ have been a net code size win, because if you look
at the diff you'll see that other useful things were removed as well:
sleeper fairness, CPU time distribution smarts, tunings, scheduler
here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive
runs fresh after bootup, Core2Duo, UP:
-rc5(cfs) -rc5+rfs
-------------------------------
Time: 3.905 Time: 4.259
Time: 3.962 Time: 4.190
Time: 3.981 Time: 4.241
Time: 3.986 Time: 3.937
Time: 3.984 Time: 4.120
Time: 4.001 Time: 4.013
Time: 3.980 Time: 4.248
Time: 3.983 Time: 3.961
Time: 3.989 Time: 4.345
Time: 3.981 Time: 4.294
-------------------------------
Avg: 3.975 Avg: 4.160 (+4.6%)
Fluct: 0.138 Fluct: 1.671
so unmodified CFS is 4.6% faster on this box than with Roman's patch and
it's also more consistent/stable (10 times lower fluctuations).
At lower hackbench levels (hackbench 10) the numbers are closer - that
could be what you saw.
But, this measurement too is apples to oranges, given the amount of
useful code the patch removes - fact is that you can always speed up the
scheduler by removing stuff, just run hackbench as SCHED_FIFO (via "chrt
-f 90 ./hackbench 50") to see what a minimal scheduler can do.
It would ...To be fair to Roman, he probably started development off an earlier CFS, most probably 2.6.23-rc3-git1, if I guess correctly from his original posting. So it's likely he missed out on some of the tunings/comments(?) Again, it would be interesting to benchmark against 2.6.23-rc3-git1. And also probably rediff vs 2.6.23-rc3-git1 and compare how the code actually changed ... but admittedly, doing so would be utterly pointless, because much water has flowed down the Ganges since -rc3 (misc CFS improvements, Peter's patches that you mentioned). So a "look, I told you so" kind of Absolutely. And if there indeed are net improvements (be it for corner cases) over latest CFS-rc5, while maintaining performance for the common cases at the same time, well, that can only be a good thing. Just my Rs. 0.02, Satyam -
actually, here are the rc3->rc5 changes to CFS: sched.c | 89 ++++++++++++++++++++++++------------ sched_debug.c | 3 - sched_fair.c | 142 +++++++++++++++++++++++++++++++++++++++++++++------------- sched_rt.c | 11 +++- 4 files changed, 182 insertions(+), 63 deletions(-) so since -rc3 CFS's size _increased_ (a bit). and i just checked, the sched.o codesize still increases even when comparing rc4 against rc4+patch (his original patch) and there are no comments added by Roman's patch at all. (all the comments in sched_norm.c were inherited from sched_fair.c and none of the new code comes with comments - this can be seen in Daniel's rediffed patch.) (and it's still not apples to oranges, for the reasons i outlined - so this whole comparison is unfair to CFS on several levels.) also, note that CFS's modularity probably enabled Roman to do a fairly stable kernel/sched_norm.c (as most of the post-rc3 CFS changes were not to sched.c but to sched_fair.c) with easy porting. So with the CFS modular framework you can easily whip up and prototype a new scheduler and name it whatever you like. [ i expect the RCFS (Really Completely yeah - and as i said to Roman, i like for example the use of a ready-queue instead of a run-queue. (but these are independent of the math changes, obviously.) I also think that the core math changes should be split from the Breshenham optimizations. I.e. the Breshenham _fract code should be done as a "this improves performance and improves rounding, without changing behavior" add-on ontop of a fairly simple core math change. I think Roman will easily be able to do this with a few hours of effort which should present much reduced .text overhead in his next version of the patch, to demonstrate the simplicity of his implementation of the CFS fairness math - this really isnt hard to do. Ingo -
Hi,
That's pretty easy to explain due to differences in inlining:
text data bss dec hex filename
15092 228 1204 16524 408c kernel/sched.o
15444 224 1228 16896 4200 kernel/sched.o.rfs
14708 224 1228 16160 3f20 kernel/sched.o.rfs.noinline
Sorry, but I didn't spend as much time as you on tuning these numbers.
Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- linux-2.6.orig/kernel/sched_norm.c 2007-09-02 16:58:05.000000000 +0200
+++ linux-2.6/kernel/sched_norm.c 2007-09-02 16:10:58.000000000 +0200
@@ -145,7 +145,7 @@ static inline struct task_struct *task_o
/*
* Enqueue an entity into the rb-tree:
*/
-static inline void
+static void
__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
@@ -192,7 +192,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
se->queued = 1;
}
-static inline void
+static void
__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (cfs_rq->rb_leftmost == se) {
@@ -240,7 +240,7 @@ static void verify_queue(struct cfs_rq *
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
-static inline void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
Well, these are things I'd like you to explain a little, for example I
repeatedly asked you about the sleeper fairness and I got no answer.
BTW you seemed to haved missed that I actually give a bonus to sleepers
Was SCHED_DEBUG enabled or disabled for these runs?
bye, Roman
-
no, when generating those numbers i used:
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_FORCED_INLINING is not set
(but i also re-did it for all the other combinations of these build
flags and similar results can be seen - your patch, despite removing
some changes did slip into your patch that have no other purpose but to
reduce code size:
+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
[...]
+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif
[...]
so i thought you must be aware of the problem - at least considering how
much you've criticised CFS's "complexity" both in your initial review of
CFS (which included object size comparisons) and in this patch
submission of yours (which did not include object size comparisons
debugging disabled of course. (your patch has a self-validity checking
function [verify_queue()] that is called on SCHED_DEBUG=y, it would have
been unfair to test your patch with that included.)
Ingo
-
Hi, Ingo as long as you freely mix algorithmic and code complexity we won't get very far. I did code size comparisons for your _stable_ code, which was merged into Linus' tree. I explicitly said that my patch is only a prototype, so I haven't done any cleanups and tuning in this direction yet, so making any conclusions based on code size comparisions is quite ridiculous at this point. The whole point of this patch was to demonstrate the algorithmic changes, I'll look into it next week. bye, Roman -
thanks. FYI, there's plenty of time - i'll be at the KS next week so i'll be quite unresponsive to emails. Would be nice if you could take a quick look at the trivial patch i posted today though. How close is it to your algorithm, have i missed any important details? (not counting nice levels and rounding, it's just a quick & dirty prototype to show the first layer of the core math and nothing more.) Ingo -
Hi,
On Sun, 2 Sep 2007, Ingo Molnar wrote:
Below is a patch updated against the latest git tree, no major changes.
For a split version I'm still waiting for some more explanation about the
CFS tuning parameter.
I added another check for the debug version so that any inbalances (as
e.g. Mike seen it) should be reported. I left some of the proc parameter
visible even in the non-debug version, so one can play with it without
You'll be happy to know, that with the patch even the SMP code size
decreases.
BTW if I made the impression that overall code size must decrease, than I
apologize, overall code size is of course only an indication, IMO more
important is where that code is, i.e. that regularly executed code is as
compact as possible.
64bit math simply adds a level of bloatness, which is hard to avoid, but
with the fixed math it's actually possible to further reduce the code size
by getting rid of the 64bit math, so that anyone prefering a really
hackbench apparently works best if one doesn't wake up the consumer too
early, so in the patch I don't preempt a just woken process, which
produces far better numbers.
bye, Roman
---
include/linux/sched.h | 26 -
kernel/sched.c | 173 +++++-----
kernel/sched_debug.c | 26 -
kernel/sched_norm.c | 841 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 30 -
5 files changed, 948 insertions(+), 148 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned ...Interesting, I see major behavioral changes. I still see an aberration with fairtest2. On startup, the hog component will consume 100% cpu for a bit, then the sleeper shows up. This doesn't always happen, but happens quite often. The time it takes for the sleeper to show up varies, but is roughly as below. With the previous kernel, sleeper starvation was permanent once the hog ran for a bit. This behavior inverted itself, starvation now goes away after a bit. 6573 root 20 0 1568 468 384 R 51 0.0 0:08.30 1 fairtest2 6574 root 20 0 1568 112 28 R 47 0.0 0:01.07 1 fairtest2 Once it coughs up it's fairtest2 startup-furball, it does well on mixed load (various interval sleepers + hog) fairness. The previous kernel did not do at all well at mixed load, as soon as I added a hog, it was all over for sleepers, and even without a hog, fairness among the various interval sleepers was not at all good. On the interactivity front, your first cut was not really usable here, but this one seems fine <not heavily tested caveat>. One test I do here is Amarok song switch time under hefty kbuild load, and this kernel passes that test just fine. I didn't try this test with your first cut, because it was very ragged even under modest load. All in all, this one is behaving quite well, which is radically different than my experience with first cut. Debug messages triggered on boot, but haven't triggered since. [ 113.426575] audit(1189232695.670:2): audit_backlog_limit=256 old=64 by auid=4294967295 res=1 [ 113.953958] IA-32 Microcode Update Driver: v1.14a <tigran@aivazian.fsnet.co.uk> [ 115.851209] audit(1189232698.095:3): audit_pid=5597 old=0 by auid=4294967295 [ 116.707631] 2,f73035a0(5624): 1fa78027f5c,1fb37ff0000,f7d035a0(5626),1 [ 116.723091] WARNING: at kernel/sched_norm.c:243 verify_queue() [ 116.737979] [<c0105188>] show_trace_log_lvl+0x1a/0x30 [ 116.752191] [<c0105d83>] show_trace+0x12/0x14 [ 116.765544] [<c0105d9b>] ...
They weren't all repeats after all, the last few were... [ 120.267389] 2,f73035a0(5624): 1fa7e90b58c,1fb3b460000,f73035a0(5624),5 [ 120.281110] WARNING: at kernel/sched_norm.c:413 entity_tick() [ 120.294101] [<c0105188>] show_trace_log_lvl+0x1a/0x30 [ 120.306425] [<c0105d83>] show_trace+0x12/0x14 [ 120.317857] [<c0105d9b>] dump_stack+0x16/0x18 [ 120.329124] [<c011eb68>] task_tick_fair+0x15d/0x170 [ 120.340988] [<c01221f3>] scheduler_tick+0x18d/0x2d5 [ 120.352896] [<c012f91d>] update_process_times+0x44/0x63 [ 120.365177] [<c014161e>] tick_sched_timer+0x5c/0xbd [ 120.377146] [<c013ca61>] hrtimer_interrupt+0x12c/0x1a0 [ 120.389392] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88 [ 120.402114] [<c0104c50>] apic_timer_interrupt+0x28/0x30 [ 120.414491] [<c0123c1b>] wake_up_new_task+0x72/0xad [ 120.426477] [<c0125ddc>] do_fork+0x13c/0x205 [ 120.437795] [<c0102258>] sys_clone+0x33/0x39 [ 120.449062] [<c0104182>] sysenter_past_esp+0x5f/0x85 [ 120.461047] ======================= -
Hi,
I found the problem for this. What basically happened is that a task that
hasn't been running for a second is enqueued first on an idle queue and it
keeps that advantage compared to tasks that had been running more recently
until it catched up. The new version will now remember where the last task
left off and use that for that first task which restarts the queue. As a
side effect it also limits the bonus a task gets if multiple tasks are
woken at the same time.
This version also limits againsts runtime overruns, that means a task
exceeds its time slice, because the timer interrupt was disabled and so
the current task wasn't preempted, this usually happens during boot, but
also when using a serial console.
bye, Roman
---
include/linux/sched.h | 26 -
kernel/sched.c | 173 +++++----
kernel/sched_debug.c | 26 -
kernel/sched_norm.c | 870 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 30 -
5 files changed, 977 insertions(+), 148 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned long delta_fair_sleep;
- unsigned long delta_exec;
- s64 fair_key;
+ s64 time_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ unsigned int on_rq, queued;
+ unsigned int weight_shift;
u64 exec_start;
...I still see the fairtest2 sleeper startup anomaly. Sometimes it starts up normally, others the sleeper is a delayed. Seems to require idle time to trigger worst case startup delay. 14854 root 20 0 1568 468 384 R 52 0.0 0:07.50 1 fairtest2 14855 root 20 0 1568 112 28 R 45 0.0 0:00.46 1 fairtest2 Everything else still seems fine. Boot-time warnings: [ 113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by auid=4294967295 [ 114.077979] IA-32 Microcode Update Driver: v1.14a <tigran@aivazian.fsnet.co.uk> [ 116.281216] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b49210000,f70355a0(5633,b,d1cef00),2 [ 116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue() [ 116.312380] [<c0105188>] show_trace_log_lvl+0x1a/0x30 [ 116.326193] [<c0105d83>] show_trace+0x12/0x14 [ 116.339270] [<c0105d9b>] dump_stack+0x16/0x18 [ 116.352199] [<c011e61a>] verify_queue+0x2f8/0x3fb [ 116.365406] [<c011e8a3>] enqueue_entity+0x186/0x21b [ 116.378571] [<c0123344>] enqueue_task_fair+0x2f/0x31 [ 116.391545] [<c011d043>] enqueue_task+0xd/0x18 [ 116.403833] [<c011dffe>] activate_task+0x20/0x2d [ 116.416174] [<c0120336>] __migrate_task+0x9a/0xc4 [ 116.428490] [<c0122d0b>] migration_thread+0x175/0x220 [ 116.441159] [<c01393f7>] kthread+0x37/0x59 [ 116.452824] [<c0104dd3>] kernel_thread_helper+0x7/0x14 [ 116.465582] ======================= [ 116.476497] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b496c0000,f7e1a5a0(4179,3b0,232aaf800),4 [ 116.492899] WARNING: at kernel/sched_norm.c:271 verify_queue() [ 116.506459] [<c0105188>] show_trace_log_lvl+0x1a/0x30 [ 116.519302] [<c0105d83>] show_trace+0x12/0x14 [ 116.531313] [<c0105d9b>] dump_stack+0x16/0x18 [ 116.543334] [<c011e61a>] verify_queue+0x2f8/0x3fb [ 116.555710] [<c011ea1e>] update_curr+0xe6/0x102 [ 116.567654] [<c011ebb9>] task_tick_fair+0x3f/0x1f2 [ 116.579657] [<c01223f3>] scheduler_tick+0x18d/0x2d5 [ 116.591808] [<c012fa9d>] update_process_times+0x44/0x63 [ ...
Hi,
Damn, I forgot that tasks which are reniced or migrate to another cpu
need some more initialization, so the small incremental patch does that.
Thanks again for testing.
bye, Roman
Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- linux-2.6.orig/kernel/sched_norm.c 2007-09-11 13:15:00.000000000 +0200
+++ linux-2.6/kernel/sched_norm.c 2007-09-11 13:13:43.000000000 +0200
@@ -326,11 +326,14 @@ static void update_curr(struct cfs_rq *c
}
static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
{
verify_queue(cfs_rq, cfs_rq->curr != se, se);
cfs_rq->time_avg_min = kclock_max(cfs_rq->time_avg_min, get_time_avg(cfs_rq));
- se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, se->time_norm);
+ if (likely(wakeup))
+ se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, se->time_norm);
+ else
+ se->time_norm = cfs_rq->time_avg_min;
cfs_rq->nr_running++;
cfs_rq->weight_sum += 1 << se->weight_shift;
@@ -553,7 +556,7 @@ static void enqueue_task_fair(struct rq
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se);
+ enqueue_entity(cfs_rq, se, wakeup);
}
}
@@ -813,7 +816,7 @@ static void task_new_fair(struct rq *rq,
rq->curr->se.time_norm -= time;
se->time_norm = rq->curr->se.time_norm;
- enqueue_entity(cfs_rq, se);
+ enqueue_entity(cfs_rq, se, 1);
p->se.on_rq = 1;
cfs_rq->next = se;
-
Hi, I mainly did it this way to avoid annoying conflicts. I also started with a copy where I threw out everything but a basic skeleton and than just T is the number of tasks, so the first is the sum of the time every task gets over a period of time and the second is the sum of all weights of each task. bye, Roman -
I think I'm more puzzled now than I was before, (scratches head for minutes) .. I see .. It's a summation symbol .. For instance if there are three tasks in the system. Call them A,B, and C. then time equals "time of A" + "time of B" + "time of C" Right? Daniel -
Hi,
Ok, let's take a simple example. :)
If we have three task A, B, C, each with a weight of 1, 2, 3, so the
weight sum is 6. If we let these three tasks run over a period of 12s,
each task gets time/weight_sum*weight_{t} seconds, that is each task gets
2, 4, 6 seconds of runtime.
The basic idea of CFS is now that using this formula, the time is
distributed to the active tasks and accounted against the runtime they
get. Let's assume a time slice of 1s, where each task gets
1s/weight_sum*weight_{t} of eligible runtime every second, so in the
following table the task column contains the values (eligible runtime
- obtained runtime):
time current A B C fair_clock
1 C 1/6-0 2/6-0 3/6-1 1/6
2 C 2/6-0 4/6-0 6/6-2 2/6
3 C 3/6-0 6/6-0 9/6-3 3/6
4 B 4/6-0 8/6-1 12/6-3 4/6
5 B 5/6-0 10/6-2 15/6-3 5/6
6 A 6/6-1 12/6-2 18/6-3 6/6
7 C 7/6-1 14/6-2 21/6-4 7/6
8 C 8/6-1 16/6-2 24/6-5 8/6
9 C 9/6-1 18/6-2 27/6-6 9/6
10 B 10/6-1 20/6-3 30/6-6 10/6
11 B 11/6-1 22/6-4 33/6-6 11/6
12 A 12/6-2 24/6-4 36/6-6 12/6
Practically the eligible runtime part isn't updated constantly, but a
delta against the last update is used. If everything is working
correctly in the end the difference between the eligible and obtained
runtime is zero.
Peter's patches now make an interesting step, basically they change the
position of the weight_{t} variable, so every task get's now
1s/weight_sum per second and weight_{t} is now used to scale the
obtained runtime instead:
time current A B C fair_clock
1 C 1/6-0/1 1/6-0/2 1/6-1/3 1/6
2 C 2/6-0/1 2/6-0/2 2/6-2/3 2/6
3 C 3/6-0/1 3/6-0/2 3/6-3/3 3/6
4 B 4/6-0/1 4/6-1/2 4/6-3/3 4/6
5 B 5/6-0/1 5/6-2/2 5/6-3/3 5/6
6 A 6/6-1/1 6/6-2/2 6/6-3/3 6/6
7 C 7/6-1/1 7/6-2/2 7/6-4/3 7/6
8 C 8/6-1/1 8/6-2/2 8/6-5/3 8/6
9 C 9/6-1/1 9/6-2/2 9/6-6/3 9/6
10 B 10/6-1/1 10/6-3/2 10/6-6/3 10/6
11 B 11/6-1/1 11/6-4/2 11/6-6/3 11/6
12 A 12/6-2/1 12/6-4/2 12/6-6/3 12/6
As you can see here the ...It helps a tiny bit .. However, I appreciate that you took the time to write this .. Thanks you. Daniel -
