test.kernel.org found some idle time regressions in the latest update to the staircase deadline scheduler and Andy Whitcroft helped me track down the offending problem which was present in all previous RSDL schedulers but previously wouldn't be manifest without changes in nice. So here is a bugfix for the set_load_weight being incorrectly set and a few other minor improvements. Thanks Andy! I'm cautiously optimistic that we're at the thin edge of the bugfix wedge now. --- set_load_weight() should be performed after p->quota is set. This fixes a large SMP performance regression. Make sure rr_interval is never set to less than one jiffy. Some sanity checking in update_cpu_clock will prevent bogus sched_clock values. SCHED_BATCH tasks should not set the rq->best_static_prio field. Correct sysctl rr_interval description to describe the value in milliseconds. Style fixes. Signed-off-by: Con Kolivas <kernel@kolivas.org> --- Documentation/sysctl/kernel.txt | 8 ++-- kernel/sched.c | 73 +++++++++++++++++++++++++++++----------- 2 files changed, 58 insertions(+), 23 deletions(-) Index: linux-2.6.21-rc5-mm2/kernel/sched.c =================================================================== --- linux-2.6.21-rc5-mm2.orig/kernel/sched.c 2007-03-28 09:01:03.000000000 +1000 +++ linux-2.6.21-rc5-mm2/kernel/sched.c 2007-03-29 00:02:33.000000000 +1000 @@ -88,10 +88,13 @@ unsigned long long __attribute__((weak)) #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) #define SCHED_PRIO(p) ((p)+MAX_RT_PRIO) -/* Some helpers for converting to/from nanosecond timing */ +/* Some helpers for converting to/from various scales.*/ #define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) -#define NS_TO_MS(TIME) ((TIME) / 1000000) +#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) #define MS_TO_NS(TIME) ((TIME) * 1000000) +/* Can return 0 */ +#define MS_TO_JIFFIES(TIME) ((TIME) * HZ / 1000) +#define JIFFIES_TO_MS(TIME) ((TIME) * 1000 / HZ) #define ...
hm, how about the questions Mike raised (there were a couple of cases of friction between 'the design as documented and announced' and 'the code as implemented')? As far as i saw they were still largely unanswered - but let me know if they are all answered and addressed: http://marc.info/?l=linux-kernel&m=117465220309006&w=2 http://marc.info/?l=linux-kernel&m=117489673929124&w=2 http://marc.info/?l=linux-kernel&m=117489831930240&w=2 and the numbers he posted: http://marc.info/?l=linux-kernel&m=117448900626028&w=2 his test conclusion was that under CPU load, RSDL (SD) generally does not hold up to mainline's interactivity. Ingo -
I spent less time emailing and more time coding. I have been working on There have been improvements since the earlier iterations but it's still a fairness based design. Mike's "sticking point" test case should be improved as well. My call based on my own testing and feedback from users is: Under niced loads it is 99% in favour of SD. Under light loads it is 95% in favour of SD. Under Heavy loads it becomes proportionately in favour of mainline. The crossover is somewhere around a load of 4. If the reluctance to renice X goes away I'd say it was 99% across the board -- -ck -
That one's not fine.
+static void recalc_task_prio(struct task_struct *p, struct rq *rq)
+{
+ struct prio_array *array = rq->active;
+ int queue_prio;
+
+ update_if_moved(p, rq);
+ if (p->rotation == rq->prio_rotation) {
+ if (p->array == array) {
+ if (p->time_slice > 0)
+ return;
+ p->time_slice = p->quota;
+ } else if (p->array == rq->expired) {
You implemented nanosecond accounting, but here you give a task which
has either missed the tick ofter enough, or accumulated enough cross cpu
clock drift to have an I.O.U. in it's wallet a shiny new $8 bill.
WRT clock drift/timewarps, your latest code cedes that these do occur,
but where these timewarps can be anywhere between minuscule with Intel
same package processors, up to a tick elsewhere, charges a tick.
- /* cpu scheduler quota accounting is performed here */
+ if (tick) {
+ /*
+ * Called from scheduler_tick() there should be less
than two
+ * jiffies worth, and not negative/overflow.
+ */
+ if (time_diff > JIFFIES_TO_NS(2) || time_diff <
min_diff)
Hm. How, where?
I'm getting inconsistent results with current, but sleeping tasks still
don't _appear_ to be able to compete with hogs on an equal footing, and
I don't see how they really can.
What happens if a sleeper sleeps after using say half of it's slice, and
the hog it's sharing the CPU with then sleeps briefly after using most
of it's slice. That's the end of the rotation. They are put back on an
The behavior is different, and is less ragged, but I wouldn't say it's
really been improved. The below was added as a workaround.
+ * This contains a bitmap for each dynamic priority level with empty slots
+ * for the valid priorities each different nice level can have. It allows
+ * us to stagger the slots where differing priorities run in a way that
+ * keeps latency differences between different nice levels at a minimum.
+ * ie, where 0 means a slot ...Suggestion: try the testcase that Satoru Takeuch posted. The numbers I got with latest SD were no better than the numbers I got with the patch I posted to try to solve it. Seems to me the numbers with SD should have been much better, but they in fact were not. Running that thing, mainline's GUI was not usable, even with my patch, but neither was it usable with SD. What's the difference between horrible with mainline and merely terrible with SD? In both, the GUI ends up doing round-robin with a slew of hogs. In mainline, this happens because the history logic can and does get it wrong sometimes, which this exploit deliberately triggers. With SD, it's by design. -Mike -
Oh my, I'm on a roll here... somebody stop me ;-) Some emphasis: The much maligned history mechanism in mainline didn't start it's life as an interactivity estimator, that's a name it acquired later. What it was first put there for was to ensure fairness for sleeping tasks. I found it most ironic that the numbers I posted showed that mechanism working perfectly, with an exploit that was designed specifically to expose it's weakness, despite the deliberate tweaks that have gone in tweaking it very heavily in the unfair direction, and this went uncommented. If I had run more of them, it would have shown that weakness very well. We all know that weakness exists. What the numbers clearly showed was that sleeping tasks did not get the fairness RSDL advertised with the particular test I ran, yet it went uncommented/uncontested. Anyone could have tested with the trivial proggy of their choice... but nobody did. The history mechanism is not only about interactivity, and never was. -Mike I'm gonna go piddle around with code now, much more fun than yacking :) -
Rereading to make sure I wasn't unclear anywhere... Egad. Here I'm pondering the numbers and light load as I'm typing, and my fingers (seemingly independent when mind wanders off) typed < 95% as in not fully committed, instead of "light". -Mike -
95% of cases where load is less than 4; not 95% load. -- -ck -
While I don't know the _exact_ figure for this, my hunch is that a good ballpark figure is anything that is not a heavy load (less than 4, perhaps even lower, maybe <0.75 or <2?) and that is not a "niced" load. -- -- Michael Chang ~Just the crazy copy cat~ -
Try two instances of chew.c at _differing_ nice levels on one cpu on mainline, -- -ck
How about something more challenging instead :) The numbers below are from my scheduler tree with massive_intr running at nice 0, and chew at nice 5. Below these numbers are 100 lines from the exact center of chew's output. (interactivity remains intact with this rather heavy load) root@Homer: ./massive_intr 30 180 005671 00001506 005657 00001506 005651 00001491 005647 00001466 005661 00001484 005660 00001475 005645 00001514 005668 00001384 005673 00001516 005656 00001449 005664 00001512 005659 00001507 005667 00001513 005663 00001521 005670 00001440 005649 00001522 005652 00001487 005648 00001405 005665 00001472 005669 00001418 005662 00001489 005674 00001523 005650 00001480 005655 00001476 005672 00001530 005653 00001463 005654 00001427 005646 00001499 005658 00001510 005666 00001476 100 sequential lines from the middle of chew's logged output. pid 5642, prio 5, out for 2 ms, ran for 1 ms, load 34% pid 5642, prio 5, out for 1268 ms, ran for 63 ms, load 4% pid 5642, prio 5, out for 52 ms, ran for 0 ms, load 0% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 14% pid 5642, prio 5, out for 9 ms, ran for 1 ms, load 12% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 17% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 15% pid 5642, prio 5, out for 9 ms, ran for 1 ms, load 17% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 15% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 12% pid 5642, prio 5, out for 7 ms, ran for 1 ms, load 18% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 11% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load 18% pid 5642, prio 5, out for 4 ms, ran for 1 ms, load 22% pid 5642, prio 5, out for 1395 ms, ran for 50 ms, load 3% pid 5642, prio 5, out for 26 ms, ran for 0 ms, load 3% pid 5642, prio 5, out for 8 ms, ran for 1 ms, load ...
Here are the numbers for 2.6.21-rc5 with only the earlier mentioned patch. Chew's log is only 20% as long as that from my other tree, and interactivity suffers badly while running this exploit, but as you can see, chew isn't dying of boredom. -Mike root@Homer: ./massive_intr 30 180 006701 00001509 006693 00001571 006707 00001072 006690 00001582 006691 00001547 006692 00001336 006695 00001759 006710 00001766 006699 00001531 006688 00001405 006709 00001907 006703 00001572 006705 00001501 006697 00001617 006686 00001344 006713 00001922 006714 00001885 006704 00001491 006694 00001482 006689 00001395 006711 00001176 006715 00001471 006708 00001527 006687 00001200 006706 00001451 006698 00001246 006702 00001495 006696 00001421 006712 00001414 006700 00001047 pid 6683, prio 5, out for 46 ms, ran for 0 ms, load 0% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 17% pid 6683, prio 5, out for 8 ms, ran for 1 ms, load 16% pid 6683, prio 5, out for 6 ms, ran for 1 ms, load 18% pid 6683, prio 5, out for 3527 ms, ran for 69 ms, load 1% pid 6683, prio 5, out for 52 ms, ran for 1 ms, load 2% pid 6683, prio 5, out for 15 ms, ran for 1 ms, load 6% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 15% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 13% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 18% pid 6683, prio 5, out for 8 ms, ran for 1 ms, load 18% pid 6683, prio 5, out for 8 ms, ran for 1 ms, load 18% pid 6683, prio 5, out for 8 ms, ran for 1 ms, load 17% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 17% pid 6683, prio 5, out for 3925 ms, ran for 56 ms, load 1% pid 6683, prio 5, out for 30 ms, ran for 1 ms, load 3% pid 6683, prio 5, out for 24 ms, ran for 1 ms, load 6% pid 6683, prio 5, out for 7 ms, ran for 1 ms, load 18% pid 6683, prio 5, out for 7 ...
Taking a little break from tinkering, I built/ran rsd-0.38 as well. While chew usually says "out for N < 500ms", I see spikes like those below the massive_intr numbers. root@Homer: ./massive_intr 30 180 (nice 0) 006596 00001346 006613 00001475 006605 00001463 006606 00001423 006598 00001279 006609 00001458 006600 00001378 006591 00001491 006610 00001413 006588 00001361 006602 00001401 006601 00001412 006607 00001373 006604 00001449 006599 00001398 006608 00001269 006611 00001464 006593 00001349 006614 00001335 006612 00001512 006615 00001422 006589 00001363 006617 00001362 006597 00001435 006592 00001354 006595 00001425 006616 00001348 006603 00001308 006594 00001360 006590 00001397 (spikes from run above) pid 6585, prio 0, out for 178 ms, ran for 12 ms, load 6% pid 6585, prio 0, out for 175 ms, ran for 13 ms, load 7% pid 6585, prio 0, out for 1901 ms, ran for 12 ms, load 0% pid 6585, prio 0, out for 61 ms, ran for 12 ms, load 17% ... pid 6585, prio 0, out for 148 ms, ran for 11 ms, load 7% pid 6585, prio 0, out for 229 ms, ran for 13 ms, load 5% pid 6585, prio 0, out for 182 ms, ran for 11 ms, load 6% pid 6585, prio 0, out for 1306 ms, ran for 11 ms, load 0% pid 6585, prio 0, out for 72 ms, ran for 12 ms, load 15% pid 6585, prio 0, out for 252 ms, ran for 11 ms, load 4% .... (spikes from massive_intr at nice 0 and chew at nice -20) pid 6547, prio -20, out for 132 ms, ran for 119 ms, load 47% pid 6547, prio -20, out for 52 ms, ran for 119 ms, load 69% pid 6547, prio -20, out for 4 ms, ran for 96 ms, load 95% pid 6547, prio -20, out for 1251 ms, ran for 24 ms, load 1% pid 6547, prio -20, out for 78 ms, ran for 1561 ms, load 95% pid 6547, prio -20, out for 89 ms, ran for 120 ms, load 57% pid 6547, prio -20, out for 69 ms, ran for 119 ms, load 63% pid 6547, prio -20, out for 4125 ms, ran for 119 ms, load 2% pid 6547, prio ...
looks interesting - could you send the patch? Ingo -
Sorry, that tree is not _even_ ready for viewing yet. (and it's got an occasional oops bug i have to kill) -Mike -
Ok, this is looking/feeling pretty good in testing. Comments on fugliness etc much appreciated. Below the numbers is a snapshot of my experimental tree. It's a mixture of my old throttling/anti-starvation tree and the task promotion patch, with the addition of a scheduling class for interactive tasks to dish out some of that targeted unfairness I mentioned. SCHED_INTERACTIVE is also targeted at the scenario where X or one of it's clients uses enough CPU to end up in the expired array. (note: Xorg was not set SCHED_INTERACTIVE during the test runs below) -Mike top - 12:31:34 up 16 min, 13 users, load average: 7.37, 8.74, 6.58 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 6542 root 15 0 1568 108 24 S 43 0.0 0:58.98 1 fiftypercent 6540 root 17 0 1568 440 356 R 30 0.0 1:00.04 0 fiftypercent 6544 root 18 0 1568 108 24 R 28 0.0 0:58.36 0 fiftypercent 6541 root 20 0 1568 108 24 R 26 0.0 0:57.70 1 fiftypercent 6536 root 25 0 1436 356 296 R 24 0.0 0:45.76 1 chew 6538 root 25 0 1436 356 296 R 20 0.0 0:49.73 0 chew 6543 root 19 0 1568 108 24 R 19 0.0 0:58.04 1 fiftypercent 6409 root 15 0 154m 63m 27m R 2 6.3 0:13.09 0 amarokapp 6410 root 15 0 154m 63m 27m S 2 6.3 0:14.36 0 amarokapp 6376 root 15 0 2380 1092 764 R 2 0.1 0:15.63 0 top 5591 root 18 0 4736 1036 736 S 1 0.1 0:00.14 1 smpppd 5678 root 15 0 167m 24m 4848 S 1 2.4 0:19.37 0 Xorg 6202 root 15 0 32364 18m 12m S 1 1.8 0:04.25 1 konsole 50 lines from center of chew nailed to cpu0's log pid 6538, prio 0, out for 27 ms, ran for 1 ms, load 6% pid 6538, prio 0, out for 26 ms, ran for 4 ms, load 14% pid 6538, prio 0, out for 27 ms, ran for 7 ms, load 20% pid 6538, prio 0, out for 13 ms, ran for 5 ms, load 27% pid 6538, prio 0, out for 8 ms, ran for ...
find a whitespace fix below. Ingo Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1034,7 +1034,7 @@ static int recalc_task_prio(struct task_ /* * Migration timestamp adjustment may induce negative time. * Ignore unquantifiable values as well as SCHED_BATCH tasks. - */ + */ if (now < p->timestamp || batch_task(p)) sleep_time = 0; -
Thanks. (dang, i need to find that fifty "make it red" thingie for vi again) -Mike -
or just start using quilt, which warns about this :) Ingo -
put "let c_space_errors=1" in .vimrc HTH, Johannes -
Thanks. I received this link via private mail, and think it's worth posting. Who knows, it may save Maintainers an antacid tablet or two. http://www.pixelbeat.org/settings/.vimrc -Mike (may eventually get tired of the colors, but for now they're cooler than the plain black and white i'm used to, _and_ has "make it glow" feature) -
here's some test results, comparing SD-latest to Mike's-latest: re-testing the weak points of the vanilla scheduler + Mike's: - thud.c: this workload has almost unnoticeable effect - fiftyp.c: noticeable, but alot better than previously! re-testing the weak points of SD: - hackbench: still unusable under such type of high load - no improvement. - make -j: still less interactive than Mike's - no improvement. Ingo -
Hmm. Here fiftyp.c is utterly harmless. If you have a second, can you send me a top snapshot? If you're running many of them, it can take a bit for the throttle to catch them all. -Mike -
ah, indeed - i ran 10 of them and letting them run for a bit smoothes things out. Ingo -
Ok, I didn't try 10 of them. It can still get a bit ragged here, so I may have to latch the throttle for a bit to make sure they have to maintain improved behavior to get unleashed. 5 of them get instantly nailed, and stay nailed. -Mike -
Throttling to try to get to SD fairness? The mainline state machine becomes more complex than ever and fluctuates from interactive to fair by an as-yet Nice -10 on mainline ruins the latency of nice 0 tasks unlike SD. New scheduling class just for X? Sounds like a very complicated Depends on how big your job number vs cpu is. The better the throttling gets with mainline the better SD gets in this comparison. At equal fairness mainline does not have the low latency interactivity SD has. Nice -10 X with SD is a far better solution than an ever increasing complexity state machine and a userspace-changing scheduling policy just for X. Half -- -ck -
I believe I've already met and surpassed SD fairness. Bold statement, but I believe it's true. I'm more worried about becoming _too_ fair. Show me your numbers. I showed you mine with both SD and my patches. WRT magic and state machine complexity: If you read the patch, there is nothing "magical" about it. It doesn't do anything but monitor CPU usage and move a marker. It does nothing the least bit complicated, and what it does, it does in the slow path. The only thing it does in the fast path is to move the marker, and perhaps tag a targeted task. State machine? There is nothing there that resembles a state machine to me, This patch makes massive nice -10 vs nice 0 latency history I believe. Testing welcome. WRT "nice -10 obfuscated", that's a load of high grade horse-hockey. There were very good reason posted here as to why that is a very bad idea, perhaps you haven't read them. (you can find them if you choose) Your criticism SCHED_INTERACTIVE leaves me dumbfounded, since you were, and still are, specifically telling me that I should tell the scheduler that X is special. I did precisely that, and am also trying to tell it that it's clients are special too, _without_ having to start each and SD does not retain interactivity under any appreciable load for one, and secondly, I'm getting interactivity that SD cannot even get close to without renicing, and without any patches - in mainline right now. (Speaking of low latency, how long can tasks forking off sleepers who overlap their wake times prevent an array switch with SD? Forever?) I posted numbers that demonstrate the improvement in fairness while maintaining interactivity, and I'm not finished. I've solved the multiple fiftyp.c thing Ingo noticed, and in fact, I had 10 copies running that I had forgotten to terminate while I was working, and I didn't even notice until I finished, and saw my top window. Patch to follow as soon as I test some more (that's what takes much time, not creating the ...
i think you are missing the point. We _do not know in advance_ whether X should be prioritized or not. It's the behavior of X that determines it. When X is reniced to -10 it fixes a few corner cases, but it breaks many other cases. We found that out time and time again. this is relative to how mainline+Mike's handles it. Users wont really i often run make jobs with -j200 or larger, and SD gets worse than even mainline much sooner than that. Ingo -
fiftyp.c seems to have been stumbled across by accident as having an effect when Xenofon was trying to recreate Mike's 50% x 3 test case. I suggest a ten percent version like the following would be more useful as a test for the harmful effect discovered in fiftyp.c. (/me throws in obligatory code style change). Starts 15 processes that sleep ten times longer than they run. Change forks to 15 times the number of cpus you have and it should work on any size hardware. -- -ck
I was more focused on the general case, but all I should have to do to de-claw all of these sleep exploits is account rr time (only a couple of lines, done and building now). It's only a couple of lines. -Mike -
The more you try to "de-claw" these sleep exploits the less effective you make your precious interactive estimator. Feel free to keep adding endless tweaks to undo the other tweaks in order to try and achieve what SD has by design. You'll end up with an incresingly complex state machine design of interactivity tweaks and interactivity throttlers all fighting each other to the point where the intearactivity estimator doesn't do anything. What's the point in that? Eventually you'll have an estimator throttled to the point it does nothing and you end up with something far less interactive than SD which is as interactive as fairness allows, unlike mainline. -- -ck -
I haven't seen SD achieve what it's design docs claim yet, so yup, I'm going to keep right on trying to fix the corner cases in what we have that _does_ give me the interactivity I want. -Mike -
firstly, testing on various workloads Mike's tweaks work pretty well, while SD still doesnt handle the high-load case all that well. Note that it was you who raised this whole issue to begin with: everything was pretty quiet in scheduling interactivity land. (There was one person who reported wide-scale interactivity regressions against mainline but he didnt answer my followup posts to trace/debug the scenario.) SD has a built-in "interactivity estimator" as well, but hardcoded into its design. SD has its own set of ugly-looking tweaks as well - for example the prio_matrix. So it all comes down on 'what interactivity heuristics is enough', and which one is more tweakable. So far i've yet to see SD address the hackbench and make -j interactivity problems/regression for example, while Mike has been busy addressing the It comes down to defining interactivity by scheduling behavior, and making that definition flexible. SD's definition of interactivity is rigid (but it's still behavior-based, so not fundamentally different from an explicit 'interactivity estimator'), and currently it does not work well under high load. But ... i'm still entertaining the notion that it might be good enough, but you've got to demonstrate the design's flexibility. furthermore, your description does not match my experience when using Mike's tweaks and comparing it to SD on the same hardware. According to your claim i should have seen regressions popping up in various, already-fixed corners, but it didnt happen in practice. But ... i'm awaiting further SD and Mike tweaks, the race certainly looks interesting ;) Ingo -
<g> I think I lapped him, but since we're running in opposite directions, it's hard to tell. -Mike -
I'm terribly sorry but you have completely missed my intentions then. I was _not_ trying to improve mainline's interactivity at all. My desire was to fix the unfairness that mainline has, across the board without compromising fairness. You said yourself that an approach that fixed a lot and had a small number of regressions would be worth it. In a surprisingly ironic turnaround two bizarre things happened. People found SD fixed a lot of their interactivity corner cases which were showstoppers. That didn't surprise me because any unfair design will by its nature get it wrong sometimes. The even _more_ surprising thing is that you're now using interactivity as the argument against SD. I did not set out to create better interactivity, I set out to create widespread fairness without too much compromise to interactivity. As I said from the _very first email_, there would be cases of That was one user. As I mentioned in an earlier thread, the problem with email threads on drawn out issues on lkml is that all that people remember is the last one creating noise, and that has only been the noise from Mike for 2 weeks now. Has everyone forgotten the many many users who reported the advantages first up which generated the interest in the first place? Why have they stopped reporting? Well the answer is obvious; all the signs suggest that SD is slated for mainline. It is on the path, Linus has suggested it and now akpm is asking if it's ready for 2.6.22. So they figure there is no point testing and replying any further. SD is ready for prime time, finalised and does everything I intended it to. This is where I have to reveal to them the horrible truth. This is no guarantee it will go in. In fact, this one point that you (Ingo) go on and on about is not only a quibble, but you will call it an absolute showstopper. As maintainer of the cpu scheduler, in its current form you will flatly refuse it goes to mainline citing the 5% of cases where interactivity has regressed. ...
Con was scratching an itch, one we desktop users all have in a place we can't quite reach to scratch because we aren't quite the coding gods we should be. Con at least has the coding knowledge to walk in and start shoveling, which is more than I can say of the efforts to derail the SD Sorry, this user got quiet to watch the cat fight. Obviously I should Who gives a s*** about hackbench or a make -j 200?! Those are NOT, and NEVER WILL BE, REAL WORLD LOADS for the vast majority of us. For us SD To be expected, there are after all, only so many cpu cycles to go around. Here I sit, running 2.6.21-rc6 ATM, and since there is not an SD patch that applies cleanly to rc6, I am back to typing half or more of a sentence blind while I answer a posting such as this because of x starvation while kmail is sorting incoming stuff. All this while gkrellm, sitting on the right edge of my screen, is showing a 0 to 2% cpu usage in its graphic display! FWIW, also isn't suffering the same display update problems, nor is the system clock down on the kickstart bar. If that isn't prima faci evidence of an unfair scheduler, I don't know what is. With the SD patch applied to a working kernel, I've pretty well got my machine back and I'm in command again, just as if I was running nitros9 on my trs-80 Color Computer while it was compiling a program in the background, or back when I was doing all this on an amiga. Both of these had, by their simplistic designs, schedulers that were fair, with (nitr)os9 having the ability to schedule the order that IRQ's were serviced with a priority setting on a per IRQ basis. If Amigados ever had the ability to fiddle with the scheduler other than niceing the process, it wasn't important enough for me to see if I could tweak it because generally it simply worked. Con's earlier patches worked very well for this desktop user, but as Mike kept bitching about "production", (who the hell runs a 'make -j 200' or 50 while(1)'s in the ...
it would be really nice to analyze this. Does the latest -rt patch boot on your box so that we could trace this regression? (I can send you a standalone tracing patch if it doesnt.) IIRC you reported that one of the early patches from Mike made your system behave good (but still not as good as SD) - it would be nice to try a later patch too. basically, the current unfairness in the scheduler should be solved, one not many - and i dont think Mike tested any of these - Mike tested pretty low make -j values (Mike, can you confirm?). (I personally routinely run 'make -j 200' build jobs on my box [because it's the central server of a build cluster and high parallelism is needed to overcome network latencies], but i'm pretty special in that regard and i didnt use that workload as a test against any of these schedulers.) Ingo -
Yes. I don't test anything more than make -j5 when looking at interactivity, and make -j nr_cpus+1 is my must have yardstick. -Mike -
Somebody made that remark, maybe not you, and maybe they were being funny, but I didn't at the time, see any smileys. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Please remain calm, it's no use both of us being hysterical at the same time. -
I strongly suggest assembling a battery of cleanly and properly written, configurable testcases, and scripting a series of regression tests as opposed to just randomly running kernel compiles and relying on Braille. For instance, a program that spawns a set of tasks with some spectrum of interactive vs. noninteractive behaviors and maybe priorities too according to command-line flags and then measures and reports the distribution of CPU bandwidth between them, with some notion of success or failure and performance within the realm of success reported would be something to include in such a battery of testcases. Different sorts of cooperating processes attempting to defeat whatever sorts of guarantees the scheduler is intended to provide would also be good testcases, particularly if they're arranged so as to automatically report success or failure in their attempts to defeat the scheduler (which even irman2.c, while quite good otherwise, fails to do). IMHO the failure of these threads to converge to some clear conclusion is in part due to the lack of an agreed-upon set of standards for what the scheduler should achieve and overreliance on subjective criteria. how any of this is to be demonstrated, and what the status of various pathological cases are, these threads are a nightmare of subjective squishiness and a tug-of-war between testcases only ever considered one at a time needing Lindent to read that furthermore have all their parameters hardcoded. Scripting edits and recompiles is awkward. Just finding the testcases is also awkward; con has a collection of a few, but they've got the aforementioned flaws and others also go around that can only be dredged up from mailing list archive searches, plus there's nothing like LTP where they can be run in a script with pass/fail reports and/or performance metrics for each. One patch goes through for one testcase and regressions against the others are open questions. Scheduling does have a strong subjective component, but this is ...
there's interbench, written by Con (with the purpose of improving RSDL/SD), which does exactly that, but vanilla and SD performs quite the same in those tests. it's quite hard to test interactivity, because it's both subjective and because even for objective workloads, things depend so much on exact circumstances. So the best way is to wait for actual complaints, and/or actual testcases that trigger badness, and victims^H^H^H^H^H testers. (also note that often it needs _that precise_ workload to trigger some badness. For example make -j depends on the kind of X shell terminal that is used - gterm behaves differently from xterm, etc.) Ingo -
Interactivity will probably have to stay squishy. The DoS affairs like fiftyp.c, tenp.c, etc. are more of what I had in mind. There are also a number of instances where CPU bandwidth distributions are gauged by top(1) with noninteractive tests where the scriptable testcase affair should be coming into play. There are other, relatively obvious testcases for basic functionality missing, too. For instance, where is the testcase to prove that nice levels have the intended effect upon CPU bandwidth distribution between sets of CPU-bound tasks? Or one that gauges the CPU bandwidth distribution between a task that sleeps some (command-line configurable) percentage of the time and some (command-line configurable) number of competing CPU-bound tasks? Or one that gauges the CPU bandwidth distribution between sets of cooperating processes competing with ordinary CPU-bound processes? Can it be proven that any of this is staying constant across interactivity or other changes? Is any of it being changed as an unintended side-effect? Are the CPU bandwidth distributions among such sets of competing tasks even consciously decided? There should be readily-available answers to these questions, but they are not so. -- wli -
Yes it would be Ingo, but so far, none of the recent -rt patches has booted on this machine, the last one I tried a few days ago failing to find /dev/root, whatever the heck that is. FWIW, I gave up on the rt stuffs 6 months or more ago when the regressions I was reporting weren't ever acknowledged. I don't enjoy sitting through all these e2fsk's during the reboot just to have things I normally run in the background die, like tvtime, sitting there with some news channel muttering along in the background. I was even ignored when I suggested it might be a dma problem, which I still think it could be. Nevertheless, the patch you sent is building as I type, intermittently And I'd wager a cool one that you don't gain more than a second or so in compile time between a make -j8 and a make -j200 unless your network is a pair of tomato juice cans & some string. Again, to me, the network thing is not something that's present in an everyday users environment. My -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) If you would keep a secret from an enemy, tell it not to a friend. -
did you have a chance to try the yum kernel by any chance? The -testing one you can try on Fedora with little hassle, by doing this as root: cat > /etc/yum.repos.d/rt-testing.repo [rt-testing] name=Ingo's Real-Time (-rt) test-kernel for FC6 baseurl=http://people.redhat.com/mingo/realtime-preempt/yum-testing/yum/ enabled=1 gpgcheck=0 <Ctrl-D> i did spend quite some time to debug your tv-tuner problem back then, and for that purpose alone i bought a tv tuner card to test this myself. (but it worked on my testbox) Ingo -
No, I couldn't seem to get that to show up in a yumex display, and I'm You didn't tell me this. That said, I am booted to the patch you sent me now, and this also is a very obvious improvement, one I could easily live with on a long term basis. I haven't tried a kernel build in the background yet, but I have sat here and played patience for about an hour, looking for the little stutters, but never saw them. So I could just as easily recommend this one for desktop use, it seems to be working. tvtime hasn't had any audio or video glitches that I've noted when I was on that screen to check on an interesting story, like the 102 year old lady who finally got her hole in one, on a very short hole, but after 90 years of golfing, she was beginning to wonder if she would ever get one. Not sure who bought at the 19th hole, HNN didn't cover that traditional part. So this patch also works. And if it gets into mainline, at least Con's efforts at proding the fixes needed will not have been in vain. My question then, is why did it take a very public cat-fight to get this looked at and the code adjusted? Its been what, nearly 2 years since Linus himself made a comment that this thing needed fixed. The fixes then done were of very little actual effectiveness and the situation then has gradually deteriorated since. Its on the desktop that linux will win or lose the public's market share. After all, there are only so many 'servers' on the planet, a market that linux has pretty well demo'ed its superiority, if not in terms of speed, at least in security. To qualify that, I currently have 2 of yahoo's machines in my .procmailrc's /dev/null list as they are a source of a large number of little 1 to 3 line spams. I assume they are IIS machines, but the emails headers aren't that explicit to my relatively untrained eyeballs. And I'd like to see korea put on a permanent rbl black hole. I'm less than amused at watching the log coming out of my router as first ...
thanks for testing it! (for the record, Gene tested sched-mike-4.patch,
this is pretty hard to get right, and the most objective way to change
it is to do it testcase-driven. FYI, interactivity tweaking has been
gradual, the last bigger round of interactivity changes were done a year
ago:
commit 5ce74abe788a26698876e66b9c9ce7e7acc25413
Author: Mike Galbraith <efault@gmx.de>
Date: Mon Apr 10 22:52:44 2006 -0700
[PATCH] sched: fix interactive task starvation
(and a few smaller tweaks since then too.)
and that change from Mike responded to a testcase. Mike's latest changes
(the ones you just tested) were mostly driven by actual testcases too,
which measured long-term timeslice distribution fairness.
It's really hard to judge interactivity subjectively, so we rely on
things like interbench (written by Con) - in which testsuite the
upstream scheduler didnt fare all that badly, plus other testcases
(thud.c, game_sim.c, now massive_inter.c, fiftyp.c and chew.c) and all
the usual test-workloads. This is admittedly a slow process, but it
seems to be working too and it also ensures that we dont regress in the
future. (because testcases stick around and do get re-tested)
your system seems to also be a bit special because you 1) drive it to
the absolute max on the desktop but you do not overload it in obvious
ways (i.e. your workloads are pretty fairly structured) 2) it's a bit
under-powered (single-CPU 800 MHz CPU, right?) but not _too_
underpowered - so i think you /just/ managed to hit 'the worst' of the
current interactivity estimator: with important tasks both being just
above and just below 50%. Believe me, on all ~10 systems i use
regularly, Linux interactivity of the vanilla scheduler is stellar. (And
that includes a really old 500 MHz one too with FC6 on it.)
Ingo
-
Actually, its an XP2800 Athlon, 333 fsb, gig of memory. And I was all enthusiastic about this until amanda's nightly run started, at which point I started losing control for quite long periods, 30+ seconds at a time. Up till then I thought we had it made. In this regard, Cons patches were enough better to notice it right away, lags were 1-2 seconds max. That seems to be the killer loading here, building a kernel (make -j3) doesn't seem to lag it all that bad. One session of gzip -best makes it fall plumb over though, which was a disappointment. But, I could live with this. Now if I could figure out a way to nail dm_mod down to a fixed LANANA approved address, I just got bit again, because enabling pktcdvd caused a MAJOR switch, only from 253 to 252 but tar thinks the whole 45GB is all new again. So since it, dm_mod, no longer carries the experimental label, lets put that patch back in and be done with this particular hassle once and for all. If I had known that using LVM2 was going to be such a pain in the ass just with this item alone, I wouldn't have touched it with a 50 foot fiberglass pole. Or does this SOB effect normal partition mountings too? I don't know, and the suggested fixes from David Dillow I put in /etc/modprobe.conf are ignored for dm_mod, and when extended to pktcdvd, cause pktcdvd to fail totally. Mmm??, can I pass an 'option dm_mod major=238' as a kernel argument & make -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Real Programmers don't write in PL/I. PL/I is for programmers who can't decide whether to write in COBOL or FORTRAN. -
Or at least send me a couple of 5 or 10 second top snapshots (which also show CPU usage of sleeping tasks) while the system is misbehaving? -Mike -
With what monitor utility? -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) "Microsoft technology" -- isn't that an oxymoron? -- Gareth Barnard -
This may not be so informative, its almost behaving ATM. 29252 amanda 22 0 1856 572 220 R 76.4 0.1 1:07.24 gzip 29235 amanda 15 0 2992 1224 888 S 5.6 0.1 0:02.80 chunker 29500 root 18 0 2996 1164 788 S 4.0 0.1 0:02.40 tar 10459 amanda 15 0 3340 1052 832 S 3.0 0.1 0:49.04 amandad 10536 amanda 15 0 3276 1308 1004 S 2.3 0.1 0:40.92 dumper 29496 amanda 18 0 2808 472 280 S 2.0 0.0 0:01.73 sendbackup 4057 gkrellmd 15 0 11568 1172 896 S 1.3 0.1 7:45.82 gkrellmd 29498 amanda 18 0 2396 780 656 S 1.0 0.1 0:00.60 tar 19183 root 15 0 0 0 0 S 0.7 0.0 0:01.92 pdflush I also note with some disdain that I'm half a megabyte into swap, but I've had FF-2.0.0.3 busy for the last hour while amanda was trying to find a few cycles at the same time. Looking at a bunch of pdf's of circuit boards to see if I wanna build them for my milling machine. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Fatal Error: Found MS-Windows System -> Repartitioning Disk for Linux... -
Sure. Try 'tar czf nameofarchive.tar.gz /path/to-dir-to-be-backed-up' Or, from the runtar log from this morning, and this is all one line: runtar.20070408022016.debug:running: /bin/tar: 'gtar' '--create' '--file' '-' '--directory' '/usr/dlds-rpms' '--one-file-system' '--listed-incremental' '/usr/local/var/amanda/gnutar-lists/coyote_usr_dlds-rpms_1.new' '--sparse' '--ignore-failed-read' '--totals' '--exclude-from' '/tmp/amanda/sendbackup._usr_dlds-rpms.20070408022016.exclude' '.' and amanda will if requested, pipe that output through a |gzip -best, and its this process that brings the machine to the table begging for scraps like a puppy. Tar by itself can be felt but isn't bad. Even without the -best switch in effect, I'm sure you'll see the machine slow considerably. Please don't try to call amanda an unusual load as amanda itself is nothing but an intelligent manager, constructing the command lines passed to tar or dump, and gzip, which do the real work. Amdump, the manager my scripts wrap around, and my scripts themselves, will not use more than .01% of the cpu when averaged over the whole backup session. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) We are Microsoft. What you are experiencing is not a problem; it is an undocumented feature. -
That looks as if it should demo it pretty well if I understand correctly everything you're doing there. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) In /users3 did Kubla Kahn A stately pleasure dome decree, Where /bin, the sacred river ran Through Test Suites measureless to Man Down to a sunless C. -
Well, I let it process my ~250GB of data with my current tree, and it looked utterly harmless (and since I'm running SMP, was of course). I'll try building UP to make sure, and check mainline as well. -Mike -
Ok, I can't reproduce any bad interactivity here with that workload either with SMP or UP kernel. That said however, gzip does attain interactive status, which it really should not - that gives it an unfair advantage over it's peers. With my throttled tree, it gets pushed back down to where it belongs. I'm going to try to tighten the tolerance on behavior to evict the riffraff who don't really belong in the elite interactive club sooner, and guarantee that even fast/light tasks can't dominate the CPU without paying heavily. (to close the many fast/light tasks wakeup scenario that the "untested" patch someone mentioned did, but was shown to be too painful to bare). -Mike -
and note that a year ago Mike did a larger patch too, not unlike his current patch - but we hoped that his smaller change would be sufficient - and nobody came along and said "i tested Mike's and the difference is significant on my system". Which seems to suggest that the number of problem-systems and worried users/developers isnt particularly large. Ingo -
May I suggest that while it may have been noticeable, it was not 'significant', so we didn't sing praises and bow to mecca at the time. I just thought that this is the way it was, till Cons patch proved otherwise for this 'desktop' user. We were then, and still are, looking for the magic that lets it all load up and slow down in a linear feeling fashion. Only those IRQ's that are fleeting and need serviced NOW should be exceptions to that rule. AFAIAC, gzip can take its turn in the queue, getting no more time in proportion than any other process that wakes up in its slice and finds it has something to do, if nothing to do it should yield the floor immediately, and in any event be put back at the far end of the queue when its timeslice is over. gzip in particular seems very reticent to give up the cpu at what should be the end of its timeslice. As it is, the IRQ's are being serviced, so no keystrokes are being lost, or very few, unlike the situation 2 years ago when whole sentences typed blind were on the missing list when x finally did get a chance to play catchup. As a desktop user, I fail to understand any good reason why a keystroke typed can't be echoed to the screen within 200 milliseconds regardless of how many gzip -best's amdump may be running in the background. I have a coco3, running nitros9 at a cpu clock rate of 1.79mhz with a 1/10th second context switch, in the basement that CAN do that while assembling an executable with a separate process printing the listing of that assembly as it progresses. Again, may I suggest that this sort of behavior on the desktop is a -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The meek will inherit the earth -- if that's OK with you. -
Actually, there was practically nil interest in testing. We made a couple of minor adjustments to the interactivity logic, and all went quiet, so I didn't think it was enough of a problem to require more intrusive countermeasures. -Mike -
Does one of these messages have a url so I can test the latest of your patches for -rc6? Or was the one Ingo sent the most recent? Putting that url in your sig would be nice, and might result in its getting a lot more exersize which should = more feedback. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Got a complaint about the Internal Revenue Service? Call the convenient toll-free "IRS Taxpayer Complaint Hot Line Number": 1-800-AUDITME -
No, my tree has a bugfix and some other adjustments that try to move the When I get it cleaned up and better tested, I'll post again. If you want, I'll CC you... willing victims are a highly valued commodity :) -Mike -
-- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The box said "Requires Windows 95 or better." I can't understand why it won't work on my Linux computer. -
Ah yes, that one. Here's the next one in that series:
commit f1adad78dd2fc8edaa513e0bde92b4c64340245c
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date: Sun May 21 18:54:09 2006 -0700
Revert "[PATCH] sched: fix interactive task starvation"
It personally had me wonder if _anyone_ was testing this stuff...
Rene.
-
Well of course not. Making random untested changes, and reverting them later is half the fun of kernel development. -Mike -
The point ofcourse is that the very example Molnar quoted as an example of responsible, testcase driven development was in fact hugely broken and sat in the tree that way for 4 rc's. To me, the example rather serves as confirmation of what Kolivas has been saying; endlessly tweaking the tweaks isn't going anywhere. The minute you tweak A, tweak B over there in corner C-Sharp falls flat on its face. Computers are horribly stupid and tend to fail most situations their smart human programmers didn't specifically tell them about. If, as in the case of a scheduler, the real-world demands on a piece of software are so diverse that you cannot tell them about all possible situations specifically, the only workable solution is to make them _predictable_ so that when hitting one of those special situations, the smart human using the computer at least gets to know how to intervene if he feels inclined to do so. This turned into an interactivity thing, and while interactivity is in fact better for a large majority of testers, that isn't what Kolivas' scheduler is about. It's about predictability and leaving the dead-end road of these endlesss tweaks, which then break previous tweaks, rinse, repeat. It's unfortunate that Kolivas is having health problems currently, but I certainly do hope that his scheduler finds its way into _a_ -rc1. He said it was done... Rene. -
To me, it's more than an interactivity thing. It is also about reacting Well, there I disagree with him quite strongly, but it's not my decision what gets integrated into any tree but my own ;-) -Mike -
Hi, The whole recent discussion/flamefest/... here makes me think that we're still heading towards actually introducing plugsched (most preferrably by making mainline scheduler the builtin default and optionally building a plugsched kernel which then allows selection). There are fundamental behavioural differences between the various CPU scheduler types developed; while some people want a very interactive system with in most(!) cases good latency and exploit-less operation, several others want a scheduler which provides very predictable latency, low overhead and additionally as much interactivity as this strict model can provide for. And then there are people who have very specific SMP requirements which both characteristic scheduler types may have trouble satisfying properly. And I really don't see much difference whatsoever to the I/O scheduler area: some people want predictable latency, while others want maximum throughput or fastest operation for seek-less flash devices (noop). Hardware varies similarly greatly has well: Some people have huge disk arrays or NAS, others have a single flash disk. Some people have a decaying UP machine, others have huge SMP farms. IMHO both areas are too varied, thus runtime or compile-time selection is justified for both areas, not simply for I/O schedulers only. I don't think anybody would want to introduce new very similar scheduler types just for the fun of it; development would center around improving the at most 3 or 4 different scheduler implementations (as is the case with I/O schedulers, BTW: there hasn't been an explosion of different variants either!). I think the whole discussion went on the wrong track when people somehow had the notion of making RSDL (and its later variants) the main scheduler for desktop machines, not just server operation. And this target of course (and rightfully so) prompted people to ask for interactivity similar to what the current scheduler achieves which RSDL cannot fully provide within its ...
I do agree, and yes, I/O scheduling seems to not have suffered from the choice although I must say I'm not sure how much use each I/O scheduler individualy sees. If one CPU scheduler can be good enough then it would better to just have that one, but well, yes, maybe it can't. I certainly believe any one scheduler can't avoid breaking down onder some condition. Demand is just too varied. I find it interesting that you see SD as a server scheduler and I guess deterministic behaviour does point in that direction somewhat. I would be enabling it on the desktop though, which probably is _some_ argument on having multiple schedulers. Rene. -
but ... SD clearly regresses in some areas, so by that logic SD isnt going anywhere either? note that i still like the basic idea about SD, that it is an experiment that if the only conceptual focus is on "scheduling fairness", we'll get a better scheduler. But for that to work out two things have to be done i think: - the code actually has to match that stated goal. Right now it diverges from it (it is not a "fair" scheduler), and it's not clear why. note that SD at the moment produces ~10% more code in sched.o, and the reason is that SD is more complex than the vanilla scheduler. People tend to get the impression that SD is simpler, partly because it is a net linecount win in sched.c, but many of the removed lines are comments. this "provide fairness" goal is quite important, because if SD's code is not only about providing fairness, what is the rest of the logic doing? Are they "tweaks", to achieve interactivity? If yes, why are they not marked as such? I.e. will we go down the _same_ road again, but this time with a much less clearly defined rule for what a "tweak" is? note that under the interactivity estimator it is not that hard to achieve forced "fairness". So _if_ we accept that scheduling must include a fair dose of heuristics (which i tend to think it has to), we are perhaps better off with an interactivity design that _accepts_ this fundamental fact and separates heuristics from core scheduling. Right now i dont see the SD proponents even _accepting_ that even the current SD code does include heuristics. the other one is: - the code has to demonstrate that it can flexibly react to various complaints of regressions. (I identified a few problem workloads that we tend to care about and i havent seen much progress with them - but i really reserve judgement about that, given Con's medical condition.) Ingo -
No. The logic isn't that (performance and other) characteristics must always be exactly the same between two schedulers, the logic is that having one of them turn into a contrived heap of heuristics where every progression on one front turns into a regression on another means that one is on a dead-end road. Now ofcourse, while not needing to behave the same in all conceivable situations, any alternative like SD needs to behave _well_ and for me, I read most of the discussion centering around that specific point as well, and frankly, I mostly came away from it thinking "so what?". It seems this is largely an issue of you and Kolivas disagreeing on what needs to be called design and what needs to be called implementation, but more importantly I feel a solution is to just shy away from the inherently subjective word "fair". If you feel that some of the things SD does need to be called "unfair" as much as mainline, so be it, but do you think that SD is less _predictably_ fair or unfair than mainline? This is what I consider to be very important; if my retarted kid brother sometimes walk left and sometimes right when I tell him to walk forward, I can't go stand to the right and say "nono, forward I said". If on the right there's a highway, you can imagine what that means... All software One answer to that is that it's much less important what a tweak is as long as it's the same always. If I then don't like the definition I'll just define it the other way around privately and be done with it. I do believe that SDs objective is not fairness as such, it's predictability. Being "fair" was postulated as a condition for being so, but let's not put too much focus on that one point; it's a matter of definitions (and I agree that the demands on a (one) general purpose scheduler are so diverse that it's impossible to have one that doesn't break down under some set of conditions. The mainline scheduler does so, and SD does so. What SD does is take some of the ...
it's important due to what Mike mentioned in the previous mail too: SD seems to be quite rigid in certain aspects. So if we end up with that fundamental rigidity we might as well be _very_ sure that it makes sense. Because otherwise there might be no other way out but to "revert the whole thing again". Today we always have the "tweak the interactivity estimator" route, because that code is not rigid at the that's not what i found when testing Mike's latest patches - they visibly improved those testcases, part of which were written to "exploit" heuristics, without regressing others. Several people reported improvements with those patches. Why was that possible without spending years on writing a new scheduler? Because the interactivity estimator is fundamentally _tweakable_. What you flag with sometimes derogative sentences as a weakness of the interactivity estimator is also its strength: tweakability is flexibility. And no, despite what you claim to be a "patchwork" it makes quite some sense: reward certain scheduling behavior and punish other type of behavior. That's what SD does too in the end. Sure, if your "reward" fights against the "punishment", they cancel out each other, or if the metrics used are just arbitrary and make no independent sense it's bad, but that's just plain bad engineering. Why didnt much happen in the past year or so? Frankly, due to lack of demand for change - because most people were just happy about it, or just not upset enough. And i know the types of complaints first-hand, the -rt tree is a _direct answer_ to desktop-space complaints of Linux and it includes a fair bit of scheduler changes too. Now that we have actual new testcases and people with complaints and their willingness to i didnt say that, in fact my first lkml comment about RSDL on lkml was the exact opposite, but you SD advocates are _still_ bickering about (and not accepting) fundamental things like Mike's make -j5 workload and flagging it as ...
Mikes -j5 workload is AFAIAC, a very realistic workload for building a kernel. My own script I just discovered was using -j8, and that was noticeable, but by no means a killing hit on my poor old Xp2800 Athlon. I pulled it back to 4 for this mornings build and the hit, while less, is still noticeable. Killer hit? No way. Using Mike's v4 patch I think it -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) When you jump for joy, beware that no-one moves the ground from beneath your feet. -- Stanislaw Lem, "Unkempt Thoughts" -
I suppose I'm lumped in with the "SD advocates" now but you will note that I haven't been bickering about make -j5 loads. You cut away the entire meat of my reply which was all that predictability harping. What I did say about make -j5 loads is that I do not think that they, under all circumstances, on all machines and at all cost, need to perform the same as currently if other situations improve. Do I want heuristics? Sure, I'm just saying the kernel is fundamentally incapable of getting it right all of the time and as such it should provide me with as many opportunities as possible at stepping in. That is, let me understand what it is and is going to be doing and then listen to me. I agree not a lot of progress is to be made if people keep ignoring each other like that but also while SD's author is offline. Let's just shelve it until he's back. Not bury though... Rene. -
yes - in hindsight i regret having asked Mike for a "simpler" patch, which turned out to be rushed and plain broke your setup: my bad. And i completely forgot about that episode, Mike did a stream of changes in yes, i certainly tried it and it broke nothing, and it was in fact acked so reverting it was justified. Basically, the approach was that the vanilla scheduler is working reasonably well, and that any improvement to it must not cause regression in areas where it already works well. (it obviously must have been working on your audio setup to a certain degree if reverting Mike's patch made the underruns go away) In any case, it would be very nice if you could try Mike's latest patch, how does it work on your setup? (i've attached it) Ingo
Can do. Note that "my setup" in that case consisted of browsing around eBay in firefox with ogg123 playing audio directly to ALSA in an xterm as the only other thing running. That is, just about as basic a Linux desktop as imagineable. Testing Mike's latest will have to wait a bit though; I'm currently testing the latest incarnation of SD (against 2.6.20.6). For people who've lost track of what and where, it's available as: http://ck.kolivas.org/patches/staircase-deadline/2.6.20.5-sd-0.39.patch and versus 2.6.21-rc5 as: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc5-sd-0.39.patch For the moment it is giving me a snappy feeling desktop on this Duron 1300, with ogg123 playing in an xterm without audio underruns, with a make -j2 kernel compile running (not niced) and me browsing around in firefox. Mike latest would probably also support this load without much problem. Given that I feel the basic idea of SD is better than mainline though, I'll be concentrating on using SD for a bit for now. Rene. -
Hi, I am one of those who have been happily testing Con's patches. They work better than mainline here. There seems to be a disconnect on what Con is trying to achieve with SD. They do not improve interactivity per say. Instead they make the scheduler predictable by removing the alchemy used by the interactivity estimator. Mikes patches may be better alchemy but they continue down the same path - from prior experience, we can say with fairly good confidence, that there will be new corner cases that trigger problems. With SD, if you ask too much of the machine it slows down. You can fix this, if required, by renicing tasks some tasks - or by reducing the load on the box. If one really needs some sort of interactivity booster (I do not with SD), why not move it into user space? With SD it would be simple enough to export some info on estimated latency. With this user space could make a good attempt to keep latency within bounds for a set of tasks just by renicing.... Thanks Ed Tomlinson PS. Get well soon Con. -
(I tried a UP kernel yesterday, and even a single kernel build would I don't think you can have very much effect on latency using nice with SD once the CPU is fully utilized. See below. /* * This contains a bitmap for each dynamic priority level with empty slots * for the valid priorities each different nice level can have. It allows * us to stagger the slots where differing priorities run in a way that * keeps latency differences between different nice levels at a minimum. * ie, where 0 means a slot for that priority, priority running from left to * right: * nice -20 0000000000000000000000000000000000000000 * nice -10 1001000100100010001001000100010010001000 * nice 0 0101010101010101010101010101010101010101 * nice 5 1101011010110101101011010110101101011011 * nice 10 0110111011011101110110111011101101110111 * nice 15 0111110111111011111101111101111110111111 * nice 19 1111111111111111111011111111111111111111 */ Nice allocates bandwidth, but as long as the CPU is busy, tasks always proceed downward in priority until they hit the expired array. That's the design. If X gets busy and expires, and a nice 20 CPU hog wakes up after it's previous rotation has ended, but before the current rotation is ended (ie there is 1 task running at wakeup time), X will take a guaranteed minimum 160ms latency hit (quite noticeable) independent of nice level. The only way to avoid it is to use a realtime class. A nice -20 task has maximum bandwidth allocated, but that also makes it a bigger target for preemption from tasks at all nice levels as it proceeds downward toward expiration. AFAIKT, low latency scheduling just isn't possible once the CPU becomes 100% utilized, but it is bounded to runqueue length. In mainline OTOH, a nice -20 task will always preempt a nice 0 task, giving it instant gratification, and latency of lower priority tasks is bounded by the EXPIRED_STARVING(rq) safety net. -Mike -
There's another aspect of this that may require some thought - kernel threads. As load increases, so does rotation length. Would you really want CPU hogs routinely preempting house-keepers under load? -Mike -
SD has a schedule batch nice level. This is good for tasks that want lots of cpu when they can get it. If you overload your cpu I expect the box to slow down - including kernel threads. If really required they can be started with a higher priority... Ed -
Sure. Anything that is latency sensitive, and those kernel threads that are necessary for system function can be made RT to bypass the designed in latency. It's just another thing that should be considered before integration. Now if burst loads (only one of which it the desktop) would just cease to exist... -Mike -
Interesting. I run UP amd64, 1000HZ, 1.25G, preempt off (on causes kernel stalls with no messages - but that is another story). I do not notice a single make. When several are running the desktop slows down a bit. I do not have X niced. Wonder why we see such different results? I am not saying that SD is perfect - I fully expect that more bugs will turn up in its code (some will affect mainline too). I do however like the idea of a scheduler that does not need alchemy to achieve good results. Nor do I necessarily expect it to be 100% transparent. If one changes something as basic as the scheduler some tweaking should be expected. IMO this Mike I made no mention of low latency. I did mention predictable latency. If you are 100% utilized, and have a nice -20 task cpu hog, I would expect it to run and that it _should_ affect other tasks - thats why it runs with -20... This is why I suggest that user space may be a better place to boost interactive tasks. A daemon that posted a message telling me that the nice -20 cpu hog is causing 300ms delays for X would, IMHO, be a good thing. That same daemon could then propose a fix telling me the expected latencies and let me decide if I want to change priorities. It could also be set to automaticily adjust nice levels... Thanks Ed -
Probably because with your processor, in general cc1 can get the job done faster, as can X. The latency big hit happens when you hit the end of the rotation. You simply don't hit it as often as I do. Anyone with an old PIII box should hit the wall very quickly indeed. I haven't had You did say that Con's patch works better than mainline, and you seemed very much to be talking about the desktop. X very definitely is a latency sensitive application, and often a CPU hog to boot. The point I illustrated above is a salient point. If you don't want to hear about anything other than this idea about I did above, that we are absolutely _going_ to take a 160ms + remaining task ticks latency hit. Nice -20 was used only to show clearly what SD trades away, and it's not only the desktop it's trading for mundane latency, it's trading any possibility of low latency, and dismissing burst loads as if they don't even exist. The current scheduler is dynamic. SD is utterly rigid. Apply what I wrote to X at the recommended nice -10. It makes no difference what bandwidth you allocate if the latency sensitive application _will_ take a very major latency hit if it uses it. X does Re-read what I wrote. You simply can't get there from here, by design. If I'm wrong, someone please show me where. -Mike -
Hi! I am running 2.6.20.7 + sd-0.44 on an IBM ThinkPad T23 that I use as my Amarok machine[1]. It has a Pentium 3 with 1.13 GHz using ondemand frequency scaling and XFS as filesystem. So far music playback has been perfect even when I had it building kernel packages while wildly clicking around starting apps and then moving the Amarok window like mad while solid window moving is enabled. Amarok / xine continued to play the music totally unimpressed of that. So for me from a users point of view who wants good music playback *no matter what*, this is already perfect. Also the desktop feels quite snappy to me. It was only slow on anything I/O bound but thats understandable IMHO when make-kpkg tar -bzips the kernel source while 20 KDE applications are starting and Amarok plays music. Should I try any specific tests? This also goes out to anybody else, especially to you, Con. So if you want me to run some benchmarks, please tell me. I am not experienced in benchmarking, but if you tell me what to do, I can try it out. I prefer benchmarks that do not disrupt music playback, but can run more aggressive benchmarks over night. I think it might be good to use a benchmark that isn't I/O bound to really test the scheduler... but as said I am no expert on that and real life loads usually are I/O bound as well. Have to have an carefully eye on the harddisk though... Apr 22 11:51:06 deepdance smartd[3116]: Device: /dev/sda, SMART Prefailure Attribute: 3 Spin_Up_Time changed from 154 to 150 (well threshold is at 033, so still plenty to go, hope it will take some time till the next change) [1] http://martin-steigerwald.de/amarok-machine/ ;) Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -
Thanks for the report. In your case, you've done the testing I require; that for your workloads everything works as you'd desire it without obvious problems. Just keeping an eye on newer versions if you have the time and inclination and making sure that everything stays as you expect it would be the most helpful thing you can do. Thanks! -- -ck -
Hi, I am using 2.6.21-rc5 with rsdl 0.37 and think I still see a regression= =20 with my Athlon X2. Namely using this ac3 encoder=20 (http://aften.sourceforge.net/), which I parallelized in a simple way, with= =20 my test sample I remember having encoding times of ~5.4sec with vanilla and= =20 ~5.8 sec with rsdl - once the whole test wave is in cache. Otherwise you ca= n=20 easily I/O limit the encoder. ;-) You need to get sources from svn though.= =20 The current 0.06 release doesn't have threads support. Cheers, =2D-=20 (=B0=3D =3D=B0) //\ Prakash Punnoor /\\ V_/ \_V
BTW, I confirmed this regression. With vanilla 2.76.21-rc5 I get back my 5.= 4=20 secs with the test sample and two threads. Furtmermore for me vanilla=20 actually feels nicer on my dual core, even with load - just subjectively=20 that's why I ditched rsdl... Cheers, =2D-=20 (=B0=3D =3D=B0) //\ Prakash Punnoor /\\ V_/ \_V
My neck condition got a lot worse today. I'm forced offline for a week and will be uncontactable. -- -ck -
OK, this is bizarre. I'm getting this: [ 52.754522] RTNL: assertion failed at net/ipv4/devinet.c (1055) [ 52.758258] [<c02cb6f7>] inetdev_event+0x46/0x2d8 [ 52.762041] [<c01049c9>] show_trace_log_lvl+0x28/0x2c [ 52.765887] [<c0105482>] show_trace+0xf/0x13 [ 52.769627] [<c01054d7>] dump_stack+0x14/0x18 [ 52.773320] [<c029b22e>] rtnl_unlock+0xd/0x2f [ 52.776999] [<c029f410>] fib_rules_event+0x3a/0xeb [ 52.780678] [<c01236aa>] notifier_call_chain+0x2c/0x55 [ 52.784339] [<c012371a>] raw_notifier_call_chain+0x17/0x1b [ 52.787975] [<c0295984>] dev_open+0x63/0x6b [ 52.791587] [<c02944fd>] dev_change_flags+0x50/0x104 [ 52.795201] [<c02cbcf4>] devinet_ioctl+0x259/0x57b [ 52.798798] [<c02955b2>] dev_ifsioc+0x113/0x3a0 [ 52.802408] [<c028b127>] sock_ioctl+0x1a1/0x1c4 [ 52.805966] [<c028af86>] sock_ioctl+0x0/0x1c4 [ 52.809475] [<c0165969>] do_ioctl+0x19/0x4d [ 52.812977] [<c0165b99>] vfs_ioctl+0x1fc/0x216 [ 52.816478] [<c0165bff>] sys_ioctl+0x4c/0x65 [ 52.819944] [<c0103b68>] syscall_call+0x7/0xb [ 52.823395] ======================= [ 52.826923] RTNL: assertion failed at net/ipv4/igmp.c (1358) [ 52.830485] [<c02cf545>] ip_mc_up+0x35/0x59 [ 52.834034] [<c029b22e>] rtnl_unlock+0xd/0x2f [ 52.837569] [<c02cb7ed>] inetdev_event+0x13c/0x2d8 [ 52.841123] [<c01049c9>] show_trace_log_lvl+0x28/0x2c [ 52.844682] [<c0105482>] show_trace+0xf/0x13 [ 52.848227] [<c01054d7>] dump_stack+0x14/0x18 [ 52.851752] [<c029b22e>] rtnl_unlock+0xd/0x2f [ 52.855242] [<c029f410>] fib_rules_event+0x3a/0xeb [ 52.858734] [<c01236aa>] notifier_call_chain+0x2c/0x55 [ 52.862241] [<c012371a>] raw_notifier_call_chain+0x17/0x1b [ 52.865759] [<c0295984>] dev_open+0x63/0x6b [ 52.869191] [<c02944fd>] dev_change_flags+0x50/0x104 [ 52.872571] [<c02cbcf4>] devinet_ioctl+0x259/0x57b [ 52.875998] [<c02955b2>] dev_ifsioc+0x113/0x3a0 [ 52.879399] [<c028b127>] sock_ioctl+0x1a1/0x1c4 [ 52.882741] [<c028af86>] ...
