This message is to announce the first general public release of the "Rotating Staircase DeadLine" cpu scheduler. Based on previous work from the staircase cpu scheduler I set out to design, from scratch, a new scheduling policy design which satisfies every requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task management. Available for download are: A full rollup of the patch for 2.6.20: http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch Split patches for 2.6.20(which will follow this email): http://ck.kolivas.org/patches/staircase-deadline/split-out/ The readme (which will also constitute the rest of this email): http://ck.kolivas.org/patches/staircase-deadline/rsdl_scheduler.readme The following readme is also included as documentation in Documentation/sched-design.txt Rotating Staircase Deadline cpu scheduler policy ================================================ Design summary ============== A novel design which incorporates a foreground-background descending priority system (the staircase) with runqueue managed minor and major epochs (rotation and deadline). Features ======== A starvation free, strict fairness O(1) scalable design with interactivity as good as the above restrictions can provide. There is no interactivity estimator, no sleep/run measurements and only simple fixed accounting. The design has strict enough a design and accounting that task behaviour can be modelled and maximum scheduling latencies can be predicted by the virtual deadline mechanism that manages runqueues. The prime concern in this design is to maintain fairness at all costs determined by nice level, yet to maintain as good interactivity as can be allowed within the constraints of strict fairness. Design description ================== RSDL works off the principle of providing each task a quota of runtime that it is allowed to run at each priority level equal to its static priority (ie. its nice level) and every ...
It's probably worth mentioning that this scheduler shows a not insignificant improvement in the mysql test case that recently received a lot of publicity. Why exactly that's the case I'm not sure but it may help track down what is actually responsible for the performance drop off as well. Note that the SMP balancing of this cpu scheduler is essentially unchanged from the mainline one. -- -ck -
Only preliminary other benchmarking has been done so far on RSDL, and so far the results are equal to or slightly better than mainline on scalability grounds. These are the sysbench graphs just for completion that Nishanth Aravamudan performed. We were actually trying to track down the cause of the mysql performance problem and I suggested trying different cpu schedulers as well. Needless to say I'm not unhappy with the results although they haven't told us exactly where the problem lies but it certainly does add evidence that perhaps it is a scheduling issue. -- -ck
I assume to test this, we select the deadline scheduler? -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) -
No, only the "deadline" in the name is shared. This is a cpu process scheduler whereas the deadline scheduler you're thinking of is an I/O scheduler. To test this you just patch it in and it replaces the current mainline cpu scheduler (the same way the staircase cpu scheduler in -ck replaces it). -- -ck -
Oh, then I tried to put the -ck1 patch on top of that, and blew the tree up. I'd built it the first time with the deadline scheduler as the default while I was waiting on your reply. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) -
Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a standalone piece of code. -- -ck -
I just this instant got it booted, what with building a driver for nvidia and all. I'll let you know what I think in a few hours after I've gotten a feel for it. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) -
Great, thanks. Just to make it clear. The purpose of this scheduler is at all costs to maintain absolute fairness no matter what type of load it is put under. This means that if you heavily load up your machine without the use of 'nice' then your interactive tasks _will_ slow down proportionately to the amount of cpu you use. So doing make -j4 for example will make any other task started in taht presence run precisely 1/5th speed, but they will still be responsive, have low latency (and audio shouldn't skip for example). There will be times when the mainline scheduler feels more interactive than this scheduler, and that is because it has significant unfairness granted towards interactive tasks. This degree of unfairness in an effort to maintain interactivity has been criticised and causes problems in certain environments with both loss of fairness, relative starvation and is not entirely predictable. This was designed to be robust for any application since linux demands a general purpose scheduler design, while preserving interactivity, instead of optimising for one particular end use. -- -ck -
Well, in 20 minutes of playing, I am so far VERY impressed, the kmail composer is typing to the screen in unison to my keystrokes, where with the older version it often went away for 10 or more seconds at a time, then displayed the last sentence I had typed in one (or 2 sometimes) swell foop. Now I can back up and correct a typo in real time whereas before, it was often faster if the typo was half a line back, and the key repeat so darned slow it was often over a second per character cell moved, to go grab the mouse, position the cursor on the typo, click, wait for the indicator to move, and then fix it. Typing is now pleasurable again. Key repeats seem to remain at the set in the bios key repeat speed. Amazing. I do believe you have given me back my machine. I may find something to squawk about in due time, I think that Murphy guys laws may have a corollary on that subject, but it sure feels good right -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) -
XP-2800 Athlon on a biostar board whose model number I've forgotten. 1 GB of dual channel 400 mhz ram running at 333 because this athlon falls over at 400mhz fsb settings. 480GB of drives in 3 pieces. Running a jaton 3dForce 6200-256 video card with nvidia driver, which does the opengl stuff better than the ati 9200se I took out a week ago, but this nvidia card is several times harder on the cpu when tvtime is running than the ati was. There are other cards in this box, 1394, nic, usb expander, but that is the main list of power burners. Its dmesg clock is just short of 2100. and: Calibrating delay using timer specific routine.. 4177.82 BogoMIPS (lpj=2088910) In comparison to an XP-1400 I had before, this is nowhere near 2x faster. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) -
Hi Con !
Well, I haven't tested it yet, but your design choices please me. As you
know, I've been one of those encountering big starvation problems with
the original scheduler, making 2.6 unusable for me in many situations. I
welcome your work and want to thank you for the time you spend trying to
fix it.
Keep up the good work,
Willy
PS: I've looked at your graphs, I hope you're on the way to something really
better than the 21 first 2.6 releases !
-
Well, imho his current staircase scheduler already does a better job compar= ed=20 to mainline, but it won't make it in (or at least, it's not likely). So we= =20 can hope this WILL make it into mainline, but I wouldn't count on it. grtz Jos =2D-=20 Disclaimer: Alles wat ik doe denk en zeg is gebaseerd op het wereldbeeld wat ik nu heb.= =20 Ik ben niet verantwoordelijk voor wijzigingen van de wereld, of het beeld w= at=20 ik daarvan heb, noch voor de daaruit voortvloeiende gedragingen van mezelf.= =20 Alles wat ik zeg is aardig bedoeld, tenzij expliciet vermeld.
Wrong problem, what is really needed is to get CPU scheduler choice into mainline, just as i/o scheduler finally did. Con has noted that for some loads this will present suboptimal performance, as will his -ck patches, as will the default scheduler. Instead of trying to make ANY one size fit all, we should have a means to select, at runtime, between any of the schedulers, and preferably to define an interface by which a user can insert a new scheduler in the kernel (compile in, I don't mean plugable) with clear and well defined rules for how that can be done. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
Been there, done that. Wli wrote the infrastructure for plugsched; I took his code and got it booting and ported 3 or so different scheduler designs. It allowed you to build as few or as many different schedulers into the kernel and either boot the only one you built into your kernel, or choose a scheduler at boot time. That code got permavetoed by both Ingo and Linus. After that I gave up on that code and handed it over to Peter Williams who still maintains it. So please note that I pushed the plugsched barrow previously and still don't think it's a bad idea, but the maintainers think it's the wrong approach. RSDL is my attempt to create a cpu scheduler to try to do everything well instead of some things perfectly, to be best fit when trying to shoehorn it into any environment. -- -ck -
In a way, I think they are right. Let me explain. Pluggable schedulers are useful when you want to switch away from the default one. This is very useful during development of a new scheduler, as well as when you're not satisfied with the default scheduler. Having this feature will incitate many people to develop their own scheduler for their very specific workload, and nothing generic. It's a bit what happened after all : you, Peter, Nick, and Mike have worked a lot trying to provide alternative solutions. But when you think about it, there are other OSes which have only one scheduler and which behave very well with tens of thousands of tasks and scale very well with lots of CPUs (eg: solaris). So there is a real challenge here to try to provide something at least as good and universal because we know that it can exist. And this is what you finally did : work on a scheduler which ought to be good with any workload. Then, when we have a generic, good enough scheduler for most situations, I think that it could be good to get the plugsched for very specific usages. People working in HPC may prefer to allocate ressource differently for instance. There may also be people refusing to mix tasks from different users on two different siblings of one CPU for security reasons, etc... All those would justify a plugable scheduler. But it should not be an excuse to provide a set of bad schedulers and no good one. The CPU scheduler is often compared to the I/O schedulers while in fact this is a completely different story. The I/O schedulers are needed because the hardware and filesystems may lead to very different behaviours, and the workload may vary a lot (eg: news server, ftp server, cache, desktop, real time streaming, ...). But at least, the default I/O scheduler was good enough for most usages, and alternative ones are here to provide optimal solutions to specific needs. Regards, Willy -
Solaris has a pluggable scheduler framework (each policy -- OTHER/FIFO/RR/etc. -- is it's own separate component). -- Nicholas Miell <nmiell@comcast.net> -
Did that happen for I/O? There are a few schedulers, eg some for servers,=20 I can imagine a desktop can work optimally with another scheduler than a ti= ny=20 embedded OS in a phone or an 8-core system serving a website, or a=20 distributed 512 core system doing heavy scientific calculations?!? Optimizing for all at the same time involves some compromises, and thus lim= its=20 CFQ does pretty well at most workloads, that's why it's default, right? But= =20 there is choice, which is a good thing. And the current mainline CPU=20 Ok, for I/O, the diff could be pretty big. But still, there are workloads=20 which could be improved by a certain scheduler, right? And wouldn't it make sense then to have a choice in the default kernel at=20 boottime? If that wouldn't hurt performance, it would be an improvement for= =20 desktop distributions like (K)ubuntu who can set staircase by default, and= =20 server distro's offering RSDL... At least having a desktop/interactivity optimized scheduler like staircase = and=20 a fair, throughput-optimized scheduler like RSDL sounds sane. RSDL does=20 better at the msql testcase, staircase is better on the desktop... We're no= t=20 talking about huge amounts of code, or 10 schedulers, and the diff of a few= =20 percent and better scaling on many cpu's and processes versus better=20
The problem is not with "any workload," because that's not the issue, the issue is the definition of "good" matching the administrator's policy. And that's where the problem comes in. We have the default scheduler, which favors interactive jobs. We have Con's staircase scheduler which is part of an interactivity package. We have the absolutely fair scheduler which is, well... fair, and keeps things smooth and under reasonable load crisp. There are other schedulers in the pluggable package, I did a doorknob scheduler for 2.2 (everybody gets a turn, special case of round-robin). I'm sure people have quietly hacked many more, which have never been presented to public view. The point is that no one CPU scheduler will satisfy the policy needs of all users, any more than one i/o scheduler does so. We have realtime scheduling, preempt both voluntary and involuntary, why should we not have multiple CPU schedulers. If Linus has an objection to plugable schedulers, then let's identify what the problem is and address it. If that means one scheduler or the other must be compiled in, or all Unless you force the the definition of "good" to "what the default scheduler does," there can be no "one" good one. Choice is good, no one is calling for bizarre niche implementations, but we have at minimum three CPU schedulers which as "best" for a large number of users. And multiple schedulers are needed because the type of load, mix of loads, and admin preference all require decisions at the policy which can't be covered by a single solution. Or at least none of the existing solutions, and I think letting people tune the guts of scheduler policy is more dangerous than giving a selection of solutions. Linux has been about choice all along, I hope it's nearly time for a solution better than patches to be presented. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
Hi Bill, On Tue, Mar 06, 2007 at 04:37:37PM -0500, Bill Davidsen wrote: I'm not in Linus' head, but I think that he wanted the recurrent scheduler problems to be addressed first for most users before going further. Too much choice is often dangerous for quality. For instance, look at all the netfilter modules. Many of them were completely bogus in their early stages, and some of them even do mostly the same jobs, and many of them have never left the "extra" stage. Choice is good to detect users' needs, it's good for global evolution, but it's not as good when you want to have something By "good", I mean a scheduler that is not trivially DoSable, and which does not cause unexpected long pauses to some processes without any reason (processes which cannot get any time slice for tens of seconds, or ssh daemons which freeze under system load, to the point of totally preventing There's a difference between CPU and I/O scheduler though. With the CPU scheduler, you've always had the choice to assign per-process priorities with "nice". Don't get me wrong, I'm all for pluggable schedulers, as I'm an ever unsatisfied optimizer. It's just that I think it has been good to encourage people to focus on real issues before dispersing efforts on different needs. I hope that Con's work will eventually get merged and that the door will be opened towards pluggable schedulers. Best regards, Willy -
This patch has been queued at test.kernel.org (thanks Andy Whitcroft). Here is a summary of the first few benchmarks from those tests rsdl 0.26 vs mainline 2.6.20 (However much it is that you value these particular benchmarks...) on amd64 4x: Kernbench (lower elapsed is better); ========= 2.6.20: Elapsed: 103.058s User: 351.974s System: 36.474s CPU: 376.6% 2.6.20-rsdl: Elapsed: 102.848s User: 359.186s System: 36.666s CPU: 384.6% tbench (higher is better): ====== 2.6.20: Throughput 165.5 MB/sec 1 procs 2.6.20-rsdl: Throughput 261.418 MB/sec 1 procs reaim (higher is better): ===== 2.6.20: Max Jobs per Minute 500727.27 2.6.20-rsdl: Max Jobs per Minute 612000.00 -- -ck -
cool! I like this even more than i liked your original staircase scheduler from 2 years ago :) Lets try what we did back then: put it into -mm and see what breaks (if anything). But in general, it is becoming increasingly clear that the interactivity estimator is a more fragile concept than the built-in quota mechanism of the staircase scheduler, so if it works in practice i'm quite in favor of it, even if it regresses /some/ workloads. Ingo -
Great! Thanks for your support. After futzing around for all that time I've become sure that an approach without an interactive estimator is our only way forward. So far the throughput benchmarks are encouraging too so I suspect the estimator may be causing harm there too. Ensuring the different arches and cpuidle work properly I likely will need help with, though, so I'd appreciate any help from people if they see something obvious and can get a grip of my code. Thanks! -- -ck -
Hi Con It would be nice if you could rebase this patch to latest git or at least to 2.6.21-rc3. Regards, Fabio -
Check in http://ck.kolivas.org/patches/staircase-deadline/ There's an -rc3 patch there. -- -ck -
Well, downloaded - compiled - booted: initng measures 17.369 seconds to complete the boot process; without the patch the same kernel booted in 21.553 seconds. Very impressive. Many thanks for your work. Fabio -
Con, you've really outdone yourself this time ! :D As a long time user of the -ck patchset, RSDL is a welcome change, and a great piece of code to play around with, and USE! Booted up on my system perfectly, Pentium-D 830 3GHz, 1.5GB RAM. No problems whatsoever so far, using 0.26. I can launch up a bunch of encode jobs, in SCHED_NORMAL even, and still have low latency on my desktop (I know it's not low latency _specific_ code, but it works very well). I guess all I can say is.. wow. This code isn't "prime time" ready, yet.. But it can be, and would be a great addition to mainline. Hell, a little tuning and merging this with a few current ck patches could make a damn fine kernel, and probably beat out the original staircase in desktops. :) Keep up the good work ! -r -- Rodney "meff" Gordon II -*- meff@pobox.com Systems Administrator / Coder Geek -*- Open yourself to OpenSource -
