logo
Published on KernelTrap (http://kerneltrap.org)

Linux: The Completely Fair Scheduler

By Jeremy
Created Apr 18 2007 - 10:51

Ingo Molnar [interview [1]] released a new patchset titled the "Modular Scheduler Core and Completely Fair Scheduler". He explained, "this project is a complete rewrite of the Linux task scheduler. My goal is to address various feature requests and to fix deficiencies in the vanilla scheduler that were suggested/found in the past few years, both for desktop scheduling and for server scheduling workloads." The patchset introduces Scheduling Classes, "an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming about them too much." It also includes sched_fair.c with an implementation of the CFS desktop scheduler, "a replacement for the vanilla scheduler's SCHED_OTHER interactivity code," about which Ingo noted, "I'd like to give credit to Con Kolivas [interview [2]] for the general approach here: he has proven via RSDL/SD that 'fair scheduling' is possible and that it results in better desktop scheduling. Kudos Con!"

Regarding the actual implementation, Ingo explained, "CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected). CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of 'timeslices' and has no heuristics whatsoever. There is only one central tunable, /proc/sys/kernel/sched_granularity_ns, which can be used to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads." He went on to note, "due to its design, the CFS scheduler is not prone to any of the 'attacks' that exist today against the heuristics of the stock scheduler".

During the followup discussion, Ingo explained that he wrote the 100K patch in 62 hours. In response to concerns that his efforts had not been discussed first on the Linux Kernel Mailing List, Ingo explained, "I prefer such early releases to lkml _alot_ more than any private review process. I released the CFS code about 6 hours after i thought 'okay, this looks pretty good" and i spent those final 6 hours on testing it (making sure it doesnt blow up on your box, etc.), in the final 2 hours i showed it to two folks i could reach on IRC (Arjan and Thomas) and on various finishing touches." He went on to add, "the 'design consultation' phase you are talking about is _NOW_! :)" Later in the discussion that touched on egos, Linux creator Linus Torvalds noted, "one of the most motivating things there *is* in open source is 'personal pride'," going on to add, "it's a really good thing, and it means that if somebody shows that your code is flawed in some way (by, for example, making a patch that people claim gets better behaviour or numbers), any *good* programmer that actually cares about his code will obviously suddenly be very motivated to out-do the out-doer!"


From: Ingo Molnar [3] [email blocked]
To:  linux-kernel
Subject: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
Date:	Fri, 13 Apr 2007 22:21:00 +0200

[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:

   http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch [4]

This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.

[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
  new scheduler will be active by default and all tasks will default
  to the new SCHED_FAIR interactive scheduling class. ]

Highlights are:

 - the introduction of Scheduling Classes: an extensible hierarchy of
   scheduler modules. These modules encapsulate scheduling policy
   details and are handled by the scheduler core without the core
   code assuming about them too much.

 - sched_fair.c implements the 'CFS desktop scheduler': it is a
   replacement for the vanilla scheduler's SCHED_OTHER interactivity
   code.

   i'd like to give credit to Con Kolivas for the general approach here:
   he has proven via RSDL/SD that 'fair scheduling' is possible and that
   it results in better desktop scheduling. Kudos Con!

   The CFS patch uses a completely different approach and implementation
   from RSDL/SD. My goal was to make CFS's interactivity quality exceed
   that of RSDL/SD, which is a high standard to meet :-) Testing
   feedback is welcome to decide this one way or another. [ and, in any
   case, all of SD's logic could be added via a kernel/sched_sd.c module
   as well, if Con is interested in such an approach. ]

   CFS's design is quite radical: it does not use runqueues, it uses a
   time-ordered rbtree to build a 'timeline' of future task execution,
   and thus has no 'array switch' artifacts (by which both the vanilla
   scheduler and RSDL/SD are affected).

   CFS uses nanosecond granularity accounting and does not rely on any
   jiffies or other HZ detail. Thus the CFS scheduler has no notion of
   'timeslices' and has no heuristics whatsoever. There is only one
   central tunable:

         /proc/sys/kernel/sched_granularity_ns

   which can be used to tune the scheduler from 'desktop' (low
   latencies) to 'server' (good batching) workloads. It defaults to a
   setting suitable for desktop workloads. SCHED_BATCH is handled by the
   CFS scheduler module too.

   due to its design, the CFS scheduler is not prone to any of the
   'attacks' that exist today against the heuristics of the stock
   scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
   work fine and do not impact interactivity and produce the expected
   behavior.

   the CFS scheduler has a much stronger handling of nice levels and
   SCHED_BATCH: both types of workloads should be isolated much more
   agressively than under the vanilla scheduler.

   ( another rdetail: due to nanosec accounting and timeline sorting,
     sched_yield() support is very simple under CFS, and in fact under
     CFS sched_yield() behaves much better than under any other
     scheduler i have tested so far. )

 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
   way than the vanilla scheduler does. It uses 100 runqueues (for all
   100 RT priority levels, instead of 140 in the vanilla scheduler)
   and it needs no expired array.

 - reworked/sanitized SMP load-balancing: the runqueue-walking
   assumptions are gone from the load-balancing code now, and
   iterators of the scheduling modules are used. The balancing code got
   quite a bit simpler as a result.

the core scheduler got smaller by more than 700 lines:

 kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
 1 file changed, 372 insertions(+), 1082 deletions(-)

and even adding all the scheduling modules, the total size impact is
relatively small:

 18 files changed, 1454 insertions(+), 1133 deletions(-)

most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:

   text    data     bss     dec     hex filename
  23366    4001      24   27391    6aff kernel/sched.o.vanilla
  24159    2705      56   26920    6928 kernel/sched.o.CFS

(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)

thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.

as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,

	Ingo


From: Ingo Molnar [5] [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Fri, 13 Apr 2007 23:50:59 +0200 * Ingo Molnar [email blocked] wrote: > and even adding all the scheduling modules, the total size impact is > relatively small: > > 18 files changed, 1454 insertions(+), 1133 deletions(-) > > most of the increase is due to extensive comments. The kernel size > impact is in fact a small negative: > > text data bss dec hex filename > 23366 4001 24 27391 6aff kernel/sched.o.vanilla > 24159 2705 56 26920 6928 kernel/sched.o.CFS update: these were older numbers, here are the stats redone with the latest patch: text data bss dec hex filename 23366 4001 24 27391 6aff kernel/sched.o.vanilla 23671 4548 24 28243 6e53 kernel/sched.o.sd.v40 23349 2705 24 26078 65de kernel/sched.o.cfs so CFS is now a win both for text and for data size :) Ingo
From: Willy Tarreau <w@1wt.eu> Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sat, 14 Apr 2007 00:31:31 +0200 Hi Ingo, On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] (...) > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). I have a high confidence this will work better : I've been using time-ordered trees in userland projects for several years, and never found anything better. To be honnest, I never understood the concept behind the array switch, but as I never felt brave enough to hack something in this kernel area, I simply preferred to shut up (not enough knowledge and not enough time). However, I have been using a very fast struct timeval-ordered RADIX tree. I found generic rbtree code to generally be slower, certainly because of the call to a function with arguments on every node. Both trees are O(log(n)), the rbtree being balanced and the radix tree being unbalanced. If you're interested, I can try to see how that would fit (but not this week-end). Also, I had spent much time in the past doing paper work on how to improve fairness between interactive tasks and batch tasks. I came up with the conclusion that for perfectness, tasks should not be ordered by their expected wakeup time, but by their expected completion time, which automatically takes account of their allocated and used timeslice. It would also allow both types of workloads to share equal CPU time with better responsiveness for the most interactive one through the reallocation of a "credit" for the tasks which have not consumed all of their timeslices. I remember we had discussed this with Mike about one year ago when he fixed lots of problems in mainline scheduler. The downside is that I never found how to make this algo fit in O(log(n)). I always ended in something like O(n.log(n)) IIRC. But maybe this is overkill for real life anyway. Given that a basic two arrays switch (which I never understood) was sufficient for many people, probably that a basic tree will be an order of magnitude better. > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. I find this useful, but to be fair with Mike and Con, they both have proposed similar tuning knobs in the past and you said you did not want to add that complexity for admins. People can sometimes be demotivated by seeing their proposals finally used by people who first rejected them. And since both Mike and Con both have done a wonderful job in that area, we need their experience and continued active participation more than ever. > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. I'm very pleased to read this. Because as I have already said it, my major concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the basic problems for me to the point that I switched the default lilo entry on my notebook to 2.6 ! I hope that whatever the next scheduler will be, we'll definitely get rid of any heuristics. Heuristics are good in 95% of situations and extremely bad in the remaining 5%. I prefer something reasonably good in 100% of situations. > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. > > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) > > - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler > way than the vanilla scheduler does. It uses 100 runqueues (for all > 100 RT priority levels, instead of 140 in the vanilla scheduler) > and it needs no expired array. > > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. Will this have any impact on NUMA/HT/multi-core/etc... ? > the core scheduler got smaller by more than 700 lines: Well done ! Cheers, Willy
From: Ingo Molnar [6] [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sat, 14 Apr 2007 01:18:09 +0200 * Willy Tarreau <w@1wt.eu> wrote: > > central tunable: > > > > /proc/sys/kernel/sched_granularity_ns > > > > which can be used to tune the scheduler from 'desktop' (low > > latencies) to 'server' (good batching) workloads. It defaults to a > > setting suitable for desktop workloads. SCHED_BATCH is handled by the > > CFS scheduler module too. > > I find this useful, but to be fair with Mike and Con, they both have > proposed similar tuning knobs in the past and you said you did not > want to add that complexity for admins. [...] yeah. [ Note that what i opposed in the past was mostly the 'export all the zillion of sched.c knobs to /sys and let people mess with them' kind of patches which did exist and still exist. A _single_ knob, which represents basically the totality of parameters within sched_fair.c is less of a problem. I dont think i ever objected to this knob within staircase/SD. (If i did then i was dead wrong.) ] > [...] People can sometimes be demotivated by seeing their proposals > finally used by people who first rejected them. And since both Mike > and Con both have done a wonderful job in that area, we need their > experience and continued active participation more than ever. very much so! Both Con and Mike has contributed regularly to upstream sched.c: $ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l 19 $ git-log kernel/sched.c | grep 'by: Mike' | wc -l 6 and i'd very much like both counts to increase steadily in the future too :) > > - reworked/sanitized SMP load-balancing: the runqueue-walking > > assumptions are gone from the load-balancing code now, and > > iterators of the scheduling modules are used. The balancing code > > got quite a bit simpler as a result. > > Will this have any impact on NUMA/HT/multi-core/etc... ? it will inevitably have some sort of effect - and if it's negative, i'll try to fix it. I got rid of the explicit cache-hot tracking code and replaced it with a more natural pure 'pick the next-to-run task first, that is likely the most cache-cold one' logic. That just derives naturally from the rbtree approach. > > the core scheduler got smaller by more than 700 lines: > > Well done ! thanks :) Ingo
From: Con Kolivas [7] [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sun, 15 Apr 2007 13:27:13 +1000 On Saturday 14 April 2007 06:21, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch [8] > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. The casual observer will be completely confused by what on earth has happened here so let me try to demystify things for them. 1. I tried in vain some time ago to push a working extensable pluggable cpu scheduler framework (based on wli's work) for the linux kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as being absolutely the wrong approach and that we should never do that. Oddly enough the linux-kernel-mailing list was -dead- at the time and the discussion did not make it to the mailing list. Every time I've tried to forward it to the mailing list the spam filter decided to drop it so most people have not even seen this original veto-forever discussion. 2. Since then I've been thinking/working on a cpu scheduler design that takes away all the guesswork out of scheduling and gives very predictable, as fair as possible, cpu distribution and latency while preserving as solid interactivity as possible within those confines. For weeks now, Ingo has said that the interactivity regressions were showstoppers and we should address them, never mind the fact that the so-called regressions were purely "it slows down linearly with load" which to me is perfectly desirable behaviour. While this was not perma-vetoed, I predicted pretty accurately your intent was to veto it based on this. People kept claiming scheduling problems were few and far between but what was really happening is users were terrified of lkml and instead used 1. windows and 2. 2.4 kernels. The problems were there. So where are we now? Here is where your latest patch comes in. As a solution to the many scheduling problems we finally all agree exist, you propose a patch that adds 1. a limited pluggable framework and 2. a fairness based cpu scheduler policy... o_O So I should be happy at last now that the things I was promoting you are also promoting, right? Well I'll fill in the rest of the gaps and let other people decide how I should feel. > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and having trips in and out of hospitals for my condition. I appreciate greatly the sympathy and patience from people in this regard. However at one stage I virtually begged for support with my attempts and help with the code. Dmitry Adamushko is the only person who actually helped me with the code in the interim, while others poked sticks at it. Sure the sticks helped at times but the sticks always seemed to have their ends kerosene doused and flaming for reasons I still don't get. No other help was forthcoming. Now that you're agreeing my direction was correct you've done the usual Linux kernel thing - ignore all my previous code and write your own version. Oh well, that I've come to expect; at least you get a copyright notice in the bootup and somewhere in the comments give me credit for proving it's possible. Let's give some other credit here too. William Lee Irwin provided the major architecture behind plugsched at my request and I simply finished the work and got it working. He is also responsible for many IRC discussions I've had about cpu scheduling fairness, designs, programming history and code help. Even though he did not contribute code directly to SD, his comments have been invaluable. So let's look at the code. kernel/sched.c kernel/sched_fair.c kernel/sched_rt.c It turns out this is not a pluggable cpu scheduler framework at all, and I guess you didn't really promote it as such. It's a "modular scheduler core". Which means you moved code from sched.c into sched_fair.c and sched_rt.c. This abstracts out each _scheduling policy's_ functions into struct sched_class and allows each scheduling policy's functions to be in a separate file etc. Ok so what it means is that instead of whole cpu schedulers being able to be plugged into this framework we can plug in only cpu scheduling policies.... hrm... So let's look on -#define SCHED_NORMAL 0 Ok once upon a time we rename SCHED_OTHER which every other unix calls the standard policy 99.9% of applications used into a more meaningful name, SCHED_NORMAL. That's fine since all it did was change the description internally for those reading the code. Let's see what you've done now: +#define SCHED_FAIR 0 You've renamed it again. This is, I don't know what exactly to call it, but an interesting way of making it look like there is now more choice. Well, whatever you call it, everything in linux spawned from init without specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed SCHED_NORMAL and now SCHED_FAIR. You encouraged me to create a sched_sd.c to add onto your design as well. Well, what do I do with that? I need to create another scheduling policy for that code to even be used. A separate scheduling policy requires a userspace change to even benefit from it. Even if I make that sched_sd.c patch, people cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to read SCHED_SD 0 or similar. The same goes for original staircase cpusched, nicksched, zaphod, spa_ws, ebs and so on. So what you've achieved with your patch is - replaced the current scheduler with another one and moved it into another file. There is no choice, and no pluggability, just code trumping. Do I support this? In this form.... no. It's not that I don't like your new scheduler. Heck it's beautiful like most of your _serious_ code. It even comes with a catchy name that's bound to give people hard-ons (even though many schedulers aim to be completely fair, yours has been named that for maximum selling power). The complaint I have is that you are not providing quite what you advertise (on the modular front), or perhaps you're advertising it as such to make it look more appealing; I'm not sure. Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything different, and that this is anything other than your NIH (Not Invented Here) cpu scheduling policy rewrite which will probably end up taking it's position in mainline after yet another truckload of regression/performance tests and so on. I haven't seen an awful lot of comparisons with SD yet, just people jumping on your bandwagon which is fine I guess. Maybe a few tiny tests showing less than 5% variation in their fairness from what I can see. Either way, I already feel you've killed off SD... like pretty much everything else I've done lately. At least I no longer have to try and support my code mostly by myself. In the interest of putting aside any ego concerns since this is about linux and not me... Because... You are a hair's breadth away from producing something that I would support, which _does_ do what you say and produces the pluggability we're all begging for with only tiny changes to the code you've already done. Make Kconfig let you choose which sched_*.c gets built into the kernel, and make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig and even choose one of the alternative built in ones with boot parametersyour code has more clout than mine will (ie do exactly what plugsched does). Then we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is the very thing Linus didn't want in specialisation with the cpu schedulers! Does this mean this idea will be vetoed yet again? In all likelihood, yes. I guess I have lots to put into -ck still... sigh. > Ingo -- -ck
From: Ingo Molnar [9] [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sun, 15 Apr 2007 17:05:36 +0200 * Con Kolivas [email blocked] wrote: [ i'm quoting this bit out of order: ] > 2. Since then I've been thinking/working on a cpu scheduler design > that takes away all the guesswork out of scheduling and gives very > predictable, as fair as possible, cpu distribution and latency while > preserving as solid interactivity as possible within those confines. yeah. I think you were right on target with this call. I've applied the sched.c change attached at the bottom of this mail to the CFS patch, if you dont mind. (or feel free to suggest some other text instead.) > 1. I tried in vain some time ago to push a working extensable > pluggable cpu scheduler framework (based on wli's work) for the linux > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he > didn't like it) as being absolutely the wrong approach and that we > should never do that. [...] i partially replied to that point to Will already, and i'd like to make it clear again: yes, i rejected plugsched 2-3 years ago (which already drifted away from wli's original codebase) and i would still reject it today. First and foremost, please dont take such rejections too personally - i had my own share of rejections (and in fact, as i mentioned it in a previous mail, i had a fair number of complete project throwaways: 4g:4g, in-kernel Tux, irqrate and many others). I know that they can hurt and can demoralize, but if i dont like something it's my job to tell that. Can i sum up your argument as: "you rejected plugsched, but then why on earth did you modularize portions of the scheduler in CFS? Isnt your position thus woefully inconsistent?" (i'm sure you would never put it this impolitely though, but i guess i can flame myself with impunity ;) While having an inconsistent position isnt a terminal sin in itself, please realize that the scheduler classes code in CFS is quite different from plugsched: it was a result of what i saw to be technological pressure for _internal modularization_. (This internal/policy modularization aspect is something that Will said was present in his original plugsched code, but which aspect i didnt see in the plugsched patches that i reviewed.) That possibility never even occured to me to until 3 days ago. You never raised it either AFAIK. No patches to simplify the scheduler that way were ever sent. Plugsched doesnt even touch the core load-balancer for example, and most of the time i spent with the modularization was to get the load-balancing details right. So it's really apples to oranges. My view about plugsched: first please take a look at the latest plugsched code: http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch [10] 26 files changed, 8951 insertions(+), 1495 deletions(-) As an experiment i've removed all the add-on schedulers (both the core and the include files, only kept the vanilla one) from the plugsched patch (and the makefile and kconfig complications, etc), to see the 'infrastructure cost', and it still gave: 12 files changed, 1933 insertions(+), 1479 deletions(-) that's the extra complication i didnt like 3 years ago and which i still dont like today. What the current plugsched code does is that it simplifies the adding of new experimental schedulers, but it doesnt really do what i wanted: to simplify the _scheduler itself_. Personally i'm still not primarily interested in having a large selection of schedulers, i'm mainly interested in a good and maintainable scheduler that works for people. so the rejection was on these grounds, and i still very much stand by that position here and today: i didnt want to see the Linux scheduler landscape balkanized and i saw no technological reasons for the complication that external modularization brings. the new scheding classes code in the CFS patch was not a result of "oh, i want to write a new scheduler, lets make schedulers pluggable" kind of thinking. That result was just a side-effect of it. (and as you correctly noted it, the CFS related modularization is incomplete). Btw., the thing that triggered the scheduling classes code wasnt even plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he fixed it within the framework of the existing scheduler, and the end result behaved quite well when i threw various testloads on it. But i felt a bit uncomfortable that it added another few hundred lines of code to an already complex sched.c. This felt unnatural so i mailed Mike that i'd attempt to clean these infrastructure aspects of sched.c up a bit so that it becomes more hackable to him. Thus 3 days ago, without having made up my mind about anything, i started this experiment (which ended up in the modularization and in the CFS scheduler) to simplify the code and to enable Mike to fix such itches in an easier way. By your logic Mike should in fact be quite upset about this: if the new code works out and proves to be useful then it obsoletes a whole lot of code of him! > For weeks now, Ingo has said that the interactivity regressions were > showstoppers and we should address them, never mind the fact that the > so-called regressions were purely "it slows down linearly with load" > which to me is perfectly desirable behaviour. [...] yes. For me the first thing when considering a large scheduler patch is: "does a patch do what it claims" and "does it work". If those goals are met (and if it's a complete scheduler i actually try it quite extensively) then i look at the code cleanliness issues. Mike's patch was the first one that seemed to meet that threshold in my own humble testing, and CFS was a direct result of that. note that i tried the same workloads with CFS and while it wasnt as good as mainline, it handled them better than SD. Mike reported the same, and Mark Lord (who too reported SD interactivity problems) reported success yesterday too. (but .. CFS is a mere 2 days old so we cannot really tell anything with certainty yet.) > [...] However at one stage I virtually begged for support with my > attempts and help with the code. Dmitry Adamushko is the only person > who actually helped me with the code in the interim, while others > poked sticks at it. Sure the sticks helped at times but the sticks > always seemed to have their ends kerosene doused and flaming for > reasons I still don't get. No other help was forthcoming. i'm really sorry you got that impression. in 2004 i had a good look at the staircase scheduler and said: http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html [11] "But in general i'm quite positive about the staircase scheduler." and even tested it and gave you feedback: http://lwn.net/Articles/96562/ [12] i think i even told Andrew that i dont really like pluggable schedulers and if there's any replacement for the current scheduler then that would be a full replacement, and it would be the staircase scheduler. Hey, i told this to you as recently as 1 month ago as well: http://lkml.org/lkml/2007/3/8/54 [13] "cool! I like this even more than i liked your original staircase scheduler from 2 years ago :)" Ingo -----------> Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -16,6 +16,7 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin + * 2007-04-15 Con Kolivas was dead right: fairness matters! :) */
From: Bill Huey (hui) [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sat, 14 Apr 2007 22:16:45 -0700 On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote: ... > Now that you're agreeing my direction was correct you've done the usual Linux > kernel thing - ignore all my previous code and write your own version. Oh > well, that I've come to expect; at least you get a copyright notice in the > bootup and somewhere in the comments give me credit for proving it's > possible. Let's give some other credit here too. William Lee Irwin provided > the major architecture behind plugsched at my request and I simply finished > the work and got it working. He is also responsible for many IRC discussions > I've had about cpu scheduling fairness, designs, programming history and code > help. Even though he did not contribute code directly to SD, his comments > have been invaluable. Hello folks, I think the main failure I see here is that Con wasn't included in this design or privately in review process. There could have been better co-ownership of the code. This could also have been done openly on lkml (since this is kind of what this medium is about to significant degree) so that consensus can happen (Con can be reasoned with). It would have achieved the same thing but probably more smoothly if folks just listened, considered an idea and then, in this case, created something that would allow for experimentation from outsiders in a fluid fashion. If these issues aren't fixed, you're going to stuck with the same kind of creeping elitism that has gradually killed the FreeBSD project and other BSDs. I can't comment on the code implementation. I'm focus on other things now that I'm at NetApp and I can't help out as much as I could. Being former BSDi, I had a first hand account of these issues as they played out. A development process like this is likely to exclude smart people from wanting to contribute to Linux and folks should be conscious about this issues. It's basically a lot of code and concept that at least two individuals have worked on (wli and con) only to have it be rejected and then sudden replaced by code from a community gatekeeper. In this case, this results in both Con and Bill Irwin being woefully under utilized. If I were one of these people. I'd be mighty pissed. bill
From: Ingo Molnar [14] [email blocked] Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sun, 15 Apr 2007 10:44:47 +0200 * Bill Huey [email blocked] wrote: > Hello folks, > > I think the main failure I see here is that Con wasn't included in > this design or privately in review process. There could have been > better co-ownership of the code. This could also have been done openly > on lkml [...] Bill, you come from a BSD background and you are still relatively new to Linux development, so i dont at all fault you for misunderstanding this situation, and fortunately i have a really easy resolution for your worries: i did exactly that! :) i wrote the first line of code of the CFS patch this week, 8am Wednesday morning, and released it to lkml 62 hours later, 22pm on Friday. (I've listed the file timestamps of my backup patches further below, for all the fine details.) I prefer such early releases to lkml _alot_ more than any private review process. I released the CFS code about 6 hours after i thought "okay, this looks pretty good" and i spent those final 6 hours on testing it (making sure it doesnt blow up on your box, etc.), in the final 2 hours i showed it to two folks i could reach on IRC (Arjan and Thomas) and on various finishing touches. It doesnt get much faster than that and i definitely didnt want to sit on it even one day longer because i very much thought that Con and others should definitely see this work! And i very much credited (and still credit) Con for the whole fairness angle: || i'd like to give credit to Con Kolivas for the general approach here: || he has proven via RSDL/SD that 'fair scheduling' is possible and that || it results in better desktop scheduling. Kudos Con! the 'design consultation' phase you are talking about is _NOW_! :) I got the v1 code out to Con, to Mike and to many others ASAP. That's how you are able to comment on this thread and be part of the development process to begin with, in a 'private consultation' setup you'd not have had any opportunity to see _any_ of this. In the BSD space there seem to be more 'political' mechanisms for development, but Linux is truly about doing things out in the open, and doing it immediately. Okay? ;-) Here's the timestamps of all my backups of the patch, from its humble 4K beginnings to the 100K first-cut v1 result: -rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch i hope this helps :) Ingo
From: Linus Torvalds <torvalds@linux-foundation.org> Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Date: Sun, 15 Apr 2007 10:59:54 -0700 (PDT) On Sun, 15 Apr 2007, Mike Galbraith wrote: > On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote: > > > > He did exactly that and he did it with a patch. Nothing new here. This is > > how development on LKML proceeds when you have two or more competing > > designs. There's absolutely no need to get upset or hurt your feelings > > over it. It's not malicious, it's how we do Linux development. > > Yes. Exactly. This is what it's all about, this is what makes it work. I obviously agree, but I will also add that one of the most motivating things there *is* in open source is "personal pride". It's a really good thing, and it means that if somebody shows that your code is flawed in some way (by, for example, making a patch that people claim gets better behaviour or numbers), any *good* programmer that actually cares about his code will obviously suddenly be very motivated to out-do the out-doer! Does this mean that there will be tension and rivalry? Hell yes. But that's kind of the point. Life is a game, and if you aren't in it to win, what the heck are you still doing here? As long as it's reasonably civil (I'm not personally a huge believer in being too polite or "politically correct", so I think the "reasonably" is more important than the "civil" part!), and as long as the end result is judged on TECHNICAL MERIT, it's all good. We don't want to play politics. But encouraging peoples competitive feelings? Oh, yes. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email blocked] More majordomo info at http://vger.kernel.org/majordomo-info.html [15] Please read the FAQ at http://www.tux.org/lkml/ [16]
From: Ingo Molnar [17] [email blocked] Subject: [patch] CFS (Completely Fair Scheduler), v2 Date: Tue, 17 Apr 2007 00:07:15 +0200 this is the second release of the CFS (Completely Fair Scheduler) patchset, against v2.6.21-rc7: http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch [18] i'd like to thank everyone for the tremendous amount of feedback and testing the v1 patch got - i could hardly keep up with just reading the mails! Some of the stuff people addressed i couldnt implement yet, i mostly concentrated on bugs, regressions and debuggability. there's a fair amount of churn: 15 files changed, 456 insertions(+), 241 deletions(-) But it's an encouraging sign that there was no crash bug found in v1, all the bugs were related to scheduling-behavior details. The code was tested on 3 architectures so far: i686, x86_64 and ia64. Most of the code size increase in -v2 is due to debugging helpers, they'll be removed later. (The new /proc/sched_debug file can be used to see the fine details of CFS scheduling.) Changes since -v1: - make nice levels less starvable. (reported by Willy Tarreau) - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first flag can be used to turn it on/off. (This might fix the Kaffeine bug reported by S.Çağlar Onur <) - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas) - UP build fix. (reported by Gabriel C) - timer tick micro-optimization (Dmitry Adamushko) - preemption fix: sched_class->check_preempt_curr method to decide whether to preempt after a wakeup (or at a timer tick). (Found via a fairness-test-utility written for CFS by Mike Galbraith) - start forked children with neutral statistics instead of trying to inherit them from the parent: Willy Tarreau reported that this results in better behavior on extreme workloads, and it also simplifies the code quite nicely. Removed sched_exit() and the ->task_exit() methods. - make nice levels independent of the sched_granularity value - new /proc/sched_debug file listing runqueue details and the rbtree - new SCH-* fields in /proc/<NR>/status to see scheduling details - new cpu-hog feature (off by default) and sysctl tunable to set it: /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to 0 (off). Positive values are meant the maximum 'memory' that the scheduler has of CPU hogs. - various code cleanups - added more statistics temporarily: sum_exec_runtime, sum_wait_runtime. - added -CFS-v2 to EXTRAVERSION as usual, any sort of feedback, bugreports, fixes and suggestions are more than welcome, Ingo



Related Links:


Source URL:
http://kerneltrap.org/node/8059