In an effort to fully understand the math proposed by Roman Zippel in his Really Fair Scheduler, Ingo Molnar implemented a simplified version of the logic on top of his Completely Fair Scheduler code which he then humorously labeled the Really Simple Really Fair Scheduler, "could you please confirm whether the math algorithm you are suggesting is implemented by this patch roughly correctly?" Ingo explained:
"As an addendum to my review, please find below a prototype patch I've just written that implements RSRFS (Really Simple Really Fair Scheduler) on top of CFS. It is intended to demonstrate the essence of the math you have presented via your patch. (it has no nice levels support yet, to make the math really apparent to everyone interested)"
Roman pointed out that things were oversimplified, "it simplifies the math too much, the nice level weighting is an essential part of the math and without it one can't really understand the problem I'm trying to solve." Ingo acknowledged this and explained, "in the first level review I'm mainly interested in your patch's behavior [with regards to] simple nice-0 tasks, how they run and how they sleep and wake up. Nothing else. Why? That's what 99% of the Linux tasks do after all." In another thread he explained that he would continue to review Roman's proposal, however also noting that he would be offline this week for the kernel summit and thus temporarily unresponsive to email.
From: Ingo Molnar [email blocked] Subject: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Sun, 2 Sep 2007 14:01:54 +0200 * Ingo Molnar [email blocked] wrote: > Your math is fairly simple (and that is _good_, just like CFS's > existing math is simple), it can be summed up in essence as (without > complicating it with nice-level weighting, for easy > understandability): > > " use the already existing p->sum_exec_runtime 'task runtime' metric > that CFS maintains, and use that as the key into the rb-tree that > selects tasks that should be run next. To handle sleeping tasks: keep > a per-rq sum of all runnable task's ->sum_exec_runtime values and > start newly woken tasks at the average rq->sum/nr_running value. " > > Now your patch does not actually do it that way in a clearly > discernible manner because lots of changes are intermixed into one big > patch. > > ( please correct me if i got your math wrong. Your patch does not add > any comments at all to the new code and this slowed down my review > and analysis of your patch quite considerably. Lack of comments makes > it harder to see the purpose and makes it harder to notice the > benefits/tradeoffs involved in each change. ) Roman, as an addendum to my review, please find below a prototype patch i've just written that implements RSRFS (Really Simple Really Fair Scheduler) ontop of CFS. It is intended to demonstrate the essence of the math you have presented via your patch. (it has no nice levels support yet, to make the math really apparent to everyone interested) It's a truly simple patch: 3 files changed, 30 insertions(+), 23 deletions(-) and gives good fairness: $ ./massive_intr 10 10 002510 00000114 002515 00000114 002518 00000114 002514 00000114 002513 00000115 002509 00000115 002517 00000115 002511 00000115 002512 00000115 002516 00000115 Could you please confirm whether the math algorithm you are suggesting is implemented by this patch roughly correctly? (ignoring nice levels, i.e. only considering nice-0 tasks, ignoring rounding issues and Bresenham optimizations, CPU time slicing and other changes.) Thanks, Ingo
From: Roman Zippel [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 20:38:21 +0200 (CEST) Hi, On Sun, 2 Sep 2007, Ingo Molnar wrote: > Roman, as an addendum to my review, please find below a prototype patch > i've just written that implements RSRFS (Really Simple Really Fair > Scheduler) ontop of CFS. It is intended to demonstrate the essence of > the math you have presented via your patch. (it has no nice levels > support yet, to make the math really apparent to everyone interested) It simplifies the math too much, the nice level weighting is an essential part of the math and without it one can't really understand the problem I'm trying to solve. At http://lkml.org/lkml/2007/9/3/168 I provide an example of the math, which more acurately demonstrates the essential math. bye, Roman
From: Ingo Molnar [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 20:54:03 +0200 * Roman Zippel [email blocked] wrote: > Hi, > > On Sun, 2 Sep 2007, Ingo Molnar wrote: > > > Roman, as an addendum to my review, please find below a prototype patch > > i've just written that implements RSRFS (Really Simple Really Fair > > Scheduler) ontop of CFS. It is intended to demonstrate the essence of > > the math you have presented via your patch. (it has no nice levels > > support yet, to make the math really apparent to everyone interested) > > It simplifies the math too much, the nice level weighting is an > essential part of the math and without it one can't really understand > the problem I'm trying to solve. At http://lkml.org/lkml/2007/9/3/168 > I provide an example of the math, which more acurately demonstrates > the essential math. as i mentioned it above, in the first level review i'm mainly interested in your patch's behavior wrt. simple nice-0 tasks, how they run and how they sleep and wake up. Nothing else. Why? That's what 99% of the Linux tasks do after all. So could you please confirm whether that aspect of what i wrote (both in words and in code) is correct? If this basic model is correct, we can look further. If this basic model is flawed, no amount of weighting, nice level logic, rounding behavior can fix it. I'm sure you agree with that too. That's why i wrote this prototype, please confirm or deny whether i correctly understood how you intend to handle nice-0 tasks. (From your mails i read a part-acknowledgement of that but i'd like to see a more clear acknowledgement - or please mention any issues if you know about them.) Thanks! Ingo
From: Roman Zippel [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 21:13:33 +0200 (CEST) Hi, On Mon, 3 Sep 2007, Ingo Molnar wrote: > If this basic model is correct, we can look further. The basic model is correct insofar I use an absolute time instead of a relative time, but it's not the essence of my math, so I don't quite understand the point of this exercise. bye, Roman
From: Ingo Molnar [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 21:20:50 +0200 * Roman Zippel [email blocked] wrote: > On Mon, 3 Sep 2007, Ingo Molnar wrote: > > > If this basic model is correct, we can look further. > > The basic model is correct insofar I use an absolute time instead of a > relative time, but it's not the essence of my math, so I don't quite > understand the point of this exercise. thanks. (and i did not claim nor do i want to claim this to be the essence of your efforts - it is very clear from your mails where your focus is.) My next question then is about this code of yours in the wakeup path: +static void +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + kclock_t min_time; + + verify_queue(cfs_rq, cfs_rq->curr != se, se); + min_time = get_time_avg(cfs_rq) - se->req_weight_inv; + if ((kclock_t)(se->time_norm - min_time) < 0) + se->time_norm = min_time; why do you only use the "min_time" if the pre-sleep time_norm is smaller than the min_time? Here 'min_time' is close to the current average. Shouldnt here the woken up task be set to the average time, like i did it in the crude prototype: + se->exec_runtime = avg_exec_runtime(cfs_rq); (and lets again only consider the special case of only having nice-0 tasks.) Or is it set in a similar way as my prototype does, and i missed some detail why that branch is there? Ingo
From: Roman Zippel [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 21:55:20 +0200 (CEST) Hi, On Mon, 3 Sep 2007, Ingo Molnar wrote: > My next question then is about this code of yours in the wakeup path: > > +static void > +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > +{ > + kclock_t min_time; > + > + verify_queue(cfs_rq, cfs_rq->curr != se, se); > + min_time = get_time_avg(cfs_rq) - se->req_weight_inv; > + if ((kclock_t)(se->time_norm - min_time) < 0) > + se->time_norm = min_time; > > why do you only use the "min_time" if the pre-sleep time_norm is smaller > than the min_time? Here 'min_time' is close to the current average. It's a variation of the sleeper bonus. Let's assume two running tasks which have been running for 95ms and 105ms and a time slice of 10ms, the average is thus 100ms. If the new task has been sleeping for a while it starts at 90ms, if the task had been running lately it doesn't get this bonus again. > Shouldnt here the woken up task be set to the average time, like i did > it in the crude prototype: > > + se->exec_runtime = avg_exec_runtime(cfs_rq); That would be equivalent to simply clearing wait_runtime in CFS. bye, Roman
From: Ingo Molnar [email blocked] Subject: Re: [ANNOUNCE/RFC] Really Simple Really Fair Scheduler Date: Mon, 3 Sep 2007 22:04:05 +0200 * Roman Zippel [email blocked] wrote: > Hi, > > On Mon, 3 Sep 2007, Ingo Molnar wrote: > > > My next question then is about this code of yours in the wakeup path: > > > > +static void > > +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > > +{ > > + kclock_t min_time; > > + > > + verify_queue(cfs_rq, cfs_rq->curr != se, se); > > + min_time = get_time_avg(cfs_rq) - se->req_weight_inv; > > + if ((kclock_t)(se->time_norm - min_time) < 0) > > + se->time_norm = min_time; > > > > why do you only use the "min_time" if the pre-sleep time_norm is smaller > > than the min_time? Here 'min_time' is close to the current average. > > It's a variation of the sleeper bonus. [...] hm, where are its effects described in your explanation? Seems like a key item. > [...] Let's assume two running tasks which have been running for 95ms > and 105ms and a time slice of 10ms, the average is thus 100ms. If the > new task has been sleeping for a while it starts at 90ms, if the task > had been running lately it doesn't get this bonus again. what happens if there are lots of such tasks? What limits the total bonus? > > Shouldnt here the woken up task be set to the average time, like i > > did it in the crude prototype: > > > > + se->exec_runtime = avg_exec_runtime(cfs_rq); > > That would be equivalent to simply clearing wait_runtime in CFS. so my prototype patch is not an exact map of the nice-0 special-case of your code? Would this be the correct thing then perhaps: + se->exec_runtime = + max(avg_exec_runtime(cfs_rq), se->exec_runtime); Or if not, could you suggest a code-line at that place? Thanks, Ingo
classic Molnar
Another threat to his scheduler, another re-implementation!
sometimes the best way to
sometimes the best way to understand something is to build it up by hand so that you see how the parts work on their own and as a whole.
i think the mail explains why he did what he did, and to me it makes sense to do it, so as to get a clearer sense of what the code does on the most basic level.
its like stripping down a engine to just one cylinder and the basic support system, so that you dont have to deal with the interaction of multiple cylinders or maybe even a gearbox.
Or maybe
Just maybe, the choice of scheduler should be a kernel config, so that I can choose to use Molnar's, or anybody else's.
Right now, the earlier poster is right. There is One and only One who decides about scheduling, and that's Ingo. You've got a great idea? Wanna share your own scheduler with others? It seems very stable? Wanna get it into mainline kernel?
Forget it.
Since way back, Ingo is the authoer or at least the credited one for scheduling, and he simply won't take a better alternative for an answer. He wants the cred, and full control. So much for the "Linux community". I'd say, it wouldn't hurt to have at least a few choices here. CFS, SD and Zippel's scheduler.
man patch? If you can
man patch?
If you can compile a kernel, you can apply a patch - so what's your point again?
Or just maybe you should
Or just maybe you should actually read how Ingo responded to Roman's scheduler, before starting yet another flamewar against Ingo? Here you go: lkml.org/lkml/2007/8/31/97
The Really Fair Scheduler is not rejected; Ingo is claiming that the proposed enhancements were already implemented in CFS, and wrote this scheduler to convince Roman or to get his input, but it's not set in stone yet. If this is true then there simply is no good reason to throw away CFS -- it already has process containers integration pending which would otherwise need a rewrite, and possibly other features.
I cannot see where you came up with the claim that Roman's scheduler is necessarily better than his. And naturally you have to start a flamewar *before* they even get a chance to sort that out.
And I should remind you that Ingo was by far not the only person who ended up preferring CFS over SD.
I can rule someone down,
I can rule someone down, without actually badmouthing him. It's possible too with a determined saying, which leaves no room for other things.
More than one CPU scheduler
More than one CPU scheduler causes more problems than it solves. CFS is fine, you just seem to have a personal issue with Ingo. Take a lesson from the I/O Scheduler developers who have already said they wish they didn't have 3 different schedulers either.
If nothing else...
this can only lead to better written and documented code.
At least when the discussion is kept at a well-civilized manner, which seems to be the case right now.
Scheduler of XXX-Scheduler? User chooses sched's interaction.
Imagine one case of long timed FastCGI task or long uptimed inetd/swapd/X daemon.
The Roman's scheduler model for this above case fails precipitatously.
Sometimes, it's better to use the Ingo's scheduler instead of Roman's scheduler.
Sometimes, it's better to use the ConKolivas's scheduler instead of Ingo's scheduler.
Select your favorite CPU-priority-Scheduler with/without IO-Scheduler from this menu (historically ordered):
a) Con Kolivas (SD, Staircase Deadline scheduler).
b) Ingo Molnar (CFS, Completely Fair Scheduler).
c) Roman Zippel (RFS, Really Fair Scheduler).
d) Peter Zijlstra (RFS with his improved patches).
z) Round Robin Hood for stupids (extremely simple circular scheduler without priority).
The selection can picked in compilation-time (#ifdef .. #endif in the kernel) or in run-time (/proc/cpusched... or cpuschedctl or ioctl).
What "inversion of priority" matters in the CPU-priority-scheduler's IO-bottlenecks about?
Scheduler of XXX-Scheduler? User chooses sched's interaction.
"CPU-priority-Scheduler with IO-Scheduler"
I've a better idea for most general proposites:
formula (scores of task) = formula_global ( formula_cpu (scores of CPU's activity of task), formula_io (scores of I/O's activity of task), formula_user (scores assigned statically or dinamically by the user for this task) ).There are many formulas that the user can pick the most convenient formula_global, formula_cpu, formula_io and formula_user.
Not quite
Ingo's asked Roman to break his stuff up into small pieces that can be understood individually. Lacking that, Ingo is trying to do that himself, and then asking Roman at each step "If this is wrong, please tell me why it's wrong."
That seems like it ought to be a rather thorough way to ensure you've understood something.
This is a place where good communication skills come in, especially since there seems to be some difficulty communicating. One skill I've been taught a couple different times that helps is to say back to someone what you've understood them to say. (Not just repeat their words, but rather rephrase it in your own words to reflect your understanding--complete or incomplete.) This helps catch places where people talk past each other, and really forces both sides to listen more closely to each other.
It appears to me that this is what Ingo's doing with Roman. Paraphrasing... Ingo: "I am trying to understand your code, but the big monolithic thing is too much. Here's what I think the core does." Roman: "No, that's wrong for X, Y and Z. What I really meant was..." and so forth.
Ingo seems to be the trusted lieutenant in this space, and so his word does carry a lot of weight. But it looks to me he's being diligent in processing other people's criticisms and inputs. If he really didn't care about Roman's scheduler work, then why would Ingo bother with RSRFS and bother playing 20 questions with Roman?
--
Program Intellivision and play Space Patrol!
Linux is a mere joke.
Linux is a mere joke.
Yup, and that's why it runs
Yup, and that's why it runs on 10.000 CPU monsters, come again?
Yeah, with a very, very
Yeah, with a very, very customized version of Linux kernel. Together with some experts who can intercept every minute, if necessary.
Don't you have some better joke to amuse the public, instead of this childish propaganda?
Childish propaganda? Keep in
Childish propaganda? Keep in mind that a lot (not all!) of the features added to these "custom kernels" eventually end up being added to the mainstream kernel sooner or later. SGI and IBM are only a few examples of adding stuff into the mainstream kernel for big systems. Especially SGI has recently contributed a lot of code to the kernel for big systems, their latest addition being the SLUB allocator which deals with NUMA systems better than SLAB. If Linux was such a joke like you think, I wonder why more than half of the Top 500 companies run it on their (big) machines. There has to be something about Linux that makes it a real joke, if you get my point...
As for the CFS vs RSRFS schedulers, what both Ingo and Roman are trying to do here is discuss different scheduler implementations, tweak algos, patch, rewrite code, etc so in the end they can deliver the best possible implementation of the scheduler in mind. I think discussions and idea sharing in this case is a good thing.
There's a saying that fits here
"Those who like sausage should not watch it being made."
I'm an engineer for one of those top 500 companies, and I get to watch all manner of engineering debates and so on. What's happening with Ingo and Roman looks productive to me, and really is driving to the technical heart of the question.
If this were a more Dilbertian environment, you'd have the two engineers get their managers involved to "align" the two groups, and the group with more clout, resources, or revenue probably bends the result more. That is, it's no longer a technical argument as much as a political turf war.
I've watched that happen, and I don't think I'm seeing that here.
--
Program Intellivision and play Space Patrol!
It's good script for you.
This script is good for benchmarking of its responsiveness:
$ time { for f in $(seq 1 200); do \ ( xterm -e "dd if=/dev/zero of=/dev/null bs=4k count=1k" & ) \ done ; }It has given 23.72 seconds in my old A64 3200+ 2.0 GHz monocore linux-2.6.20.6
strange
real 0m1.262s
AMD Athlon(TM) XP 2600+
2.6.23-rc4
After the script has finished (5 secs later) there are a few xterms flashing up.
I'd just like to note that
I'd just like to note that the results of this script are definitely not comparable across different machines: it is more sensitive to your graphics drivers, X11 version/configuration, xterm and font configuration, than it is to your scheduler choice.
Also, there's a little flaw: the script terminates before all the child processes have actually exited; the parentheses should be removed and there should be a 'wait $?' after 'done;'