Linux: Comparing CFS to the EEVDF CPU Scheduler

Submitted by Jeremy
on May 3, 2007 - 2:29pm

Ingo Molnar [interview] announced the eighth version of his CFS CPU scheduler [story], "-v7 resolved a couple of important regressions while not introducing new regressions, so i felt it was time to step forward: -v8 tries to address one of the last (known) frontiers: 3D/OpenGL games^H^H^H applications 'smoothness'." Regarding the purpose of his scheduler work he explained, "the main goal of CFS is to implement 'desktop scheduling' with as high quality as technically possible."

Ting Yang, a University of Massachusetts Amherst graduate student, replied in a lengthy and insightful email comparing CFS to the Earliest Eligible Virtual Deadline First (EEVDF) scheduler proposed by Ion Stoica's paper, "A Proportional Share Resource Allocation Algorithm for Real-Time, Time-Shared Systems". Ting went on to explain:

"EEVDF uses exactly the same method as CFS to track the execution of each running task. The only difference between EEVDF and CFS is that EEVDF tries to be _deadline_ fair while CFS is _start-time_ fair. Scheduling based on deadline gives better reponse time bound and seems to more fair. In the following part of this email, I will try to explain the similarities and differences between EEVDF and CFS. Hopefully, this might provide you with some useful information w.r.t your current work on CFS."


From: Ingo Molnar [email blocked]
To:  linux-kernel
Subject: [patch] CFS scheduler, -v8
Date:	Tue, 1 May 2007 23:22:23 +0200


i'm pleased to announce release -v8 of the CFS scheduler patchset. (The 
main goal of CFS is to implement "desktop scheduling" with as high 
quality as technically possible.)

The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be 
downloaded from the usual place:

    http://people.redhat.com/mingo/cfs-scheduler/

-v7 resolved a couple of important regresisons while not introducing new 
regressions, so i felt it was time to step forward: -v8 tries to address 
one of the last (known) frontiers: 3D/OpenGL games^H^H^H applications 
'smoothness'.

To achieve more scheduling smoothness, -v8 introduces a new 'precise 
load calculation and smoothing' feature. (A variant of this was 
suggested by Peter Williams in earlier CFS discussions - thanks Peter!)

i was able to reuse the rq->cpu_load[] load average calculation from 
Peter Williams's smpnice code, and it's thus now utilized on UP too. As 
a nice side-effect of CFS using a smoothed load metric, apps should also 
start up faster under load. CFS now utilizes the full range of smpnice 
metrics.

No other fundamental portion of CFS was touched, so the rate of change 
is moderate:

   7 files changed, 140 insertions(+), 79 deletions(-)

Changes since -v7:

 - powerpc debug output and build warning fixes (Balbir Singh)

 - documentation fixes (Zach Carter)

 - interactivity: precise load calculation and load smoothing

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,

	Ingo


From: Ting Yang [email blocked] Subject: Re: [patch] CFS scheduler, -v8 Date: Tue, 01 May 2007 22:57:14 -0400 Hi, Ingo My name is Ting Yang, a graduate student from UMASS. I am currently studying the linux scheduler and virtual memory manager to solve some page swapping problems. I am very excited with the new scheduler CFS. After I read through your code, I think that you might be interested in reading this paper: "A Proportional Share REsource Allocation Algorithm for Real-Time, Time-Shared Systems", by Ion Stoica. You can find the paper here: http://citeseer.ist.psu.edu/37752.html Authors of this paper proposed a scheduler: Earlist Eligible Virtual Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to track the execution of each running task. The only difference between EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is _start-time_ fair. Scheduling based on deadline gives better reponse time bound and seems to more fair. In the following part of this email, I will try to explain the similarities and differences between EEVDF and CFS. Hopefully, this might provide you with some useful information w.r.t your current work on CFS. Similarities: 1. both EEVDF and CFS use virtual time to track the system's progress. CFS maintains rq->fair_clock for each cpu, which is updated every tick by: SCALE/sum(p_i->loadweight) where p_i->loadweight is the weight of each task mapped from its nice value in prio_to_load_shift[], given that the default weight is SCALE (1024) EEVDF maintains a virtual time, which is advanced every tick by: 1/sum(w_i) where w_i is the weight of each task, given that the default weight is 1. Therefore, EEVDF and CFS monitors system progress in the same way, except normalized to different scale. 2. EEVDF and CFS monitors the progress in the same way. CFS maintains a p->fair_key which indicating the amount of work done by this task. When p executes for a period t, then p->fair_key incremented by: t * SCALE/p->loadweight //the default weight is SCALE (based on solving equations in your code :-)) EEVDF does the same thing with assumption that the default weight is 1, it uses the same equation to calculate deadlines for running tasks. Differences: The main difference between CFS and EEVDF lies in the scheduling policy, although they follow the same model for tracking tasks. *** CFS: When a task starts, it gets p->fair_key from the current virtual time rq->fair_clock. It will not be scheduled for execution until all other running tasks' fair_key go beyond p->fair_key by certain virtual ticks (which is 5 ticks for the current setting of CFS). (I wish I could show examples with graphs, instead of my not-so-good english, but I am not sure if it appropriate to attach figures on this maillist) EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, and the granularity of pre-exemption for CFS is 5 virtual ticks (the current setting). If, at time t=0, we start 2 tasks: p1 and p2, both have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. Now we have: p1->fair_key = p2->fair_key = rq->fair_clock = 0. CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick (1ms later) t=1, we have: rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0. Suppose, a new task p3 starts with nice value -10 at this moment, that is p3->fair_key=1/2. In this case, CFS will not schedule p3 for execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which translates to about 10ms later in this setting), _regardless_ the priority (weight) of p3. Further imagine, if we starts n tasks at time t=0 and then start a task p_(n+1) at time t = 1, the delay of task p_(n+1) actually is proportional to the number of running tasks n. Formally speaking, CFS can has a *O(n)* lag w.r.t the ideal proportional shared fluid-flow system, which can be arbitrarily fine granularity. The cause of this problem is that p->fair_key only reflects a fair point that a task should be started, but does not has any information about how urgent the task is (i.e. the priority or weight of the task). *** In EEVDF, each task p_i is specified by 2 parameters: weight w_i and timeslice length l_i. EEVDF tries to schedule tasks based on the virtual deadline vd_i where a timeslice l_i should be finished. EEVDF keeps a virtual start (ve_i) and virtual deadline (vd_i) for each tasks. When a tasks starts, its ve_i is initialized to be the current virtual time, and calculates its virtual deadline as: vd_i = ve_i + l_i/w_i //the same method as CFS updates fair_key. When l_i amount of work is done, the just finished vd_i becomes the new ve_i. That is the virtual start time of next l_i amount work is the deadline of previous finished timeslice. The virtual deadline vd_i is then updated using the above equation. EEVDF schedule policy: always executes the task with the _earliest_ virtual deadline. EXAMPLE: Assume the system has 1000 ticks per second. At time t = 0, we start 2 tasks: p1 and p2, such that w_1 = w_2 = 1 and l_1 = l_2 = 5 ticks, i.e 5ms (which reflects the similar setting in CFS case). Furthermore, the system virtual time vt is initialized to be 0. Now at time t = 0, we have vt = 0, ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 5 ve_2 = 0, vd_2 = vd_2 + l_2/w_2 = 5 As p1 and p2 have the same virtual deadline, EEVDF breaks the tie arbitrarily, say it executes p1. After 1 system tick (1ms later), we have: vt = 0 + 1 / (w_1 + w_2) = 1/2 //just as CFS updates rq->fair_clock ve_1 = 0, vd_1 = 5 //not changed ve_2 = 0, vd_1 = 5 //not changed Suppose, we starts a new task p2 at this moment, with weight w_3 = 2 and timeslice l_3 = 5 ticks (5ms), Then ve_3 = vt = 1/2 vd_3 = ve_3 + l_3/w_2 = 1/2 + 5/2 = 3 hmmm, the scheduler now will execute task p3 first since it has the earliest deadline. The deadline implicitly contains some very important information that the CFS fair_key does not have: how urgent this amount of work has to be done, which comes from the weight of a task. Formally speaking, EEVDF has a constant O(1) lag w.r.t the ideal proportional shared fluid-flow system. (Please look at the paper for detail proofs.) Under normal cases, for a task p_i, EEVDF ensures that it does not fall behind the ideal system by more than l_i time. Occasionally, the delay can be max(l_i), the maximum timeslice used by all active tasks, due to the dynamic entering and leaving of tasks. (Again, the paper give more detailed explanation on this). We can see that here the timeslice l_i used by a task p_i actually controls accuracy of scheduling p_i. The smaller l_i, the closer to the ideal system during p_i's execution. For example, in above case, if p3 has w_3 = 1 (the same as p1 and p2) and l_3 = 3 (3ms). Since vd_3 = 1/2 + l_3/w_3 = 7/2, the scheduler still executes p3 first, even though p1,p2 and p3 have the same weight. Smaller l_i leads the scheduler to handle p_i in finer granularity. Intuitively, it makes scheduler checks the deadline of p_i more frequently and ensures each small piece of work is done on time, while larger l_i does the reverse. The decouple of weight w_i and timeslice l_i is important. Generally speaking, weight determines throughput and timeslice determines the responsiveness of a task. In normal situation, high priority tasks usually need more cpu capacity within short period of time (bursty, such as keyboard, mouse move, X updates, daemons, etc), and need to be processed as quick as possible (responsiveness and interactiveness). Follow the analysis above, we know that for higher priority tasks we should give _higher weight_ to ensure its CPU throughput, and at the same time give _smaller timeslice_ to ensure better responsiveness. This is a bit counter-intuitive against the current linux implementation: smaller nice value leads to higher weight and larger timeslice. Now let's see what happens in the Real-Time world. Usually a real-time application is specified with (C_i, T_i), where C_i is the amount of work and T_i is the period of time that the work has to be finished. For example, (20ms, 50ms) says the scheduler has to do 20ms work within a window of 50ms for this task. Smaller T_i gives tighter response constraints. Note that Bi = Ci/Ti is really the CPU proportion for the task during its execution. In our proportional time share world, weight w_i decides the cpu proportion which maps to Bi, and timeslice l_i gives the amount works should be done each time, which maps Ci. Then using w_i and l_i, we can get a window size, which actually maps Ti. Therefore smaller l_i leads to smaller Ti which means tighter response constraints. It seems to me all these make a lot sense in both proportional time share and real-time world. Based on my understanding, adopting something like EEVDF in CFS should not be very difficult given their similarities, although I do not have any idea on how this impacts the load balancing for SMP. Does this worth a try? Sorry for such a long email :-) Ting
From: Willy Tarreau [email blocked] Subject: Re: [patch] CFS scheduler, -v8 Date: Wed, 2 May 2007 07:10:49 +0200 Hi Ting, On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: > > Hi, Ingo > > My name is Ting Yang, a graduate student from UMASS. I am currently > studying the linux scheduler and virtual memory manager to solve some > page swapping problems. I am very excited with the new scheduler CFS. > After I read through your code, I think that you might be interested in > reading this paper: > > "A Proportional Share REsource Allocation Algorithm for Real-Time, > Time-Shared Systems", by Ion Stoica. You can find the paper here: > http://citeseer.ist.psu.edu/37752.html > > Authors of this paper proposed a scheduler: Earlist Eligible Virtual > Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to > track the execution of each running task. The only difference between > EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is > _start-time_ fair. Scheduling based on deadline gives better reponse > time bound and seems to more fair. > > In the following part of this email, I will try to explain the > similarities and differences between EEVDF and CFS. Hopefully, this > might provide you with some useful information w.r.t your current work > on CFS. (...) Thanks very much for this very clear explanation. Now I realize that some of the principles I've had in mind for a long time already exist and are documented ! That's what I called sorting by job completion time in the past, which might not have been clear for everyone. Now you have put words on all those concepts, it's more clear ;-) > The decouple of weight w_i and timeslice l_i is important. Generally > speaking, weight determines throughput and timeslice determines the > responsiveness of a task. I 100% agree. That's the problem we have with nice today. Some people want to use nice to assign more CPU to tasks (as has always been for years) and others want to use nice to get better interactivity (meaning nice as when you're in a queue and leaving the old woman go before you). IMHO, the two concepts are opposed. Either you're a CPU hog OR you get quick responsiveness. > In normal situation, high priority tasks > usually need more cpu capacity within short period of time (bursty, such > as keyboard, mouse move, X updates, daemons, etc), and need to be > processed as quick as possible (responsiveness and interactiveness). > Follow the analysis above, we know that for higher priority tasks we > should give _higher weight_ to ensure its CPU throughput, and at the > same time give _smaller timeslice_ to ensure better responsiveness. > This is a bit counter-intuitive against the current linux > implementation: smaller nice value leads to higher weight and larger > timeslice. We have an additional problem in Linux, and not the least : it already exists and is deployed everywhere, so we cannot break existing setups. More specifically, we don't want to play with nice values of processes such as X. That's why I think that monitoring the amount of the time-slice (l_i) consumed by the task is important. I proposed to conserve the unused part of l_i as a credit (and conversely the credit can be negative if the time-slice has been over-used). This credit would serve two purposes : - reassign the unused part of l_i on next time-slices to get the most fair share of CPU between tasks - use it as an interactivity key to sort the tasks. Basically, if we note u_i the unused CPU cycles, you can sort based on (d_i - u_i) instead of just d_i, and the less hungry tasks will reach the CPU faster than others. (...) > Based on my understanding, adopting something like EEVDF in CFS should > not be very difficult given their similarities, although I do not have > any idea on how this impacts the load balancing for SMP. Does this worth > a try? I think that if you have time to spend on this, everyone would like to see the difference. All the works on the scheduler are more or less experimental and several people are exchanging ideas right now, so it should be the right moment. You seem to understand very well both approaches and it's likely that it will not take you too much time :-) > Sorry for such a long email :-) It was worth it, thanks ! Willy
From: Ingo Molnar [email blocked] Subject: Re: [patch] CFS scheduler, -v8 Date: Wed, 2 May 2007 12:27:12 +0200 * Ting Yang [email blocked] wrote: > My name is Ting Yang, a graduate student from UMASS. I am currently > studying the linux scheduler and virtual memory manager to solve some > page swapping problems. I am very excited with the new scheduler CFS. > After I read through your code, I think that you might be interested > in reading this paper: thanks for your detailed analysis - it was very interesting! > Based on my understanding, adopting something like EEVDF in CFS > should not be very difficult given their similarities, although I do > not have any idea on how this impacts the load balancing for SMP. Does > this worth a try? It would definitely be interesting to try! I dont think it should negatively impact load balancing on SMP. The current fork-time behavior of CFS is really just a first-approximation thing, and what you propose seems to make more sense to me too because it preserves the fluidity of fairness. (I'd probably apply your patch even if there was no directly measurable impact on workloads, because the more natural approaches tend to be more maintainable in the long run.) So by all means, please feel free to do a patch for this. > Sorry for such a long email :-) it made alot of sense and was very useful :-) Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch] CFS scheduler, -v8 Date: Wed, 2 May 2007 08:45:10 +0200 * Mike Galbraith [email blocked] wrote: > > As usual, any sort of feedback, bugreport, fix and suggestion is > > more than welcome, > > Greetings, > > I noticed a (harmless) bounds warning triggered by the reduction in > size of array->bitmap. Patchlet below. thanks, applied! Your patch should also speed up task selection of RT tasks a bit. (the patch removes ~40 bytes of code). And on 64-bit we now fit into 2x 64-bit bitmap and thus are now down to two Find-First-Set instructions. Nice :) Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch] CFS scheduler, -v8 Date: Thu, 3 May 2007 15:02:02 +0200 * Zoltan Boszormenyi [email blocked] wrote: > I started up 12 glxgears to see the effect of CFS v8 on my Athlon64 X2 > 4200. > > Without this patch all the GL load was handled by the second core, the > system only stressed the first core if other processes were also > started, i.e. a kernel compilation. With this patch the load is evenly > balanced across the two cores all the time. ok, i didnt realize that it would affect x86_64 too. I'll do a -v9 release with this fix included. > [...] And while doing make -j4 on the kernel, the 12 gears are still > spinning about 185+ FPS and there are only slightly visible hiccups. > Switching between workspaces, i.e. refreshing the large windows of > Thunderbird and Firefox are done very quickly, the whole system is > exceptionally responsive. great! So it seems -v8 does improve OpenGL handling too :-) > Thanks for this great work! you are welcome :) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email blocked] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar [email blocked] Subject: [patch] CFS scheduler, -v9 Date: Thu, 3 May 2007 16:09:11 +0200 i'm pleased to announce release -v9 of the CFS scheduler patchset. (The main goal of CFS is to implement "desktop scheduling" with as high quality as technically possible.) The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be downloaded from the usual place: http://people.redhat.com/mingo/cfs-scheduler/ -v9 is a small fixes-only release: 6 files changed, 34 insertions(+), 29 deletions(-) it includes small fixlets, the most notable of which is the 64-bit SMP balancing fix from Balbir Singh. Things are clearly calming down and there's been no report of interactivity regression against -v8 so far, which is certainly promising. So please try to break -v9 :-) Changes since -v8: - SMP balancing fix for powerpc/64-bit. (Balbir Singh) - build warning fix and microoptimization: sched_find_first_bit() only needs to cover 100 priority levels. (Mike Galbraith) - fix long initial delay in tasks reniced in a positive direction. (reported by Al Boldi) - fix missing syscall_table.S entry for sys_sched_yield_to. (noticed by Vegard Nossum) - small tweaks to the /proc/sched_debug output As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Ingo

Related Links: