Linux: CFS Group Level Fairness

Submitted by Jeremy
on May 24, 2007 - 8:02am

Following a review of Ingo Molnar [interview]'s Completely Fair Scheduler [story], Srivatsa Vaddagiri posted a patch allowing the new scheduler to provide fairness at a per-group level rather than at a per-process level. He described the changes that he made and noted, "I have used 'uid' as the basis of grouping for timebeing (since that grouping concept is already in mainline today). The patch can be adapted to a more generic process grouping mechanism later."

Ingo reacted to the patch favorably, "yeah, i like this alot." He went on to comment, "the 'struct sched_entity' abstraction looks very clean, and that's the main thing that matters: it allows for a design that will only cost us performance if group scheduling is desired." He went on to ask, "if you could do a -v14 port and at least add minimal SMP support: i.e. it shouldnt crash on SMP, but otherwise no extra load-balancing logic is needed for the first cut - then i could try to pick all these core changes up for -v15. (I'll let you know about any other thoughts/details when i do the integration.)"


From: Srivatsa Vaddagiri [email blocked]
To: Ingo Molnar [email blocked]
Subject: [RFC] [PATCH 0/3] Add group fairness to CFS
Date:	Wed, 23 May 2007 22:18:59 +0530

Here's an attempt to extend CFS (v13) to be fair at a group level, rather than
just at task level. The patch is in a very premature state (passes
simple tests, smp load balance not supported yet) at this point. I am sending 
it out early to know if this is a good direction to proceed.

Salient points which needs discussion:

1. This patch reuses CFS core to achieve fairness at group level also.

   To make this possible, CFS core has been abstracted to deal with generic 
   schedulable "entities" (tasks, users etc).

2. The per-cpu rb-tree has been split to be per-group per-cpu.

   schedule() now becomes two step on every cpu : pick a group first (from
   group rb-tree) and a task within that group next (from that group's task
   rb-tree)

3. Grouping mechanism - I have used 'uid' as the basis of grouping for
   timebeing (since that grouping concept is already in mainline today).
   The patch can be adapted to a more generic process grouping mechanism
   (like http://lkml.org/lkml/2007/4/27/146) later.

Some results below, obtained on a 4way (with HT) Intel Xeon box. All 
number are reflective of single CPU performance (tests were forced to 
run on single cpu since load balance is not yet supported).


			         uid "vatsa"	           uid "guest"
		             (make -s -j4 bzImage)    (make -s -j20 bzImage)

2.6.22-rc1		          772.02 sec		497.42 sec (real)
2.6.22-rc1+cfs-v13 	          780.62 sec		478.35 sec (real)
2.6.22-rc1+cfs-v13+this patch     776.36 sec		776.68 sec (real)

[ An exclusive cpuset containing only one CPU was created and the
compilation jobs of both users were run simultaneously in this cpuset ]

I also disabled CONFIG_FAIR_USER_SCHED and compared the results with
cfs-v13:

					uid "vatsa"
					make -s -j4 bzImage

2.6.22-rc1+cfs-v13			395.57 sec (real)
2.6.22-rc1+cfs-v13+this_patch		388.54 sec (real)

There is no regression I can see (rather some improvement, which I can't
understand atm). I will run more tests later to check this regression aspect.

Request your comments on the future direction to proceed!


-- 
Regards,
vatsa


From: Ingo Molnar [email blocked] Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS Date: Wed, 23 May 2007 20:32:52 +0200 * Srivatsa Vaddagiri [email blocked] wrote: > Here's an attempt to extend CFS (v13) to be fair at a group level, > rather than just at task level. The patch is in a very premature state > (passes simple tests, smp load balance not supported yet) at this > point. I am sending it out early to know if this is a good direction > to proceed. cool patch! :-) > Salient points which needs discussion: > > 1. This patch reuses CFS core to achieve fairness at group level also. > > To make this possible, CFS core has been abstracted to deal with > generic schedulable "entities" (tasks, users etc). yeah, i like this alot. The "struct sched_entity" abstraction looks very clean, and that's the main thing that matters: it allows for a design that will only cost us performance if group scheduling is desired. If you could do a -v14 port and at least add minimal SMP support: i.e. it shouldnt crash on SMP, but otherwise no extra load-balancing logic is needed for the first cut - then i could try to pick all these core changes up for -v15. (I'll let you know about any other thoughts/details when i do the integration.) > 2. The per-cpu rb-tree has been split to be per-group per-cpu. > > schedule() now becomes two step on every cpu : pick a group first > (from group rb-tree) and a task within that group next (from that > group's task rb-tree) yeah. It might even become more steps if someone wants to have a different, deeper hierarchy (at the price of performance). Containers will for example certainly want to use one more level. > 3. Grouping mechanism - I have used 'uid' as the basis of grouping for > timebeing (since that grouping concept is already in mainline > today). The patch can be adapted to a more generic process grouping > mechanism (like http://lkml.org/lkml/2007/4/27/146) later. yeah, agreed. > Some results below, obtained on a 4way (with HT) Intel Xeon box. All > number are reflective of single CPU performance (tests were forced to > run on single cpu since load balance is not yet supported). > > > uid "vatsa" uid "guest" > (make -s -j4 bzImage) (make -s -j20 bzImage) > > 2.6.22-rc1 772.02 sec 497.42 sec (real) > 2.6.22-rc1+cfs-v13 780.62 sec 478.35 sec (real) > 2.6.22-rc1+cfs-v13+this patch 776.36 sec 776.68 sec (real) > > [ An exclusive cpuset containing only one CPU was created and the > compilation jobs of both users were run simultaneously in this cpuset > ] looks really promising! > I also disabled CONFIG_FAIR_USER_SCHED and compared the results with > cfs-v13: > > uid "vatsa" > make -s -j4 bzImage > > 2.6.22-rc1+cfs-v13 395.57 sec (real) > 2.6.22-rc1+cfs-v13+this_patch 388.54 sec (real) > > There is no regression I can see (rather some improvement, which I > can't understand atm). I will run more tests later to check this > regression aspect. kernel builds dont really push scheduling micro-costs, rather try something like 'hackbench.c' to measure that. (kernel builds are of course one of our primary benchmarks.) > Request your comments on the future direction to proceed! full steam ahead please! =B-) Ingo

Related Links:

Improvement?

Anonymous (not verified)
on
May 25, 2007 - 8:08pm


> 2.6.22-rc1+cfs-v13 780.62 sec 478.35 sec (real)
> 2.6.22-rc1+cfs-v13+this patch 776.36 sec 776.68 sec (real)

Total is 1258.97 sec without this patch and 1552.98 sec with.
How is a 23% slowdown an improvement?

Improvement.

on
May 25, 2007 - 11:21pm

I believe he started both jobs at the same time, so a total of their realtimes is probably not a useful metric.

Without the patch, the 20-process job finishes significantly sooner than the 4-process job. (Presumably the 4-process job got the processor around 1/5th as much then sped up after the 20-process job completed.) With the patch, they finished at roughly the same time, which is more fair. In both cases, the same amount of work was finished at around 780 seconds, so there's no slowdown.

If he were to run the 20-process job in a loop, I'd expect you'd see the 4-process job take significantly longer to complete without this patch. One user who spawns a huge number of processes could basically monopolize the system.

Also remember that these

Anonymous (not verified)
on
June 3, 2007 - 1:52pm

Also remember that these jobs are run in parallel: the total real time is the maximum of any one job, not the sum of their real times, which means there's actually a 4 second improvement in wall-clock performance with the introduction of the patch.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.