"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
Hackbench:

Lat_ctx:

Pipe test:

From: Rob Hussey [email blocked] Subject: Scheduler benchmarks - a follow-up Date: Mon, 17 Sep 2007 05:21:42 -0400 Hi all, After posting some benchmarks involving cfs (http://lkml.org/lkml/2007/9/13/385), I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled. This time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago). All three .configs are attached. The benchmarks consist of lat_ctx and hackbench, both with a growing number of processes, as well as pipe-test. All benchmarks were also run bound to a single core. Since this time there are hundreds of lines of data, I'll post a reasonable amount here and attach the data files. There are graphs again this time, which I'll post links to as well as attach. I'll start with some selected numbers, which are preceded by the command used for the benchmark. for((i=2; i < 201; i++)); do lat_ctx -s 0 $i; done: (the left most column is the number of processes ($i)) 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 15 5.88 4.85 5.14 16 5.80 4.77 4.76 17 5.91 4.84 4.92 18 5.79 4.86 4.83 19 5.89 4.94 4.93 20 5.78 4.81 5.13 21 5.88 5.02 4.94 22 5.79 4.79 4.84 23 5.93 4.86 5.05 24 5.73 4.76 4.90 25 6.00 4.94 5.19 for((i=1; i < 100; i++)); do hackbench $i; done: 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 80 9.75 8.95 9.52 81 11.54 8.87 9.57 82 11.29 8.92 9.67 83 10.76 8.96 9.82 84 12.04 9.20 9.91 85 11.74 9.39 10.09 86 12.01 9.37 10.18 87 11.39 9.43 10.13 88 12.48 9.60 10.38 89 11.85 9.77 10.52 90 13.78 9.76 10.65 pipe-test: (the left most column is the run #) 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 1 13.84 12.59 13.01 2 13.90 12.57 13.00 3 13.84 12.62 13.06 4 13.87 12.61 13.04 5 13.82 12.62 13.03 6 13.86 12.60 13.02 7 13.85 12.61 13.02 8 13.88 12.45 13.04 9 13.83 12.46 13.03 10 13.88 12.46 13.03 Bound to Single core: for((i=2; i < 201; i++)); do lat_ctx -s 0 $i; done: 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 15 2.90 2.76 2.21 16 2.88 2.79 2.36 17 2.87 2.77 2.52 18 2.86 2.78 2.66 19 2.89 2.72 2.81 20 2.87 2.72 2.95 21 2.86 2.69 3.10 22 2.88 2.72 3.26 23 2.86 2.71 3.39 24 2.84 2.72 3.56 25 2.82 2.73 3.72 for((i=1; i < 100; i++)); do hackbench $i; done: 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 80 14.29 10.86 12.03 81 14.40 11.25 12.17 82 15.00 11.42 12.33 83 14.87 11.12 12.51 84 15.37 11.42 12.66 85 15.75 11.68 12.79 86 15.64 11.95 12.95 87 15.80 11.64 13.12 88 15.70 11.91 13.25 89 15.10 12.19 13.42 90 16.24 12.53 13.54 pipe-test: 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel 1 9.27 8.50 8.55 2 9.27 8.47 8.55 3 9.28 8.47 8.54 4 9.28 8.48 8.54 5 9.28 8.48 8.54 6 9.29 8.46 8.54 7 9.28 8.47 8.55 8 9.29 8.47 8.55 9 9.29 8.45 8.54 10 9.28 8.46 8.54 Links to the graphs (the .dat files are in the same directory): http://www.healthcarelinen.com/misc/benchmarks/lat_ctx_benchmark2.png http://www.healthcarelinen.com/misc/benchmarks/hackbench_benchmark2.png http://www.healthcarelinen.com/misc/benchmarks/pipe-test_benchmark2.png http://www.healthcarelinen.com/misc/benchmarks/BOUND_lat_ctx_benchmark2.png http://www.healthcarelinen.com/misc/benchmarks/BOUND_hackbench_benchmark2.png http://www.healthcarelinen.com/misc/benchmarks/BOUND_pipe-test_benchmark2.png The only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed. Any feedback is welcome. Regards, Rob
From: Ingo Molnar [email blocked] Subject: Re: Scheduler benchmarks - a follow-up Date: Mon, 17 Sep 2007 13:27:07 +0200 * Rob Hussey [email blocked] wrote: > Hi all, > > After posting some benchmarks involving cfs > (http://lkml.org/lkml/2007/9/13/385), I got some feedback, so I > decided to do a follow-up that'll hopefully fill in the gaps many > people wanted to see filled. thanks for the update! > I'll start with some selected numbers, which are preceded by the > command used for the benchmark. > > for((i=2; i < 201; i++)); do lat_ctx -s 0 $i; done: > (the left most column is the number of processes ($i)) > > 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel > > 15 5.88 4.85 5.14 > 16 5.80 4.77 4.76 the unbound results are harder to compare because CFS changed SMP balancing to saturate multiple cores better - but this can result in a micro-benchmark slowdown if the other core is idle (and one of the benchmark tasks runs on one core and the other runs on the first core). This affects lat_ctx and pipe-test. (I'll have a look at the hackbench behavior.) > Bound to Single core: these are the more comparable (apples to apples) tests. Usually the most stable of them is pipe-test: > pipe-test: > > 2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel > > 1 9.27 8.50 8.55 > 2 9.27 8.47 8.55 > 3 9.28 8.47 8.54 > 4 9.28 8.48 8.54 > 5 9.28 8.48 8.54 so -ck1 is 0.8% faster in this particular test. (but still, there can be caching effects in either direction - so i usually run the test on both cores/CPUs to see whether there's any systematic spread in the results. The cache-layout related random spread can be as high as 10% on some systems!) many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead. Could you send me your system's 'dmesg' output when running a 2.6.22 (or -ck1) kernel? Chances are that your TSC got marked unstable, this turns on a much less precise but also faster sched_clock() implementation. CFS uses the TSC even if the time-of-day code marked it as unstable - going for the more precise but slightly slower variant. To test this theory, could you apply the patch below to cfs-devel (if you are interested in further testing this) - this changes the cfs-devel version of sched_clock() to have a low-resolution fallback like v2.6.22 does. Does this result in any measurable increase in performance? (there's also a new sched-devel.git tree out there - if you update to it you'll need to re-pull it against a pristine Linus git head.) Ingo --- arch/i386/kernel/tsc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux/arch/i386/kernel/tsc.c =================================================================== --- linux.orig/arch/i386/kernel/tsc.c +++ linux/arch/i386/kernel/tsc.c @@ -110,9 +110,9 @@ unsigned long long native_sched_clock(vo * very important for it to be as fast as the platform * can achive it. ) */ - if (unlikely(!tsc_enabled && !tsc_unstable)) + if (1 || unlikely(!tsc_enabled && !tsc_unstable)) /* No locking but a rare wrong value is not a big deal: */ - return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); + return jiffies_64 * (1000000000 / HZ); /* read the Time Stamp Counter: */ rdtscll(this_offset);
From: Ingo Molnar [email blocked] Subject: Re: Scheduler benchmarks - a follow-up Date: Mon, 17 Sep 2007 15:05:24 +0200 * Rob Hussey [email blocked] wrote: > http://www.healthcarelinen.com/misc/benchmarks/BOUND_hackbench_benchmark2.png heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ] i've meanwhile tested hackbench 90 and the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks. Putting back the 2.6.22 version and tweaking the startup penalty gives this: [hackbench 90, smaller is better] sched-devel.git sched-devel.git+lowres-sched-clock+dsp --------------- -------------------------------------- 5.555 5.149 5.641 5.149 5.572 5.171 5.583 5.155 5.532 5.111 5.540 5.138 5.617 5.176 5.542 5.119 5.587 5.159 5.553 5.177 -------------------------------------- avg: 5.572 avg: 5.150 (-8.1%) ('lowres-sched-clock' is the patch i sent in the previous mail. 'dsp' is a disable-startup-penalty patch that is in the latest sched-devel.git) i have used your .config to conduct this test. can you reproduce this with the (very-) latest sched-devel git tree: git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git plus with the low-res-sched-clock patch (re-) attached below? Ingo --- arch/i386/kernel/tsc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux/arch/i386/kernel/tsc.c =================================================================== --- linux.orig/arch/i386/kernel/tsc.c +++ linux/arch/i386/kernel/tsc.c @@ -110,9 +110,9 @@ unsigned long long native_sched_clock(vo * very important for it to be as fast as the platform * can achive it. ) */ - if (unlikely(!tsc_enabled && !tsc_unstable)) + if (1 || unlikely(!tsc_enabled && !tsc_unstable)) /* No locking but a rare wrong value is not a big deal: */ - return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ); + return jiffies_64 * (1000000000 / HZ); /* read the Time Stamp Counter: */ rdtscll(this_offset);
From: Jos Poortvliet [email blocked] Subject: Re: [ck] Re: Scheduler benchmarks - a follow-up Date: Mon, 17 Sep 2007 16:01:31 +0200 On 9/17/07, Ingo Molnar [email blocked] wrote: > > * Rob Hussey [email blocked] wrote: > > > http://www.healthcarelinen.com/misc/benchmarks/BOUND_hackbench_benchmark2.png > > heh - am i the only one impressed by the consistency of the blue line in > this graph? :-) [ and the green line looks a bit like a .. staircase? ] Looks lovely, though as long as lower is better, that staircase does a nice job ;-) > i've meanwhile tested hackbench 90 and the performance difference > between -ck and -cfs-devel seems to be mostly down to the more precise > (but slower) sched_clock() introduced in v2.6.23 and to the startup > penalty of freshly created tasks. > > Putting back the 2.6.22 version and tweaking the startup penalty gives > this: > > [hackbench 90, smaller is better] > > sched-devel.git sched-devel.git+lowres-sched-clock+dsp > --------------- -------------------------------------- > 5.555 5.149 > 5.641 5.149 > 5.572 5.171 > 5.583 5.155 > 5.532 5.111 > 5.540 5.138 > 5.617 5.176 > 5.542 5.119 > 5.587 5.159 > 5.553 5.177 > -------------------------------------- > avg: 5.572 avg: 5.150 (-8.1%) Hmmm. So cfs was 0.8% slower compared to ck in the test by Rob, it became 8% faster so... it should be faster than CK - provided these results are valid over different tests. But this is all microbenchmarks, which won't have much effect in real life, right? Besides, will the lowres sched clock patch get in?
From: Ingo Molnar [email blocked] Subject: Re: [ck] Re: Scheduler benchmarks - a follow-up Date: Mon, 17 Sep 2007 16:12:17 +0200 * Jos Poortvliet [email blocked] wrote: > On 9/17/07, Ingo Molnar [email blocked] wrote: > > > > * Rob Hussey [email blocked] wrote: > > > > > http://www.healthcarelinen.com/misc/benchmarks/BOUND_hackbench_benchmark2.png > > > > heh - am i the only one impressed by the consistency of the blue line in > > this graph? :-) [ and the green line looks a bit like a .. staircase? ] > > Looks lovely, though as long as lower is better, that staircase does a > nice job ;-) lower is better, but you have to take the thing below into account: > > i've meanwhile tested hackbench 90 and the performance difference > > between -ck and -cfs-devel seems to be mostly down to the more precise > > (but slower) sched_clock() introduced in v2.6.23 and to the startup > > penalty of freshly created tasks. > > > > Putting back the 2.6.22 version and tweaking the startup penalty gives > > this: > > > > [hackbench 90, smaller is better] > > > > sched-devel.git sched-devel.git+lowres-sched-clock+dsp > > --------------- -------------------------------------- > > 5.555 5.149 > > 5.641 5.149 > > 5.572 5.171 > > 5.583 5.155 > > 5.532 5.111 > > 5.540 5.138 > > 5.617 5.176 > > 5.542 5.119 > > 5.587 5.159 > > 5.553 5.177 > > -------------------------------------- > > avg: 5.572 avg: 5.150 (-8.1%) > > Hmmm. So cfs was 0.8% slower compared to ck in the test by Rob, it > became 8% faster so... it should be faster than CK - provided these > results are valid over different tests. on my box the TSC overhead has hit CFS quite hard, i'm not sure that's true on Rob's box. So i'd expect them to be in roughly the same area. > But this is all microbenchmarks, which won't have much effect in real > life, right? [...] yeah, it's much less pronounced in real life - a context-switch rate above 10,000/sec is already excessive - while for example the lat_ctx test generates close to a million context switches a second. > [...] Besides, will the lowres sched clock patch get in? i dont think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isnt. Ingo
Why
Why are they so proud and patting each other on the back when the CFS is still slower than SD?
Because they think that CFS
Because they think that CFS is "right" choice and SD is "bad". That's absolutely political, not based on performance or anything. Just that Ingo is more popular and powerful developer.
Con? Is that you?
Con? Is that you?
Can you read?
More fairness means using more clock resolution. More clock resolution means a little performance loss. You have to balance that. More performance or more fairness? Given that the performance impact is negligible, they chose more fairness. God. This isn't rocket science.
Dual Core
No Dual Core tests?
Dual-Core makes it almost impossible that one thread should eat all CPU
I want more tests on different CPU's
TSC overhead?
I'm so glad the TSC on our DSPs is essentially zero overhead. :-)
(Ok, reading the 32-bit TSC takes one instruction slot, and the 64-bit TSC takes two on consecutive cycles. But, there are no stalls, and you can still run 7 or 14 other instructions in that time in parallel with it, so....)
I guess reading the TSC on an x86 implies letting all the in-flight state land or something? For it to be such a huge penalty it must have some rather strong synchronization semantics. Ours just reads a free-running counter. If you want synchronization, you actually need to go do it yourself before reading the count. For what CFS does, I don't think you need a synchronized TSC at all. Even if it's off by a full microsecond, that's still 1000x better precision than HZ ever gave us.
--
Program Intellivision and play Space Patrol!
Ahem. Is product placement
Ahem.
Is product placement allowed here? ;-) :-P
Interesting
I did some googling around, and apparently RDTSC is not a serializing instruction, although it is often used with CPUID before and after in order to serialize it.
Even without serializing, though, it looks to be a slow instruction on Intel's x86s according to this page:
http://www.hostingforum.ca/694715-rdtsc-performance-different-x86-archs....
On AMDs, though, it appears to be much faster.
--
Edit: I found another neat article RE: RDTSC on one of Intel's pages. It's in one of their help forums. They confirm that RDTSC is 65 cycles on Core 2 and 80 cycles on Pentium 4.
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/302...
--
Program Intellivision and play Space Patrol!
SD wins all?
So, the SD (staircase-dead) scheduler wins all?
SD is the best?
Sucks that we get CFS in the kernel, if SD is better.
I don't see how you come to that conclusion
SD certainly appears to have a slightly lower overhead than CFS at this point, that's for sure. There does seem to be a few oddities in the data, though. For instance,
Compare these two graphs:
http://www.healthcarelinen.com/misc/benchmarks/lat_ctx_benchmark2.png
http://www.healthcarelinen.com/misc/benchmarks/BOUND_lat_ctx_benchmark2.png
In the first one, SD and CFS cluster together and the old scheduler shows higher latency. In the second one, SD and the old scheduler cluster together and CFS shows higher latency. It's rather impressive. The differences in hackbench can also probably be attributed to the differences in scheduling latency between the two.
Ingo's right to point out the TSC latency, if that's indeed the cause for the difference. It's interesting that TSC should be such an overhead, though.
I honestly don't think they'd even bother to measure SD if they didn't care about what it achieved. If nothing else it gives them a target to beat.
--
Program Intellivision and play Space Patrol!
Interesting that they
Interesting that they compare a scheduler not developed in a few months now to one that's under development and peer review right now, I think I know which I'd bet my money on to win. If nothing else it gives them bragging rights of some kind, I think
Note that these benchmarks
Note that these benchmarks measure the speed and not the interactivity or fairness of these schedulers. While these numbers are interesting, they are just microbenchmarks and have little significance in practice.
The sole reason Linux needed a new scheduler was to attain better fairness and interactivity. Interactivity is difficult to measure, but prior threads on the mailing list have concluded that CFS is slightly better than SD in this regard (the CFS vs SD debate ended months ago).
The SD vs CFS debate ended
The SD vs CFS debate ended when Ingo took all of the good idea from SD and tried to integrate them into CFS. Many people still think SD is better for interactivity ( a task that Con tackled years ahead of Ingo ).
My take is that Con refused to play politics game anymore and left the kernel development fold, overall a loss for linux.
A developer that can't work
A developer that can't work in a team is a liability for the team.
Why doesn't Linus replace Ingo?
Which makes you wonder why Torvalds hasn't replaced Ingo. His reputation is for being a decent manager. Molnar is obviously an open sore and an impediment to progress.
What are you talking about?
What are you talking about? He's maybe the most friendly and cooperating developer of them all, and tries to work closely even with walking flames like Roman.
Molnar is at best a
Molnar is at best a stonewaller, at worst a plagiarist. You don't hear the same discord coming from the other subsystems.
Ingo has stonewalled nobody,
Ingo has stonewalled nobody, quite the opposite, he's quite open for discussion.
He also always has been open about who and what inspired him.
The other subsystems doesn't have drama queens like Con and Roman wanting to get attention...
"a task that Con tackled
"a task that Con tackled years ahead of Ingo"
Bullshit. Ingo has been working on latency for longer than Con has been around by far.
The test shows that CFS has lower overhead with low to moderate process counts, lower latency in all cases, and much higher consistency. Based on that it would appear that it's server workloads where SD is arguably somewhat better right now.
You are sure reading the
You are sure reading the graph the wrong way. Lower is better and -ck is at least equal -rc6, AT LEAST.
Really, the only objection people can make about SD results is that micro benchmarks don't matter.
No, Con refused to
No, Con refused to play!
Some of his mindshare went to the CFS. But because not all of it got merged, he decided to quit. (Instead of contributing more). He took it personally when he shouldn't. The reasons why Linus didn't merge SD have been stated in kernel mailing list and they were valid.
Egos and drama queens work badly in kernel development.
Read the mailing list and get the real story instead of trusting some 'headlines'
Ingo is outstanding developer. Luckily that still matters.
When some clueless people's political movements manage to evict talented guys like Ingo out, it will be the day one of the decline.
Try to get the facts straight. This CK vs Ingo (SD vs CFS) has grown
to some stinking myth and its still being kept alive by the ones without any clue.
Exceptional technical skill and working with others are key talents in these kernel circles. So i guess Ingo will be around for long...
amen
amen