Hi all,
After posting some benchmarks involving cfs
(http://lkml.org/lkml/2007/9/13/385), I got some feedback, so I
decided to do a follow-up that'll hopefully fill in the gaps many
people wanted to see filled.This time around I've done the benchmarks against 2.6.21, 2.6.22-ck1,
and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago). All three
.configs are attached. The benchmarks consist of lat_ctx and
hackbench, both with a growing number of processes, as well as
pipe-test. All benchmarks were also run bound to a single core.Since this time there are hundreds of lines of data, I'll post a
reasonable amount here and attach the data files. There are graphs
again this time, which I'll post links to as well as attach.I'll start with some selected numbers, which are preceded by the
command used for the benchmark.for((i=2; i < 201; i++)); do lat_ctx -s 0 $i; done:
(the left most column is the number of processes ($i))2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel
15 5.88 4.85 5.14
16 5.80 4.77 4.76
17 5.91 4.84 4.92
18 5.79 4.86 4.83
19 5.89 4.94 4.93
20 5.78 4.81 5.13
21 5.88 5.02 4.94
22 5.79 4.79 4.84
23 5.93 4.86 5.05
24 5.73 4.76 4.90
25 6.00 4.94 5.19for((i=1; i < 100; i++)); do hackbench $i; done:
2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel
80 9.75 8.95 9.52
81 11.54 8.87 9.57
82 11.29 8.92 9.67
83 10.76 8.96 9.82
84 12.04 9.20 9.91
85 11.74 9.39 10.09
86 12.01 9.37 10.18
87 11.39 9.43 10.13
88 12.48 9.60 10.38
89 11.85 9.77 10.52
90 13.78 9.76 10.65pipe-test:
(the left most column is the run #)2.6.21 2.6.22-ck1 2.6.23-rc6-cfs-devel
1 13.84 12.59 13.01
2 13.90 12.57 13.00
3 13.84 12.62 13.06
4 13.87 12.61 13.04
5 13.82 12.62 13.03
6 13.86 12.60 13.02
7 13.85 12.61 13.02
8 13.88 12.45 13.04
9 13.83 12.46 13.03
10 13.88 12.46 13.03Bound to Single core:
for((i=2; i < 201; i++)); do lat_ctx -s 0 $i; done:
2.6.21 2.6.22-ck1 2.6.23-rc6-cfs...
heh - am i the only one impressed by the consistency of the blue line in
this graph? :-) [ and the green line looks a bit like a .. staircase? ]i've meanwhile tested hackbench 90 and the performance difference
between -ck and -cfs-devel seems to be mostly down to the more precise
(but slower) sched_clock() introduced in v2.6.23 and to the startup
penalty of freshly created tasks.Putting back the 2.6.22 version and tweaking the startup penalty gives
this:[hackbench 90, smaller is better]
sched-devel.git sched-devel.git+lowres-sched-clock+dsp
--------------- --------------------------------------
5.555 5.149
5.641 5.149
5.572 5.171
5.583 5.155
5.532 5.111
5.540 5.138
5.617 5.176
5.542 5.119
5.587 5.159
5.553 5.177
--------------------------------------
avg: 5.572 avg: 5.150 (-8.1%)('lowres-sched-clock' is the patch i sent in the previous mail. 'dsp' is
a disable-startup-penalty patch that is in the latest sched-devel.git)i have used your .config to conduct this test.
can you reproduce this with the (very-) latest sched-devel git tree:
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
plus with the low-res-sched-clock patch (re-) attached below?
Ingo
---
arch/i386/kernel/tsc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -110,9 ...
Sorry it took so long for me to get back.
Ok, to start the dmesg output for 2.6.22-ck1 is attached. The relevant
lines seem to be:
[ 27.691348] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[ 27.995427] Time: tsc clocksource has been installed.I've updated to the latest sched-devel git, and applied the patch
above. I ran it through the same tests, but this time only while bound
to a single core. Some selected numbers:lat_ctx -s 0 $i (the left most number is $i):
15 3.09
16 3.09
17 3.11
18 3.07
19 2.99
20 3.09
21 3.05
22 3.11
23 3.05
24 3.08
25 3.06hackbench $i:
80 11.720
81 11.698
82 11.888
83 12.094
84 12.232
85 12.351
86 12.512
87 12.680
88 12.736
89 12.861
90 13.103pipe-test (the left most number is the run #):
1 8.85
2 8.80
3 8.84
4 8.82
5 8.82
6 8.80
7 8.82
8 8.82
9 8.85
10 8.83Once again, graphs:
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_lat_ctx_ben...
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_hackbench_b...
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_pipe-test_b...I saw in your other email that you'd like for me to try with
CONFIG_PREEMPT disabled. I should have a chance to try that very soon.Regards,
Rob
Rob, another thing i just noticed in your .configs: you have
CONFIG_PREEMPT=y enabled. Would it be possible to get a testrun with
that disabled? That gives the best throughput and context-switch latency
numbers. (CONFIG_PREEMPT might also have preemption artifacts - there's
one report of it having _worse_ desktop latencies on certain hardware
than !CONFIG_PREEMPT.)Ingo
-
I reverted the patch from before since it didn't seem to help. Do you
think it may have to do with my system having Hyper-Threading enabled?
I should have pointed out before that I don't really have a dual-core
system, just a P4 with Hyper-Threading (I loosely used core to refer
to processor).Some new numbers for 2.6.23-rc6-cfs-devel (!CONFIG_PREEMPT and bound
to single processor)lat_ctx:
15 2.73
16 2.74
17 2.81
18 2.74
19 2.74
20 2.73
21 2.60
22 2.74
23 2.72
24 2.74
25 2.74hackbench:
80 11.578
81 11.991
82 11.914
83 12.026
84 12.226
85 12.347
86 12.552
87 12.655
88 13.011
89 12.941
90 13.237pipe-test:
1 9.58
2 9.58
3 9.58
4 9.58
5 9.58
6 9.58
7 9.58
8 9.58
9 9.58
10 9.58The obligatory graphs:
http://www.healthcarelinen.com/misc/benchmarks/BOUND_NOPREEMPT_lat_ctx_b...
http://www.healthcarelinen.com/misc/benchmarks/BOUND_NOPREEMPT_hackbench...
http://www.healthcarelinen.com/misc/benchmarks/BOUND_NOPREEMPT_pipe-test...A cursory glance suggests that performance wrt lat_ctx and hackbench
has increased (lower numbers), but degraded quite a lot for pipe-test.
The numbers for pipe-test are extremely stable though, while the
numbers for hackbench are more erratic (which isn't saying much since
the original numbers gave nearly a straight line). I'm still willing
to try out any more ideas.Regards,
Rob
-
btw., it's likely that if you turn off CONFIG_PREEMPT for .21 and for
.22-ck1 they'll improve a bit too - so it's not fair to put the .23
!PREEMPT numbers on the graph as the PREEMPT numbers of the otherthe pipe-test behavior looks like an outlier. !PREEMPT only removes code
(which makes the code faster), so this could be a cache layout artifact.
(or perhaps we preempt at a different point which is disadvantageous to
caching?) Pipe-test is equivalent to "lat_ctx -s 0 2" so if there was a
genuine slowdown it would show up in the lat_ctx graph - but the graph
shows a speedup.Ingo
-
The graphs are really just to show where the new numbers fit in. Plus,
Interestingly, every set of lat_ctx -s 0 2 numbers I run on the
!PREEMPT kernel are on average higher than with PREEMPT (around 2.84
for !PREEMPT and 2.4 for PREEMPT). Anything higher than around 2 or 3
(such as lat_ctx -s 0 8) gives lower average numbers for !PREEMPT.Regards,
Rob
-
yeah - the graphs are completely OK (and they are really nice and
perhaps this 2 task ping-pong is somehow special in that it manages to
fit into L1 cache much better under PREEMPT than under !PREEMPT.
(usually the opposite is true) At 3 tasks or more things dont fit
anymore (or the special alignment is gone) so the faster !PREEMPT code
wins.Ingo
-
pipe-test is a very stable workload, and is thus quite sensitive to the
associativity of the CPU cache. Even killing the task and repeating the
same test isnt enough to get rid of the systematic skew that this can
cause. I've seen divergence of up to 10% in pipe-test. One way to test
it is to run pipe-test, then to stop it, then to "ssh localhost" (this
in itself uses up a couple of pipe objects and file objects and changes
the cache layout picture), then run pipe-test again, then again "ssh
localhost", etc. Via this trick one can often see cache-layout
artifacts. How much 'skew' does pipe-test have on your system if you try
this manually?Ingo
-
I did 7 data sets of 5 runs each using this method. With pipe-test
bound to one sibling, there were 10 unique values in these 7 sets. The
lowest value was 9.22, the highest value was 9.62, and the median of
the unique values was 9.47. So the deviation from mean for the lowest
and highest values was {-0.25, 0.15} The numbers were even tighter for
pipe-test not bound to a single sibling: {-0.07, 0.12}
-
Hi Rob,
Just for reference, we call them "siblings", not "cores" on HT. I believe
that a line "Sibling:" appears in /proc/cpuinfo BTW.Regards,
Willy-
Thanks, I was searching for the right word but couldn't come up with it.
-
Looks lovely, though as long as lower is better, that staircase does a
Hmmm. So cfs was 0.8% slower compared to ck in the test by Rob, it
became 8% faster so... it should be faster than CK - provided these
results are valid over different tests.But this is all microbenchmarks, which won't have much effect in real
-
on my box the TSC overhead has hit CFS quite hard, i'm not sure that's
yeah, it's much less pronounced in real life - a context-switch rate
above 10,000/sec is already excessive - while for example the lat_ctxi dont think so - we want precise/accurate scheduling before
performance. (otherwise tasks working off the timer tick could steal
away cycles without being accounted for them fairly, and could starve
out all other tasks.) Unless the difference was really huge in real life
- but it isnt.Ingo
-
the unbound results are harder to compare because CFS changed SMP
balancing to saturate multiple cores better - but this can result in a
micro-benchmark slowdown if the other core is idle (and one of the
benchmark tasks runs on one core and the other runs on the first core).
This affects lat_ctx and pipe-test. (I'll have a look at the hackbenchthese are the more comparable (apples to apples) tests. Usually the most
so -ck1 is 0.8% faster in this particular test. (but still, there can be
caching effects in either direction - so i usually run the test on both
cores/CPUs to see whether there's any systematic spread in the results.
The cache-layout related random spread can be as high as 10% on some
systems!)many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could
affect performance of this test. My initial guess would be sched_clock()
overhead. Could you send me your system's 'dmesg' output when running a
2.6.22 (or -ck1) kernel? Chances are that your TSC got marked unstable,
this turns on a much less precise but also faster sched_clock()
implementation. CFS uses the TSC even if the time-of-day code marked it
as unstable - going for the more precise but slightly slower variant.To test this theory, could you apply the patch below to cfs-devel (if
you are interested in further testing this) - this changes the cfs-devel
version of sched_clock() to have a low-resolution fallback like v2.6.22
does. Does this result in any measurable increase in performance?(there's also a new sched-devel.git tree out there - if you update to it
you'll need to re-pull it against a pristine Linus git head.)Ingo
---
arch/i386/kernel/tsc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -110,9 +110,9 @@ unsigned long long native_sched_clock(vo
* very im...
Rob,
I gather this was with the complete -ck patchset? It would be interesting to see if just SD
performed as well. If it does, CFS needs more work. if not there are other things in -ck
that really do improve performance and should be looked into.Thanks
Ed Tomlinson-
yeah. The biggest item in -ck besides SD is swap-prefetch, but that
shouldnt have an effect in this case. I _think_ that most of the
measured difference is due to scheduler details though. Right now my
estimation is that with the patch i sent to Rob, and with latest
sched-devel.git, CFS should perform as good or better than SD, even in
these micro-benchmarks. (but i cannot tell what will happen on Rob's
machine - so i'm keeping an open mind towards any other fixables :-) I'm
curious about the next round of numbers (if Rob has time to do them).Ingo
-
also see:
http://lkml.org/lkml/2007/9/17/172
i think at least part of the differences is due to the different
sched_clock() accuracy and performance in v2.6.22-ck versus v2.6.23-cfs.Ingo
-
| Greg Kroah-Hartman | [PATCH 006/196] Chinese: add translation of oops-tracing.txt |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Borislav Petkov | 2.6.23-rc1: no setup signature found... |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [BUG] New Kernel Bugs |
