Hi. It was reported recently that tbench has a long history of regressions, starting at least from 2.6.23 kernel. I verified that in my test environment tbench 'lost' more than 100 MB/s from 470 down to 355 between at least 2.6.24 and 2.6.27. 2.6.26-2.6.27 performance regression in my machines is rougly corresponds to 375 down to 355 MB/s. I spent several days in various tests and bisections (unfortunately bisect can not always point to the 'right' commit), and found following problems. First, related to the network, as lots of people expected: TSO/GSO over loopback with tbench workload eats about 5-10 MB/s, since TSO/GSO frame creation overhead is not paid by the optimized super-frame processing gains. Since it brings really impressive improvement in big-packet workload, it was (likely) decided not to add a patch for this, but instead one can disable TSO/GSO via ethtool. This patch was added in 2.6.27 window, so it has its part in its regression. Second part in the 26-27 window regression (I remind, it is about 20 MB/s) is related to the scheduler changes, which was expected by another group of people. I tracked it down to the a7be37ac8e1565e00880531f4e2aff421a21c803 commit, which, if being reverted, returns 2.6.27 tbench perfromance to the highest (for 2.6.26-2.6.27) 365 MB/s mark. I also tested tree, stopped at above commit itself, i.e. not 2.6.27, and got 373 MB/s, so likely another changes in that merge ate couple of megs. Attached patch against 2.6.27. Curious reader can ask, where did we lost another 100 MB/s? This small issue was not detected (or at least reported in netdev@ with provocative enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes. I was so lucky to 'guess' (just after couple of hundreds of compilations), that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about high-resolution timers, attached patch against 2.6.25 brings tbench performance for the 2.6.25 kernel tree to 455 MB/s. There are still somewhat ...
can you try echo NO_HRTICK > /debug/sched_features on .27 like kernels? Also, what clocksource do those machines use? cat /sys/devices/system/clocksource/clocksource0/current_clocksource As to, a7be37ac8e1565e00880531f4e2aff421a21c803, could you try tip/master? I reworked some of the wakeup preemption code in there. Thanks for looking into this issue! --
Hi Peter. I've enabled kernel hacking option and scheduler debugging and turned off hrticks and performance jumped to 382 MB/s: vanilla 27: 347.222 no TSO/GSO: 357.331 no hrticks: 382.983 I use tsc clocksource, also available acpi_pm and jiffies, with acpi_pm performance is even lower (I stopped test after it dropped below 340 MB/s mark), jiffies do not work at all, looks like sockets stuck in time_wait state when this clock source is used, although that may be some different issue. So I think hrticks are guilty, but still not as good as .25 tree without mentioned changes (455 MB/s) and .24 (475 MB/s). -- Evgeniy Polyakov --
hi Evgeniy, i'm glad that you are looking into this! That is an SMP box, right? If yes then could you try this sched-domains tuning utility i have written yesterday (incidentally): http://redhat.com/~mingo/cfs-scheduler/tune-sched-domains just run it without options to see the current sched-domains options. On a testsystem i have it displays this: # tune-sched-domains usage: tune-sched-domains <val> current val on cpu0/domain0: SD flag: 47 + 1: SD_LOAD_BALANCE: Do load balancing on this domain + 2: SD_BALANCE_NEWIDLE: Balance when about to become idle + 4: SD_BALANCE_EXEC: Balance on exec + 8: SD_BALANCE_FORK: Balance on fork, clone - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup + 32: SD_WAKE_AFFINE: Wake task to waking CPU - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup then could you check what effects it has if you turn off SD_BALANCE_NEWIDLE? On my box i did it via: # tune-sched-domains $[47-2] changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 47 => 45 SD flag: 45 + 1: SD_LOAD_BALANCE: Do load balancing on this domain - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle + 4: SD_BALANCE_EXEC: Balance on exec + 8: SD_BALANCE_FORK: Balance on fork, clone - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup + 32: SD_WAKE_AFFINE: Wake task to waking CPU - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 1101 => 45 SD flag: 45 + 1: SD_LOAD_BALANCE: Do load balancing on this domain - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle + 4: SD_BALANCE_EXEC: Balance on exec + 8: SD_BALANCE_FORK: Balance on fork, clone - 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup + 32: SD_WAKE_AFFINE: Wake task to waking CPU - 64: SD_WAKE_BALANCE: ...
Hi Ingo. I've removed SD_BALANCE_NEWIDLE: # ./tune-sched-domains $[191-2] changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 191 => 189 SD flag: 189 + 1: SD_LOAD_BALANCE: Do load balancing on this domain - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle + 4: SD_BALANCE_EXEC: Balance on exec + 8: SD_BALANCE_FORK: Balance on fork, clone + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup + 32: SD_WAKE_AFFINE: Wake task to waking CPU - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup + 128: SD_SHARE_CPUPOWER: Domain members share cpu power changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 47 => 189 SD flag: 189 + 1: SD_LOAD_BALANCE: Do load balancing on this domain - 2: SD_BALANCE_NEWIDLE: Balance when about to become idle + 4: SD_BALANCE_EXEC: Balance on exec + 8: SD_BALANCE_FORK: Balance on fork, clone + 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup + 32: SD_WAKE_AFFINE: Wake task to waking CPU - 64: SD_WAKE_BALANCE: Perform balancing at task wakeup + 128: SD_SHARE_CPUPOWER: Domain members share cpu power And got noticeble improvement (each new line has fixes from previous): vanilla 27: 347.222 no TSO/GSO: 357.331 no hrticks: 382.983 Ok, I've started to pull it down, I will reply back when things are ready. -- Evgeniy Polyakov --
make sure you have this fix in tip/master already: 5b7dba4: sched_clock: prevent scd->clock from moving backwards Note: Mike is 100% correct in suggesting that a very good cpu_clock() is needed for precise scheduling. i've also Cc:-ed Nick. Ingo --
The last commit is 5dc64a3442b98eaa and aforementioned changeset was included. Result is quite bad: vanilla 27: 347.222 no TSO/GSO: 357.331 no hrticks: 382.983 no balance: 389.802 tip: 365.576 -- Evgeniy Polyakov --
okay. The target is 470 MB/sec, right? (Assuming the workload is sane and 'fixing' it does not mean we have to schedule worse.) We are still way off from 470 MB/sec. Ingo --
Well, that's where I started/stopped, so maybe we will even move further? :) -- Evgeniy Polyakov --
Can anyone please tell me if there was any conclusion of this thread? Thanks, Rafael --
From: "Rafael J. Wysocki" <rjw@sisk.pl> I made some more analysis in private with Ingo and Peter Z. and found that the tbench decreases correlate pretty much directly with the ongoing increasing cpu cost of wake_up() and friends in the fair scheduler. The largest increase in computational cost of wakeups came in 2.6.27 when the hrtimer bits got added, it more than tripled the cost of a wakeup. In 2.6.28-rc1 the hrtimer feature has been disabled, but I think that should be backports into the 2.6.27-stable branch. So I think that should be backported, and meanwhile I'm spending some time in the background trying to replace the fair schedulers RB tree crud with something faster so maybe at some point we can recover all of the regressions in this area caused by the CFS code. --
My test data indicates (to me anyway) that there is another source of localhost throughput loss in .27. In that data, there is no hrtick overhead since I didn't have highres timers enabled, and computational costs added in .27 were removed. Dunno where it lives, but it does appear to exist. -Mike --
From: Mike Galbraith <efault@gmx.de> Disabling TSO on loopback doesn't fix that bit for you? --
No. Those numbers are with TSO/GSO disabled. I did a manual 100% sched and everything related revert to 26 scheduler, and had ~the same result as these numbers. 27 with 100% revert actually performed a bit _worse_ for me than 27 with it's overhead.. which puzzles me greatly. -Mike --
Thanks a lot for the info. Could you please give me a pointer to the commit disabling the hrtimer feature? Rafael --
From: "Rafael J. Wysocki" <rjw@sisk.pl>
Here it is:
commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478
Author: Ingo Molnar <mingo@elte.hu>
Date: Mon Oct 20 14:27:43 2008 +0200
sched: disable the hrtick for now
David Miller reported that hrtick update overhead has tripled the
wakeup overhead on Sparc64.
That is too much - disable the HRTICK feature for now by default,
until a faster implementation is found.
Reported-by: David Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 7c9e8f4..fda0162 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -5,7 +5,7 @@ SCHED_FEAT(START_DEBIT, 1)
SCHED_FEAT(AFFINE_WAKEUPS, 1)
SCHED_FEAT(CACHE_HOT_BUDDY, 1)
SCHED_FEAT(SYNC_WAKEUPS, 1)
-SCHED_FEAT(HRTICK, 1)
+SCHED_FEAT(HRTICK, 0)
SCHED_FEAT(DOUBLE_TICK, 0)
SCHED_FEAT(ASYM_GRAN, 1)
SCHED_FEAT(LB_BIAS, 1)
--
Part of the .27 regression was added scheduler overhead going from .26
to .27. The scheduler overhead is now gone, but an unidentified source
of localhost throughput loss remains for both SMP and UP configs.
-Mike
My last test data, updated to reflect recent commits:
Legend:
clock = v2.6.26..5052696 + 5052696..v2.6.27-rc7 sched clock changes
weight = a7be37a + c9c294a + ced8aa1 (adds math overhead)
buddy = 103638d (adds math overhead)
buddy_overhead = b0aa51b (removes math overhead of buddy)
revert_to_per_rq_vruntime = f9c0b09 (+2 lines, removes math overhead of weight)
2.6.26.6-up virgin
ring-test - 1.169 us/cycle = 855 KHz 1.000
netperf - 130967.54 131143.75 130914.96 rr/s avg 131008.75 rr/s 1.000
tbench - 357.593 355.455 356.048 MB/sec avg 356.365 MB/sec 1.000
2.6.26.6-up + clock + buddy + weight (== .27 scheduler)
ring-test - 1.234 us/cycle = 810 KHz .947 [cmp1]
netperf - 128026.62 128118.48 127973.54 rr/s avg 128039.54 rr/s .977
tbench - 342.011 345.307 343.535 MB/sec avg 343.617 MB/sec .964
2.6.26.6-up + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead
ring-test - 1.174 us/cycle = 851 KHz .995 [cmp2]
netperf - 133928.03 134265.41 134297.06 rr/s avg 134163.50 rr/s 1.024
tbench - 358.049 359.529 358.342 MB/sec avg 358.640 MB/sec 1.006
versus .26 counterpart
2.6.27-up virgin
ring-test - 1.193 us/cycle = 838 KHz 1.034 [vs cmp1]
netperf - 121293.48 121700.96 120716.98 rr/s avg 121237.14 rr/s .946
tbench - 340.362 339.780 341.353 MB/sec avg 340.498 MB/sec .990
2.6.27-up + revert_to_per_rq_vruntime + buddy_overhead
ring-test - 1.122 us/cycle = 891 KHz 1.047 [vs cmp2]
netperf - 119353.27 118600.98 119719.12 rr/s avg ...From: Mike Galbraith <efault@gmx.de> It has to be the TSO thinky Evgeniy hit too right? If not, please bisect this. --
(oh my <fword> gawd:) I spent long day manweeks trying to bisect and whatnot. It's immune to my feeble efforts, and my git-foo. -Mike --
but.. (tbench/netperf numbers were tested with gcc-4.1 at this time in log, I went back and re-measured ring-test because I switched compilers) 2.6.22.19-up ring-test - 1.204 us/cycle = 830 KHz (gcc-4.1) ring-test - doorstop (gcc-4.3) netperf - 147798.56 rr/s = 295 KHz (hmm, a bit unstable, 140K..147K rr/s) tbench - 374.573 MB/sec 2.6.22.19-cfs-v24.1-up ring-test - 1.098 us/cycle = 910 KHz (gcc-4.1) ring-test - doorstop (gcc-4.3) netperf - 140039.03 rr/s = 280 KHz = 3.57us - 1.10us sched = 2.47us/packet network tbench - 364.191 MB/sec 2.6.23.17-up ring-test - 1.252 us/cycle = 798 KHz (gcc-4.1) ring-test - 1.235 us/cycle = 809 KHz (gcc-4.3) netperf - 123736.40 rr/s = 247 KHz sb 268 KHZ / 134336.37 rr/s tbench - 355.906 MB/sec 2.6.23.17-cfs-v24.1-up ring-test - 1.100 us/cycle = 909 KHz (gcc-4.1) ring-test - 1.074 us/cycle = 931 KHz (gcc-4.3) netperf - 135847.14 rr/s = 271 KHz sb 280 KHz / 140039.03 rr/s tbench - 364.511 MB/sec 2.6.24.7-up ring-test - 1.100 us/cycle = 909 KHz (gcc-4.1) ring-test - 1.068 us/cycle = 936 KHz (gcc-4.3) netperf - 122300.66 rr/s = 244 KHz sb 280 KHz / 140039.03 rr/s tbench - 341.523 MB/sec 2.6.25.17-up ring-test - 1.163 us/cycle = 859 KHz (gcc-4.1) ring-test - 1.129 us/cycle = 885 KHz (gcc-4.3) netperf - 132102.70 rr/s = 264 KHz sb 275 KHz / 137627.30 rr/s tbench - 361.71 MB/sec ..in 25, something happened that dropped my max context switch rate from ~930 KHz to ~885 KHz. Maybe I'll have better luck trying to find that. Added to to-do list. Benchmark mysteries I'm going to have to leave alone, they've kicked my little butt quite thoroughly ;-) -Mike --
From: Mike Galbraith <efault@gmx.de> But note that tbench performance improved a bit in 2.6.25. In my tests I noticed a similar effect, but from 2.6.23 to 2.6.24, weird. Just for the public record here are the numbers I got in my testing. Each entry was run purely on the latest 2.6.X-stable tree for each release. First is the tbench score and then there are 40 numbers which are sparc64 cpu cycle counts of default_wake_function(). v2.6.22: Throughput 173.677 MB/sec 2 clients 2 procs max_latency=38.192 ms 1636 1483 1552 1560 1534 1522 1472 1530 1518 1468 1534 1402 1468 1656 1383 1362 1516 1336 1392 1472 1652 1522 1486 1363 1430 1334 1382 1398 1448 1439 1662 1540 1526 1472 1539 1434 1452 1492 1502 1432 v2.6.23: This is when CFS got added to the tree. Throughput 167.933 MB/sec 2 clients 2 procs max_latency=25.428 ms 3435 3363 3165 3304 3401 3189 3280 3243 3156 3295 3439 3375 2950 2945 2727 3383 3560 3417 3221 3271 3595 3293 3323 3283 3267 3279 3343 3293 3203 3341 3413 3268 3107 3361 3245 3195 3079 3184 3405 3191 v2.6.24: Throughput 170.314 MB/sec 2 clients 2 procs max_latency=22.121 ms 2136 1886 2030 1929 2021 1941 2009 2067 1895 2019 2072 1985 1992 1986 2031 2085 2014 2103 1825 1705 2018 2034 1921 2079 1901 1989 1976 2035 2053 1971 2144 2059 2025 2024 2029 1932 1980 1947 1956 2008 v2.6.25: Throughput 165.294 MB/sec 2 clients 2 procs max_latency=108.869 ms 2551 2707 2674 2771 2641 2727 2647 2865 2800 2796 2793 2745 2609 2753 2674 2618 2671 2668 2641 2744 2727 2616 2897 2720 2682 2737 2551 2677 2687 2603 2725 2717 2510 2682 2658 2581 2713 2608 2619 2586 v2.6.26: Throughput 160.759 MB/sec 2 clients 2 procs max_latency=31.420 ms 2576 2492 2556 2517 2496 2473 2620 2464 2535 2494 2800 2297 2183 2634 2546 2579 2488 2455 2632 2540 2566 2540 2536 2496 2432 2453 2462 2568 2406 2522 2565 2620 2532 2416 2434 2452 2524 2440 2424 2412 v2.6.27: Throughput 143.776 MB/sec 2 clients 2 procs ...
23->24 I can understand. In my testing, 23 CFS was not a wonderful Your numbers seem to ~agree with mine. And yeah, that hrtick is damned expensive. I didn't realize _how_ expensive until I trimmed my config way way down from distro. Just having highres timers enabled makes a very large difference here, even without hrtick enabled, and with the overhead of a disabled hrtick removed. -Mike --
I have been currently looking at very similarly looking issue. For the
public record, here are the numbers we have been able to come up with so
far (measured with dbench, so the absolute values are slightly different,
but still shows similar pattern)
208.4 MB/sec -- vanilla 2.6.16.60
201.6 MB/sec -- vanilla 2.6.20.1
172.9 MB/sec -- vanilla 2.6.22.19
74.2 MB/sec -- vanilla 2.6.23
46.1 MB/sec -- vanilla 2.6.24.2
30.6 MB/sec -- vanilla 2.6.26.1
I.e. huge drop for 2.6.23 (this was with default configs for each
respective kernel).
2.6.23-rc1 shows 80.5 MB/s, i.e. a few % better than final 2.6.23, but
still pretty bad.
I have gone through the commits that went into -rc1 and tried to figure
out which one could be responsible. Here are the numbers:
85.3 MB/s for 2ba2d00363 (just before on-deman readahead has been merged)
82.7 MB/s for 45426812d6 (before cond_resched() has been added into page
187.7 MB/s for c1e4fe711a4 (just before CFS scheduler has been merged)
invalidation code)
So the current bigest suspect is CFS, but I don't have enough numbers yet
to be able to point a finger to it with 100% certainity. Hopefully soon.
Just my $0.02
--
Jiri Kosina
SUSE Labs
--
Hi,
High client count right?
I reproduced this on my Q6600 box. However, I also reproduced it with
2.6.22.19. What I think you're seeing is just dbench creating a massive
train wreck. With CFS, it appears to be more likely to start->end
_sustain_, but the wreckage is present in O(1) scheduler runs as well,
and will start->end sustain there as well.
2.6.22.19-smp Throughput 967.933 MB/sec 16 procs Throughput 147.879 MB/sec 160 procs
Throughput 950.325 MB/sec 16 procs Throughput 349.959 MB/sec 160 procs
Throughput 953.382 MB/sec 16 procs Throughput 126.821 MB/sec 160 procs <== massive jitter
2.6.22.19-cfs-v24.1-smp Throughput 978.047 MB/sec 16 procs Throughput 170.662 MB/sec 160 procs
Throughput 943.254 MB/sec 16 procs Throughput 39.388 MB/sec 160 procs <== sustained train wreck
Throughput 934.042 MB/sec 16 procs Throughput 239.574 MB/sec 160 procs
2.6.23.17-smp Throughput 1173.97 MB/sec 16 procs Throughput 100.996 MB/sec 160 procs
Throughput 1122.85 MB/sec 16 procs Throughput 80.3747 MB/sec 160 procs
Throughput 1113.60 MB/sec 16 procs Throughput 99.3723 MB/sec 160 procs
2.6.24.7-smp Throughput 1030.34 MB/sec 16 procs Throughput 256.419 MB/sec 160 procs
Throughput 970.602 MB/sec 16 procs Throughput 257.008 MB/sec 160 procs
Throughput 1056.48 MB/sec 16 procs Throughput 248.841 MB/sec 160 procs
2.6.25.19-smp Throughput 955.874 MB/sec 16 procs Throughput 40.5735 MB/sec 160 procs
Throughput 943.348 MB/sec 16 procs Throughput 62.3966 MB/sec 160 procs
Throughput 937.595 MB/sec 16 procs Throughput 17.4639 MB/sec 160 procs
2.6.26.7-smp Throughput 904.564 MB/sec 16 procs Throughput 118.364 MB/sec 160 procs
Throughput 891.824 MB/sec 16 procs Throughput 34.2193 MB/sec 160 procs
...wasn't dbench one of those non-benchmarks that thrives on randomness and unfairness? Andrew said recently: "dbench is pretty chaotic and it could be that a good change causes dbench to get worse. That's happened plenty of times in the past." So I'm not inclined to worry too much about dbench in any way shape or form. --
Was this when we decreased the default value of Well. If there is a consistent change in dbench throughput, it is important that we at least understand the reasons for it. But we don't necessarily want to optimise for dbench throughput. --
Hi. Sorry, but such excuses do not deserve to be said. No matter how ugly, wrong, unusual or whatever else you might say about some test, but it shows the problem, which has to be fixed. There is no 'dbench tune', there is fair number of problems, and at least several of them dbench already helped to narrow down and precisely locate. The same regressions were also observed in other benchmarks, originally reported before I started this thread. -- Evgeniy Polyakov --
Not necessarily. There are times when we have made changes which we knew full well reduced dbench's throughput, because we believed them to You seem to be saying what I said. --
Hi Andrew. I suppose, there were words about dbench is not a real-life test, so if it will suddenly suck, no one will care. Sigh, theorists... I'm not surprised there were no changes when I reported hrtimers to be the main guilty factor in my setup for dbench tests, and only when David showed that they also killed his sparks via wake_up(), something was done. Now this regression even dissapeared from the list. Good direction, we should always follow this. As a side note, is hrtimer subsystem also used for BH backend? I have not yet analyzed data about vanilla kernels only being able to accept clients at 20-30k accepts per second, while some other magical tree (not vanilla) around 2.6.18 was able to that with 50k accepts per second. There are lots of CPUs, ram, bandwidth, which are effectively unused even behind linux load balancer... -- Evgeniy Polyakov --
From: Evgeniy Polyakov <zbr@ioremap.net> Yes, this situation was in my opinion a complete fucking joke. Someone like me shouldn't have to do all of the hard work for the scheduler folks in order for a bug like this to get seriously looked at. Evgeniy's difficult work was effectively ignored except by other testers who could also see and reproduce the problem. No scheduler developer looked seriously into these reports other than to say "please try to reproduce with tip" (?!?!?!) I guess showing the developer the exact changeset(s) which add the regression isn't enough these days :-/ Did any scheduler developer try to run tbench ONCE and do even a tiny bit of analysis, like the kind I did? Answer honestly... Linus even asked you guys in the private thread to "please look into it". So, if none of you did, you should all be deeply ashamed of yourselves. People like me shouldn't have to do all of that work for you just to get something to happen. Not until I went privately to Ingo and Linus with cycle counts and a full disagnosis (of every single release since 2.6.22, a whole 2 days of work for me) of the precise code eating up too many cycles and causing problems DID ANYTHING HAPPEN. This is extremely and excruciatingly DISAPPOINTING and WRONG. We completely and absolutely suck if this is how we will handle any performance regression report. And although this case is specific to the scheduler, a lot of other areas handle well prepared bug reports similarly. So I'm not really picking on the scheduler folks, they just happen to be the current example :-) --
yeah, that overhead was bad, and once it became clear that you had high-resolution timers enabled for your benchmaking runs (which is default-off and which is still rare for benchmarking runs - despite being a popular end-user feature) we immediately disabled the hrtick via this upstream commit: 0c4b83d: sched: disable the hrtick for now that commit is included in v2.6.28-rc1 so this particular issue should be resolved. high-resolution timers are still default-disabled in the upstream kernel, so this never affected usual configs that folks keep benchmarking - it only affected those who decided they want higher resolution timers and more precise scheduling. Anyway, the sched-hrtick is off now, and we wont turn it back on without making sure that it's really low cost in the hotpath. Regarding tbench, a workload that context-switches in excess of 100,000 per second is inevitably going to show scheduler overhead - so you'll get the best numbers if you eliminate all/most scheduler code from the hotpath. We are working on various patches to mitigate the cost some more - and your patches and feedback is welcome as well. But it's a difficult call with no silver bullets. On one hand we have folks putting more and more stuff into the context-switching hotpath on the (mostly valid) point that the scheduler is a slowpath compared to most other things. On the other hand we've got folks doing high-context-switch ratio benchmarks and complaining about the overhead whenever something goes in that improves the quality of scheduling of a workload that does not context-switch as massively as tbench. It's a difficult balance and we cannot satisfy both camps. Nevertheless, this is not a valid argument in favor of the hrtick overhead: that was clearly excessive overhead and we zapped it. Ingo --
From: Ingo Molnar <mingo@elte.hu> This I heavily disagree with. The scheduler should be so cheap that you cannot possibly notice that it is even there for a benchmark like tbench. If we now think it's ok that picking which task to run is more expensive than writing 64 bytes over a TCP socket and then blocking on a read, I'd like to stop using Linux. :-) That's "real work" and if the scheduler is more expensive than "real work" we lose. I do want to remind you of a thread you participated in, in April, where you complained about loopback TCP performance: http://marc.info/?l=linux-netdev&m=120696343707674&w=2 It might be fruitful for you to rerun your tests with CFS reverted (start with 2.6.22 and progressively run your benchmark on every We've always been proud of our scheduling overhead being extremely low, and you have to face the simple fact that starting in 2.6.23 it's been getting progressively more and more expensive. Consistently so. People even noticed it. --
Wow, indeed. I fired up an ext2 disk to take kjournald out of the picture (dunno, just a transient thought). Stock settings produced three perma-wrecks in a row. With it bumped to 50, three very considerably nicer results in a row appeared. 2.6.26.7-smp dirty_ratio = 10 (stock) Throughput 36.3649 MB/sec 160 procs Throughput 47.0787 MB/sec 160 procs Throughput 88.2055 MB/sec 160 procs 2.6.26.7-smp dirty_ratio = 50 Throughput 1009.98 MB/sec 160 procs Throughput 1101.57 MB/sec 160 procs Throughput 943.205 MB/sec 160 procs -Mike --
2.6.28 gives 41.8 MB/s with /proc/sys/vm/dirty_ratio == 50. So small improvement, but still far far away from the throughput of pre-2.6.23 kernels. -- Jiri Kosina SUSE Labs --
How many clients? dbench 160 -t 60 2.6.28-smp (git.today) Throughput 331.718 MB/sec 160 procs (no logjam) Throughput 309.85 MB/sec 160 procs (contains logjam) Throughput 392.746 MB/sec 160 procs (contains logjam) -Mike --
Ok, so another important datapoint: with c1e4fe711a4 (just before CFS has been merged for 2.6.23), the dbench throughput measures 187.7 MB/s in our testing conditions (default config). With c31f2e8a42c4 (just after CFS has been merged for 2.6.23), the throughput measured by dbench is 82.3 MB/s This is the huge drop we have been looking for. After this, the performance was still going down gradually, up to ~45 MS/ we are measuring for 2.6.27. But the biggest drop (more than 50%) points directly to CFS merge. -- Jiri Kosina SUSE Labs --
that is a well-known property of dbench: it rewards unfairness in IO, memory management and scheduling. The way to get the best possible dbench numbers in CPU-bound dbench runs, you have to throw away the scheduler completely, and do this instead: - first execute all requests of client 1 - then execute all requests of client 2 .... - execute all requests of client N the moment the clients are allowed to overlap, the moment their requests are executed more fairly, the dbench numbers drop. Ingo --
Rubbish. If you do that you'll not get enough I/O in parallel to schedule the disk well (not that most of our I/O schedulers are doing the job well, and the vm writeback threads then mess it up and the lack of Arjans Fairness isn't everything. Dbench is a fairly good tool for studying some real world workloads. If your fairness hurts throughput that much maybe your scheduler algorithm is just plain *wrong* as it isn't adapting to workload at all well. Alan --
Doesn't seem to be scheduler/fairness. 2.6.22.19 is O(1), and falls apart too, I posted the numbers and full dbench output yesterday. -Mike --
We'll need to look into this a little bit more I think. I have sent out some numbers too, and these indicate very clearly that there is more than 50% performance drop (measured by dbench) just after the very merge of CFS in 2.6.23-rc1 merge window. -- Jiri Kosina SUSE Labs --
Sure. Watching the per/sec output, every kernel I have sucks at high client count dbench, it's just a matter of how badly, and how long. BTW, the nice pretty 160 client numbers I posted yesterday for ext2 turned out to be because somebody adds _netdev mount option when I mount -a in order to mount my freshly hotplugged external drive (why? that ain't in my fstab). Without that switch, ext2 output is roughly as raggedy as ext3, and nowhere near the up to 1.4GB/sec I can get with dirty_ratio=50 + ext2 + (buy none, get one free) _netdev option. Free for the not asking option does nada for ext3. -Mike --
i've actually implemented that about a decade ago: i've tracked down what makes dbench tick, i've implemented the kernel heuristics for it to make dbench scale linearly with the number of clients - just to be the best dbench results come from systems that have enough RAM to cache the full working set, and a filesystem intelligent enough to not insert bogus IO serialization cycles (ext3 is not such a filesystem). The moment there's real IO it becomes harder to analyze but the same basic behavior remains: the more unfair the IO scheduler, the "better" dbench results we get. Ingo --
My test system has 8gb for 8 clients and its performance dropped by 30%. There is no IO load since tbech uses only network part while dbench itself uses only disk IO. What we see right now is that usual network server which handles mixed set of essentially small reads and writes from the socket from multiple (8) clients suddenly lost one third of Right now there is no disk IO at all. Only quite usual network and process load. -- Evgeniy Polyakov --
From: Evgeniy Polyakov <zbr@ioremap.net> I think the hope is that by saying there isn't a problem enough times, it will become truth. :-) More seriously, Ingo, what in the world do we need to do in order to get you to start doing tbench runs and optimizing things (read as: fixing the regression you added)? I'm personally working on a test fibonacci heap implementation for the fair sched code, and I already did all of the cost analysis all the way back to the 2.6.22 pre-CFS days. But I'm NOT a scheduler developer, so it isn't my responsibility to do this crap for you. You added this regression, why do I have to get my hands dirty in order for there to be some hope that these regressions start to get fixed? --
I don't want to ruffle any feathers, but my box has comment or two.. Has anyone looked at the numbers box emitted? Some what I believe to be very interesting data-points may have been overlooked. Here's a piece thereof again for better of worse. One last post won't burn the last electron. If they don't agree anyone else's numbers, that's ok, their numbers have meaning too, and speak for themselves. Retest hrtick pain: 2.6.26.7-up virgin no highres timers enabled ring-test - 1.155 us/cycle = 865 KHz 1.000 netperf - 130470.93 130771.00 129872.41 rr/s avg 130371.44 rr/s 1.000 (within jitter of previous tests) tbench - 355.153 357.163 356.836 MB/sec avg 356.384 MB/sec 1.000 2.6.26.7-up virgin highres timers enabled, hrtick enabled ring-test - 1.368 us/cycle = 730 KHz .843 netperf - 118959.08 118853.16 117761.42 rr/s avg 118524.55 rr/s .909 tbench - 340.999 338.655 340.005 MB/sec avg 339.886 MB/sec .953 OK, there's the htrick regression in all it's gory. Ouch, that hurt. Remember those numbers, box muttered them again in 27 testing. These previously tested kernels don't even have highres timers enabled, so obviously hrtick is a non-issue for them. 2.6.26.6-up + clock + buddy + weight ring-test - 1.234 us/cycle = 810 KHz .947 [cmp1] netperf - 128026.62 128118.48 127973.54 rr/s avg 128039.54 rr/s .977 tbench - 342.011 345.307 343.535 MB/sec avg 343.617 MB/sec .964 2.6.26.6-up + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead ring-test - 1.174 us/cycle = 851 KHz .995 [cmp2] netperf - 133928.03 134265.41 134297.06 rr/s avg 134163.50 rr/s 1.024 tbench - 358.049 359.529 358.342 MB/sec avg 358.640 MB/sec 1.006 Note that I added all .27 additional scheduler overhead to .26, and then removed every last bit of it, ...
thanks Mike for the _extensive_ testing and bug hunting session you've done in the past couple of weeks! All the relevant fixlets you found are now queued up properly in sched/urgent, correct? What's your gut feeling, is that remaining small regression scheduler or networking related? i'm cutting the ball in half and i'm passing over one half of it to the networking folks, because your numbers show _huge_ sensitivity in any scheduler micro-overhead detail is going to be a drop in the ocean, compared to such huge variations. We could change the scheduler to the old O(N) design of the 2.2 kernel and the impact of that would be a blip on the radar, compared to the overhead shown above. Ingo --
I don't know where it lives. I'm still looking, and the numbers are I strongly _suspect_ that the network folks have some things they could investigate, but given my utter failure at finding the smoking gun, I can't say one way of the other. IMHO, sharing with network folks would likely turn out to be a fair thing to do. Am I waffling? Me? You bet your a$$! My clock is already squeaky clean thank you very much :-) What I can say is that my box is quite certain that there are influences outside the scheduler which have more influence benchmark results than the scheduler does through the life of testing. -Mike --
okay, that's an important observation. Ingo --
Hm. _Maybe_ someone needs to take a look at c7aceab. I took it to a 26 test tree yesterday, and it lowered my throughput, though I didn't repeat a lot, was too busy. I just backed it out of one of my 27 test trees, and the netperf number is 1.030, tbench is 1.040. I'll test this in virgin source later, but thought I should drop a note, so perhaps someone interested in this thread can confirm/deny loss. -Mike --
Bah, too much testing, must have done something stupid. -Mike --
Hi. Cooled down? Let's start over the sched vs network fight. Sorry for interrupting your conversation, Ingo, but before throwing a stone one should be clear himself, doesn't it? When you asked to test -tip tree and it was showed it regressed about 20 MB/s in my test against the last tweaks you suggested, -tip still was merged and no work on this issue was performed. Now previous tweaks (nohrticks and nobalalance, and although hrticks are now disabled, performance did not return to the vanilla .27 with tweaks) do not help anymore, so apparently there is additional problem. So for the reference: vanilla 27 : 347.222 no TSO/GSO : 357.331 no hrticks : 382.983 no balance : 389.802 4403b4 commit : 361.184 dirty_ratio-50 : 361.086 no-sched-tweaks : 361.367 So scheduler _does_ regress even right now when this thread is being discussed. Now let's return to the network. Ilpo Järvinen showed a nasty modulo operation in the fast path which David thinks on how to resolve it, but it happend that this change was introduced back in 2005, and although some naive change allows to increase performance upto 370 MB/s, i.e. it gained us 2.5%, this was never accounted in previous changes. So, probably, if we revert -tip merge to vanilla .27, add nohrtick patch and nobalance tweak _only_, and apply naive TSO patch we could bring system to 400 MB/s. Note, that .22 has 479.82 and .23 454.36 MB/s. -- Evgeniy Polyakov --
And now I have to admit that the very last -top merge did noticebly improve the situation upto 391.331 MB/s (189 in domains, with tso/gso off and naive tcp_tso_should_defer() hange). So we are now essentially at the level of 24-25 trees in my tests. -- Evgeniy Polyakov --
That's good and make think whether it would be a good idea to add some performance number in each pull request that affect the core part of the kernel. Ciao, -- Paolo http://paolo.ciarrocchi.googlepages.com/ --
You can get good dbench results come from dbench on tmpfs, which exercises the vm vfs scheduler etc without IO or filesystems. --
Yeah, I was just curious. The switch rate of dbench isn't high enough for math to be an issue, so I wondered how the heck CFS could be such a huge problem for this load. Looks to me like all the math in the _world_ couldn't hurt.. or help. -Mike --
From: Mike Galbraith <efault@gmx.de> I understand, this is what happened to me when I tried to look into the gradual tbench regressions since 2.6.22 I guess the only way to attack these things is to analyze the code and make some debugging hacks to get some measurements and numbers. --
That's exactly what I've been trying to look into, but combined with netperf. The thing is an incredibly twisted maze of _this_ affects _that_... sometimes involving magic and/or mythical creatures. Very very annoying. -Mike --
I cannot guarantee it will help, but the global -T option to pin netperf or netserver to a specific CPU might help cut-down the variables. FWIW netperf top of trunk omni tests can now also determine and report the state of SELinux. They also have code to accept or generate their own RFC4122-esque UUID. Define some connical tests and then ever closer to just needing some database-fu and automagic testing I suppose... things I do not presently posess but am curious enough to follow some pointers. happy benchmarking, rick jones --
Yup, and how. Early on, the other variables drove me bat-shit frigging _nuts_. I eventually selected a UP config to test _because_ those other Not really, but I can't seem to give up ;-) -Mike --
http://www.netperf.org/svn/netperf2/trunk/src/netsec_linux.c Pointers to programtatic detection of AppArmour and a couple salient details about firewall (enabled, perhaps number of rules) from any Plot thickening, seems that autotest knows about some version of netperf2 already... i'll be trying to see if there is some benefit to autotest to netperf2's top of trunk having the keyval output format, and then I guess I'll close with successful benchmarking, if not necessarily happy :) rick jones --
There ya go, happy benchmarking is when they tell you what you want to hear. Successful is when you learn something. -Mike (not happy, but learning) --
For the reference, just pulled git tree (4403b4 commit): 361.184 and with dirty_ratio set to 50: 361.086 without scheduler domain tuning things are essentially the same: 361.367 So, things are getting worse with time, and previous tunes do not help anymore. -- Evgeniy Polyakov --
That's the picture, how we go on my hardware: 4 (2 physical, 2 logical hyper-threaded) 32-bit xeons 8gb of ram. We probably can be a little bit better for -rc1 kernel though, if I enable only 4gb via config. Better one time to see than 1000 times to read. One can scare children with our graphs... Picture attached. -- Evgeniy Polyakov
Has anyone looked into the impact of port randomization on this benchmark.
If it is generating lots of sockets quickly there could be an impact:
* port randomization causes available port space to get filled non-uniformly
and what was once a linear scan may have to walk over existing ports.
(This could be improved by a hint bitmap)
* port randomization adds at least one modulus operation per socket
creation. This could be optimized by using a loop instead.
--
In this benchmark only two sockets are created per client for the whole run, so this should not have any impact on performance. -- Evgeniy Polyakov --
tbench setups one socket per client, then send/receive lot of messages on this socket. Connection setup time can be ignored for the tbench regression analysis --
Hum, re-reading your question, I feel you might have a valid point after all :) Not because of connection setup time, but because the rwlocks used on tcp hash table. tcp sessions used on this tbench test might now be on the same cache lines, because of port randomization or so. CPUS might do cache-line ping pongs on those rwlocks. # netstat -tn|grep 7003 tcp 0 59 127.0.0.1:37248 127.0.0.1:7003 ESTABLISHED tcp 0 71 127.0.0.1:7003 127.0.0.1:37252 ESTABLISHED tcp 0 0 127.0.0.1:37251 127.0.0.1:7003 ESTABLISHED tcp 0 4155 127.0.0.1:7003 127.0.0.1:37249 ESTABLISHED tcp 0 55 127.0.0.1:7003 127.0.0.1:37248 ESTABLISHED tcp 0 0 127.0.0.1:37252 127.0.0.1:7003 ESTABLISHED tcp 0 0 127.0.0.1:37249 127.0.0.1:7003 ESTABLISHED tcp 0 59 127.0.0.1:37246 127.0.0.1:7003 ESTABLISHED tcp 0 0 127.0.0.1:37250 127.0.0.1:7003 ESTABLISHED tcp 71 0 127.0.0.1:37245 127.0.0.1:7003 ESTABLISHED tcp 0 0 127.0.0.1:37244 127.0.0.1:7003 ESTABLISHED tcp 0 87 127.0.0.1:7003 127.0.0.1:37250 ESTABLISHED tcp 0 4155 127.0.0.1:7003 127.0.0.1:37251 ESTABLISHED tcp 0 4155 127.0.0.1:7003 127.0.0.1:37246 ESTABLISHED tcp 0 71 127.0.0.1:7003 127.0.0.1:37245 ESTABLISHED tcp 0 4155 127.0.0.1:7003 127.0.0.1:37244 ESTABLISHED We use a jhash, so normally we could expect a really random split of hash values for all these sessions, but it would be worth to check :) You know understand why we want to avoid those rwlocks Stephen, and switch to RCU... --
I did something with AIM9's tcp_test recently (1-2 days ago depending on how one calculates that so didn't yet have time summarize the details in the AIM9 thread) by deterministicly binding in userspace and got much more sensible numbers than with randomized ports (2-4%/5-7% vs 25% variation some difference in variation in different kernel versions even with deterministic binding). Also, I'm still to actually oprofile and bisect the remaining ~4% regression (around 20% was reported by Christoph). For oprofiling I might have to change aim9 to do predefined number of loops instead of a deadline to get more consistent view on changes in per func runtime. AIM9 is one process only so scheduler has a bit less to do in that benchmark anyway. It would probably be nice to test just the port randomizer separately to see if there's some regression in that but I don't expect it to happen any time soon unless I quickly come up with something in the bisection. -- i. --
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Yes, it looks like port selection cache and locking effects are a very real issue. Good find. --
On Fri, 31 Oct 2008, David Miller wrote: > From: "Ilpo J
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Not locks or ping-pongs perhaps, I guess. So it just sends and receives over a socket, implementing both ends of the communication in the same process? If hash chain conflicts do happen for those 2 sockets, just traversing the chain 2 entries deep could show up. --
On Fri, 31 Oct 2008, David Miller wrote: > From: "Ilpo J
tbench is very sensible to cache line ping-pongs (on SMP machines of cour= se) Just to prove my point, I coded the following patch and tried it on a HP BL460c G1. This machine has 2 quad cores cpu=20 (Intel(R) Xeon(R) CPU E5450 @3.00GHz) tbench 8 went from 2240 MB/s to 2310 MB/s after this patch applied [PATCH] net: Introduce netif_set_last_rx() helper On SMP machine, loopback device (and possibly others net device) should try to avoid dirty the memory cache line containing "last_rx" field. Got 3% increase on tbench on a 8 cpus machine. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- drivers/net/loopback.c | 2 +- include/linux/netdevice.h | 16 ++++++++++++++++ 2 files changed, 17 insertions(+), 1 deletion(-)
On Fri, 31 Oct 2008, Eric Dumazet wrote: > David Miller a
Well, before you added AIM9 on this topic, we were focusing on tbench :) Sorry to disappoint you :) --
On Fri, 31 Oct 2008, Eric Dumazet wrote: > Ilpo J
On Fri, 31 Oct 2008 11:45:33 +0100 Why bother with last_rx at all on loopback. I have been thinking we should figure out a way to get rid of last_rx all together. It only seems to be used by bonding, and the bonding driver could do the calculation in its receive handling. --
Not related to the regression: bug will be just papered out by this changes. Having bonding on loopback is somewhat strange idea, but still this kind of changes is an attempt to make a good play in the bad game: this loopback-only optimization does not fix the problem. -- Evgeniy Polyakov --
Just to be clear, this change was not meant to be committed. It already was rejected by David some years ago (2005, and 2006) http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html If you read my mail, I was *only* saying that tbench results can be sensible to cache line ping pongs. tbench is a crazy benchmark, and only is a crazy benchmark. Optimizing linux for tbench sake would be .... crazy ? --
Hi Eric. No problem Eric, I just pointed that this particular case is rather fluffy, which really does not fix anything. It improves the case, but the way it does it, is not the right one imho. We would definitely want to eliminate assignment of global constantly updated variables in the pathes where it is not required, but in a way which does improve the design and implementation, but not to hide some other problem. Tbench is, well, as is it is quite usual network server :) Dbench side is rather non-optimized, but still it is quite common pattern of small-sized IO. Anyway, optimizing for some kind of the workload tends to force other side to become slower, so I agree of course that any narrow-viewed optimizations are bad, and instead we should focus on searching error patter more widerspread. -- Evgeniy Polyakov --
From: Eric Dumazet <dada1@cosmosbay.com> However, I do like Stephen's suggestion that maybe we can get rid of this ->last_rx thing by encapsulating the logic completely in the Unlike dbench I think tbench is worth cranking up as much as possible. It doesn't have a huge memory working set, it just writes mostly small messages over a TCP socket back and forth, and does a lot of blocking And I think we'd like all of those operating to run as fast as possible. When Tridge first wrote tbench I would see the expected things at the top of the profiles. Things like tcp_ack(), copy to/from user, and perhaps SLAB. Things have changed considerably. --
On Fri, 31 Oct 2008 16:51:44 -0700 (PDT) Since bonding driver doesn't actually see the rx packets, that isn't really possible. But it would be possible to change last_rx from a variable to an function pointer, so that device's could apply other logic to derive the last value. One example would be to keep it per cpu and then take the maximum. --
I suspect it could also be tucked away in skb_bond_should_drop,
which is called both by the standard input path and the VLAN accelerated
path to see if the packet should be tossed (e.g., it arrived on an
inactive bonding slave).
Since last_rx is part of struct net_device, I don't think any
additional bonding internals knowledge would be needed. It could be
arranged to only update last_rx for devices that are actually bonding
slaves.
Just off the top of my head (haven't tested this), something
like this:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c8bcb59..ed1e58f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1743,22 +1743,24 @@ static inline int skb_bond_should_drop(struct sk_buff *skb)
struct net_device *dev = skb->dev;
struct net_device *master = dev->master;
- if (master &&
- (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
- if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
- skb->protocol == __constant_htons(ETH_P_ARP))
- return 0;
-
- if (master->priv_flags & IFF_MASTER_ALB) {
- if (skb->pkt_type != PACKET_BROADCAST &&
- skb->pkt_type != PACKET_MULTICAST)
+ if (master) {
+ dev->last_rx = jiffies;
+ if (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
+ if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
+ skb->protocol == __constant_htons(ETH_P_ARP))
return 0;
- }
- if (master->priv_flags & IFF_MASTER_8023AD &&
- skb->protocol == __constant_htons(ETH_P_SLOW))
- return 0;
- return 1;
+ if (master->priv_flags & IFF_MASTER_ALB) {
+ if (skb->pkt_type != PACKET_BROADCAST &&
+ skb->pkt_type != PACKET_MULTICAST)
+ return 0;
+ }
+ if (master->priv_flags & IFF_MASTER_8023AD &&
+ skb->protocol == __constant_htons(ETH_P_SLOW))
+ return 0;
+
+ return 1;
+ }
}
return 0;
}
That doesn't move the storage out of struct net_device, but it
does stop the updates for devices that aren't bonding slaves. It could
probably be refined ...From: Jay Vosburgh <fubar@us.ibm.com> I like this very much. Jay can you give this a quick test by just trying this patch and removing the ->last_rx setting in the driver you use for your test? Once you do that, I'll apply this to net-next-2.6 and do the leg work to zap all of the ->last_rx updates from the entire tree. Thanks! --
The only user of the net_device->last_rx field is bonding. This
patch adds a conditional update of last_rx to the bonding special logic
in skb_bond_should_drop, causing last_rx to only be updated when the ARP
monitor is running.
This frees network device drivers from the necessity of updating
last_rx, which can have cache line thrash issues.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56c823c..39575d7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4564,6 +4564,8 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params)
bond_dev->tx_queue_len = 0;
bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
bond_dev->priv_flags |= IFF_BONDING;
+ if (bond->params.arp_interval)
+ bond_dev->priv_flags |= IFF_MASTER_ARPMON;
/* At first, we block adding VLANs. That's the only way to
* prevent problems that occur when adding VLANs over an
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 296a865..e400d7d 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -620,6 +620,8 @@ static ssize_t bonding_store_arp_interval(struct device *d,
": %s: Setting ARP monitoring interval to %d.\n",
bond->dev->name, new_value);
bond->params.arp_interval = new_value;
+ if (bond->params.arp_interval)
+ bond->dev->priv_flags |= IFF_MASTER_ARPMON;
if (bond->params.miimon) {
printk(KERN_INFO DRV_NAME
": %s: ARP monitoring cannot be used with MII monitoring. "
@@ -1039,6 +1041,7 @@ static ssize_t bonding_store_miimon(struct device *d,
"ARP monitoring. Disabling ARP monitoring...\n",
bond->dev->name);
bond->params.arp_interval = 0;
+ bond->dev->priv_flags &= ~IFF_MASTER_ARPMON;
if (bond->params.arp_validate) {
bond_unregister_arp(bond);
bond->params.arp_validate =
diff --git a/include/linux/if.h ...From: Jay Vosburgh <fubar@us.ibm.com> Applied, thanks a lot Jay. --
a7be37a adds some math overhead, calls to calc_delta_mine() per wakeup/context switch for all weight tasks, whereas previously these calls were only made for tasks which were not nice 0. It also shifts performance a bit in favor of loads which dislike wakeup preemption, this effect lessens as task count increases. Per testing, overhead is not the primary factor in throughput loss. I believe clock accuracy to be a more important factor than overhead by a very large margin. Reverting a7be37a (and the two asym fixes) didn't do a whole lot for me either. I'm still ~8% down from 2.6.26 for netperf, and ~3% for tbench, and the 2.6.26 numbers are gcc-4.1, which are a little lower than gcc-4.3. Along the way, I've reverted 100% of scheduler and ilk 26->27 and been unable to recover throughput. (Too bad I didn't know about that TSO/GSO thingy, would have been nice.) I can achieve nearly the same improvement for tbench with a little tinker, and _more_ for netperf than reverting these changes delivers, see last log entry, experiment cut math overhead by less than 1/3. For the full cfs history, even with those three reverts, I'm ~6% down on I have highres timers disabled in my kernels because per testing it does cost a lot at high frequency, but primarily because it's not available throughout test group, same for nohz. A patchlet went into 2.6.27 to neutralized the cost of hrtick when it's not active. Per re-test, I lost some at 24, got it back at 25 etc. Some of it is fairness / preemption differences, but there's a bunch I can't find, and massive amounts of time spent bisecting were a waste of time. My annotated test log. File under fwiw. Note: 2.6.23 cfs was apparently a bad-hair day for high frequency switchers. Anyone entering the way-back-machine to test 2.6.23, should probably use cfs-24.1, which is 2.6.24 scheduler minus on zero impact for nice 0 loads line. ------------------------------------------------------------------------- UP config, no nohz ...
Hi Mike. In my tests it was not just overhead, it was a disaster. And stopping just before this commit gained 20 MB/s out of 30 MB/s lose for 26-27 window. No matter what accuracy it brings, this is just wrong to assume that such performance drop in some workloads is justified. Well, yes, disabling it should bring performance back, but since they are actually enabled everywhere and trick with debugfs is not widely Yup, but since I slacked with bits of beer after POHMELFS release I did UP actually may expect the differece in our results: I have 4-way (2 physical and 2 logical (HT enabled) CPUs) 32-bit old Xeons with highmem enabled. I also tried low-latency preemption and no preemption (server) without much difference. -- Evgeniy Polyakov --
a7be37a 's purpose is for group scheduling where it provides means to
calculate things in a unform metric.
If you take the following scenario:
R
/|\
A 1 B
/|\ |
2 3 4 5
Where letters denote supertasks/groups and digits are tasks.
We used to look at a single level only, so if you want to compute a
task's ideal runtime, you'd take:
runtime_i = period w_i / \Sum_i w_i
So, in the above example, assuming all entries have an equal weight,
we'd want to run A for p/3. But then we'd also want to run 2 for p/3.
IOW. all of A's tasks would run in p time.
Which in contrairy to the expectation that all tasks in the scenario
would run in p.
So what the patch does is change the calculation to:
period \Prod_l w_l,i / \Sum_i w_l,i
Which would, for 2 end up being: p 1/3 1/3 = p/9.
Now the thing that causes the extra math in the !group case is that for
the single level case, we can avoid doing that division by the sum,
because that is equal for all tasks (we then compensate for it at some
other place).
However, for the nested case, we cannot do that.
That said, we can probably still avoid the division for the top level
stuff, because the sum of the top level weights is still invariant
between all tasks.
I'll have a stab at doing so... I initially didn't do this because my
first try gave some real ugly code, but we'll see - these numbers are a
very convincing reason to try again.
--
...but the numbers I get on Q6600 don't pin the tail on the math donkey. Update to UP test log. 2.6.27-final-up ring-test - 1.193 us/cycle = 838 KHz (gcc-4.3) tbench - 337.377 MB/sec tso/gso on tbench - 340.362 MB/sec tso/gso off netperf - 120751.30 rr/s tso/gso on netperf - 121293.48 rr/s tso/gso off 2.6.27-final-up patches/revert_weight_and_asym_stuff.diff ring-test - 1.133 us/cycle = 882 KHz (gcc-4.3) tbench - 340.481 MB/sec tso/gso on tbench - 343.472 MB/sec tso/gso off netperf - 119486.14 rr/s tso/gso on netperf - 121035.56 rr/s tso/gso off 2.6.28-up ring-test - 1.149 us/cycle = 870 KHz (gcc-4.3) tbench - 343.681 MB/sec tso/gso off netperf - 122812.54 rr/s tso/gso off My SMP log, updated to account for TSO/GSO monkey-wrench. (<bleep> truckload of time <bleep> wasted chasing unbisectable <bleepity-bleep> tso gizmo. <bleep!>) SMP config, same as UP kernels tested, except SMP. tbench -t 60 4 localhost followed by four 60 sec netperf TCP_RR pairs, each pair on it's own core of my Q6600. 2.6.22.19 Throughput 1250.73 MB/sec 4 procs 1.00 16384 87380 1 1 60.01 111272.55 1.00 16384 87380 1 1 60.00 104689.58 16384 87380 1 1 60.00 110733.05 16384 87380 1 1 60.00 110748.88 2.6.22.19-cfs-v24.1 Throughput 1213.21 MB/sec 4 procs .970 16384 87380 1 1 60.01 108569.27 .992 16384 87380 1 1 60.01 108541.04 16384 87380 1 1 60.00 108579.63 16384 87380 1 1 60.01 108519.09 2.6.23.17 Throughput 1200.46 MB/sec 4 procs .959 16384 87380 1 1 60.01 95987.66 .866 16384 87380 1 1 60.01 92819.98 16384 87380 1 1 60.01 95454.00 16384 87380 1 1 ...
Since I showed the rest of my numbers, I may as well show freshly
generated oltp numbers too. Chart attached. 2.6.27.rev is 2.6.27 with
weight/asym changes reverted.
Data:
read/write requests/sec per client count
1 2 4 8 16 32 64 128 256
2.6.26.6.mysql 7978 19856 37238 36652 34399 33054 31608 27983 23411
2.6.27.mysql 9618 18329 37128 36504 33590 31846 30719 27685 21299
2.6.27.rev.mysql 10944 19544 37349 36582 33793 31744 29161 25719 21026
2.6.28.git.mysql 9518 18031 30418 33571 33330 32797 31353 29139 25793
2.6.26.6.pgsql 14165 27516 53883 53679 51960 49694 44377 35361 32879
2.6.27.pgsql 14146 27519 53797 53739 52850 47633 39976 30552 28741
2.6.27.rev.pgsql 14168 27561 53973 54043 53150 47900 39906 31987 28034
2.6.28.git.pgsql 14404 28318 55124 55010 55002 54890 53745 53519 52215
P.S. all knobs stock, TSO/GSO off. --
