Re: [tbench regression fixes]: digging out smelly deadmen.

Previous thread: [PATCH] proc: remove kernel.maps_protect sysctl by Alexey Dobriyan on Thursday, October 9, 2008 - 4:34 pm. (1 message)

Next thread: [git pull] x86 updates for v2.6.28, phase #1 by Ingo Molnar on Thursday, October 9, 2008 - 4:47 pm. (9 messages)
From: Evgeniy Polyakov
Date: Thursday, October 9, 2008 - 4:17 pm

Hi.

It was reported recently that tbench has a long history of regressions,
starting at least from 2.6.23 kernel. I verified that in my test
environment tbench 'lost' more than 100 MB/s from 470 down to 355
between at least 2.6.24 and 2.6.27. 2.6.26-2.6.27 performance regression
in my machines is rougly corresponds to 375 down to 355 MB/s.

I spent several days in various tests and bisections (unfortunately
bisect can not always point to the 'right' commit), and found following
problems.

First, related to the network, as lots of people expected: TSO/GSO over
loopback with tbench workload eats about 5-10 MB/s, since TSO/GSO frame
creation overhead is not paid by the optimized super-frame processing
gains. Since it brings really impressive improvement in big-packet
workload, it was (likely) decided not to add a patch for this, but
instead one can disable TSO/GSO via ethtool. This patch was added in
2.6.27 window, so it has its part in its regression.

Second part in the 26-27 window regression (I remind, it is about 20
MB/s) is related to the scheduler changes, which was expected by another
group of people. I tracked it down to the
a7be37ac8e1565e00880531f4e2aff421a21c803 commit, which, if being
reverted, returns 2.6.27 tbench perfromance to the highest (for
2.6.26-2.6.27) 365 MB/s mark. I also tested tree, stopped at above
commit itself, i.e. not 2.6.27, and got 373 MB/s, so likely another
changes in that merge ate couple of megs. Attached patch against 2.6.27.

Curious reader can ask, where did we lost another 100 MB/s? This small
issue was not detected (or at least reported in netdev@ with provocative
enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes.
I was so lucky to 'guess' (just after couple of hundreds of compilations),
that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
high-resolution timers, attached patch against 2.6.25 brings tbench
performance for the 2.6.25 kernel tree to 455 MB/s.

There are still somewhat ...
From: Peter Zijlstra
Date: Thursday, October 9, 2008 - 10:40 pm

can you try

echo NO_HRTICK > /debug/sched_features

on .27 like kernels?

Also, what clocksource do those machines use?

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

As to, a7be37ac8e1565e00880531f4e2aff421a21c803, could you try
tip/master? I reworked some of the wakeup preemption code in there.

Thanks for looking into this issue!

--

From: Evgeniy Polyakov
Date: Friday, October 10, 2008 - 1:09 am

Hi Peter.

I've enabled kernel hacking option and scheduler debugging and turned
off hrticks and performance jumped to 382 MB/s:

vanilla 27: 347.222
no TSO/GSO: 357.331
no hrticks: 382.983

I use tsc clocksource, also available acpi_pm and jiffies,
with acpi_pm performance is even lower (I stopped test after it dropped
below 340 MB/s mark), jiffies do not work at all, looks like sockets
stuck in time_wait state when this clock source is used, although that
may be some different issue.

So I think hrticks are guilty, but still not as good as .25 tree without
mentioned changes (455 MB/s) and .24 (475 MB/s).

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Friday, October 10, 2008 - 2:15 am

hi Evgeniy,


i'm glad that you are looking into this! That is an SMP box, right? If 
yes then could you try this sched-domains tuning utility i have written 
yesterday (incidentally):

  http://redhat.com/~mingo/cfs-scheduler/tune-sched-domains

just run it without options to see the current sched-domains options. On 
a testsystem i have it displays this:

# tune-sched-domains
usage: tune-sched-domains <val>
current val on cpu0/domain0:
SD flag: 47
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_WAKE_IDLE:             Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_WAKE_BALANCE:          Perform balancing at task wakeup

then could you check what effects it has if you turn off 
SD_BALANCE_NEWIDLE? On my box i did it via:

# tune-sched-domains $[47-2]
changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 47 => 45
SD flag: 45
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
-   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_WAKE_IDLE:             Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_WAKE_BALANCE:          Perform balancing at task wakeup
changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 1101 => 45
SD flag: 45
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
-   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_WAKE_IDLE:             Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_WAKE_BALANCE:          ...
From: Evgeniy Polyakov
Date: Friday, October 10, 2008 - 4:31 am

Hi Ingo.


I've removed SD_BALANCE_NEWIDLE:
# ./tune-sched-domains $[191-2]
changed /proc/sys/kernel/sched_domain/cpu0/domain0/flags: 191 => 189
SD flag: 189
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
-   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
+  16: SD_WAKE_IDLE:             Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_WAKE_BALANCE:          Perform balancing at task wakeup
+ 128: SD_SHARE_CPUPOWER:        Domain members share cpu power
changed /proc/sys/kernel/sched_domain/cpu0/domain1/flags: 47 => 189
SD flag: 189
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
-   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
+  16: SD_WAKE_IDLE:             Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64: SD_WAKE_BALANCE:          Perform balancing at task wakeup
+ 128: SD_SHARE_CPUPOWER:        Domain members share cpu power

And got noticeble improvement (each new line has fixes from previous):

vanilla 27: 347.222
no TSO/GSO: 357.331
no hrticks: 382.983

Ok, I've started to pull it down, I will reply back when things are
ready.

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Friday, October 10, 2008 - 4:40 am

make sure you have this fix in tip/master already:

  5b7dba4: sched_clock: prevent scd->clock from moving backwards

Note: Mike is 100% correct in suggesting that a very good cpu_clock() is 
needed for precise scheduling.

i've also Cc:-ed Nick.

	Ingo
--

From: Evgeniy Polyakov
Date: Friday, October 10, 2008 - 6:25 am

The last commit is 5dc64a3442b98eaa and aforementioned changeset was included.
Result is quite bad:

vanilla 27: 	347.222
no TSO/GSO:	357.331
no hrticks:	382.983
no balance:	389.802
tip:		365.576

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Friday, October 10, 2008 - 4:42 am

okay. The target is 470 MB/sec, right? (Assuming the workload is sane 
and 'fixing' it does not mean we have to schedule worse.)

We are still way off from 470 MB/sec.

	Ingo
--

From: Evgeniy Polyakov
Date: Friday, October 10, 2008 - 4:55 am

Well, that's where I started/stopped, so maybe we will even move
further? :)

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Friday, October 10, 2008 - 4:57 am

that's the right attitude ;)

	Ingo
--

From: Rafael J. Wysocki
Date: Friday, October 24, 2008 - 3:25 pm

Can anyone please tell me if there was any conclusion of this thread?

Thanks,
Rafael
--

From: David Miller
Date: Friday, October 24, 2008 - 4:31 pm

From: "Rafael J. Wysocki" <rjw@sisk.pl>

I made some more analysis in private with Ingo and Peter Z. and found
that the tbench decreases correlate pretty much directly with the
ongoing increasing cpu cost of wake_up() and friends in the fair
scheduler.

The largest increase in computational cost of wakeups came in 2.6.27
when the hrtimer bits got added, it more than tripled the cost of a wakeup.
In 2.6.28-rc1 the hrtimer feature has been disabled, but I think that
should be backports into the 2.6.27-stable branch.

So I think that should be backported, and meanwhile I'm spending some
time in the background trying to replace the fair schedulers RB tree
crud with something faster so maybe at some point we can recover all
of the regressions in this area caused by the CFS code.
--

From: Mike Galbraith
Date: Friday, October 24, 2008 - 9:05 pm

My test data indicates (to me anyway) that there is another source of
localhost throughput loss in .27.  In that data, there is no hrtick
overhead since I didn't have highres timers enabled, and computational
costs added in .27 were removed.  Dunno where it lives, but it does
appear to exist.

	-Mike

--

From: David Miller
Date: Friday, October 24, 2008 - 10:15 pm

From: Mike Galbraith <efault@gmx.de>

Disabling TSO on loopback doesn't fix that bit for you?

--

From: Mike Galbraith
Date: Friday, October 24, 2008 - 10:53 pm

No.  Those numbers are with TSO/GSO disabled.

I did a manual 100% sched and everything related revert to 26 scheduler,
and had ~the same result as these numbers.  27 with 100% revert actually
performed a bit _worse_ for me than 27 with it's overhead.. which
puzzles me greatly.

	-Mike

--

From: Rafael J. Wysocki
Date: Saturday, October 25, 2008 - 4:13 am

Thanks a lot for the info.

Could you please give me a pointer to the commit disabling the hrtimer feature?

Rafael
--

From: David Miller
Date: Saturday, October 25, 2008 - 8:55 pm

From: "Rafael J. Wysocki" <rjw@sisk.pl>

Here it is:

commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478
Author: Ingo Molnar <mingo@elte.hu>
Date:   Mon Oct 20 14:27:43 2008 +0200

    sched: disable the hrtick for now
    
    David Miller reported that hrtick update overhead has tripled the
    wakeup overhead on Sparc64.
    
    That is too much - disable the HRTICK feature for now by default,
    until a faster implementation is found.
    
    Reported-by: David Miller <davem@davemloft.net>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 7c9e8f4..fda0162 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -5,7 +5,7 @@ SCHED_FEAT(START_DEBIT, 1)
 SCHED_FEAT(AFFINE_WAKEUPS, 1)
 SCHED_FEAT(CACHE_HOT_BUDDY, 1)
 SCHED_FEAT(SYNC_WAKEUPS, 1)
-SCHED_FEAT(HRTICK, 1)
+SCHED_FEAT(HRTICK, 0)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(ASYM_GRAN, 1)
 SCHED_FEAT(LB_BIAS, 1)
--

From: Rafael J. Wysocki
Date: Sunday, October 26, 2008 - 4:33 am

Rafael
--

From: Mike Galbraith
Date: Friday, October 24, 2008 - 8:37 pm

Part of the .27 regression was added scheduler overhead going from .26
to .27.  The scheduler overhead is now gone, but an unidentified source
of localhost throughput loss remains for both SMP and UP configs.

	-Mike

My last test data, updated to reflect recent commits:

Legend:
clock  = v2.6.26..5052696 + 5052696..v2.6.27-rc7 sched clock changes
weight = a7be37a + c9c294a + ced8aa1 (adds math overhead)
buddy  = 103638d (adds math overhead)
buddy_overhead = b0aa51b (removes math overhead of buddy)
revert_to_per_rq_vruntime = f9c0b09 (+2 lines, removes math overhead of weight)

2.6.26.6-up virgin
ring-test   - 1.169 us/cycle  = 855 KHz                                 1.000
netperf     - 130967.54 131143.75 130914.96 rr/s    avg 131008.75 rr/s  1.000
tbench      - 357.593 355.455 356.048 MB/sec        avg 356.365 MB/sec  1.000

2.6.26.6-up + clock + buddy + weight (== .27 scheduler)
ring-test   - 1.234 us/cycle  = 810 KHz                                  .947 [cmp1]
netperf     - 128026.62 128118.48 127973.54 rr/s    avg 128039.54 rr/s   .977
tbench      - 342.011 345.307 343.535 MB/sec        avg 343.617 MB/sec   .964

2.6.26.6-up + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead
ring-test   - 1.174 us/cycle  = 851 KHz                                  .995 [cmp2]
netperf     - 133928.03 134265.41 134297.06 rr/s    avg 134163.50 rr/s  1.024
tbench      - 358.049 359.529 358.342 MB/sec        avg 358.640 MB/sec  1.006

                                                       versus .26 counterpart
2.6.27-up virgin
ring-test   - 1.193 us/cycle  = 838 KHz                                 1.034 [vs cmp1]
netperf     - 121293.48 121700.96 120716.98 rr/s    avg 121237.14 rr/s   .946
tbench      - 340.362 339.780 341.353 MB/sec        avg 340.498 MB/sec   .990

2.6.27-up + revert_to_per_rq_vruntime + buddy_overhead
ring-test   - 1.122 us/cycle  = 891 KHz                                 1.047 [vs cmp2]
netperf     - 119353.27 118600.98 119719.12 rr/s    avg ...
From: David Miller
Date: Friday, October 24, 2008 - 10:16 pm

From: Mike Galbraith <efault@gmx.de>

It has to be the TSO thinky Evgeniy hit too right?

If not, please bisect this.
--

From: Mike Galbraith
Date: Friday, October 24, 2008 - 10:58 pm

(oh my <fword> gawd:)

I spent long day manweeks trying to bisect and whatnot.  It's immune to
my feeble efforts, and my git-foo.

	-Mike

--

From: Mike Galbraith
Date: Friday, October 24, 2008 - 11:53 pm

but..

(tbench/netperf numbers were tested with gcc-4.1 at this time in log, I
went back and re-measured ring-test because I switched compilers)

2.6.22.19-up
ring-test   - 1.204 us/cycle  = 830 KHz  (gcc-4.1)
ring-test   - doorstop                   (gcc-4.3)
netperf     - 147798.56 rr/s  = 295 KHz  (hmm, a bit unstable, 140K..147K rr/s)
tbench      - 374.573 MB/sec

2.6.22.19-cfs-v24.1-up
ring-test   - 1.098 us/cycle  = 910 KHz  (gcc-4.1)
ring-test   - doorstop                   (gcc-4.3)
netperf     - 140039.03 rr/s  = 280 KHz = 3.57us - 1.10us sched = 2.47us/packet network
tbench      - 364.191 MB/sec

2.6.23.17-up
ring-test   - 1.252 us/cycle  = 798 KHz  (gcc-4.1)
ring-test   - 1.235 us/cycle  = 809 KHz  (gcc-4.3)
netperf     - 123736.40 rr/s  = 247 KHz  sb 268 KHZ / 134336.37 rr/s
tbench      - 355.906 MB/sec

2.6.23.17-cfs-v24.1-up
ring-test   - 1.100 us/cycle  = 909 KHz  (gcc-4.1)
ring-test   - 1.074 us/cycle  = 931 KHz  (gcc-4.3)
netperf     - 135847.14 rr/s  = 271 KHz  sb 280 KHz / 140039.03 rr/s
tbench      - 364.511 MB/sec

2.6.24.7-up
ring-test   - 1.100 us/cycle  = 909 KHz  (gcc-4.1)
ring-test   - 1.068 us/cycle  = 936 KHz  (gcc-4.3)
netperf     - 122300.66 rr/s  = 244 KHz  sb 280 KHz / 140039.03 rr/s
tbench      - 341.523 MB/sec

2.6.25.17-up
ring-test   - 1.163 us/cycle  = 859 KHz  (gcc-4.1)
ring-test   - 1.129 us/cycle  = 885 KHz  (gcc-4.3)
netperf     - 132102.70 rr/s  = 264 KHz  sb 275 KHz / 137627.30 rr/s
tbench      - 361.71 MB/sec

..in 25, something happened that dropped my max context switch rate from
~930 KHz to ~885 KHz.  Maybe I'll have better luck trying to find that.
Added to to-do list.  Benchmark mysteries I'm going to have to leave
alone, they've kicked my little butt quite thoroughly ;-)

	-Mike

--

From: David Miller
Date: Saturday, October 25, 2008 - 12:24 am

From: Mike Galbraith <efault@gmx.de>

But note that tbench performance improved a bit in 2.6.25.

In my tests I noticed a similar effect, but from 2.6.23 to 2.6.24,
weird.

Just for the public record here are the numbers I got in my testing.
Each entry was run purely on the latest 2.6.X-stable tree for each
release.  First is the tbench score and then there are 40 numbers
which are sparc64 cpu cycle counts of default_wake_function().

v2.6.22:

	Throughput 173.677 MB/sec  2 clients  2 procs  max_latency=38.192 ms

	1636 1483 1552 1560 1534 1522 1472 1530 1518 1468
	1534 1402 1468 1656 1383 1362 1516 1336 1392 1472
	1652 1522 1486 1363 1430 1334 1382 1398 1448 1439
	1662 1540 1526 1472 1539 1434 1452 1492 1502 1432

v2.6.23: This is when CFS got added to the tree.

	Throughput 167.933 MB/sec  2 clients  2 procs  max_latency=25.428 ms

	3435 3363 3165 3304 3401 3189 3280 3243 3156 3295
	3439 3375 2950 2945 2727 3383 3560 3417 3221 3271
	3595 3293 3323 3283 3267 3279 3343 3293 3203 3341
	3413 3268 3107 3361 3245 3195 3079 3184 3405 3191

v2.6.24:

	Throughput 170.314 MB/sec  2 clients  2 procs  max_latency=22.121 ms

	2136 1886 2030 1929 2021 1941 2009 2067 1895 2019
	2072 1985 1992 1986 2031 2085 2014 2103 1825 1705
	2018 2034 1921 2079 1901 1989 1976 2035 2053 1971
	2144 2059 2025 2024 2029 1932 1980 1947 1956 2008

v2.6.25:

	Throughput 165.294 MB/sec  2 clients  2 procs  max_latency=108.869 ms

	2551 2707 2674 2771 2641 2727 2647 2865 2800 2796
	2793 2745 2609 2753 2674 2618 2671 2668 2641 2744
	2727 2616 2897 2720 2682 2737 2551 2677 2687 2603
	2725 2717 2510 2682 2658 2581 2713 2608 2619 2586

v2.6.26:

	Throughput 160.759 MB/sec  2 clients  2 procs  max_latency=31.420 ms

	2576 2492 2556 2517 2496 2473 2620 2464 2535 2494
	2800 2297 2183 2634 2546 2579 2488 2455 2632 2540
	2566 2540 2536 2496 2432 2453 2462 2568 2406 2522
	2565 2620 2532 2416 2434 2452 2524 2440 2424 2412

v2.6.27:

	Throughput 143.776 MB/sec  2 clients  2 procs  ...
From: Mike Galbraith
Date: Saturday, October 25, 2008 - 12:52 am

23->24 I can understand.  In my testing, 23 CFS was not a wonderful

Your numbers seem to ~agree with mine.  And yeah, that hrtick is damned
expensive.  I didn't realize _how_ expensive until I trimmed my config
way way down from distro.  Just having highres timers enabled makes a
very large difference here, even without hrtick enabled, and with the
overhead of a disabled hrtick removed.

	-Mike

--

From: Jiri Kosina
Date: Saturday, October 25, 2008 - 4:10 pm

I have been currently looking at very similarly looking issue. For the 
public record, here are the numbers we have been able to come up with so 
far (measured with dbench, so the absolute values are slightly different, 
but still shows similar pattern)

208.4 MB/sec  -- vanilla 2.6.16.60
201.6 MB/sec  -- vanilla 2.6.20.1
172.9 MB/sec  -- vanilla 2.6.22.19
74.2 MB/sec   -- vanilla 2.6.23
 46.1 MB/sec  -- vanilla 2.6.24.2
 30.6 MB/sec  -- vanilla 2.6.26.1

I.e. huge drop for 2.6.23 (this was with default configs for each 
respective kernel).
2.6.23-rc1 shows 80.5 MB/s, i.e. a few % better than final 2.6.23, but 
still pretty bad. 

I have gone through the commits that went into -rc1 and tried to figure 
out which one could be responsible. Here are the numbers:

 85.3 MB/s for 2ba2d00363 (just before on-deman readahead has been merged)
 82.7 MB/s for 45426812d6 (before cond_resched() has been added into page 
187.7 MB/s for c1e4fe711a4 (just before CFS scheduler has been merged)
                           invalidation code)

So the current bigest suspect is CFS, but I don't have enough numbers yet 
to be able to point a finger to it with 100% certainity. Hopefully soon.

Just my $0.02

-- 
Jiri Kosina
SUSE Labs

--

From: Mike Galbraith
Date: Sunday, October 26, 2008 - 1:46 am

Hi,

High client count right?

I reproduced this on my Q6600 box.  However, I also reproduced it with
2.6.22.19.  What I think you're seeing is just dbench creating a massive
train wreck.  With CFS, it appears to be more likely to start->end
_sustain_, but the wreckage is present in O(1) scheduler runs as well,
and will start->end sustain there as well.

2.6.22.19-smp           Throughput 967.933 MB/sec 16 procs Throughput 147.879 MB/sec 160 procs
                        Throughput 950.325 MB/sec 16 procs Throughput 349.959 MB/sec 160 procs
                        Throughput 953.382 MB/sec 16 procs Throughput 126.821 MB/sec 160 procs <== massive jitter
2.6.22.19-cfs-v24.1-smp Throughput 978.047 MB/sec 16 procs Throughput 170.662 MB/sec 160 procs
                        Throughput 943.254 MB/sec 16 procs Throughput 39.388 MB/sec 160 procs <== sustained train wreck
                        Throughput 934.042 MB/sec 16 procs Throughput 239.574 MB/sec 160 procs
2.6.23.17-smp           Throughput 1173.97 MB/sec 16 procs Throughput 100.996 MB/sec 160 procs
                        Throughput 1122.85 MB/sec 16 procs Throughput 80.3747 MB/sec 160 procs
                        Throughput 1113.60 MB/sec 16 procs Throughput 99.3723 MB/sec 160 procs
2.6.24.7-smp            Throughput 1030.34 MB/sec 16 procs Throughput 256.419 MB/sec 160 procs
                        Throughput 970.602 MB/sec 16 procs Throughput 257.008 MB/sec 160 procs
                        Throughput 1056.48 MB/sec 16 procs Throughput 248.841 MB/sec 160 procs
2.6.25.19-smp           Throughput 955.874 MB/sec 16 procs Throughput 40.5735 MB/sec 160 procs
                        Throughput 943.348 MB/sec 16 procs Throughput 62.3966 MB/sec 160 procs
			Throughput 937.595 MB/sec 16 procs Throughput 17.4639 MB/sec 160 procs
2.6.26.7-smp            Throughput 904.564 MB/sec 16 procs Throughput 118.364 MB/sec 160 procs
                        Throughput 891.824 MB/sec 16 procs Throughput 34.2193 MB/sec 160 procs
                     ...
From: Peter Zijlstra
Date: Sunday, October 26, 2008 - 2:00 am

wasn't dbench one of those non-benchmarks that thrives on randomness and
unfairness?

Andrew said recently:
  "dbench is pretty chaotic and it could be that a good change causes
dbench to get worse.  That's happened plenty of times in the past."

So I'm not inclined to worry too much about dbench in any way shape or
form.



--

From: Andrew Morton
Date: Sunday, October 26, 2008 - 2:11 am

Was this when we decreased the default value of

Well.  If there is a consistent change in dbench throughput, it is
important that we at least understand the reasons for it.  But we
don't necessarily want to optimise for dbench throughput.

--

From: Evgeniy Polyakov
Date: Sunday, October 26, 2008 - 2:27 am

Hi.


Sorry, but such excuses do not deserve to be said. No matter how
ugly, wrong, unusual or whatever else you might say about some test, but
it shows the problem, which has to be fixed. There is no 'dbench tune',
there is fair number of problems, and at least several of them dbench
already helped to narrow down and precisely locate. The same regressions
were also observed in other benchmarks, originally reported before I
started this thread.

-- 
	Evgeniy Polyakov
--

From: Andrew Morton
Date: Sunday, October 26, 2008 - 2:34 am

Not necessarily.  There are times when we have made changes which we
knew full well reduced dbench's throughput, because we believed them to

You seem to be saying what I said.
--

From: Evgeniy Polyakov
Date: Sunday, October 26, 2008 - 3:05 am

Hi Andrew.


I suppose, there were words about dbench is not a real-life test, so if
it will suddenly suck, no one will care. Sigh, theorists...
I'm not surprised there were no changes when I reported hrtimers to be
the main guilty factor in my setup for dbench tests, and only when David
showed that they also killed his sparks via wake_up(), something was
done. Now this regression even dissapeared from the list.
Good direction, we should always follow this.

As a side note, is hrtimer subsystem also used for BH backend? I have
not yet analyzed data about vanilla kernels only being able to accept
clients at 20-30k accepts per second, while some other magical tree
(not vanilla) around 2.6.18 was able to that with 50k accepts per
second. There are lots of CPUs, ram, bandwidth, which are effectively
unused even behind linux load balancer...

-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Sunday, October 26, 2008 - 7:34 pm

From: Evgeniy Polyakov <zbr@ioremap.net>

Yes, this situation was in my opinion a complete fucking joke.  Someone
like me shouldn't have to do all of the hard work for the scheduler
folks in order for a bug like this to get seriously looked at.

Evgeniy's difficult work was effectively ignored except by other
testers who could also see and reproduce the problem.

No scheduler developer looked seriously into these reports other than
to say "please try to reproduce with tip" (?!?!?!)  I guess showing
the developer the exact changeset(s) which add the regression isn't
enough these days :-/

Did any scheduler developer try to run tbench ONCE and do even a tiny
bit of analysis, like the kind I did?  Answer honestly...  Linus even
asked you guys in the private thread to "please look into it".  So, if
none of you did, you should all be deeply ashamed of yourselves.

People like me shouldn't have to do all of that work for you just to
get something to happen.

Not until I went privately to Ingo and Linus with cycle counts and a
full disagnosis (of every single release since 2.6.22, a whole 2 days
of work for me) of the precise code eating up too many cycles and
causing problems DID ANYTHING HAPPEN.

This is extremely and excruciatingly DISAPPOINTING and WRONG.

We completely and absolutely suck if this is how we will handle any
performance regression report.

And although this case is specific to the scheduler, a lot of
other areas handle well prepared bug reports similarly.  So I'm not
really picking on the scheduler folks, they just happen to be the
current example :-)

--

From: Ingo Molnar
Date: Monday, October 27, 2008 - 2:30 am

yeah, that overhead was bad, and once it became clear that you had 
high-resolution timers enabled for your benchmaking runs (which is 
default-off and which is still rare for benchmarking runs - despite 
being a popular end-user feature) we immediately disabled the hrtick via 
this upstream commit:

  0c4b83d: sched: disable the hrtick for now

that commit is included in v2.6.28-rc1 so this particular issue should 
be resolved.

high-resolution timers are still default-disabled in the upstream 
kernel, so this never affected usual configs that folks keep 
benchmarking - it only affected those who decided they want higher 
resolution timers and more precise scheduling.

Anyway, the sched-hrtick is off now, and we wont turn it back on without 
making sure that it's really low cost in the hotpath.

Regarding tbench, a workload that context-switches in excess of 100,000 
per second is inevitably going to show scheduler overhead - so you'll 
get the best numbers if you eliminate all/most scheduler code from the 
hotpath. We are working on various patches to mitigate the cost some 
more - and your patches and feedback is welcome as well.

But it's a difficult call with no silver bullets. On one hand we have 
folks putting more and more stuff into the context-switching hotpath on 
the (mostly valid) point that the scheduler is a slowpath compared to 
most other things. On the other hand we've got folks doing 
high-context-switch ratio benchmarks and complaining about the overhead 
whenever something goes in that improves the quality of scheduling of a 
workload that does not context-switch as massively as tbench. It's a 
difficult balance and we cannot satisfy both camps.

Nevertheless, this is not a valid argument in favor of the hrtick 
overhead: that was clearly excessive overhead and we zapped it.

	Ingo
--

From: David Miller
Date: Monday, October 27, 2008 - 2:57 am

From: Ingo Molnar <mingo@elte.hu>

This I heavily disagree with.  The scheduler should be so cheap
that you cannot possibly notice that it is even there for a benchmark
like tbench.

If we now think it's ok that picking which task to run is more
expensive than writing 64 bytes over a TCP socket and then blocking on
a read, I'd like to stop using Linux. :-) That's "real work" and if
the scheduler is more expensive than "real work" we lose.

I do want to remind you of a thread you participated in, in April,
where you complained about loopback TCP performance:

	http://marc.info/?l=linux-netdev&m=120696343707674&w=2

It might be fruitful for you to rerun your tests with CFS reverted
(start with 2.6.22 and progressively run your benchmark on every

We've always been proud of our scheduling overhead being extremely
low, and you have to face the simple fact that starting in 2.6.23 it's
been getting progressively more and more expensive.

Consistently so.

People even noticed it.
--

From: Mike Galbraith
Date: Sunday, October 26, 2008 - 3:23 am

Wow, indeed.  I fired up an ext2 disk to take kjournald out of the
picture (dunno, just a transient thought).  Stock settings produced
three perma-wrecks in a row.  With it bumped to 50, three very
considerably nicer results in a row appeared.

2.6.26.7-smp dirty_ratio = 10 (stock)
Throughput 36.3649 MB/sec 160 procs
Throughput 47.0787 MB/sec 160 procs
Throughput 88.2055 MB/sec 160 procs

2.6.26.7-smp dirty_ratio = 50
Throughput 1009.98 MB/sec 160 procs
Throughput 1101.57 MB/sec 160 procs
Throughput 943.205 MB/sec 160 procs

	-Mike

--

From: Jiri Kosina
Date: Sunday, October 26, 2008 - 12:03 pm

2.6.28 gives 41.8 MB/s with /proc/sys/vm/dirty_ratio == 50. So small 
improvement, but still far far away from the throughput of pre-2.6.23 
kernels.

-- 
Jiri Kosina
SUSE Labs
--

From: Mike Galbraith
Date: Monday, October 27, 2008 - 2:29 am

How many clients?

dbench 160 -t 60

2.6.28-smp (git.today)
Throughput 331.718 MB/sec 160 procs (no logjam)
Throughput 309.85 MB/sec 160 procs (contains logjam)
Throughput 392.746 MB/sec 160 procs (contains logjam)

	-Mike

--

From: Jiri Kosina
Date: Monday, October 27, 2008 - 3:42 am

Ok, so another important datapoint:

with c1e4fe711a4 (just before CFS has been merged for 2.6.23), the dbench 
throughput measures

	187.7 MB/s

in our testing conditions (default config).

With c31f2e8a42c4 (just after CFS has been merged for 2.6.23), the 
throughput measured by dbench is

	82.3 MB/s

This is the huge drop we have been looking for. After this, the 
performance was still going down gradually, up to ~45 MS/ we are measuring 
for 2.6.27. But the biggest drop (more than 50%) points directly to CFS 
merge.

-- 
Jiri Kosina
SUSE Labs
--

From: Ingo Molnar
Date: Monday, October 27, 2008 - 4:27 am

that is a well-known property of dbench: it rewards unfairness in IO, 
memory management and scheduling.

The way to get the best possible dbench numbers in CPU-bound dbench 
runs, you have to throw away the scheduler completely, and do this 
instead:

 - first execute all requests of client 1
 - then execute all requests of client 2
 ....
 - execute all requests of client N

the moment the clients are allowed to overlap, the moment their requests 
are executed more fairly, the dbench numbers drop.

	Ingo
--

From: Alan Cox
Date: Monday, October 27, 2008 - 4:33 am

Rubbish. If you do that you'll not get enough I/O in parallel to schedule
the disk well (not that most of our I/O schedulers are doing the job
well, and the vm writeback threads then mess it up and the lack of Arjans

Fairness isn't everything. Dbench is a fairly good tool for studying some
real world workloads. If your fairness hurts throughput that much maybe
your scheduler algorithm is just plain *wrong* as it isn't adapting to
workload at all well.

Alan
--

From: Mike Galbraith
Date: Monday, October 27, 2008 - 5:06 am

Doesn't seem to be scheduler/fairness.  2.6.22.19 is O(1), and falls
apart too, I posted the numbers and full dbench output yesterday.

	-Mike

--

From: Jiri Kosina
Date: Monday, October 27, 2008 - 6:42 am

We'll need to look into this a little bit more I think. I have sent out 
some numbers too, and these indicate very clearly that there is more than 
50% performance drop (measured by dbench) just after the very merge of CFS 
in 2.6.23-rc1 merge window.

-- 
Jiri Kosina
SUSE Labs
--

From: Mike Galbraith
Date: Monday, October 27, 2008 - 7:17 am

Sure.  Watching the per/sec output, every kernel I have sucks at high
client count dbench, it's just a matter of how badly, and how long.

BTW, the nice pretty 160 client numbers I posted yesterday for ext2
turned out to be because somebody adds _netdev mount option when I mount
-a in order to mount my freshly hotplugged external drive (why?  that
ain't in my fstab).  Without that switch, ext2 output is roughly as
raggedy as ext3, and nowhere near the up to 1.4GB/sec I can get with
dirty_ratio=50 + ext2 + (buy none, get one free) _netdev option.  Free
for the not asking option does nada for ext3.

	-Mike

--

From: Ingo Molnar
Date: Monday, October 27, 2008 - 11:33 am

i've actually implemented that about a decade ago: i've tracked down 
what makes dbench tick, i've implemented the kernel heuristics for it 
to make dbench scale linearly with the number of clients - just to be 

the best dbench results come from systems that have enough RAM to 
cache the full working set, and a filesystem intelligent enough to not 
insert bogus IO serialization cycles (ext3 is not such a filesystem).

The moment there's real IO it becomes harder to analyze but the same 
basic behavior remains: the more unfair the IO scheduler, the "better" 
dbench results we get.

	Ingo
--

From: Evgeniy Polyakov
Date: Monday, October 27, 2008 - 12:39 pm

My test system has 8gb for 8 clients and its performance dropped by 30%.
There is no IO load since tbech uses only network part while dbench
itself uses only disk IO. What we see right now is that usual network
server which handles mixed set of essentially small reads and writes
from the socket from multiple (8) clients suddenly lost one third of

Right now there is no disk IO at all. Only quite usual network and
process load.

-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Monday, October 27, 2008 - 12:48 pm

From: Evgeniy Polyakov <zbr@ioremap.net>

I think the hope is that by saying there isn't a problem enough times,
it will become truth. :-)

More seriously, Ingo, what in the world do we need to do in order to get
you to start doing tbench runs and optimizing things (read as: fixing
the regression you added)?

I'm personally working on a test fibonacci heap implementation for
the fair sched code, and I already did all of the cost analysis all
the way back to the 2.6.22 pre-CFS days.

But I'm NOT a scheduler developer, so it isn't my responsibility to do
this crap for you.  You added this regression, why do I have to get my
hands dirty in order for there to be some hope that these regressions
start to get fixed?
--

From: Mike Galbraith
Date: Tuesday, October 28, 2008 - 3:24 am

I don't want to ruffle any feathers, but my box has comment or two..

Has anyone looked at the numbers box emitted?  Some what I believe to be
very interesting data-points may have been overlooked.

Here's a piece thereof again for better of worse.  One last post won't
burn the last electron.  If they don't agree anyone else's numbers,
that's ok, their numbers have meaning too, and speak for themselves.

Retest hrtick pain:

2.6.26.7-up virgin no highres timers enabled
ring-test   - 1.155 us/cycle  = 865 KHz                                 1.000
netperf     - 130470.93 130771.00 129872.41 rr/s    avg 130371.44 rr/s  1.000 (within jitter of previous tests)
tbench      - 355.153 357.163 356.836 MB/sec        avg 356.384 MB/sec  1.000

2.6.26.7-up virgin highres timers enabled, hrtick enabled
ring-test   - 1.368 us/cycle  = 730 KHz                                  .843
netperf     - 118959.08 118853.16 117761.42 rr/s    avg 118524.55 rr/s   .909
tbench      - 340.999 338.655 340.005 MB/sec        avg 339.886 MB/sec   .953

OK, there's the htrick regression in all it's gory.  Ouch, that hurt.  

Remember those numbers, box muttered them again in 27 testing.  These
previously tested kernels don't even have highres timers enabled, so
obviously hrtick is a non-issue for them.

2.6.26.6-up + clock + buddy + weight
ring-test   - 1.234 us/cycle  = 810 KHz                                  .947 [cmp1]
netperf     - 128026.62 128118.48 127973.54 rr/s    avg 128039.54 rr/s   .977
tbench      - 342.011 345.307 343.535 MB/sec        avg 343.617 MB/sec   .964

2.6.26.6-up + clock + buddy + weight + revert_to_per_rq_vruntime + buddy_overhead
ring-test   - 1.174 us/cycle  = 851 KHz                                  .995 [cmp2]
netperf     - 133928.03 134265.41 134297.06 rr/s    avg 134163.50 rr/s  1.024
tbench      - 358.049 359.529 358.342 MB/sec        avg 358.640 MB/sec  1.006

Note that I added all .27 additional scheduler overhead to .26, and then
removed every last bit of it, ...
From: Ingo Molnar
Date: Tuesday, October 28, 2008 - 3:37 am

thanks Mike for the _extensive_ testing and bug hunting session you've 
done in the past couple of weeks! All the relevant fixlets you found 
are now queued up properly in sched/urgent, correct?

What's your gut feeling, is that remaining small regression scheduler 
or networking related?

i'm cutting the ball in half and i'm passing over one half of it to 
the networking folks, because your numbers show _huge_ sensitivity in 

any scheduler micro-overhead detail is going to be a drop in the 
ocean, compared to such huge variations. We could change the scheduler 
to the old O(N) design of the 2.2 kernel and the impact of that would 
be a blip on the radar, compared to the overhead shown above.

	Ingo
--

From: Mike Galbraith
Date: Tuesday, October 28, 2008 - 3:57 am

I don't know where it lives.  I'm still looking, and the numbers are

I strongly _suspect_ that the network folks have some things they could
investigate, but given my utter failure at finding the smoking gun, I
can't say one way of the other.  IMHO, sharing with network folks would
likely turn out to be a fair thing to do.

Am I waffling?  Me?  You bet your a$$! My clock is already squeaky clean
thank you very much :-)

What I can say is that my box is quite certain that there are influences
outside the scheduler which have more influence benchmark results than
the scheduler does through the life of testing.

	-Mike

--

From: Ingo Molnar
Date: Tuesday, October 28, 2008 - 4:02 am

okay, that's an important observation.

	Ingo
--

From: Mike Galbraith
Date: Tuesday, October 28, 2008 - 7:00 am

Hm.  _Maybe_ someone needs to take a look at c7aceab.  I took it to a 26
test tree yesterday, and it lowered my throughput, though I didn't
repeat a lot, was too busy.  I just backed it out of one of my 27 test
trees, and the netperf number is 1.030, tbench is 1.040.  I'll test this
in virgin source later, but thought I should drop a note, so perhaps
someone interested in this thread can confirm/deny loss.

	-Mike

--

From: Mike Galbraith
Date: Tuesday, October 28, 2008 - 8:22 am

Bah, too much testing, must have done something stupid.

	-Mike

--

From: Evgeniy Polyakov
Date: Wednesday, October 29, 2008 - 2:14 am

Hi.

Cooled down? Let's start over the sched vs network fight.


Sorry for interrupting your conversation, Ingo, but before throwing
a stone one should be clear himself, doesn't it? When you asked to
test -tip tree and it was showed it regressed about 20 MB/s in my
test against the last tweaks you suggested, -tip still was merged and no
work on this issue was performed. Now previous tweaks (nohrticks and
nobalalance, and although hrticks are now disabled, performance did not
return to the vanilla .27 with tweaks) do not help anymore, so
apparently there is additional problem.

So for the reference:
vanilla 27	: 347.222
no TSO/GSO	: 357.331
no hrticks	: 382.983
no balance	: 389.802
4403b4 commit	: 361.184
dirty_ratio-50	: 361.086
no-sched-tweaks	: 361.367

So scheduler _does_ regress even right now when this thread is being
discussed.

Now let's return to the network. Ilpo Järvinen showed a nasty modulo
operation in the fast path which David thinks on how to resolve it, but
it happend that this change was introduced back in 2005, and although
some naive change allows to increase performance upto 370 MB/s, i.e. it
gained us 2.5%, this was never accounted in previous changes.

So, probably, if we revert -tip merge to vanilla .27, add nohrtick patch
and nobalance tweak _only_, and apply naive TSO patch we could bring
system to 400 MB/s. Note, that .22 has 479.82 and .23 454.36 MB/s.

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Wednesday, October 29, 2008 - 2:50 am

And now I have to admit that the very last -top merge did noticebly
improve the situation upto 391.331 MB/s (189 in domains, with tso/gso
off and naive tcp_tso_should_defer() hange).

So we are now essentially at the level of 24-25 trees in my tests.

-- 
	Evgeniy Polyakov
--

From: Paolo Ciarrocchi
Date: Saturday, November 1, 2008 - 5:51 am

That's good and make think whether it would be a good idea to add some
performance number in each pull request that affect the core part of
the kernel.

Ciao,
-- 
Paolo
http://paolo.ciarrocchi.googlepages.com/
--

From: Nick Piggin
Date: Wednesday, October 29, 2008 - 2:59 am

You can get good dbench results come from dbench on tmpfs, which
exercises the vm vfs scheduler etc without IO or filesystems.
--

From: Mike Galbraith
Date: Sunday, October 26, 2008 - 2:15 am

Yeah, I was just curious.  The switch rate of dbench isn't high enough
for math to be an issue, so I wondered how the heck CFS could be such a
huge problem for this load.  Looks to me like all the math in the
_world_ couldn't hurt.. or help.

	-Mike

--

From: David Miller
Date: Saturday, October 25, 2008 - 12:19 am

From: Mike Galbraith <efault@gmx.de>

I understand, this is what happened to me when I tried to look into
the gradual tbench regressions since 2.6.22

I guess the only way to attack these things is to analyze the code and
make some debugging hacks to get some measurements and numbers.
--

From: Mike Galbraith
Date: Saturday, October 25, 2008 - 12:33 am

That's exactly what I've been trying to look into, but combined with
netperf.  The thing is an incredibly twisted maze of _this_ affects
_that_... sometimes involving magic and/or mythical creatures.

Very very annoying.

	-Mike

--

From: Rick Jones
Date: Monday, October 27, 2008 - 10:26 am

I cannot guarantee it will help, but the global -T option to pin netperf 
or netserver to a specific CPU might help cut-down the variables.

FWIW netperf top of trunk omni tests can now also determine and report 
the state of SELinux.  They also have code to accept or generate their 
own RFC4122-esque UUID.  Define some connical tests and then ever closer 
to just needing some database-fu and automagic testing I suppose... 
things I do not presently posess but am curious enough to follow some 
pointers.

happy benchmarking,

rick jones
--

From: Mike Galbraith
Date: Monday, October 27, 2008 - 12:11 pm

Yup, and how.  Early on, the other variables drove me bat-shit frigging
_nuts_.  I eventually selected a UP config to test _because_ those other


Not really, but I can't seem to give up ;-)

	-Mike

--

From: Rick Jones
Date: Monday, October 27, 2008 - 12:18 pm

http://www.netperf.org/svn/netperf2/trunk/src/netsec_linux.c

Pointers to programtatic detection of AppArmour and a couple salient 
details about firewall (enabled, perhaps number of rules) from any 

Plot thickening, seems that autotest knows about some version of 
netperf2 already...  i'll be trying to see if there is some benefit to 
autotest to netperf2's top of trunk having the keyval output format, and 

then I guess I'll close with

successful benchmarking,

if not necessarily happy :)

rick jones
--

From: Mike Galbraith
Date: Monday, October 27, 2008 - 12:44 pm

There ya go, happy benchmarking is when they tell you what you want to
hear.  Successful is when you learn something.

	-Mike  (not happy, but learning)

--

From: Evgeniy Polyakov
Date: Sunday, October 26, 2008 - 4:29 am

For the reference, just pulled git tree (4403b4 commit): 361.184
and with dirty_ratio set to 50: 361.086
without scheduler domain tuning things are essentially the same: 361.367

So, things are getting worse with time, and previous tunes do not help
anymore.

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Sunday, October 26, 2008 - 5:23 am

That's the picture, how we go on my hardware:
4 (2 physical, 2 logical hyper-threaded) 32-bit xeons 8gb of ram.
We probably can be a little bit better for -rc1 kernel though,
if I enable only 4gb via config.

Better one time to see than 1000 times to read. One can scare children
with our graphs... Picture attached.

-- 
	Evgeniy Polyakov
From: Stephen Hemminger
Date: Thursday, October 30, 2008 - 11:15 am

Has anyone looked into the impact of port randomization on this benchmark.
If it is generating lots of sockets quickly there could be an impact:
  * port randomization causes available port space to get filled non-uniformly
    and what was once a linear scan may have to walk over existing ports.
    (This could be improved by a hint bitmap)

  * port randomization adds at least one modulus operation per socket
    creation. This could be optimized by using a loop instead.
--

From: Evgeniy Polyakov
Date: Thursday, October 30, 2008 - 11:40 am

In this benchmark only two sockets are created per client for the whole
run, so this should not have any impact on performance.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 11:43 am

tbench setups one socket per client, then send/receive lot of messages on this socket.

Connection setup time can be ignored for the tbench regression analysis


--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 11:56 am

Hum, re-reading your question, I feel you might have a valid point after all :)

Not because of connection setup time, but because the rwlocks used on tcp hash table.

tcp sessions used on this tbench test might now be on the same cache lines,
because of port randomization or so.

CPUS might do cache-line ping pongs on those rwlocks.

# netstat -tn|grep 7003
tcp        0     59 127.0.0.1:37248         127.0.0.1:7003          ESTABLISHED
tcp        0     71 127.0.0.1:7003          127.0.0.1:37252         ESTABLISHED
tcp        0      0 127.0.0.1:37251         127.0.0.1:7003          ESTABLISHED
tcp        0   4155 127.0.0.1:7003          127.0.0.1:37249         ESTABLISHED
tcp        0     55 127.0.0.1:7003          127.0.0.1:37248         ESTABLISHED
tcp        0      0 127.0.0.1:37252         127.0.0.1:7003          ESTABLISHED
tcp        0      0 127.0.0.1:37249         127.0.0.1:7003          ESTABLISHED
tcp        0     59 127.0.0.1:37246         127.0.0.1:7003          ESTABLISHED
tcp        0      0 127.0.0.1:37250         127.0.0.1:7003          ESTABLISHED
tcp       71      0 127.0.0.1:37245         127.0.0.1:7003          ESTABLISHED
tcp        0      0 127.0.0.1:37244         127.0.0.1:7003          ESTABLISHED
tcp        0     87 127.0.0.1:7003          127.0.0.1:37250         ESTABLISHED
tcp        0   4155 127.0.0.1:7003          127.0.0.1:37251         ESTABLISHED
tcp        0   4155 127.0.0.1:7003          127.0.0.1:37246         ESTABLISHED
tcp        0     71 127.0.0.1:7003          127.0.0.1:37245         ESTABLISHED
tcp        0   4155 127.0.0.1:7003          127.0.0.1:37244         ESTABLISHED

We use a jhash, so normally we could expect a really random split of hash values
for all these sessions, but it would be worth to check :)

You know understand why we want to avoid those rwlocks Stephen, and switch to RCU...

--

From: Ilpo Järvinen
Date: Thursday, October 30, 2008 - 12:01 pm

I did something with AIM9's tcp_test recently (1-2 days ago depending on 
how one calculates that so didn't yet have time summarize the details in 
the AIM9 thread) by deterministicly binding in userspace and got much more 
sensible numbers than with randomized ports (2-4%/5-7% vs 25% variation 
some difference in variation in different kernel versions even with 
deterministic binding). Also, I'm still to actually oprofile and bisect 
the remaining ~4% regression (around 20% was reported by Christoph). For 
oprofiling I might have to change aim9 to do predefined number of loops 
instead of a deadline to get more consistent view on changes in per func 
runtime.

AIM9 is one process only so scheduler has a bit less to do in that 
benchmark anyway.

It would probably be nice to test just the port randomizer separately to 
see if there's some regression in that but I don't expect it to happen
any time soon unless I quickly come up with something in the bisection.


-- 
 i.
--

From: David Miller
Date: Friday, October 31, 2008 - 12:52 am

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>

Yes, it looks like port selection cache and locking effects are
a very real issue.

Good find.
--

From: Ilpo Järvinen
Date: Friday, October 31, 2008 - 2:40 am

On Fri, 31 Oct 2008, David Miller wrote:

> From: "Ilpo J
From: David Miller
Date: Friday, October 31, 2008 - 2:51 am

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>

Not locks or ping-pongs perhaps, I guess.  So it just sends and
receives over a socket, implementing both ends of the communication
in the same process?

If hash chain conflicts do happen for those 2 sockets, just traversing
the chain 2 entries deep could show up.
--

From: Ilpo Järvinen
Date: Friday, October 31, 2008 - 3:42 am

On Fri, 31 Oct 2008, David Miller wrote:

> From: "Ilpo J
From: Eric Dumazet
Date: Friday, October 31, 2008 - 3:45 am

tbench is very sensible to cache line ping-pongs (on SMP machines of cour=
se)

Just to prove my point, I coded the following patch and tried it
on a HP BL460c G1. This machine has 2 quad cores cpu=20
(Intel(R) Xeon(R) CPU E5450  @3.00GHz)

tbench 8 went from 2240 MB/s to 2310 MB/s after this patch applied

[PATCH] net: Introduce netif_set_last_rx() helper

On SMP machine, loopback device (and possibly others net device)
should try to avoid dirty the memory cache line containing "last_rx"
field. Got 3% increase on tbench on a 8 cpus machine.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 drivers/net/loopback.c    |    2 +-
 include/linux/netdevice.h |   16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

From: Ilpo Järvinen
Date: Friday, October 31, 2008 - 4:01 am

On Fri, 31 Oct 2008, Eric Dumazet wrote:

> David Miller a 
From: Eric Dumazet
Date: Friday, October 31, 2008 - 4:10 am

Well, before you added AIM9 on this topic, we were focusing on tbench :)

Sorry to disappoint you :)

--

From: Ilpo Järvinen
Date: Friday, October 31, 2008 - 4:15 am

On Fri, 31 Oct 2008, Eric Dumazet wrote:

> Ilpo J
From: Stephen Hemminger
Date: Friday, October 31, 2008 - 12:57 pm

On Fri, 31 Oct 2008 11:45:33 +0100

Why bother with last_rx at all on loopback.  I have been thinking
we should figure out a way to get rid of last_rx all together. It only
seems to be used by bonding, and the bonding driver could do the calculation
in its receive handling.
--

From: Evgeniy Polyakov
Date: Friday, October 31, 2008 - 1:10 pm

Not related to the regression: bug will be just papered out by this
changes. Having bonding on loopback is somewhat strange idea, but still
this kind of changes is an attempt to make a good play in the bad game:
this loopback-only optimization does not fix the problem.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Friday, October 31, 2008 - 2:03 pm

Just to be clear, this change was not meant to be committed.
It already was rejected by David some years ago (2005, and 2006)

http://www.mail-archive.com/netdev@vger.kernel.org/msg07382.html

If you read my mail, I was *only* saying that tbench results can be sensible to
cache line ping pongs. tbench is a crazy benchmark, and only is a crazy benchmark.

Optimizing linux for tbench sake would be .... crazy ?

--

From: Evgeniy Polyakov
Date: Friday, October 31, 2008 - 2:18 pm

Hi Eric.


No problem Eric, I just pointed that this particular case is rather
fluffy, which really does not fix anything. It improves the case, but
the way it does it, is not the right one imho.
We would definitely want to eliminate assignment of global constantly
updated variables in the pathes where it is not required, but in a
way which does improve the design and implementation, but not to
hide some other problem.

Tbench is, well, as is it is quite usual network server :)
Dbench side is rather non-optimized, but still it is quite common
pattern of small-sized IO. Anyway, optimizing for some kind of the
workload tends to force other side to become slower, so I agree of
course that any narrow-viewed optimizations are bad, and instead we
should focus on searching error patter more widerspread.

-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Friday, October 31, 2008 - 4:51 pm

From: Eric Dumazet <dada1@cosmosbay.com>

However, I do like Stephen's suggestion that maybe we can get rid of
this ->last_rx thing by encapsulating the logic completely in the

Unlike dbench I think tbench is worth cranking up as much as possible.

It doesn't have a huge memory working set, it just writes mostly small
messages over a TCP socket back and forth, and does a lot of blocking

And I think we'd like all of those operating to run as fast as possible.

When Tridge first wrote tbench I would see the expected things at the
top of the profiles.  Things like tcp_ack(), copy to/from user, and
perhaps SLAB.

Things have changed considerably.

--

From: Stephen Hemminger
Date: Friday, October 31, 2008 - 4:56 pm

On Fri, 31 Oct 2008 16:51:44 -0700 (PDT)

Since bonding driver doesn't actually see the rx packets, that isn't
really possible.  But it would be possible to change last_rx from a variable
to an function pointer, so that device's could apply other logic to derive
the last value.  One example would be to keep it per cpu and then take the
maximum.
--

From: Jay Vosburgh
Date: Friday, October 31, 2008 - 5:16 pm

I suspect it could also be tucked away in skb_bond_should_drop,
which is called both by the standard input path and the VLAN accelerated
path to see if the packet should be tossed (e.g., it arrived on an
inactive bonding slave).

	Since last_rx is part of struct net_device, I don't think any
additional bonding internals knowledge would be needed.  It could be
arranged to only update last_rx for devices that are actually bonding
slaves.

	Just off the top of my head (haven't tested this), something
like this:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c8bcb59..ed1e58f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1743,22 +1743,24 @@ static inline int skb_bond_should_drop(struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	struct net_device *master = dev->master;
 
-	if (master &&
-	    (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __constant_htons(ETH_P_ARP))
-			return 0;
-
-		if (master->priv_flags & IFF_MASTER_ALB) {
-			if (skb->pkt_type != PACKET_BROADCAST &&
-			    skb->pkt_type != PACKET_MULTICAST)
+	if (master) {
+		dev->last_rx = jiffies;
+		if (dev->priv_flags & IFF_SLAVE_INACTIVE)) {
+			if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
+			    skb->protocol == __constant_htons(ETH_P_ARP))
 				return 0;
-		}
-		if (master->priv_flags & IFF_MASTER_8023AD &&
-		    skb->protocol == __constant_htons(ETH_P_SLOW))
-			return 0;
 
-		return 1;
+			if (master->priv_flags & IFF_MASTER_ALB) {
+				if (skb->pkt_type != PACKET_BROADCAST &&
+				    skb->pkt_type != PACKET_MULTICAST)
+					return 0;
+			}
+			if (master->priv_flags & IFF_MASTER_8023AD &&
+			    skb->protocol == __constant_htons(ETH_P_SLOW))
+				return 0;
+
+			return 1;
+		}
 	}
 	return 0;
 }


	That doesn't move the storage out of struct net_device, but it
does stop the updates for devices that aren't bonding slaves.  It could
probably be refined ...
From: David Miller
Date: Saturday, November 1, 2008 - 9:40 pm

From: Jay Vosburgh <fubar@us.ibm.com>

I like this very much.

Jay can you give this a quick test by just trying this patch
and removing the ->last_rx setting in the driver you use for
your test?

Once you do that, I'll apply this to net-next-2.6 and do the
leg work to zap all of the ->last_rx updates from the entire tree.

Thanks!
--

From: Jay Vosburgh
Date: Monday, November 3, 2008 - 7:13 pm

The only user of the net_device->last_rx field is bonding.  This
patch adds a conditional update of last_rx to the bonding special logic
in skb_bond_should_drop, causing last_rx to only be updated when the ARP
monitor is running.

	This frees network device drivers from the necessity of updating
last_rx, which can have cache line thrash issues.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56c823c..39575d7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4564,6 +4564,8 @@ static int bond_init(struct net_device *bond_dev, struct bond_params *params)
 	bond_dev->tx_queue_len = 0;
 	bond_dev->flags |= IFF_MASTER|IFF_MULTICAST;
 	bond_dev->priv_flags |= IFF_BONDING;
+	if (bond->params.arp_interval)
+		bond_dev->priv_flags |= IFF_MASTER_ARPMON;
 
 	/* At first, we block adding VLANs. That's the only way to
 	 * prevent problems that occur when adding VLANs over an
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 296a865..e400d7d 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -620,6 +620,8 @@ static ssize_t bonding_store_arp_interval(struct device *d,
 	       ": %s: Setting ARP monitoring interval to %d.\n",
 	       bond->dev->name, new_value);
 	bond->params.arp_interval = new_value;
+	if (bond->params.arp_interval)
+		bond->dev->priv_flags |= IFF_MASTER_ARPMON;
 	if (bond->params.miimon) {
 		printk(KERN_INFO DRV_NAME
 		       ": %s: ARP monitoring cannot be used with MII monitoring. "
@@ -1039,6 +1041,7 @@ static ssize_t bonding_store_miimon(struct device *d,
 			       "ARP monitoring. Disabling ARP monitoring...\n",
 			       bond->dev->name);
 			bond->params.arp_interval = 0;
+			bond->dev->priv_flags &= ~IFF_MASTER_ARPMON;
 			if (bond->params.arp_validate) {
 				bond_unregister_arp(bond);
 				bond->params.arp_validate =
diff --git a/include/linux/if.h ...
From: David Miller
Date: Monday, November 3, 2008 - 7:17 pm

From: Jay Vosburgh <fubar@us.ibm.com>

Applied, thanks a lot Jay.
--

From: Mike Galbraith
Date: Friday, October 10, 2008 - 3:13 am

a7be37a adds some math overhead, calls to calc_delta_mine() per
wakeup/context switch for all weight tasks, whereas previously these
calls were only made for tasks which were not nice 0.  It also shifts
performance a bit in favor of loads which dislike wakeup preemption,
this effect lessens as task count increases.  Per testing, overhead is
not the primary factor in throughput loss.  I believe clock accuracy to
be a more important factor than overhead by a very large margin.

Reverting a7be37a (and the two asym fixes) didn't do a whole lot for me
either.  I'm still ~8% down from 2.6.26 for netperf, and ~3% for tbench,
and the 2.6.26 numbers are gcc-4.1, which are a little lower than
gcc-4.3.  Along the way, I've reverted 100% of scheduler and ilk 26->27
and been unable to recover throughput.  (Too bad I didn't know about
that TSO/GSO thingy, would have been nice.)

I can achieve nearly the same improvement for tbench with a little
tinker, and _more_ for netperf than reverting these changes delivers,
see last log entry, experiment cut math overhead by less than 1/3.

For the full cfs history, even with those three reverts, I'm ~6% down on

I have highres timers disabled in my kernels because per testing it does
cost a lot at high frequency, but primarily because it's not available
throughout test group, same for nohz.  A patchlet went into 2.6.27 to
neutralized the cost of hrtick when it's not active.  Per re-test,

I lost some at 24, got it back at 25 etc.  Some of it is fairness /
preemption differences, but there's a bunch I can't find, and massive
amounts of time spent bisecting were a waste of time.

My annotated test log.  File under fwiw.

Note:  2.6.23 cfs was apparently a bad-hair day for high frequency
switchers.  Anyone entering the way-back-machine to test 2.6.23, should
probably use cfs-24.1, which is 2.6.24 scheduler minus on zero impact
for nice 0 loads line.

-------------------------------------------------------------------------
UP config, no nohz ...
From: Evgeniy Polyakov
Date: Saturday, October 11, 2008 - 6:13 am

Hi Mike.



In my tests it was not just overhead, it was a disaster.
And stopping just before this commit gained 20 MB/s out of 30 MB/s lose
for 26-27 window. No matter what accuracy it brings, this is just wrong
to assume that such performance drop in some workloads is justified.


Well, yes, disabling it should bring performance back, but since they
are actually enabled everywhere and trick with debugfs is not widely

Yup, but since I slacked with bits of beer after POHMELFS release I did

UP actually may expect the differece in our results: I have 4-way (2
physical and 2 logical (HT enabled) CPUs) 32-bit old Xeons with highmem
enabled. I also tried low-latency preemption and no preemption (server)
without much difference.

-- 
	Evgeniy Polyakov
--

From: Peter Zijlstra
Date: Saturday, October 11, 2008 - 7:39 am

a7be37a 's purpose is for group scheduling where it provides means to
calculate things in a unform metric.

If you take the following scenario:

    R
   /|\
  A 1 B
 /|\  |
2 3 4 5

Where letters denote supertasks/groups and digits are tasks.

We used to look at a single level only, so if you want to compute a
task's ideal runtime, you'd take:

  runtime_i = period w_i / \Sum_i w_i

So, in the above example, assuming all entries have an equal weight,
we'd want to run A for p/3. But then we'd also want to run 2 for p/3.
IOW. all of A's tasks would run in p time.

Which in contrairy to the expectation that all tasks in the scenario
would run in p.

So what the patch does is change the calculation to:

  period \Prod_l w_l,i / \Sum_i w_l,i

Which would, for 2 end up being: p 1/3 1/3 = p/9.

Now the thing that causes the extra math in the !group case is that for
the single level case, we can avoid doing that division by the sum,
because that is equal for all tasks (we then compensate for it at some
other place).

However, for the nested case, we cannot do that.

That said, we can probably still avoid the division for the top level
stuff, because the sum of the top level weights is still invariant
between all tasks.

I'll have a stab at doing so... I initially didn't do this because my
first try gave some real ugly code, but we'll see - these numbers are a
very convincing reason to try again.

--

From: Mike Galbraith
Date: Saturday, October 11, 2008 - 11:13 am

...but the numbers I get on Q6600 don't pin the tail on the math donkey.

Update to UP test log.

2.6.27-final-up
ring-test   - 1.193 us/cycle  = 838 KHz  (gcc-4.3)
tbench      - 337.377 MB/sec           tso/gso on
tbench      - 340.362 MB/sec           tso/gso off
netperf     - 120751.30 rr/s           tso/gso on
netperf     - 121293.48 rr/s           tso/gso off

2.6.27-final-up
patches/revert_weight_and_asym_stuff.diff
ring-test   - 1.133 us/cycle  = 882 KHz  (gcc-4.3)
tbench      - 340.481 MB/sec           tso/gso on
tbench      - 343.472 MB/sec           tso/gso off
netperf     - 119486.14 rr/s           tso/gso on
netperf     - 121035.56 rr/s           tso/gso off

2.6.28-up
ring-test   - 1.149 us/cycle  = 870 KHz  (gcc-4.3)
tbench      - 343.681 MB/sec           tso/gso off
netperf     - 122812.54 rr/s           tso/gso off

My SMP log, updated to account for TSO/GSO monkey-wrench.

(<bleep> truckload of time <bleep> wasted chasing unbisectable
<bleepity-bleep> tso gizmo. <bleep!>)

SMP config, same as UP kernels tested, except SMP.

tbench -t 60 4 localhost followed by four 60 sec netperf
TCP_RR pairs, each pair on it's own core of my Q6600.

2.6.22.19

Throughput 1250.73 MB/sec 4 procs                  1.00

16384  87380  1        1       60.01    111272.55  1.00
16384  87380  1        1       60.00    104689.58
16384  87380  1        1       60.00    110733.05
16384  87380  1        1       60.00    110748.88

2.6.22.19-cfs-v24.1

Throughput 1213.21 MB/sec 4 procs                  .970

16384  87380  1        1       60.01    108569.27  .992
16384  87380  1        1       60.01    108541.04
16384  87380  1        1       60.00    108579.63
16384  87380  1        1       60.01    108519.09

2.6.23.17

Throughput 1200.46 MB/sec 4 procs                  .959

16384  87380  1        1       60.01    95987.66   .866
16384  87380  1        1       60.01    92819.98
16384  87380  1        1       60.01    95454.00
16384  87380  1        1       ...
From: Mike Galbraith
Date: Saturday, October 11, 2008 - 11:02 pm

Since I showed the rest of my numbers, I may as well show freshly
generated oltp numbers too.  Chart attached.  2.6.27.rev is 2.6.27 with
weight/asym changes reverted.

Data:

read/write requests/sec per client count
                            1       2       4       8      16      32      64     128     256  
2.6.26.6.mysql		 7978	19856	37238	36652	34399	33054	31608	27983	23411
2.6.27.mysql		 9618	18329	37128	36504	33590	31846	30719	27685	21299
2.6.27.rev.mysql	10944	19544	37349	36582	33793	31744	29161	25719	21026
2.6.28.git.mysql	 9518	18031	30418	33571	33330	32797	31353	29139	25793
									
2.6.26.6.pgsql		14165	27516	53883	53679	51960	49694	44377	35361	32879
2.6.27.pgsql		14146	27519	53797	53739	52850	47633	39976	30552	28741
2.6.27.rev.pgsql	14168	27561	53973	54043	53150	47900	39906	31987	28034
2.6.28.git.pgsql	14404	28318	55124	55010	55002	54890	53745	53519	52215

From: Mike Galbraith
Date: Saturday, October 11, 2008 - 11:33 pm

P.S.  all knobs stock, TSO/GSO off.

--

Previous thread: [PATCH] proc: remove kernel.maps_protect sysctl by Alexey Dobriyan on Thursday, October 9, 2008 - 4:34 pm. (1 message)

Next thread: [git pull] x86 updates for v2.6.28, phase #1 by Ingo Molnar on Thursday, October 9, 2008 - 4:47 pm. (9 messages)