Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

Previous thread: [2.6 patch] CONFIG_SOUND_WM97XX: remove stale makefile line by Adrian Bunk on Wednesday, May 21, 2008 - 1:38 pm. (1 message)

Next thread: [2.6 patch] scsi/advansys.c: fix compile errors by Adrian Bunk on Wednesday, May 21, 2008 - 1:41 pm. (3 messages)
To: lkml <linux-kernel@...>
Date: Wednesday, May 21, 2008 - 1:34 pm

PostgreSQL ships with a simple database benchmarking tool named pgbench,
in what's labeled the contrib section (in many distributions it's a
separate package from the main server/client ones). I see there's been
some work done already improving how the PostgreSQL server works under the
new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread).
I wanted to provide you a different test case using pgbench that has taken
a sharp dive starting with 2.6.23, and the server improvement changes in
2.6.25 actually made this problem worse.

I think it will be easy for someone else to replicate my results and I'll
go over the exact procedure below. To start with a view of how bad the
regression is, here's a summary of the results on one system, an AMD X2
4600+ running at 2.4GHz, with a few interesting kernels. I threw in
results from Solaris 10 on this system as a nice independant reference
point. The numbers here are transactions/second (TPS) running a simple
read-only test over a 160MB data set, I took the median from 3 test runs:

Clients 2.6.9 2.6.22 2.6.24 2.6.25 Solaris
1 11173 11052 10526 10700 9656
2 18035 16352 14447 10370 14518
3 19365 15414 17784 9403 14062
4 18975 14290 16832 8882 14568
5 18652 14211 16356 8527 15062
6 17830 13291 16763 9473 15314
8 15837 12374 15343 9093 15164
10 14829 11218 10732 9057 14967
15 14053 11116 7460 7113 13944
20 13713 11412 7171 7017 13357
30 13454 11191 7049 6896 12987
40 13103 11062 7001 6820 12871
50 12311 11255 6915 6797 12858

That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I
compiled with a minimum of fiddling from the defaults (just adding support
for my SATA RAID card). You can see a major drop with the recent kernels
at high client loads, and the changes in 2.6.25 seem to have really hurt
even the low client count ones.

The other recent hardware I have here, an Intel Q6600 based system, gives
even more maddening results. On successive benchmark runs, you can watch
it break...

To: Greg Smith <gsmith@...>
Cc: lkml <linux-kernel@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>
Date: Thursday, May 22, 2008 - 3:10 am

Yup, I can reproduce. Running the test with 2.6.25.4, everything is
waking/running on one CPU, leaving my box 75% idle. Not good.

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 4:28 am

Can you try with 2.6.26-rc? There is minimal load balancing for group
scheduling till 25, which might explain the lack of scalability.

--
regards,
Dhaval
--

To: Dhaval Giani <dhaval@...>
Cc: Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 5:05 am

I'm playing with it now, it's tweakable with migration cost. This
testcase is funky. It can't generate enough work to keep CPUs busy for
spit, and can't saturate my little quad with any kernel I've tried.

-Mike

--

To: Dhaval Giani <dhaval@...>
Cc: Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 6:34 am

Heh, watch this. No tweaking.

(Nadia's ipc/idr patches are applied though, to see if the high end
improves over previous runs with various kernels, and it does seem to.)

2.6.26-smp x86_64
1 10014.774797
2 9791.395302
3 10575.369296
4 9763.183251
5 10160.274262
6 9893.174179
8 9566.978464
10 10294.456456
15 9444.100540
20 9137.878618
30 8277.795499
40 7925.824428
50 7646.644285

nail postgres to CPUs1-3
nail pgbench to CPU0

2.6.26-smp x86_64
1 10900.959982
2 15976.870604
3 24661.322669
4 25347.141780
5 25893.815676
6 26756.414839
8 25399.018582
10 26172.878669
15 25542.082746
20 25090.381828
30 24270.301103
40 23405.867336
50 21926.223083

--

To: Dhaval Giani <dhaval@...>
Cc: Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Peter Zijlstra <peterz@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 7:25 am

Disregard the above, no they don't. (now removed again)

However. The problem with 2.6.26.git running this testcase appears to
be SYNC_WAKEUPS. No taskset, nada except echo 863 > sched_features

2.6.26.git
1 8173.538610
2 15738.206889
3 23399.356839
4 21401.182501
5 21682.839897
6 26396.301413
8 29910.334798
10 29953.625797
15 29535.740343
20 28950.900431
30 27159.733949
40 24163.344207
50 23258.496794

vs

2.6.22.17-0.1-default (opensuse 10.3 stock kernel)
1 7693.501369
2 15669.304960
3 25340.818410
4 24445.932930
5 22807.019544
6 24051.387364
8 22406.392813
10 22631.510576
15 21225.243584
20 20382.232075
30 18834.814588
40 17799.906622
50 17305.274561

--

To: Mike Galbraith <efault@...>
Cc: Dhaval Giani <dhaval@...>, Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 7:44 am

Makes sense - I took a look at pgbench.c (and only thereafter took the
time to find the initial mail lkml where Greg rather nicely explained
its workings) - the thing with sync wakeups is that they try to pull
tasks together, but as this one task (pgbench) serves a number of
postgresql server tasks it will cluster everything.

Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
its operation - I'm sure there are other scenarios that could benefit
from this.

--

To: Peter Zijlstra <peterz@...>
Cc: Dhaval Giani <dhaval@...>, Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 8:09 am

Maybe simple (minded): cache waker's last non-interrupt context wakee,
if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at
call time?

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Dhaval Giani <dhaval@...>, Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 8:24 am

Yeah, something like so - or perhaps like you say cache the wakee.

I picked the wake_affine() condition, because I think that is the
biggest factor in this behaviour. You could of course also disable all
of sync.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c86c5c5..856c2a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -950,6 +950,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;

+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 894a702..8971044 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1036,7 +1036,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && curr->sched_class == &fair_sched_class &&
+ p->se.waker == curr->se->waker) {
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1210,6 +1211,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
if (unlikely(se == pse))
return;

+ se->waker = pse;
cfs_rq_of(pse)->next = pse;

/*

--

To: Peter Zijlstra <peterz@...>
Cc: Mike Galbraith <efault@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 3:13 am

I tested out Peter's patch (updated version against -rc3 with a typo fix
from Mike below) and it's a big step in the right direction. Here are
updated results from my benchmark script, adding 2.6.26-rc3 and that rev
with this patch applied:

Clients 2.6.22 2.6.24 2.6.25 -rc3 patch
1 11052 10526 10700 10193 10439
2 16352 14447 10370 9817 13289
3 15414 17784 9403 9428 13678
4 14290 16832 8882 9533 13033
5 14211 16356 8527 9558 12790
6 13291 16763 9473 9367 12660
8 12374 15343 9093 9159 12357
10 11218 10732 9057 8711 11839
15 11116 7460 7113 7620 11267
20 11412 7171 7017 7707 10531
30 11191 7049 6896 7195 9766
40 11062 7001 6820 7079 9668
50 11255 6915 6797 7202 9588

Exact versions I tested because I think it may start mattering now:
2.6.22.19, 2.6.24.3, 2.6.25. I didn't save 2.6.23 results but recall them
being similar to 2.6.24.

On this dual-core system, without this patch there's an average of a a 33%
regression in -rc3 compared to 2.6.22. With it that's dropped to 8%; some
cases (around 10 clients) even improve a touch (it's enough within the
margin of error here I wouldn't conclude too much from that). The big
jump in high client count cases is the first I've seen that since CFS was
introduced. It seems a bit odd to me that there's still such a large
regression in the 2-8 client cases compared with not only 2.6.22 but
2.6.24, which owned this benchmark in that area.

With this feedback, any ideas on where to go next? There seems like's
some room for improvement still left here.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..e160f71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -965,6 +965,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;

+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair....

To: Greg Smith <gsmith@...>
Cc: Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 6:00 am

Dunno. This load is very highly tweakable, and doesn't seem to like
preemption much at all. You can see below what preemption is doing to
2.6.22.18 by looking at the batch numbers. SCHED_BATCH turns the O(1)
scheduler into a pathetic little round-robin scheduler, and this load
loves pathetic :-) After seeing the batch numbers, I tweaked .git to
make it as round-robin as I could.

My take on the numbers is that both kernels preempt too frequently for
_this_ load.. but what to do, many many loads desperately need
preemption to perform.

2.6.22.18 2.6.22.18-batch 2.6.26.git 2.6.26.git.batch
1 7487.115236 7643.563512 9999.400036 9915.823582
2 17074.869889 15360.150210 14042.644140 14958.375329
3 25073.139078 24802.446538 15621.206938 25047.032536
4 24236.413612 26126.482482 16436.055117 25007.183313
5 26367.198572 28298.293443 19926.550734 27853.081679
6 24695.827843 30786.651975 22375.916107 28119.474302
8 21020.949689 31973.674156 25825.292413 31070.664011
10 22792.204610 31775.164023 26754.471274 31596.415197
15 21202.173186 30388.559630 28711.761083 30963.050265
20 21204.041830 29317.044783 28512.269685 30127.614550
30 18519.965964 27252.739106 26682.613791 28185.244056
40 17936.447579 25670.803773 24964.936746 26282.369366
50 16247.605712 25089.154310 21078.604858 25356.750461

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Greg Smith <gsmith@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 6:10 am

was 2.6.26.git.batch running the load with SCHED_BATCH, or did you do
other tweaks as well?

if it's other tweaks as well then could you perhaps try to make
SCHED_BATCH batch more agressively?

I.e. i think it's a perfectly fine answer to say "if your workload needs
batch scheduling, run it under SCHED_BATCH".

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Greg Smith <gsmith@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 9:05 am

Running SCHED_BATCH with only the below put a large dent in the problem.

You can have tl <= current->se.load.weight. Nothing good happens in
either case, at least with this load.

--- kernel/sched_fair.c.org 2008-05-23 14:59:39.000000000 +0200
+++ kernel/sched_fair.c 2008-05-23 14:49:05.000000000 +0200
@@ -1081,7 +1081,7 @@ wake_affine(struct rq *rq, struct sched_
* effect of the currently running task from the load
* of the current CPU:
*/
- if (sync)
+ if (sync && tl > current->se.load.weight)
tl -= current->se.load.weight;

if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||

2.6.26-smp x86_64
1 9209.503213
2 15792.406916
3 23369.199181
4 23140.108032
5 24556.515470
6 24926.457776
8 26896.607558
10 27350.988396
15 29005.426298
20 28558.267290
30 27002.328374
40 25809.202374
50 24589.478654

--

To: Ingo Molnar <mingo@...>
Cc: Greg Smith <gsmith@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 9:35 am

And without SCHED_BATCH

2.6.26-smp x86_64
1 8417.511252
2 15559.741472
3 23417.911087
4 21982.631084
5 24212.518114
6 21870.640050
8 25178.186022
10 27350.449792
15 27958.758943
20 28011.989131
30 26668.779045
40 24871.625107
50 23687.757456

So the primary low end problem is sync afine wakeups it seems.

-Mike

--

To: Ingo Molnar <mingo@...>
Cc: Greg Smith <gsmith@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 6:15 am

That's what I was thinking, because it needed features=0 as well to

Yes, and this appears to be such a case.

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 7:18 pm

I figured out how to run pgbench with chrt in order to get SCHED_BATCH
behavior, but I don't understand what you mean by features=0 here. Since
I didn't see the same magnitude of different just using batch that seems
important, where does that get set at?

I'm also curious what hardware your results are coming from, to fit them
into my larger pgbench results context space.

Got my 4-core system back on-line again today (found some bad RAM) and
wanted to try another round of tests on that. Looks like you've defined 5
test sets I should replicate:

2.6.22
2.6.22, batch
2.6.26.git
2.6.26.git, batch
2.6.26.git, batch + se.load.weight patch

Should I still be trying Peter's se.waker patch as well in this mix
somewhere?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
--

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, May 23, 2008 - 7:46 pm

/proc/sys/kernel/sched_features. You need CONFIG_SCHED_DEBUG to have

Yeah.

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, May 26, 2008 - 8:28 pm

After spending a whole day testing various scheduler options, I've got a
pretty good idea how possible improvements here might map out. Let's
start with Mike's results (slightly reformatted), from his "grocery store
Q6600 box" similar to the one my results in this message come from:

.22.18 .22.18b .26.git .26.git.batch
1 7487 7644 9999 9916
2 17075 15360 14043 14958
3 25073 24802 15621 25047
4 24236 26126 16436 25007
5 26367 28298 19927 27853
6 24696 30787 22376 28119
8 21021 31974 25825 31071
10 22792 31775 26754 31596
15 21202 30389 28712 30963
20 21204 29317 28512 30128
30 18520 27253 26683 28185
40 17936 25671 24965 26282
50 16248 25089 21079 25357

I couldn't replicate that batch mode improvement in 2.6.22 or 2.6.26.git,
so I asked Mike for some clarification about how he did the batch testing

Which explains the difference: I was just running pgbench as "chrt -b cmd
pgbench ..." which doesn't help at all. I am uncomfortable with the idea
of running the database server itself as a batch process. While it may be
effective for optimizing this benchmark, I think it's in general a bad
idea because it may de-tune it for more real-world workloads like web
applications. Also, that requires being intrusive into people's setup
scripts, which bothers me a lot more than doing a bit of kernel tuning at
system startup.

Mike also suggested a patch that adjusted se.load.weight. That didn't
seem helpful in any of the cases I tested, presumably it helps with the
all batch-mode setup I didn't try properly.

I did again get useful results here with the stock 2.6.26.git kernel and
default parameters using Peter's small patch to adjust se.waker.

What I found most interesting was how the results changed when I set
/proc/sys/kernel/sched_features = 0, without doing anything with batch
mode. The default for that is 1101111111=895. What I then did was run
through setting each of those bits off one by one to see which feature(s)
were getting in the way here. T...

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, May 27, 2008 - 1:59 am

Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@gmx.de>

kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;

- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;

/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->s...

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, June 6, 2008 - 1:03 am

Sorry I didn't get back to you until now, got distracted for a bit.
Here's my table now updated with this patched version and with your
numbers for comparision, since we have the same basic processor setup:

Clients .22.19 .26.git patch Mike
1 7660 11043 11003 10122
2 17798 11452 16868 14360
3 29612 13231 20381 17049
4 25584 13053 22222 18749
6 25295 12263 23546 24913
8 24344 11748 23895 27976
10 23963 11612 22492 29347
15 23026 11414 21896 29157
20 22549 11332 21015 28392
30 22074 10743 18411 26590
40 21495 10406 17982 24422
50 20051 10534 17009 23306

So this is a huge win for this patch compared with the stock 2.6.26.git
(I'm still using the daily snapshot from 2008-05-26) and a nice
improvement over the earlier, smaller patches I tested in this thread
(which peaked at 19537 for 10 clients for me with default features, vs. a
peak of 23895 @ 8 here).

I think I might not be testing exactly the same thing you did, though,
because the pattern doesn't match. I think that my Q6600 system runs a
little bit faster than yours, which is the case for small numbers of
clients here. But once we get above 8 clients your setup is way faster,
with the difference at 15 clients being the largest. Were you perhaps
using batch mode when you generated these results? Only thing I could
think of that would produce this pattern. If it's not something simple
like that, I may have to dig into whether there was some change in the git
snapshot between what you tested and what I did.

Regardless, clearly your patch reduces the regression with the default
parameters to a mild one instead of the gigantic one we started with.
Considering how generally incompatible this benchmark is with this
scheduler, and that there are clear workarounds (feature disabling) I can
document in PostgreSQL land to "fix" the problem defined for me now, I'd
be happy if all that came from this investigation was this change. I'd
hope that being strengthened against this workload improves the
sch...

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, June 6, 2008 - 2:13 am

Unfortunately, after the recent reverts, we're right back to huge :-/

I'm trying to come up with a dirt simple solution that doesn't harm
other load types. I've found no clear reason why we regressed so badly,
it seems to be a luck of the draw run order thing. As soon as the load
starts jamming up a bit, it avalanches into a serialized mess again. I

I consider pgbench to be a pretty excellent testcase. Getting this
fixed properly will certainly benefit similar loads, Xorg being one

It's committed, but I don't think a back-port is justified. It does
what it's supposed to do, but there's a part 2. I suspect that your
results differ from mine due to that luck of the run order draw thing.

-Mike

--

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 7:38 am

The below doesn't hurt my volanomark numbers of the day, helps pgbench
considerably, and improves the higher client end of mysql+oltp a wee
bit. It may hurt the low end a wee bit, but the low end is always
pretty unstable, so it's hard to tell with only three runs.

pgbench
2.6.26-rc5 2.6.26-rc5+
1 10213.768037 10237.990274 10165.511814 10183.705908
2 15885.949053 15519.005195 14994.697875 15204.900479
3 15663.233356 16043.733087 16554.371722 17279.376443
4 14193.807355 15799.792612 18447.345925 18088.861169
5 17239.456219 17326.938538 20119.250823 18537.351094
6 15293.624093 14272.208159 21439.841579 22634.887824
8 12483.727461 13486.991527 25579.379337 25908.373483
10 11919.023584 12058.503518 23876.035623 22403.867804
15 10128.724654 11253.959398 23276.797649 23595.597093
20 9645.056147 9980.465235 23603.315133 23256.506240
30 9288.747962 8801.059613 23633.448266 23229.286697
40 8494.705123 8323.107702 22925.552706 23081.526954
50 8357.781935 8239.867147 19102.481374 19558.624434

volanomark
2.6.26-rc5
test-1.log:Average throughput = 101768 messages per second
test-2.log:Average throughput = 99124 messages per second
test-3.log:Average throughput = 99821 messages per second
test-1.log:Average throughput = 101362 messages per second
test-2.log:Average throughput = 98891 messages per second
test-3.log:Average throughput = 99164 messages per second

2.6.26-rc5+
test-1.log:Average throughput = 103275 messages per second
test-2.log:Average throughput = 100034 messages per second
test-3.log:Average throughput = 99434 messages per second
test-1.log:Average throughput = 100460 messages per second
test-2.log:Average throughput = 100188 messages per second
test-3.log:Average throughput = 99617 messages per second

Index: linux-2.6.26.git/kernel/sched_fair.c
=================================================================...

To: Mike Galbraith <efault@...>
Cc: Greg Smith <gsmith@...>, Ingo Molnar <mingo@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 9:08 am

On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:

--

To: Peter Zijlstra <peterz@...>
Cc: Greg Smith <gsmith@...>, Ingo Molnar <mingo@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 10:54 am

In that case it _might_ fly, so needs changelog and blame line.

Tasks which awaken many clients can suffer terribly due to affine wakeup
preemption. This can (does for pgbench) lead to serialization of the entire
load on one CPU due to ever lowering throughput of the preempted waker and
constant affine wakeup of many preempters. Prevent this by noticing when
multi-task preemption is going on, ensure that the 1:N waker can always do
a reasonable batch of work, and temporarily restrict further affine wakeups.

Signed-off-by: Mike Galbraith <efault@gmx.de>

include/linux/sched.h | 1 +
kernel/sched.c | 1 +
kernel/sched_fair.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..73b7d23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {

u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;

#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index bfb8ad8..deb30e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;

#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..4539a79 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)

update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

ideal_runtime = sched_slice(cf...

To: Mike Galbraith <efault@...>
Cc: Greg Smith <gsmith@...>, Ingo Molnar <mingo@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 12:12 pm

Just wondering, how much effect does the last_preempter stuff have?, it
seems to me the minimum runtime check ought to throttle these wakeups
quite a bit as well.

--

To: Peter Zijlstra <peterz@...>
Cc: Greg Smith <gsmith@...>, Ingo Molnar <mingo@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 1:53 pm

Without last_preempter, you'd have all tasks having a minimum runtime.
That would harm the single cpu starve.c testcase for sure, and anything
like it. I wanted to target this pretty accurately to 1:N type loads.

If you mean no trying to disperse preempters, I can test without it.

-Mike

--

To: Peter Zijlstra <peterz@...>
Cc: Greg Smith <gsmith@...>, Ingo Molnar <mingo@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 2:19 pm

pgbench
2.6.26-rc5+ 2.6.26-rc5+ with no disperse
1 10165.511814 10183.705908 10191.865953 10186.995546
2 14994.697875 15204.900479 15209.856474 15239.639522
3 16554.371722 17279.376443 16431.588533 15828.812843
4 18447.345925 18088.861169 15967.533533 16827.107528
5 20119.250823 18537.351094 17890.057368 18829.423686
6 21439.841579 22634.887824 18562.389387 18907.807327
8 25579.379337 25908.373483 19527.104304 19687.221241
10 23876.035623 22403.867804 22635.429472 20627.666899
15 23276.797649 23595.597093 22695.938882 22233.399329
20 23603.315133 23256.506240 22623.205980 22637.340746
30 23633.448266 23229.286697 22736.523283 22691.638135
40 22925.552706 23081.526954 20037.610595 22174.404351
50 19102.481374 19558.624434 21459.370223 21664.820102

--

To: Ingo Molnar <mingo@...>, Greg Smith <gsmith@...>
Cc: Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 8:50 am

Since I tested mysql+oltp and made the dang pdf of the results, I may
as well actually attach the thing <does that before continuing...>.

BTW, I have a question wrt avg_overlap. When a wakeup cause the current
task to begin sharing CPU with a freshly awakened task, the current task
is tagged.. but the wakee isn't. How come? If one is sharing, so is
the other.

-Mike

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Greg Smith <gsmith@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 9:07 am

avg_overlap is about measuring how long we'll run after waking someone
else. The other measure, how long our waker shares the cpu with us,
hasn't proven to be relevant so far.

--

To: Peter Zijlstra <peterz@...>
Cc: Ingo Molnar <mingo@...>, Greg Smith <gsmith@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 10:16 am

Yeah wrt relevance, I've been playing with making it mean this and that,
with approx 0 success ;-) If it's a measure of how long we run after
waking though, don't we need to make sure it's not a cross CPU wakeup?

-Mike

--

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Greg Smith <gsmith@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 12:16 pm

The idea was to dynamically detect sync wakeups, who's defining property
is that the waker will sleep after waking the wakee. And who's effect is
pulling tasks together on wakeups - so that we might have the most
benefit of cache sharing.

So if we were to exclude cross cpu wakeups from this measurement we'd
handicap the whole scheme, because then we'd never measure that its
actually a sync wakeup and wants to run on the same cpu.

--

To: Peter Zijlstra <peterz@...>
Cc: Ingo Molnar <mingo@...>, Greg Smith <gsmith@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, June 7, 2008 - 1:56 pm

Ah, I get it now, thanks.

-Mike

--

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, May 27, 2008 - 4:20 am

Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;)

2.6.25.4-feat=45 2.6.25.4-feat=111 2.6.25.4-feat=47
1 11385.471887 10292.721924 9551.157672
2 16709.515434 15540.399522 16283.968970
3 25456.658841 20187.320016 24562.735943
4 24453.435157 24975.037450 23391.583053
5 25504.302958 23102.131056 23671.860667
6 27076.359200 24688.791507 25947.592071
8 31758.200682 29462.639752 29700.144372
10 32190.081142 30428.413809 27439.441838
15 31175.074906 11097.668025 20344.284129
20 30513.974332 10742.166624 19256.695409
30 28307.399275 10233.708047 17535.423344
40 26720.463867 10037.856773 16104.895695
50 24899.945793 9907.624283 15768.746911

Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.

Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.

Signed-off-by: Mike Galbraith <efault@gmx.de>

kernel/sched.c | 9 ++++++++-
kernel/sched_fair.c | 25 ++++++++++++++-----------
2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_AFFINE_WAKEUPS = 32,
+ SCHED_FEAT_SYNC_WAKEUPS = 64,
};

const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |...

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, May 27, 2008 - 4:35 am

(to somewhat solidify the random thought i'm sharing...)

Perhaps a SCHED_PREEMPT class so such things can co-exist:

SCHED_BATCH == I never preempt.
SCHED_NORMAL == I preempt sometimes.
SCHED_PREEMPT == I always preempt my waker.

(end of random synaptic firing;)

-Mike

--

To: Greg Smith <gsmith@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Dhaval Giani <dhaval@...>, lkml <linux-kernel@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, May 24, 2008 - 4:08 am

btw, the problem with 2.6.25.4 and this load is one and the same. With
a 1:N load, you really don't want work generator waking all worker-bees
on it's CPU. The patchlet below let's you turn it off.

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..5641eb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,7 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_SYNC_WAKEUPS = 32,
};

const_debug unsigned int sysctl_sched_features =
@@ -603,7 +604,8 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |
- SCHED_FEAT_DOUBLE_TICK * 0;
+ SCHED_FEAT_DOUBLE_TICK * 0 |
+ SCHED_FEAT_SYNC_WAKEUPS * 0;

#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)

@@ -1902,6 +1904,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
long old_state;
struct rq *rq;

+ if (!sched_feat(SYNC_WAKEUPS))
+ sync = 0;
+
smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;

--

To: Peter Zijlstra <peterz@...>
Cc: Dhaval Giani <dhaval@...>, Greg Smith <gsmith@...>, lkml <linux-kernel@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, May 22, 2008 - 9:16 am

Works fine (modulo booboo).

--

Previous thread: [2.6 patch] CONFIG_SOUND_WM97XX: remove stale makefile line by Adrian Bunk on Wednesday, May 21, 2008 - 1:38 pm. (1 message)

Next thread: [2.6 patch] scsi/advansys.c: fix compile errors by Adrian Bunk on Wednesday, May 21, 2008 - 1:41 pm. (3 messages)