Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

Previous thread: [2.6 patch] CONFIG_SOUND_WM97XX: remove stale makefile line by Adrian Bunk on Wednesday, May 21, 2008 - 10:38 am. (1 message)

Next thread: [2.6 patch] scsi/advansys.c: fix compile errors by Adrian Bunk on Wednesday, May 21, 2008 - 10:41 am. (3 messages)
From: Greg Smith
Date: Wednesday, May 21, 2008 - 10:34 am

PostgreSQL ships with a simple database benchmarking tool named pgbench, 
in what's labeled the contrib section (in many distributions it's a 
separate package from the main server/client ones).  I see there's been 
some work done already improving how the PostgreSQL server works under the 
new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread). 
I wanted to provide you a different test case using pgbench that has taken 
a sharp dive starting with 2.6.23, and the server improvement changes in 
2.6.25 actually made this problem worse.

I think it will be easy for someone else to replicate my results and I'll 
go over the exact procedure below.  To start with a view of how bad the 
regression is, here's a summary of the results on one system, an AMD X2 
4600+ running at 2.4GHz, with a few interesting kernels.  I threw in 
results from Solaris 10 on this system as a nice independant reference 
point.  The numbers here are transactions/second (TPS) running a simple 
read-only test over a 160MB data set, I took the median from 3 test runs:

Clients	2.6.9	2.6.22	2.6.24	2.6.25	Solaris
1	11173	11052	10526	10700	9656
2	18035	16352	14447	10370	14518
3	19365	15414	17784	9403	14062
4	18975	14290	16832	8882	14568
5	18652	14211	16356	8527	15062
6	17830	13291	16763	9473	15314
8	15837	12374	15343	9093	15164
10	14829	11218	10732	9057	14967
15	14053	11116	7460	7113	13944
20	13713	11412	7171	7017	13357
30	13454	11191	7049	6896	12987
40	13103	11062	7001	6820	12871
50	12311	11255	6915	6797	12858

That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I 
compiled with a minimum of fiddling from the defaults (just adding support 
for my SATA RAID card).  You can see a major drop with the recent kernels 
at high client loads, and the changes in 2.6.25 seem to have really hurt 
even the low client count ones.

The other recent hardware I have here, an Intel Q6600 based system, gives 
even more maddening results.  On successive benchmark runs, you can watch 
it ...
From: Mike Galbraith
Date: Thursday, May 22, 2008 - 12:10 am

Yup, I can reproduce.  Running the test with 2.6.25.4, everything is
waking/running on one CPU, leaving my box 75% idle.  Not good.

	-Mike

--

From: Dhaval Giani
Date: Thursday, May 22, 2008 - 1:28 am

Can you try with 2.6.26-rc? There is minimal load balancing for group
scheduling till 25, which might explain the lack of scalability.

-- 
regards,
Dhaval
--

From: Mike Galbraith
Date: Thursday, May 22, 2008 - 2:05 am

I'm playing with it now, it's tweakable with migration cost.  This
testcase is funky.  It can't generate enough work to keep CPUs busy for
spit, and can't saturate my little quad with any kernel I've tried.

	-Mike

--

From: Mike Galbraith
Date: Thursday, May 22, 2008 - 3:34 am

Heh, watch this.  No tweaking.  

(Nadia's ipc/idr patches are applied though, to see if the high end
improves over previous runs with various kernels, and it does seem to.)

2.6.26-smp x86_64
1 10014.774797
2 9791.395302
3 10575.369296
4 9763.183251
5 10160.274262
6 9893.174179
8 9566.978464
10 10294.456456
15 9444.100540
20 9137.878618
30 8277.795499
40 7925.824428
50 7646.644285

nail postgres to CPUs1-3
nail pgbench to CPU0

2.6.26-smp x86_64
1 10900.959982
2 15976.870604
3 24661.322669
4 25347.141780
5 25893.815676
6 26756.414839
8 25399.018582
10 26172.878669
15 25542.082746
20 25090.381828
30 24270.301103
40 23405.867336
50 21926.223083



--

From: Mike Galbraith
Date: Thursday, May 22, 2008 - 4:25 am

Disregard the above, no they don't.  (now removed again)

However.  The problem with 2.6.26.git running this testcase appears to
be SYNC_WAKEUPS. No taskset, nada except echo 863 > sched_features
 
2.6.26.git
1 8173.538610
2 15738.206889
3 23399.356839
4 21401.182501
5 21682.839897
6 26396.301413
8 29910.334798
10 29953.625797
15 29535.740343
20 28950.900431
30 27159.733949
40 24163.344207
50 23258.496794

vs

2.6.22.17-0.1-default (opensuse 10.3 stock kernel)
1 7693.501369
2 15669.304960
3 25340.818410
4 24445.932930
5 22807.019544
6 24051.387364
8 22406.392813
10 22631.510576
15 21225.243584
20 20382.232075
30 18834.814588
40 17799.906622
50 17305.274561

--

From: Peter Zijlstra
Date: Thursday, May 22, 2008 - 4:44 am

Makes sense - I took a look at pgbench.c (and only thereafter took the
time to find the initial mail lkml where Greg rather nicely explained
its workings) - the thing with sync wakeups is that they try to pull
tasks together, but as this one task (pgbench) serves a number of
postgresql server tasks it will cluster everything.

Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of
its operation - I'm sure there are other scenarios that could benefit
from this.





--

From: Mike Galbraith
Date: Thursday, May 22, 2008 - 5:09 am

Maybe simple (minded): cache waker's last non-interrupt context wakee,
if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at
call time?

	-Mike

--

From: Peter Zijlstra
Date: Thursday, May 22, 2008 - 5:24 am

Yeah, something like so - or perhaps like you say cache the wakee.

I picked the wake_affine() condition, because I think that is the
biggest factor in this behaviour. You could of course also disable all
of sync.



diff --git a/include/linux/sched.h b/include/linux/sched.h
index c86c5c5..856c2a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -950,6 +950,8 @@ struct sched_entity {
 	u64			last_wakeup;
 	u64			avg_overlap;
 
+	struct sched_entity 	*waker;
+
 #ifdef CONFIG_SCHEDSTATS
 	u64			wait_start;
 	u64			wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 894a702..8971044 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1036,7 +1036,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && curr->sched_class == &fair_sched_class) {
+	if (sync && curr->sched_class == &fair_sched_class &&
+	    p->se.waker == curr->se->waker) {
 		if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
 				p->se.avg_overlap < sysctl_sched_migration_cost)
 			return 1;
@@ -1210,6 +1211,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
 	if (unlikely(se == pse))
 		return;
 
+	se->waker = pse;
 	cfs_rq_of(pse)->next = pse;
 
 	/*



--

From: Mike Galbraith
Date: Thursday, May 22, 2008 - 6:16 am

Works fine (modulo booboo).


--

From: Greg Smith
Date: Friday, May 23, 2008 - 12:13 am

I tested out Peter's patch (updated version against -rc3 with a typo fix 
from Mike below) and it's a big step in the right direction.  Here are 
updated results from my benchmark script, adding 2.6.26-rc3 and that rev 
with this patch applied:

Clients	2.6.22	2.6.24	2.6.25	-rc3	patch
1	11052	10526	10700	10193	10439
2	16352	14447	10370	9817	13289
3	15414	17784	9403	9428	13678
4	14290	16832	8882	9533	13033
5	14211	16356	8527	9558	12790
6	13291	16763	9473	9367	12660
8	12374	15343	9093	9159	12357
10	11218	10732	9057	8711	11839
15	11116	7460	7113	7620	11267
20	11412	7171	7017	7707	10531
30	11191	7049	6896	7195	9766
40	11062	7001	6820	7079	9668
50	11255	6915	6797	7202	9588

Exact versions I tested because I think it may start mattering now: 
2.6.22.19, 2.6.24.3, 2.6.25.  I didn't save 2.6.23 results but recall them 
being similar to 2.6.24.

On this dual-core system, without this patch there's an average of a a 33% 
regression in -rc3 compared to 2.6.22.  With it that's dropped to 8%; some 
cases (around 10 clients) even improve a touch (it's enough within the 
margin of error here I wouldn't conclude too much from that).  The big 
jump in high client count cases is the first I've seen that since CFS was 
introduced.  It seems a bit odd to me that there's still such a large 
regression in the 2-8 client cases compared with not only 2.6.22 but 
2.6.24, which owned this benchmark in that area.

With this feedback, any ideas on where to go next?  There seems like's 
some room for improvement still left here.


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..e160f71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -965,6 +965,8 @@ struct sched_entity {
         u64                     last_wakeup;
         u64                     avg_overlap;

+       struct sched_entity     *waker;
+
  #ifdef CONFIG_SCHEDSTATS
         u64                     wait_start;
         u64                     wait_max;
diff --git ...
From: Mike Galbraith
Date: Friday, May 23, 2008 - 3:00 am

Dunno.  This load is very highly tweakable, and doesn't seem to like
preemption much at all.  You can see below what preemption is doing to
2.6.22.18 by looking at the batch numbers.  SCHED_BATCH turns the O(1)
scheduler into a pathetic little round-robin scheduler, and this load
loves pathetic :-)  After seeing the batch numbers, I tweaked .git to
make it as round-robin as I could.

My take on the numbers is that both kernels preempt too frequently for
_this_ load.. but what to do, many many loads desperately need
preemption to perform.

      2.6.22.18     2.6.22.18-batch          2.6.26.git    2.6.26.git.batch
1   7487.115236         7643.563512         9999.400036         9915.823582
2  17074.869889        15360.150210        14042.644140        14958.375329
3  25073.139078        24802.446538        15621.206938        25047.032536
4  24236.413612        26126.482482        16436.055117        25007.183313
5  26367.198572        28298.293443        19926.550734        27853.081679
6  24695.827843        30786.651975        22375.916107        28119.474302
8  21020.949689        31973.674156        25825.292413        31070.664011
10 22792.204610        31775.164023        26754.471274        31596.415197
15 21202.173186        30388.559630        28711.761083        30963.050265
20 21204.041830        29317.044783        28512.269685        30127.614550
30 18519.965964        27252.739106        26682.613791        28185.244056
40 17936.447579        25670.803773        24964.936746        26282.369366
50 16247.605712        25089.154310        21078.604858        25356.750461

	-Mike

--

From: Ingo Molnar
Date: Friday, May 23, 2008 - 3:10 am

was 2.6.26.git.batch running the load with SCHED_BATCH, or did you do 
other tweaks as well?

if it's other tweaks as well then could you perhaps try to make 
SCHED_BATCH batch more agressively?

I.e. i think it's a perfectly fine answer to say "if your workload needs 
batch scheduling, run it under SCHED_BATCH".

	Ingo
--

From: Mike Galbraith
Date: Friday, May 23, 2008 - 3:15 am

That's what I was thinking, because it needed features=0 as well to

Yes, and this appears to be such a case.

	-Mike

--

From: Greg Smith
Date: Friday, May 23, 2008 - 4:18 pm

I figured out how to run pgbench with chrt in order to get SCHED_BATCH 
behavior, but I don't understand what you mean by features=0 here.  Since 
I didn't see the same magnitude of different just using batch that seems 
important, where does that get set at?

I'm also curious what hardware your results are coming from, to fit them 
into my larger pgbench results context space.

Got my 4-core system back on-line again today (found some bad RAM) and 
wanted to try another round of tests on that.  Looks like you've defined 5 
test sets I should replicate:

2.6.22
2.6.22, batch
2.6.26.git
2.6.26.git, batch
2.6.26.git, batch + se.load.weight patch

Should I still be trying Peter's se.waker patch as well in this mix 
somewhere?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
--

From: Mike Galbraith
Date: Friday, May 23, 2008 - 4:46 pm

/proc/sys/kernel/sched_features.  You need CONFIG_SCHED_DEBUG to have


Yeah.

	-Mike

--

From: Mike Galbraith
Date: Saturday, May 24, 2008 - 1:08 am

btw, the problem with 2.6.25.4 and this load is one and the same.  With
a 1:N load, you really don't want work generator waking all worker-bees
on it's CPU.  The patchlet below let's you turn it off.

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..5641eb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,7 @@ enum {
 	SCHED_FEAT_START_DEBIT		= 4,
 	SCHED_FEAT_HRTICK		= 8,
 	SCHED_FEAT_DOUBLE_TICK		= 16,
+	SCHED_FEAT_SYNC_WAKEUPS		= 32,
 };
 
 const_debug unsigned int sysctl_sched_features =
@@ -603,7 +604,8 @@ const_debug unsigned int sysctl_sched_features =
 		SCHED_FEAT_WAKEUP_PREEMPT	* 1 |
 		SCHED_FEAT_START_DEBIT		* 1 |
 		SCHED_FEAT_HRTICK		* 1 |
-		SCHED_FEAT_DOUBLE_TICK		* 0;
+		SCHED_FEAT_DOUBLE_TICK		* 0 |
+		SCHED_FEAT_SYNC_WAKEUPS         * 0;
 
 #define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
 
@@ -1902,6 +1904,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 	long old_state;
 	struct rq *rq;
 
+	if (!sched_feat(SYNC_WAKEUPS))
+		sync = 0;
+
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;


--

From: Greg Smith
Date: Monday, May 26, 2008 - 5:28 pm

After spending a whole day testing various scheduler options, I've got a 
pretty good idea how possible improvements here might map out.  Let's 
start with Mike's results (slightly reformatted), from his "grocery store 
Q6600 box" similar to the one my results in this message come from:

 	.22.18	.22.18b	.26.git	.26.git.batch
1	7487	7644	9999	9916
2	17075	15360	14043	14958
3	25073	24802	15621	25047
4	24236	26126	16436	25007
5	26367	28298	19927	27853
6	24696	30787	22376	28119
8	21021	31974	25825	31071
10	22792	31775	26754	31596
15	21202	30389	28712	30963
20	21204	29317	28512	30128
30	18520	27253	26683	28185
40	17936	25671	24965	26282
50	16248	25089	21079	25357

I couldn't replicate that batch mode improvement in 2.6.22 or 2.6.26.git, 
so I asked Mike for some clarification about how he did the batch testing 

Which explains the difference:  I was just running pgbench as "chrt -b cmd 
pgbench ..." which doesn't help at all.  I am uncomfortable with the idea 
of running the database server itself as a batch process.  While it may be 
effective for optimizing this benchmark, I think it's in general a bad 
idea because it may de-tune it for more real-world workloads like web 
applications.  Also, that requires being intrusive into people's setup 
scripts, which bothers me a lot more than doing a bit of kernel tuning at 
system startup.

Mike also suggested a patch that adjusted se.load.weight.  That didn't 
seem helpful in any of the cases I tested, presumably it helps with the 
all batch-mode setup I didn't try properly.

I did again get useful results here with the stock 2.6.26.git kernel and 
default parameters using Peter's small patch to adjust se.waker.

What I found most interesting was how the results changed when I set 
/proc/sys/kernel/sched_features = 0, without doing anything with batch 
mode.  The default for that is 1101111111=895. What I then did was run 
through setting each of those bits off one by one to see which feature(s) 
were getting in the way ...
From: Mike Galbraith
Date: Monday, May 26, 2008 - 10:59 pm

Care to give the below a whirl?  If fixes the over-enthusiastic affinity
bug in a less restrictive way.  It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled.  Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho) 

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@gmx.de>

 kernel/sched_fair.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
 	struct task_struct *curr = this_rq->curr;
 	unsigned long tl = this_load;
 	unsigned long tl_per_task;
+	int bad_imbalance;
 
-	if (!(this_sd->flags & SD_WAKE_AFFINE))
+	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
 		return 0;
 
 	/*
+	 * If sync wakeup then subtract the (maximum possible)
+	 * effect of the currently running task from the load
+	 * of the current CPU:
+	 */
+	if (sync && tl)
+		tl -= curr->se.load.weight;
+
+	bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && curr->sched_class == &fair_sched_class) {
+	if (sync ...
From: Mike Galbraith
Date: Tuesday, May 27, 2008 - 1:20 am

Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;) 
 
2.6.25.4-feat=45  2.6.25.4-feat=111    2.6.25.4-feat=47
1  11385.471887        10292.721924         9551.157672
2  16709.515434        15540.399522        16283.968970
3  25456.658841        20187.320016        24562.735943
4  24453.435157        24975.037450        23391.583053
5  25504.302958        23102.131056        23671.860667
6  27076.359200        24688.791507        25947.592071
8  31758.200682        29462.639752        29700.144372
10 32190.081142        30428.413809        27439.441838
15 31175.074906        11097.668025        20344.284129
20 30513.974332        10742.166624        19256.695409
30 28307.399275        10233.708047        17535.423344
40 26720.463867        10037.856773        16104.895695
50 24899.945793         9907.624283        15768.746911

Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.

Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.

Signed-off-by: Mike Galbraith <efault@gmx.de>

 kernel/sched.c      |    9 ++++++++-
 kernel/sched_fair.c |   25 ++++++++++++++-----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
 	SCHED_FEAT_START_DEBIT		= 4,
 	SCHED_FEAT_HRTICK		= 8,
 	SCHED_FEAT_DOUBLE_TICK		= 16,
+	SCHED_FEAT_AFFINE_WAKEUPS	= 32,
+	SCHED_FEAT_SYNC_WAKEUPS		= 64,
 };
 
 const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
 		SCHED_FEAT_WAKEUP_PREEMPT	* 1 |
 		SCHED_FEAT_START_DEBIT		* 1 |
 		SCHED_FEAT_HRTICK		* 1 ...
From: Mike Galbraith
Date: Tuesday, May 27, 2008 - 1:35 am

(to somewhat solidify the random thought i'm sharing...)

Perhaps a SCHED_PREEMPT class so such things can co-exist:

SCHED_BATCH == I never preempt.
SCHED_NORMAL == I preempt sometimes.
SCHED_PREEMPT == I always preempt my waker.

(end of random synaptic firing;)

	-Mike

--

From: Greg Smith
Date: Thursday, June 5, 2008 - 10:03 pm

Sorry I didn't get back to you until now, got distracted for a bit. 
Here's my table now updated with this patched version and with your 
numbers for comparision, since we have the same basic processor setup:

Clients	.22.19	.26.git	patch	Mike
1	7660	11043	11003	10122
2	17798	11452	16868	14360
3	29612	13231	20381	17049
4	25584	13053	22222	18749
6	25295	12263	23546	24913
8	24344	11748	23895	27976
10	23963	11612	22492	29347
15	23026	11414	21896	29157
20	22549	11332	21015	28392
30	22074	10743	18411	26590
40	21495	10406	17982	24422
50	20051	10534	17009	23306

So this is a huge win for this patch compared with the stock 2.6.26.git 
(I'm still using the daily snapshot from 2008-05-26) and a nice 
improvement over the earlier, smaller patches I tested in this thread 
(which peaked at 19537 for 10 clients for me with default features, vs. a 
peak of 23895 @ 8 here).

I think I might not be testing exactly the same thing you did, though, 
because the pattern doesn't match.  I think that my Q6600 system runs a 
little bit faster than yours, which is the case for small numbers of 
clients here.  But once we get above 8 clients your setup is way faster, 
with the difference at 15 clients being the largest.  Were you perhaps 
using batch mode when you generated these results?  Only thing I could 
think of that would produce this pattern.  If it's not something simple 
like that, I may have to dig into whether there was some change in the git 
snapshot between what you tested and what I did.

Regardless, clearly your patch reduces the regression with the default 
parameters to a mild one instead of the gigantic one we started with. 
Considering how generally incompatible this benchmark is with this 
scheduler, and that there are clear workarounds (feature disabling) I can 
document in PostgreSQL land to "fix" the problem defined for me now, I'd 
be happy if all that came from this investigation was this change.  I'd 
hope that being strengthened against this workload improves the ...
From: Mike Galbraith
Date: Thursday, June 5, 2008 - 11:13 pm

Unfortunately, after the recent reverts, we're right back to huge :-/

I'm trying to come up with a dirt simple solution that doesn't harm
other load types.  I've found no clear reason why we regressed so badly,
it seems to be a luck of the draw run order thing.  As soon as the load
starts jamming up a bit, it avalanches into a serialized mess again.  I

I consider pgbench to be a pretty excellent testcase.  Getting this
fixed properly will certainly benefit similar loads, Xorg being one

It's committed, but I don't think a back-port is justified.  It does
what it's supposed to do, but there's a part 2.  I suspect that your
results differ from mine due to that luck of the run order draw thing.

	-Mike

--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 4:38 am

The below doesn't hurt my volanomark numbers of the day, helps pgbench
considerably, and improves the higher client end of mysql+oltp a wee
bit.  It may hurt the low end a wee bit, but the low end is always
pretty unstable, so it's hard to tell with only three runs.

pgbench
2.6.26-rc5                         2.6.26-rc5+
1  10213.768037    10237.990274    10165.511814    10183.705908
2  15885.949053    15519.005195    14994.697875    15204.900479
3  15663.233356    16043.733087    16554.371722    17279.376443
4  14193.807355    15799.792612    18447.345925    18088.861169
5  17239.456219    17326.938538    20119.250823    18537.351094
6  15293.624093    14272.208159    21439.841579    22634.887824
8  12483.727461    13486.991527    25579.379337    25908.373483
10 11919.023584    12058.503518    23876.035623    22403.867804
15 10128.724654    11253.959398    23276.797649    23595.597093
20  9645.056147     9980.465235    23603.315133    23256.506240
30  9288.747962     8801.059613    23633.448266    23229.286697
40  8494.705123     8323.107702    22925.552706    23081.526954
50  8357.781935     8239.867147    19102.481374    19558.624434

volanomark
2.6.26-rc5
test-1.log:Average throughput = 101768 messages per second
test-2.log:Average throughput = 99124 messages per second
test-3.log:Average throughput = 99821 messages per second
test-1.log:Average throughput = 101362 messages per second
test-2.log:Average throughput = 98891 messages per second
test-3.log:Average throughput = 99164 messages per second

2.6.26-rc5+
test-1.log:Average throughput = 103275 messages per second
test-2.log:Average throughput = 100034 messages per second
test-3.log:Average throughput = 99434 messages per second
test-1.log:Average throughput = 100460 messages per second
test-2.log:Average throughput = 100188 messages per second
test-3.log:Average throughput = 99617 messages per second


Index: ...
From: Mike Galbraith
Date: Saturday, June 7, 2008 - 5:50 am

Since I tested mysql+oltp and made the dang pdf of the results, I may
as well actually attach the thing <does that before continuing...>.

BTW, I have a question wrt avg_overlap.  When a wakeup cause the current
task to begin sharing CPU with a freshly awakened task, the current task
is tagged.. but the wakee isn't.  How come?  If one is sharing, so is
the other.

	-Mike
From: Peter Zijlstra
Date: Saturday, June 7, 2008 - 6:07 am

avg_overlap is about measuring how long we'll run after waking someone
else. The other measure, how long our waker shares the cpu with us,
hasn't proven to be relevant so far.

--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 7:16 am

Yeah wrt relevance, I've been playing with making it mean this and that,
with approx 0 success ;-)  If it's a measure of how long we run after
waking though, don't we need to make sure it's not a cross CPU wakeup?

	-Mike

--

From: Peter Zijlstra
Date: Saturday, June 7, 2008 - 9:16 am

The idea was to dynamically detect sync wakeups, who's defining property
is that the waker will sleep after waking the wakee. And who's effect is
pulling tasks together on wakeups - so that we might have the most
benefit of cache sharing.

So if we were to exclude cross cpu wakeups from this measurement we'd
handicap the whole scheme, because then we'd never measure that its
actually a sync wakeup and wants to run on the same cpu.

--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 10:56 am

Ah, I get it now, thanks.

	-Mike

--

From: Peter Zijlstra
Date: Saturday, June 7, 2008 - 6:08 am

On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:


--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 7:54 am

In that case it _might_ fly, so needs changelog and blame line.

Tasks which awaken many clients can suffer terribly due to affine wakeup
preemption.  This  can (does for pgbench) lead to serialization of the entire
load on one CPU due to ever lowering throughput of the preempted waker and
constant affine wakeup of many preempters.  Prevent this by noticing when
multi-task preemption is going on, ensure that the 1:N waker can always do
a reasonable batch of work, and temporarily restrict further affine wakeups.

Signed-off-by: Mike Galbraith <efault@gmx.de>


 include/linux/sched.h |    1 +
 kernel/sched.c        |    1 +
 kernel/sched_fair.c   |   49 ++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..73b7d23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {
 
 	u64			last_wakeup;
 	u64			avg_overlap;
+	struct sched_entity	*last_preempter;
 
 #ifdef CONFIG_SCHEDSTATS
 	u64			wait_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index bfb8ad8..deb30e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.last_wakeup		= 0;
 	p->se.avg_overlap		= 0;
+	p->se.last_preempter		= NULL;
 
 #ifdef CONFIG_SCHEDSTATS
 	p->se.wait_start		= 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..4539a79 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
 
 	update_stats_dequeue(cfs_rq, se);
 	if (sleep) {
+		se->last_preempter = NULL;
 		update_avg_stats(cfs_rq, se);
 #ifdef CONFIG_SCHEDSTATS
 		if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 
 	ideal_runtime = sched_slice(cfs_rq, curr);
 ...
From: Peter Zijlstra
Date: Saturday, June 7, 2008 - 9:12 am

Just wondering, how much effect does the last_preempter stuff have?, it
seems to me the minimum runtime check ought to throttle these wakeups
quite a bit as well.



--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 10:53 am

Without last_preempter, you'd have all tasks having a minimum runtime.
That would harm the single cpu starve.c testcase for sure, and anything
like it.  I wanted to target this pretty accurately to 1:N type loads.

If you mean no trying to disperse preempters, I can test without it.

	-Mike

--

From: Mike Galbraith
Date: Saturday, June 7, 2008 - 11:19 am

pgbench
    2.6.26-rc5+                    2.6.26-rc5+ with no disperse
1  10165.511814    10183.705908    10191.865953    10186.995546
2  14994.697875    15204.900479    15209.856474    15239.639522
3  16554.371722    17279.376443    16431.588533    15828.812843
4  18447.345925    18088.861169    15967.533533    16827.107528
5  20119.250823    18537.351094    17890.057368    18829.423686
6  21439.841579    22634.887824    18562.389387    18907.807327
8  25579.379337    25908.373483    19527.104304    19687.221241
10 23876.035623    22403.867804    22635.429472    20627.666899
15 23276.797649    23595.597093    22695.938882    22233.399329
20 23603.315133    23256.506240    22623.205980    22637.340746
30 23633.448266    23229.286697    22736.523283    22691.638135
40 22925.552706    23081.526954    20037.610595    22174.404351
50 19102.481374    19558.624434    21459.370223    21664.820102


--

From: Mike Galbraith
Date: Friday, May 23, 2008 - 6:05 am

Running SCHED_BATCH with only the below put a large dent in the problem.

You can have tl <= current->se.load.weight.  Nothing good happens in
either case, at least with this load.

--- kernel/sched_fair.c.org	2008-05-23 14:59:39.000000000 +0200
+++ kernel/sched_fair.c	2008-05-23 14:49:05.000000000 +0200
@@ -1081,7 +1081,7 @@ wake_affine(struct rq *rq, struct sched_
 	 * effect of the currently running task from the load
 	 * of the current CPU:
 	 */
-	if (sync)
+	if (sync && tl > current->se.load.weight)
 		tl -= current->se.load.weight;
 
 	if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
 


2.6.26-smp x86_64
1 9209.503213
2 15792.406916
3 23369.199181
4 23140.108032
5 24556.515470
6 24926.457776
8 26896.607558
10 27350.988396
15 29005.426298
20 28558.267290
30 27002.328374
40 25809.202374
50 24589.478654

--

From: Mike Galbraith
Date: Friday, May 23, 2008 - 6:35 am

And without SCHED_BATCH

2.6.26-smp x86_64
1 8417.511252
2 15559.741472
3 23417.911087
4 21982.631084
5 24212.518114
6 21870.640050
8 25178.186022
10 27350.449792
15 27958.758943
20 28011.989131
30 26668.779045
40 24871.625107
50 23687.757456

So the primary low end problem is sync afine wakeups it seems.

	-Mike

--

Previous thread: [2.6 patch] CONFIG_SOUND_WM97XX: remove stale makefile line by Adrian Bunk on Wednesday, May 21, 2008 - 10:38 am. (1 message)

Next thread: [2.6 patch] scsi/advansys.c: fix compile errors by Adrian Bunk on Wednesday, May 21, 2008 - 10:41 am. (3 messages)