PostgreSQL ships with a simple database benchmarking tool named pgbench, in what's labeled the contrib section (in many distributions it's a separate package from the main server/client ones). I see there's been some work done already improving how the PostgreSQL server works under the new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread). I wanted to provide you a different test case using pgbench that has taken a sharp dive starting with 2.6.23, and the server improvement changes in 2.6.25 actually made this problem worse. I think it will be easy for someone else to replicate my results and I'll go over the exact procedure below. To start with a view of how bad the regression is, here's a summary of the results on one system, an AMD X2 4600+ running at 2.4GHz, with a few interesting kernels. I threw in results from Solaris 10 on this system as a nice independant reference point. The numbers here are transactions/second (TPS) running a simple read-only test over a 160MB data set, I took the median from 3 test runs: Clients 2.6.9 2.6.22 2.6.24 2.6.25 Solaris 1 11173 11052 10526 10700 9656 2 18035 16352 14447 10370 14518 3 19365 15414 17784 9403 14062 4 18975 14290 16832 8882 14568 5 18652 14211 16356 8527 15062 6 17830 13291 16763 9473 15314 8 15837 12374 15343 9093 15164 10 14829 11218 10732 9057 14967 15 14053 11116 7460 7113 13944 20 13713 11412 7171 7017 13357 30 13454 11191 7049 6896 12987 40 13103 11062 7001 6820 12871 50 12311 11255 6915 6797 12858 That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I compiled with a minimum of fiddling from the defaults (just adding support for my SATA RAID card). You can see a major drop with the recent kernels at high client loads, and the changes in 2.6.25 seem to have really hurt even the low client count ones. The other recent hardware I have here, an Intel Q6600 based system, gives even more maddening results. On successive benchmark runs, you can watch it ...
Yup, I can reproduce. Running the test with 2.6.25.4, everything is waking/running on one CPU, leaving my box 75% idle. Not good. -Mike --
Can you try with 2.6.26-rc? There is minimal load balancing for group scheduling till 25, which might explain the lack of scalability. -- regards, Dhaval --
I'm playing with it now, it's tweakable with migration cost. This testcase is funky. It can't generate enough work to keep CPUs busy for spit, and can't saturate my little quad with any kernel I've tried. -Mike --
Heh, watch this. No tweaking. (Nadia's ipc/idr patches are applied though, to see if the high end improves over previous runs with various kernels, and it does seem to.) 2.6.26-smp x86_64 1 10014.774797 2 9791.395302 3 10575.369296 4 9763.183251 5 10160.274262 6 9893.174179 8 9566.978464 10 10294.456456 15 9444.100540 20 9137.878618 30 8277.795499 40 7925.824428 50 7646.644285 nail postgres to CPUs1-3 nail pgbench to CPU0 2.6.26-smp x86_64 1 10900.959982 2 15976.870604 3 24661.322669 4 25347.141780 5 25893.815676 6 26756.414839 8 25399.018582 10 26172.878669 15 25542.082746 20 25090.381828 30 24270.301103 40 23405.867336 50 21926.223083 --
Disregard the above, no they don't. (now removed again) However. The problem with 2.6.26.git running this testcase appears to be SYNC_WAKEUPS. No taskset, nada except echo 863 > sched_features 2.6.26.git 1 8173.538610 2 15738.206889 3 23399.356839 4 21401.182501 5 21682.839897 6 26396.301413 8 29910.334798 10 29953.625797 15 29535.740343 20 28950.900431 30 27159.733949 40 24163.344207 50 23258.496794 vs 2.6.22.17-0.1-default (opensuse 10.3 stock kernel) 1 7693.501369 2 15669.304960 3 25340.818410 4 24445.932930 5 22807.019544 6 24051.387364 8 22406.392813 10 22631.510576 15 21225.243584 20 20382.232075 30 18834.814588 40 17799.906622 50 17305.274561 --
Makes sense - I took a look at pgbench.c (and only thereafter took the time to find the initial mail lkml where Greg rather nicely explained its workings) - the thing with sync wakeups is that they try to pull tasks together, but as this one task (pgbench) serves a number of postgresql server tasks it will cluster everything. Humm,.. how to fix this.. we'd need to somehow detect the 1:n nature of its operation - I'm sure there are other scenarios that could benefit from this. --
Maybe simple (minded): cache waker's last non-interrupt context wakee, if the wakee != cached, ignore SYNC_WAKEUP unless sync was requested at call time? -Mike --
Yeah, something like so - or perhaps like you say cache the wakee.
I picked the wake_affine() condition, because I think that is the
biggest factor in this behaviour. You could of course also disable all
of sync.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c86c5c5..856c2a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -950,6 +950,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 894a702..8971044 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1036,7 +1036,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && curr->sched_class == &fair_sched_class &&
+ p->se.waker == curr->se->waker) {
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1210,6 +1211,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
if (unlikely(se == pse))
return;
+ se->waker = pse;
cfs_rq_of(pse)->next = pse;
/*
--
I tested out Peter's patch (updated version against -rc3 with a typo fix
from Mike below) and it's a big step in the right direction. Here are
updated results from my benchmark script, adding 2.6.26-rc3 and that rev
with this patch applied:
Clients 2.6.22 2.6.24 2.6.25 -rc3 patch
1 11052 10526 10700 10193 10439
2 16352 14447 10370 9817 13289
3 15414 17784 9403 9428 13678
4 14290 16832 8882 9533 13033
5 14211 16356 8527 9558 12790
6 13291 16763 9473 9367 12660
8 12374 15343 9093 9159 12357
10 11218 10732 9057 8711 11839
15 11116 7460 7113 7620 11267
20 11412 7171 7017 7707 10531
30 11191 7049 6896 7195 9766
40 11062 7001 6820 7079 9668
50 11255 6915 6797 7202 9588
Exact versions I tested because I think it may start mattering now:
2.6.22.19, 2.6.24.3, 2.6.25. I didn't save 2.6.23 results but recall them
being similar to 2.6.24.
On this dual-core system, without this patch there's an average of a a 33%
regression in -rc3 compared to 2.6.22. With it that's dropped to 8%; some
cases (around 10 clients) even improve a touch (it's enough within the
margin of error here I wouldn't conclude too much from that). The big
jump in high client count cases is the first I've seen that since CFS was
introduced. It seems a bit odd to me that there's still such a large
regression in the 2-8 client cases compared with not only 2.6.22 but
2.6.24, which owned this benchmark in that area.
With this feedback, any ideas on where to go next? There seems like's
some room for improvement still left here.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5395a61..e160f71 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -965,6 +965,8 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *waker;
+
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
diff --git ...Dunno. This load is very highly tweakable, and doesn't seem to like
preemption much at all. You can see below what preemption is doing to
2.6.22.18 by looking at the batch numbers. SCHED_BATCH turns the O(1)
scheduler into a pathetic little round-robin scheduler, and this load
loves pathetic :-) After seeing the batch numbers, I tweaked .git to
make it as round-robin as I could.
My take on the numbers is that both kernels preempt too frequently for
_this_ load.. but what to do, many many loads desperately need
preemption to perform.
2.6.22.18 2.6.22.18-batch 2.6.26.git 2.6.26.git.batch
1 7487.115236 7643.563512 9999.400036 9915.823582
2 17074.869889 15360.150210 14042.644140 14958.375329
3 25073.139078 24802.446538 15621.206938 25047.032536
4 24236.413612 26126.482482 16436.055117 25007.183313
5 26367.198572 28298.293443 19926.550734 27853.081679
6 24695.827843 30786.651975 22375.916107 28119.474302
8 21020.949689 31973.674156 25825.292413 31070.664011
10 22792.204610 31775.164023 26754.471274 31596.415197
15 21202.173186 30388.559630 28711.761083 30963.050265
20 21204.041830 29317.044783 28512.269685 30127.614550
30 18519.965964 27252.739106 26682.613791 28185.244056
40 17936.447579 25670.803773 24964.936746 26282.369366
50 16247.605712 25089.154310 21078.604858 25356.750461
-Mike
--
was 2.6.26.git.batch running the load with SCHED_BATCH, or did you do other tweaks as well? if it's other tweaks as well then could you perhaps try to make SCHED_BATCH batch more agressively? I.e. i think it's a perfectly fine answer to say "if your workload needs batch scheduling, run it under SCHED_BATCH". Ingo --
That's what I was thinking, because it needed features=0 as well to Yes, and this appears to be such a case. -Mike --
I figured out how to run pgbench with chrt in order to get SCHED_BATCH behavior, but I don't understand what you mean by features=0 here. Since I didn't see the same magnitude of different just using batch that seems important, where does that get set at? I'm also curious what hardware your results are coming from, to fit them into my larger pgbench results context space. Got my 4-core system back on-line again today (found some bad RAM) and wanted to try another round of tests on that. Looks like you've defined 5 test sets I should replicate: 2.6.22 2.6.22, batch 2.6.26.git 2.6.26.git, batch 2.6.26.git, batch + se.load.weight patch Should I still be trying Peter's se.waker patch as well in this mix somewhere? -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD --
/proc/sys/kernel/sched_features. You need CONFIG_SCHED_DEBUG to have Yeah. -Mike --
btw, the problem with 2.6.25.4 and this load is one and the same. With
a 1:N load, you really don't want work generator waking all worker-bees
on it's CPU. The patchlet below let's you turn it off.
diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..5641eb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,7 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_SYNC_WAKEUPS = 32,
};
const_debug unsigned int sysctl_sched_features =
@@ -603,7 +604,8 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 |
- SCHED_FEAT_DOUBLE_TICK * 0;
+ SCHED_FEAT_DOUBLE_TICK * 0 |
+ SCHED_FEAT_SYNC_WAKEUPS * 0;
#define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x)
@@ -1902,6 +1904,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
long old_state;
struct rq *rq;
+ if (!sched_feat(SYNC_WAKEUPS))
+ sync = 0;
+
smp_wmb();
rq = task_rq_lock(p, &flags);
old_state = p->state;
--
After spending a whole day testing various scheduler options, I've got a pretty good idea how possible improvements here might map out. Let's start with Mike's results (slightly reformatted), from his "grocery store Q6600 box" similar to the one my results in this message come from: .22.18 .22.18b .26.git .26.git.batch 1 7487 7644 9999 9916 2 17075 15360 14043 14958 3 25073 24802 15621 25047 4 24236 26126 16436 25007 5 26367 28298 19927 27853 6 24696 30787 22376 28119 8 21021 31974 25825 31071 10 22792 31775 26754 31596 15 21202 30389 28712 30963 20 21204 29317 28512 30128 30 18520 27253 26683 28185 40 17936 25671 24965 26282 50 16248 25089 21079 25357 I couldn't replicate that batch mode improvement in 2.6.22 or 2.6.26.git, so I asked Mike for some clarification about how he did the batch testing Which explains the difference: I was just running pgbench as "chrt -b cmd pgbench ..." which doesn't help at all. I am uncomfortable with the idea of running the database server itself as a batch process. While it may be effective for optimizing this benchmark, I think it's in general a bad idea because it may de-tune it for more real-world workloads like web applications. Also, that requires being intrusive into people's setup scripts, which bothers me a lot more than doing a bit of kernel tuning at system startup. Mike also suggested a patch that adjusted se.load.weight. That didn't seem helpful in any of the cases I tested, presumably it helps with the all batch-mode setup I didn't try properly. I did again get useful results here with the stock 2.6.26.git kernel and default parameters using Peter's small patch to adjust se.waker. What I found most interesting was how the results changed when I set /proc/sys/kernel/sched_features = 0, without doing anything with batch mode. The default for that is 1101111111=895. What I then did was run through setting each of those bits off one by one to see which feature(s) were getting in the way ...
Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).
With default features, I get the below.
2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434
(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync ...Hm, pbench's extreme dislike of preemption, and the starvation testcase
I sent earlier having an absolute requirement of preemption kinda argues
that some knobs and dials should be per task or task group (or, or... or
scheduler should be all knowing all seeing;)
2.6.25.4-feat=45 2.6.25.4-feat=111 2.6.25.4-feat=47
1 11385.471887 10292.721924 9551.157672
2 16709.515434 15540.399522 16283.968970
3 25456.658841 20187.320016 24562.735943
4 24453.435157 24975.037450 23391.583053
5 25504.302958 23102.131056 23671.860667
6 27076.359200 24688.791507 25947.592071
8 31758.200682 29462.639752 29700.144372
10 32190.081142 30428.413809 27439.441838
15 31175.074906 11097.668025 20344.284129
20 30513.974332 10742.166624 19256.695409
30 28307.399275 10233.708047 17535.423344
40 26720.463867 10037.856773 16104.895695
50 24899.945793 9907.624283 15768.746911
Anyway, if patchlet flies, and Ingo concurs, I'll submit the below.
Prevent short-running wakers of short-running threads from overloading a
single
cpu via wakeup affinity, and provide affinity related debug/tuning
options.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched.c | 9 ++++++++-
kernel/sched_fair.c | 25 ++++++++++++++-----------
2 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 1e4596c..d6d70a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -596,6 +596,8 @@ enum {
SCHED_FEAT_START_DEBIT = 4,
SCHED_FEAT_HRTICK = 8,
SCHED_FEAT_DOUBLE_TICK = 16,
+ SCHED_FEAT_AFFINE_WAKEUPS = 32,
+ SCHED_FEAT_SYNC_WAKEUPS = 64,
};
const_debug unsigned int sysctl_sched_features =
@@ -603,7 +605,9 @@ const_debug unsigned int sysctl_sched_features =
SCHED_FEAT_WAKEUP_PREEMPT * 1 |
SCHED_FEAT_START_DEBIT * 1 |
SCHED_FEAT_HRTICK * 1 ...(to somewhat solidify the random thought i'm sharing...) Perhaps a SCHED_PREEMPT class so such things can co-exist: SCHED_BATCH == I never preempt. SCHED_NORMAL == I preempt sometimes. SCHED_PREEMPT == I always preempt my waker. (end of random synaptic firing;) -Mike --
Sorry I didn't get back to you until now, got distracted for a bit. Here's my table now updated with this patched version and with your numbers for comparision, since we have the same basic processor setup: Clients .22.19 .26.git patch Mike 1 7660 11043 11003 10122 2 17798 11452 16868 14360 3 29612 13231 20381 17049 4 25584 13053 22222 18749 6 25295 12263 23546 24913 8 24344 11748 23895 27976 10 23963 11612 22492 29347 15 23026 11414 21896 29157 20 22549 11332 21015 28392 30 22074 10743 18411 26590 40 21495 10406 17982 24422 50 20051 10534 17009 23306 So this is a huge win for this patch compared with the stock 2.6.26.git (I'm still using the daily snapshot from 2008-05-26) and a nice improvement over the earlier, smaller patches I tested in this thread (which peaked at 19537 for 10 clients for me with default features, vs. a peak of 23895 @ 8 here). I think I might not be testing exactly the same thing you did, though, because the pattern doesn't match. I think that my Q6600 system runs a little bit faster than yours, which is the case for small numbers of clients here. But once we get above 8 clients your setup is way faster, with the difference at 15 clients being the largest. Were you perhaps using batch mode when you generated these results? Only thing I could think of that would produce this pattern. If it's not something simple like that, I may have to dig into whether there was some change in the git snapshot between what you tested and what I did. Regardless, clearly your patch reduces the regression with the default parameters to a mild one instead of the gigantic one we started with. Considering how generally incompatible this benchmark is with this scheduler, and that there are clear workarounds (feature disabling) I can document in PostgreSQL land to "fix" the problem defined for me now, I'd be happy if all that came from this investigation was this change. I'd hope that being strengthened against this workload improves the ...
Unfortunately, after the recent reverts, we're right back to huge :-/ I'm trying to come up with a dirt simple solution that doesn't harm other load types. I've found no clear reason why we regressed so badly, it seems to be a luck of the draw run order thing. As soon as the load starts jamming up a bit, it avalanches into a serialized mess again. I I consider pgbench to be a pretty excellent testcase. Getting this fixed properly will certainly benefit similar loads, Xorg being one It's committed, but I don't think a back-port is justified. It does what it's supposed to do, but there's a part 2. I suspect that your results differ from mine due to that luck of the run order draw thing. -Mike --
The below doesn't hurt my volanomark numbers of the day, helps pgbench considerably, and improves the higher client end of mysql+oltp a wee bit. It may hurt the low end a wee bit, but the low end is always pretty unstable, so it's hard to tell with only three runs. pgbench 2.6.26-rc5 2.6.26-rc5+ 1 10213.768037 10237.990274 10165.511814 10183.705908 2 15885.949053 15519.005195 14994.697875 15204.900479 3 15663.233356 16043.733087 16554.371722 17279.376443 4 14193.807355 15799.792612 18447.345925 18088.861169 5 17239.456219 17326.938538 20119.250823 18537.351094 6 15293.624093 14272.208159 21439.841579 22634.887824 8 12483.727461 13486.991527 25579.379337 25908.373483 10 11919.023584 12058.503518 23876.035623 22403.867804 15 10128.724654 11253.959398 23276.797649 23595.597093 20 9645.056147 9980.465235 23603.315133 23256.506240 30 9288.747962 8801.059613 23633.448266 23229.286697 40 8494.705123 8323.107702 22925.552706 23081.526954 50 8357.781935 8239.867147 19102.481374 19558.624434 volanomark 2.6.26-rc5 test-1.log:Average throughput = 101768 messages per second test-2.log:Average throughput = 99124 messages per second test-3.log:Average throughput = 99821 messages per second test-1.log:Average throughput = 101362 messages per second test-2.log:Average throughput = 98891 messages per second test-3.log:Average throughput = 99164 messages per second 2.6.26-rc5+ test-1.log:Average throughput = 103275 messages per second test-2.log:Average throughput = 100034 messages per second test-3.log:Average throughput = 99434 messages per second test-1.log:Average throughput = 100460 messages per second test-2.log:Average throughput = 100188 messages per second test-3.log:Average throughput = 99617 messages per second Index: ...
Since I tested mysql+oltp and made the dang pdf of the results, I may as well actually attach the thing <does that before continuing...>. BTW, I have a question wrt avg_overlap. When a wakeup cause the current task to begin sharing CPU with a freshly awakened task, the current task is tagged.. but the wakee isn't. How come? If one is sharing, so is the other. -Mike
avg_overlap is about measuring how long we'll run after waking someone else. The other measure, how long our waker shares the cpu with us, hasn't proven to be relevant so far. --
Yeah wrt relevance, I've been playing with making it mean this and that, with approx 0 success ;-) If it's a measure of how long we run after waking though, don't we need to make sure it's not a cross CPU wakeup? -Mike --
The idea was to dynamically detect sync wakeups, who's defining property is that the waker will sleep after waking the wakee. And who's effect is pulling tasks together on wakeups - so that we might have the most benefit of cache sharing. So if we were to exclude cross cpu wakeups from this measurement we'd handicap the whole scheme, because then we'd never measure that its actually a sync wakeup and wants to run on the same cpu. --
On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote: --
In that case it _might_ fly, so needs changelog and blame line.
Tasks which awaken many clients can suffer terribly due to affine wakeup
preemption. This can (does for pgbench) lead to serialization of the entire
load on one CPU due to ever lowering throughput of the preempted waker and
constant affine wakeup of many preempters. Prevent this by noticing when
multi-task preemption is going on, ensure that the 1:N waker can always do
a reasonable batch of work, and temporarily restrict further affine wakeups.
Signed-off-by: Mike Galbraith <efault@gmx.de>
include/linux/sched.h | 1 +
kernel/sched.c | 1 +
kernel/sched_fair.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 44 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..73b7d23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {
u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index bfb8ad8..deb30e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;
#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..4539a79 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
ideal_runtime = sched_slice(cfs_rq, curr);
...Just wondering, how much effect does the last_preempter stuff have?, it seems to me the minimum runtime check ought to throttle these wakeups quite a bit as well. --
Without last_preempter, you'd have all tasks having a minimum runtime. That would harm the single cpu starve.c testcase for sure, and anything like it. I wanted to target this pretty accurately to 1:N type loads. If you mean no trying to disperse preempters, I can test without it. -Mike --
pgbench
2.6.26-rc5+ 2.6.26-rc5+ with no disperse
1 10165.511814 10183.705908 10191.865953 10186.995546
2 14994.697875 15204.900479 15209.856474 15239.639522
3 16554.371722 17279.376443 16431.588533 15828.812843
4 18447.345925 18088.861169 15967.533533 16827.107528
5 20119.250823 18537.351094 17890.057368 18829.423686
6 21439.841579 22634.887824 18562.389387 18907.807327
8 25579.379337 25908.373483 19527.104304 19687.221241
10 23876.035623 22403.867804 22635.429472 20627.666899
15 23276.797649 23595.597093 22695.938882 22233.399329
20 23603.315133 23256.506240 22623.205980 22637.340746
30 23633.448266 23229.286697 22736.523283 22691.638135
40 22925.552706 23081.526954 20037.610595 22174.404351
50 19102.481374 19558.624434 21459.370223 21664.820102
--
Running SCHED_BATCH with only the below put a large dent in the problem. You can have tl <= current->se.load.weight. Nothing good happens in either case, at least with this load. --- kernel/sched_fair.c.org 2008-05-23 14:59:39.000000000 +0200 +++ kernel/sched_fair.c 2008-05-23 14:49:05.000000000 +0200 @@ -1081,7 +1081,7 @@ wake_affine(struct rq *rq, struct sched_ * effect of the currently running task from the load * of the current CPU: */ - if (sync) + if (sync && tl > current->se.load.weight) tl -= current->se.load.weight; if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) || 2.6.26-smp x86_64 1 9209.503213 2 15792.406916 3 23369.199181 4 23140.108032 5 24556.515470 6 24926.457776 8 26896.607558 10 27350.988396 15 29005.426298 20 28558.267290 30 27002.328374 40 25809.202374 50 24589.478654 --
And without SCHED_BATCH 2.6.26-smp x86_64 1 8417.511252 2 15559.741472 3 23417.911087 4 21982.631084 5 24212.518114 6 21870.640050 8 25178.186022 10 27350.449792 15 27958.758943 20 28011.989131 30 26668.779045 40 24871.625107 50 23687.757456 So the primary low end problem is sync afine wakeups it seems. -Mike --
