On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:
Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).
With default features, I get the below.
2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434
(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@gmx.de>
kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;
- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;
/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.
If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)
(it doesn't fully address the 1:N needs, that needs much more thought)
I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.
See patch :)
-Mike