Comparing with 2.6.27-rc4, oltp has ~10% regression with 2.6.27-rc5 on 8-core stoakley machine. Run oltp with 8 threads 120 seconds, vmstat shows much more idle time, about ~30% procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 10 0 0 7822824 42240 123740 0 0 312 47 442 1613 3 2 88 6 0 9 0 0 7822312 42240 123764 0 0 0 16 26691 232566 56 14 30 0 0 13 0 0 7821940 42240 123764 0 0 0 16 26661 228689 54 14 32 0 0 8 0 0 7821320 42240 123764 0 0 0 16 31508 263765 61 17 23 0 0 12 0 0 7820948 42240 123764 0 0 16 16 28666 242402 57 15 28 0 0 9 0 0 7820584 42240 123780 0 0 0 16 27107 230804 56 14 30 0 0 10 0 0 7819964 42240 123796 0 0 16 612 27599 244037 55 16 29 0 0 11 0 0 7819356 42240 123796 0 0 0 64 23540 209713 51 13 36 0 0 10 0 0 7819212 42240 123796 0 0 0 32 25674 224205 54 13 32 0 0 10 0 0 7818716 42240 123796 0 0 0 20 30106 257161 59 16 25 0 0 7 0 0 7818468 42240 123796 0 0 0 16 28356 241551 57 14 29 0 0 10 0 0 7818096 42240 123796 0 0 0 16 39174 273656 64 16 20 0 0 12 0 0 7817724 42240 123796 0 0 0 20 39688 276936 63 16 20 0 0 11 0 0 7817352 42240 123796 0 0 0 16 42543 285192 66 16 18 0 0 9 0 0 7817352 42240 123796 0 0 0 16 37083 259830 62 14 24 0 0 8 0 0 7817104 42240 123796 0 0 0 16 37450 259160 61 15 23 0 0 10 0 0 7816516 42240 123796 0 0 0 64 37425 261870 61 16 23 0 0 11 0 0 7815896 42240 123812 0 0 16 16 41558 279320 66 16 18 0 0 9 0 0 7815648 42240 123812 0 0 0 16 34017 235741 59 14 28 0 0 10 0 0 7815152 42240 123812 0 0 0 16 35642 ...
Thats bizarre... that just indicates the better clock, which should give better (read fairer) scheduling hurts your workload. Is there anything I can run to see if we can fix the scheduler perhaps? --
I observed schedstats of sysbench, there's more "nr_failed_migrations_hot" 2.6.27-rc4: se.nr_failed_migrations_hot 11 2.6.27-rc5: se.nr_failed_migrations_hot 95 task migration failed because of task_hot, the system is un-balanced? --
would be nice to get a representative (==steady state) scheduler trace from the critical portions of that workload. See: http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt Ingo --
A huge trace result file, see the attachment for more.
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
mysqld-3799 [07] 151.411009: 3799:120:R ==> 3807:120:R
mysqld-3796 [00] 151.411015: 3796:120:R + 3804:120:S
mysqld-3796 [00] 151.411018: 3796:120:R ==> 3804:120:R
sysbench-3807 [07] 151.411020: 3807:120:S ==> 3799:120:R
sysbench-3804 [00] 151.411020: 3804:120:S ==> 3796:120:R
mysqld-3799 [07] 151.411028: 3799:120:R + 3807:120:S
mysqld-3796 [00] 151.411029: 3796:120:R + 3804:120:S
mysqld-3799 [07] 151.411030: 3799:120:R ==> 3807:120:R
mysqld-3796 [00] 151.411031: 3796:120:R ==> 3804:120:R
sysbench-3807 [07] 151.411032: 3807:120:S ==> 3799:120:R
sysbench-3804 [00] 151.411049: 3804:120:S ==> 3796:120:R
mysqld-3799 [07] 151.411052: 3799:120:R + 3807:120:S
mysqld-3799 [07] 151.411053: 3799:120:R ==> 3807:120:R
mysqld-3796 [00] 151.411055: 3796:120:R + 3804:120:S
sysbench-3807 [07] 151.411063: 3807:120:S ==> 3799:120:R
mysqld-3796 [00] 151.411069: 3796:120:S ==> 3804:120:R
mysqld-3799 [07] 151.411070: 3799:120:R + 3807:120:S
mysqld-3799 [07] 151.411071: 3799:120:R ==> 3807:120:R
sysbench-3807 [07] 151.411073: 3807:120:S ==> 3799:120:R
sysbench-3804 [00] 151.411078: 3804:120:S ==> 3798:120:R
mysqld-3798 [00] 151.411093: 3798:120:R + 3806:120:S
mysqld-3799 [07] 151.411093: 3799:120:R + 3807:120:S
mysqld-3799 [07] 151.411095: 3799:120:R ==> 3807:120:R
sysbench-3807 [07] 151.411104: 3807:120:S ==> 3799:120:R
mysqld-3799 [07] 151.411114: 3799:120:R + 3807:120:S
mysqld-3799 [07] 151.411115: ...ok, this only spans 6 milliseconds, which is probably a bit too light. Could you please increase the trace size 32-fold (via /debug/tracing/trace_entries), and send us the (large!) trace file compressed via bzip2 -9, off-list (or via a web link)? [anyone on-list can have a look at the smaller trace file already] Thanks, Ingo --
The large trace file also available at below link, --
Ah, that makes sense, a more accurate clock could indeed make more tasks hot. Can you try fiddling with: /proc/sys/kernel/sched_migration_cost ? Also, we used to have some auto-tuning in there, which dissapeared some time ago, gregory brought it back to live recently, perhaps he likes to share? :-) --
sched_migration_cost regression ---------------------- ------------- 50000 ~6% 0 ~8% 500000 (default) ~10% 5000000 ~14% --
at 50000 (~6%), is the predominant difference in schedstats still the nr_failed_migrations_hot? --
Yes, it's strange that nr_failed_migrations_hot (cost=50000) is larger than (cost=500000) sched_migration_cost = 50000 ---------------------------------------------------------- se.exec_start : 3475158.689868 se.vruntime : 935215.526989 se.sum_exec_runtime : 14358.431942 se.avg_overlap : 0.000000 se.wait_start : 0.000000 se.sleep_start : 3475158.689868 se.block_start : 0.000000 se.sleep_max : 37.910779 se.block_max : 0.033796 se.exec_max : 0.121028 se.slice_max : 0.000000 se.wait_max : 6.007209 se.wait_sum : 25029.649588 se.wait_count : 1954948 sched_info.bkl_count : 0 se.nr_migrations : 2521 se.nr_migrations_cold : 0 se.nr_failed_migrations_affine : 0 se.nr_failed_migrations_running : 147 se.nr_failed_migrations_hot : 123 se.nr_forced_migrations : 0 se.nr_forced2_migrations : 1236 se.nr_wakeups : 1952827 se.nr_wakeups_sync : 1944785 se.nr_wakeups_migrate : 2479 se.nr_wakeups_local : 1930122 se.nr_wakeups_remote : 22705 se.nr_wakeups_affine : 13252 se.nr_wakeups_affine_attempts : 825950 se.nr_wakeups_passive : 0 se.nr_wakeups_idle : 0 avg_atom ...
Hi Peter,
Per your request, I brought those "revert" patches I had in the PQ branch
forward to 2.6.27-rc5. They are completely unbuilt and untested, but here
for your experimentation.
-Greg
---
Gregory Haskins (4):
sched: make task_hot() once again use sd->cache_hot_time
Revert "sched: zap the migration init / cache-hot balancing code"
Revert "[PATCH] sched: remove cache_hot_time"
revert "sched: sched_cacheflush is now unused"
Documentation/kernel-parameters.txt | 43 +++
arch/ia64/kernel/setup.c | 15 +
arch/mips/kernel/smp.c | 11 +
arch/sparc/kernel/smp.c | 10 +
arch/sparc64/kernel/smp.c | 27 ++
arch/x86/kernel/smpboot.c | 12 +
include/asm-m32r/system.h | 10 +
include/asm-mips/system.h | 10 +
include/asm-parisc/system.h | 11 +
include/asm-x86/system.h | 9 +
include/linux/sched.h | 7
include/linux/topology.h | 1
kernel/sched.c | 499 ++++++++++++++++++++++++++++++++++-
13 files changed, 656 insertions(+), 9 deletions(-)
--
Signature
--
revert c41917df8a1adde34864116ce2231a7fe308d2ff
Author: Ralf Baechle <ralf@linux-mips.org>
Date: Thu Jul 19 21:28:35 2007 +0200
[PATCH] sched: sched_cacheflush is now unused
Since Ingo's recent scheduler rewrite which was merged as commit
0437e109e1841607f2988891eaa36c531c6aa6ac sched_cacheflush is unused.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We want to restore sd->cache_hot_time for better migration decisions.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
arch/ia64/kernel/setup.c | 9 +++++++++
include/asm-m32r/system.h | 10 ++++++++++
include/asm-mips/system.h | 10 ++++++++++
include/asm-parisc/system.h | 11 +++++++++++
include/asm-x86/system.h | 9 +++++++++
5 files changed, 49 insertions(+), 0 deletions(-)
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index c27d5b2..c0050ab 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -1051,6 +1051,15 @@ cpu_init (void)
pm_idle = default_idle;
}
+/*
+ * On SMP systems, when the scheduler does migration-cost autodetection,
+ * it needs a way to flush as much of the CPU's caches as possible.
+ */
+void sched_cacheflush(void)
+{
+ ia64_sal_cache_flush(3);
+}
+
void __init
check_bugs (void)
{
diff --git a/include/asm-m32r/system.h b/include/asm-m32r/system.h
index 70a57c8..d2f9559 100644
--- a/include/asm-m32r/system.h
+++ b/include/asm-m32r/system.h
@@ -54,6 +54,16 @@
); \
} while(0)
+/*
+ * On SMP systems, when the scheduler does migration-cost autodetection,
+ * it needs a way to flush as much of the CPU's caches as possible.
+ *
+ * TODO: fill this in!
+ */
+static inline void sched_cacheflush(void)
+{
+}
+
/* Interrupt Control */
#if !defined(CONFIG_CHIP_M32102) && !defined(CONFIG_CHIP_M32104)
#define local_irq_enable() \
diff --git a/include/asm-mips/system.h b/include/asm-mips/system.h
index a944eda..aaba821 ...This reverts commit 362a7016637648c6aefc98b706298baedfaa1543.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
include/linux/sched.h | 1 +
include/linux/topology.h | 1 +
kernel/sched.c | 8 +++++---
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..5619f3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,6 +777,7 @@ struct sched_domain {
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
+ unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
unsigned int busy_idx;
unsigned int idle_idx;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 2158fc0..f6a7928 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -162,6 +162,7 @@ void arch_update_cpu_topology(void);
.max_interval = 64*num_online_cpus(), \
.busy_factor = 128, \
.imbalance_pct = 133, \
+ .cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.busy_idx = 3, \
.idle_idx = 3, \
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a1ddb8..0ca5218 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6223,7 +6223,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(12);
+ struct ctl_table *table = sd_alloc_ctl_entry(13);
if (table == NULL)
return NULL;
@@ -6249,9 +6249,11 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
set_table_entry(&table[9], "cache_nice_tries",
&sd->cache_nice_tries,
sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[10], "flags", &sd->flags,
+ set_table_entry(&table[10], "cache_hot_time", ...>From commit: 0437e109e1841607f2988891eaa36c531c6aa6ac We want to restore the concept of using empirical cache_hot data for managing migrations. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- Documentation/kernel-parameters.txt | 43 +++ arch/ia64/kernel/setup.c | 6 arch/mips/kernel/smp.c | 11 + arch/sparc/kernel/smp.c | 10 + arch/sparc64/kernel/smp.c | 27 ++ arch/x86/kernel/smpboot.c | 12 + include/linux/sched.h | 6 kernel/sched.c | 483 +++++++++++++++++++++++++++++++++++ 8 files changed, 598 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1150444..6e7b78f 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1247,6 +1247,49 @@ and is between 256 and 4096 characters. It is defined in the file mga= [HW,DRM] + migration_cost= + [KNL,SMP] debug: override scheduler migration costs + Format: <level-1-usecs>,<level-2-usecs>,... + This debugging option can be used to override the + default scheduler migration cost matrix. The numbers + are indexed by 'CPU domain distance'. + E.g. migration_cost=1000,2000,3000 on an SMT NUMA + box will set up an intra-core migration cost of + 1 msec, an inter-core migration cost of 2 msecs, + and an inter-node migration cost of 3 msecs. + + WARNING: using the wrong values here can break + scheduler performance, so it's only for scheduler + development purposes, not production environments. + + migration_debug= + [KNL,SMP] migration cost auto-detect verbosity + Format=<0|1|2> + If a system's migration matrix reported at bootup + seems erroneous then this option can be used to + increase verbosity of the detection process. + We default to 0 (no extra messages), 1 will print + some more information, and 2 will be really + verbose (probably only ...
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
kernel/sched.c | 8 ++------
1 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index fd28b64..b5f8640 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1778,6 +1778,7 @@ static int
task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
{
s64 delta;
+ s64 migration_cost = (s64)(sd ? sd->cache_hot_time : sysctl_sched_migration_cost);
/*
* Buddy candidates are cache hot:
@@ -1788,14 +1789,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (p->sched_class != &fair_sched_class)
return 0;
- if (sysctl_sched_migration_cost == -1)
- return 1;
- if (sysctl_sched_migration_cost == 0)
- return 0;
-
delta = now - p->se.exec_start;
- return delta < (s64)sysctl_sched_migration_cost;
+ return delta < migration_cost;
}
--
It does not help after these 4 patches applied.
Gregory Haskins (4):
sched: make task_hot() once again use sd->cache_hot_time
Revert "sched: zap the migration init / cache-hot balancing code"
Revert "[PATCH] sched: remove cache_hot_time"
revert "sched: sched_cacheflush is now unused"
--
Ming, how does this work for you?
---
Subject: sched: wakeup preempt when small overlap
Aggresively preempt a task if its avg overlap is very small, this should
avoid the task going to sleep and find it still running when we schedule
back to it - saving a wakeup.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b4592c..cb44774 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -897,7 +897,7 @@ struct sched_class {
void (*yield_task) (struct rq *rq);
int (*select_task_rq)(struct task_struct *p, int sync);
- void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
+ void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 32d56d6..e17d506 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -609,9 +609,9 @@ struct rq {
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p, int sync)
{
- rq->curr->sched_class->check_preempt_curr(rq, p);
+ rq->curr->sched_class->check_preempt_curr(rq, p, sync);
}
static inline int cpu_of(struct rq *rq)
@@ -2299,7 +2299,7 @@ out_activate:
out_running:
trace_sched_wakeup(rq, p);
- check_preempt_curr(rq, p);
+ check_preempt_curr(rq, p, sync);
p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
@@ -2432,7 +2432,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
inc_nr_running(rq);
}
trace_sched_wakeup_new(rq, p);
- check_preempt_curr(rq, p);
+ check_preempt_curr(rq, p, 0);
#ifdef CONFIG_SMP
if (p->sched_class->task_wake_up)
p->sched_class->task_wake_up(rq, p);
@@ -2889,7 +2889,7 @@ static void pull_task(struct rq ...This patch restores ~5% performance. kernel peformance ----------------------------------- 27-rc4 100% 27-rc5 89.6% 27-rc5+peter's patch 94.5% Thanks, Lin Ming --
