[PATCH 3/4] Revert "sched: zap the migration init / cache-hot balancing code"

Previous thread: by Wu Shuwei on Thursday, September 4, 2008 - 1:47 am. (1 message)

Next thread: [PATCH] Make taint bit reliable v3 by Andi Kleen on Thursday, September 4, 2008 - 2:16 am. (1 message)
From: Lin Ming
Date: Thursday, September 4, 2008 - 1:51 am

Comparing with 2.6.27-rc4, oltp has ~10% regression with 2.6.27-rc5 on
8-core stoakley machine.

Run oltp with 8 threads 120 seconds, vmstat shows much more idle time, about ~30%

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
10  0      0 7822824  42240 123740    0    0   312    47  442 1613  3  2 88  6  0
 9  0      0 7822312  42240 123764    0    0     0    16 26691 232566 56 14 30  0  0
13  0      0 7821940  42240 123764    0    0     0    16 26661 228689 54 14 32  0  0
 8  0      0 7821320  42240 123764    0    0     0    16 31508 263765 61 17 23  0  0
12  0      0 7820948  42240 123764    0    0    16    16 28666 242402 57 15 28  0  0
 9  0      0 7820584  42240 123780    0    0     0    16 27107 230804 56 14 30  0  0
10  0      0 7819964  42240 123796    0    0    16   612 27599 244037 55 16 29  0  0
11  0      0 7819356  42240 123796    0    0     0    64 23540 209713 51 13 36  0  0
10  0      0 7819212  42240 123796    0    0     0    32 25674 224205 54 13 32  0  0
10  0      0 7818716  42240 123796    0    0     0    20 30106 257161 59 16 25  0  0
 7  0      0 7818468  42240 123796    0    0     0    16 28356 241551 57 14 29  0  0
10  0      0 7818096  42240 123796    0    0     0    16 39174 273656 64 16 20  0  0
12  0      0 7817724  42240 123796    0    0     0    20 39688 276936 63 16 20  0  0
11  0      0 7817352  42240 123796    0    0     0    16 42543 285192 66 16 18  0  0
 9  0      0 7817352  42240 123796    0    0     0    16 37083 259830 62 14 24  0  0
 8  0      0 7817104  42240 123796    0    0     0    16 37450 259160 61 15 23  0  0
10  0      0 7816516  42240 123796    0    0     0    64 37425 261870 61 16 23  0  0
11  0      0 7815896  42240 123812    0    0    16    16 41558 279320 66 16 18  0  0
 9  0      0 7815648  42240 123812    0    0     0    16 34017 235741 59 14 28  0  0
10  0      0 7815152  42240 123812    0    0     0    16 35642 ...
From: Peter Zijlstra
Date: Thursday, September 4, 2008 - 2:03 am

Thats bizarre... that just indicates the better clock, which should give
better (read fairer) scheduling hurts your workload.

Is there anything I can run to see if we can fix the scheduler perhaps?

--

From: Lin Ming
Date: Thursday, September 4, 2008 - 3:52 am

I observed schedstats of sysbench, there's more
"nr_failed_migrations_hot"

2.6.27-rc4: se.nr_failed_migrations_hot 11
2.6.27-rc5: se.nr_failed_migrations_hot 95

task migration failed because of task_hot, the system is un-balanced?

--

From: Ingo Molnar
Date: Thursday, September 4, 2008 - 4:09 am

would be nice to get a representative (==steady state) scheduler trace 
from the critical portions of that workload. See:

 http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

	Ingo
--

From: Lin Ming
Date: Thursday, September 4, 2008 - 4:30 am

A huge trace result file, see the attachment for more.


# tracer: sched_switch
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
          mysqld-3799  [07]   151.411009:   3799:120:R ==>  3807:120:R
          mysqld-3796  [00]   151.411015:   3796:120:R   +  3804:120:S
          mysqld-3796  [00]   151.411018:   3796:120:R ==>  3804:120:R
        sysbench-3807  [07]   151.411020:   3807:120:S ==>  3799:120:R
        sysbench-3804  [00]   151.411020:   3804:120:S ==>  3796:120:R
          mysqld-3799  [07]   151.411028:   3799:120:R   +  3807:120:S
          mysqld-3796  [00]   151.411029:   3796:120:R   +  3804:120:S
          mysqld-3799  [07]   151.411030:   3799:120:R ==>  3807:120:R
          mysqld-3796  [00]   151.411031:   3796:120:R ==>  3804:120:R
        sysbench-3807  [07]   151.411032:   3807:120:S ==>  3799:120:R
        sysbench-3804  [00]   151.411049:   3804:120:S ==>  3796:120:R
          mysqld-3799  [07]   151.411052:   3799:120:R   +  3807:120:S
          mysqld-3799  [07]   151.411053:   3799:120:R ==>  3807:120:R
          mysqld-3796  [00]   151.411055:   3796:120:R   +  3804:120:S
        sysbench-3807  [07]   151.411063:   3807:120:S ==>  3799:120:R
          mysqld-3796  [00]   151.411069:   3796:120:S ==>  3804:120:R
          mysqld-3799  [07]   151.411070:   3799:120:R   +  3807:120:S
          mysqld-3799  [07]   151.411071:   3799:120:R ==>  3807:120:R
        sysbench-3807  [07]   151.411073:   3807:120:S ==>  3799:120:R
        sysbench-3804  [00]   151.411078:   3804:120:S ==>  3798:120:R
          mysqld-3798  [00]   151.411093:   3798:120:R   +  3806:120:S
          mysqld-3799  [07]   151.411093:   3799:120:R   +  3807:120:S
          mysqld-3799  [07]   151.411095:   3799:120:R ==>  3807:120:R
        sysbench-3807  [07]   151.411104:   3807:120:S ==>  3799:120:R
          mysqld-3799  [07]   151.411114:   3799:120:R   +  3807:120:S
          mysqld-3799  [07]   151.411115:   ...
From: Ingo Molnar
Date: Thursday, September 4, 2008 - 4:35 am

ok, this only spans 6 milliseconds, which is probably a bit too light.

Could you please increase the trace size 32-fold (via 
/debug/tracing/trace_entries), and send us the (large!) trace file 
compressed via bzip2 -9, off-list (or via a web link)? [anyone on-list 
can have a look at the smaller trace file already]

Thanks,

	Ingo
--

From: Lin Ming
Date: Thursday, September 4, 2008 - 5:19 am

The large trace file also available at below link,

--

From: Peter Zijlstra
Date: Thursday, September 4, 2008 - 4:06 am

Ah, that makes sense, a more accurate clock could indeed make more tasks
hot.

Can you try fiddling with: /proc/sys/kernel/sched_migration_cost ?

Also, we used to have some auto-tuning in there, which dissapeared some
time ago, gregory brought it back to live recently, perhaps he likes to
share? :-)

--

From: Lin Ming
Date: Thursday, September 4, 2008 - 5:12 am

sched_migration_cost		regression
----------------------          -------------
50000                           ~6%
0				~8%
500000 (default)		~10%
5000000                         ~14%

--

From: Peter Zijlstra
Date: Thursday, September 4, 2008 - 5:26 am

at 50000 (~6%), is the predominant difference in schedstats still the
nr_failed_migrations_hot?

--

From: Lin Ming
Date: Thursday, September 4, 2008 - 5:42 am

Yes, it's strange that nr_failed_migrations_hot (cost=50000) is larger
than (cost=500000)

sched_migration_cost = 50000
----------------------------------------------------------
se.exec_start                      :       3475158.689868
se.vruntime                        :        935215.526989
se.sum_exec_runtime                :         14358.431942
se.avg_overlap                     :             0.000000
se.wait_start                      :             0.000000
se.sleep_start                     :       3475158.689868
se.block_start                     :             0.000000
se.sleep_max                       :            37.910779
se.block_max                       :             0.033796
se.exec_max                        :             0.121028
se.slice_max                       :             0.000000
se.wait_max                        :             6.007209
se.wait_sum                        :         25029.649588
se.wait_count                      :              1954948
sched_info.bkl_count               :                    0
se.nr_migrations                   :                 2521
se.nr_migrations_cold              :                    0
se.nr_failed_migrations_affine     :                    0
se.nr_failed_migrations_running    :                  147
se.nr_failed_migrations_hot        :                  123
se.nr_forced_migrations            :                    0
se.nr_forced2_migrations           :                 1236
se.nr_wakeups                      :              1952827
se.nr_wakeups_sync                 :              1944785
se.nr_wakeups_migrate              :                 2479
se.nr_wakeups_local                :              1930122
se.nr_wakeups_remote               :                22705
se.nr_wakeups_affine               :                13252
se.nr_wakeups_affine_attempts      :               825950
se.nr_wakeups_passive              :                    0
se.nr_wakeups_idle                 :                    0
avg_atom                    ...
From: Gregory Haskins
Date: Thursday, September 4, 2008 - 6:50 am

Hi Peter,


Per your request, I brought those "revert" patches I had in the PQ branch
forward to 2.6.27-rc5.  They are completely unbuilt and untested, but here
for your experimentation.

-Greg

---

Gregory Haskins (4):
      sched: make task_hot() once again use sd->cache_hot_time
      Revert "sched: zap the migration init / cache-hot balancing code"
      Revert "[PATCH] sched: remove cache_hot_time"
      revert "sched: sched_cacheflush is now unused"


 Documentation/kernel-parameters.txt |   43 +++
 arch/ia64/kernel/setup.c            |   15 +
 arch/mips/kernel/smp.c              |   11 +
 arch/sparc/kernel/smp.c             |   10 +
 arch/sparc64/kernel/smp.c           |   27 ++
 arch/x86/kernel/smpboot.c           |   12 +
 include/asm-m32r/system.h           |   10 +
 include/asm-mips/system.h           |   10 +
 include/asm-parisc/system.h         |   11 +
 include/asm-x86/system.h            |    9 +
 include/linux/sched.h               |    7 
 include/linux/topology.h            |    1 
 kernel/sched.c                      |  499 ++++++++++++++++++++++++++++++++++-
 13 files changed, 656 insertions(+), 9 deletions(-)

-- 
Signature
--

From: Gregory Haskins
Date: Thursday, September 4, 2008 - 6:50 am

revert c41917df8a1adde34864116ce2231a7fe308d2ff

	Author: Ralf Baechle <ralf@linux-mips.org>
	Date:   Thu Jul 19 21:28:35 2007 +0200

	    [PATCH] sched: sched_cacheflush is now unused

	    Since Ingo's recent scheduler rewrite which was merged as commit
	    0437e109e1841607f2988891eaa36c531c6aa6ac sched_cacheflush is unused.

	    Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
	    Signed-off-by: Ingo Molnar <mingo@elte.hu>

We want to restore sd->cache_hot_time for better migration decisions.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/ia64/kernel/setup.c    |    9 +++++++++
 include/asm-m32r/system.h   |   10 ++++++++++
 include/asm-mips/system.h   |   10 ++++++++++
 include/asm-parisc/system.h |   11 +++++++++++
 include/asm-x86/system.h    |    9 +++++++++
 5 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index c27d5b2..c0050ab 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -1051,6 +1051,15 @@ cpu_init (void)
 	pm_idle = default_idle;
 }
 
+/*
+ * On SMP systems, when the scheduler does migration-cost autodetection,
+ * it needs a way to flush as much of the CPU's caches as possible.
+ */
+void sched_cacheflush(void)
+{
+	ia64_sal_cache_flush(3);
+}
+
 void __init
 check_bugs (void)
 {
diff --git a/include/asm-m32r/system.h b/include/asm-m32r/system.h
index 70a57c8..d2f9559 100644
--- a/include/asm-m32r/system.h
+++ b/include/asm-m32r/system.h
@@ -54,6 +54,16 @@
 	); \
 } while(0)
 
+/*
+ * On SMP systems, when the scheduler does migration-cost autodetection,
+ * it needs a way to flush as much of the CPU's caches as possible.
+ *
+ * TODO: fill this in!
+ */
+static inline void sched_cacheflush(void)
+{
+}
+
 /* Interrupt Control */
 #if !defined(CONFIG_CHIP_M32102) && !defined(CONFIG_CHIP_M32104)
 #define local_irq_enable() \
diff --git a/include/asm-mips/system.h b/include/asm-mips/system.h
index a944eda..aaba821 ...
From: Gregory Haskins
Date: Thursday, September 4, 2008 - 6:50 am

This reverts commit 362a7016637648c6aefc98b706298baedfaa1543.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h    |    1 +
 include/linux/topology.h |    1 +
 kernel/sched.c           |    8 +++++---
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..5619f3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,6 +777,7 @@ struct sched_domain {
 	unsigned long max_interval;	/* Maximum balance interval ms */
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
+	unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
 	unsigned int busy_idx;
 	unsigned int idle_idx;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 2158fc0..f6a7928 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -162,6 +162,7 @@ void arch_update_cpu_topology(void);
 	.max_interval		= 64*num_online_cpus(),	\
 	.busy_factor		= 128,			\
 	.imbalance_pct		= 133,			\
+	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 3,			\
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a1ddb8..0ca5218 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6223,7 +6223,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(12);
+	struct ctl_table *table = sd_alloc_ctl_entry(13);
 
 	if (table == NULL)
 		return NULL;
@@ -6249,9 +6249,11 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 	set_table_entry(&table[9], "cache_nice_tries",
 		&sd->cache_nice_tries,
 		sizeof(int), 0644, proc_dointvec_minmax);
-	set_table_entry(&table[10], "flags", &sd->flags,
+	set_table_entry(&table[10], "cache_hot_time", ...
From: Gregory Haskins
Date: Thursday, September 4, 2008 - 6:50 am

>From commit: 0437e109e1841607f2988891eaa36c531c6aa6ac

We want to restore the concept of using empirical cache_hot data for
managing migrations.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/kernel-parameters.txt |   43 +++
 arch/ia64/kernel/setup.c            |    6 
 arch/mips/kernel/smp.c              |   11 +
 arch/sparc/kernel/smp.c             |   10 +
 arch/sparc64/kernel/smp.c           |   27 ++
 arch/x86/kernel/smpboot.c           |   12 +
 include/linux/sched.h               |    6 
 kernel/sched.c                      |  483 +++++++++++++++++++++++++++++++++++
 8 files changed, 598 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1150444..6e7b78f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1247,6 +1247,49 @@ and is between 256 and 4096 characters. It is defined in the file
 
 	mga=		[HW,DRM]
 
+	migration_cost=
+			[KNL,SMP] debug: override scheduler migration costs
+			Format: <level-1-usecs>,<level-2-usecs>,...
+			This debugging option can be used to override the
+			default scheduler migration cost matrix. The numbers
+			are indexed by 'CPU domain distance'.
+			E.g. migration_cost=1000,2000,3000 on an SMT NUMA
+			box will set up an intra-core migration cost of
+			1 msec, an inter-core migration cost of 2 msecs,
+			and an inter-node migration cost of 3 msecs.
+
+			WARNING: using the wrong values here can break
+			scheduler performance, so it's only for scheduler
+			development purposes, not production environments.
+
+	migration_debug=
+			[KNL,SMP] migration cost auto-detect verbosity
+			Format=<0|1|2>
+			If a system's migration matrix reported at bootup
+			seems erroneous then this option can be used to
+			increase verbosity of the detection process.
+			We default to 0 (no extra messages), 1 will print
+			some more information, and 2 will be really
+			verbose (probably only ...
From: Gregory Haskins
Date: Thursday, September 4, 2008 - 6:50 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index fd28b64..b5f8640 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1778,6 +1778,7 @@ static int
 task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 {
 	s64 delta;
+	s64 migration_cost = (s64)(sd ? sd->cache_hot_time : sysctl_sched_migration_cost);
 
 	/*
 	 * Buddy candidates are cache hot:
@@ -1788,14 +1789,9 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	if (p->sched_class != &fair_sched_class)
 		return 0;
 
-	if (sysctl_sched_migration_cost == -1)
-		return 1;
-	if (sysctl_sched_migration_cost == 0)
-		return 0;
-
 	delta = now - p->se.exec_start;
 
-	return delta < (s64)sysctl_sched_migration_cost;
+	return delta < migration_cost;
 }
 
 

--

From: Lin Ming
Date: Thursday, September 4, 2008 - 6:26 pm

It does not help after these 4 patches applied.

Gregory Haskins (4):
     sched: make task_hot() once again use sd->cache_hot_time
     Revert "sched: zap the migration init / cache-hot balancing code"
     Revert "[PATCH] sched: remove cache_hot_time"
     revert "sched: sched_cacheflush is now unused"

--

From: Peter Zijlstra
Date: Saturday, September 20, 2008 - 2:38 pm

Ming, how does this work for you?

---
Subject: sched: wakeup preempt when small overlap

Aggresively preempt a task if its avg overlap is very small, this should
avoid the task going to sleep and find it still running when we schedule
back to it - saving a wakeup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b4592c..cb44774 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -897,7 +897,7 @@ struct sched_class {
 	void (*yield_task) (struct rq *rq);
 	int  (*select_task_rq)(struct task_struct *p, int sync);
 
-	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
+	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);
 
 	struct task_struct * (*pick_next_task) (struct rq *rq);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 32d56d6..e17d506 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -609,9 +609,9 @@ struct rq {
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
-static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p, int sync)
 {
-	rq->curr->sched_class->check_preempt_curr(rq, p);
+	rq->curr->sched_class->check_preempt_curr(rq, p, sync);
 }
 
 static inline int cpu_of(struct rq *rq)
@@ -2299,7 +2299,7 @@ out_activate:
 
 out_running:
 	trace_sched_wakeup(rq, p);
-	check_preempt_curr(rq, p);
+	check_preempt_curr(rq, p, sync);
 
 	p->state = TASK_RUNNING;
 #ifdef CONFIG_SMP
@@ -2432,7 +2432,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 		inc_nr_running(rq);
 	}
 	trace_sched_wakeup_new(rq, p);
-	check_preempt_curr(rq, p);
+	check_preempt_curr(rq, p, 0);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_wake_up)
 		p->sched_class->task_wake_up(rq, p);
@@ -2889,7 +2889,7 @@ static void pull_task(struct rq ...
From: Lin Ming
Date: Thursday, September 25, 2008 - 7:00 pm

This patch restores ~5% performance.

kernel                  peformance
-----------------------------------
27-rc4                  100%
27-rc5                  89.6%
27-rc5+peter's patch    94.5%

Thanks,
Lin Ming


--

Previous thread: by Wu Shuwei on Thursday, September 4, 2008 - 1:47 am. (1 message)

Next thread: [PATCH] Make taint bit reliable v3 by Andi Kleen on Thursday, September 4, 2008 - 2:16 am. (1 message)