Re: VolanoMark regression with 2.6.27-rc1

Previous thread: [PATCH stable] ftrace document crucial update by Steven Rostedt on Wednesday, July 30, 2008 - 8:05 pm. (8 messages)

Next thread: [PATCH] cpuset: make ntasks to be a monotonic increasing value by Lai Jiangshan on Wednesday, July 30, 2008 - 8:22 pm. (13 messages)
From: Zhang, Yanmin
Date: Wednesday, July 30, 2008 - 8:20 pm

Ingo,

volanoMark has regression with 2.6.27-rc1.
1) 70% on 16-core tigerton;
2) 18% on a new multi-core+HT mahcine;

I tried to use git bisect to locate the root cause, but git bisect always went
back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:

commit 82638844d9a8581bbf33201cc209a14876eca167
Merge: 9982fbf... 63cf13b...
Author: Ingo Molnar <mingo@elte.hu>
Date:   Wed Jul 16 00:29:07 2008 +0200

    Merge branch 'linus' into cpus4096

    Conflicts:

        arch/x86/xen/smp.c
        kernel/sched_rt.c
        net/iucv/iucv.c

    Signed-off-by: Ingo Molnar <mingo@elte.hu>



But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
a merge, what're commit numbers of the subpatches?


BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.

-yanmin



--

From: Zhang, Yanmin
Date: Thursday, July 31, 2008 - 12:31 am

Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.

yanmin


--

From: Peter Zijlstra
Date: Thursday, July 31, 2008 - 12:39 am

The new smp-group stuff doesn't remotely look like what was in .26

Also, on my quad (admittedly smaller than your machines) both volano and
sysbench didn't regress anymore - where they clearly did with the code
reverted from .26.


--

From: Zhang, Yanmin
Date: Thursday, July 31, 2008 - 12:49 am

The regression I reported exists on:
1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
3) 16-core tigerton: %70 with volano, %18 with oltp;
4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.

So the issues are popular on different architectures.

yanmin


--

From: Zhang, Yanmin
Date: Thursday, July 31, 2008 - 5:39 pm

I know kernel needs the features and it might not be a good idea to reject them over and over again.
I will collect more data on tigerton and try to optimize it.

-yanmin


--

From: Miao Xie
Date: Thursday, July 31, 2008 - 7:35 pm

on 2008-8-1 8:39 Zhang, Yanmin wrote:

Could you tell me the exact config options? the same with last time?

as follows:

# CONFIG_CGROUPS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y


--

From: Zhang, Yanmin
Date: Thursday, July 31, 2008 - 8:08 pm

From: Dhaval Giani
Date: Thursday, July 31, 2008 - 10:14 pm

Hi Yanmin,

Would it be possible for you to switch of the group scheduling feature
and see if the regression still exists. In all our testing, we did not
see a regression. I would like to eliminate it from your testing as
well.

The option to switch off would be CONFIG_GROUP_SCHED, that should disable
all the group scheduling features.

Thanks,
-- 
regards,
Dhaval
--

From: Zhang, Yanmin
Date: Sunday, August 3, 2008 - 10:04 pm

I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter.

volanoMark:
kernel				| 	result
----------------------------------------------------------
2.6.27-rc1_group		|	205901
----------------------------------------------------------
2.6.27-rc1_nogroup		|	303377
----------------------------------------------------------
2.6.26_group			|	529388


sysbench+mysql(readonly oltp):
kernel				|	result
-----------------------------------------------------------
2.6.27-rc1_group		|	560636
-----------------------------------------------------------
2.6.27-rc1_nogroup		|	604937
-----------------------------------------------------------

--

From: Dhaval Giani
Date: Sunday, August 3, 2008 - 10:22 pm

There seem to be two different regressions here. One in the user group
scheduling (which I do remember did have problems) and something totally
unrelated to group scheduling. In some of the runs I tried here, I got
similar results for 2.6.27-rc1_nogroup and 2.6.27-rc1_cgroup but had bad
results for user. Anyway, we will need to fix both the regressions.
Would it be possible for you to see what causes the regression between
2.6.26 and 2.6.27-rc1 for the non group scheduling case?

thanks,
-- 
regards,
Dhaval
--

From: Zhang, Yanmin
Date: Sunday, August 3, 2008 - 10:37 pm

Does cgroup here mean CONFIG_CGROUPS? Or just a typo?

I will check it. But git bisect doesn't work on this issue. Mostly, it's still
caused by scheduler. If checking the old emails about 2.6.26-rc1, we can find the
major issues about scheduler are related to 2 patches, although I'm not sure
current regression is still caused by them.

yanmin


--

From: Dhaval Giani
Date: Sunday, August 3, 2008 - 10:53 pm

The current set of patches affect group scheduling. From your results,
there is a big performance regression between the 2.6.26 group
scheduling and 2.6.27-rc1 non group scheduling case (where normally non
group scheduling case should have performed better). (I don't recall any
major changes to the scheduler which would explain this regression).
Peter, vatsa, any ideas?

Thanks,
-- 
regards,
Dhaval
--

From: Peter Zijlstra
Date: Sunday, August 3, 2008 - 11:26 pm

---

Patches in tip/sched/clock

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5270d44..ea436bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1572,28 +1572,13 @@ static inline void sched_clock_idle_sleep_event(void)
 static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
 {
 }
-
-#ifdef CONFIG_NO_HZ
-static inline void sched_clock_tick_stop(int cpu)
-{
-}
-
-static inline void sched_clock_tick_start(int cpu)
-{
-}
-#endif
-
-#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
+#else
 extern void sched_clock_init(void);
 extern u64 sched_clock_cpu(int cpu);
 extern void sched_clock_tick(void);
 extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
-#ifdef CONFIG_NO_HZ
-extern void sched_clock_tick_stop(int cpu);
-extern void sched_clock_tick_start(int cpu);
 #endif
-#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
 
 /*
  * For kernel-internal use: high-speed (but slightly incorrect) per-cpu
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
 	default 1000 if HZ_1000
 
 config SCHED_HRTICK
-	def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..9a76e92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -834,7 +834,7 @@ static inline u64 global_rt_period(void)
 
 static inline u64 global_rt_runtime(void)
 {
-	if (sysctl_sched_rt_period < 0)
+	if (sysctl_sched_rt_runtime < 0)
 		return RUNTIME_INF;
 
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..074edc9 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,14 +32,18 @@
 #include <linux/ktime.h>
 #include <linux/module.h>
 
+/*
+ * Scheduler clock - returns current time in nanosec ...
From: Peter Zijlstra
Date: Sunday, August 3, 2008 - 11:26 pm

---

Revert:
  a7be37ac8e1565e00880531f4e2aff421a21c803  sched: revert the revert of: weight calculations
  c9c294a630e28eec5f2865f028ecfc58d45c0a5a  sched: fix calc_delta_asym()
  ced8aa16e1db55c33c507174c1b1f9e107445865  sched: fix calc_delta_asym, #2

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..7afb0fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1244,9 +1244,6 @@ static void resched_task(struct task_struct *p)
  */
 #define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
 
-/*
- * delta *= weight / lw
- */
 static unsigned long
 calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 		struct load_weight *lw)
@@ -1274,6 +1271,12 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 	return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
 }
 
+static inline unsigned long
+calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
+{
+	return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
+}
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..593af05 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -334,34 +334,6 @@ int sched_nr_latency_handler(struct ctl_table *table, int write,
 #endif
 
 /*
- * delta *= w / rw
- */
-static inline unsigned long
-calc_delta_weight(unsigned long delta, struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		delta = calc_delta_mine(delta,
-				se->load.weight, &cfs_rq_of(se)->load);
-	}
-
-	return delta;
-}
-
-/*
- * delta *= rw / w
- */
-static inline unsigned long
-calc_delta_fair(unsigned long delta, struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		delta = calc_delta_mine(delta,
-				cfs_rq_of(se)->load.weight, &se->load);
-	}
-
-	return delta;
-}
-
-/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sysctl_sched_nr_latency) we have to ...
From: Dhaval Giani
Date: Monday, August 4, 2008 - 12:05 am

Did we not fix those? :) 
-- 
regards,
Dhaval
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 12:12 am

Works for me,.. just guessing here.

--

From: Zhang, Yanmin
Date: Monday, August 5, 2030 - 8:26 pm

I did more investigation on 16-core tigerton.

Firstly, let's focus on CONFIG_GROUP_SCHED=n. With 2.6.26, the result has little difference
between with and without CONFIG_GROUP_SCHED.

1) I tried different sched_features and found AFFINE_WAKEUPS has big impact on volanoMark. Other
features have little impact.

2) With kernel 2.6.26, if disabling AFFINE_WAKEUPS, the result is 260000; if enabling AFFINE_WAKEUPS,
the result is 515000, so the improvement caused by AFFINE_WAKEUPS is about 100%. With kernel 2.6.27-rc1,
the improvement is only about 25%.

3) I turned on CONFIG_SCHETSTATS in kernel and collect ttwu_move_affine. Mostly, collect ttwu_move_affine,
then recollect it after 30 seconds and calculate the difference. With 2.6.26, I got below data:
domain0 279521 142332 0
domain1 184589 22823 0
domain0 289170 142168 0
domain1 185491 23778 0
domain0 291842 139687 0
domain1 187807 23174 0
domain0 292426 144879 0
domain1 179721 22122 0
domain0 287669 137756 0
domain1 201236 25156 0
domain0 268374 139532 0
domain1 210145 25268 0
domain0 292002 144530 0
domain1 196146 24669 0
domain0 298406 145023 0
domain1 178381 22743 0
domain0 275685 141086 0
domain1 203797 25686 0
domain0 285818 140260 0
domain1 180506 23002 0
domain0 290562 139757 0
domain1 186669 23086 0
domain0 296466 142084 0
domain1 186346 24161 0
domain0 283394 137930 0
domain1 195596 23895 0
domain0 269296 142978 0
domain1 210648 25682 0
domain0 281672 144002 0
domain1 189959 23685 0
domain0 301834 145922 0
domain1 172737 22351 0


The 3rd column is ttwu_move_affine difference.

With 2.6.27-rc1:
domain0 39054 302678 0
domain1 315384 245684 0
domain0 39142 304117 0
domain1 312896 244796 0
domain0 38636 304438 0
domain1 310687 244409 0
domain0 39534 304167 0
domain1 313746 245381 0
domain0 39082 304231 0
domain1 312592 245219 0
domain0 39057 305460 0
domain1 311395 245195 0
domain0 38224 301351 0
domain1 314482 244448 0
domain0 38016 300573 0
domain1 309031 241127 0
domain0 40285 306397 ...
From: Peter Zijlstra
Date: Friday, August 8, 2008 - 12:30 am

I'm a bit puzzled, but you're right - I too noticed that volanomark is
_very_ sensitive to affine wakeups.

I'll try and find what changed in that code for GROUP=n.

--

From: Zhang, Yanmin
Date: Tuesday, August 13, 2030 - 1:50 am

I collect more data and find CPU_NEWLY_IDLE balance schedstat looks abnormal.
Comparing with 2.6.26, 2.6.27-rc1 has more successful move_tasks among cpu runqueue. I
instrument kernel and find that, with 2.6.26, mostly task is hot when kernel tries to
move it to another cpu. But with 2.6.27-rc1, task is often moved successfully.
If I set /proc/sys/kernel/sched_migration_cost=1500000 (default is 500000), volanoMark
result is improved significantly, near to the result of 2.6.26. Above testing set
CONFIG_GROUP_SCHED=n. So perhaps some key data structures are changed with 2.6.27-rc1
to create more cache misses. With 2.6.26, cpu idle is about 6~7%. With 2.6.27-rc1, cpu idle
is about 1%. I compare the 2 kernels and couldn't find what data structure change makes it.

As for CONFIG_GROUP_SCHED=y, oprofile shows tg_shares_up consumes about 8% cpu utilization
on my 16-core tigerton. If I enlarge /proc/sys/kernel/sched_shares_ratelimit, it doesn't help
volanoMark result. I check the group schedule codes and got an idea to improve it. Add
share_percent, a new var in task_group->sched_entity[i] to record the percent this task group
occupies in the parent group. share_percent is updated in walk_tg_tree. In account_entity_enqueue,
if the task entity has parent, we could just use share_percent and se->load.weight to calculate
a new weight and add the new weight to parent entity weight, in the end to runqueue load weight.
So when sched_shares_ratelimit is enlarged, various load balances still could work well. I think
volanoMark could benefit from it.

BTW, with CONFIG_GROUP_SCHED=y, hackbench has about 80% regression on my 8core+multi_thread
Montvale Itanium machine and Tulsa machines. It seems mutli-thread machines has the regression.

-yanmin


--

From: Peter Zijlstra
Date: Sunday, August 3, 2008 - 11:54 pm

---
Subject: sched: scale sysctl_sched_shares_ratelimit with nr_cpus

David reported that his Niagra spend a little too much time in
tg_shares_up(), which considering he has a large cpu count makes sense.

So scale the ratelimit value with the number of cpus like we do for
other controls as well.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a76e92..7eddaea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -809,9 +809,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
  * ratelimit for updating the group shares.
- * default: 0.5ms
+ * default: 0.25ms
  */
-const_debug unsigned int sysctl_sched_shares_ratelimit = 500000;
+const_debug unsigned int sysctl_sched_shares_ratelimit = 250000;
 
 /*
  * period over which we measure -rt task cpu usage in us.
@@ -5732,6 +5732,8 @@ static inline void sched_init_granularity(void)
 		sysctl_sched_latency = limit;
 
 	sysctl_sched_wakeup_granularity *= factor;
+
+	sysctl_sched_shares_ratelimit *= factor;
 }
 
 #ifdef CONFIG_SMP


--

From: Ingo Molnar
Date: Friday, August 15, 2008 - 8:37 am

i've queued this up in tip/sched/urgent as it makes sense - but i'm also 
wondering, does this impact the volano numbers?

	Ingo
--

From: Hugh Dickins
Date: Friday, August 1, 2008 - 5:25 am

I'm no git expert, but didn't see anyone else comment on this:
you need to trust git more, it's like the Tour de France,
occasionally venturing into other countries for a little while.

Work which got merged into Linus's 2.6.26-git for 2.6.27-rc1 may
well have been developed on a 2.6.26-rcN base in someone else's
tree, and so the bisection may take you back there.

I think this is getting commoner now, since Linus spoke out
against rebasing: bisecting a net issue took me back to rc6-git
and rc4-git, but did end up at the right commit.

It can be nuisance if you don't notice at "make install" time,
and reboot another kernel than the one you just built to test.

Hugh
--

From: Zhang, Yanmin
Date: Sunday, August 3, 2008 - 5:54 pm

Sometimes, git bisect could locate the culprit, but didn't this time.
I'm used to keep quiet when it's good but complain to make noisy when
something is wrong. :)

Thanks,
Yanmin


--

Previous thread: [PATCH stable] ftrace document crucial update by Steven Rostedt on Wednesday, July 30, 2008 - 8:05 pm. (8 messages)

Next thread: [PATCH] cpuset: make ntasks to be a monotonic increasing value by Lai Jiangshan on Wednesday, July 30, 2008 - 8:22 pm. (13 messages)