Ingo,
volanoMark has regression with 2.6.27-rc1.
1) 70% on 16-core tigerton;
2) 18% on a new multi-core+HT mahcine;
I tried to use git bisect to locate the root cause, but git bisect always went
back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:
commit 82638844d9a8581bbf33201cc209a14876eca167
Merge: 9982fbf... 63cf13b...
Author: Ingo Molnar <mingo@elte.hu>
Date: Wed Jul 16 00:29:07 2008 +0200
Merge branch 'linus' into cpus4096
Conflicts:
arch/x86/xen/smp.c
kernel/sched_rt.c
net/iucv/iucv.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
a merge, what're commit numbers of the subpatches?
BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.
-yanmin
--
Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26. New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly. http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html. yanmin --
The new smp-group stuff doesn't remotely look like what was in .26 Also, on my quad (admittedly smaller than your machines) both volano and sysbench didn't regress anymore - where they clearly did with the code reverted from .26. --
The regression I reported exists on: 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp; 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp; 3) 16-core tigerton: %70 with volano, %18 with oltp; 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark. So the issues are popular on different architectures. yanmin --
I know kernel needs the features and it might not be a good idea to reject them over and over again. I will collect more data on tigerton and try to optimize it. -yanmin --
on 2008-8-1 8:39 Zhang, Yanmin wrote: Could you tell me the exact config options? the same with last time? as follows: # CONFIG_CGROUPS is not set CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set CONFIG_USER_SCHED=y --
Hi Yanmin, Would it be possible for you to switch of the group scheduling feature and see if the regression still exists. In all our testing, we did not see a regression. I would like to eliminate it from your testing as well. The option to switch off would be CONFIG_GROUP_SCHED, that should disable all the group scheduling features. Thanks, -- regards, Dhaval --
I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter. volanoMark: kernel | result ---------------------------------------------------------- 2.6.27-rc1_group | 205901 ---------------------------------------------------------- 2.6.27-rc1_nogroup | 303377 ---------------------------------------------------------- 2.6.26_group | 529388 sysbench+mysql(readonly oltp): kernel | result ----------------------------------------------------------- 2.6.27-rc1_group | 560636 ----------------------------------------------------------- 2.6.27-rc1_nogroup | 604937 ----------------------------------------------------------- --
There seem to be two different regressions here. One in the user group scheduling (which I do remember did have problems) and something totally unrelated to group scheduling. In some of the runs I tried here, I got similar results for 2.6.27-rc1_nogroup and 2.6.27-rc1_cgroup but had bad results for user. Anyway, we will need to fix both the regressions. Would it be possible for you to see what causes the regression between 2.6.26 and 2.6.27-rc1 for the non group scheduling case? thanks, -- regards, Dhaval --
Does cgroup here mean CONFIG_CGROUPS? Or just a typo? I will check it. But git bisect doesn't work on this issue. Mostly, it's still caused by scheduler. If checking the old emails about 2.6.26-rc1, we can find the major issues about scheduler are related to 2 patches, although I'm not sure current regression is still caused by them. yanmin --
The current set of patches affect group scheduling. From your results, there is a big performance regression between the 2.6.26 group scheduling and 2.6.27-rc1 non group scheduling case (where normally non group scheduling case should have performed better). (I don't recall any major changes to the scheduler which would explain this regression). Peter, vatsa, any ideas? Thanks, -- regards, Dhaval --
---
Patches in tip/sched/clock
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5270d44..ea436bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1572,28 +1572,13 @@ static inline void sched_clock_idle_sleep_event(void)
static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
{
}
-
-#ifdef CONFIG_NO_HZ
-static inline void sched_clock_tick_stop(int cpu)
-{
-}
-
-static inline void sched_clock_tick_start(int cpu)
-{
-}
-#endif
-
-#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
+#else
extern void sched_clock_init(void);
extern u64 sched_clock_cpu(int cpu);
extern void sched_clock_tick(void);
extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
-#ifdef CONFIG_NO_HZ
-extern void sched_clock_tick_stop(int cpu);
-extern void sched_clock_tick_start(int cpu);
#endif
-#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
/*
* For kernel-internal use: high-speed (but slightly incorrect) per-cpu
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
default 1000 if HZ_1000
config SCHED_HRTICK
- def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+ def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..9a76e92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -834,7 +834,7 @@ static inline u64 global_rt_period(void)
static inline u64 global_rt_runtime(void)
{
- if (sysctl_sched_rt_period < 0)
+ if (sysctl_sched_rt_runtime < 0)
return RUNTIME_INF;
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..074edc9 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,14 +32,18 @@
#include <linux/ktime.h>
#include <linux/module.h>
+/*
+ * Scheduler clock - returns current time in nanosec ...---
Revert:
a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..7afb0fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1244,9 +1244,6 @@ static void resched_task(struct task_struct *p)
*/
#define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
-/*
- * delta *= weight / lw
- */
static unsigned long
calc_delta_mine(unsigned long delta_exec, unsigned long weight,
struct load_weight *lw)
@@ -1274,6 +1271,12 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
}
+static inline unsigned long
+calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
+{
+ return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
+}
+
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
lw->weight += inc;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..593af05 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -334,34 +334,6 @@ int sched_nr_latency_handler(struct ctl_table *table, int write,
#endif
/*
- * delta *= w / rw
- */
-static inline unsigned long
-calc_delta_weight(unsigned long delta, struct sched_entity *se)
-{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- se->load.weight, &cfs_rq_of(se)->load);
- }
-
- return delta;
-}
-
-/*
- * delta *= rw / w
- */
-static inline unsigned long
-calc_delta_fair(unsigned long delta, struct sched_entity *se)
-{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- cfs_rq_of(se)->load.weight, &se->load);
- }
-
- return delta;
-}
-
-/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to ...Did we not fix those? :) -- regards, Dhaval --
I did more investigation on 16-core tigerton. Firstly, let's focus on CONFIG_GROUP_SCHED=n. With 2.6.26, the result has little difference between with and without CONFIG_GROUP_SCHED. 1) I tried different sched_features and found AFFINE_WAKEUPS has big impact on volanoMark. Other features have little impact. 2) With kernel 2.6.26, if disabling AFFINE_WAKEUPS, the result is 260000; if enabling AFFINE_WAKEUPS, the result is 515000, so the improvement caused by AFFINE_WAKEUPS is about 100%. With kernel 2.6.27-rc1, the improvement is only about 25%. 3) I turned on CONFIG_SCHETSTATS in kernel and collect ttwu_move_affine. Mostly, collect ttwu_move_affine, then recollect it after 30 seconds and calculate the difference. With 2.6.26, I got below data: domain0 279521 142332 0 domain1 184589 22823 0 domain0 289170 142168 0 domain1 185491 23778 0 domain0 291842 139687 0 domain1 187807 23174 0 domain0 292426 144879 0 domain1 179721 22122 0 domain0 287669 137756 0 domain1 201236 25156 0 domain0 268374 139532 0 domain1 210145 25268 0 domain0 292002 144530 0 domain1 196146 24669 0 domain0 298406 145023 0 domain1 178381 22743 0 domain0 275685 141086 0 domain1 203797 25686 0 domain0 285818 140260 0 domain1 180506 23002 0 domain0 290562 139757 0 domain1 186669 23086 0 domain0 296466 142084 0 domain1 186346 24161 0 domain0 283394 137930 0 domain1 195596 23895 0 domain0 269296 142978 0 domain1 210648 25682 0 domain0 281672 144002 0 domain1 189959 23685 0 domain0 301834 145922 0 domain1 172737 22351 0 The 3rd column is ttwu_move_affine difference. With 2.6.27-rc1: domain0 39054 302678 0 domain1 315384 245684 0 domain0 39142 304117 0 domain1 312896 244796 0 domain0 38636 304438 0 domain1 310687 244409 0 domain0 39534 304167 0 domain1 313746 245381 0 domain0 39082 304231 0 domain1 312592 245219 0 domain0 39057 305460 0 domain1 311395 245195 0 domain0 38224 301351 0 domain1 314482 244448 0 domain0 38016 300573 0 domain1 309031 241127 0 domain0 40285 306397 ...
I'm a bit puzzled, but you're right - I too noticed that volanomark is _very_ sensitive to affine wakeups. I'll try and find what changed in that code for GROUP=n. --
I collect more data and find CPU_NEWLY_IDLE balance schedstat looks abnormal. Comparing with 2.6.26, 2.6.27-rc1 has more successful move_tasks among cpu runqueue. I instrument kernel and find that, with 2.6.26, mostly task is hot when kernel tries to move it to another cpu. But with 2.6.27-rc1, task is often moved successfully. If I set /proc/sys/kernel/sched_migration_cost=1500000 (default is 500000), volanoMark result is improved significantly, near to the result of 2.6.26. Above testing set CONFIG_GROUP_SCHED=n. So perhaps some key data structures are changed with 2.6.27-rc1 to create more cache misses. With 2.6.26, cpu idle is about 6~7%. With 2.6.27-rc1, cpu idle is about 1%. I compare the 2 kernels and couldn't find what data structure change makes it. As for CONFIG_GROUP_SCHED=y, oprofile shows tg_shares_up consumes about 8% cpu utilization on my 16-core tigerton. If I enlarge /proc/sys/kernel/sched_shares_ratelimit, it doesn't help volanoMark result. I check the group schedule codes and got an idea to improve it. Add share_percent, a new var in task_group->sched_entity[i] to record the percent this task group occupies in the parent group. share_percent is updated in walk_tg_tree. In account_entity_enqueue, if the task entity has parent, we could just use share_percent and se->load.weight to calculate a new weight and add the new weight to parent entity weight, in the end to runqueue load weight. So when sched_shares_ratelimit is enlarged, various load balances still could work well. I think volanoMark could benefit from it. BTW, with CONFIG_GROUP_SCHED=y, hackbench has about 80% regression on my 8core+multi_thread Montvale Itanium machine and Tulsa machines. It seems mutli-thread machines has the regression. -yanmin --
--- Subject: sched: scale sysctl_sched_shares_ratelimit with nr_cpus David reported that his Niagra spend a little too much time in tg_shares_up(), which considering he has a large cpu count makes sense. So scale the ratelimit value with the number of cpus like we do for other controls as well. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/kernel/sched.c b/kernel/sched.c index 9a76e92..7eddaea 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -809,9 +809,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32; /* * ratelimit for updating the group shares. - * default: 0.5ms + * default: 0.25ms */ -const_debug unsigned int sysctl_sched_shares_ratelimit = 500000; +const_debug unsigned int sysctl_sched_shares_ratelimit = 250000; /* * period over which we measure -rt task cpu usage in us. @@ -5732,6 +5732,8 @@ static inline void sched_init_granularity(void) sysctl_sched_latency = limit; sysctl_sched_wakeup_granularity *= factor; + + sysctl_sched_shares_ratelimit *= factor; } #ifdef CONFIG_SMP --
i've queued this up in tip/sched/urgent as it makes sense - but i'm also wondering, does this impact the volano numbers? Ingo --
I'm no git expert, but didn't see anyone else comment on this: you need to trust git more, it's like the Tour de France, occasionally venturing into other countries for a little while. Work which got merged into Linus's 2.6.26-git for 2.6.27-rc1 may well have been developed on a 2.6.26-rcN base in someone else's tree, and so the bisection may take you back there. I think this is getting commoner now, since Linus spoke out against rebasing: bisecting a net issue took me back to rc6-git and rc4-git, but did end up at the right commit. It can be nuisance if you don't notice at "make install" time, and reboot another kernel than the one you just built to test. Hugh --
Sometimes, git bisect could locate the culprit, but didn't this time. I'm used to keep quiet when it's good but complain to make noisy when something is wrong. :) Thanks, Yanmin --
