Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.
With bisect, I located below patch.
18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit
commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Apr 19 19:45:00 2008 +0200
sched: fair-group: SMP-nice for group scheduling
Implement SMP nice support for the full group hierarchy.
If I reverse the patch with resolving some conflictions, volanoMark result could
be restored completely.
AIM7 (use tmpfs) also has more than 40% on my 8-core stoakley, 16-core tigerton,
and Itanium Montecito, but I verified that aim7 regression isn't caused by above
patch. I am doing new bisect to check aim7 now.
-yanmin
--
AIM7 regression is caused by another patch of semaphore restructure. I will send emails on aim7 in another thread. -yanmin --
ok, that's bad. Let's get vatsa and Ingo also involved. -- regards, Dhaval --
Just to confirm, do you still have a performance regression with !group_sched? -- regards, Dhaval --
I just tried it with CONFIG_GROUP_SCHED=n a moment ago. The regression becomes less than 3%. --
Hmm. On another machine I am seeing a 10% regression, with and without group sched. Let us work on fixing this one. -- regards, Dhaval --
One more thing if you can try out, please set the shares for other users to 2 except for the one which is running the benchmark. You can set it at /sys/kernel/uids/<uid>/cpu_share -- regards, Dhaval --
I might try. There are only 2 users active in my system, root for background processes and mine for the testing. In the other hand, I kill most backgroud services when starting testing. So it might not have help. --
The other combination that I am interested to know is when: CONFIG_FAIR_GROUP_SCHED=y and CONFIG_CGROUP_SCHED=y [i.e cgroup based scheduling rather than uid based scheduling. Former should result in only one group at bootup] I will also try to get some numbers with this combination. -- Regards, vatsa --
I ran with that combination and here are some results: 2.6.25 (with CONFIG_USER_SCHED) Volanomark perf = 20436.6 (Avg of 10 runs) 2.6.26-rc1 + patches in Ingo's tree [1] as of Fri morning IST (abt 8 hrs before) (with CONFIG_CGROUP_SCHED) Volanomark perf = 21529.6 i.e CGROUP based grouping in 2.6.26-rc1 gives same (if not somewhat better) results as UID-based scheduling in 2.6.25. Yamin, Could you validate this as well? i.e just turn on cgroup-based grouping (CONFIG_CGROUP_SCHED) and check the resulting performance with 2625 you already have (using CONFIG_USER_SCHED). A) In 2.6.25, with UID based scheduling, CPU load = summation of task load B) In 2.6.26-rc1, with UID based scheduling, CPU load = summation of group weights C) In 2.6.26-rc1, with CGROUP based scheduling, CPU load = summation of task weights This change in definition of cpu load is affecting load balance routines (find_busiest_group et al). As a result, threads of volanomark benchmark aren't quickly spread across the cpus, resulting in slower performance. In case of B), cpu load can be low numbers (100 or 200), while in A or C, cpu load are large numbers. I think find_busiest_group() and related routines need to be "educated" to deal with such low numbers .. -- Regards, vatsa --
Reference: 1. git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git Yamin, I request you just compare 2.6.25 (CONFIG_USER_SCHED) performance with 2.6.26-rc1 (CONFIG_CGROUP_SCHED) performance. They should be same (as they use the same definition of cpu load). -- Regards, vatsa --
I'm confused by these conceptions. Would you like to tell me the exact config options you want to turn on? Options in my config file(both 2.6.25 and 2.6.26-rc1): # CONFIG_CGROUPS is not set CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set CONFIG_USER_SCHED=y --
This is fine for 2.6.25. For 2.6.26-rc1, can you turn off USER_SCHED adn turn on CGROUP_SCHED? -- regards, Dhaval --
Retain this config for 2.6.25. For 2.6.26-rc1, turn OFF CONFIG_USER_SCHED and turn ON CONFIG_CGROUP_SCHED i.e # CONFIG_USER_SCHED is not set CONFIG_CGROUP_SCHED=y [Note that above options are mutually exclusive i.e you cannot set both of them to y] From my experiments, results of 2.6.25 (with CONFIG_USER_SCHED=y) are same as 2.6.26-rc1 (with CONFIG_CGROUP_SCHED=y). It'd be great if you could confirm the same in your environment. -- Regards, vatsa --
I tested it with below config against 2.6.26-rc1. CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set # CONFIG_CGROUP_NS is not set # CONFIG_CGROUP_DEVICE is not set # CONFIG_CPUSETS is not set CONFIG_GROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_RT_GROUP_SCHED is not set # CONFIG_USER_SCHED is not set CONFIG_CGROUP_SCHED=y # CONFIG_CGROUP_CPUACCT is not set # CONFIG_RESOURCE_COUNTERS is not set To make the testing faster, I changed some parameters of volanoMark. The result of 2.6.26-rc1(CONFIG_CGROUP_SCHED=y) is about 2%~3% less than the one of 2.6.25 (CONFIG_USER_SCHED=y). -yanmin --
This for confirming my observation. It seems much better than the 50% regression reported earlier (with 2.6.26-rc1 and CONFIG_USER_SCHED). Ideally we should get same results with CONFIG_USER_SCHED also (in 2.6.26-rc1). That needs some work in load balance code. Till that is tackled, IMHO we can retain all the current code by either: 1. Disabling CONFIG_GROUP_SCHED (or better) 2. Enable CONFIG_GROUP_SCHED and CONFIG_CGROUP_SCHED Ingo/Peter, What's your opinion? -- Regards, vatsa --
A quick update: With 2.6.26-rc2 (CONFIG_USER_SCHED=y), volanoMark result on my 8-core stoakley --
One more testing: volanoMark result of 2.6.25 (CONFIG_CGROUP_SCHED=y) is about 6%~7% better than the one --
Here (Q6600), 2.6.26-rc2 CONFIG_USER_SCHED=y regression culprit for volanomark is the same one identified for mysql+oltp. (i have yet to figure out where the buglet lies, but there is definitely one in there somewhere) 2.6.25.3-smp (baseline, no group scheduling) test-1.log:Average throughput = 102412 messages per second test-2.log:Average throughput = 99636 messages per second test-3.log:Average throughput = 99373 messages per second CONFIG_CGROUPS=n CONFIG_USER_SCHED=y 2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 102341 messages per second test-2.log:Average throughput = 101710 messages per second test-3.log:Average throughput = 100572 messages per second 2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 79506 messages per second test-2.log:Average throughput = 78168 messages per second test-3.log:Average throughput = 78200 messages per second CONFIG_CGROUPS=y CONFIG_USER_SCHED=y 2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 103494 messages per second test-2.log:Average throughput = 100832 messages per second test-3.log:Average throughput = 98840 messages per second 2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 80132 messages per second test-2.log:Average throughput = 79410 messages per second test-3.log:Average throughput = 79609 messages per second CONFIG_CGROUPS=y CONFIG_CGROUP_SCHED=y 2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 103026 messages per second test-2.log:Average throughput = 101152 messages per second test-3.log:Average throughput = 102616 messages per second 2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e test-1.log:Average throughput = 104174 messages per second test-2.log:Average throughput = 101390 messages per second test-3.log:Average throughput = 101212 messages per second (but there are no task groups set ...
Yeah, I expect that when you create some groups and move everything down 1 level you'll get into the same problems as with user grouping. The thing seems to be that rq weights shrink to < 1 task level in these situations - because its spreading 1 tasks (well group) worth of load over the various CPUs. We're going through the load balance code atm to find out where the small load numbers would affect decisions. It looks like things like find_busiest_group() just think everything is peachy when the imbalance is < 1 task - which with all this grouping stuff is not necessarily true. --
In case I might mislead you on the find_busiest_group path, I did more testing
and collected data on both hackbench and volanoMark.
I reran hackbench against 2.6.25, 2.6.26-rc2 and 2.6.26-rc2+slub_reverse, because
2.6.26-rc includes Christoph's handling multi page-size slub patch which could improve
hackbench. The testing machine is 8-core stoakley.
All kernel are compiled with options:
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CGROUPS is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y
| hackbench 100 process 2000 | hackbench 100 process 10000
-------------------------------------------------------------------------------
2.6.25 | 35seconds | 182second
-------------------------------------------------------------------------------
2.6.26-rc2 | 28.5seconds | 140second
-------------------------------------------------------------------------------
2.6.26-rc2 | |
+reverse_slub | 32seconds | 160second
-------------------------------------------------------------------------------
So if we don't consider SLUB patch improvement, 2.6.26-rc2 still has some improvement
on hackbench. Not sure if the improvement is related to scheduler.
Then, I collected the schedule caller information with volanoMark testing. Data
is collected for 20 seconds during the testing.
Below is the gprof output with kernel 2.6.25 using above config option.
0.00 0.00 2962/19804016 retint_careful [16339]
0.00 0.00 3234/19804016 sys_rt_sigsuspend [20024]
0.00 0.00 4960/19804016 lock_sock_nested [11240]
0.00 0.00 8957/19804016 sysret_careful [20253]
0.00 0.00 28507/19804016 cpu_idle [4340]
0.00 0.00 2137406/19804016 futex_wait [8065]
0.00 0.00 4400980/19804016 ...fwiw, the following hack seems to help bring down regression to ~5%
(b/n 2.6.25 and 2.6.26-rc1 with USER_SCHED):
Also in the patch GROUP_SCALE can probably be less than what I used (and
needs a ifdef GROUP_SCHED) ..
Not-signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 4 ++++
kernel/sched_fair.c | 14 +++++---------
kernel/sched_features.h | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1551,13 +1551,17 @@ static void cpuacct_charge(struct task_s
static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
#endif
+#define GROUP_SCALE (2*1024)
+
static inline void inc_cpu_load(struct rq *rq, unsigned long load)
{
+ load *= GROUP_SCALE;
update_load_add(&rq->load, load);
}
static inline void dec_cpu_load(struct rq *rq, unsigned long load)
{
+ load *= GROUP_SCALE;
update_load_sub(&rq->load, load);
}
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1393,24 +1393,20 @@ load_balance_fair(struct rq *this_rq, in
unsigned long this_weight, busiest_weight;
long rem_load, max_load, moved_load;
+ busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
/*
* empty group
*/
- if (!aggregate(tg, sd)->task_weight)
+ if (!aggregate(tg, sd)->task_weight || !busiest_weight)
continue;
rem_load = rem_load_move * aggregate(tg, sd)->rq_weight;
rem_load /= aggregate(tg, sd)->load + 1;
- this_weight = tg->cfs_rq[this_cpu]->task_weight;
- busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
-
- imbalance = (busiest_weight - this_weight) / 2;
-
- if (imbalance < 0)
- imbalance = busiest_weight;
+ if (!rem_load)
+ continue;
- max_load = ...On my little Q6600 with git.today, it made no real difference, whether NORMALIZED_SLEEPERS was enabled or not. I see roughly 10-15% idle time w/wo this patch, whereas pre-regression, it's < 2%. -Mike --
Yeah, this bit makes a huge difference; I do that by:
mkdir /cgroup/foo
for i in `cat /cgroup/tasks`; do echo $i > /cgroup/foo/tasks; done
echo $((1024*1024)) > /cgroup/foo/cpu.shares
I'm still pulling my hairs out on why this makes a difference though - I
eliminated all direct assumption on SCHED_LOAD_SCALE(_FUZZ) with an
average of the weight per task. - but all that doesn't help (much)
A few other things I found that make a significant difference:
+static void update_aggregate(int cpu, struct sched_domain *sd)
+{
+ aggregate_walk_tree(aggregate_get_down, aggregate_get_nop, cpu, sd);
+}
@@ -3224,6 +3189,8 @@ static int move_tasks(struct rq *this_rq, int this_cpu, st
ruct rq *busiest,
unsigned long total_load_moved = 0;
int this_best_prio = this_rq->curr->prio;
+ update_aggregate(this_cpu, sd);
+
do {
total_load_moved +=
class->load_balance(this_rq, this_cpu, busiest,
and
@@ -1169,7 +1168,10 @@ static unsigned long wakeup_gran(struct sched_entity *se)
* More easily preempt - nice tasks, while not making it harder for
* + nice tasks.
*/
- gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+ if (sched_feat(ASYM_GRAN))
+ gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+ else
+ gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
return gran;
}
the asym logic is wrong wrt shares - it should look at tg->weight
--
One more observation: access to aggregate()->rq_weight etc arent correctly synchronized i.e while a cpu is doing a aggregate_walk_tree() in a domain, and thus possibly modifying rq_weight, load etc, other cpus could be concurrently accessing the same data. As a result, its possible to see inconsistent rq_weight, load, task_weight combination? -- Regards, vatsa --
Yes - and that should not be too big an issue as long as we can deal with it. Any number we'll put to it will be based on a snapshot of the state so we're wrong no matter what we do. The trick is trying to keep sane. My current stack on top of sched-devel: http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/ I've found that: http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/sched-agg-update... was sufficient to deal with all the anomalities I've found so far. --
more than staleness, interspersing of writes/reads is my concern.
Lets say that CPU0 is updating tg->cfs_rq[0]->aggregate.load,shares,rq_weight
at CPU domain (comprising of cpu 0-7). That will result in writes in this order:
->rq_weight = ?
->task_weight = ?
->shares = ?
->load = ?
At the same time, CPU1 could be doing a load_balance_fair() in SMT
domain, reading the above same words in this order:
->rq_weight
->load
->task_weight
->shares
What if the writes (on cpu0) and reads (on cpu1) are interspersed? Won't
Doesnt improve things here (8-way Intel Xeon with SCHED_SMT set):
2.6.25 : 21762.4
2.6.26-rc1 + sched_devel : 17937.5 (-17.6%)
2.6.26-rc1 + sched_devel + your patches : 17047 (-21.6%)
2.6.26-rc1 + sched_devel + patch_below : 18368.9 (-15.6%)
2.6.26-rc1 + sched_devel + patch_below + ^NORMALIZED_SLEEPER : 19589.6 (-9.9%)
I will check if patch_below + your patches help close down the gap (tomorrow):
---
kernel/sched.c | 98 ++++++++++++++++++++++++++--------------------------
kernel/sched_fair.c | 26 +++++--------
2 files changed, 61 insertions(+), 63 deletions(-)
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1568,12 +1568,12 @@ static int task_hot(struct task_struct *
*/
static inline struct aggregate_struct *
-aggregate(struct task_group *tg, struct sched_domain *sd)
+aggregate(struct task_group *tg, int this_cpu)
{
- return &tg->cfs_rq[sd->first_cpu]->aggregate;
+ return &tg->cfs_rq[this_cpu]->aggregate;
}
-typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+typedef void (*aggregate_func)(struct task_group *, struct sched_domain *, int);
/*
* Iterate the full tree, calling @down when first entering a node and @up when
@@ -1581,14 +1581,14 @@ typedef void (*aggregate_func)(struct ta
*/
static
void ...Even if one service is running as root, it is guaranteed 2/3 the CPU (since root's shares are double that of other users by default). It would cause issues for sure. Thanks, -- regards, Dhaval --
thanks Yanmin, i've queued up your patch that reverts this change. Ingo --
Is this really needed now that GROUP_SCHED defaults to 'n' ? Yanmin, this is with GROUP_SCHED=y, right or is this without? --
Its a long shot, but does the below help?
---
Subject: sched: fixup SMP load-balance
Keeping the aggregate on the first cpu of the sched domain has two problems:
- it could collide between different sched domains on different cpus
- it could slow things down because of the remote accesses
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 1
kernel/sched.c | 113 +++++++++++++++++++++++---------------------------
kernel/sched_fair.c | 12 ++---
3 files changed, 60 insertions(+), 66 deletions(-)
Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -766,7 +766,6 @@ struct sched_domain {
struct sched_domain *child; /* bottom domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
- int first_cpu; /* cache of the first cpu in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -1539,12 +1539,12 @@ static int task_hot(struct task_struct *
*/
static inline struct aggregate_struct *
-aggregate(struct task_group *tg, struct sched_domain *sd)
+aggregate(struct task_group *tg, int cpu)
{
- return &tg->cfs_rq[sd->first_cpu]->aggregate;
+ return &tg->cfs_rq[cpu]->aggregate;
}
-typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+typedef void (*aggregate_func)(struct task_group *, int, struct sched_domain *);
/*
* Iterate the full tree, calling @down when first entering a node and @up when
@@ -1552,14 +1552,14 ...With GROUP_SCHED=y. I remember a similiar patch was merged into 2.6.25-rc1 and I found the similiar volanoMark regression, then you reverted it. Why to add it back to 2.6.26-rc1? -yanmin --
The implementation has been changed extensively. As of now, without proper load balancing, group scheduling is not yet fully fair, but such a performance regression is serious, and we need to figure out why the regression is taking place. -- regards, Dhaval --
The only thing similar is that it tries to to SMP load balancing for groups, other than that there is nothing similar - its a total rewrite with a whole different approach. And we _need_ an SMP load-balancer for groups - otherwise group scheduling is just not complete. --
What's the hardware/software configuration? And kernel .config? --
2 CPU 64bit Xeon Processors running SLES 10 SP1 # # Automatically generated make config: don't edit # Linux kernel version: 2.6.26-rc1 # Wed May 7 11:13:46 2008 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_DEFCONFIG_LIST="arch/x86/configs/x86_64_defconfig" # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="-default" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # CONFIG_TASK_XACCT is not ...
Is the cpu dual-core or quad-core? Or just hyper-threading? I found 16-core tigerton has bigger regression than 8-core stoakley. perhaps the > CONFIG_USB_ZC0301=m
