Re: volanoMark regression with kernel 2.6.26-rc1

Previous thread: [PATCH 3/3] PNP: add AD1815 and AD1816 quirks by Rene Herman on Monday, May 5, 2008 - 6:08 pm. (6 messages)

Next thread: [PATCH -mm][v2] ratelimit rewrite by Dave Young on Monday, May 5, 2008 - 7:25 pm. (4 messages)
From: Zhang, Yanmin
Date: Monday, May 5, 2008 - 7:06 pm

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located below patch.

18d95a2832c1392a2d63227a7a6d433cb9f2037e is first bad commit
commit 18d95a2832c1392a2d63227a7a6d433cb9f2037e
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Sat Apr 19 19:45:00 2008 +0200

    sched: fair-group: SMP-nice for group scheduling
    
    Implement SMP nice support for the full group hierarchy.

If I reverse the patch with resolving some conflictions, volanoMark result could
be restored completely.


AIM7 (use tmpfs) also has more than 40% on my 8-core stoakley, 16-core tigerton,
and Itanium Montecito, but I verified that aim7 regression isn't caused by above
patch. I am doing new bisect to check aim7 now.

-yanmin


--

From: Zhang, Yanmin
Date: Monday, May 5, 2008 - 10:41 pm

AIM7 regression is caused by another patch of semaphore restructure. I will send
emails on aim7 in another thread.

-yanmin



--

From: Dhaval Giani
Date: Tuesday, May 6, 2008 - 4:52 am

ok, that's bad. Let's get vatsa and Ingo also involved.

-- 
regards,
Dhaval
--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:33 am

Just to confirm, do you still have a performance regression with
!group_sched?

-- 
regards,
Dhaval
--

From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 10:18 pm

I just tried it with CONFIG_GROUP_SCHED=n a moment ago. The regression becomes less than 3%.



--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:32 pm

Hmm. On another machine I am seeing a 10% regression, with and without
group sched. Let us work on fixing this one.

-- 
regards,
Dhaval
--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:40 pm

One more thing if you can try out, please set the shares for other users
to 2 except for the one which is running the benchmark. You can set it
at /sys/kernel/uids/<uid>/cpu_share

-- 
regards,
Dhaval
--

From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 10:53 pm

I might try. There are only 2 users active in my system, root for background processes and mine
for the testing. In the other hand, I kill most backgroud services when starting testing. So
it might not have help.


--

From: Srivatsa Vaddagiri
Date: Wednesday, May 7, 2008 - 11:11 pm

The other combination that I am interested to know is when:

CONFIG_FAIR_GROUP_SCHED=y and CONFIG_CGROUP_SCHED=y

[i.e cgroup based scheduling rather than uid based scheduling. Former
should result in only one group at bootup]

I will also try to get some numbers with this combination.

-- 
Regards,
vatsa
--

From: Srivatsa Vaddagiri
Date: Friday, May 9, 2008 - 8:52 am

I ran with that combination and here are some results:

2.6.25 (with CONFIG_USER_SCHED) 

	Volanomark perf = 20436.6 (Avg of 10 runs)

2.6.26-rc1 + patches in Ingo's tree [1] as of Fri morning IST (abt 8 hrs
before) (with CONFIG_CGROUP_SCHED)
	
	Volanomark perf = 21529.6

i.e CGROUP based grouping in 2.6.26-rc1 gives same (if not somewhat
better) results as UID-based scheduling in 2.6.25.

Yamin,
	Could you validate this as well? i.e just turn on cgroup-based
grouping (CONFIG_CGROUP_SCHED) and check the resulting performance with 2625
you already have (using CONFIG_USER_SCHED).


A) In 2.6.25, with UID based scheduling,
	CPU load = summation of task load

B) In 2.6.26-rc1, with UID based scheduling,
	CPU load = summation of group weights

C) In 2.6.26-rc1, with CGROUP based scheduling,
	CPU load = summation of task weights


This change in definition of cpu load is affecting load balance routines
(find_busiest_group et al). As a result, threads of volanomark benchmark
aren't quickly spread across the cpus, resulting in slower performance.

In case of B), cpu load can be low numbers (100 or 200), while in A or
C, cpu load are large numbers. I think find_busiest_group() and related 
routines need to be "educated" to deal with such low numbers ..


-- 
Regards,
vatsa
--

From: Srivatsa Vaddagiri
Date: Friday, May 9, 2008 - 8:54 am

Reference:
	
1.  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git

Yamin,
	I request you just compare 2.6.25 (CONFIG_USER_SCHED)
performance with 2.6.26-rc1 (CONFIG_CGROUP_SCHED) performance. They
should be same (as they use the same definition of cpu load).

-- 
Regards,
vatsa
--

From: Zhang, Yanmin
Date: Sunday, May 11, 2008 - 6:39 pm

I'm confused by these conceptions. Would you like to tell me the exact config options
you want to turn on?

Options in my config file(both 2.6.25 and 2.6.26-rc1):

# CONFIG_CGROUPS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y

--

From: Dhaval Giani
Date: Sunday, May 11, 2008 - 7:04 pm

This is fine for 2.6.25. For 2.6.26-rc1, can you turn off USER_SCHED adn
turn on CGROUP_SCHED?

-- 
regards,
Dhaval
--

From: Srivatsa Vaddagiri
Date: Sunday, May 11, 2008 - 7:37 pm

Retain this config for 2.6.25.

For 2.6.26-rc1, turn OFF CONFIG_USER_SCHED and turn ON CONFIG_CGROUP_SCHED i.e

# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y

[Note that above options are mutually exclusive i.e you cannot set both
of them to y]

From my experiments, results of 2.6.25 (with CONFIG_USER_SCHED=y) are
same as 2.6.26-rc1 (with CONFIG_CGROUP_SCHED=y).

It'd be great if you could confirm the same in your environment.

-- 
Regards,
vatsa
--

From: Zhang, Yanmin
Date: Sunday, May 11, 2008 - 8:33 pm

I tested it with below config against 2.6.26-rc1.

CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CGROUP_DEVICE is not set
# CONFIG_CPUSETS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set


To make the testing faster, I changed some parameters of volanoMark.
The result of 2.6.26-rc1(CONFIG_CGROUP_SCHED=y) is about 2%~3% less than the one of
2.6.25 (CONFIG_USER_SCHED=y).

-yanmin


--

From: Srivatsa Vaddagiri
Date: Sunday, May 11, 2008 - 9:52 pm

This for confirming my observation. It seems much better than the 50% regression
reported earlier (with 2.6.26-rc1 and CONFIG_USER_SCHED).

Ideally we should get same results with CONFIG_USER_SCHED also (in
2.6.26-rc1). That needs some work in load balance code. Till that is
tackled, IMHO we can retain all the current code by either:

1. Disabling CONFIG_GROUP_SCHED (or better)
2. Enable CONFIG_GROUP_SCHED and CONFIG_CGROUP_SCHED

Ingo/Peter, What's your opinion?

-- 
Regards,
vatsa
--

From: Zhang, Yanmin
Date: Sunday, May 11, 2008 - 10:02 pm

A quick update:
With 2.6.26-rc2 (CONFIG_USER_SCHED=y), volanoMark result on my 8-core stoakley

--

From: Zhang, Yanmin
Date: Sunday, May 11, 2008 - 10:43 pm

One more testing:
volanoMark result of 2.6.25 (CONFIG_CGROUP_SCHED=y) is about 6%~7% better than the one


--

From: Mike Galbraith
Date: Monday, May 12, 2008 - 2:04 am

Here (Q6600), 2.6.26-rc2 CONFIG_USER_SCHED=y regression culprit for
volanomark is the same one identified for mysql+oltp.

(i have yet to figure out where the buglet lies, but there is definitely
one in there somewhere)

2.6.25.3-smp (baseline, no group scheduling)
test-1.log:Average throughput = 102412 messages per second
test-2.log:Average throughput = 99636 messages per second
test-3.log:Average throughput = 99373 messages per second

CONFIG_CGROUPS=n
CONFIG_USER_SCHED=y

2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 102341 messages per second
test-2.log:Average throughput = 101710 messages per second
test-3.log:Average throughput = 100572 messages per second


2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 79506 messages per second
test-2.log:Average throughput = 78168 messages per second
test-3.log:Average throughput = 78200 messages per second

CONFIG_CGROUPS=y
CONFIG_USER_SCHED=y

2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 103494 messages per second
test-2.log:Average throughput = 100832 messages per second
test-3.log:Average throughput = 98840 messages per second

2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 80132 messages per second
test-2.log:Average throughput = 79410 messages per second
test-3.log:Average throughput = 79609 messages per second

CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y

2.6.26-rc2 - 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 103026 messages per second
test-2.log:Average throughput = 101152 messages per second
test-3.log:Average throughput = 102616 messages per second

2.6.26-rc2 + 18d95a2832c1392a2d63227a7a6d433cb9f2037e
test-1.log:Average throughput = 104174 messages per second
test-2.log:Average throughput = 101390 messages per second
test-3.log:Average throughput = 101212 messages per second

(but there are no task groups set ...
From: Peter Zijlstra
Date: Monday, May 12, 2008 - 2:20 am

Yeah, I expect that when you create some groups and move everything down
1 level you'll get into the same problems as with user grouping.

The thing seems to be that rq weights shrink to < 1 task level in these
situations - because its spreading 1 tasks (well group) worth of load
over the various CPUs.

We're going through the load balance code atm to find out where the
small load numbers would affect decisions.

It looks like things like find_busiest_group() just think everything is
peachy when the imbalance is < 1 task - which with all this grouping
stuff is not necessarily true.

--

From: Zhang, Yanmin
Date: Wednesday, May 14, 2008 - 2:22 am

In case I might mislead you on the find_busiest_group path, I did more testing
and collected data on both hackbench and volanoMark.

I reran hackbench against 2.6.25, 2.6.26-rc2 and 2.6.26-rc2+slub_reverse, because
2.6.26-rc includes Christoph's handling multi page-size slub patch which could improve
hackbench. The testing machine is 8-core stoakley.

All kernel are compiled with options:
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CGROUPS is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y

		| hackbench 100 process 2000	| hackbench 100 process 10000
-------------------------------------------------------------------------------
2.6.25		|	35seconds		|	182second
-------------------------------------------------------------------------------
2.6.26-rc2      |	28.5seconds		 |	 140second
-------------------------------------------------------------------------------
2.6.26-rc2	 |				 |
+reverse_slub 	|	32seconds		|	160second
-------------------------------------------------------------------------------

So if we don't consider SLUB patch improvement, 2.6.26-rc2 still has some improvement
on hackbench. Not sure if the improvement is related to scheduler.


Then, I collected the schedule caller information with volanoMark testing. Data
is collected for 20 seconds during the testing.

Below is the gprof output with kernel 2.6.25 using above config option.
                0.00    0.00    2962/19804016     retint_careful [16339]
                0.00    0.00    3234/19804016     sys_rt_sigsuspend [20024]
                0.00    0.00    4960/19804016     lock_sock_nested [11240]
                0.00    0.00    8957/19804016     sysret_careful [20253]
                0.00    0.00   28507/19804016     cpu_idle [4340]
                0.00    0.00 2137406/19804016     futex_wait [8065]
                0.00    0.00 4400980/19804016 ...
From: Srivatsa Vaddagiri
Date: Wednesday, May 14, 2008 - 6:44 am

fwiw, the following hack seems to help bring down regression to ~5%
(b/n 2.6.25 and 2.6.26-rc1 with USER_SCHED):

Also in the patch GROUP_SCALE can probably be less than what I used (and
needs a ifdef GROUP_SCHED) ..

Not-signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 kernel/sched.c          |    4 ++++
 kernel/sched_fair.c     |   14 +++++---------
 kernel/sched_features.h |    2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1551,13 +1551,17 @@ static void cpuacct_charge(struct task_s
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 #endif
 
+#define GROUP_SCALE	(2*1024)
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
+	load *= GROUP_SCALE;
 	update_load_add(&rq->load, load);
 }
 
 static inline void dec_cpu_load(struct rq *rq, unsigned long load)
 {
+	load *= GROUP_SCALE;
 	update_load_sub(&rq->load, load);
 }
 
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1393,24 +1393,20 @@ load_balance_fair(struct rq *this_rq, in
 		unsigned long this_weight, busiest_weight;
 		long rem_load, max_load, moved_load;
 
+		busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
 		/*
 		 * empty group
 		 */
-		if (!aggregate(tg, sd)->task_weight)
+		if (!aggregate(tg, sd)->task_weight || !busiest_weight)
 			continue;
 
 		rem_load = rem_load_move * aggregate(tg, sd)->rq_weight;
 		rem_load /= aggregate(tg, sd)->load + 1;
 
-		this_weight = tg->cfs_rq[this_cpu]->task_weight;
-		busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
-
-		imbalance = (busiest_weight - this_weight) / 2;
-
-		if (imbalance < 0)
-			imbalance = busiest_weight;
+		if (!rem_load)
+			continue;
 
-		max_load = ...
From: Mike Galbraith
Date: Wednesday, May 14, 2008 - 7:50 am

On my little Q6600 with git.today, it made no real difference, whether
NORMALIZED_SLEEPERS was enabled or not.  I see roughly 10-15% idle time
w/wo this patch, whereas pre-regression, it's < 2%.

	-Mike

--

From: Peter Zijlstra
Date: Wednesday, May 14, 2008 - 8:12 am

Yeah, this bit makes a huge difference; I do that by:

mkdir /cgroup/foo
for i in `cat /cgroup/tasks`; do echo $i > /cgroup/foo/tasks; done
echo $((1024*1024)) > /cgroup/foo/cpu.shares

I'm still pulling my hairs out on why this makes a difference though - I
eliminated all direct assumption on SCHED_LOAD_SCALE(_FUZZ) with an
average of the weight per task. - but all that doesn't help (much)

A few other things I found that make a significant difference:



+static void update_aggregate(int cpu, struct sched_domain *sd)
+{
+       aggregate_walk_tree(aggregate_get_down, aggregate_get_nop, cpu, sd);
+}


@@ -3224,6 +3189,8 @@ static int move_tasks(struct rq *this_rq, int this_cpu, st
ruct rq *busiest,
        unsigned long total_load_moved = 0;
        int this_best_prio = this_rq->curr->prio;

+       update_aggregate(this_cpu, sd);
+
        do {
                total_load_moved +=
                        class->load_balance(this_rq, this_cpu, busiest,



and

@@ -1169,7 +1168,10 @@ static unsigned long wakeup_gran(struct sched_entity *se)
         * More easily preempt - nice tasks, while not making it harder for
         * + nice tasks.
         */
-       gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+       if (sched_feat(ASYM_GRAN))
+               gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+       else
+               gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);

        return gran;
 }


the asym logic is wrong wrt shares - it should look at tg->weight

--

From: Srivatsa Vaddagiri
Date: Thursday, May 15, 2008 - 1:20 am

One more observation: access to aggregate()->rq_weight etc arent
correctly synchronized i.e while a cpu is doing a aggregate_walk_tree()
in a domain, and thus possibly modifying rq_weight, load etc, other cpus could 
be concurrently accessing the same data. As a result, its possible to
see inconsistent rq_weight, load, task_weight combination?

-- 
Regards,
vatsa
--

From: Peter Zijlstra
Date: Thursday, May 15, 2008 - 1:41 am

Yes - and that should not be too big an issue as long as we can deal
with it.

Any number we'll put to it will be based on a snapshot of the state so
we're wrong no matter what we do. The trick is trying to keep sane.

My current stack on top of sched-devel:

http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/

I've found that:
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/sched-agg-update...

was sufficient to deal with all the anomalities I've found so far.

--

From: Srivatsa Vaddagiri
Date: Thursday, May 15, 2008 - 10:10 am

more than staleness, interspersing of writes/reads is my concern.

Lets say that CPU0 is updating tg->cfs_rq[0]->aggregate.load,shares,rq_weight 
at CPU domain (comprising of cpu 0-7). That will result in writes in this order:

	->rq_weight = ?
	->task_weight = ?
	->shares = ?
	->load = ?

At the same time, CPU1 could be doing a load_balance_fair() in SMT
domain, reading the above same words in this order:

	->rq_weight
	->load
	->task_weight
	->shares

What if the writes (on cpu0) and reads (on cpu1) are interspersed? Won't

Doesnt improve things here (8-way Intel Xeon with SCHED_SMT set):

2.6.25 							     : 21762.4
2.6.26-rc1 + sched_devel 				     : 17937.5 (-17.6%)
2.6.26-rc1 + sched_devel + your patches 		     : 17047   (-21.6%)
2.6.26-rc1 + sched_devel + patch_below  		     : 18368.9 (-15.6%)
2.6.26-rc1 + sched_devel + patch_below + ^NORMALIZED_SLEEPER : 19589.6 (-9.9%)

I will check if patch_below + your patches help close down the gap (tomorrow):


---
 kernel/sched.c      |   98 ++++++++++++++++++++++++++--------------------------
 kernel/sched_fair.c |   26 +++++--------
 2 files changed, 61 insertions(+), 63 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1568,12 +1568,12 @@ static int task_hot(struct task_struct *
  */
 
 static inline struct aggregate_struct *
-aggregate(struct task_group *tg, struct sched_domain *sd)
+aggregate(struct task_group *tg, int this_cpu)
 {
-	return &tg->cfs_rq[sd->first_cpu]->aggregate;
+	return &tg->cfs_rq[this_cpu]->aggregate;
 }
 
-typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+typedef void (*aggregate_func)(struct task_group *, struct sched_domain *, int);
 
 /*
  * Iterate the full tree, calling @down when first entering a node and @up when
@@ -1581,14 +1581,14 @@ typedef void (*aggregate_func)(struct ta
  */
 static
 void ...
From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 11:04 pm

Even if one service is running as root, it is guaranteed 2/3 the CPU
(since root's shares are double that of other users by default). It
would cause issues for sure.

Thanks,
-- 
regards,
Dhaval
--

From: Andrew Morton
Date: Wednesday, May 7, 2008 - 12:04 am

From: Ingo Molnar
Date: Wednesday, May 7, 2008 - 2:17 am

thanks Yanmin, i've queued up your patch that reverts this change.

	Ingo
--

From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 2:33 am

From: Peter Zijlstra
Date: Wednesday, May 7, 2008 - 10:34 am

Is this really needed now that GROUP_SCHED defaults to 'n' ?

Yanmin, this is with GROUP_SCHED=y, right or is this without?

--

From: Peter Zijlstra
Date: Wednesday, May 7, 2008 - 11:58 am

Its a long shot, but does the below help?

---
Subject: sched: fixup SMP load-balance 

Keeping the aggregate on the first cpu of the sched domain has two problems:
 - it could collide between different sched domains on different cpus
 - it could slow things down because of the remote accesses

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 
 kernel/sched.c        |  113 +++++++++++++++++++++++---------------------------
 kernel/sched_fair.c   |   12 ++---
 3 files changed, 60 insertions(+), 66 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -766,7 +766,6 @@ struct sched_domain {
 	struct sched_domain *child;	/* bottom domain must be null terminated */
 	struct sched_group *groups;	/* the balancing groups of the domain */
 	cpumask_t span;			/* span of all CPUs in this domain */
-	int first_cpu;			/* cache of the first cpu in this domain */
 	unsigned long min_interval;	/* Minimum balance interval ms */
 	unsigned long max_interval;	/* Maximum balance interval ms */
 	unsigned int busy_factor;	/* less balancing by factor if busy */
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -1539,12 +1539,12 @@ static int task_hot(struct task_struct *
  */
 
 static inline struct aggregate_struct *
-aggregate(struct task_group *tg, struct sched_domain *sd)
+aggregate(struct task_group *tg, int cpu)
 {
-	return &tg->cfs_rq[sd->first_cpu]->aggregate;
+	return &tg->cfs_rq[cpu]->aggregate;
 }
 
-typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+typedef void (*aggregate_func)(struct task_group *, int, struct sched_domain *);
 
 /*
  * Iterate the full tree, calling @down when first entering a node and @up when
@@ -1552,14 +1552,14 ...
From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 11:07 pm

From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 10:20 pm

With GROUP_SCHED=y.

I remember a similiar patch was merged into 2.6.25-rc1 and I found the similiar volanoMark
regression, then you reverted it. Why to add it back to 2.6.26-rc1?

-yanmin


--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:34 pm

The implementation has been changed extensively. As of now, without
proper load balancing, group scheduling is not yet fully fair, but such a
performance regression is serious, and we need to figure out why the
regression is taking place.

-- 
regards,
Dhaval
--

From: Peter Zijlstra
Date: Wednesday, May 7, 2008 - 11:43 pm

The only thing similar is that it tries to to SMP load balancing for
groups, other than that there is nothing similar - its a total rewrite
with a whole different approach.

And we _need_ an SMP load-balancer for groups - otherwise group
scheduling is just not complete.

--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:42 am

[Empty message]
From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 10:21 pm

What's the hardware/software configuration? And kernel .config?

--

From: Dhaval Giani
Date: Wednesday, May 7, 2008 - 10:39 pm

2 CPU 64bit Xeon Processors running SLES 10 SP1

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26-rc1
# Wed May  7 11:13:46 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_DEFCONFIG_LIST="arch/x86/configs/x86_64_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-default"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not ...
From: Zhang, Yanmin
Date: Wednesday, May 7, 2008 - 11:03 pm

Is the cpu dual-core or quad-core? Or just hyper-threading?

I found 16-core tigerton has bigger regression than 8-core stoakley. perhaps the
> CONFIG_USB_ZC0301=m
Previous thread: [PATCH 3/3] PNP: add AD1815 and AD1816 quirks by Rene Herman on Monday, May 5, 2008 - 6:08 pm. (6 messages)

Next thread: [PATCH -mm][v2] ratelimit rewrite by Dave Young on Monday, May 5, 2008 - 7:25 pm. (4 messages)