Re: fair group scheduler not so fair?

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Srivatsa Vaddagiri
Date: Tuesday, May 27, 2008 - 10:15 am

On Wed, May 21, 2008 at 05:59:22PM -0600, Chris Friesen wrote:

Definitely not an expected behavior and I think I understand why this is
happening.

But first, note that Groups "a" and "b" share bandwidth with all tasks
in /dev/cgroup/tasks. Lets say that /dev/cgroup/tasks had T0-T1,
/dev/cgroup/a/tasks has TA1 while /dev/cgroup/b/tasks has
TB1 (all tasks of weight 1024).

Then TA1 is expected to get 1/(1+1+2) = 25% bandwidth

Similarly T0, T1, TB1 all get 25% bandwidth.

IOW, Groups "a" and "b" are peers of each task in /dev/cgroup/tasks.

Having said that, here's what I do for my testing:

	# mkdir /cgroup
	# mount -t cgroup -ocpu none /cgroup
	# cd /cgroup

	# #Move all tasks to 'sys' group and give it low shares
	# mkdir sys
	# cd sys
	# for i in `cat ../tasks`
	  do
		echo $i > tasks
	  done
	# echo 100 > cpu.shares

	# mkdir a
	# mkdir b

	# echo <pid> > a/tasks
	..

Now, why did Group "a" get less than what it deserved? Here's what was
happening:

	CPU0		CPU1

	  a0		b0
	  b1

cpu0.load = 1024 (Grp a load) + 512 (Grp b load)
cpu1.load = 512 (Grp b load)

imbalance = 1024

max_load_move = 512 (to equalize load)

load_balance_fair() is invoked on CPU1 with this max_load_move target of 512. 
Ideally it can move b1 to CPU1, which would attain perfect balance. This
does not happen because:

load_balance_fair() iterates thr' the task list in the order they
were created. So it first examines what tasks it can pull from Group "a".

	It invokes __load_balance_fair() to see if it can pull any tasks
	worth max weight 512 (rem_load). Ideally since a0's weight is
	1024, it should not pull a0. However, balance_tasks() is eager
	to pull atleast one task (because of SCHED_LOAD_SCALE_FUZZ) and
	ends up pulling a0. This results in more load being moved (1024)
	than the required target.

Next, when CPU0 tries pulling load of 512, it ends up pulling a0 again.

This a0 ping pongs between both CPUs.


The following experimental patch (on top of 2.6.26-rc3 +
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/) seems 
to fix the problem.

Note that this works only when /dev/cgroup/sys/cpu.shares = 100 (or some low
number). Otherwise top (or whatever command you run to observe load 
distribution) contributes to some load in /dev/cgroup/sys group, which skews the
results. IMHO, find_busiest_group() needs to use cpu utilization (rather than 
cpu load) as the metric to balance across CPUs (rather than task/group load).

Can you check if this makes a difference for you as well?


Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/sched.h |    4 ++++
 init/Kconfig          |    2 +-
 kernel/sched.c        |    5 ++++-
 kernel/sched_debug.c  |    2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -698,7 +698,11 @@ enum cpu_idle_type {
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define SCHED_LOAD_SCALE_FUZZ	0
+#else
 #define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+#endif
 
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
 	  See Documentation/sched-rt-group.txt for more information.
 
 choice
-	depends on GROUP_SCHED
+	depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
 	prompt "Basis for grouping tasks"
 	default USER_SCHED
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int 
 	unsigned long shares = 0;
 	int i;
 
+	if (!tg->parent)
+		return;
+
 	for_each_cpu_mask(i, sd->span) {
 		rq_weight += tg->cfs_rq[i]->load.weight;
 		shares += tg->cfs_rq[i]->shares;
@@ -2919,7 +2922,7 @@ next:
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+	skip_for_load = (p->se.load.weight >> 1) >= rem_load_move +
 							 SCHED_LOAD_SCALE_FUZZ;
 	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
 	struct sched_entity *last;
 	unsigned long flags;
 
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
 #else
 	char path[128] = "";




















	






-- 
Regards,
vatsa
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
fair group scheduler not so fair?, Chris Friesen, (Wed May 21, 4:59 pm)
Re: fair group scheduler not so fair?, Peter Zijlstra, (Wed May 21, 11:56 pm)
Re: fair group scheduler not so fair?, Chris Friesen, (Thu May 22, 1:02 pm)
Re: fair group scheduler not so fair?, Peter Zijlstra, (Thu May 22, 1:07 pm)
RE: fair group scheduler not so fair?, Li, Tong N, (Thu May 22, 1:18 pm)
RE: fair group scheduler not so fair?, Peter Zijlstra, (Thu May 22, 2:13 pm)
Re: fair group scheduler not so fair?, Chris Friesen, (Thu May 22, 5:17 pm)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Fri May 23, 12:44 am)
Re: fair group scheduler not so fair?, Peter Zijlstra, (Fri May 23, 2:39 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Fri May 23, 2:42 am)
Re: fair group scheduler not so fair?, Peter Zijlstra, (Fri May 23, 3:16 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Fri May 23, 3:19 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Tue May 27, 10:15 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Tue May 27, 10:28 am)
Re: fair group scheduler not so fair?, Chris Friesen, (Tue May 27, 11:13 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Wed May 28, 9:33 am)
Re: fair group scheduler not so fair?, Chris Friesen, (Wed May 28, 11:35 am)
Re: fair group scheduler not so fair?, Dhaval Giani, (Wed May 28, 11:47 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Wed May 28, 7:50 pm)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Thu May 29, 9:46 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Thu May 29, 9:47 am)
Re: fair group scheduler not so fair?, Chris Friesen, (Thu May 29, 2:30 pm)
Re: fair group scheduler not so fair?, Dhaval Giani, (Thu May 29, 11:43 pm)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Fri May 30, 3:21 am)
Re: fair group scheduler not so fair?, Srivatsa Vaddagiri, (Fri May 30, 4:36 am)
Re: fair group scheduler not so fair?, Chris Friesen, (Mon Jun 2, 1:03 pm)