[tip:sched/core] sched: Drop group_capacity to 1 only if local group has extra capacity

Previous thread: [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity by Nikhil Rao on Friday, October 15, 2010 - 1:12 pm. (2 messages)

Next thread: [PATCH 1/2] ARM: imx: Add wake functionality to GPIO by Dinh.Nguyen on Friday, October 15, 2010 - 1:18 pm. (6 messages)
From: Nikhil Rao
Date: Friday, October 15, 2010 - 1:12 pm

Hi all,

Please find attached a a series of patches that improve load balancing
when there is a large weight differential between tasks (such as when nicing a
task or when using SCHED_IDLE). These patches are based off feedback given by
Peter Zijlstra and Mike Galbraith in earlier posts.

Previous versions:
-v0: http://thread.gmane.org/gmane.linux.kernel/1015966
     Large weight differential leads to inefficient load balancing

-v1: http://thread.gmane.org/gmane.linux.kernel/1041721
     Improve load balancing when tasks have large weight differential

-v2: http://thread.gmane.org/gmane.linux.kernel/1048073
     Improve load balancing when tasks have large weight differential -v2

Changes from -v2:
- Swap patches 3 and 4, which allows us to reuse sds->this_has_capacity to
  check if the local group has extra capacity.
- Drop this_group_capacity from sd_lb_stats
- Update comments and changelog descriptions to describe the patches better.
- Add an unlikely() hint to the SCHED_IDLE policy check in task_hot() based on
  feedback from Satoru Takeuchi.

These patches can be applied to v2.6.36-rc7 or -tip without conflicts. Below
are some tests that highlight the improvements with this patchset.

1. 16 SCHED_IDLE soakers, 1 SCHED_NORMAL task on 16 cpu machine.
Tested on a quad-cpu, quad-socket. Steps to reproduce:
- spawn 16 SCHED_IDLE tasks
- spawn one nice 0 task
- system utilization immediately drops to 80% on v2.6.36-rc7

v2.6.36-rc7

10:38:46 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:38:47 AM  all   80.69    0.00    0.50    0.00    0.00    0.00    0.00   18.82  14008.00
10:38:48 AM  all   85.09    0.06    0.50    0.00    0.00    0.00    0.00   14.35  14690.00
10:38:49 AM  all   86.83    0.06    0.44    0.00    0.00    0.00    0.00   12.67  14314.85
10:38:50 AM  all   79.89    0.00    0.37    0.00    0.00    0.00    0.00   19.74  14035.35
10:38:51 AM  all   87.94    0.06    0.44    0.00    0.00    0.00    0.00   11.56  ...
From: Nikhil Rao
Date: Friday, October 15, 2010 - 1:12 pm

When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.

Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 kernel/sched_fair.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0dd1021 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
 {
-	unsigned long load, max_cpu_load, min_cpu_load;
+	unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
 	int i;
 	unsigned int balance_cpu = -1, first_idle_cpu = 0;
 	unsigned long avg_load_per_task = 0;
@@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	/* Tally up the load of all CPUs in the group */
 	max_cpu_load = 0;
 	min_cpu_load = ~0UL;
+	max_nr_running = 0;
 
 	for_each_cpu_and(i, sched_group_cpus(group), cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
-			if (load > max_cpu_load)
+			if (load > max_cpu_load) {
 				max_cpu_load = load;
+				max_nr_running = rq->nr_running;
+			}
 			if (min_cpu_load > load)
 				min_cpu_load = load;
 		}
@@ -2447,7 +2450,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
 	if (sgs->sum_nr_running)
 		avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
-	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
+	if ((max_cpu_load - ...
From: Nikhil Rao
Date: Friday, October 15, 2010 - 1:12 pm

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).

For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:

- find_busiest_group() will always pick the sched group containing the niced
  task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
  nice0 task (never picks the cpu with the nice -15 task since
  weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
  and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
  at which point it kicks off the active load balancer, wakes up the migration
  thread and kicks the nice 0 task off the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
not relevant because the niced task has a very large weight.

In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.

It helps in the following ways:
- In low utilization cases (where nr_tasks << ...
From: tip-bot for Nikhil Rao
Date: Monday, October 18, 2010 - 12:24 pm

Commit-ID:  75dd321d79d495a0ee579e6249ebc38ddbb2667f
Gitweb:     http://git.kernel.org/tip/75dd321d79d495a0ee579e6249ebc38ddbb2667f
Author:     Nikhil Rao <ncrao@google.com>
AuthorDate: Fri, 15 Oct 2010 13:12:30 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 20:52:19 +0200

sched: Drop group_capacity to 1 only if local group has extra capacity

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).

For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:

- find_busiest_group() will always pick the sched group containing the niced
  task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
  nice0 task (never picks the cpu with the nice -15 task since
  weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
  and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
  at which point it kicks off the active load balancer, wakes up the migration
  thread and kicks the nice 0 task off the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
not relevant because the niced task has a very large weight.

In this patch, we add an extra condition to the "if(prefer_sibling)" check ...
From: Peter Zijlstra
Date: Friday, October 15, 2010 - 1:44 pm

Thanks, I've queued them up, we'll see what happens :-)
--

Previous thread: [PATCH 3/4] sched: force balancing on newidle balance if local group has capacity by Nikhil Rao on Friday, October 15, 2010 - 1:12 pm. (2 messages)

Next thread: [PATCH 1/2] ARM: imx: Add wake functionality to GPIO by Dinh.Nguyen on Friday, October 15, 2010 - 1:18 pm. (6 messages)