Hi all, Please find attached a a series of patches that improve load balancing when there is a large weight differential between tasks (such as when nicing a task or when using SCHED_IDLE). These patches are based off feedback given by Peter Zijlstra and Mike Galbraith in earlier posts. Previous versions: -v0: http://thread.gmane.org/gmane.linux.kernel/1015966 Large weight differential leads to inefficient load balancing -v1: http://thread.gmane.org/gmane.linux.kernel/1041721 Improve load balancing when tasks have large weight differential -v2: http://thread.gmane.org/gmane.linux.kernel/1048073 Improve load balancing when tasks have large weight differential -v2 Changes from -v2: - Swap patches 3 and 4, which allows us to reuse sds->this_has_capacity to check if the local group has extra capacity. - Drop this_group_capacity from sd_lb_stats - Update comments and changelog descriptions to describe the patches better. - Add an unlikely() hint to the SCHED_IDLE policy check in task_hot() based on feedback from Satoru Takeuchi. These patches can be applied to v2.6.36-rc7 or -tip without conflicts. Below are some tests that highlight the improvements with this patchset. 1. 16 SCHED_IDLE soakers, 1 SCHED_NORMAL task on 16 cpu machine. Tested on a quad-cpu, quad-socket. Steps to reproduce: - spawn 16 SCHED_IDLE tasks - spawn one nice 0 task - system utilization immediately drops to 80% on v2.6.36-rc7 v2.6.36-rc7 10:38:46 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:38:47 AM all 80.69 0.00 0.50 0.00 0.00 0.00 0.00 18.82 14008.00 10:38:48 AM all 85.09 0.06 0.50 0.00 0.00 0.00 0.00 14.35 14690.00 10:38:49 AM all 86.83 0.06 0.44 0.00 0.00 0.00 0.00 12.67 14314.85 10:38:50 AM all 79.89 0.00 0.37 0.00 0.00 0.00 0.00 19.74 14035.35 10:38:51 AM all 87.94 0.06 0.44 0.00 0.00 0.00 0.00 11.56 ...
When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..0dd1021 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2378,7 +2378,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
{
- unsigned long load, max_cpu_load, min_cpu_load;
+ unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
@@ -2389,6 +2389,7 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
/* Tally up the load of all CPUs in the group */
max_cpu_load = 0;
min_cpu_load = ~0UL;
+ max_nr_running = 0;
for_each_cpu_and(i, sched_group_cpus(group), cpus) {
struct rq *rq = cpu_rq(i);
@@ -2406,8 +2407,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
load = target_load(i, load_idx);
} else {
load = source_load(i, load_idx);
- if (load > max_cpu_load)
+ if (load > max_cpu_load) {
max_cpu_load = load;
+ max_nr_running = rq->nr_running;
+ }
if (min_cpu_load > load)
min_cpu_load = load;
}
@@ -2447,7 +2450,8 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
if (sgs->sum_nr_running)
avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
- if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
+ if ((max_cpu_load - ...When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << ...
Commit-ID: 75dd321d79d495a0ee579e6249ebc38ddbb2667f Gitweb: http://git.kernel.org/tip/75dd321d79d495a0ee579e6249ebc38ddbb2667f Author: Nikhil Rao <ncrao@google.com> AuthorDate: Fri, 15 Oct 2010 13:12:30 -0700 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Mon, 18 Oct 2010 20:52:19 +0200 sched: Drop group_capacity to 1 only if local group has extra capacity When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check ...
Thanks, I've queued them up, we'll see what happens :-) --
