Hi all, We have observed that a large weight differential between tasks on a runqueue leads to sub-optimal machine utilization and poor load balancing. For example, if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100% busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has significant idle time. The data below highlights this problem. The test machine is a 4 socket quad-core box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16 SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU utilization numbers gathered from mpstat for 10s are: 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82 Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20 We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter so long as we introduce some large weight differential). 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 ...
This patch adds a load balancer for SCHED_IDLE tasks (sched_idle_load_balance).
The metric used to balance SCHED_IDLE tasks is calculated as:
load = (idle_nr_runnig * WEIGHT_IDLEPRIO / idle power)
The metric used is a ratio of the load contributed by SCHED_IDLE tasks to the
available power for running SCHED_IDLE tasks. We determine available power
similar to the RT power scaling calculations, i.e. we scale a CPU's available
idle power based on the average SCHED_NORMAL/SCHED_BATCH activity over a given
period.
The SCHED_IDLE load balancer is called at the end of rebalance domain. It runs
only when the SCHED_NORMAL/SCHED_BATCH balancer runs (i.e. it follows the same
rate limit).
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 140 insertions(+), 0 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cb270e8..134ddbf 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1874,6 +1874,20 @@ balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
pinned = 1;
list_for_each_entry_safe(p, n, &busiest_cfs_rq->tasks, se.group_node) {
+ /*
+ * We skip this task if the following conditions are satisfied:
+ * 1. This sched domain has SD_IDLE_LOAD_BALANCE set
+ * 2. If sched_idle_balance is not set (i.e. we are doing a
+ * SCHED_NORMAL/SCHED_BATCH balance) and this is a SCHED_IDLE
+ * task
+ * 3. If sched_idle_balance is set (i.e. we are doing a
+ * SCHED_IDLE balance) and this is not a SCHED_IDLE task
+ */
+ if (sd->flags & SD_IDLE_LOAD_BALANCE &&
+ ((sched_idle_balance && p->policy != SCHED_IDLE) ||
+ (!sched_idle_balance && p->policy == SCHED_IDLE)))
+ continue;
+
if (loops++ > sysctl_sched_nr_migrate)
break;
@@ -3097,6 +3111,119 @@ out_unlock:
return 0;
}
+/*
+ * SCHED_IDLE balancing functions
+ */
+unsigned long scale_norm_power(int cpu)
+{
+ struct rq *rq = ...This patch adds an extra sched_idle_balance argument to move_tasks,
load_balance_fair and balance_tasks. This argument is required to differentiate
between SCHED_NORMAL/SCHED_BATCH load balancing and SCHED_IDLE balancing.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched_fair.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0e25e51..cb270e8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1861,7 +1861,8 @@ static unsigned long
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
- int *this_best_prio, struct cfs_rq *busiest_cfs_rq)
+ int *this_best_prio, struct cfs_rq *busiest_cfs_rq,
+ int sched_idle_balance)
{
int loops = 0, pulled = 0, pinned = 0;
long rem_load_move = max_load_move;
@@ -1923,7 +1924,7 @@ static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
- int *all_pinned, int *this_best_prio)
+ int *all_pinned, int *this_best_prio, int sched_idle_balance)
{
long rem_load_move = max_load_move;
int busiest_cpu = cpu_of(busiest);
@@ -1949,7 +1950,7 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
moved_load = balance_tasks(this_rq, this_cpu, busiest,
rem_load, sd, idle, all_pinned, this_best_prio,
- busiest_cfs_rq);
+ busiest_cfs_rq, sched_idle_balance);
if (!moved_load)
continue;
@@ -1970,11 +1971,11 @@ static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
- int *all_pinned, int *this_best_prio)
+ int *all_pinned, int *this_best_prio, int sched_idle_balance)
{
return ...Enable SD_IDLE_LOAD_BALANCE on MC, CPU and NUMA for x86. This sched flag enables the idle load balancer on these scheduling domains. Signed-off-by: Nikhil Rao <ncrao@google.com> --- arch/x86/include/asm/topology.h | 2 +- include/linux/topology.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9f29b4e..8b27b3f 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -150,7 +150,7 @@ extern unsigned long node_remap_size[]; | 0*SD_SHARE_PKG_RESOURCES \ | 1*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ - | 0*SD_IDLE_LOAD_BALANCE \ + | 1*SD_IDLE_LOAD_BALANCE \ , \ .last_balance = jiffies, \ .balance_interval = 1, \ diff --git a/include/linux/topology.h b/include/linux/topology.h index 97fbb1b..cef87aa 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -135,7 +135,7 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | 0*SD_IDLE_LOAD_BALANCE \ + | 1*SD_IDLE_LOAD_BALANCE \ | sd_balance_for_mc_power() \ | sd_power_saving_flags() \ , \ @@ -169,7 +169,7 @@ int arch_update_cpu_topology(void); | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ | 0*SD_SERIALIZE \ - | 0*SD_IDLE_LOAD_BALANCE \ + | 1*SD_IDLE_LOAD_BALANCE \ | sd_balance_for_package_power() \ | sd_power_saving_flags() \ , \ -- 1.7.1 --
This patch adds a moving average of time spent servicing SCHED_NORMAL tasks.
This metric is maintained at rq->norm_avg and is similar to rq->rt_avg. This is
metric is used to provide a measure of cpu power available to run SCHED_IDLE
tasks.
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
kernel/sched.c | 12 ++++++++++++
kernel/sched_fair.c | 3 +++
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 26e74e9..d037b6c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -514,6 +514,7 @@ struct rq {
unsigned long avg_load_per_task;
u64 rt_avg;
+ u64 norm_avg;
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
@@ -1263,6 +1264,7 @@ static void sched_avg_update(struct rq *rq)
asm("" : "+rm" (rq->age_stamp));
rq->age_stamp += period;
rq->rt_avg /= 2;
+ rq->norm_avg /= 2;
}
}
@@ -1272,6 +1274,12 @@ static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
sched_avg_update(rq);
}
+static void sched_norm_avg_update(struct rq *rq, u64 norm_delta)
+{
+ rq->norm_avg += norm_delta;
+ sched_avg_update(rq);
+}
+
#else /* !CONFIG_SMP */
static void resched_task(struct task_struct *p)
{
@@ -1282,6 +1290,10 @@ static void resched_task(struct task_struct *p)
static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
}
+
+static void sched_norm_avg_update(struct rq *rq, u64 norm_delta)
+{
+}
#endif /* CONFIG_SMP */
#if BITS_PER_LONG == 32
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a878b53..0e25e51 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -543,6 +543,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+
+ if (curtask->policy != SCHED_IDLE)
+ sched_norm_avg_update(task_rq(curtask), delta_exec);
}
}
--
1.7.1
--
What happens with s/SCHED_IDLE/nice 19? -Mike --
We see the same result with nice 19 as well. w/ 16 nice-19 soakers: 10:15:16 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:15:17 AM all 0.06 99.94 0.00 0.00 0.00 0.00 0.00 0.00 16296.04 10:15:18 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16379.00 10:15:19 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16414.00 10:15:20 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16413.00 10:15:21 AM all 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 16402.00 10:15:22 AM all 0.00 99.88 0.06 0.00 0.00 0.06 0.00 0.00 16419.00 10:15:23 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16406.00 10:15:24 AM all 0.19 99.69 0.12 0.00 0.00 0.00 0.00 0.00 16613.13 10:15:25 AM all 0.38 99.31 0.31 0.00 0.00 0.00 0.00 0.00 16313.86 10:15:26 AM all 0.50 99.31 0.19 0.00 0.00 0.00 0.00 0.00 16623.23 Average: all 0.11 99.79 0.09 0.00 0.00 0.01 0.00 0.00 16427.30 w/ adding a SCHED_NORMAL soaker to the mix: 10:17:44 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:17:45 AM all 6.20 74.38 0.06 0.00 0.00 0.00 0.00 19.35 14419.80 10:17:46 AM all 6.25 74.89 0.06 0.00 0.00 0.00 0.00 18.80 14619.00 10:17:47 AM all 6.30 74.84 0.06 0.00 0.00 0.00 0.00 18.79 14590.00 10:17:48 AM all 6.25 80.57 0.06 0.00 0.00 0.00 0.00 13.12 15511.00 10:17:49 AM all 6.51 80.33 0.07 0.00 0.00 0.00 0.00 13.09 14904.00 10:17:50 AM all 6.06 72.62 0.06 0.00 0.00 0.00 0.00 21.26 14564.00 10:17:51 AM all 6.21 74.47 0.06 0.00 0.00 0.00 0.00 19.25 14584.00 10:17:52 AM all 6.47 77.67 0.12 0.00 0.00 0.00 0.00 15.73 15295.96 10:17:53 AM all 6.27 ...
So this is also a concern of mine that I wanted to raise.
The problem observed above is a function of balancing entities which
can't meaningfully contribute to improving the observed imbalance.
While this does occur with {nice 19, SCHED_IDLE} threads, it is a more
general problem and actually much more prevalent in the group
scheduling case where the group entity weight is more arbitrary. This
is compounded by the fact that we iterate over the task_groups in load
balance and don't have a good way of differentiating in-group high/low
weight entities.
Something I have been investigating is how can we fix this problem
more generally: e.g. how can we effectively load-balance a low-weight
entity in the presence of high-weight competition. This could just as
easily be a {1000, 2} share split as it could be a {64000, 2000} share
split -- the latter won't benefit from IDLE_SCHEDULING optimizations
but is a realistic test case if someone is trying to provide minimal
latencies for the high-weight task.
What I've been considering in avenue is instead of load-balancing the
low weight entities when their h_load relative to the imbalance is
'not meaningful' -- by some fitness function, tag them for a separate
load-balancing pass. This secondary pass would only need to be
periodic in nature, and consider the 'tagged' task_groups from the
generic load_balancing operations. At this point in time it can be
evaluated whether the load is still 'not meaningful', and distribution
towards the most idle cpu (by utilization) can be evaluated.
The one converse here is that while something like the above is more
general for the low weight-group entity case, it does not extend
particularly cleanly to the non group-scheduling case. I haven't yet
got a particularly good answer for this since without a group entity
representing the low-weight threads tracking them for a second pass
becomes more cumbersome.
--
No its terrible. I've already explained how to solve this on several occasions (although my google skillz seem to fail me to find the latest occasion a few months ago). The thing is that from a fairness point of view 1 nice-0 (weight=1024) on one CPU and 512 SCHED_IDLE (weight=2) tasks on another CPU and all other CPUs idle is correct. It just doesn't seem to be the thing that most people expect. Special casing things like you've done is utterly the wrong thing to do. This problem comes in two forms and its name is infeasible weight distribution. The load-balancer tries to ensure W_k ~= W_l, k,l elem_of CPUs, where W_k = \Sum_i w_i^k, where w_i^k is the i-th task on CPU k. The two cases are statically infeasible, and dynamically infeasible. We say the task-set is statically infeasible if for a task set of n tasks there is no way to statically distribute them on N <= n CPUs such that each task gets equal service (assuming the scheduling on each CPU is fair). We say the task-set is dynamically infeasible if for the given scenario there is no way to rotate the tasks to obtain equality. Lets assume 2 CPUs. Ex.1: 2 tasks of different weight. Ex.2: 3 tasks of equal weight. The first example is both statically and dynamically infeasible as there is no way to occupy both CPUs such that each task gets the proportional correct service. The second example is statically infeasible, but dynamically feasible, for if we rotate one task, such that we alternate between 2:1 and 1:2 in equal measures, each task will receive its correct 2/3rd CPU service. The current load-balancer isn't particularly skilled at either issue. The proper solution is to 'fix' find_busiest_group() so that it will: - pick the heaviest cpu with more than 1 task on it - slowly over-balance things The first thing will solve your issue. --
Peter, Thanks for the feedback! I see your point here, and yes I agree having 1 nice-0 on one cpu, 512 SCHED_IDLE tasks on another cpu and all other cpus idle is correct if we only considered fairness. However, we would also like to maximize machine utilization. The fitness function we would ideally like to optimize for is a combination of both fairness and utilization. I'm going to take a step back here and explain the bigger problem we are trying to solve. We have two types of workloads -- high priority tasks that need to run for short periods of time and can consume as much cpu as required when they run; and low priority tasks that ideally consume slack cpu. Let's say we had a machine with enough tasks to soak up all the cpus. There are two cases to implement the priority scheme. The first case is to run all tasks run as SCHED_NORMAL. The machine is fully utilized but this has other side-effects. Low priority tasks consume much more cpu than we would like, they get to preempt high priority tasks, and this results in poor performance of high priority tasks. The second case is to run high priority tasks as SCHED_NORMAL and low priority tasks as SCHED_IDLE. We choose SCHED_IDLE to minimize interference with high priority tasks -- we don't want low priority tasks to preempt high priority or to be favored over high priority tasks in any way. When there are high priority tasks, we want low priority tasks to get as little cpu as possible; thus giving them the minimum possible weight works out well. In this case, we are able to isolate high prio tasks from low prio tasks reasonably well. However, sub-optimal machine utilization defeats the purpose of packing a machine with lots of low priority tasks as we are not able to consume the slack cpu. This RFC is really meant to explain the problem that we are facing. We presented one possible solution to fix this but we are also open to Thanks for your suggestions; I explored the first one a bit and I added a check into ...
Sure, I see (and agree with) the fact that we want to optimize utilization as well (although I bet the power management people might You might also need some changes to find_busiest_group(), suppose you have a 4 cpu machine, with 2 groups of 2, now also assume you have 4 tasks, 2 of nice-0 and 2 idle, if both nice-0 are in the same group, each on their own cpu, then f_b_g() could select that group as being the busiest (its got W=2048, against W=4 of the other group after all). Right, so wakeup/sleep are indeed more interesting. For wakeup we also have select_task_rq() to consider, it is responsible to choosing where to run the newly woken task. For sleeps we have new idle balancing, which is a lot like the regular responsible for the thing you see (although under-utilization suggests the new-idle balancer), you can use perf/ftrace to look at what your tasks are doing and how they could be doing it better (Arjan's timechart might be a good help). If they get woken to the wrong CPU, its select_task_rq(), if they leave a CPU idle too long, its new idle balancing -- or possibly its something I overlooked all together :-) --
