When calculating capacity we use the following calculation:
capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
In SMT2, power will be 1178/2 (provided we are not scaling power with
freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
being 1.
With SMT4 however, power will be 1178/4, hence capacity will end up as
0.
Fix this by ensuring capacity is always at least 1 after this
calculation.
Signed-off-by: Michael Neuling <mikey@neuling.org>
---
I'm not sure this is the correct fix but this works for me.
Original post here:
http://lkml.org/lkml/2010/3/30/884
---
kernel/sched_fair.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
}
capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ capacity = max(capacity, 1UL);
if (tmp->flags & SD_POWERSAVINGS_BALANCE)
nr_running /= 2;
@@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st
sgs->group_capacity =
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+ sgs->group_capacity = max(sgs->group_capacity, 1UL);
}
/**
@@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g
for_each_cpu(i, sched_group_cpus(group)) {
unsigned long power = power_of(i);
- unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+ unsigned long capacity;
unsigned long wl;
+ capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
+
if (!cpumask_test_cpu(i, cpus))
continue;
--
Right, so I suspect this will indeed break some things. We initially allowed 0 capacity for when a cpu is consumed by an RT task and there simply isn't much capacity left, in that case you really want to try and move load to your sibling cpus if possible. However you're right that this goes awry in your case. One thing to look at is if that 15% increase is indeed representative for the power7 cpu, it having 4 SMT threads seems to suggest there was significant gains, otherwise they'd not have wasted the silicon. (The broken x86 code was meant to actually compute the SMT gain, so that we'd not have to guess the 15%) Now, increasing this will only marginally fix the issue, since if you end up with 512 per thread it only takes a very tiny amount of RT workload to drop below and end up at 0 again. One thing we could look at is using the cpu base power to compute capacity from. We'd have to add another field to sched_group and store power before we do the scale_rt_power() stuff. --
Changing the CPU power based on what tasks are running on them seems a There are certainly, for most workloads, per core gains for SMT4 over SMT2 on P7. My kernels certainly compile faster and that's the only I tried initialled to make smt_gain programmable and at 2048 per core Separating capacity from what RT tasks are running seems like a good idea to me. This would fix the RT issue, but it's not clear to me how you are suggesting fixing the rounding down to 0 SMT4 issue. Are you suggesting we bump smt_gain to say 2048 + 15%? Or are you suggesting we separate the RT tasks out from capacity and keep the max(1, capacity) that I've added? Or something else? Would another possibility be changing capacity a scaled value (like cpu_power is now) rather than a small integer as it is now. For example, a scaled capacity of 1024 would be equivalent to a capacity of 1 now. This might enable us to handle partial capacities better? We'd probably have to scale a bunch of nr_running too. Mikey --
Well the thing cpu_power represents is a ratio of compute capacity available to this cpu as compared to other cpus. By normalizing the runqueue weights with this we end up with a fair balance. The thing to realize here is that this is solely about SCHED_NORMAL tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about fairness and available compute capacity. So if we were to ignore RT tasks, you'd end up with a situation where, assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the load-balancer would give each cpu 2 NORMAL tasks, but the tasks that would end up on the cpu the RT tasks would be running on would not run as fast -- is that fair? Since RT tasks do not have a weight (FIFO/RR have no limit at all, DEADLINE would have something equivalent to a max weight), it is impossible to account them in the normal weight sense. Therefore the current model takes them into account by lowering the compute capacity according to their (avg) cpu usage. So if the RT task would consume 66% cputime, we'd end up with a situation where the cpu running the RT task would get 1 NORMAL task, and other cpu would have For sure ;-) Are there any numbers available on how much they gain? It might be worth I would think that 4 SMT threads are still slower than two full cores, Right, so my proposal was to scale down the capacity divider (currently 1024) to whatever would be the base capacity for that cpu. Trouble seems to be that that makes group capacity a lot more complex, as you would end up needing to average all the cpu's their base capacity. Hrmm, my brain seems muddled but I might have another solution, let me ponder this for a bit.. --
Thanks for the explanation. Your last example makes perfect sense for I get some gain numbers but obviously the workloads makes a huge difference. From a scheduler perspective, I assume an average/representative gain is best rather than an optimistic or pessimistic one? We'll have different gains for SMT2 and SMT4, so we could change the gain dynamically based on which SMT mode we are in. Does that seem like Yes for an average workload, 4 SMT threads are slower than 2 full Great! Mikey --
That's the sort of thing you can use arch_scale_smt_power() for. But be weary to not fall into the same trap I did with x86, where I confused actual gain with capacity (When idle the actual gain is 0, but the capacity is not). --
Oops, yes of course :-) Let me know if/when you come up this solution or if I can help. Mikey --
Peter, Did you manage to get anywhere on this capacity issue? Mikey --
Right, so the thing I was thinking about is taking the group capacity
into account when determining the capacity for a single cpu.
Say the group contains all the SMT siblings, then use the group capacity
(usually larger than 1024) and then distribute the capacity over the
group members, preferring CPUs with higher individual cpu_power over
those with less.
So suppose you've got 4 siblings with cpu_power=294 each, then we assign
capacity 1 to the first member, and the remaining 153 is insufficient,
and thus we stop and the rest lives with 0 capacity.
Now take the example that the first sibling would be running a heavy RT
load, and its cpu_power would be reduced to say, 50, then we still got
nearly 933 left over the others, which is still sufficient for one
capacity, but because the first sibling is low, we'll assign it 0 and
instead assign 1 to the second, again, leaving the third and fourth 0.
If the group were a core group, the total would be much higher and we'd
likely end up assigning 1 to each before we'd run out of capacity.
For power savings, we can lower the threshold and maybe use the maximal
individual cpu_power in the group to base 1 capacity from.
So, suppose the second example, where sibling0 has 50 and the others
have 294, you'd end up with a capacity distribution of: {0,1,1,1}.
--
Hi Peter, This is a tricky case because we are depending upon the DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1. We will not have any task movement until capacity is depleted to quite low value due to RT task. Having a threshold to flag 0/1 instead of DIV_ROUND_CLOSEST just like you have suggested in the power savings One challenge here is that if RT tasks run on more that one thread in this group, we will have slightly different cpu powers. Arranging them from max to min and having a cutoff threshold should work. Should we keep the RT scaling as a separate entity along with cpu_power to simplify these thresholds. Whenever we need to scale group load with cpu power can take the product of cpu_power and scale_rt_power but in these cases where we compute capacity, we can mark a 0 or 1 just based on whether scale_rt_power was less than SCHED_LOAD_SCALE or not. Alternatively we can keep cpu_power as a product of all scaling factors as it is today but save the component scale factors also like scale_rt_power() and arch_scale_freq_power() so that it can be used in load balance decisions. Basically in power save balance we would give all threads a capacity '1' unless the cpu_power was reduced due to RT task. Similarly in the non-power save case, we can have flag 1,0,0,0 unless first thread had a RT scaling during the last interval. I am suggesting to distinguish the reduction is cpu_power due to architectural (hardware DVFS) reasons from RT tasks so that it is easy to decide if moving tasks to sibling thread or core can help or not. --Vaidy --
Right, well we could put the threshold higher than the 50%, say 90% or Right, so the question is, do we only care about RT or should capacity reflect the full asymmetric MP case. I don't quite see why RT is special from any of the other scale factors, if someone pegged one core at half the frequency of the others you'd still want it to get 0 capacity so that we only try to populate it on For power savings such a special heuristic _might_ make sense. --
Peter,
We are exploring an alternate solution which seems to be working as
expected. Basically allow capacity of 1 for SMT threads provided there is
no significant influence by RT tasks or freq scaling. Note that at core level,
capacity is unchanged and hence this affects only how tasks are distributed
within a core.
Mike Neuling should post an updated patchset containing this patch
(with more comments added ofcourse!).
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 30 +++++++++++++++++++++++-------
2 files changed, 24 insertions(+), 8 deletions(-)
Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -860,7 +860,7 @@ struct sched_group {
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU.
*/
- unsigned int cpu_power;
+ unsigned int cpu_power, cpu_power_orig;
/*
* The CPUs this group covers.
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;
- if (sched_feat(ARCH_POWER))
- power *= arch_scale_freq_power(sd, cpu);
- else
- power *= default_scale_freq_power(sd, cpu);
-
- power >>= SCHED_LOAD_SHIFT;
-
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
power *= arch_scale_smt_power(sd, cpu);
@@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche
power >>= SCHED_LOAD_SHIFT;
}
+ sdg->cpu_power_orig = power;
+
+ if (sched_feat(ARCH_POWER))
+ power *= arch_scale_freq_power(sd, cpu);
+ else
+ power *= default_scale_freq_power(sd, ...