Re: [PATCH 1/5] sched: fix capacity calculations for SMT4

Previous thread: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing by Michael Neuling on Thursday, April 8, 2010 - 11:21 pm. (4 messages)

Next thread: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance() by Michael Neuling on Thursday, April 8, 2010 - 11:21 pm. (3 messages)
From: Michael Neuling
Date: Thursday, April 8, 2010 - 11:21 pm

When calculating capacity we use the following calculation:

       capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);

In SMT2, power will be 1178/2 (provided we are not scaling power with
freq say) and SCHED_LOAD_SCALE will be 1024, resulting in capacity
being 1.

With SMT4 however, power will be 1178/4, hence capacity will end up as
0.

Fix this by ensuring capacity is always at least 1 after this
calculation.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
I'm not sure this is the correct fix but this works for me.  
Original post here: 
  http://lkml.org/lkml/2010/3/30/884

---

 kernel/sched_fair.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -1482,6 +1482,7 @@ static int select_task_rq_fair(struct ta
 			}
 
 			capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+			capacity = max(capacity, 1UL);
 
 			if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 				nr_running /= 2;
@@ -2488,6 +2489,7 @@ static inline void update_sg_lb_stats(st
 
 	sgs->group_capacity =
 		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
+	sgs->group_capacity = max(sgs->group_capacity, 1UL);
 }
 
 /**
@@ -2795,9 +2797,11 @@ find_busiest_queue(struct sched_group *g
 
 	for_each_cpu(i, sched_group_cpus(group)) {
 		unsigned long power = power_of(i);
-		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
+		unsigned long capacity;
 		unsigned long wl;
 
+		capacity = max(DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE), 1UL);
+
 		if (!cpumask_test_cpu(i, cpus))
 			continue;
 
--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 5:29 am

Right, so I suspect this will indeed break some things.

We initially allowed 0 capacity for when a cpu is consumed by an RT task
and there simply isn't much capacity left, in that case you really want
to try and move load to your sibling cpus if possible.

However you're right that this goes awry in your case.

One thing to look at is if that 15% increase is indeed representative
for the power7 cpu, it having 4 SMT threads seems to suggest there was
significant gains, otherwise they'd not have wasted the silicon.

(The broken x86 code was meant to actually compute the SMT gain, so that
we'd not have to guess the 15%)

Now, increasing this will only marginally fix the issue, since if you
end up with 512 per thread it only takes a very tiny amount of RT
workload to drop below and end up at 0 again.

One thing we could look at is using the cpu base power to compute
capacity from. We'd have to add another field to sched_group and store
power before we do the scale_rt_power() stuff.


--

From: Michael Neuling
Date: Tuesday, April 13, 2010 - 9:28 pm

Changing the CPU power based on what tasks are running on them seems a

There are certainly, for most workloads, per core gains for SMT4 over
SMT2 on P7.  My kernels certainly compile faster and that's the only

I tried initialled to make smt_gain programmable and at 2048 per core

Separating capacity from what RT tasks are running seems like a good
idea to me.

This would fix the RT issue, but it's not clear to me how you are
suggesting fixing the rounding down to 0 SMT4 issue.  Are you suggesting
we bump smt_gain to say 2048 + 15%?  Or are you suggesting we separate
the RT tasks out from capacity and keep the max(1, capacity) that I've
added?  Or something else?

Would another possibility be changing capacity a scaled value (like
cpu_power is now) rather than a small integer as it is now.  For
example, a scaled capacity of 1024 would be equivalent to a capacity of
1 now.  This might enable us to handle partial capacities better?  We'd
probably have to scale a bunch of nr_running too.  

Mikey

--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 6:58 am

Well the thing cpu_power represents is a ratio of compute capacity
available to this cpu as compared to other cpus. By normalizing the
runqueue weights with this we end up with a fair balance.

The thing to realize here is that this is solely about SCHED_NORMAL
tasks, SCHED_FIFO/RR (or the proposed DEADLINE) tasks do not care about
fairness and available compute capacity.

So if we were to ignore RT tasks, you'd end up with a situation where,
assuming 2 cpus and 4 equally weighted NORMAL tasks, and 1 RT task, the
load-balancer would give each cpu 2 NORMAL tasks, but the tasks that
would end up on the cpu the RT tasks would be running on would not run
as fast -- is that fair?

Since RT tasks do not have a weight (FIFO/RR have no limit at all,
DEADLINE would have something equivalent to a max weight), it is
impossible to account them in the normal weight sense.

Therefore the current model takes them into account by lowering the
compute capacity according to their (avg) cpu usage. So if the RT task
would consume 66% cputime, we'd end up with a situation where the cpu
running the RT task would get 1 NORMAL task, and other cpu would have

For sure ;-)

Are there any numbers available on how much they gain? It might be worth


I would think that 4 SMT threads are still slower than two full cores,

Right, so my proposal was to scale down the capacity divider (currently
1024) to whatever would be the base capacity for that cpu. Trouble seems
to be that that makes group capacity a lot more complex, as you would
end up needing to average all the cpu's their base capacity.


Hrmm, my brain seems muddled but I might have another solution, let me
ponder this for a bit..



--

From: Michael Neuling
Date: Sunday, April 18, 2010 - 2:34 pm

Thanks for the explanation.  Your last example makes perfect sense for

I get some gain numbers but obviously the workloads makes a huge
difference.  From a scheduler perspective, I assume an
average/representative gain is best rather than an optimistic or
pessimistic one?

We'll have different gains for SMT2 and SMT4, so we could change the
gain dynamically based on which SMT mode we are in.  Does that seem like


Yes for an average workload, 4 SMT threads are slower than 2 full

Great!

Mikey
--

From: Peter Zijlstra
Date: Monday, April 19, 2010 - 7:49 am

That's the sort of thing you can use arch_scale_smt_power() for. But be
weary to not fall into the same trap I did with x86, where I confused
actual gain with capacity (When idle the actual gain is 0, but the
capacity is not).




--

From: Michael Neuling
Date: Monday, April 19, 2010 - 1:45 pm

Oops, yes of course :-)


Let me know if/when you come up this solution or if I can help.  

Mikey
--

From: Michael Neuling
Date: Wednesday, April 28, 2010 - 11:55 pm

Peter,

Did you manage to get anywhere on this capacity issue?

Mikey
--

From: Peter Zijlstra
Date: Monday, May 31, 2010 - 1:33 am

Right, so the thing I was thinking about is taking the group capacity
into account when determining the capacity for a single cpu.

Say the group contains all the SMT siblings, then use the group capacity
(usually larger than 1024) and then distribute the capacity over the
group members, preferring CPUs with higher individual cpu_power over
those with less.

So suppose you've got 4 siblings with cpu_power=294 each, then we assign
capacity 1 to the first member, and the remaining 153 is insufficient,
and thus we stop and the rest lives with 0 capacity.

Now take the example that the first sibling would be running a heavy RT
load, and its cpu_power would be reduced to say, 50, then we still got
nearly 933 left over the others, which is still sufficient for one
capacity, but because the first sibling is low, we'll assign it 0 and
instead assign 1 to the second, again, leaving the third and fourth 0.

If the group were a core group, the total would be much higher and we'd
likely end up assigning 1 to each before we'd run out of capacity.


For power savings, we can lower the threshold and maybe use the maximal
individual cpu_power in the group to base 1 capacity from.

So, suppose the second example, where sibling0 has 50 and the others
have 294, you'd end up with a capacity distribution of: {0,1,1,1}.


--

From: Vaidyanathan Srinivasan
Date: Tuesday, June 1, 2010 - 3:52 pm

Hi Peter,


This is a tricky case because we are depending upon the
DIV_ROUND_CLOSEST to decide whether to flag capacity to 0 or 1.  We
will not have any task movement until capacity is depleted to quite
low value due to RT task.  Having a threshold to flag 0/1 instead of
DIV_ROUND_CLOSEST just like you have suggested in the power savings

One challenge here is that if RT tasks run on more that one thread in
this group, we will have slightly different cpu powers.  Arranging
them from max to min and having a cutoff threshold should work.

Should we keep the RT scaling as a separate entity along with
cpu_power to simplify these thresholds.  Whenever we need to scale
group load with cpu power can take the product of cpu_power and
scale_rt_power but in these cases where we compute capacity, we can
mark a 0 or 1 just based on whether scale_rt_power was less than
SCHED_LOAD_SCALE or not.  Alternatively we can keep cpu_power as
a product of all scaling factors as it is today but save the component
scale factors also like scale_rt_power() and arch_scale_freq_power()
so that it can be used in load balance decisions.

Basically in power save balance we would give all threads a capacity
'1' unless the cpu_power was reduced due to RT task.  Similarly in
the non-power save case, we can have flag 1,0,0,0 unless first thread
had a RT scaling during the last interval.

I am suggesting to distinguish the reduction is cpu_power due to
architectural (hardware DVFS) reasons from RT tasks so that it is easy
to decide if moving tasks to sibling thread or core can help or not.

--Vaidy

--

From: Peter Zijlstra
Date: Thursday, June 3, 2010 - 1:56 am

Right, well we could put the threshold higher than the 50%, say 90% or


Right, so the question is, do we only care about RT or should capacity
reflect the full asymmetric MP case.

I don't quite see why RT is special from any of the other scale factors,
if someone pegged one core at half the frequency of the others you'd
still want it to get 0 capacity so that we only try to populate it on

For power savings such a special heuristic _might_ make sense.
--

From: Srivatsa Vaddagiri
Date: Monday, June 7, 2010 - 8:06 am

Peter,
	We are exploring an alternate solution which seems to be working as
expected. Basically allow capacity of 1 for SMT threads provided there is
no significant influence by RT tasks or freq scaling. Note that at core level,
capacity is unchanged and hence this affects only how tasks are distributed
within a core.

Mike Neuling should post an updated patchset containing this patch
(with more comments added ofcourse!).


Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/sched.h |    2 +-
 kernel/sched_fair.c   |   30 +++++++++++++++++++++++-------
 2 files changed, 24 insertions(+), 8 deletions(-)

Index: linux-2.6-ozlabs/include/linux/sched.h
===================================================================
--- linux-2.6-ozlabs.orig/include/linux/sched.h
+++ linux-2.6-ozlabs/include/linux/sched.h
@@ -860,7 +860,7 @@ struct sched_group {
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
 	 */
-	unsigned int cpu_power;
+	unsigned int cpu_power, cpu_power_orig;
 
 	/*
 	 * The CPUs this group covers.
Index: linux-2.6-ozlabs/kernel/sched_fair.c
===================================================================
--- linux-2.6-ozlabs.orig/kernel/sched_fair.c
+++ linux-2.6-ozlabs/kernel/sched_fair.c
@@ -2285,13 +2285,6 @@ static void update_cpu_power(struct sche
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if (sched_feat(ARCH_POWER))
-		power *= arch_scale_freq_power(sd, cpu);
-	else
-		power *= default_scale_freq_power(sd, cpu);
-
-	power >>= SCHED_LOAD_SHIFT;
-
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 		if (sched_feat(ARCH_POWER))
 			power *= arch_scale_smt_power(sd, cpu);
@@ -2301,6 +2294,15 @@ static void update_cpu_power(struct sche
 		power >>= SCHED_LOAD_SHIFT;
 	}
 
+	sdg->cpu_power_orig = power;
+
+	if (sched_feat(ARCH_POWER))
+		power *= arch_scale_freq_power(sd, cpu);
+	else
+		power *= default_scale_freq_power(sd, ...
Previous thread: [PATCH 5/5] sched: make fix_small_imbalance work with asymmetric packing by Michael Neuling on Thursday, April 8, 2010 - 11:21 pm. (4 messages)

Next thread: [PATCH 4/5] sched: Mark the balance type for use in need_active_balance() by Michael Neuling on Thursday, April 8, 2010 - 11:21 pm. (3 messages)