[RFC PATCH v2 6/7] sched: bias task wakeups to preferred semi-idle packages

Previous thread: [patch 1/3] s3cmci -- support odd block transfers by Christer Weinigel on Monday, September 8, 2008 - 5:48 am. (1 message)

Next thread: [GIT PULL] avr32 fixes for 2.6.27 by Haavard Skinnemoen on Monday, September 8, 2008 - 6:18 am. (1 message)
From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:14 am

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power.  The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.

Please refer to the following discussions and article for details.

Making power policy just work
http://lwn.net/Articles/287924/ 
[RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/

The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.

The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.

enum powersavings_balance_level {
	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package 
					 * first for long running threads 
					 */ 
	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
					 * cpu package for power savings
					 */
	MAX_POWERSAVINGS_BALANCE_LEVELS
};

sched_mc_power_savings values of 0 and 1 are implemented and available
in the default kernel.  The new level of 2 support task wakeup biasing
that helps in consolidating very short running bursty jobs in an
almost idle system.  This level of task consolidation packs most
workloads to one cpu package and extends the tickless idle time for
unused cpu packages.

Changes ...
From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:16 am

From: Max Krasnyansky <maxk@qualcomm.com>

What I realized recently is that calling rebuild_sched_domains() in
arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
and will not actually rebuild anything if new domain masks match the old ones.

What this means is that doing
     echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
on a system with cpusets enabled will not take affect untill something changes
in the cpuset setup (ie new sets created or deleted).

This patch fixes restore correct behaviour where domains must be rebuilt in
order to enable MC powersaving flags.

Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
as expected.

Ingo, please apply.
btw We also need to push my other cpuset patch into mainline. We currently
calling rebuild_sched_domains() without cgroup lock which is bad. When I made
original sched: changes the assumption was that cpuset patch will also go in.
I'm talking about
	"cpuset: Rework sched domains and CPU hotplug handling"
It's been ACKed by Paul has been in the -tip for awhile now.

Reference LKML threads:

http://lkml.org/lkml/2008/8/29/191
http://lkml.org/lkml/2008/8/29/343

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: svaidy@linux.vnet.ibm.com
Cc: peterz@infradead.org
Cc: mingo@elte.hu
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/cpuset.h |    2 +-
 kernel/sched.c         |   19 +++++++++++++------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e8f450c..2691926 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -160,7 +160,7 @@ static inline int current_cpuset_is_being_rebound(void)
 
 static inline void rebuild_sched_domains(void)
 {
-	partition_sched_domains(0, NULL, ...
From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:17 am

From: Gautham R Shenoy <ego@in.ibm.com>

The __load_balance_iterator() returns a NULL when there's only one
sched_entity which is a task. It is caused by the following code-path.


	/* Skip over entities that are not tasks */
	do {
		se = list_entry(next, struct sched_entity, group_node);
		next = next->next;
	} while (next != &cfs_rq->tasks && !entity_is_task(se));

	if (next == &cfs_rq->tasks)
		return NULL;
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      This will return NULL even when se is a task.

As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
since iter_move_one_task() when it calls load_balance_start_fair(),
would not get any tasks to move!

Fix this by checking if the last entity was a task or not.

Reference:
http://lkml.org/lkml/2008/9/5/135

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
---

 kernel/sched_fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index fb8994c..f1c96e3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1451,7 +1451,7 @@ __load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
+	if (next == &cfs_rq->tasks && !entity_is_task(se))
 		return NULL;
 
 	cfs_rq->balance_iterator = next;

--

From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:18 am

From: Gautham R Shenoy <ego@in.ibm.com>

***  RFC patch of work in progress and not for inclusion. ***

Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.

Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.

This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).

Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.

The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h |   11 +++++++++++
 kernel/sched.c        |   16 +++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..7c12240 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -721,6 +721,17 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
 #define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
 
+enum powersavings_balance_level {
+	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
+	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package
+					 * first for long running threads
+					 */
+	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
+					 * ...
From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:20 am

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group.  Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured.  This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e535f98..569fc8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3237,7 +3237,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		 */
 		if ((sum_nr_running < min_nr_running) ||
 		    (sum_nr_running == min_nr_running &&
-		     first_cpu(group->cpumask) <
+		     first_cpu(group->cpumask) >
 		     first_cpu(group_min->cpumask))) {
 			group_min = group;
 			min_nr_running = sum_nr_running;
@@ -3253,7 +3253,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		if (sum_nr_running <= group_capacity - 1) {
 			if (sum_nr_running > leader_nr_running ||
 			    (sum_nr_running == leader_nr_running &&
-			     first_cpu(group->cpumask) >
+			     first_cpu(group->cpumask) <
 			      first_cpu(group_leader->cpumask))) {
 				group_leader = group;
 				leader_nr_running = sum_nr_running;

--

From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:21 am

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical.  For now lets try with a global variable.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 569fc8d..4ae79f5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3380,6 +3380,9 @@ out_balanced:
 
 	if (this == group_leader && group_leader != group_min) {
 		*imbalance = min_load_per_task;
+		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+			sched_mc_preferred_wakeup_cpu =
+					first_cpu(group_leader->cpumask);
 		return group_min;
 	}
 #endif
@@ -6911,6 +6914,13 @@ static void sched_domain_node_span(int node, cpumask_t *span)
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
 
 /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be used when
+ * most cpus are idle in the system indicating overall very low system
+ * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
+ */
+unsigned int sched_mc_preferred_wakeup_cpu;
+
+/*
  * SMT sched-domains:
  */
 #ifdef CONFIG_SCHED_SMT

--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 6:21 am

This cannot be a global variable, what happens when we have two disjoint

--

From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:43 am

Agreed this is certainly a problem.  I tried adding this to the
sched_domain, but accessing the correct 'copy' for sched_domain that
holds this variable from any cpu is not fast.  

Thank you for pointing this out.  I will find a alternative
implementation.

--Vaidy
--

From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:22 am

Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch.  Use
this information in sched_mc_preferred_wakeup_cpu in function
select_task_rq_fair() to bias task wakeups if the following
conditions are satisfied:
	- The present cpu that is trying to wakeup the process is
	  idle and waking the target process on this cpu will
	  potentially wakeup a completely idle package
	- The previous cpu on which the target process ran is
	  also idle and hence selecting the previous cpu may
	  wakeup a semi idle cpu package
	- The task being woken up is allowed to run in the
	  nominated cpu (cpu affinity and restrictions)

Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.

Cache hotness and system utilisation could also be considered
before the decision is made.  (Not currently implemented)

This technique will effectively move short running bursty jobs in
an mostly idle system.

Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched_fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f1c96e3..1301521 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -73,6 +73,9 @@ unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+/* Preferred wakeup cpu nominated by sched_mc load balancing logic */
+extern unsigned int sched_mc_preferred_wakeup_cpu;
+
 /**************************************************************
  * CFS operations on generic schedulable entities:
  */
@@ -1232,6 +1235,16 @@ static int ...
From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:23 am

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4ae79f5..e47e8e8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3656,10 +3656,40 @@ redo:
 	}
 
 	if (!ld_moved) {
+		int active_balance;
+		unsigned long flags;
+
 		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 		    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 			return -1;
+
+		if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+			return -1;
+
+		if (sd->nr_balance_failed++ < 1)
+			return -1;
+
+		spin_lock_irqsave(&busiest->lock, flags);
+
+		/* don't kick the migration_thread, if the curr
+		 * task on busiest cpu can't be moved to this_cpu
+		 */
+		if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+			spin_unlock_irqrestore(&busiest->lock, flags);
+			all_pinned = 1;
+			return ld_moved;
+		}
+
+		if (!busiest->active_balance) {
+			busiest->active_balance = 1;
+			busiest->push_cpu = this_cpu;
+			active_balance = 1;
+		}
+		spin_unlock_irqrestore(&busiest->lock, flags);
+		if (active_balance)
+			wake_up_process(busiest->migration_thread);
+
 	} else
 		sd->nr_balance_failed = 0;
 

--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 6:25 am

May I again ask to first clean up the current power saving code before
stacking more on top of it?

--

From: Vaidyanathan Srinivasan
Date: Monday, September 8, 2008 - 6:48 am

current power save balance code:

1. Detailed documentation
2. Cleanup the group_min and group_leader stuff in
   find_busiest_group()

Did I get the requirements correct?

--Vaidy

--

From: Andi Kleen
Date: Monday, September 8, 2008 - 6:58 am

> 1. Detailed documentation

Messy code cannot be really made good with documentation. It's
not that your patches are that messy, it's more that it makes

I think one issue is that there are general too many special cases
that completely change the algorithm especially for power saving.
Perhaps it would make sense to refactor the code a bit and then
use different high level code paths for those?  I assume that
would make it all simpler and easier to understand.

The other alternative would be to dynamically change the domains
so that a generic graph walker without knowledge of power savings
could DTRT in all cases. But I assume that would be much harder.

-Andi


--

From: Vaidyanathan Srinivasan
Date: Wednesday, September 10, 2008 - 6:45 am

Hi Andi,

I will try to refactor the code and see if it can look cleaner.  Power
saving balance is actually just a corner case in general load_balance.
We will have to do default load_balance for performance each time we
enter this section of the scheduler, except when certain complex
conditions are met and we see an opportunity to save power. When that
window of opportunity to save power exist, the code will take
a different path and do things that will normally not be done by the
default load balancer logic.

I think we can split the statistics (load) computation part and the
decision part to separate functions and call the power save balance
conditions first and if there is no opportunity to save power, then

This option may not work because we will have to change the decision
only when a power saving opportunity exist.  Hence more often the
power save balance code should just fall through into the default
balancer.  Splitting them this way may add duplicate code in both the
paths.

--Vaidy
--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 6:56 am

Preferably in the form of in-code comments and code structure, this

That would be much appreciated. 

But I also prefer to get rid of that power savings tweak in
cpu_coregroup_map(). 

But above all, readable code ;-)

find_busiest_group() is the stuff of nightmares.

--

From: Suresh Siddha
Date: Monday, September 8, 2008 - 6:20 pm

Peter, Almost every if() stmt/basic block  in the power savings code has
comments around it. And also power-savings code is 50 lines (mostly comments)

Why? Based on the power vs perf, we wanted to construct topologies
differently. Reason for the complexity is, in some of the Intel cpu's,
while the cores share the same package they have different last level caches.
So for performance, we want to differentiate based on last level caches

power-savings code is very small part of that nightmare :) That code
became complex over years with HT, smp-nice etc.

I haven't been following recent sched changes. I can take a look at it
and see what I can do to better organize find_busiest_group()

thanks,
suresh
--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 11:18 pm

Sure, and those comments only tell me what it does, not why it does it.

But then why not add a domain level that represents the packages and
power schedule on that? That way you dont have to change the domain
structure depending on the load balance goals.

Also, that justification is just wrong - AMD has similar constructs in
its cpus, and god knows what other architectures do, so hiding this in

Yes I understand, but we really need to do something about it - and
everybody with interests in it should be looking at ways to reduce the

You and me both then :-)

I've been looking at the history of that function - it started out quite
readable - but has, over the years, grown into a monstrosity.

One of the things I'm currently looking at is getting rid of the
small_imbalance stuff - that made sense when we balanced based on
nr_running, but now that we're weight based and willing to over-balance
a little I'm not sure that it still does.

Also, it looks to me that Christoph Lameters fix
0a2966b48fb784e437520e400ddc94874ddbd4e8 introduced a bug. By skipping
the cpu in the group accumulation the group average computation below
needs to be adjusted, because it uses the group power stuff, which
assumes all cpus contributed.

Then there is this whole sched_group stuff, which I intent to have a
hard look at, afaict its unneeded and we can iterate over the
sub-domains just as well.

Finally, we should move all this stuff into sched_fair and get rid of
that iterator interface and fix up all nr_running etc.. usages to refer
to cfs.nr_running and similar.

Then there is the idea Andi proposed, splitting up the performance and
power balancer into two separate functions, something that is worth
looking into imho.

--

From: Nick Piggin
Date: Monday, September 8, 2008 - 11:31 pm

I agree it is terrible, and subsequent "features" weren't really properly

What sub-domains? The domains-minus-groups are just a graph (in existing
setup code AFAIK just a line) of cpumasks. You have to group because you
want enough control for example not to pull load from an unusually busy
CPU from one group if it's load should actually be spread out over a
smaller domain (ie. probably other CPUs within the group we're looking at).

It would be nice if you could make it simpler of course, but I just don't
understand you or maybe you thought of some other way to solve this or

That's what *I* suggested. Before it even went in. Of course there was no
--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 11:54 pm

Right, I get the domain stuff - that's good stuff.

But, let my try and confuse you with ASCII-art ;-)

             Domain [0-7]
       group [0-3]  group [4-7]

     Domain [0-3]      
  group[0-1]  [group2-3]

Domain [0-1]
group 0 group 1

(right hand side not drawn due to lack of space etc...)

So we have this tree of domains, which is cool stuff. But then we have
these groups in there, which closely match up with the domain's child
domains.

So my idea was to ditch the groups and just iterate over the child

Even more reason to make it happen.

--

From: Nick Piggin
Date: Tuesday, September 9, 2008 - 12:59 am

But it's all per-cpu, so you'd have to iterate down other CPU's child
domains. Which may get dirtied by that CPU. So you get cacheline
bounces.

You also lose flexibility (although nobody really takes full advantage

I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
might be a problem if you propose to traverse other CPU's domains), but

Yes it would be great if it happens.
--

From: Peter Zijlstra
Date: Tuesday, September 9, 2008 - 1:25 am

Humm, are you saying each cpu has its own domain tree? My understanding
was that its a global structure, eg. given:

   domain[0-1]

domain[0] domain[1]


Afaict the only flexibility you loose is that you cannot make groups
larger/smaller than the child domain - which given that the whole
premesis of the groups existence is that the inner-group balancing

Those cacheline bounces could be mitigated by splitting sched_domain
into two parts with a cacheline aligned dummy and keep the rarely
modified data separate from the frequently modified data.

As to the gains - a graph walk with a single type seems more elegant to
me.

--

From: Nick Piggin
Date: Tuesday, September 9, 2008 - 2:03 am

It's fundamentally two different things anyway though. I don't
see any theoretical improvement, and it definitely wouldn't
improve the practical side much if any because the biggest problem
I don't think is the simple walks themselves but the calculations
and stuff.

If it can yield something clearly better that is impossible using
domains and groups, I could change my mind.

--

Previous thread: [patch 1/3] s3cmci -- support odd block transfers by Christer Weinigel on Monday, September 8, 2008 - 5:48 am. (1 message)

Next thread: [GIT PULL] avr32 fixes for 2.6.27 by Haavard Skinnemoen on Monday, September 8, 2008 - 6:18 am. (1 message)