Hi, The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run the workload in the system on minimum number of CPU packages and tries to keep rest of the CPU packages idle for longer duration. Thus consolidating workloads to fewer packages help other packages to be in idle state and save power. The current implementation is very conservative and does not work effectively across different workloads. Initial idea of tunable sched_mc_power_savings=n was proposed to enable tuning of the power saving load balancer based on the system configuration, workload characteristics and end user requirements. Please refer to the following discussions and article for details. Making power policy just work http://lwn.net/Articles/287924/ [RFC v1] Tunable sched_mc_power_savings=n http://lwn.net/Articles/287882/ The following series of patch demonstrates the basic framework for tunable sched_mc_power_savings. The power savings and performance of the given workload in an under utilised system can be controlled by setting values of 0, 1 or 2 to /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest performance and least power savings and level 2 indicating maximum power savings even at the cost of slight performance degradation. enum powersavings_balance_level { POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package * first for long running threads */ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle * cpu package for power savings */ MAX_POWERSAVINGS_BALANCE_LEVELS }; sched_mc_power_savings values of 0 and 1 are implemented and available in the default kernel. The new level of 2 support task wakeup biasing that helps in consolidating very short running bursty jobs in an almost idle system. This level of task consolidation packs most workloads to one cpu package and extends the tickless idle time for unused cpu packages. Changes ...
From: Max Krasnyansky <maxk@qualcomm.com>
What I realized recently is that calling rebuild_sched_domains() in
arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
and will not actually rebuild anything if new domain masks match the old ones.
What this means is that doing
echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
on a system with cpusets enabled will not take affect untill something changes
in the cpuset setup (ie new sets created or deleted).
This patch fixes restore correct behaviour where domains must be rebuilt in
order to enable MC powersaving flags.
Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
as expected.
Ingo, please apply.
btw We also need to push my other cpuset patch into mainline. We currently
calling rebuild_sched_domains() without cgroup lock which is bad. When I made
original sched: changes the assumption was that cpuset patch will also go in.
I'm talking about
"cpuset: Rework sched domains and CPU hotplug handling"
It's been ACKed by Paul has been in the -tip for awhile now.
Reference LKML threads:
http://lkml.org/lkml/2008/8/29/191
http://lkml.org/lkml/2008/8/29/343
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: svaidy@linux.vnet.ibm.com
Cc: peterz@infradead.org
Cc: mingo@elte.hu
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
include/linux/cpuset.h | 2 +-
kernel/sched.c | 19 +++++++++++++------
2 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e8f450c..2691926 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -160,7 +160,7 @@ static inline int current_cpuset_is_being_rebound(void)
static inline void rebuild_sched_domains(void)
{
- partition_sched_domains(0, NULL, ...From: Gautham R Shenoy <ego@in.ibm.com>
The __load_balance_iterator() returns a NULL when there's only one
sched_entity which is a task. It is caused by the following code-path.
/* Skip over entities that are not tasks */
do {
se = list_entry(next, struct sched_entity, group_node);
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
if (next == &cfs_rq->tasks)
return NULL;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This will return NULL even when se is a task.
As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
since iter_move_one_task() when it calls load_balance_start_fair(),
would not get any tasks to move!
Fix this by checking if the last entity was a task or not.
Reference:
http://lkml.org/lkml/2008/9/5/135
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
---
kernel/sched_fair.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index fb8994c..f1c96e3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1451,7 +1451,7 @@ __load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
- if (next == &cfs_rq->tasks)
+ if (next == &cfs_rq->tasks && !entity_is_task(se))
return NULL;
cfs_rq->balance_iterator = next;
--
From: Gautham R Shenoy <ego@in.ibm.com>
*** RFC patch of work in progress and not for inclusion. ***
Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.
Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.
This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).
Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.
The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 16 +++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..7c12240 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -721,6 +721,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * ...Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.
find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index e535f98..569fc8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3237,7 +3237,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3253,7 +3253,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;
--
When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.
Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.
This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. For now lets try with a global variable.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 569fc8d..4ae79f5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3380,6 +3380,9 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+ sched_mc_preferred_wakeup_cpu =
+ first_cpu(group_leader->cpumask);
return group_min;
}
#endif
@@ -6911,6 +6914,13 @@ static void sched_domain_node_span(int node, cpumask_t *span)
int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
/*
+ * Preferred wake up cpu nominated by sched_mc balance that will be used when
+ * most cpus are idle in the system indicating overall very low system
+ * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
+ */
+unsigned int sched_mc_preferred_wakeup_cpu;
+
+/*
* SMT sched-domains:
*/
#ifdef CONFIG_SCHED_SMT
--
This cannot be a global variable, what happens when we have two disjoint --
Agreed this is certainly a problem. I tried adding this to the sched_domain, but accessing the correct 'copy' for sched_domain that holds this variable from any cpu is not fast. Thank you for pointing this out. I will find a alternative implementation. --Vaidy --
Preferred wakeup cpu (from a semi idle package) has been nominated in find_busiest_group() in the previous patch. Use this information in sched_mc_preferred_wakeup_cpu in function select_task_rq_fair() to bias task wakeups if the following conditions are satisfied: - The present cpu that is trying to wakeup the process is idle and waking the target process on this cpu will potentially wakeup a completely idle package - The previous cpu on which the target process ran is also idle and hence selecting the previous cpu may wakeup a semi idle cpu package - The task being woken up is allowed to run in the nominated cpu (cpu affinity and restrictions) Basically if both the current cpu and the previous cpu on which the task ran is idle, select the nominated cpu from semi idle cpu package for running the new task that is waking up. Cache hotness and system utilisation could also be considered before the decision is made. (Not currently implemented) This technique will effectively move short running bursty jobs in an mostly idle system. Wakeup biasing for power savings gets automatically disabled if system utilisation increases due to the fact that the probability of finding both this_cpu and prev_cpu idle decreases. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> --- kernel/sched_fair.c | 13 +++++++++++++ 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index f1c96e3..1301521 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -73,6 +73,9 @@ unsigned int sysctl_sched_wakeup_granularity = 5000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +/* Preferred wakeup cpu nominated by sched_mc load balancing logic */ +extern unsigned int sched_mc_preferred_wakeup_cpu; + /************************************************************** * CFS operations on generic schedulable entities: */ @@ -1232,6 +1235,16 @@ static int ...
Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.
This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
This logic provides effective consolidation of short running
daemon jobs in a almost idle system
The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 4ae79f5..e47e8e8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3656,10 +3656,40 @@ redo:
}
if (!ld_moved) {
+ int active_balance;
+ unsigned long flags;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 1)
+ return -1;
+
+ spin_lock_irqsave(&busiest->lock, flags);
+
+ /* don't kick the migration_thread, if the curr
+ * task on busiest cpu can't be moved to this_cpu
+ */
+ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+ spin_unlock_irqrestore(&busiest->lock, flags);
+ all_pinned = 1;
+ return ld_moved;
+ }
+
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+ }
+ spin_unlock_irqrestore(&busiest->lock, flags);
+ if (active_balance)
+ wake_up_process(busiest->migration_thread);
+
} else
sd->nr_balance_failed = 0;
--
May I again ask to first clean up the current power saving code before stacking more on top of it? --
current power save balance code: 1. Detailed documentation 2. Cleanup the group_min and group_leader stuff in find_busiest_group() Did I get the requirements correct? --Vaidy --
> 1. Detailed documentation Messy code cannot be really made good with documentation. It's not that your patches are that messy, it's more that it makes I think one issue is that there are general too many special cases that completely change the algorithm especially for power saving. Perhaps it would make sense to refactor the code a bit and then use different high level code paths for those? I assume that would make it all simpler and easier to understand. The other alternative would be to dynamically change the domains so that a generic graph walker without knowledge of power savings could DTRT in all cases. But I assume that would be much harder. -Andi --
Hi Andi, I will try to refactor the code and see if it can look cleaner. Power saving balance is actually just a corner case in general load_balance. We will have to do default load_balance for performance each time we enter this section of the scheduler, except when certain complex conditions are met and we see an opportunity to save power. When that window of opportunity to save power exist, the code will take a different path and do things that will normally not be done by the default load balancer logic. I think we can split the statistics (load) computation part and the decision part to separate functions and call the power save balance conditions first and if there is no opportunity to save power, then This option may not work because we will have to change the decision only when a power saving opportunity exist. Hence more often the power save balance code should just fall through into the default balancer. Splitting them this way may add duplicate code in both the paths. --Vaidy --
Preferably in the form of in-code comments and code structure, this That would be much appreciated. But I also prefer to get rid of that power savings tweak in cpu_coregroup_map(). But above all, readable code ;-) find_busiest_group() is the stuff of nightmares. --
Peter, Almost every if() stmt/basic block in the power savings code has comments around it. And also power-savings code is 50 lines (mostly comments) Why? Based on the power vs perf, we wanted to construct topologies differently. Reason for the complexity is, in some of the Intel cpu's, while the cores share the same package they have different last level caches. So for performance, we want to differentiate based on last level caches power-savings code is very small part of that nightmare :) That code became complex over years with HT, smp-nice etc. I haven't been following recent sched changes. I can take a look at it and see what I can do to better organize find_busiest_group() thanks, suresh --
Sure, and those comments only tell me what it does, not why it does it. But then why not add a domain level that represents the packages and power schedule on that? That way you dont have to change the domain structure depending on the load balance goals. Also, that justification is just wrong - AMD has similar constructs in its cpus, and god knows what other architectures do, so hiding this in Yes I understand, but we really need to do something about it - and everybody with interests in it should be looking at ways to reduce the You and me both then :-) I've been looking at the history of that function - it started out quite readable - but has, over the years, grown into a monstrosity. One of the things I'm currently looking at is getting rid of the small_imbalance stuff - that made sense when we balanced based on nr_running, but now that we're weight based and willing to over-balance a little I'm not sure that it still does. Also, it looks to me that Christoph Lameters fix 0a2966b48fb784e437520e400ddc94874ddbd4e8 introduced a bug. By skipping the cpu in the group accumulation the group average computation below needs to be adjusted, because it uses the group power stuff, which assumes all cpus contributed. Then there is this whole sched_group stuff, which I intent to have a hard look at, afaict its unneeded and we can iterate over the sub-domains just as well. Finally, we should move all this stuff into sched_fair and get rid of that iterator interface and fix up all nr_running etc.. usages to refer to cfs.nr_running and similar. Then there is the idea Andi proposed, splitting up the performance and power balancer into two separate functions, something that is worth looking into imho. --
I agree it is terrible, and subsequent "features" weren't really properly What sub-domains? The domains-minus-groups are just a graph (in existing setup code AFAIK just a line) of cpumasks. You have to group because you want enough control for example not to pull load from an unusually busy CPU from one group if it's load should actually be spread out over a smaller domain (ie. probably other CPUs within the group we're looking at). It would be nice if you could make it simpler of course, but I just don't understand you or maybe you thought of some other way to solve this or That's what *I* suggested. Before it even went in. Of course there was no --
Right, I get the domain stuff - that's good stuff.
But, let my try and confuse you with ASCII-art ;-)
Domain [0-7]
group [0-3] group [4-7]
Domain [0-3]
group[0-1] [group2-3]
Domain [0-1]
group 0 group 1
(right hand side not drawn due to lack of space etc...)
So we have this tree of domains, which is cool stuff. But then we have
these groups in there, which closely match up with the domain's child
domains.
So my idea was to ditch the groups and just iterate over the child
Even more reason to make it happen.
--
But it's all per-cpu, so you'd have to iterate down other CPU's child domains. Which may get dirtied by that CPU. So you get cacheline bounces. You also lose flexibility (although nobody really takes full advantage I'm not saying you couldn't do it (reasonably well -- cacheline bouncing might be a problem if you propose to traverse other CPU's domains), but Yes it would be great if it happens. --
Humm, are you saying each cpu has its own domain tree? My understanding was that its a global structure, eg. given: domain[0-1] domain[0] domain[1] Afaict the only flexibility you loose is that you cannot make groups larger/smaller than the child domain - which given that the whole premesis of the groups existence is that the inner-group balancing Those cacheline bounces could be mitigated by splitting sched_domain into two parts with a cacheline aligned dummy and keep the rarely modified data separate from the frequently modified data. As to the gains - a graph walk with a single type seems more elegant to me. --
It's fundamentally two different things anyway though. I don't see any theoretical improvement, and it definitely wouldn't improve the practical side much if any because the biggest problem I don't think is the simple walks themselves but the calculations and stuff. If it can yield something clearly better that is impossible using domains and groups, I could change my mind. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer |
