Hi, The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run the workload in the system on minimum number of CPU packages and tries to keep rest of the CPU packages idle for longer duration. Thus consolidating workloads to fewer packages help other packages to be in idle state and save power. The current implementation is very conservative and does not work effectively across different workloads. Initial idea of tunable sched_mc_power_savings=n was proposed to enable tuning of the power saving load balancer based on the system configuration, workload characteristics and end user requirements. The power savings and performance of the given workload in an under utilised system can be controlled by setting values of 0, 1 or 2 to /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest performance and least power savings and level 2 indicating maximum power savings even at the cost of slight performance degradation. Please refer to the following discussions and article for details. [1]Making power policy just work http://lwn.net/Articles/287924/ [2][RFC v1] Tunable sched_mc_power_savings=n http://lwn.net/Articles/287882/ [3][RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n http://lwn.net/Articles/297306/ [4][RFC PATCH v3 0/5] Tunable sched_mc_power_savings=n http://lkml.org/lkml/2008/11/10/260 The following series of patch demonstrates the basic framework for tunable sched_mc_power_savings. This version of the patch incorporates comments and feedback received on the previous post. Thanks to Peter Zijlstra, Gregory Haskins, and Vatsa for the review and comments. Changes from v3: ---------------- * Fixed the locking code with double_lock_balance() in active-balance-newidle.patch * Moved sched_mc_preferred_wakeup_cpu to ...
commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
Author: Ingo Molnar <mingo@elte.hu>
Date: Wed Nov 5 16:52:08 2008 +0100
Reverting this patch since this affects consolidation
and power savings. Will rework this tuning in the next
iteration.
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.
Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)
lat_ctx on a NUMA box is particularly happy about this change:
before:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70
after:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07
a 2.75x speedup.
pipe-test is similarly happy about it too:
| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.
Also:
- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains
Sysbench+postgresql improves all around the board, quite significantly:
.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 ...commit 14800984706bf6936bbec5187f736e928be5c218
Author: Mike Galbraith <efault@gmx.de>
Date: Fri Nov 7 15:26:50 2008 +0100
Reverting this patch for now since this affects
consolidation and power savings. Will rework
this tuning in the next iteration.
Tune SD_MC_INIT the same way as SD_CPU_INIT:
unset SD_BALANCE_NEWIDLE, and set SD_WAKE_BALANCE.
This improves vmark by 5%:
vmark 132102 125968 125497 messages/sec avg 127855.66 .984
vmark 139404 131719 131272 messages/sec avg 134131.66 1.033
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
include/linux/topology.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 2565f4a..b1551d4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -120,10 +120,10 @@ void arch_update_cpu_topology(void);
.wake_idx = 1, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
| SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
- | SD_WAKE_BALANCE \
| SD_SHARE_PKG_RESOURCES\
| BALANCE_FOR_MC_POWER, \
.last_balance = jiffies, \
--
From: Gautham R Shenoy <ego@in.ibm.com>
*** RFC patch of work in progress and not for inclusion. ***
Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.
Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.
This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).
Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.
The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
include/linux/sched.h | 11 +++++++++++
kernel/sched.c | 16 +++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 644ffbd..d862837 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -760,6 +760,17 @@ enum cpu_idle_type {
#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
+enum powersavings_balance_level {
+ POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
+ POWERSAVINGS_BALANCE_BASIC, /* Fill one thread/core/package
+ * first for long running threads
+ */
+ POWERSAVINGS_BALANCE_WAKEUP, /* Also bias task wakeups to semi-idle
+ * ...Might I suggest a dimensioned number rather than a relative one? One might say that 100 represents the full power of a system, meaning that all chips/cores are running at full speed, whereas 50 means that the power system would attempt to halve the resources available, and would return a value that represents the value that the power system believes it has achieved. For example, if it could only reduce the clock speed by 10%, on a old uniprocessor, it would return 90. An additional, second value it might return might be the power reduction it believed it had achieved. These, by the way, are what my Tadpole GUI shows (;-)) so I'm just following someone else's lead. -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# --
Ideally we would like to have such a metric :) However practically the power savings and performance tradeoff depends on 1) System configuration (topology, cpu type, cpu features) 2) Workload -- memory bound, IO bound, cpu bound 3) Environment -- system temperature and many more dimensions that we have not considered yet! What you are asking for may have been possible if the CPUs were very simple and performance and power directly corresponded to operating frequency. Modern CPUs have very widely varying operating characteristics that greatly depend on workload type. Hence deriving a metric for power vs performance tradeoff will be very inaccurate and useless. We may be able to design the framework in such a way that each level of settings will provide increasing levels of power savings with little or no performance impact (depending on the workload). Power consumption at sched_mc=0 > sched_mc=1 > sched_mc=2 but not in linear scale. Also, this is only one of the component of the power saving tunables. There are various governor settings and platform setting that may Measuring the actual power being consumed will be useful for sys admins to choose the correct settings. However this is platform dependent and best left to the platform management tools as compared to generic scheduler. We would expect the end user to use the platform management tools to collect and trend the power consumption data and correlate with the power saving tunables to decide the best power vs performance Can you please provide more details. I could not get a google hit on the Tadpole GUI which is relevant to this discussion. Thanks, Vaidy --
Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.
This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
This logic provides effective consolidation of short running
daemon jobs in a almost idle system
The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 54 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d28cd98..e4aff6c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3694,10 +3694,64 @@ redo:
}
if (!ld_moved) {
+ int active_balance;
+
schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
!test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
return -1;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (sd->nr_balance_failed++ < 2)
+ return -1;
+
+ /*
+ * The only task running in a non-idle cpu can be moved to this
+ * cpu in an attempt to completely freeup the other CPU
+ * package. The same method used to move task in load_balance()
+ * have been extended for load_balance_newidle() to speedup
+ * consolidation at sched_mc=POWERSAVINGS_BALANCE_WAKEUP (2)
+ *
+ * The package power saving logic comes from
+ * find_busiest_group(). If there are no imbalance, then
+ * f_b_g() will return NULL. However when sched_mc={1,2} then
+ * f_b_g() will select a group from which a running task may be
+ * pulled to this cpu in order to make the other package idle.
+ * If there is no opportunity to make a package idle ...When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.
Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.
This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical. Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 79b71f3..d28cd98 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,17 @@ struct root_domain {
#ifdef CONFIG_SMP
struct cpupri cpupri;
#endif
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+ /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be
+ * used when most cpus are idle in the system indicating overall very
+ * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+ */
+ unsigned int sched_mc_preferred_wakeup_cpu;
+
+#endif
+
};
/*
@@ -3090,6 +3101,7 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
return 0;
}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -3406,6 +3418,9 @@ out_balanced:
if (this == group_leader && group_leader != group_min) {
*imbalance = min_load_per_task;
+ if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+ cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+ first_cpu(group_leader->cpumask);
return ...While not strictly needed, I prefer braces around multi-line statments. --
Hi Peter, Thanks for the detailed review. I will fix these issues in the next iteration. --Vaidy --
Preferred wakeup cpu (from a semi idle package) has been nominated in find_busiest_group() in the previous patch. Use this information in sched_mc_preferred_wakeup_cpu in function wake_idle() to bias task wakeups if the following conditions are satisfied: - The present cpu that is trying to wakeup the process is idle and waking the target process on this cpu will potentially wakeup a completely idle package - The previous cpu on which the target process ran is also idle and hence selecting the previous cpu may wakeup a semi idle cpu package - The task being woken up is allowed to run in the nominated cpu (cpu affinity and restrictions) Basically if both the current cpu and the previous cpu on which the task ran is idle, select the nominated cpu from semi idle cpu package for running the new task that is waking up. Cache hotness is considered since the actual biasing happens in wake_idle() only if the application is cache cold. This technique will effectively move short running bursty jobs in a mostly idle system. Wakeup biasing for power savings gets automatically disabled if system utilisation increases due to the fact that the probability of finding both this_cpu and prev_cpu idle decreases. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> --- kernel/sched_fair.c | 17 +++++++++++++++++ 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 98345e4..939f2a1 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1027,6 +1027,23 @@ static int wake_idle(int cpu, struct task_struct *p) cpumask_t tmp; struct sched_domain *sd; int i; + unsigned int chosen_wakeup_cpu; + int this_cpu; + + /* + * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu + * are idle and this is not a kernel thread and this task's affinity + * allows it to be moved to preferred cpu, then just move! + */ + + this_cpu = ...
Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.
find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index ea33446..79b71f3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3263,7 +3263,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
*/
if ((sum_nr_running < min_nr_running) ||
(sum_nr_running == min_nr_running &&
- first_cpu(group->cpumask) <
+ first_cpu(group->cpumask) >
first_cpu(group_min->cpumask))) {
group_min = group;
min_nr_running = sum_nr_running;
@@ -3279,7 +3279,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (sum_nr_running <= group_capacity - 1) {
if (sum_nr_running > leader_nr_running ||
(sum_nr_running == leader_nr_running &&
- first_cpu(group->cpumask) >
+ first_cpu(group->cpumask) <
first_cpu(group_leader->cpumask))) {
group_leader = group;
leader_nr_running = sum_nr_running;
--
Looking forward to the next version. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DI |
