Re: [PATCH 0/3] [idled]: Idle Cycle Injector for power capping

Previous thread: [PATCH 0/6] Ceph RADOS block device by Yehuda Sadeh on Tuesday, April 13, 2010 - 4:29 pm. (2 messages)

Next thread: [BUG & PATCH V2] drivers/pci/intel-iommu.c: errors with smaller iommu widths by Tom Lyon on Tuesday, April 13, 2010 - 5:21 pm. (1 message)
From: Salman
Date: Tuesday, April 13, 2010 - 5:08 pm

As we discussed earlier this year, Google has an implementation that it
would like to share.  I have finally gotten around to porting it to
v2.6.33 and cleaning up the interfaces.  It is provided in the following
messages for your review.  I realize that when we first discussed this
idea, a lot of ideas were presented for enhancing it.  Thanks alot for
your suggestions.  I haven't gotten around to implementing any of them.

The ones that I still find appealing are:

0. Providing approximate synchronization between cores, regardless
of their independant settings in order to improve power savings.   We have
to balance this with eager injection (i.e. avoiding injection when
an interactive task needs to run).

A stricter synchronization between cores is needed to make idle cycle injector
work on hyperthreaded systems.  This is a some what separate issue, as
there should only be one idle cycle injector minimum idle setting per
physical core.

1. It's not possible to directly use hard limits to implement the
type of assurance that we need.  However, doing something similar to CPU hard
limits, to implement a global power cap. It is not strictly necessary for
Google's purposes.  The outcome of the trade offs is not immediately clear to
me.  I need to do some prototyping.

Now, back to the current set of patches.

Testing:

The patches were tested using the following program.  The output was:

# /export/hda3/kidled_test /dev/cgroup/
Latency Test:

Count without injection: 9441
Count with 80% injection (batch) 1805 (idle 8099305661)
Count with 80% injection (interactive): 9439 (idle 8054796135)
Lost wake ups (batch): 7636
Lost wake ups (interactive): 2
Priority Test:

Low priority got:  26197453ns
High priority got: 1971369919ns
Idle Time:         8021629325ns

Test program follows:


/*
 *  A set of tests for the idle cycle injector.
 */

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include ...
From: Salman
Date: Tuesday, April 13, 2010 - 5:08 pm

From: Salman Qazi <sqazi@google.com>

kidled is a kernel thread that implements idle cycle injection for
the purposes of power capping.  It measures the naturally occuring
idle time as necessary to avoid injecting idle cycles when the
CPU is already sufficiently idle.  The actual idle cycle injection
takes places in a realtime kernel thread, where as the measurements
take place in hrtimer callback functions.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 Documentation/kidled.txt     |   40 +++
 arch/x86/Kconfig             |    1 
 arch/x86/include/asm/idle.h  |    1 
 arch/x86/kernel/process_64.c |    2 
 drivers/misc/Gconfig.ici     |    1 
 include/linux/kidled.h       |   45 +++
 kernel/Kconfig.ici           |    6 
 kernel/Makefile              |    1 
 kernel/kidled.c              |  547 ++++++++++++++++++++++++++++++++++++++++++
 kernel/softirq.c             |   15 +
 kernel/sysctl.c              |   11 +
 11 files changed, 664 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/kidled.txt
 create mode 100644 drivers/misc/Gconfig.ici
 create mode 100644 include/linux/kidled.h
 create mode 100644 kernel/Kconfig.ici
 create mode 100644 kernel/kidled.c

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
new file mode 100644
index 0000000..1149e3f
--- /dev/null
+++ b/Documentation/kidled.txt
@@ -0,0 +1,40 @@
+Idle Cycle Injector:
+====================
+
+Overview:
+
+Provides a kernel interface for causing the CPUs to have some
+minimum percentage of the idle time.
+
+Interfaces:
+
+Under /proc/sys/kernel/kidled/, we can find the following files:
+
+cpu/*/interval
+cpu/*/min_idle_percent
+cpu/*/stats
+
+interval specifies the period of time over which we attempt to make the
+CPU min_idle_percent idle.  stats provides three fields.  The first is
+the naturally occuring idle time.  The second is the busy time, and the last
+is the injected idle time.  All three values are reported in the units of
+nanoseconds.
+
+** VERY ...
From: Andi Kleen
Date: Wednesday, April 14, 2010 - 2:49 am

Haven't read the whole thing, but do any of these stats really
need to execute on the target CPU? They seem to be just readable
fields.

Or does it simply not matter because this proc call is too infrequent?

Anyways global broadcasts are discouraged, there is typically
always someone who feels their RT latency be messed up by them.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Salman Qazi
Date: Wednesday, April 14, 2010 - 8:41 am

To capture all the quantities for a CPU atomically, they must be read
on the CPU.  Basically, reading them on that CPU prevents them from
changing as we read them.

Also, if the CPU is idle (injected or otherwise), the quantities won't

It should be infrequent.  The idle cycle injector does all the hard


I will look at it one more time to see if there is something else that
--

From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 12:46 am

Who cares? by the time they reach userspace they've changed anyway.

--

From: Salman
Date: Tuesday, April 13, 2010 - 5:08 pm

From: Salman Qazi <sqazi@google.com>

We add the concept of a "power interactive" task group.  This is a task
group that, for the purposes of power capping, will recieve special treatment.

When there are no power interactive tasks on the runqueue, we inject idle
cycles unless we have already met the quota.  However, when there are
power interactive tasks on the runqueue, we only inject idle cycles if we
would otherwise fail to meet the quota.  As a result, we try our very best
to not hit the interactive tasks with the idle cycles.  The power
interactivity status of a task group is determined by the boolean value
in cpu.power_interactive.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 Documentation/kidled.txt |   15 ++++
 include/linux/kidled.h   |   34 +++++++++
 include/linux/sched.h    |    3 +
 kernel/kidled.c          |  166 +++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched.c           |   80 ++++++++++++++++++++++
 5 files changed, 285 insertions(+), 13 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 1149e3f..564aa00 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -25,7 +25,7 @@ injected idle cycles are by convention reported as busy time, attributed to
 kidled.
 
 
-Operation:
+Basic Operation:
 
 The injecting component of the idle cycle injector is the kernel thread
 kidled.  The measurements to determine when to inject idle cycles is done
@@ -38,3 +38,16 @@ quota.  If that's the case, then we inject idle cycles until the end of the
 interval.
 
 
+Eager Injection:
+
+Above is true, when there is at least one tasks marked "interactive" on
+the CPU runqueue for the duration of the interval.  Marking a task
+interactive involves setting power_interactive to 1 in its parent CPU
+cgroup.  When such no such task is runnable and when we have not achieved
+the minimum idle percentage for the interval, we eagerly inject idle cycles.
+The purpose for doing so is to inject as many of the ...
From: Salman
Date: Tuesday, April 13, 2010 - 5:08 pm

From: Salman Qazi <sqazi@google.com>

0) Power Capping Priority:

After we finish a lazy injection, we look at the task groups in the order
of increasing priority.  For each task group, we attempt to assign
as much vruntime as possible, to cover the time that was spent doing
the lazy injection.  Within each priority, we round-robin between the
task group between different invocations to make sure that we don't
consistently penalize the same one.

The priorities themselves are specified through the value
cpu.power_capping_priority in the parent CPU cgroup of the tasks.

1) Load balancer awareness

Idle cycle injector is an RT thread.  A consequence is that from the load
balancer's point of view, it is a particularly heavy thread.  While
we appreciate the ability to preempt any CFS threads, it is useful
to have a lesser weight: as a heavy weight makes an injected CPU
disproportionately less desirable than other CPUs.  We provide this
by faking the weight of the idle cycle injector to be equivalent to
a CFS thread of a user controllable nice value.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
 Documentation/kidled.txt |   38 ++++++++++++++++++++++-
 include/linux/kidled.h   |    6 ++++
 kernel/kidled.c          |    2 +
 kernel/sched.c           |   75 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c      |   77 +++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 192 insertions(+), 6 deletions(-)

diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt
index 564aa00..400b97b 100644
--- a/Documentation/kidled.txt
+++ b/Documentation/kidled.txt
@@ -6,7 +6,7 @@ Overview:
 Provides a kernel interface for causing the CPUs to have some
 minimum percentage of the idle time.
 
-Interfaces:
+Basic Interfaces:
 
 Under /proc/sys/kernel/kidled/, we can find the following files:
 
@@ -51,3 +51,39 @@ tasks become runnable, they are more likely to fall in an interval when we
 aren't forcing the CPU idle.
 
 
+Power Capping ...
From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 12:51 am

.33 is way too old to submit patches against.

That said, I really really dislike this approach, I would much rather
see it tie in with power aware scheduling.

--

From: tytso
Date: Saturday, April 17, 2010 - 10:08 am

But it's not too old for review purposes; as Salman said, they were
sent to LKML for comments and review.  I think it's well understood
that when these patches are ready to be merged, they need to be
submitted right before the merge window opens, against a recent -rc
kernel.

					- Ted
--

From: Arjan van de Ven
Date: Saturday, April 17, 2010 - 10:57 am

On Sat, 17 Apr 2010 13:08:08 -0400

s/submitted/refreshed/ ;-)




-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Peter Zijlstra
Date: Saturday, April 17, 2010 - 12:51 pm

No, they need to be in the relevant subsystem tree by then, patch
submissions to subsystem trees right before the merge window opens are
bound to get delayed another cycle.

--

From: Salman Qazi
Date: Monday, April 19, 2010 - 10:20 am

I think I can see your point:  there is potentially better information
about the power consumption of the CPU beyond the time it was busy.
But please clarify: is your complaint the lack of use of this
information or are you arguing for a deeper integration into the
scheduler (I.e. implementing it as part of the scheduler rather than
--

From: Peter Zijlstra
Date: Monday, April 19, 2010 - 12:01 pm

Right, so the IBM folks who were looking at power aware scheduling were
working on an interface to quantify the amount of power to save.

But their approach, was an extension of the regular power aware
load-balancer, which basically groups tasks onto sockets so that whole
sockets can go idle.

However Arjan explained to me that your approach, which idles the whole
machine, has the advantage that also memory banks can go into idle mode
and save power.

Still in the interest to cut back on power-saving interfaces it would be
nice to see if there is anything we can do to merge these things, but I
really haven't thought much about that yet.

--

From: Arjan van de Ven
Date: Monday, April 19, 2010 - 6:00 pm

On Mon, 19 Apr 2010 21:01:41 +0200

one correction, this is not about power *saving*, it is about power
*capping*. Power capping is pretty much energy inefficient by
definition (and surely in practice), but it's about dealing with
reality about underdimensioned airconditioning or voltage rails....

Due to the reality that socket offlining isn't as good as idle
insertion.. I rather focus on the later...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Vaidyanathan Srinivasan
Date: Monday, April 19, 2010 - 10:00 pm

The power reduction benefit is architecture and topology dependent.
Like on POWER platform, socket offlining could provide better power
reduction than idle injection.

As mentioned by Arjan, these approaches help reduce average power
consumption to meet power and cooling limitation over a short
interval.  These are not general optimizations to improve operating
efficiency, however when use at certain workload and utilization
levels, these can potentially provide overall energy savings.

Having the SMP load balancer pull jobs away form a core or socket to
allow it to remain idle  for short burst of time will be an good
implementation.

--Vaidy

--

From: Vaidyanathan Srinivasan
Date: Monday, April 19, 2010 - 9:50 pm

Indicating required system capacity to the loadbalance and using that
information to evacuate cores or socket was the basic idea.  

Ref: http://lkml.org/lkml/2009/5/13/173

The challenges with that approach is the predictable evacuation or

Integrating with the load balancer will make the design cleaner and
avoid forcefully running an idle thread.  The scheduler should
schedule 'nothing' so that idleness can happen and cpuidle governor

Well, this is an ideal goal.  Injecting some amount of idle time
across all cores/threads preferably with overlapping time window will
save quite a lot of power on x86.  But atleast overlapping idle times
among sibling threads are required to get any power savings.

This proposed approach does not yet have the ability to do overlapping

Atleast integrating this with ACPI cpu aggregation driver can be a good
first step.  Both the drivers and code are for the same power capping
purpose using idle time injection and running an high priority idle
thread for short duration.

ACPI Processor Aggregator Driver for 2.6.32-rc1
Ref: http://lkml.org/lkml/2009/10/3/13

--Vaidy

--

From: Salman Qazi
Date: Tuesday, April 20, 2010 - 10:52 am

On Mon, Apr 19, 2010 at 9:50 PM, Vaidyanathan Srinivasan

I am actually not sure which one would be more aesthetically pleasing.
 Putting it into the scheduler would
also place a lot of complexity (basically, the same set of timers) in

Agreed.  For sibling threads, we need a hard guarantee of simultaneous
injection, which is best achieved by using a single timer for all the
siblings.  It is in my list of things to do.  Is it necessary for the
first cut of idle cycle injector?

For improving power savings in the non-SMT case, as Arjan suggested, I
will make the changes for heuristically aligning the injection on
multiple cores.  This will not be perfect, but then because it's a
power optimization, it doesn't have to always work.  I presume that
this works best when done according to the CPU hierarchy?  That is, it
is more beneficial to idle an entire socket than the same number of

This is reasonable.  I could merge the two implementations.  Are there
features in that implementation that our implementation is missing?
From a cursory glance, the driver is a naive idle cycle injector, in
that it doesn't take existing idle time or scheduler issues into
--

From: Arjan van de Ven
Date: Tuesday, April 20, 2010 - 10:08 pm

On Tue, 20 Apr 2010 10:52:58 -0700

not really; at least not for Intel CPUs.
The problem is that due to the cache coherency, as long as one cpu in
the system is awake, the memory controllers etc cannot go into a sleep
mode...

I would not be surprised if AMD has the same behavior... or anyone else
with an integrated memory controller for that matter.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mike Chan
Date: Wednesday, April 21, 2010 - 6:32 pm

I may have missed this on lkml but are there any on-going community
efforts to power aware scheduling?

--

From: Peter Zijlstra
Date: Thursday, April 22, 2010 - 1:21 am

Well, mostly targeting load-balancing, which I gather is kinda useless
for android seeing that it runs on UP hardware.

But yeah, both IBM and Intel have contributed significant work in this
area.

--

From: Vaidyanathan Srinivasan
Date: Thursday, April 22, 2010 - 12:02 pm

Yes, mostly in power aware task placement and task consolidation in
large SMP systems and also some timer consolidation to improve low
power idle residency.

There are some tuning and optimizations in cpuidle governor that is
related to power management but not core scheduler.

As Peter mentioned, most of them may not apply to uni processor
systems.

--Vaidy
--

From: Arjan van de Ven
Date: Saturday, April 17, 2010 - 9:40 am

On Tue, 13 Apr 2010 17:08:18 -0700

again I'll chime in to support this effort; it's the right thing to do
for power limiting (as opposed to taking cores offline), and I'm happy
to see progress being made.

I'll start playing with your patches and use timechart to see how well

I still would like to see this ;-)
It's a *HUGE* instant power delta.

But it does not have to be perfect. As long as "on average" we align
we're good enough.

the easiest way is to round the time of the start of idle injection
up to, say, double the duration of the injection period...
and maybe to whole seconds or some round value of jiffies as well.

It could even be done by "creeping" towards an aligned situation...
rather than forcing instant alignment, as long as each time we inject
idle time we get a step closer to being aligned.. very soon we WILL be
aligned.
(for example, if a cpu notices it's on the late side of an alignment
window, it could inject a little shorter than usual, while if it

actually... while the HT case is clearly required to be solved to get
actual power limits, ideally we can solve it using the same tricks we
use for the above, just with a stronger bias...

I don't think we need to force the admin to set the same value per se,
it's something that's just a matter of having the policy guy do this
right... (but if you want to do "effective injection %age is minimum of


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

Previous thread: [PATCH 0/6] Ceph RADOS block device by Yehuda Sadeh on Tuesday, April 13, 2010 - 4:29 pm. (2 messages)

Next thread: [BUG & PATCH V2] drivers/pci/intel-iommu.c: errors with smaller iommu widths by Tom Lyon on Tuesday, April 13, 2010 - 5:21 pm. (1 message)