As we discussed earlier this year, Google has an implementation that it would like to share. I have finally gotten around to porting it to v2.6.33 and cleaning up the interfaces. It is provided in the following messages for your review. I realize that when we first discussed this idea, a lot of ideas were presented for enhancing it. Thanks alot for your suggestions. I haven't gotten around to implementing any of them. The ones that I still find appealing are: 0. Providing approximate synchronization between cores, regardless of their independant settings in order to improve power savings. We have to balance this with eager injection (i.e. avoiding injection when an interactive task needs to run). A stricter synchronization between cores is needed to make idle cycle injector work on hyperthreaded systems. This is a some what separate issue, as there should only be one idle cycle injector minimum idle setting per physical core. 1. It's not possible to directly use hard limits to implement the type of assurance that we need. However, doing something similar to CPU hard limits, to implement a global power cap. It is not strictly necessary for Google's purposes. The outcome of the trade offs is not immediately clear to me. I need to do some prototyping. Now, back to the current set of patches. Testing: The patches were tested using the following program. The output was: # /export/hda3/kidled_test /dev/cgroup/ Latency Test: Count without injection: 9441 Count with 80% injection (batch) 1805 (idle 8099305661) Count with 80% injection (interactive): 9439 (idle 8054796135) Lost wake ups (batch): 7636 Lost wake ups (interactive): 2 Priority Test: Low priority got: 26197453ns High priority got: 1971369919ns Idle Time: 8021629325ns Test program follows: /* * A set of tests for the idle cycle injector. */ #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <signal.h> #include <unistd.h> #include ...
From: Salman Qazi <sqazi@google.com> kidled is a kernel thread that implements idle cycle injection for the purposes of power capping. It measures the naturally occuring idle time as necessary to avoid injecting idle cycles when the CPU is already sufficiently idle. The actual idle cycle injection takes places in a realtime kernel thread, where as the measurements take place in hrtimer callback functions. Signed-off-by: Salman Qazi <sqazi@google.com> --- Documentation/kidled.txt | 40 +++ arch/x86/Kconfig | 1 arch/x86/include/asm/idle.h | 1 arch/x86/kernel/process_64.c | 2 drivers/misc/Gconfig.ici | 1 include/linux/kidled.h | 45 +++ kernel/Kconfig.ici | 6 kernel/Makefile | 1 kernel/kidled.c | 547 ++++++++++++++++++++++++++++++++++++++++++ kernel/softirq.c | 15 + kernel/sysctl.c | 11 + 11 files changed, 664 insertions(+), 6 deletions(-) create mode 100644 Documentation/kidled.txt create mode 100644 drivers/misc/Gconfig.ici create mode 100644 include/linux/kidled.h create mode 100644 kernel/Kconfig.ici create mode 100644 kernel/kidled.c diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt new file mode 100644 index 0000000..1149e3f --- /dev/null +++ b/Documentation/kidled.txt @@ -0,0 +1,40 @@ +Idle Cycle Injector: +==================== + +Overview: + +Provides a kernel interface for causing the CPUs to have some +minimum percentage of the idle time. + +Interfaces: + +Under /proc/sys/kernel/kidled/, we can find the following files: + +cpu/*/interval +cpu/*/min_idle_percent +cpu/*/stats + +interval specifies the period of time over which we attempt to make the +CPU min_idle_percent idle. stats provides three fields. The first is +the naturally occuring idle time. The second is the busy time, and the last +is the injected idle time. All three values are reported in the units of +nanoseconds. + +** VERY ...
Haven't read the whole thing, but do any of these stats really need to execute on the target CPU? They seem to be just readable fields. Or does it simply not matter because this proc call is too infrequent? Anyways global broadcasts are discouraged, there is typically always someone who feels their RT latency be messed up by them. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
To capture all the quantities for a CPU atomically, they must be read on the CPU. Basically, reading them on that CPU prevents them from changing as we read them. Also, if the CPU is idle (injected or otherwise), the quantities won't It should be infrequent. The idle cycle injector does all the hard I will look at it one more time to see if there is something else that --
Who cares? by the time they reach userspace they've changed anyway. --
From: Salman Qazi <sqazi@google.com> We add the concept of a "power interactive" task group. This is a task group that, for the purposes of power capping, will recieve special treatment. When there are no power interactive tasks on the runqueue, we inject idle cycles unless we have already met the quota. However, when there are power interactive tasks on the runqueue, we only inject idle cycles if we would otherwise fail to meet the quota. As a result, we try our very best to not hit the interactive tasks with the idle cycles. The power interactivity status of a task group is determined by the boolean value in cpu.power_interactive. Signed-off-by: Salman Qazi <sqazi@google.com> --- Documentation/kidled.txt | 15 ++++ include/linux/kidled.h | 34 +++++++++ include/linux/sched.h | 3 + kernel/kidled.c | 166 +++++++++++++++++++++++++++++++++++++++++++--- kernel/sched.c | 80 ++++++++++++++++++++++ 5 files changed, 285 insertions(+), 13 deletions(-) diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt index 1149e3f..564aa00 100644 --- a/Documentation/kidled.txt +++ b/Documentation/kidled.txt @@ -25,7 +25,7 @@ injected idle cycles are by convention reported as busy time, attributed to kidled. -Operation: +Basic Operation: The injecting component of the idle cycle injector is the kernel thread kidled. The measurements to determine when to inject idle cycles is done @@ -38,3 +38,16 @@ quota. If that's the case, then we inject idle cycles until the end of the interval. +Eager Injection: + +Above is true, when there is at least one tasks marked "interactive" on +the CPU runqueue for the duration of the interval. Marking a task +interactive involves setting power_interactive to 1 in its parent CPU +cgroup. When such no such task is runnable and when we have not achieved +the minimum idle percentage for the interval, we eagerly inject idle cycles. +The purpose for doing so is to inject as many of the ...
From: Salman Qazi <sqazi@google.com> 0) Power Capping Priority: After we finish a lazy injection, we look at the task groups in the order of increasing priority. For each task group, we attempt to assign as much vruntime as possible, to cover the time that was spent doing the lazy injection. Within each priority, we round-robin between the task group between different invocations to make sure that we don't consistently penalize the same one. The priorities themselves are specified through the value cpu.power_capping_priority in the parent CPU cgroup of the tasks. 1) Load balancer awareness Idle cycle injector is an RT thread. A consequence is that from the load balancer's point of view, it is a particularly heavy thread. While we appreciate the ability to preempt any CFS threads, it is useful to have a lesser weight: as a heavy weight makes an injected CPU disproportionately less desirable than other CPUs. We provide this by faking the weight of the idle cycle injector to be equivalent to a CFS thread of a user controllable nice value. Signed-off-by: Salman Qazi <sqazi@google.com> --- Documentation/kidled.txt | 38 ++++++++++++++++++++++- include/linux/kidled.h | 6 ++++ kernel/kidled.c | 2 + kernel/sched.c | 75 +++++++++++++++++++++++++++++++++++++++++++-- kernel/sched_fair.c | 77 +++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 192 insertions(+), 6 deletions(-) diff --git a/Documentation/kidled.txt b/Documentation/kidled.txt index 564aa00..400b97b 100644 --- a/Documentation/kidled.txt +++ b/Documentation/kidled.txt @@ -6,7 +6,7 @@ Overview: Provides a kernel interface for causing the CPUs to have some minimum percentage of the idle time. -Interfaces: +Basic Interfaces: Under /proc/sys/kernel/kidled/, we can find the following files: @@ -51,3 +51,39 @@ tasks become runnable, they are more likely to fall in an interval when we aren't forcing the CPU idle. +Power Capping ...
.33 is way too old to submit patches against. That said, I really really dislike this approach, I would much rather see it tie in with power aware scheduling. --
But it's not too old for review purposes; as Salman said, they were sent to LKML for comments and review. I think it's well understood that when these patches are ready to be merged, they need to be submitted right before the merge window opens, against a recent -rc kernel. - Ted --
On Sat, 17 Apr 2010 13:08:08 -0400 s/submitted/refreshed/ ;-) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
No, they need to be in the relevant subsystem tree by then, patch submissions to subsystem trees right before the merge window opens are bound to get delayed another cycle. --
I think I can see your point: there is potentially better information about the power consumption of the CPU beyond the time it was busy. But please clarify: is your complaint the lack of use of this information or are you arguing for a deeper integration into the scheduler (I.e. implementing it as part of the scheduler rather than --
Right, so the IBM folks who were looking at power aware scheduling were working on an interface to quantify the amount of power to save. But their approach, was an extension of the regular power aware load-balancer, which basically groups tasks onto sockets so that whole sockets can go idle. However Arjan explained to me that your approach, which idles the whole machine, has the advantage that also memory banks can go into idle mode and save power. Still in the interest to cut back on power-saving interfaces it would be nice to see if there is anything we can do to merge these things, but I really haven't thought much about that yet. --
On Mon, 19 Apr 2010 21:01:41 +0200 one correction, this is not about power *saving*, it is about power *capping*. Power capping is pretty much energy inefficient by definition (and surely in practice), but it's about dealing with reality about underdimensioned airconditioning or voltage rails.... Due to the reality that socket offlining isn't as good as idle insertion.. I rather focus on the later... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
The power reduction benefit is architecture and topology dependent. Like on POWER platform, socket offlining could provide better power reduction than idle injection. As mentioned by Arjan, these approaches help reduce average power consumption to meet power and cooling limitation over a short interval. These are not general optimizations to improve operating efficiency, however when use at certain workload and utilization levels, these can potentially provide overall energy savings. Having the SMP load balancer pull jobs away form a core or socket to allow it to remain idle for short burst of time will be an good implementation. --Vaidy --
Indicating required system capacity to the loadbalance and using that information to evacuate cores or socket was the basic idea. Ref: http://lkml.org/lkml/2009/5/13/173 The challenges with that approach is the predictable evacuation or Integrating with the load balancer will make the design cleaner and avoid forcefully running an idle thread. The scheduler should schedule 'nothing' so that idleness can happen and cpuidle governor Well, this is an ideal goal. Injecting some amount of idle time across all cores/threads preferably with overlapping time window will save quite a lot of power on x86. But atleast overlapping idle times among sibling threads are required to get any power savings. This proposed approach does not yet have the ability to do overlapping Atleast integrating this with ACPI cpu aggregation driver can be a good first step. Both the drivers and code are for the same power capping purpose using idle time injection and running an high priority idle thread for short duration. ACPI Processor Aggregator Driver for 2.6.32-rc1 Ref: http://lkml.org/lkml/2009/10/3/13 --Vaidy --
On Mon, Apr 19, 2010 at 9:50 PM, Vaidyanathan Srinivasan I am actually not sure which one would be more aesthetically pleasing. Putting it into the scheduler would also place a lot of complexity (basically, the same set of timers) in Agreed. For sibling threads, we need a hard guarantee of simultaneous injection, which is best achieved by using a single timer for all the siblings. It is in my list of things to do. Is it necessary for the first cut of idle cycle injector? For improving power savings in the non-SMT case, as Arjan suggested, I will make the changes for heuristically aligning the injection on multiple cores. This will not be perfect, but then because it's a power optimization, it doesn't have to always work. I presume that this works best when done according to the CPU hierarchy? That is, it is more beneficial to idle an entire socket than the same number of This is reasonable. I could merge the two implementations. Are there features in that implementation that our implementation is missing? From a cursory glance, the driver is a naive idle cycle injector, in that it doesn't take existing idle time or scheduler issues into --
On Tue, 20 Apr 2010 10:52:58 -0700 not really; at least not for Intel CPUs. The problem is that due to the cache coherency, as long as one cpu in the system is awake, the memory controllers etc cannot go into a sleep mode... I would not be surprised if AMD has the same behavior... or anyone else with an integrated memory controller for that matter. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
I may have missed this on lkml but are there any on-going community efforts to power aware scheduling? --
Well, mostly targeting load-balancing, which I gather is kinda useless for android seeing that it runs on UP hardware. But yeah, both IBM and Intel have contributed significant work in this area. --
Yes, mostly in power aware task placement and task consolidation in large SMP systems and also some timer consolidation to improve low power idle residency. There are some tuning and optimizations in cpuidle governor that is related to power management but not core scheduler. As Peter mentioned, most of them may not apply to uni processor systems. --Vaidy --
On Tue, 13 Apr 2010 17:08:18 -0700 again I'll chime in to support this effort; it's the right thing to do for power limiting (as opposed to taking cores offline), and I'm happy to see progress being made. I'll start playing with your patches and use timechart to see how well I still would like to see this ;-) It's a *HUGE* instant power delta. But it does not have to be perfect. As long as "on average" we align we're good enough. the easiest way is to round the time of the start of idle injection up to, say, double the duration of the injection period... and maybe to whole seconds or some round value of jiffies as well. It could even be done by "creeping" towards an aligned situation... rather than forcing instant alignment, as long as each time we inject idle time we get a step closer to being aligned.. very soon we WILL be aligned. (for example, if a cpu notices it's on the late side of an alignment window, it could inject a little shorter than usual, while if it actually... while the HT case is clearly required to be solved to get actual power limits, ideally we can solve it using the same tricks we use for the above, just with a stronger bias... I don't think we need to force the admin to set the same value per se, it's something that's just a matter of having the policy guy do this right... (but if you want to do "effective injection %age is minimum of -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
