[PATCH 3/3] sched trace updated with normalized clock info.

Previous thread: [PATCH] remove noop in workqueue by Hillf Danton on Friday, December 17, 2010 - 5:56 am. (1 message)

Next thread: [PATCH] input: mt: Document interface updates by Henrik Rydberg on Friday, December 17, 2010 - 6:12 am. (1 message)
From: Harald Gustafsson
Date: Friday, December 17, 2010 - 6:02 am

This is a request for comments on additions to sched deadline v3 patches.
Deadline scheduler is the first scheduler (I think) we introduce in Linux that
specifies the runtime in time and not only as a weight or a relation.
I have introduced a normalized runtime clock dependent on the CPU frequency.
This is used, in [PATCH 2/3], to calculate the deadline thread's runtime
so that approximately the same number of cycles are giving to the thread
independent of the CPU frequency. 

I suggest that this is important for users of hard reservation based schedulers
that the intended amount of work can be accomplished independent of the CPU frequency.
The usage of CPU frequency scaling is important on mobile devices and hence 
the combination of deadline scheduler and cpufreq should be solved.

This patch series applies on a backported sched deadline v3 to a 2.6.34 kernel.
That backport can be made available if anyone is interested. It also runs on
my dual core ARM system.

So before I do this for the linux tip I would welcome a discussion about if this
is a good idea and also suggestions on how to improve this.

This first patch introduce the normalized runtime clock, this could be made
lockless instead if requested.

/Harald

Change-Id: Ie0d9b8533cf4e5720eefd3af860d3a8577101907

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 kernel/sched.c |  103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c075664..2816371 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -72,6 +72,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 #include <linux/cgroup_cpufreq.h>
 
 #include <asm/tlb.h>
@@ -596,6 +597,16 @@ struct rq {
 
 	u64 clock;
 
+        /* Need to keep track of clock cycles since
+	 * dl need to work with cpufreq, is derived based
+	 * on rq clock and cpufreq.
+	 */
+       ...
From: Harald Gustafsson
Date: Friday, December 17, 2010 - 6:02 am

This patch do the actual changes to sched deadline v3 to
utilize the normalized runtime clock. Note that the 
deadline/periods still use the regular runtime clock. 

Change-Id: I75c88676e9e18a71d94d6c4e779b376a7ac0615f

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 include/linux/sched.h |    6 +++
 kernel/sched.c        |    2 +
 kernel/sched_dl.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 89a158e..167771c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1301,6 +1301,12 @@ struct sched_dl_entity {
 	u64 deadline;		/* absolute deadline for this instance	*/
 	unsigned int flags;	/* specifying the scheduler behaviour   */
 
+        /*
+	 * CPU frequency normalized start time.
+	 * Put it inside DL since only one using it.
+	 */
+        u64 exec_start_norm;
+
 	/*
 	 * Some bool flags:
 	 *
diff --git a/kernel/sched.c b/kernel/sched.c
index 2816371..ddb18d2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2671,6 +2671,7 @@ static void __sched_fork(struct task_struct *p)
 	p->dl.dl_deadline = p->dl.deadline = 0;
 	p->dl.dl_period = 0;
 	p->dl.flags = 0;
+	p->dl.exec_start_norm = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
@@ -8475,6 +8476,7 @@ void normalize_rt_tasks(void)
 			continue;
 
 		p->se.exec_start		= 0;
+		p->dl.exec_start_norm		= 0;
 #ifdef CONFIG_SCHEDSTATS
 		p->se.wait_start		= 0;
 		p->se.sleep_start		= 0;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 5aa5a52..049c001 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -333,6 +333,40 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
 }
 
 /*
+ * A cpu freq normalized overflow check, see dl_entity_overflow
+ * function for details. Check against current cpu frequency.
+ * For this to hold, we must check if:
+ *   runtime / (norm_factor * (deadline - t)) < ...
From: Harald Gustafsson
Date: Friday, December 17, 2010 - 6:02 am

Updated the sched deadline v3 traces with the normalized runtime clock.
The delta execution runtime and the last start of execution is also
using the normalized clock.

Change-Id: I6f05a76ad876e8895f3f24940f3ee07f1cb0e8b8

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 include/trace/events/sched.h |   25 +++++++++++++++++--------
 kernel/sched.c               |    2 +-
 kernel/sched_dl.c            |    2 +-
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3307353..3c766eb 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -379,16 +379,17 @@ TRACE_EVENT(sched_stat_runtime,
  */
 TRACE_EVENT(sched_switch_dl,
 
-	TP_PROTO(u64 clock,
+	TP_PROTO(u64 clock, u64 clock_norm,
 		 struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(clock, prev, next),
+	TP_ARGS(clock, clock_norm, prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
 		__field(	u64,	clock				)
+		__field(	u64,	clock_norm			)
 		__field(	s64,	prev_rt				)
 		__field(	u64,	prev_dl				)
 		__field(	long,	prev_state			)
@@ -402,6 +403,7 @@ TRACE_EVENT(sched_switch_dl,
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->clock		= clock;
+		__entry->clock_norm	= clock_norm;
 		__entry->prev_rt	= prev->dl.runtime;
 		__entry->prev_dl	= prev->dl.deadline;
 		__entry->prev_state	= prev->state;
@@ -412,7 +414,7 @@ TRACE_EVENT(sched_switch_dl,
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_rt=%Ld [ns] prev_dl=%Lu [ns] prev_state=%s ==> "
-		  "next_comm=%s next_pid=%d next_rt=%Ld [ns] next_dl=%Lu [ns] clock=%Lu [ns]",
+		  "next_comm=%s next_pid=%d next_rt=%Ld [ns] next_dl=%Lu [ns] clock=%Lu (%Lu) [ns]",
 		  __entry->prev_comm, __entry->prev_pid, (long long)__entry->prev_rt,
 		  (unsigned long long)__entry->prev_dl, __entry->prev_state ?
 		    ...
From: Peter Zijlstra
Date: Friday, December 17, 2010 - 7:29 am

I'm thinking this is going about it totally wrong..

Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
CPUfreq governor. The governor must know about the constraints placed on
the system by the task-set. You simply cannot lower the frequency when
your system is at u=1.

Once you have a governor that keeps the freq such that: freq/max_freq >=
utilization (which is only sufficient for deadline == period systems),
then you need to frob the SCHED_DEADLINE runtime accounting.

Adding a complete normalized clock to the system like you've done is a
total no-go, it adds overhead even for the !SCHED_DEADLINE case.

The simple solution would be to slow down the runtime accounting of
SCHED_DEADLINE tasks by freq/max_freq. So instead of having:

  dl_se->runtime -= delta;

you do something like:

  dl_se->runtime -= (freq * delta) / max_freq;

Which auto-magically grows the actual bandwidth, and since the deadlines
are wall-time already it all works out nicely. It also keeps the
overhead inside SCHED_DEADLINE.

 
--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 7:32 am

This is all assuming lowering the frequency is sensible to begin with in
the first place... but that's all part of the CPUfreq governor, it needs
to find a way to lower energy usage while conforming to the system
constraints.


--

From: Harald Gustafsson
Date: Friday, December 17, 2010 - 8:06 am

Yes, I and you have already suggested the safe way to not lower it below
the total dl bandwidth. But for softer use cases it might be possible to
e.g. exclude threads with longer periods than cpufreq change periods in the
minimum frequency.
--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 8:16 am

I was more hinting at the fact that CPUfreq is at best a controversial
approach to power savings. I much prefer the whole race-to-idle
approach, its much simpler.
--

From: Harald Gustafsson
Date: Friday, December 17, 2010 - 8:36 am

That depends to a large degree on architecture, chip technology node
and deployed user space
applications. I don't agree that race-to-idle is a good idea for
some/many combinations at least
for embedded systems. But of course race-to-idle is simpler, but not
necessarily giving the
lowest energy.

/Harald
--

From: Thomas Gleixner
Date: Friday, December 17, 2010 - 8:43 am

There's that and I have yet to see a proof that running code with
lower frequency and not going idle saves more power than running full
speed and going into low power states for longer time.

Also if you want to have your deadline scheduler aware of cpu
frequency changes, then simply limit the total bandwith based on the
lowest possible frequency and it works always. This whole dynamic
bandwith expansion is more an academic exercise than a practical
necessity.

Thanks,

	tglx
--

From: Harald Gustafsson
Date: Friday, December 17, 2010 - 8:54 am

This would severely limit the bandwidth available to deadline tasks. Which
then also reduces the use cases that could benefit from using sched deadline.
Also it would imply a over-reservation of the system, e.g. if you need 10%
BW of the total system and the lowest speed is at 20%, you basically need to
set a BW of 50%, to always be guaranteed that you get your 10% when cpufreq
clocks down. If you use cpufreq and want to use sched deadline this has
strong practical implications and is definitely not academic only.
--

From: Dario Faggioli
Date: Friday, December 17, 2010 - 11:44 am

I was expecting a reply like this from right from you! :-P

BTW, I mostly agree that race to idle is better. The point here is that
you might end in a situation where frequency scaling is enabled and/or
a particular frequency is statically selected for whatever reason. In
that case, making the scheduler aware of such could be needed to get the
expected behaviour out of it, independently from the fact it is probably
going to be worse than race-to-idle for power saving purposes... How
That could be a solution as well, although you're limiting a lot the
bandwidth available for deadline task. But something similar could be
Well, despite the fact that Harald is with Ericsson and as not much to
do with academia. :-D

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org
From: Pavel Machek
Date: Monday, January 3, 2011 - 7:17 am

That depends on cpu. Look at early athlon64s that could not even run
at full speed at battery power, and where cpu sleep states were not
saving much power. Race-to-idle does not work there. It works on
recent x86 cpus.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Harald Gustafsson
Date: Friday, December 17, 2010 - 8:02 am

I agree that this is the other part of the solution, which I have in a separate
ondemand governor, but that code is not ready for public review yet. Since that
code also incorporate other ondemand changes I'm playing with. Such
changes to the ondemand is quite simple it just picks a frequency that at
least supports the total dl bandwidth. It might get tricky for systems which
support individual/clusters frequency for the cores on the system together with

I suspected this, it works as a proof of concept, but not good for mainline.
I will rework this part, if we in general thinks having the dl runtime
accounting

OK, I can do that. My thought from the beginning was considering that
the reading of the clock was done more often then updating it, but I agree that
it has a negative impact on none dl threads.

/Harald
--

From: Dario Faggioli
Date: Friday, December 17, 2010 - 11:48 am

... We can at least integrate this (done in the proper, way as Peter
suggests, i.e., _inside_ SCHED_DEADLINE) in the next release of the
patchset, can't we?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org
From: Dario Faggioli
Date: Friday, December 17, 2010 - 11:56 am

We already did the very same thing (for another EU Project called
FRESCOR), although it was done in an userspace sort of daemon. It was
also able to consider other "high level" parameters like some estimation
of the QoS of each application and of the global QoS of the system.

However, converting the basic mechanism into a CPUfreq governor should
And, at least for the meantime, this seems a very very nice solution.
The only thing I don't like is that division which would end up in being
performed at each tick/update_curr_dl(), but we can try to find out a
way to mitigate this, what do you think Harald?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org
From: Peter Zijlstra
Date: Friday, December 17, 2010 - 11:59 am

A simple mult and shift-right should do. You can either pre-compute for
a platform, or compute the inv multiplier in the cpufreq notifier thing.
--

From: Dario Faggioli
Date: Friday, December 17, 2010 - 12:16 pm

Yeah, I was thinking about something like the last solution you just
propose, but we'll consider all of them.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org
From: Harald Gustafsson
Date: Friday, December 17, 2010 - 12:31 pm

Yes, I don't mind doing that. Could you point me to the right part of
the FRESCOR code, Dario?
I will then compare that with what I already have.
--

From: Tommaso Cucinotta
Date: Sunday, December 19, 2010 - 5:11 pm

Hi there,

I'm sorry to join so late this discussion, but the unprecedented 20cm of 
snow in Pisa had some non-negligible drawbacks on my return flight from 
Perth :-).

Let me try to briefly recap what the outcomes of FRESCOR were, w.r.t. 
power management (but usually I'm not that brief :-) ):

1. from a requirements analysis phase, it comes out that it should be 
possible to specify the individual runtimes for each possible frequency, 
as it is well-known that the way computation times scale to CPU 
frequency is application-dependent (and platform-dependent); this 
assumes that as a developer I can specify the possible configurations of 
my real-time app, then the OS will be free to pick the CPU frequency 
that best suites its power management logic (i.e., keeping the minimum 
frequency by which I can meet all the deadlines).

   Requirements Analysis:
   
http://www.frescor.org/index.php?mact=Uploads,cntnt01,getfile,0&cntnt01showtemplat...

   Proposed API:
   
http://www.frescor.org/index.php?mact=Uploads,cntnt01,getfile,0&cntnt01showtemplat...

   I also attach the API we implemented, however consider it is a mix of 
calls for doing both what I wrote above, and building an OS-independent 
abstraction layer for dealing with CPU frequency scaling (and not only) 
on the heterogeneous OSes we had in FRESCOR;

2. this was also assuming, at an API level, a quite static settings 
(typical of hard RT), in which I configure the system and don't change 
its frequency too often; for example, implications of power switches on 
hard real-time requirements (i.e., time windows in which the CPU is not 
operating during the switch, and limits on the max sustainable switching 
frequencies by apps and the like) have not been stated through the API;

3. for soft real-time contexts and Linux (consider FRESCOR targeted both 
hard RT on RT OSes and soft RT on Linux), we played with a much simpler ...
From: Harald Gustafsson
Date: Monday, December 20, 2010 - 2:44 am

I think this make perfect sense, and I have explored related ideas,
but for the Linux kernel and
softer realtime use cases I think it is likely too much at least if

I would not worry too much about switch transition effects. They are
in the same order of magnitude
as other disturbances from timers and interrupts and can easily be set
to a certain smallest periodicity.
But if I was designing a system that needed real hard RT tasks I would
probably not enable cpufreq

Totally agree on this as well, and it would not be that difficult to
implement in Linux.
For example not just use the frequency as the normalization but have a
different
architecture dependent normalization. This would capture the general
normalization but
not on an application level. But, others might think this is
complicating matter too much.
The other solution is that the deadline task do some over-reservation,
which is going to be

You mean this on an application level? I think we should test the
trivial rescaling first

If we use the trivial rescaling is this a problem? In my
implementation the runtime
accounting is correct even when the frequency switch happens during a period.
Also with Peter's suggested implementation the runtime will be correct

Don't you think that this was due to that you did it from user space,
I actually change the
scheduler's accounting for the rest of the runtime, i.e. can deal with


I was also of this impression for a while that cpufreq scaling would
be of less importance.
But when I looked at complex use cases, which are common on embedded devices and
also new chip technology nodes I had to reconsider. Unfortunately I
don't have any information
that I can share publicly. What is true is that the whole system
energy needs to be considered,
Sure, more data is always of interest.
--

From: Tommaso Cucinotta
Date: Monday, January 3, 2011 - 1:25 pm

That's why we proposed a user-space daemon taking care of this (see
our paper at the last RTLWS in Kenya). This way, the kernel only sees
the minimal information it needs to have, and all the rest is handled
from the user-space (i.e., awareness of different budgets for the various
CPU speeds, extra complexity due the mode-change protocol, power
management logic). However, this is compatible with a user-space
power-management logic. Instead, if we wanted a kernel-space one
(e.g., the current governors), then we would have to pass all the
This is what has always been done. However, there's an interesting thread
on the Jack mailing list in these weeks about the support for power
management (Jack may be considered to a certain extent hard RT due to
its professional usage [ audio glitches cannot be tolerated at all ], 
even if
it is definitely not safety critical). Interestingly, there they 
proposed jackfreqd:

I was referring to the possibility to both specify (from within the app) the
additional budgets for the additional power modes, or not. In the former
case, the kernel would use the app-supplied values, in the latter case the
This is independent on how the budgets for the various CPU speeds are
computed. It is simply a matter of how to dynamically change the runtime
assigned to a reservation. The change cannot be instantaneous, and the
easiest thing to implement is that, at the next recharge, the new value is
applied. If you try to simply "reset" the current reservation without
precautions, you put at risk schedulability of other reservations.
CPU frequency changes make things slightly more complex: if you reduce
the runtimes and increase the speed, you need to be sure the frequency
increase already occurred before recharging with a halved runtime.
Similarly, if you increase the runtimes and decrease the speed, you need
to ensure runtimes are already incremented when the frequency switch
actually occurs, and this takes time because the increase in runtimes
cannot be ...
From: Harald Gustafsson
Date: Tuesday, January 4, 2011 - 5:16 am

Being an embedded audio engineer for many years I know that we audio people take
audio quality and realtime performance seriously. If I understand what
the jackfreqd
does is that it make's sure that the CPU frequency is controlled by
the JACK DSP-load,
which sort of is a CPU time percentage devoted to JACK over an audio
frame period.
With sched deadline and a resource manager knowing about JACK's needs
this should be
possible to handle in an ondemand governor aware of sched deadline
bandwidths. The
RM would set the periods and runtime budgets based on JACK's DSP load,

OK, basically specifying the normalization values per power state for each
thread with the default being a linear scaling. I'll make sure that the default
normalization can be changed then but default initialized to linear based on
frequency in each freq state. Maybe a separate patch with a new prctl
But we don't change the runtime assigned to a reservation, think of it more
as the runtime is specified in "cycles". This is done either as in my patch that
the scheduler's runtime clock is running slower at lower clock speeds
or as Peter suggest that during runtime accounting the delta execution is
Right now I only act on the post cpu frequency change notification. I think
that on most systems the error due to that it takes some time to change the
actual frequency of the core is on par with other errors like context switches,
See previous comment about the change of the runtime vs accounting
It is simple, basically to things are introduced.
1) At every post cpufreq notification the factor between the current
frequency and the maximum frequency is calculated, i.e. the linear scaling.
I also keep track of the time this happens so that the runtime clock progress
is done with right factor also between sched clock updates. Hence I introduce
a clock that progress approximately proportional to the CPU clock frequency.
(On some systems this could actually be obtained directly, so that is
a potential
optimization by ...
From: Harald Gustafsson
Date: Friday, December 17, 2010 - 12:27 pm

I'm a bit choked before the holidays, but I can fix this in the
beginning of next year.
At the same time as I do a new version of the current patches that takes

Yes, I will do something like this instead, need to make sure that
everything is consider first though.
--

Previous thread: [PATCH] remove noop in workqueue by Hillf Danton on Friday, December 17, 2010 - 5:56 am. (1 message)

Next thread: [PATCH] input: mt: Document interface updates by Henrik Rydberg on Friday, December 17, 2010 - 6:12 am. (1 message)