Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.

Previous thread: [PULL] virtio by Michael S. Tsirkin on Sunday, February 28, 2010 - 11:43 am. (1 message)

Next thread: [PATCH 1/1] sound/pci/hda/hda_codec.c: various coding style fixes by Norberto Lopes on Sunday, February 28, 2010 - 12:16 pm. (2 messages)
From: Raistlin
Date: Sunday, February 28, 2010 - 12:06 pm

Hi everybody,

Here it is the new version of the SCHED_DEADLINE patchset.

For the ones that missed the first posting[*], it is basically a new
scheduling policy (implemented inside its own scheduling class) aiming
at introducing deadline scheduling for Linux tasks, and it is being
developed by Evidence S.r.l. (http://www.evidence.eu.com) in the context
of the EU-Funded project ACTORS (http://www.actors-project.eu/).

From the previous version, mainly:
 - I followed all the suggestions --about bug-fixing and code
   restyling-- that Peter kindly gave us in his review;
 - in particular, I rewrote from scratch the group part, i.e.,
   bandwidth is now accounted for in a per-CPU fashion, such that
   the interface is the same we already have for RT-throttling;
 - I added a first drafted implementation of deadline inheritance, even
   though it surely needs thorough testing and refinements.

The official page of the project is:
  http://www.evidence.eu.com/sched_deadline.html

while the development is taking place at[**]:
  http://gitorious.org/sched_deadline/pages/Home
  http://gitorious.org/sched_deadline

Still Missing parts:
 - bandwidth reclaiming mechanisms, i.e., methods that avoid stopping
   the tasks until their next deadline when overrunning. There are at
   least three of them that are very simple, and patches are on their
   way;
 - porting to PREEMPT-RT is almost done --thanks to Nicola and Luca--
   and separate patches will be available on the project website soon;
 - migration of tasks throughout push and pull (as in -rt), to make it
   possible to deploy global-EDF scheduling. Patches are ready, they're
   just being tested and adapted to this last version;
 - refinements in deadline inheritance, especially regarding the
   possibility of retaining bandwidth isolation among non-interacting
   tasks. This is being studied from both theoretical and practical
   points of view, and hopefully we can have some demonstrative code
   soon.

As Fabio said ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:15 pm

Add a new function to the scheduling class interface. It is called
at the end of a cointext switch, if the going out task is in
TASK_DEAD state.

It might be of help in case the scheduling class wants to be
notified when one of its task dies, e.g. to perform some
cleanup actions.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..d1de995 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1119,6 +1119,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
 			       int running);
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..532dcf1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2801,6 +2801,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org
From: Raistlin
Date: Sunday, February 28, 2010 - 12:17 pm

Add a scheduling class, in sched_dl.c and a new policy called
SCHED_DEADLINE. It basically is an implementation of the Earliest
Deadline First (EDF) scheduling algorithm, augmented with a mechanism
(called Constant Bandwidth Server, CBS) that make it possible to
isolate the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most the its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

This patch:
 - defines the new scheduling policy and all the new data structures
   needed for its core implementation (e.g., task parameters storage,
   runqueues, etc.), and takes care of their initialization;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and narrows the interactions between sched_dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
---
 include/linux/sched.h |   61 +++++
 kernel/fork.c         |   12 +
 kernel/hrtimer.c      |    2 +-
 kernel/sched.c        |   58 ...
From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

This is sad, but I'm afraid a reality we have to live with..
--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

This is not about lock/inheritance related overrun, right? But simply a
task that got its WCET wrong.


--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

We could write that as:

  if (dl_time_before(dl_se, rq->clock) ||
      !dl_check_bandwidth(dl_se, rq->clock)) {
	/* reset parameters */
  }

Also, I was wondering about a more descriptive name for
dl_check_bandwidth(), check _what_ about the bandwidth!?

dl_bandwidth_overflow() perhaps, that would also remove that negation.
--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

So what happens when we overflow u64?
--

From: Steven Rostedt
Date: Tuesday, April 13, 2010 - 11:55 am

Is the resolution in nanosecs starting from zero? If so, then we don't
need to worry about overflow for 583 years? And that is only if the
difference in time is greater than 292 years since dl_time_before() does
a:

  (s64)(a - b) < 0

The (s64)(a - b) returns the difference even on overflow as long as the
difference is not greater than 2^63


-- Steve


--

From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 12:34 am

Its a multiplication of two u64, that's a lot easier to overflow.

--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

Why doesn't that live in/use sched_fork()?
--

From: Raistlin
Date: Sunday, February 28, 2010 - 12:18 pm

Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).
In fact, specifying a periodic/sporadic task that executes for a
given amount of runtime at each instance, and that is scheduled
according to the usrgency of their own timing constraints needs,
in general:
 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

In order of this model to be useable, both the data structure that
holds the scheduling parameter of tasks and the system calls that
deal with them have to be extended.
Unfortunately, modifying the existing struct sched_param would
break the ABI and result in potentially serious compatibility
issues with legacy binary code.

For these reasons, this patch:
 - defines the new struct sched_param_ex, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler_ex(), sched_setparam_ex()
   and sched_getparam_ex().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. However,
making them available on other archs is straightforward.

The SCHED_DEADLINE policy is, as of now, the only user of this
extended interface.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/ia32/ia32entry.S          |    3 +
 arch/x86/include/asm/unistd_32.h   |    5 +-
 arch/x86/include/asm/unistd_64.h   |    6 ++
 arch/x86/kernel/syscall_table_32.S |    3 +
 include/linux/sched.h              |   54 +++++++++++
 include/linux/syscalls.h           |    7 ++
 kernel/sched.c                     |  176 +++++++++++++++++++++++++++++++++++-
 9 files changed, 254 insertions(+), 6 ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:19 pm

Add resource limits for non-root tasks in using the SCHED_DEADLINE
policy, very similarly to what already exists for RT policies.

In fact, this patch:
 - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
   a user task can use as its own deadline;
 - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
   a user task can use as it own runtime.

Notice that to exploit these, a modified version of the ulimit
utility and a modified resource.h header file are needed. They
both will be available on the website of the project.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/asm-generic/resource.h |    7 ++++++-
 kernel/sched.c                 |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..4a1d0e2 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,10 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+
+#define RLIMIT_DLDLINE		16	/* minimum deadline in us */
+#define RLIMIT_DLRTIME		17	/* maximum runtime in us */
+#define RLIM_NLIMITS		18
 
 /*
  * SuS says limits have to be unsigned.
@@ -87,6 +90,8 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_DLDLINE]	= { ULONG_MAX, ULONG_MAX },		\
+	[RLIMIT_DLRTIME]	= { 0, 0 },				\
 }
 
 #endif	/* __KERNEL__ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 19e90fc..61b1561 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6407,6 +6407,31 @@ recheck:
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (dl_policy(policy)) {
+			u64 rlim_dline, rlim_rtime;
+			u64 dline, rtime;
+
+			if (!lock_task_sighand(p, ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:20 pm

Introduce sched_wait_interval() syscall (and scheduling class
interface call). In general, this aims at providing each scheduling
class with a mean of making one of its own task sleep for some time
according to some specific rule of the scheduling class itself.

As of now, the sched_dl scheduling class is the only one that needs
this kind of service, and thus the only one that implements the
class-specific logic. For other classes, calling it will result in
the same effect than calling clock_nanosleep with CLOCK_MONOTONIC
clockid and the TIMER_ABSTIME flag on.

For -deadline task, the idea is to give them the possibility of
notifying the scheduler a periodic/sporadic instance just ended and
ask it to wake up them at the beginning of the next one, with:
 - fully replenished runtime and
 - the absolute deadline set just one relative deadline interval
   away from the wakeup time.
This is an effective mean of synchronizing the task's behaviour with
the scheduler one, which might be useful in some situations.

This patch:
 - adds the new syscall (x83-32, x86-64 and ARM, but extension to all
   archs is strightforward);
 - implements the class-specific logic for -deadline tasks, making it
   impossible for them to exploit this call to use more bandwidth than
   they are given.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    1 +
 arch/arm/kernel/calls.S            |    1 +
 arch/x86/ia32/ia32entry.S          |    1 +
 arch/x86/include/asm/unistd_32.h   |    3 +-
 arch/x86/include/asm/unistd_64.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/syscalls.h           |    2 +
 kernel/sched.c                     |   39 +++++++++++++++++++
 kernel/sched_dl.c                  |   74 +++++++++++++++++++++++++++++++++++-
 10 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:22 pm

Add the typical debugging output provided by sched_deug (if enabled)
also to the sched_dl class.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/sched_debug.c |   33 +++++++++++++++++++++++++++++++++
 kernel/sched_dl.c    |   26 ++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 67f95aa..407a761 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -246,6 +246,38 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = &per_cpu(runqueues, cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_dl_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+
+	P(dl_rq->dl_nr_running);
+	PN(min_deadline);
+	PN(max_deadline);
+
+#undef PN
+#undef P
+}
+
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -305,6 +337,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 #endif
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_dl_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index a1202ac..dee2668 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -44,6 +44,9 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+#define for_each_leaf_dl_rq(dl_rq, rq) \
+	for (dl_rq = &rq->dl; dl_rq; dl_rq = NULL)
+
 ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:23 pm

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/trace/trace_sched_wakeup.c |   44 +++++++++++++++++++++++++++++++++---
 kernel/trace/trace_selftest.c     |   30 +++++++++++++++----------
 2 files changed, 58 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 0271742..a02ce04 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -214,9 +215,17 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if ((wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (p->prio >= wakeup_prio || p->prio >= current->prio))
 		return;
 
 	pc = preempt_count();
@@ -228,7 +237,7 @@ probe_wakeup(struct rq ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:24 pm

Add to the scheduler the capability of notifying when -deadline tasks
overrun their maximum runtime and/or overcome their scheduling
deadline.

Runtime overruns might be quite common, e.g., due to coarse granularity
execution time accounting resolution or wrong assignment of tasks'
parameters (especially runtime). However, since the scheduler enforces
bandwidth isolation among tasks, this is not at all a threat to other
tasks' schedulability. For this reason, it is not common that a task
wants to be notified about that. Moreover, if we are using the
SCHED_DEADLINE policy with sporadic tasks, or to limit the bandwidth
of not periodic nor sporadic ones, runtime overruns are very likely
to occur at each and every instance, and again they should not be
considered a problem.

On the other hand, a deadline miss in any task means that, even if we
are trying at our best to keep each task isolated and to avoid
reciprocal interference among them, something went very, very bad,
and one task did not manage in consuming its runtime by its deadline.
This is something that should happen only on an oversubscribed
system, and thus being notified when it occurs could be very useful.

The user can specify the signals he wants to be sent to his task
during sched_setscheduler_ex(), raising two specific flags in the
sched_flags field of struct sched_param_ex:
 * SCHED_SIG_RORUN (if he wants to be signaled on runtime overrun),
 * SCHED_SIG_DMISS (if he wants to be signaled on deadline misses).

This patch:
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case its actual runtime becomes negative;
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case it is still being scheduled while its absolute deadline
   passes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   18 ++++++++++++++++++
 kernel/sched_dl.c     |   24 ++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git ...
From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:22 am

Can't this be a regular vararg function?

Also, I wonder, would we want to put some information in the sigaction
struct to allow the user to distinguish between the various SIGXCPU

--

From: Oleg Nesterov
Date: Tuesday, April 13, 2010 - 12:32 pm

Without ->siglock?

This is racy even if dl_task_of(se) == current, but I guess it can
be !current. For example, we must never set TIF_SIGPENDING without
wake_up_state(). A fatal signal should kill the whole process, etc.
Even sigaddset() itself can race with tkill, it is not atomic.

Oleg.

--

From: Raistlin
Date: Sunday, February 28, 2010 - 12:26 pm

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed. The issues this raises are not at all
trivial, but a simple enough implementation is possible, avoiding
completely restructuring any of the two.
Such solution, that we are proposing here, is based on two key
concepts: (i) relative deadline inheritance, and (ii) inheriting
task's bandwidth boosting.

In some more details.

  (i) relative deadline is representative of how "urgent" each
      instance of a task is and, while executing, the absolute
      deadline --according to which all scheduling decision are
      undertaken-- is calculated right by adding such value to the
      instance activation time. Therefore, a task inheriting a
      relative deadline smaller than its original one will likely
      have a smaller absolute deadline too, which all looks very
      similar to what happens when priority inheritance (e.g.,
      because of rt-mutexes) take place among RT tasks. Moreover,
      relative deadline are meaningful on a per task basis
      independently from the CPU tasks are running on (and from
      any kind of clock synchronization issues among the different
      CPUs).
 (ii) in order of having the inheriting task finishing quickly its
      "operation" we also want it to be temporarily insensitive from
      the bandwidth enforcing mechanism, that would slow it down
      and provides another mean of priority inversion.
      It is true that this opens the door to situations during
      which the system is more loaded than we expected (and perhaps
      even oversubscribed), but if the fundamental assumption of
      these time intervals being small enough, this won't be
      anything unbearable.

This is not 100% theoretically correct solution, although some kind
of analysis can be tried. The main flaw is that it breaks the
bandwidth isolation property the sched_dl policy (also) strives for.
Different solutions exist, that do a better job in preserving ...
From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 1:25 am

Right, except that this makes the plist stuff O(INT_MAX) [ or rather
O(nr_dl_tasks) ].

But I guess it serves for testing.

I think it would be relatively straight forward to modify the existing
PI chain code to work using an RB-tree instead of the plist stuff.

An RB-tree would also make the whole ->prio mess much easier to solve,
we could simply make a more complex comparison function, like:

int rt_mutex_prio_less(struct task_struct *left, struct task_struct *right)
{
	if (left->prio < right->prio)
		return 1;

	if (left->prio == right->prio && left->prio == -1) {
		if (left->deadline < right->deadline)
			return 1;
	}

	return 0;
}

Yeah, good enough to start with ;-)

--

From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 2:45 am

Something like the below (totally untested)

---
 include/linux/init_task.h |   10 ++++
 include/linux/rtmutex.h   |   13 +----
 include/linux/sched.h     |    3 +-
 kernel/fork.c             |    3 +-
 kernel/rtmutex-debug.c    |    8 +--
 kernel/rtmutex.c          |  121 +++++++++++++++++++++++++++++++++++---------
 kernel/rtmutex_common.h   |   21 ++++----
 kernel/sched.c            |    4 --
 8 files changed, 125 insertions(+), 58 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..2ee9259 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 
 extern struct files_struct init_files;
@@ -103,6 +104,14 @@ extern struct cred init_cred;
 # define INIT_PERF_EVENTS(tsk)
 #endif
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -172,6 +181,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_RT_MUTEXES							\
 }
 
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 8d522ff..8a68b29 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -27,7 +27,8 @@ extern int max_lock_depth; /* for sysctl */
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root		waiters;
+	struct rb_node		*waiters_leftmost;
 	struct ...
From: Raistlin
Date: Sunday, February 28, 2010 - 12:27 pm

In order of -deaadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

The main differences between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need a throttling
mechanism in the groups, which can be used nothing more than for
admission control of tasks.

This patch:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU for -deadline tasks and task groups;

 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_total_bw,
   that --after writing to it the index of an online CPU-- tells
   how much of the total available bandwidth of that CPU is
   currently allocated.

 - adds per-group deadline bandwidth management by means of:
    * /cgroup/<group>/cpu.dl_runtime_us,
    * /cgroup/<group>/cpu.dl_period_us,
   (same as above, but per-group);

 - adds per-group deadline bandwidth management by means of:
    * ...
From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 3:09 am

Yikes!!, I'm not sure we can sanely deal with set_task_cpu() doing that.

I'd much rather see us never attempting set_task_cpu() when we know its
not going to be possible.

That also means that things like set_cpus_allowed_ptr() /
sys_sched_setaffinity() will need to propagate the error back to their
users, which in turn will need to be able to cope.

--

From: Raistlin
Date: Sunday, February 28, 2010 - 12:28 pm

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 Documentation/scheduler/sched-deadline.txt |  188 ++++++++++++++++++++++++++++
 init/Kconfig                               |    1 +
 2 files changed, 189 insertions(+), 0 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..1ff0e1e
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,188 @@
+			Deadline Task and Group Scheduling
+			----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+  1.1 Task scheduling
+  1.2 Group scheduling
+2. The interface
+  2.1 System wide settings
+  2.2 Task interface
+  2.3 Group interface
+  2.4 Default behavior
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root
+ knows what he is doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that make it possible to isolate the behaviour of tasks between each other.
+
+
+1.1 Task scheduling
+-------------------
+
+ The typical -deadline task will be made up of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance need to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (better, an ...
From: Peter Zijlstra
Date: Wednesday, April 14, 2010 - 3:17 am

Overall very nice code, awesome!



Right, I hope that with changing the PI code to use RB trees this will
become much less of a kludge than fudging the ->prio field.

--

Previous thread: [PULL] virtio by Michael S. Tsirkin on Sunday, February 28, 2010 - 11:43 am. (1 message)

Next thread: [PATCH 1/1] sound/pci/hda/hda_codec.c: various coding style fixes by Norberto Lopes on Sunday, February 28, 2010 - 12:16 pm. (2 messages)