Hi everybody, Here it is the new version of the SCHED_DEADLINE patchset. For the ones that missed the first posting[*], it is basically a new scheduling policy (implemented inside its own scheduling class) aiming at introducing deadline scheduling for Linux tasks, and it is being developed by Evidence S.r.l. (http://www.evidence.eu.com) in the context of the EU-Funded project ACTORS (http://www.actors-project.eu/). From the previous version, mainly: - I followed all the suggestions --about bug-fixing and code restyling-- that Peter kindly gave us in his review; - in particular, I rewrote from scratch the group part, i.e., bandwidth is now accounted for in a per-CPU fashion, such that the interface is the same we already have for RT-throttling; - I added a first drafted implementation of deadline inheritance, even though it surely needs thorough testing and refinements. The official page of the project is: http://www.evidence.eu.com/sched_deadline.html while the development is taking place at[**]: http://gitorious.org/sched_deadline/pages/Home http://gitorious.org/sched_deadline Still Missing parts: - bandwidth reclaiming mechanisms, i.e., methods that avoid stopping the tasks until their next deadline when overrunning. There are at least three of them that are very simple, and patches are on their way; - porting to PREEMPT-RT is almost done --thanks to Nicola and Luca-- and separate patches will be available on the project website soon; - migration of tasks throughout push and pull (as in -rt), to make it possible to deploy global-EDF scheduling. Patches are ready, they're just being tested and adapted to this last version; - refinements in deadline inheritance, especially regarding the possibility of retaining bandwidth isolation among non-interacting tasks. This is being studied from both theoretical and practical points of view, and hopefully we can have some demonstrative code soon. As Fabio said ...
Add a new function to the scheduling class interface. It is called
at the end of a cointext switch, if the going out task is in
TASK_DEAD state.
It might be of help in case the scheduling class wants to be
notified when one of its task dies, e.g. to perform some
cleanup actions.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
include/linux/sched.h | 1 +
kernel/sched.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..d1de995 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1119,6 +1119,7 @@ struct sched_class {
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
void (*task_fork) (struct task_struct *p);
+ void (*task_dead) (struct task_struct *p);
void (*switched_from) (struct rq *this_rq, struct task_struct *task,
int running);
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..532dcf1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2801,6 +2801,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ if (prev->sched_class->task_dead)
+ prev->sched_class->task_dead(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
--
1.7.0
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)
http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org
Add a scheduling class, in sched_dl.c and a new policy called SCHED_DEADLINE. It basically is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that make it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most the its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. This patch: - defines the new scheduling policy and all the new data structures needed for its core implementation (e.g., task parameters storage, runqueues, etc.), and takes care of their initialization; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and narrows the interactions between sched_dl and the other existing scheduling classes. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it> Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it> --- include/linux/sched.h | 61 +++++ kernel/fork.c | 12 + kernel/hrtimer.c | 2 +- kernel/sched.c | 58 ...
This is sad, but I'm afraid a reality we have to live with.. --
This is not about lock/inheritance related overrun, right? But simply a task that got its WCET wrong. --
We could write that as:
if (dl_time_before(dl_se, rq->clock) ||
!dl_check_bandwidth(dl_se, rq->clock)) {
/* reset parameters */
}
Also, I was wondering about a more descriptive name for
dl_check_bandwidth(), check _what_ about the bandwidth!?
dl_bandwidth_overflow() perhaps, that would also remove that negation.
--
So what happens when we overflow u64? --
Is the resolution in nanosecs starting from zero? If so, then we don't need to worry about overflow for 583 years? And that is only if the difference in time is greater than 292 years since dl_time_before() does a: (s64)(a - b) < 0 The (s64)(a - b) returns the difference even on overflow as long as the difference is not greater than 2^63 -- Steve --
Its a multiplication of two u64, that's a lot easier to overflow. --
Why doesn't that live in/use sched_fork()? --
Add the interface bits needed for supporting scheduling algorithms with extended parameters (e.g., SCHED_DEADLINE). In fact, specifying a periodic/sporadic task that executes for a given amount of runtime at each instance, and that is scheduled according to the usrgency of their own timing constraints needs, in general: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. In order of this model to be useable, both the data structure that holds the scheduling parameter of tasks and the system calls that deal with them have to be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binary code. For these reasons, this patch: - defines the new struct sched_param_ex, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setscheduler_ex(), sched_setparam_ex() and sched_getparam_ex(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. However, making them available on other archs is straightforward. The SCHED_DEADLINE policy is, as of now, the only user of this extended interface. Signed-off-by: Dario Faggioli <raistlin@linux.it> --- arch/arm/include/asm/unistd.h | 3 + arch/arm/kernel/calls.S | 3 + arch/x86/ia32/ia32entry.S | 3 + arch/x86/include/asm/unistd_32.h | 5 +- arch/x86/include/asm/unistd_64.h | 6 ++ arch/x86/kernel/syscall_table_32.S | 3 + include/linux/sched.h | 54 +++++++++++ include/linux/syscalls.h | 7 ++ kernel/sched.c | 176 +++++++++++++++++++++++++++++++++++- 9 files changed, 254 insertions(+), 6 ...
Add resource limits for non-root tasks in using the SCHED_DEADLINE
policy, very similarly to what already exists for RT policies.
In fact, this patch:
- adds the resource limit RLIMIT_DLDLINE, which is the minimum value
a user task can use as its own deadline;
- adds the resource limit RLIMIT_DLRTIME, which is the maximum value
a user task can use as it own runtime.
Notice that to exploit these, a modified version of the ulimit
utility and a modified resource.h header file are needed. They
both will be available on the website of the project.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
include/asm-generic/resource.h | 7 ++++++-
kernel/sched.c | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+), 1 deletions(-)
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..4a1d0e2 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,10 @@
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
#define RLIMIT_RTTIME 15 /* timeout for RT tasks in us */
-#define RLIM_NLIMITS 16
+
+#define RLIMIT_DLDLINE 16 /* minimum deadline in us */
+#define RLIMIT_DLRTIME 17 /* maximum runtime in us */
+#define RLIM_NLIMITS 18
/*
* SuS says limits have to be unsigned.
@@ -87,6 +90,8 @@
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
[RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \
+ [RLIMIT_DLDLINE] = { ULONG_MAX, ULONG_MAX }, \
+ [RLIMIT_DLRTIME] = { 0, 0 }, \
}
#endif /* __KERNEL__ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 19e90fc..61b1561 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6407,6 +6407,31 @@ recheck:
* Allow unprivileged RT tasks to decrease priority:
*/
if (user && !capable(CAP_SYS_NICE)) {
+ if (dl_policy(policy)) {
+ u64 rlim_dline, rlim_rtime;
+ u64 dline, rtime;
+
+ if (!lock_task_sighand(p, ...Introduce sched_wait_interval() syscall (and scheduling class interface call). In general, this aims at providing each scheduling class with a mean of making one of its own task sleep for some time according to some specific rule of the scheduling class itself. As of now, the sched_dl scheduling class is the only one that needs this kind of service, and thus the only one that implements the class-specific logic. For other classes, calling it will result in the same effect than calling clock_nanosleep with CLOCK_MONOTONIC clockid and the TIMER_ABSTIME flag on. For -deadline task, the idea is to give them the possibility of notifying the scheduler a periodic/sporadic instance just ended and ask it to wake up them at the beginning of the next one, with: - fully replenished runtime and - the absolute deadline set just one relative deadline interval away from the wakeup time. This is an effective mean of synchronizing the task's behaviour with the scheduler one, which might be useful in some situations. This patch: - adds the new syscall (x83-32, x86-64 and ARM, but extension to all archs is strightforward); - implements the class-specific logic for -deadline tasks, making it impossible for them to exploit this call to use more bandwidth than they are given. Signed-off-by: Dario Faggioli <raistlin@linux.it> --- arch/arm/include/asm/unistd.h | 1 + arch/arm/kernel/calls.S | 1 + arch/x86/ia32/ia32entry.S | 1 + arch/x86/include/asm/unistd_32.h | 3 +- arch/x86/include/asm/unistd_64.h | 2 + arch/x86/kernel/syscall_table_32.S | 1 + include/linux/sched.h | 2 + include/linux/syscalls.h | 2 + kernel/sched.c | 39 +++++++++++++++++++ kernel/sched_dl.c | 74 +++++++++++++++++++++++++++++++++++- 10 files changed, 124 insertions(+), 2 deletions(-) diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h index ...
Add the typical debugging output provided by sched_deug (if enabled)
also to the sched_dl class.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
kernel/sched_debug.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched_dl.c | 26 ++++++++++++++++++++++++++
2 files changed, 59 insertions(+), 0 deletions(-)
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 67f95aa..407a761 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -246,6 +246,38 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
#undef P
}
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+ s64 min_deadline = -1, max_deadline = -1;
+ struct rq *rq = &per_cpu(runqueues, cpu);
+ struct sched_dl_entity *last;
+ unsigned long flags;
+
+ SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ if (dl_rq->rb_leftmost)
+ min_deadline = (rb_entry(dl_rq->rb_leftmost,
+ struct sched_dl_entity,
+ rb_node))->deadline;
+ last = __pick_dl_last_entity(dl_rq);
+ if (last)
+ max_deadline = last->deadline;
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+ SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(x))
+#define PN(x) \
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+
+ P(dl_rq->dl_nr_running);
+ PN(min_deadline);
+ PN(max_deadline);
+
+#undef PN
+#undef P
+}
+
static void print_cpu(struct seq_file *m, int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -305,6 +337,7 @@ static void print_cpu(struct seq_file *m, int cpu)
#endif
print_cfs_stats(m, cpu);
print_rt_stats(m, cpu);
+ print_dl_stats(m, cpu);
print_rq(m, rq, cpu);
}
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index a1202ac..dee2668 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -44,6 +44,9 @@ static inline int dl_time_before(u64 a, u64 b)
return (s64)(a - b) < 0;
}
+#define for_each_leaf_dl_rq(dl_rq, rq) \
+ for (dl_rq = &rq->dl; dl_rq; dl_rq = NULL)
+
...It is very likely that systems that wants/needs to use the new SCHED_DEADLINE policy also want to have the scheduling latency of the -deadline tasks under control. For this reason a new version of the scheduling wakeup latency, called "wakeup_dl", is introduced. As a consequence of applying this patch there will be three wakeup latency tracer: * "wakeup", that deals with all tasks in the system; * "wakeup_rt", that deals with -rt and -deadline tasks only; * "wakeup_dl", that deals with -deadline tasks only. Signed-off-by: Dario Faggioli <raistlin@linux.it> --- kernel/trace/trace_sched_wakeup.c | 44 +++++++++++++++++++++++++++++++++--- kernel/trace/trace_selftest.c | 30 +++++++++++++++---------- 2 files changed, 58 insertions(+), 16 deletions(-) diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c index 0271742..a02ce04 100644 --- a/kernel/trace/trace_sched_wakeup.c +++ b/kernel/trace/trace_sched_wakeup.c @@ -27,6 +27,7 @@ static int wakeup_cpu; static int wakeup_current_cpu; static unsigned wakeup_prio = -1; static int wakeup_rt; +static int wakeup_dl; static arch_spinlock_t wakeup_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; @@ -214,9 +215,17 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success) tracing_record_cmdline(p); tracing_record_cmdline(current); - if ((wakeup_rt && !rt_task(p)) || - p->prio >= wakeup_prio || - p->prio >= current->prio) + /* + * Semantic is like this: + * - wakeup tracer handles all tasks in the system, independently + * from their scheduling class; + * - wakeup_rt tracer handles tasks belonging to sched_dl and + * sched_rt class; + * - wakeup_dl handles tasks belonging to sched_dl class only. + */ + if ((wakeup_dl && !dl_task(p)) || + (wakeup_rt && !dl_task(p) && !rt_task(p)) || + (p->prio >= wakeup_prio || p->prio >= current->prio)) return; pc = preempt_count(); @@ -228,7 +237,7 @@ probe_wakeup(struct rq ...
Add to the scheduler the capability of notifying when -deadline tasks overrun their maximum runtime and/or overcome their scheduling deadline. Runtime overruns might be quite common, e.g., due to coarse granularity execution time accounting resolution or wrong assignment of tasks' parameters (especially runtime). However, since the scheduler enforces bandwidth isolation among tasks, this is not at all a threat to other tasks' schedulability. For this reason, it is not common that a task wants to be notified about that. Moreover, if we are using the SCHED_DEADLINE policy with sporadic tasks, or to limit the bandwidth of not periodic nor sporadic ones, runtime overruns are very likely to occur at each and every instance, and again they should not be considered a problem. On the other hand, a deadline miss in any task means that, even if we are trying at our best to keep each task isolated and to avoid reciprocal interference among them, something went very, very bad, and one task did not manage in consuming its runtime by its deadline. This is something that should happen only on an oversubscribed system, and thus being notified when it occurs could be very useful. The user can specify the signals he wants to be sent to his task during sched_setscheduler_ex(), raising two specific flags in the sched_flags field of struct sched_param_ex: * SCHED_SIG_RORUN (if he wants to be signaled on runtime overrun), * SCHED_SIG_DMISS (if he wants to be signaled on deadline misses). This patch: - adds the logic needed to send SIGXCPU signal to a -deadline task in case its actual runtime becomes negative; - adds the logic needed to send SIGXCPU signal to a -deadline task in case it is still being scheduled while its absolute deadline passes. Signed-off-by: Dario Faggioli <raistlin@linux.it> --- include/linux/sched.h | 18 ++++++++++++++++++ kernel/sched_dl.c | 24 ++++++++++++++++++++++++ 2 files changed, 42 insertions(+), 0 deletions(-) diff --git ...
Can't this be a regular vararg function? Also, I wonder, would we want to put some information in the sigaction struct to allow the user to distinguish between the various SIGXCPU --
Without ->siglock? This is racy even if dl_task_of(se) == current, but I guess it can be !current. For example, we must never set TIF_SIGPENDING without wake_up_state(). A fatal signal should kill the whole process, etc. Even sigaddset() itself can race with tkill, it is not atomic. Oleg. --
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed. The issues this raises are not at all
trivial, but a simple enough implementation is possible, avoiding
completely restructuring any of the two.
Such solution, that we are proposing here, is based on two key
concepts: (i) relative deadline inheritance, and (ii) inheriting
task's bandwidth boosting.
In some more details.
(i) relative deadline is representative of how "urgent" each
instance of a task is and, while executing, the absolute
deadline --according to which all scheduling decision are
undertaken-- is calculated right by adding such value to the
instance activation time. Therefore, a task inheriting a
relative deadline smaller than its original one will likely
have a smaller absolute deadline too, which all looks very
similar to what happens when priority inheritance (e.g.,
because of rt-mutexes) take place among RT tasks. Moreover,
relative deadline are meaningful on a per task basis
independently from the CPU tasks are running on (and from
any kind of clock synchronization issues among the different
CPUs).
(ii) in order of having the inheriting task finishing quickly its
"operation" we also want it to be temporarily insensitive from
the bandwidth enforcing mechanism, that would slow it down
and provides another mean of priority inversion.
It is true that this opens the door to situations during
which the system is more loaded than we expected (and perhaps
even oversubscribed), but if the fundamental assumption of
these time intervals being small enough, this won't be
anything unbearable.
This is not 100% theoretically correct solution, although some kind
of analysis can be tried. The main flaw is that it breaks the
bandwidth isolation property the sched_dl policy (also) strives for.
Different solutions exist, that do a better job in preserving ...Right, except that this makes the plist stuff O(INT_MAX) [ or rather
O(nr_dl_tasks) ].
But I guess it serves for testing.
I think it would be relatively straight forward to modify the existing
PI chain code to work using an RB-tree instead of the plist stuff.
An RB-tree would also make the whole ->prio mess much easier to solve,
we could simply make a more complex comparison function, like:
int rt_mutex_prio_less(struct task_struct *left, struct task_struct *right)
{
if (left->prio < right->prio)
return 1;
if (left->prio == right->prio && left->prio == -1) {
if (left->deadline < right->deadline)
return 1;
}
return 0;
}
Yeah, good enough to start with ;-)
--
Something like the below (totally untested)
---
include/linux/init_task.h | 10 ++++
include/linux/rtmutex.h | 13 +----
include/linux/sched.h | 3 +-
kernel/fork.c | 3 +-
kernel/rtmutex-debug.c | 8 +--
kernel/rtmutex.c | 121 +++++++++++++++++++++++++++++++++++---------
kernel/rtmutex_common.h | 21 ++++----
kernel/sched.c | 4 --
8 files changed, 125 insertions(+), 58 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..2ee9259 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
#include <linux/securebits.h>
+#include <linux/rbtree.h>
#include <net/net_namespace.h>
extern struct files_struct init_files;
@@ -103,6 +104,14 @@ extern struct cred init_cred;
# define INIT_PERF_EVENTS(tsk)
#endif
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES \
+ .pi_waiters = RB_ROOT, \
+ .pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -172,6 +181,7 @@ extern struct cred init_cred;
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
+ INIT_RT_MUTEXES \
}
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 8d522ff..8a68b29 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
#define __LINUX_RT_MUTEX_H
#include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
#include <linux/spinlock_types.h>
extern int max_lock_depth; /* for sysctl */
@@ -27,7 +27,8 @@ extern int max_lock_depth; /* for sysctl */
*/
struct rt_mutex {
raw_spinlock_t wait_lock;
- struct plist_head wait_list;
+ struct rb_root waiters;
+ struct rb_node *waiters_leftmost;
struct ...In order of -deaadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
The main differences between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need a throttling
mechanism in the groups, which can be used nothing more than for
admission control of tasks.
This patch:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU for -deadline tasks and task groups;
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_total_bw,
that --after writing to it the index of an online CPU-- tells
how much of the total available bandwidth of that CPU is
currently allocated.
- adds per-group deadline bandwidth management by means of:
* /cgroup/<group>/cpu.dl_runtime_us,
* /cgroup/<group>/cpu.dl_period_us,
(same as above, but per-group);
- adds per-group deadline bandwidth management by means of:
* ...Yikes!!, I'm not sure we can sanely deal with set_task_cpu() doing that. I'd much rather see us never attempting set_task_cpu() when we know its not going to be possible. That also means that things like set_cpus_allowed_ptr() / sys_sched_setaffinity() will need to propagate the error back to their users, which in turn will need to be able to cope. --
Add in Documentation/scheduler/ some hints about the design choices, the usage and the future possible developments of the sched_dl scheduling class and of the SCHED_DEADLINE policy. Signed-off-by: Dario Faggioli <raistlin@linux.it> --- Documentation/scheduler/sched-deadline.txt | 188 ++++++++++++++++++++++++++++ init/Kconfig | 1 + 2 files changed, 189 insertions(+), 0 deletions(-) diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt new file mode 100644 index 0000000..1ff0e1e --- /dev/null +++ b/Documentation/scheduler/sched-deadline.txt @@ -0,0 +1,188 @@ + Deadline Task and Group Scheduling + ---------------------------------- + +CONTENTS +======== + +0. WARNING +1. Overview + 1.1 Task scheduling + 1.2 Group scheduling +2. The interface + 2.1 System wide settings + 2.2 Task interface + 2.3 Group interface + 2.4 Default behavior +3. Future plans + + +0. WARNING +========== + + Fiddling with these settings can result in an unpredictable or even unstable + system behavior. As for -rt (group) scheduling, it is assumed that root + knows what he is doing. + + +1. Overview +=========== + + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is + basically an implementation of the Earliest Deadline First (EDF) scheduling + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) + that make it possible to isolate the behaviour of tasks between each other. + + +1.1 Task scheduling +------------------- + + The typical -deadline task will be made up of a computation phase (instance) + which is activated on a periodic or sporadic fashion. The expected (maximum) + duration of such computation is called the task's runtime; the time interval + by which each instance need to be completed is called the task's relative + deadline. The task's absolute deadline is dynamically calculated as the + time instant a task (better, an ...
Overall very nice code, awesome! Right, I hope that with changing the PI code to use RB trees this will become much less of a kludge than fudging the ->prio field. --
