Re: Deadline scheduling (was: Re: Rearranging layout of code in the scheduler)

Previous thread: [patch 4/7] mm: cpuset aware reclaim writeout by David Rientjes on Tuesday, October 28, 2008 - 9:08 am. (29 messages)

Next thread: RE: [PATCH] Disable ioat channel only on platforms where ile driver can load by Sosnowski, Maciej on Tuesday, October 28, 2008 - 9:30 am. (1 message)
From: Henrik Austad
Date: Tuesday, October 28, 2008 - 8:34 am

Hello,=20

Before I dive in, I should probably justify my motivations for writing
this email. I'm working away on implementing an EDF scheduler for real
time tasks in the kernel. This again leads to hacking at the existing
source as I'm not about to toss out the entire scheduler - just replace
(by some Kconfig switch) the RR/FIFO classes. As to why I'm looking at
EDF, I think the answer to that is a bit too long (and not appropriate
for this email anyway) so I'll leave that part out.

However, what I do mean to discuss is the current state of the scheduler
code. Working through the code, I must say I'm really impressed. The
code is clean, it is well thought out and the new approach with
sched_class and sched_entity makes it very modular. However, digging
deeper, I find myself turning more and more desperate, looking over my
shoulder for a way out.

Now, I'm in no doubt that the code *is* modular, that it *is* clean and
tidy, but coming from outside, it is not that easy to grasp it all. And,
it is not just the sheer size and complexity of the scheduler itself,
but also a lot with how the code is arranged.

=46or instance, functions like free_fair_sched_group,
alloc_fair_sched_group etc - does IMHO belong in sched_fair.c and not in
sched.c. The same goes for several rt-functions and structs.

So, if one drew up a list over all events that would cause functions in
sched.c to be called, this could be used to make a minimized 'interface'
and then let the scheduler call the appropriate function for the given
class in the same fashion sched_tick is used today.=20

What I would like, is to rip out all the *actual* scheduling logic and
put this in sched_[fair|rt].c and let sched be purely event-driven
(which would be a nice design goal in itself). This would also lead to
sched_[fair|rt].h, where the structs, macros, defines etc can be
defined. Today these are defined in just about everywhere, making the
code unnecessary complicated (I'm not going to say messy since I'm not
*that* ...
From: Peter Zijlstra
Date: Tuesday, October 28, 2008 - 9:50 am

You and a few other folks. The most interesting part of EDF is not the
actual scheduler itself (although there are fun issues with that as
well), but extending the Priority Inheritance framework to deal with all

I'd start out small by moving the functions to the right file. After

You might need to be careful there, or introduce sched_(fair|rt).h for

Sounds good, its been on the agenda for a while, but nobody ever got
around to it.

Other cleanups that can be done are:
 - get rid of all the load balance iterator stuff and move
   that all into sched_fair

 - extract the common sched_entity members and create:

   struct {
	struct sched_entity_common common;
	union {
		struct sched_entity fair;
		struct sched_rt_entity rt;
	}

Well, adding a sched_class, no need to replace anything besides that.



--

From: Henrik Austad
Date: Tuesday, October 28, 2008 - 12:41 pm

well, yes, you trade in priority inversion with deadline inversion. Probabl=
y a=20

Yes, I was thinking something along those lines, slowly moving things out. =
I=20
just wanted to clear the air before I got started in case someone had some=
=20
real issues with this approach. After all, if it's never going to get merge=
d,=20



I wasn't really aiming at replacing anything (besides the rt-stuff, but as=
=20
said earlier, that's not the issuen ow). I just want to move things to make=
=20
it a bit clearer.=20

=2D-=20
med Vennlig Hilsen - Yours Sincerely
Henrik Austad
From: faggioli
Date: Thursday, October 30, 2008 - 9:49 am

Yes, here we are! :-)

We also have some code, but it still is highly experimental and we are  
The main problem is that, especially to deal with SMP systems, we also  
need to investigate theoretical issues and find out what the best  
I'm not saying anything in possible sched.c and sched_{fair|rt}.c code  
rearranging, I also only wonder why replacing fixed priority RT  
scheduling with EDF.

I think they both could be in... Maybe we can discuss on where, I mean  
on which position, in the linked list of scheduling classes, to put  
each of them.

Regards,
Dario

Thanks for the Cc. Peter, I also added Fabio and Michael that, you  
know, are working to this thing. :-)

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
--

From: Peter Zijlstra
Date: Thursday, October 30, 2008 - 10:17 am

Right, ideally I'd like to see 2 EDF classes on top of FIFO, so that we
end up with the following classes

  hedf     - hard EDF
  sedf     - soft EDF (bounded tardiness)
  fifo/rr  - the current static priority scheduler
  fair     - the current proportional fair scheduler
  idle     - the idle scheduler

The two edf classes must share some state, so that the sedf class knows
about the utilisation consumed by hedf, and the main difference between
these two classes is the schedulability test.

[ NOTE: read EDF as any deadline scheduler, so it might as well be
        pfair PD^2 scheduler. ]

The few problems this gives are things like kstopmachine and the
migration threads, which should run at the max priority available on the
system.

[ NOTE: although possibly we could make an exception for the migration
        threads, as we generally don't need to migrate running RT tasks]

Perhaps we can introduce another class on top of hedf which will run
just these two tasks and is not exposed to userspace (yes, I understand
it will ruin just about any schedulability analysis).

Which leaves us with the big issue of priority inversion ;-)

We can do deadline inheritance and bandwidth inheritance by changing
plist to a rb-tree/binary heap and mapping the static priority levels
somewhere at the back and also propagating the actual task responsible
for the boost down the chain (so as to be able to do bandwidth
reported that DI between disjoint schedule domains (partitions) posed an
interesting problem.

Personally I'd like to see the full priority inversion issue solved by
something like the proxy execution protocol, however the SMP extension
thereof seems to be a tad expensive - found a book on graph theory, all
that remains is finding time to read it :-)

The advantage of proxy execution is that its fully invariant to the
schedule function and thus even works for proportional fair schedulers
and any kind of scheduler hierarchy.


--

From: Peter Zijlstra
Date: Thursday, October 30, 2008 - 10:48 am

Basically the problem that I'm looking at is:

Given a directed acyclic graph G, with entry nodes E (those who don't
have an inbound edge) and exit nodes X (those without an outbound edge)
then given an exit node x of X, split the graph into G1 and G2 so that
G1 contains x and all paths leading to it.


--

From: Henrik Austad
Date: Thursday, October 30, 2008 - 2:44 pm

ok, simply put:
* give each task a relative deadline (will probably introduce a new syscall, 
please don't shoot me).
* when the task enters TASK_RUNNING, set abosolute deadline to time_now + 
rel_deadline.
* insert task in rq, sorted by abs_deadline
* pick leftmost task and run it
* when task is done, pick next task

that's it.


I have a very clear idea about *what* the scheduler should do, the problem is 
how to get it to work :-)

Things would be a lot easier if code in the scheduler was a bit 'more 
separated. I have started separating things and moving it to separate files. 


Yes, well, EDF is not optimal for SMP systems, only for single core. However, 
you can do a pretty good attempt by assigning tasks to cores in a greedy 
fashion (simply put the next task at the CPU with the lowest load).

As a further optimization, I guess you could do the whole sced-domain thing to 

No. You should have *either* FIFO/RR *or* EDF, not both at the same time. If 
you absolutely require both, you should at least separate them on a per-core 
basis. If you mix them, they need to be aware of the other in order to make 

Well.. why not just treat *all* RT-tasks as *either* FIFO/RR or EDF? Having 
fifo and edf together will complicate things. And, people looking for edf, 
will not use fifo/rr anyway (famous last words).

Furthermore, hard/firm/soft could be treated as one class, but it should be 
treated differently at missed deadlines. 
* Soft: nothing. Scheduled at best effort, when deadline passes, prioritize 
other tasks to avoid cascading effect of deadlinemissies
* Firm: keep some statistics that the user can modify and a possible 
event-handler when limits are exceeded
* Hard: immediatly call a user registred function, or send signal to notify 

Well, the nice thing about EDF, is that it is provable optimal for any 
feasible schedule,  IOW, if all the tasks are schedulable, you can have 100% 
utilization of the CPU.

On a multicore, EDF is not optimal, as rearranging ...
From: Peter Zijlstra
Date: Friday, October 31, 2008 - 2:03 am

Well, yes, I know how to do EDF, and its trivially simple - on UP.

Sure, but implementing an EDF class isn't really all that hard - esp if
you only want UP.



The problem with greedy binpacking heuristics is that your schedulablity

We _have_ to have both. Its that simple.

POSIX mandates we have SCHED_FIFO/RR, there is tons and tons of
userspace that uses it, we cannot just replace it with a deadline
scheduler.

Also, start a linux-rt kernel and look at all the RT threads you have
(and that's only going to get worse when we go to threads per interrupt
handler instead of threads per interrupt source).

Who is going to assign useful and meaningful periods, deadlines and
execution times to all those threads?

So the only other option is to add a deadline scheduler and allow people
to use it.

When you do that, you'll see that you have to add it above the FIFO/RR

errr, not so, see above. FIFO/RR are static priority schedulers, whereas
deadline schedulers are job-static or dynamic priority schedulers. You
cannot simply map FIFO onto them, you need more information and who is
going to provide that?

Also, with the above mentioned requirement that we must have FIFO/RR,

Thing is, you have to run hard tasks first, and scheduler weaker forms
in its slack time, otherwise you cannot guarantee anything.

And the easiest way to do that slack time stealing, is with such a
hierarchy. Sure, you can share a lot of code etc. but you need separate

On UP - which is not interesting on a general purpose kernel that runs

That is partitioned EDF or PEDF for short. There are many other EDF
variants who do not suffer this, for example global EDF (GEDF).

GEDF can guarantee a utilization bound of 50% for hard-rt and is soft-rt
up to 100%.

Then there are the pfair class of scheduling algorithms which can


Yes, its brilliant, on UP :-)

But it should work on SMP as well with a bit more effort. But you really
need bandwidth inheritance as well, which is yet another bit of ...
From: Henrik Austad
Date: Friday, October 31, 2008 - 5:09 am

As a start, that is the approach at least I would like to take. Once you have a 
proven, functional EDF on a single core, you can extend that to handle several 


Well, not really. I mean, to be optimal, you should also consider WCET, but 
then, that's really not interesting as IMHO that's the userspace-programmer's 
responsibility. If the user wants to add tasks that sum up to 210% utilization, 
it's really not much we can do anyway. You certainly wouldn't want the kernel to 
stop accepting new jobs.

So, keep the kernel logic as simple as possible and move the job to the user. 
By keeping the kernel logic simple - we make the job easier for the end-users. A 
very complex EDF-scheduler will make the testing very difficult.

If, on the other hand, we *know* that the scheduler is not optimal, but that it 
behaves in a predictable manner, the end users have a simpler task of finding 
out why something bad happened.

Because, no matter *what* you do, and *how* you implement it, with *whatever* 
features, there will be cases when things fall apart, and  having a simple, 


I didn't mean to rip the whole fifo/rr out of the kernel, but adding a switch at 
compile-time so that you could choose *either* normal, static RT *or* EDF. Then 
we could, at least for the first few versions, have it depend on !SMP to avoid 

yes, that would certainly be a problem. But, if we can configure in either EDF 

well, periods is not that difficult. Basically you treat everything as 
asynchronus events and put them in the runqueue when they become runnable. And 
the application itself should set a relative deadline via the syscall to tell 
the kernel "when I become runnable, I must finish before time_now + 
rel_deadline"

RT-tasks that runs forever will certainly be a problem, but what about something 
like mark it as soft/firm and for every time a deadline is missed, have the 
trigger function set a new deadline? It would mean you have to rewrite a whole 

I'm not going to map anything ...
From: Peter Zijlstra
Date: Friday, October 31, 2008 - 6:30 am

We prefer not to, but sometimes there just isn't any other option.

Well, you're of course free to do so, but I don't think its a very

I agree that the scheduler should be simple, and even something like
PD^2 is relatively simple.

But I disagree that we should not do schedulability tests. Doing those,
and esp. enforcing tasks to their given limits increases the QoS for
others in the presence of faulty/malicious tasks.

Also, WCET is still the users responsibility.

If for each deadline task you specify a period, a deadline and a budget.
Then the WCET computation is reflected in the budget.

By enforcing the schedulability test and execution budget you raise the
quality of service, because even in the presence of a mis-behaving task
only that task will be impacted. The other tasks will still meet their

But _why_? why not leave FIFO/RR in? There is absolutely no downside to

This is, if you make the soft-deadline class aware of the hard-deadline
class's tasks and schedulability contraints, then you can keep the
soft-rt class schedulable too.

So srt is in no way less important, its just has less restrictions on
the schedule, therefore we can run it in the hrt slack/idle time.

And adding the schedulability test in the kernel avoids these starvation

Not that large indeed, but people are interested in running RT workloads
on machines in the 32/64 scale.

And even the embedded folks are now staring quad core arm11 chips in the






Yeah, the idea was to send SIGXCPU to tasks who exceed their budget (and

I'm thinking there's little useful left after all that ;-)
--

From: Henrik Austad
Date: Friday, October 31, 2008 - 8:05 am

My motivation was the increased complexity in meeting deadlines etc. By having 
only EDF (or only RR/FIFO) things would be a lot simpler. Life is not simple 



bah, you just picked all my arguments to shreds and convinced me I'm wrong. I 
guess it's back to the drawing board then, but now I have a better 
understanding at least.

Thans for all the feedback guys, especially Peter! I really appreciate this!

-- 
med Vennlig Hilsen - Yours Sincerely
Henrik Austad
--

From: Dario Faggioli
Date: Friday, October 31, 2008 - 11:08 am

Ok, that is how EDF work, and I know it... I was asking something
I agree, EDF is very simple and has a lot of very nice properties...
Problem is do decide how to assign a deadline to a task, if it is not a
classical soft or hard real-time one! :-O

I definitely agree that hard real time workloads are better handled by
partitioned EDF, but for soft ones, it was sad to suffer from the possible
CPU utilization loss it entails.

Also, what about resources shared by different tasks in different
CPU/partitions? And if you avoid sharing resources between tasks in
different partitions (is that acceptable?), what about system resources,
that are shared by _all_ tasks in the system by definition?

Sorry about asking so much questions, but these are the issues we are
trying to address, and I'm quite interested in knowing if you have any
Well, obvioulsy it's something that we have to think carefully, but I
can't see any harmful situation in having a deadline based and a fixed
priority based scheduling from where to pick task in (sorry) priority
Ok, maybe it's a matter of personal feelings, but I think that such a
Well, I personally think that partitioning _raises_ issues about resource

Regards,
Dario Faggioli

PS. Sorry for the webmail... I'm abroad and I've not my laptop with me :-(

--

From: Dario Faggioli
Date: Friday, October 31, 2008 - 10:49 am

Yep. Actually I think that schedulability test could be an issue as well,
especially if we like group/hierarchical approach, since the known
hierarchical admission tests are quite complex to implement and, probably,
time consuming if performed on-line in an highly dynamic (with respec to
Yeah, that's exactly what we was thinking too.

Actually, I was also thinking that having fixed priority scheduling
_before_ EDF could be of some benefit if you have to be sure that a task
is going to be executed at a very precise instant in time, but have not a
Well, could be a solution... And if this is only used for such kind of
special tasks, maybe it is not impossible to bound or account for their
scheduling contribution... But I'm just shooting inthe dark here, sorry
Yes, that's right, this is what we are investigating and trying to do in
Wow... So, good luck for that! :-)

Maybe it's my fault, but I see some issues with proxy execution and
similar protocols.
That is, if you have, let's say, task A blocked on task B, blocked on task
C, and you are using proxy execution, that means that you have not
dequeued A and B when they blocked, but that you, for example, filled a
pointer that reminds you, when you schedule them, that you have to
actually run C, am I right?

Now, what happens if C blocks on a non-rt mutex lock, or if it simply go
to sleep? Is that acceptable to track the blocking chain in order to
actually dequeue also A and B, and to requeue them again when C will wake
up as well?

Yes, I agree and I like it very much too. If you go for it, you could also
add bandwidth inheritance (e.g., for group scheduling) and things like
that almost for free (if wanted! :-))

Regards,
Dario Faggioli


--

Previous thread: [patch 4/7] mm: cpuset aware reclaim writeout by David Rientjes on Tuesday, October 28, 2008 - 9:08 am. (29 messages)

Next thread: RE: [PATCH] Disable ioat channel only on platforms where ile driver can load by Sosnowski, Maciej on Tuesday, October 28, 2008 - 9:30 am. (1 message)