Balancing Real Time Threads

Submitted by Jeremy
on October 20, 2007 - 2:42pm

"Currently in mainline the balancing of multiple RT threads is quite broken. That is to say that a high priority thread that is scheduled on a CPU with a higher priority thread, may need to unnecessarily wait while it can easily run on another CPU that's running a lower priority thread," began Steven Rostedt, describing his patchset to introduce improved real time task balancing. He explained:

"Balancing (or migrating) tasks in general is an art. Lots of considerations must be taken into account. Cache lines, NUMA and more. This is true with general processes which expect high through put and migration can be done in batch. But when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing. Right now an RT task can wait several milliseconds before it gets scheduled to run. And perhaps even longer. The migration thread is not fast enough to take care of RT tasks."

Steven described his test cases and numerous issues he noticed with the current balancing code, noting, "to solve this issue, I've changed the RT task balancing from a passive method (migration thread) to an active method. This new method is to actively push or pull RT tasks when they are woken up or scheduled."


From: Steven Rostedt
Subject: [patch 0/8] New RT Task Balancing
Date: Oct 19, 11:42 am 2007

Currently in mainline the balancing of multiple RT threads is quite broken.
That is to say that a high priority thread that is scheduled on a CPU
with a higher priority thread, may need to unnecessarily wait while it
can easily run on another CPU that's running a lower priority thread.

Balancing (or migrating) tasks in general is an art. Lots of considerations
must be taken into account. Cache lines, NUMA and more. This is true
with general processes which expect high through put and migration can
be done in batch.  But when it comes to RT tasks, we really need to
put them off to a CPU that they can run on as soon as possible. Even
if it means a bit of cache line flushing.

Right now an RT task can wait several milliseconds before it gets scheduled
to run. And perhaps even longer. The migration thread is not fast enough
to take care of RT tasks.

To demonstrate this, I wrote a simple test.
 
  http://rostedt.homelinux.com/rt/rt-migrate-test.c

  (gcc -o rt-migrate-test rt-migrate-test.c -lpthread)

This test expects a parameter to pass in the number of threads to create.
If you add the '-c' option (check) it will terminate if the test fails
one of the iterations. If you add this, pass in +1 threads.

For example, on a 4 way box, I used

  rt-migrate-test -c 5

What this test does is to create the number of threads specified (in this
case 5). Each thread is set as an RT FIFO task starting at a specified
prio (default 2) and each thread being one priority higher. So with this
example the 5 threads created are at priorities 2, 3, 4, 5, and 6.

The parent thread sets its priority to one higher than the highest of
the children (this example 7). It uses pthread_barrier_wait to synchronize
the threads.  Then it takes a time stamp and starts all the threads.
The threads when woken up take a time stamp and compares it to the parent
thread to see how long it took to be awoken. It then runs for an
interval (20ms default) in a busy loop. The busy loop ends when it reaches
the interval delta from the start time stamp. So if it is preempted, it
may not actually run for the full interval. This is expected behavior
of the test.

The numbers recorded are the delta from the thread's time stamp from the
parent time stamp. The number of iterations it ran the busy loop for, and
the delta from a thread time stamp taken at the end of the loop to the
parent time stamp.

Sometimes a lower priority task might wake up before a higher priority,
but this is OK, as long as the higher priority process gets the CPU when
it is awoken.

At the end of the test, the iteration data is printed to stdout. If a
higher priority task had to wait for a lower one to finish running, then
this is considered a failure. Here's an example of the output from
a run against git commit 4fa4d23fa20de67df919030c1216295664866ad7.

   1:       36      33   20041      39      33
 len:    20036   20033   40041   20039   20033
 loops: 167789  167693  227167  167829  167814

On iteration 1 (starts at 0) the third task started at 20ms after the parent
woke it up. We can see here that the first two tasks ran to completion
before the higher priority task was even able to start. That is a
20ms latency for the higher priority task!!!

So people who think that their audio would lose most latencies by upping 
the priority, may be in for a surprise. Since some kernel threads (like
the migration thread itself) may cause this latency.

To solve this issue, I've changed the RT task balancing from a passive
method (migration thread) to an active method.  This new method is
to actively push or pull RT tasks when they are woken up or scheduled.

On wake up of a task if it is an RT task, and there's already an RT task
of higher priority running on its runqueue, we initiate a push_rt_tasks
algorithm. This algorithm looks at the highest non-running RT task
and tries to find a CPU where it can run on. It only migrates the RT
task if it finds a CPU (of lowest priority) where the RT task
can run on and can preempt the currently running task on that CPU.
We continue pushing RT tasks until we can't push anymore.

If a RT task fails to be migrated we stop the pushing. This is possible
because we are always looking at the highest priority RT task on the
run queue. And if it can't migrate, then most likely the lower RT tasks
can not either.

There is one case that is not covered by this patch set. That is that
when the highest priority non running RT task has its CPU affinity
in such a way that it can not preempt any tasks on the CPUs running
on CPUs of its affinity. But a lower priority task has a larger affinity
to CPUs that it can run on. This is a case where the lower priority task
will not be migrated to those CPUS (although those CPUs may pull that task).
Currently this patch set ignores this scenario.

Another case where we push RT tasks is in the finish_task_switch. This is
done since the running RT task can not be migrated while it is running.
So if an RT task is preempted by a higher priority RT task, we can
migrate the RT task being preempted at that moment.

We also actively pull RT tasks. Whenever a runqueue is about to lower its
priority (schedule a lower priority task) a check is done to see if that
runqueue can pull RT tasks to it to run instead. A search is made of all
overloaded runqueues (runqueues with more than one RT task scheduled on it)
and checked to see if they have an RT task that can run on the runqueue
(affinity matches) and is of higher priority than the task the runqueue
is about to schedule. The pull algorithm pulls all RT tasks that match
this case.

With this patch set, I ran the rt-migrate-test over night in a while
loop with the -c option (which terminates upon failure) and it passed
over 6500 tests (each doing 50 wakeups each).

Here's an example of the output from the patched kernel. This is just to
explain it a bit more.

   1:    20060      61      55      56      61
 len:    40060   20061   20055   20056   20061
 loops: 227255  146050  168104  145965  168144

   2:       40      46      31      35      42
 len:    20040   20046   20031   20035   20042
 loops:     28  167769  167781  167668  167804

The first iteration (really 2cd, since we start at zero), is a typical run.
The lowest prio task didn't start executing until the other 4 tasks finished
and gave up the CPU.

The second iteration seems at first like a failure. But this is actually fine.
The lowest priority task just happen to schedule onto a CPU before the
higher priority tasks were woken up. But as you can see from this example,
the higher priority tasks still were able to get scheduled right away and
in doing so preempted the lower priority task. This can be seen by the
number of loops that the lower priority task was able to complete. Only 28.
This is because the busy loop terminates when the time stamp reaches the
time interval (20ms here) from the start time stamp. Since the lower priority
task was able to sneak in and start, it's time stamp was low. So after it
got preempted, and rescheduled, it was already past the run time interval
so it simply ended the loop.

Finally, the CFS RT balancing had to be removed in order for this to work.
Testing showed that the CFS RT balancing would actually pull RT tasks
from runqueues already assigned to the proper runqueues, and again cause
latencies.  With this new approach, the CFS RT balancing is not needed,
and I suggest that these patches replace the current CFS RT balancing.

Also, let me stress, that I made a great attempt to have this cause
as little overhead (practically none) to the non RT cases. Most of these
algorithms only take place when more than one RT task is scheduled on the
same runqueue.

Special thanks goes to Gregory Haskins for his advice and back and forth
on IRC with ideas. Although I didn't use his actual patches (his were
against -rt) it did help me clean up some of my code.  Also, thanks go to
Ingo Molnar himself for taking some ideas from his RT balance code in
the -rt patch.


Comments welcomed!

-- Steve

-

From: Steven Rostedt Subject: [patch 3/8] push RT tasks Date: Oct 19, 11:42 am 2007 This patch adds an algorithm to push extra RT tasks off a run queue to other CPU runqueues. When more than one RT task is added to a run queue, this algorithm takes an assertive approach to push the RT tasks that are not running onto other run queues that have lower priority. The way this works is that the highest RT task that is not running is looked at and we examine the runqueues on the CPUS for that tasks affinity mask. We find the runqueue with the lowest prio in the CPU affinity of the picked task, and if it is lower in prio than the picked task, we push the task onto that CPU runqueue. We continue pushing RT tasks off the current runqueue until we don't push any more. The algorithm stops when the next highest RT task can't preempt any other processes on other CPUS. TODO: The algorithm may stop when there are still RT tasks that can be migrated. Specifically, if the highest non running RT task CPU affinity is restricted to CPUs that are running higher priority tasks, there may be a lower priority task queued that has an affinity with a CPU that is running a lower priority task that it could be migrated to. This patch set does not address this issue. Note: checkpatch reveals two over 80 charater instances. I'm not sure that breaking them up will help visually, so I left them as is. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/sched.c | 8 + kernel/sched_rt.c | 227 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 234 insertions(+), 1 deletion(-) Index: linux-test.git/kernel/sched.c =================================================================== --- linux-test.git.orig/kernel/sched.c 2007-10-19 12:34:32.000000000 -0400 +++ linux-test.git/kernel/sched.c 2007-10-19 12:34:46.000000000 -0400 @@ -1855,6 +1855,8 @@ static void finish_task_switch(struct rq prev_state = prev->state; finish_arch_switch(prev); finish_lock_switch(rq, prev); + schedule_tail_balance_rt(rq); + fire_sched_in_preempt_notifiers(current); if (mm) mmdrop(mm); @@ -2088,11 +2090,13 @@ static void double_rq_unlock(struct rq * /* * double_lock_balance - lock the busiest runqueue, this_rq is locked already. */ -static void double_lock_balance(struct rq *this_rq, struct rq *busiest) +static int double_lock_balance(struct rq *this_rq, struct rq *busiest) __releases(this_rq->lock) __acquires(busiest->lock) __acquires(this_rq->lock) { + int ret = 0; + if (unlikely(!irqs_disabled())) { /* printk() doesn't work good under rq->lock */ spin_unlock(&this_rq->lock); @@ -2103,9 +2107,11 @@ static void double_lock_balance(struct r spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); + ret = 1; } else spin_lock(&busiest->lock); } + return ret; } /* Index: linux-test.git/kernel/sched_rt.c =================================================================== --- linux-test.git.orig/kernel/sched_rt.c 2007-10-19 12:33:23.000000000 -0400 +++ linux-test.git/kernel/sched_rt.c 2007-10-19 12:34:46.000000000 -0400 @@ -111,6 +111,12 @@ static struct task_struct *pick_next_tas } #ifdef CONFIG_SMP +/* Only try algorithms three times */ +#define RT_MAX_TRIES 3 + +static int double_lock_balance(struct rq *this_rq, struct rq *busiest); +static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep); + static inline void rq_prio_add_task(struct rq *rq, struct task_struct *p) { if (unlikely(rt_task(p)) && p->prio < rq->highest_prio) @@ -133,6 +139,227 @@ static inline void rq_prio_remove_task(s rq->highest_prio = MAX_RT_PRIO; } } + +/* Return the second highest RT task, NULL otherwise */ +static struct task_struct *pick_next_highest_task_rt(struct rq *rq) +{ + struct rt_prio_array *array = &rq->rt.active; + struct task_struct *next; + struct list_head *queue; + int idx; + + assert_spin_locked(&rq->lock); + + if (likely(rq->rt_nr_running < 2)) + return NULL; + + idx = sched_find_first_bit(array->bitmap); + if (unlikely(idx >= MAX_RT_PRIO)) { + WARN_ON(1); /* rt_nr_running is bad */ + return NULL; + } + + queue = array->queue + idx; + next = list_entry(queue->next, struct task_struct, run_list); + if (unlikely(next != rq->curr)) + return next; + + if (queue->next->next != queue) { + /* same prio task */ + next = list_entry(queue->next->next, struct task_struct, run_list); + return next; + } + + /* slower, but more flexible */ + idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1); + if (unlikely(idx >= MAX_RT_PRIO)) { + WARN_ON(1); /* rt_nr_running was 2 and above! */ + return NULL; + } + + queue = array->queue + idx; + next = list_entry(queue->next, struct task_struct, run_list); + + return next; +} + +/* Will lock the rq it finds */ +static struct rq *find_lock_lowest_rq(struct task_struct *task, + struct rq *this_rq) +{ + struct rq *lowest_rq = NULL; + cpumask_t cpu_mask; + int dst_cpu = -1; + int cpu; + int tries; + + cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed); + + for (tries = 0; tries < RT_MAX_TRIES; tries++) { + /* + * Scan each rq for the lowest prio. + */ + for_each_cpu_mask(cpu, cpu_mask) { + struct rq *rq = &per_cpu(runqueues, cpu); + + if (cpu == this_rq->cpu) + continue; + + /* We look for lowest RT prio or non-rt CPU */ + if (rq->highest_prio >= MAX_RT_PRIO) { + lowest_rq = rq; + dst_cpu = cpu; + break; + } + + /* no locking for now */ + if (rq->highest_prio > task->prio && + (!lowest_rq || rq->highest_prio > lowest_rq->highest_prio)) { + dst_cpu = cpu; + lowest_rq = rq; + } + } + + if (!lowest_rq) + break; + + /* if the prio of this runqueue changed, try again */ + if (double_lock_balance(this_rq, lowest_rq)) { + /* + * We had to unlock the run queue. In + * the mean time, task could have + * migrated already or had its affinity changed. + * Also make sure that it wasn't scheduled on its rq. + */ + if (unlikely(task_rq(task) != this_rq || + !cpu_isset(dst_cpu, task->cpus_allowed) || + task_running(this_rq, task) || + !task->se.on_rq)) { + spin_unlock(&lowest_rq->lock); + lowest_rq = NULL; + break; + } + } + + /* If this rq is still suitable use it. */ + if (lowest_rq->highest_prio > task->prio) + break; + + /* try again */ + spin_unlock(&lowest_rq->lock); + lowest_rq = NULL; + } + + return lowest_rq; +} + +/* + * If the current CPU has more than one RT task, see if the non + * running task can migrate over to a CPU that is running a task + * of lesser priority. + */ +static int push_rt_task(struct rq *this_rq) +{ + struct task_struct *next_task; + struct rq *lowest_rq; + int dst_cpu; + int ret = 0; + int paranoid = RT_MAX_TRIES; + + assert_spin_locked(&this_rq->lock); + + next_task = pick_next_highest_task_rt(this_rq); + if (!next_task) + return 0; + + retry: + if (unlikely(next_task == this_rq->curr)) + return 0; + + /* + * It's possible that the next_task slipped in of + * higher priority than current. If that's the case + * just reschedule current. + */ + if (unlikely(next_task->prio < this_rq->curr->prio)) { + resched_task(this_rq->curr); + return 0; + } + + /* We might release this_rq lock */ + get_task_struct(next_task); + + /* find_lock_lowest_rq locks the rq if found */ + lowest_rq = find_lock_lowest_rq(next_task, this_rq); + if (!lowest_rq) { + struct task_struct *task; + /* + * find lock_lowest_rq releases this_rq->lock + * so it is possible that next_task has changed. + * If it has, then try again. + */ + task = pick_next_highest_task_rt(this_rq); + if (unlikely(task != next_task) && task && paranoid--) { + put_task_struct(next_task); + next_task = task; + goto retry; + } + goto out; + } + + dst_cpu = lowest_rq->cpu; + + assert_spin_locked(&lowest_rq->lock); + + deactivate_task(this_rq, next_task, 0); + set_task_cpu(next_task, dst_cpu); + activate_task(lowest_rq, next_task, 0); + + resched_task(lowest_rq->curr); + + spin_unlock(&lowest_rq->lock); + + ret = 1; +out: + put_task_struct(next_task); + + return ret; +} + +/* + * TODO: Currently we just use the second highest prio task on + * the queue, and stop when it can't migrate (or there's + * no more RT tasks). There may be a case where a lower + * priority RT task has a different affinity than the + * higher RT task. In this case the lower RT task could + * possibly be able to migrate where as the higher priority + * RT task could not. We currently ignore this issue. + * Enhancements are welcome! + */ +static void push_rt_tasks(struct rq *rq) +{ + /* push_rt_task will return true if it moved an RT */ + while (push_rt_task(rq)) + ; +} + +static void schedule_tail_balance_rt(struct rq *rq) +{ + /* + * If we have more than one rt_task queued, then + * see if we can push the other rt_tasks off to other CPUS. + * Note we may release the rq lock, and since + * the lock was owned by prev, we need to release it + * first via finish_lock_switch and then reaquire it here. + */ + if (unlikely(rq->rt_nr_running > 1)) { + spin_lock_irq(&rq->lock); + push_rt_tasks(rq); + spin_unlock_irq(&rq->lock); + } +} +#else /* CONFIG_SMP */ +# define schedule_tail_balance_rt(rq) do { } while (0) #endif /* CONFIG_SMP */ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) -- -

I would be interested to

muwlgr
on
October 20, 2007 - 11:53pm

I would be interested to read what Ingo Molnar said on this.
Only he decides what CFS improvements are reasonable and what are not.

I would be interested too.

Anonymous (not verified)
on
October 21, 2007 - 4:13am

I would be interested too.

Ingo wrote this RT

Anonymous (not verified)
on
October 21, 2007 - 5:14am

Ingo wrote this RT scheduling feature for -rt originally (Steve Rostedt picked it up and improved it), so I would be surprised if he had any objections.

rt scheduler to inforce the latency

Nicolas (not verified)
on
October 22, 2007 - 1:25am

I don't really understand the need of such rt scheme. If you have more than 2 thread per cpu running on a rt system, with a kind of static priority scheduler, you could never predict what will happen ! You can't ever garanty any latency. You could do it only if you know the behavior of each task.

It existe some scheduler algorithme that garanty a maximum latency. It's a rotating priority scheduler. You define time slot for each task, like in a fixed
scheduler. But instead of managing task direclty, you managed the priority.

For example, you could have 4 slots. Each slot means 1 ms. There is 3 tasks, TA, TB, TC. Each slot is filled like that:
- TA
- TB
- TA
- TC

So you have a 4 ms of latency for task TB and TC, and 2 ms for TA, what ever they do.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.