This patch adds the algorithm to pull tasks from RT overloaded runqueues.
When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/sched.c | 8 ++
kernel/sched_rt.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 157 insertions(+), 12 deletions(-)Index: linux-test.git/kernel/sched.c
===================================================================
--- linux-test.git.orig/kernel/sched.c 2007-10-19 12:34:46.000000000 -0400
+++ linux-test.git/kernel/sched.c 2007-10-19 12:36:05.000000000 -0400
@@ -2927,6 +2927,13 @@ static void idle_balance(int this_cpu, s
int pulled_task = -1;
unsigned long next_balance = jiffies + HZ;+ /*
+ * pull_rt_task returns true if the run queue changed.
+ * But this does not mean we have a task to run.
+ */
+ if (unlikely(pull_rt_task(this_rq)) && this_rq->nr_running)
+ return;
+
for_each_domain(this_cpu, sd) {
unsigned long interval;@@ -3614,6 +3621,7 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);+ schedule_balance_rt(rq, prev);
prev->sched_class->put_prev_task(rq, prev);
next = pick_next_task(rq, prev);Index: linux-test.git/kernel/sched_rt.c
===================================================================
--- linux-test.git.orig/kernel/sched_rt.c 2007-10-19 12:36:02.000000000 -0400
+++ linux-test.git/kernel/sched_rt.c 2007-10-19 12:36:05.000000000 -0400
@@ -158,8 +158,17 @@ static inline void rq_prio_remove_task(s
}
}+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_running(rq, p) &&
+ (cpu < 0 || cpu_isset(cpu, p->c...
we do pull_rt_task() in idle_balance() so, I think, there is no need
to do it twice.
i.e.
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
+ else
+ schedule_balance_rt(rq, prev);hum?
moreover (continuing my previous idea on "don't pull more than 1 task
at once"), I wonder whether you really see cases when more than 1 task
have been successfully _pushed_ over to other run-queues at once...I'd expect the push/pull algorithm to naturally avoid such a
possibility. Let's say we have a few RT tasks on our run-queue that
are currently runnable (but not running)... the question is 'why do
they still here?'(1) because the previous attempt to _push_ them failed;
(2) because they were not _pulled_ from other run-queues.both cases should mean that other run-queues have tasks with higher
prios running at the moment.yes, there is a tiny window in schedule() between deactivate_task() [
which can make this run-queue to look like we can push over to it ]
and idle_balance() -> pull_rt_task() _or_ schedule_balance_rt() ->
pull_rt_task() [ where this run-queue will try to pull tasks on its
own ]_but_ the run-queue is locked in this case so we wait in
double_lock_balance() (from push_rt_task()) and run into the
competition with 'src_rq' (which is currently in the 'tiny window' as
described above trying to run pull_rt_task()) for getting both self_rq
and src_rq locks...this way, push_rt_task() always knows the task to be pushed (so it can
be a bit optimized) --- as it's either a newly woken up RT task with
(p->prio > rq->curr->prio) _or_ a preempted RT task (so we know 'task'
for both cases).To sum it up, I think that the pull/push algorithm should be able to
naturally accomplish the proper job pushing/pulling 1 task at once (as
described above)... any additional actions are just overhead or there--
Best regards,
Dmitry Adamushko
-
Hi Steven,
I guess, what may happen is that while we are running push_rt_tasks()
on CPU-k (say, as a result on try_to_wake_up(task_1))
and as this_rq->lock may get released (in double_lock_balance()) , we
may get in a 'race'
with try_to_wake_up(task_2) from (another) CPU-m.
It places a woken up task on the same run-queue (for which
push_rt_task() is already running on CPU-k) and, actually, run
push_rt_task() for the same rq again!So it may happen that both task_1 and task_2 will be pushed from the same CPU...
Do you see an error in my description? (it's a late hour so I can miss
something again ... sure, otherwise I'm almost perfect :-/ ) Can it
correlate with what you have observed in your tests?Otherwise, there is 1:1 relation : push_rt_task() is called for every
new (single) task activated by try_to_wake_up() and for a preempted
task... so it's not like a few tasks are placed on the run-queue and
then push_rt_tasks() is called once.btw., if this scenario may take place... maybe it would make sense to
have something like RTLB_INPROGRESS/PENDING and to avoid competing
push_rt_tasks() calls for the same 'rq' from different CPUs?
(although, there can be some disadvantages here as well. e.g. we would
likely need to remove 'max 3 tasks at once' limit and get,
theoretically, unbounded time spent in push_rt_tasks() on a single--
Best regards,
Dmitry Adamushko
-
--
I think I see what you're saying now. That we really should do the push on
wakeup, since only the one rt task that is woken up will be able to be
pushed.My code may seem a bit aggressive, but from logging, I see that we seldom
push more than one task. But if we don't go ahead and push more than one,
the rt-migrate-test will sometimes fail. There's funny cases when we need
to push more than one. I can't remember exactly what they were, but
sometimes because of races between the pushes and pulls where one will
miss the fact that an RT process is being queued while its searching to
pull a task, but the push will catch it. Or vice versa.The max tries is just a paranoid case where I don't want heavy scheduling
to cause an unbounded trying to push or pull tasks. According to the
logging while running the rt-migrate-tests, the loop there repeated
approximately 1 in 20, and iterated twice 1 in 100 and hit the third loop
twice in all my tests (over a 1000 iterations being logged). But logging
slows it down and modifies the results itself, and perhaps I'll add
statistics soon.I'm about to post a second batch of patches.
Thanks,
-- Steve
-
I think, 2 things should be accomplished here :
(1) be sure to pull the _highest_ prio task ;
i.e. the _highest_ prio task amongst all runnable (but not running) RT
tasks across all the run-queues which is capable of running on
this_cpu ;(2) don't pull more than 1 task at once.
that said, just pull the highest prio task and run it.
---
why (2)? Just to avoid situations when tasks are being pulled/pushed
back and forth between run-queues.Let's say we have 4 cpu system:
0: task(10) , task(92)
1: task(10), task(91)
2: task(10), task(90)
3: task(10)when task(10) on cpu#3 is inactive, we pull task(92), task(91),
task(90) and then run task(90)... in the mean time, some of cpu[0..2]
becomes inactive and pull task(91) and task(92) back and run them...
that may repeat again and again depending on when/how long task(10)
run on their corresponding cpus...so it seems to me that the more optimal behavior would be "don't pull
more than you can run at the moment -- that's 1".to this goal, something like find_lock_highest_rq() would be necessary.
and I guess, {get,put}_task_struct() should be used in pull_rt_task()
--
Best regards,
Dmitry Adamushko
-
Hi Dmitry,
I've thought about this, and played a little. The problem we have is
that we don't take locks when searching each run queue. So by the time
you got the "highest rq" to pull from, it may no longer be the highest.
So what do we do in that case? search again?If we are lucky and pull the highest first, then we will not pull
another one. Since we are not comparing the rq prio to current, but to
the highest task on the current rq. So once we pull a high prio task,
we only pull another one if we find a even higher prio task.Yes, it's a little inefficient, and can cause shuffling of task around.
But these tasks haven't actually run yet, so it isn't too much harm. But
the benefit is that when we finish, we have the highest task that canI'm not sure that's too much of an issue, since we are just switch tasks
on lists. As long has they haven't run yet, then it shouldn't be too
much harm. We do have a little bit of cache hits in the queues
themselves, but alternatives to fix this would probably cause the sameThe problem is that we have too many races to find the highest rq at the
moment. But by pulling the highest at the time, we should end up withNo, we don't need to do the get task on pull. With push, we start with a
task and we want to push it somewhere. On the double_lock_balance, we
might lose our rq lock. Which means that next could have been put on
another run queue, run to completion, and then exited destroying the
task. We still reference that task after the double_lock_balance, and if
it is not active anymore, we finish the push.With the pull, we are focusing on the run queue. If we had to release
the rq lock on double_lock_balance, we just pick the highest "next" on
the current run queue (which could have changed) and take the current
task on the src rq, which also could have changed. If the current task
on the src rq is higher in priority we move it over to the dst rq. So no
get/put_task_struct is needed.-- Steve
-
This seems to be the only usage of rt_overload. I'm not sure its worth
-
Ingo just brought up a good point. With large smp (where large is >64)
this will all suck chunks.rt_overload will bounce around the system, and the rto_cpumask updates
might already hurt.The idea would be to do this per cpuset, these naturally limit the
migraiton posibilities of tasks and would thus be the natural locality-
--
That sounds like a good idea. RT balancing on >64 CPUs should be limited.
Having a bounding cpuset would help.I'll try to come up with something.
Thanks!
-- Steve
-
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| Linus Torvalds | Linux 2.6.25-rc4 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Andrew Morton | 2.6.23-rc6-mm1 |
git: | |
| Arjan van de Ven | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
| Radu Rendec | htb parallelism on multi-core platforms |
