Hi All, The first two patches are from Mike and Steven on LKML, which the rest of my series is dependent on. Patch #4 is a resend from earlier. Series Summary: 1) Send IPI on overload regardless of whether prev is an RT task 2) Set the NEEDS_RESCHED flag on reception of RESCHED_IPI 3) Fix a mistargeted IPI on overload 4) Track which CPUS are in overload for efficiency 5) Track which CPUs are eligible for rebalancing for efficiency These have been built and boot-tested on a 4-core Intel system. Regards, -Greg -
From: Mike Kravetz <kravetz@us.ibm.com>
RESCHED_IPIs can be missed if more than one RT task is awoken simultaneously
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 93fd6de..3e75c62 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2207,7 +2207,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
* If we pushed an RT task off the runqueue,
* then kick other CPUs, they might run it:
*/
- if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+ if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
schedstat_inc(rq, rto_schedule);
smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
}
-
From: Mike Kravetz <kravetz@us.ibm.com>
x86_64 based RESCHED_IPIs fail to set the reschedule flag
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
arch/x86_64/kernel/smp.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c
index a5bf746..3ce6cad 100644
--- a/arch/x86_64/kernel/smp.c
+++ b/arch/x86_64/kernel/smp.c
@@ -505,13 +505,13 @@ void smp_send_stop(void)
}
/*
- * Reschedule call back. Nothing to do,
- * all the work is done automatically when
- * we return from the interrupt.
+ * Reschedule call back. Trigger a reschedule pass so that
+ * RT-overload balancing can pass tasks around.
*/
asmlinkage void smp_reschedule_interrupt(void)
{
ack_APIC_irq();
+ set_tsk_need_resched(current);
}
asmlinkage void smp_call_function_interrupt(void)
-
Any number of tasks could be queued behind the current task, so direct the
balance IPI at all CPUs (other than current)
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Mike Kravetz <kravetz@us.ibm.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3e75c62..551629b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2209,7 +2209,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
schedstat_inc(rq, rto_schedule);
- smp_send_reschedule_allbutself_cpumask(current->cpus_allowed);
+ smp_send_reschedule_allbutself();
}
#endif
prev_state = prev->state;
-
The system currently evaluates all online CPUs whenever one or more enters
an rt_overload condition. This suffers from scalability limitations as
the # of online CPUs increases. So we introduce a cpumask to track
exactly which CPUs need RT balancing.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---
kernel/sched.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 551629b..a28ca9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -631,6 +631,7 @@ static inline struct rq *this_rq_lock(void)
#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
static __cacheline_aligned_in_smp atomic_t rt_overload;
+static cpumask_t rto_cpus;
#endif
static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -639,8 +640,11 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
if (rt_task(p)) {
rq->rt_nr_running++;
# ifdef CONFIG_SMP
- if (rq->rt_nr_running == 2)
+ if (rq->rt_nr_running == 2) {
+ cpu_set(rq->cpu, rto_cpus);
+ smp_wmb();
atomic_inc(&rt_overload);
+ }
# endif
}
#endif
@@ -653,8 +657,10 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
WARN_ON(!rq->rt_nr_running);
rq->rt_nr_running--;
# ifdef CONFIG_SMP
- if (rq->rt_nr_running == 1)
+ if (rq->rt_nr_running == 1) {
atomic_dec(&rt_overload);
+ cpu_clear(rq->cpu, rto_cpus);
+ }
# endif
}
#endif
@@ -1503,7 +1509,7 @@ static void balance_rt_tasks(struct rq *this_rq, int this_cpu)
*/
next = pick_next_task(this_rq, this_rq->curr);
- for_each_online_cpu(cpu) {
+ for_each_cpu_mask(cpu, rto_cpus) {
if (cpu == this_cpu)
continue;
src_rq = cpu_rq(cpu);
-
The code currently blindly fires IPIs out whenever an overload occurs.
However, there are strict events that govern when a rt-overload exists
(e.g. RT task added to a RQ, or an RT task preempted). Therefore, we
attempt to efficiently track which CPUs are eligible for rebalancing, and we
only IPI those affected units.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---
kernel/sched.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index a28ca9d..6ca5f4f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -359,6 +359,8 @@ struct rq {
unsigned long rto_pulled;
#endif
struct lock_class_key rq_lock_key;
+
+ cpumask_t rto_resched; /* Which of our peers needs rescheduling */
};
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -645,6 +647,9 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
smp_wmb();
atomic_inc(&rt_overload);
}
+
+ cpus_or(rq->rto_resched, rq->rto_resched, p->cpus_allowed);
+ cpu_clear(rq->cpu, rq->rto_resched);
# endif
}
#endif
@@ -2213,9 +2218,15 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
* If we pushed an RT task off the runqueue,
* then kick other CPUs, they might run it:
*/
- if (unlikely(rt_task(current) && rq->rt_nr_running > 1)) {
+ if (unlikely(rt_task(current) && prev->se.on_rq && rt_task(prev))) {
+ cpus_or(rq->rto_resched, rq->rto_resched, prev->cpus_allowed);
+ cpu_clear(rq->cpu, rq->rto_resched);
+ }
+
+ if (unlikely(rq->rt_nr_running > 1 && !cpus_empty(rq->rto_resched))) {
schedstat_inc(rq, rto_schedule);
- smp_send_reschedule_allbutself();
+ smp_send_reschedule_allbutself_cpumask(rq->rto_resched);
+ cpus_clear(rq->rto_resched);
}
#endif
prev_state = prev->state;
-
Ok, I'm not liking these. I really hate setting TIF_NEED_RESCHED from the IPI handler. Also, I don't see how doing a resched pulls tasks to begin with. How about keeping a per rq variable that indicates the highest priority of runnable tasks. And on forced preemption look for a target rq to send your last highest task to. There is no need to broadcast rebalance, that will only serialise on the local rq lock again. So pick a target rq, and stick with that. Also, I think you meant to use cpus_and() with the rto and allowed masks. -
-- Peter Zijlstra and I have been discussing this IPI Resched change a bit. It seems that it is too much overkill for what is needed. That is, the send_reschedule is used elsewhere where we do not want to actually do a schedule. I'm thinking about trying out a method that each rq has the priority of the current task that is running. On case where we get an rt overload (like in the finish_task_switch) we do a scan of all CPUS (not taking any locks) and find the CPU which the lowest priority. If that CPU has a lower prioirty than a waiting task to run on the current CPU then we grab the lock for that rq, check to see if the priority is still lower, and then push the rt task over to that CPU. If after taking the rq lock a schedule had taken place and a higher RT task is running, then we would try again, two more times. If this phenomenon happens two more times, we punt and wouldn't do anything else (paranoid attempt to fall into trying over and over on a high context The above three may be obsoleted by this new algorithm. -- Steve -
That is a good point. We definitely need a good "kick+resched" kind of mechanism here, but perhaps it should be RTO specific instead of in the primary data path. I guess a rq-lock + set(NEEDS_RESCHED) + IPI works too. On the flip side: Perhaps sending a reschedule-ipi that doesn't reschedule is simply misused, and the misuse should be cleaned up Great minds think alike ;) See attached for a patch I have been working on in this area. It currently address the "wake_up" path. It would also need to address the "preempted" path if we were to eliminate RTO outright. I wasn't going to share it quite yet, since its still a work in My patch currently doesn't address this yet, but I have been thinking about it for the last day or so. I was wondering if perhaps an RCU On the same page with you, here. Regards, -Greg
It basically does TIF_WORK_MASK and TIF_NEED_RESCHED is one of the most frequently used of those. Using it for any other bit in that mask is IMHO not abuse. -
This has been complied tested (and no more ;-)
The idea here is when we find a situation that we just scheduled in an
RT task and we either pushed a lesser RT task away or more than one RT
task was scheduled on this CPU before scheduling occurred.
The answer that this patch does is to do a O(n) search of CPUs for the
CPU with the lowest prio task running. When that CPU is found the next
highest RT task is pushed to that CPU.
Some notes:
1) no lock is taken while looking for the lowest priority CPU. When one
is found, only that CPU's lock is taken and after that a check is made
to see if it is still a candidate to push the RT task over. If not, we
try the search again, for a max of 3 tries.
2) I only do this for the second highest RT task on the CPU queue. This
can be easily changed to do it for all RT tasks until no more can be
pushed off to other CPUs.
This is a simple approach right now, and is only being posted for
comments. I'm sure more can be done to make this more efficient or just
simply better.
-- Steve
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Index: linux-2.6.23-rc9-rt2/kernel/sched.c
===================================================================
--- linux-2.6.23-rc9-rt2.orig/kernel/sched.c
+++ linux-2.6.23-rc9-rt2/kernel/sched.c
@@ -304,6 +304,7 @@ struct rq {
#ifdef CONFIG_PREEMPT_RT
unsigned long rt_nr_running;
unsigned long rt_nr_uninterruptible;
+ int curr_prio;
#endif
unsigned long switch_timestamp;
@@ -1485,6 +1486,87 @@ next_in_queue:
static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
/*
+ * If the current CPU has more than one RT task, see if the non
+ * running task can migrate over to a CPU that is running a task
+ * of lesser priority.
+ */
+static int push_rt_task(struct rq *this_rq)
+{
+ struct task_struct *next_task;
+ struct rq *lowest_rq = NULL;
+ int tries;
+ int cpu;
+ int dst_cpu = -1;
+ int ret = 0;
+
+ BUG_ON(!spin_is_locked(&this_rq->lock));
+
+ next_task ...-- I don't want that O(n) to scare anyone. It really is a O(1) but with a K = NR_CPUS. I was saying if you grow the NR_CPUS the search grows too. -- Steve -
-- I need to add a comment at the top to state that this function can do this. Now is it OK with the current caller? I need to look more closely. I might need to change where this is actually called. As stated, this hasn't been tested. But you are right, this needs to be OK, will do. Note, that this is where we need to see if it is ok to Thanks for taking the time to look it over. -- Steve -
I did something like this a while ago for another scheduling project. A couple 'possible' optimizations to think about are: 1) Only scan the remote runqueues once and keep a local copy of the remote priorities for subsequent 'scans'. Accessing the remote runqueus (CPU specific cache lines) can be expensive. 2) When verifying priorities, just perform spin_trylock() on the remote runqueue. If you can immediately get it great. If not, it implies someone else is messing with the runqueue and there is a good chance the data you pre-fetched (curr->Priority) is invalid. In this case it might be faster to just 'move on' to the next candidate runqueue/CPU. i.e. The next highest priority that the new task can preempt. Of course, these 'optimizations' would change the algorithm. Trying to make any decision based on data that is changing is always a crap shoot. :) -- Mike -
-- I was a bit scared of grabing the lock anyway, because that's another cache hit (write side). So only grabbing the lock when needed would save Yes indeed. The aim for now is to solve the latencies that you've been seeing. But really, there is still holes (small ones) that can cause a latency if a schedule happened "just right". Hopefully the final result of this work will close them too. -- Steve -
Yes. But with #2 below, your next try is the runqueue/CPU that is the next best candidate (after the trylock fails). The 'hope' is that there is more than one candidate CPU to push the task to. Of course, you always want to try and find the 'best' candidate. My thoughts were that if you could find ANY cpu to take the task that would be better than sending the IPI everywhere. With multiple runqueues/locks there is no way you can be guaranteed of making the 'best' placement. So, a good -- Mike -
It can be extended: Search CPU that is running the lowest priority or the same priority as the highest RT task (which is tried to be pushed). If any CPU is found to be running lower priority task (lowest among the CPU) as above push the task to the CPU. Else if no CPU was found with lower priority, find CPU that runs a task of same priority .... In this there are two cases case 1. if the currently running task on this CPU is higher priority than the task running (ie active task priority) , then RT task can be pushed to the CPU (where it competes with the similar priority task in round robin fashion ). case 2: if the priority of the task that is running and the task that is trying to be pushed are same (from the same queue .. queue->next->next....) then the balancing has to be done on the number of task that are running on these CPUs. making them run equal (or -- Thanks Giri -
