This is a mini-release of my series, rebased on -rt2. I have more changes downstream which are not quite ready for primetime, but I need to work on some other unrelated issues right now and I wanted to get what works out there. Changes since v5 *) Rebased to rt2 - Many of the functions of the original series are now included in base -rt so they are dropped out -
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
kernel/sched_rt.c | 10 ++--------
1 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 55da7d0..b59dc20 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -292,7 +292,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
{
struct rq *lowest_rq = NULL;
cpumask_t cpu_mask;
- int dst_cpu = -1;
int cpu;
int tries;
@@ -311,14 +310,12 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
lowest_rq = rq;
- dst_cpu = cpu;
break;
}
/* no locking for now */
if (rq->rt.highest_prio > task->prio &&
(!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
- dst_cpu = cpu;
lowest_rq = rq;
}
}
@@ -335,7 +332,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
* Also make sure that it wasn't scheduled on its rq.
*/
if (unlikely(task_rq(task) != this_rq ||
- !cpu_isset(dst_cpu, task->cpus_allowed) ||
+ !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
task_running(this_rq, task) ||
!task->se.on_rq)) {
spin_unlock(&lowest_rq->lock);
@@ -365,7 +362,6 @@ static int push_rt_task(struct rq *this_rq)
{
struct task_struct *next_task;
struct rq *lowest_rq;
- int dst_cpu;
int ret = 0;
int paranoid = RT_MAX_TRIES;
@@ -412,12 +408,10 @@ static int push_rt_task(struct rq *this_rq)
goto out;
}
- dst_cpu = lowest_rq->cpu;
-
assert_spin_locked(&lowest_rq->lock);
deactivate_task(this_rq, next_task, 0);
- set_task_cpu(next_task, dst_cpu);
+ set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
resched_task(lowest_rq->curr);
-
Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for one or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then during the handling of the timer, the system determines that it's time to smp-rebalance the system so it schedules softirq_sched to run from within the softirq_timer kthread context. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. The problem is that these tasks cannot ever be pulled away since they are already running on their one and only valid RQ. However, the other CPUs cannot determine that the tasks are unpullable without going through expensive checks/locking. Therefore the helping CPUS experience unecessary overhead/latencies regardless as they ineffectively try to process the overload condition. This patch tries to optimize the situation by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated, which may be utilized by the overload handling code to eliminate uncessary rebalance attempts. We also introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. Calculating the weight is probably relatively expensive, so it is only done when the cpus_allowed mask is updated (which should be relatively infrequent, especially compared to scheduling frequency) and cached in the task_struct. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- include/linux/sched.h | 2 + kernel/fork.c | 1 kernel/sched.c | 9 +++- kernel/sched_rt.c | 116 +++++++++++++++++++++++++++++++++++++++++-------- 4 files changed, 107 insertions(+), 21 deletions(-) diff --git ...
-- Not sure we need this optimization (not doing the nr_cpus_allowed calculation). Since, due to priority boosting, we will need to calculate I don't like this encapsulating of the doubl_lock_balance. There's a reason I kept it out like this. Mainly because this is the source of most (if not all) of the race condititions I fought. So this is a very sensitive area to touch. In fact, I see a few races already introduced by this patch. One is that you took out the "!task->se.or_rq" test. Which means that a task could have ran and deactivated itself (TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that race ;-) Hmm, I must have missed where you update the mask at time of boosting. Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks. Now you can have a handler to call for when a task changes class and changes weight. -- Steve -
I think you are misunderstanding the code here. The only optimization is that I didn't want to force every sched_class to define a default set_cpus_allowed member-fn. So instead, it first checks if its defined and invokes it if true. Else, the default behavior is to assign the mask and calculate the weight. If you look at the one and only implementation of this function (sched_rt.c:set_cpus_allowed_rt()), it also performs the assignment and calcs the mask. I disagree, but I admit it is not apparent at this level of the series why. The reason has to do with optimizing the wakeup path. Unless I am missing something, I think this placement is optimal, and that will Reducing code duplication is an effort to mitigate error propagation, This was intentional, but perhaps I should have been more explicit on the reason why. This is again related to future optimizations that I have yet to reveal. Namely, this code path can be invoked before the task has been woken up, and it is therefore normal for it to not be on a Hmm.. This is a good point. Perhaps we should check to see that the You aren't the first person to say something like that, and I always scratch my head at that one. It *is* open..its right there -40 lines from where it used to be. Its not like I gave you a obfuscated .o Agreed. You weren't aware of my issue, as I wasn't aware of yours. I ? I don't understand what you mean. Are you referring to PI boosting? What does that have to do with the mask? Are you referring to when we change the mask? We already invoke this handler from the sched.c:set_cpus_allowed(). Can you clarify this It is. Im just lazy and had the default case handled inside the sched_c:set_cpus_allowed() call. Perhaps I should just let all Well, at least for what I am designing here, we are already covered. If the task changes class, it would be dequeued in one class, and enqueued in the new one. This would properly update the migration state. Likewise, if it had its ...
-- Actually, I say we do the calculation for all tasks regardless of class. But you can have a function for each class (or NULL) that will do something with the old and new values. Reason why is that we don't want the calculation to happen during the Then move the code there then. One can't be analyzing patches when code isn't apparent that determines changes. Those changes need to be together. Especially since they may not be applied. I would be happy to apply most of this patch but because of changes that are here to help out to-be-announced patches, will keep this patch from going into the tree. IOW, break the patch up to the fixes that this patch is to do on its own. It has plenty. Leave out the stuff that will help out later patches. Add the stuff that is being recommended, and you can make a separate patch Actually that's not the point. The side effects of the double_lock_balance are unique to each location. So it doesn't really make sense to modulize Hence why I said the balance before activate task is very racy. Looking at the state is not enough. A RT task may be in the TASK_UNINTERRUPTIBLE state and about to check a condition that will have it wake itself up. So just looking at the state is not enough to determine if we should skip it My point is that the side effect that the double_lock_balance causes is No but a boosting can change classes. And going from fair to rt we don't That is what I think I'm missing. I have to go back and look at the patch (I admit I may have missed it). Switching the migration state will update the processes nr_cpus_allowed? Of course I think that's not a good idea. /me looks. -- Steve -
The correctness of this particular change (IMO) is not predicated on the later patches, otherwise I would have done just that. It accomplishes what I intended as is here in this patch. Why do you think moving the logic to pick_next_highest is a better design? To be honest, I haven't really studied your new logic in push_rt_tasks to understand why you might feel this way. If you can make the case that it is better in the other location then I agree with you that we should move it there in this patch, and potentially adjust I think the only one real solid case of that is the removal of on_rq from the double_lock conditions. I agree with your point that it should come later. I'll put it back in and adjust it later. I'll wait to hear ? I generalized the core functionality so it worked as I believed it should in the two places that I used it. I don't follow your objection here. Did I break something (besides the on_rq thing?). My feeling is that both need the same locking semantics, but we don't need the check for the priority shift in the optimized case as we do in the loop. So If that's the case then you are already broken if we hit the wake_idle() case, no? We can drop the lock there too. But I am not convinced it cannot be done in a race free way. I can be Fair enough...I'll fix it. But my point still stands that a decent Sure we do! (unless I introduced a bug) task->cpus_allowed and task->nr_cpus_allowed is updated for *all* classes. It is only acted upon to change the migration state if the No, other way around. (de)queuing an RT task, or changing an RT tasks mask will change the migration state. So if you have a normal task, its tracking its hamming-weight...and then you convert it to RT, it would update migration state on the enqueue. Likewise, if you alter the mask of an RT task, it would update the migration state right away. It never would flow the other way. Regards, -Greg
-- Ah, after reading the comment in your code, I might know where our miscommunication is from. When you hit a task that can't migrate, you simply stop and don't bother looking for a lowest rq to place it on. I'm saying to do one better. Put the code in the pick_next_highest_task_rt and _skip_ rt tasks with nr_cpus_allowed == 1. So we can then look to migrate another RT task that is lower in priority than a bounded RT task. Does this clear up what I'm trying to say? BTW, stop looking for a lowest_rq isn't really an optimization here. Since we only look at cpus that are in the tasks cpu affinity mask and we skip the cpu that it currently is on. So we don't even take a lock in that case. -- Steve -
OK, forget everything I said on this topic. I must have replied while low on caffeine. I see here that you do the set_cpus_allowed code if defined _ELSE_ you just update the weight. So you do do what I asked. /me slaps himself to wakeup. -- Steve -
This code tracks the priority of each CPU so that global migration
decisions are easy to calculate. Each CPU can be in a state as follows:
(INVALID), IDLE, NORMAL, RT1, ... RT99
going from the lowest priority to the highest. CPUs in the INVALID state
are not eligible for routing. The system maintains this state with
a 2 dimensional bitmap (the first for priority class, the second for cpus
in that class). Therefore a typical application without affinity
restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
searches). For tasks with affinity restrictions, the algorithm has a
worst case complexity of O(min(102, NR_CPUS)), though the scenario that
yields the worst case search is fairly contrived.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---
kernel/Makefile | 2
kernel/sched.c | 4 +
kernel/sched_cpupri.c | 201 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_cpupri.h | 10 ++
kernel/sched_rt.c | 34 ++------
5 files changed, 224 insertions(+), 27 deletions(-)
diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
- utsname.o
+ utsname.o sched_cpupri.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 6c90093..acfc75d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -68,6 +68,8 @@
#include <asm/tlb.h>
+#include "sched_cpupri.h"
+
/*
* Scheduler clock - returns current time in nanosec units.
* This is default implementation.
@@ -6955,6 +6957,8 @@ void __init sched_init(void)
fair_sched_class.next = &idle_sched_class;
...The cpu_priority and the cp->lock will be aboslutely horrible for cacheline bouncing. Ironically, this will kill performance for the very machines this code is to help with. The larger the number of CPUs you have the more cacheline bouncing this code will create. I still don't see the benefit from the cpupri code. -- Steve -
Oh crap. I just realized this is an older version of the patch..mustv'e forgot to refresh...grr. Ill send out the refreshed one. But anyway, I digress. =20 In the form presented here in this email, perhaps. I think you will see some significant improvements in the refreshed version. The big change Don't forget: The same is precisely true for the current -rt2 algorithm. For instance, the -rt2 algorithm aside from being linear in general, scales cacheline bouncing linearly as well. Each cpu is going to trash rq->rt.highest_prio and then we will walk them for each scan. The fact is, you can't maintain a global dynamic policy without bouncing cachelines, period. But hopefully we can minimize it, and I just want I still owe you timing data, but at this juncture I think I can beat linear (especially as we start throwing in big-iron) ;) I originally got involved in this scheduler rework from observations of poor scaling on our 8/16-ways, so I want it to scale as much as you ;) If this alg doesn't pan out, that's cool. But I think it will in the end. Linear algs in the fast path just make my skin crawl. Perhaps it will still be the best design in the end, but I am not giving up that easy until I prove it to myself. Regards, -Greg
-- Yep, -rt2 (and -rt3) are both horrible too. That's why I'm working on a Well, we need a balance between cacheline hits, and where to pull from. As performance goes, the two are inverse perportional. The secret is to It's only linear in respect to the size of the domain or cpus allowed. Any 8/16 way boxes worried about cacheline bouncing need to reign in on the cpus allowed mask. Otherwise, we would simply have bouncing anyway with the tasks themselves. -- Steve -
Excellent. I'm not 100% sure I've got the "mingo lingo" ;) down enough to know if sched_domains are the best fit, but I think in concept that makes a lot of sense (as we have discussed on IRC). I am thinking that we want to treat the rt.highest_prio/cpupri stuff to the "cpuset" mentality, just like Ingo suggested for the rto masks. If sched_domains True, but as of now its a singular domain ;) And even so, I still think a cpupri like algorithm can be potentially useful even if we utilize finer grained domains. That is, unless the membership becomes sufficiently small (maybe < 4 CPUs?) where the lower-overhead linear Agreed. Looking forward to reviewing your work here. :) Regards, -Greg
