[PATCH 0/3] RT: balance rt tasks enhancements v6

Previous thread: Re: Quad core CPU detected but shows as single core in 2.6.23.1 by Zurk Tech on Thursday, October 25, 2007 - 7:50 am. (3 messages)

Next thread: [PATCH]fs: Fix to correct the mbcache entries counter by Ram Gupta on Thursday, October 25, 2007 - 8:03 am. (1 message)
From: Gregory Haskins
Date: Thursday, October 25, 2007 - 7:46 am

This is a mini-release of my series, rebased on -rt2.  I have more changes
downstream which are not quite ready for primetime, but I need to work on some
other unrelated issues right now and I wanted to get what works out there. 

Changes since v5

*) Rebased to rt2 - Many of the functions of the original series are now
 included in base -rt so they are dropped out

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 7:46 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 55da7d0..b59dc20 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -292,7 +292,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 {
 	struct rq *lowest_rq = NULL;
 	cpumask_t cpu_mask;
-	int dst_cpu = -1;
 	int cpu;
 	int tries;
 
@@ -311,14 +310,12 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			/* We look for lowest RT prio or non-rt CPU */
 			if (rq->rt.highest_prio >= MAX_RT_PRIO) {
 				lowest_rq = rq;
-				dst_cpu = cpu;
 				break;
 			}
 
 			/* no locking for now */
 			if (rq->rt.highest_prio > task->prio &&
 			    (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
-				dst_cpu = cpu;
 				lowest_rq = rq;
 			}
 		}
@@ -335,7 +332,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			 * Also make sure that it wasn't scheduled on its rq.
 			 */
 			if (unlikely(task_rq(task) != this_rq ||
-				     !cpu_isset(dst_cpu, task->cpus_allowed) ||
+				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed) ||
 				     task_running(this_rq, task) ||
 				     !task->se.on_rq)) {
 				spin_unlock(&lowest_rq->lock);
@@ -365,7 +362,6 @@ static int push_rt_task(struct rq *this_rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int dst_cpu;
 	int ret = 0;
 	int paranoid = RT_MAX_TRIES;
 
@@ -412,12 +408,10 @@ static int push_rt_task(struct rq *this_rq)
 		goto out;
 	}
 
-	dst_cpu = lowest_rq->cpu;
-
 	assert_spin_locked(&lowest_rq->lock);
 
 	deactivate_task(this_rq, next_task, 0);
-	set_task_cpu(next_task, dst_cpu);
+	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 
 	resched_task(lowest_rq->curr);

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 7:46 am

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for one or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then during the handling of the timer, the system determines that it's
time to smp-rebalance the system so it schedules softirq_sched to run
from within the softirq_timer kthread context. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

The problem is that these tasks cannot ever be pulled away since they
are already running on their one and only valid RQ.  However, the other
CPUs cannot determine that the tasks are unpullable without going
through expensive checks/locking.  Therefore the helping CPUS
experience unecessary overhead/latencies regardless as they
ineffectively try to process the overload condition.

This patch tries to optimize the situation by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated, which may be utilized by the overload
handling code to eliminate uncessary rebalance attempts.  We also
introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

Calculating the weight is probably relatively expensive, so it is only
done when the cpus_allowed mask is updated (which should be relatively
infrequent, especially compared to scheduling frequency) and cached in
the task_struct.


Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    2 +
 kernel/fork.c         |    1 
 kernel/sched.c        |    9 +++-
 kernel/sched_rt.c     |  116 +++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 107 insertions(+), 21 deletions(-)

diff --git ...
From: Steven Rostedt
Date: Thursday, October 25, 2007 - 8:48 am

--

Not sure we need this optimization (not doing the nr_cpus_allowed
calculation). Since, due to priority boosting, we will need to calculate


I don't like this encapsulating of the doubl_lock_balance. There's a
reason I kept it out like this. Mainly because this is the source of most
(if not all) of the race condititions I fought. So this is a very
sensitive area to touch. In fact, I see a few races already introduced by
this patch. One is that you took out the "!task->se.or_rq" test. Which
means that a task could have ran and deactivated itself
(TASK_UNINTERRUPTIBLE) and you just added it to a run queue. (I hit that
race ;-)


Hmm, I must have missed where you update the mask at time of boosting.
Anyway, we shouldn't. The set_cpus_allowed should be done for all tasks.
Now you can have a handler to call for when a task changes class and
changes weight.

-- Steve

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 11:51 am

I think you are misunderstanding the code here.  The only optimization
is that I didn't want to force every sched_class to define a default
set_cpus_allowed member-fn.  So instead, it first checks if its defined
and invokes it if true.  Else, the default behavior is to assign the
mask and calculate the weight.

If you look at the one and only implementation of this function
(sched_rt.c:set_cpus_allowed_rt()), it also performs the assignment and
calcs the mask.


I disagree, but I admit it is not apparent at this level of the series
why.  The reason has to do with optimizing the wakeup path.  Unless I am
missing something, I think this placement is optimal, and that will

Reducing code duplication is an effort to mitigate error propagation,

This was intentional, but perhaps I should have been more explicit on
the reason why.  This is again related to future optimizations that I
have yet to reveal.  Namely, this code path can be invoked before the
task has been woken up, and it is therefore normal for it to not be on a

Hmm.. This is a good point.  Perhaps we should check to see that the

You aren't the first person to say something like that, and I always
scratch my head at that one.  It *is* open..its right there -40 lines
from where it used to be.  Its not like I gave you a obfuscated .o

Agreed.  You weren't aware of my issue, as I wasn't aware of yours.  I

?  I don't understand what you mean.  Are you referring to PI boosting?
What does that have to do with the mask?

Are you referring to when we change the mask?  We already invoke this
handler from the sched.c:set_cpus_allowed().  Can you clarify this

It is.  Im just lazy and had the default case handled inside the
sched_c:set_cpus_allowed() call.  Perhaps I should just let all

Well, at least for what I am designing here, we are already covered.  If
the task changes class, it would be dequeued in one class, and enqueued
in the new one.  This would properly update the migration state.
Likewise, if it had its ...
From: Steven Rostedt
Date: Thursday, October 25, 2007 - 12:52 pm

--

Actually, I say we do the calculation for all tasks regardless of class.
But you can have a function for each class (or NULL) that will do
something with the old and new values.

Reason why is that we don't want the calculation to happen during the

Then move the code there then. One can't be analyzing patches when code
isn't apparent that determines changes.  Those changes need to be
together.  Especially since they may not be applied.

I would be happy to apply most of this patch but because of changes that
are here to help out to-be-announced patches, will keep this patch from
going into the tree.

IOW, break the patch up to the fixes that this patch is to do on its own.
It has plenty. Leave out the stuff that will help out later patches.
Add the stuff that is being recommended, and you can make a separate patch

Actually that's not the point. The side effects of the double_lock_balance
are unique to each location. So it doesn't really make sense to modulize

Hence why I said the balance before activate task is very racy.  Looking
at the state is not enough. A RT task may be in the TASK_UNINTERRUPTIBLE
state and about to check a condition that will have it wake itself up. So
just looking at the state is not enough to determine if we should skip it

My point is that the side effect that the double_lock_balance causes is


No but a boosting can change classes. And going from fair to rt we don't

That is what I think I'm missing. I have to go back and look at the patch
(I admit I may have missed it). Switching the migration state will update
the processes nr_cpus_allowed? Of course I think that's not a good idea.

/me looks.

-- Steve

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 2:12 pm

The correctness of this particular change (IMO) is not predicated on the
later patches, otherwise I would have done just that.  It accomplishes
what I intended as is here in this patch.

Why do you think moving the logic to pick_next_highest is a better
design?  To be honest, I haven't really studied your new logic in
push_rt_tasks to understand why you might feel this way.  If you can
make the case that it is better in the other location then I agree with
you that we should move it there in this patch, and potentially adjust

I think the only one real solid case of that is the removal of on_rq
from the double_lock conditions.  I agree with your point that it should
come later.  I'll put it back in and adjust it later.  I'll wait to hear

?

I generalized the core functionality so it worked as I believed it
should in the two places that I used it.  I don't follow your objection
here.  Did I break something (besides the on_rq thing?).  My feeling is
that both need the same locking semantics, but we don't need the check
for the priority shift in the optimized case as we do in the loop.  So

If that's the case then you are already broken if we hit the wake_idle()
case, no?  We can drop the lock there too.

But I am not convinced it cannot be done in a race free way.  I can be


Fair enough...I'll fix it.  But my point still stands that a decent

Sure we do! (unless I introduced a bug)

task->cpus_allowed and task->nr_cpus_allowed is updated for *all*
classes.  It is only acted upon to change the migration state if the

No, other way around.  (de)queuing an RT task, or changing an RT tasks
mask will change the migration state.  So if you have a normal task, its
tracking its hamming-weight...and then you convert it to RT, it would
update migration state on the enqueue.

Likewise, if you alter the mask of an RT task, it would update the
migration state right away.  It never would flow the other way.

Regards,
-Greg


From: Steven Rostedt
Date: Thursday, October 25, 2007 - 5:03 pm

--

Ah, after reading the comment in your code, I might know where our
miscommunication is from.  When you hit a task that can't migrate, you
simply stop and don't bother looking for a lowest rq to place it on.

I'm saying to do one better. Put the code in the pick_next_highest_task_rt
and _skip_ rt tasks with nr_cpus_allowed == 1. So we can then look to
migrate another RT task that is lower in priority than a bounded RT task.

Does this clear up what I'm trying to say?

BTW, stop looking for a lowest_rq isn't really an optimization here. Since
we only look at cpus that are in the tasks cpu affinity mask and we skip
the cpu that it currently is on. So we don't even take a lock in that
case.

-- Steve

-

From: Steven Rostedt
Date: Thursday, October 25, 2007 - 12:58 pm

OK, forget everything I said on this topic. I must have replied while low
on caffeine.

I see here that you do the set_cpus_allowed code if defined _ELSE_ you
just update the weight.

So you do do what I asked.

/me slaps himself to wakeup.

-- Steve

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 7:46 am

This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

                 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/Makefile       |    2 
 kernel/sched.c        |    4 +
 kernel/sched_cpupri.c |  201 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   10 ++
 kernel/sched_rt.c     |   34 ++------
 5 files changed, 224 insertions(+), 27 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
 	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
-	    utsname.o
+	    utsname.o sched_cpupri.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 6c90093..acfc75d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -68,6 +68,8 @@
 
 #include <asm/tlb.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -6955,6 +6957,8 @@ void __init sched_init(void)
 	fair_sched_class.next = &idle_sched_class;
 ...
From: Steven Rostedt
Date: Thursday, October 25, 2007 - 8:27 am

The cpu_priority and the cp->lock will be aboslutely horrible for
cacheline bouncing.  Ironically, this will kill performance for the very
machines this code is to help with.  The larger the number of CPUs you
have the more cacheline bouncing this code will create.

I still don't see the benefit from the cpupri code.

-- Steve

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 10:26 am

Oh crap.  I just realized this is an older version of the patch..mustv'e
forgot to refresh...grr.  Ill send out the refreshed one.

But anyway, I digress. =20


In the form presented here in this email, perhaps.  I think you will see
some significant improvements in the refreshed version.  The big change

Don't forget:  The same is precisely true for the current -rt2
algorithm.  For instance, the -rt2 algorithm aside from being linear in
general, scales cacheline bouncing linearly as well.  Each cpu is going
to trash rq->rt.highest_prio and then we will walk them for each scan.

The fact is, you can't maintain a global dynamic policy without bouncing
cachelines, period.  But hopefully we can minimize it, and I just want

I still owe you timing data, but at this juncture I think I can beat
linear (especially as we start throwing in big-iron) ;)  I originally
got involved in this scheduler rework from observations of poor scaling
on our 8/16-ways, so I want it to scale as much as you ;)  If this alg
doesn't pan out, that's cool.  But I think it will in the end.  Linear
algs in the fast path just make my skin crawl.  Perhaps it will still be
the best design in the end, but I am not giving up that easy until I
prove it to myself.

Regards,
-Greg
From: Steven Rostedt
Date: Thursday, October 25, 2007 - 10:36 am

--


Yep, -rt2 (and -rt3) are both horrible too. That's why I'm working on a

Well, we need a balance between cacheline hits, and where to pull from.
As performance goes, the two are inverse perportional. The secret is to

It's only linear in respect to the size of the domain or cpus allowed. Any
8/16 way boxes worried about cacheline bouncing need to reign in on the
cpus allowed mask. Otherwise, we would simply have bouncing anyway with
the tasks themselves.

-- Steve

-

From: Gregory Haskins
Date: Thursday, October 25, 2007 - 10:55 am

Excellent.  I'm not 100% sure I've got the "mingo lingo" ;) down enough
to know if sched_domains are the best fit, but I think in concept that
makes a lot of sense (as we have discussed on IRC).   I am thinking that
we want to treat the rt.highest_prio/cpupri stuff to the "cpuset"
mentality, just like Ingo suggested for the rto masks.  If sched_domains


True, but as of now its a singular domain ;)  And even so, I still think
a cpupri like algorithm can be potentially useful even if we utilize
finer grained domains.  That is, unless the membership becomes
sufficiently small (maybe < 4 CPUs?) where the lower-overhead linear

Agreed.   Looking forward to reviewing your work here. :)

Regards,
-Greg
Previous thread: Re: Quad core CPU detected but shows as single core in 2.6.23.1 by Zurk Tech on Thursday, October 25, 2007 - 7:50 am. (3 messages)

Next thread: [PATCH]fs: Fix to correct the mbcache entries counter by Ram Gupta on Thursday, October 25, 2007 - 8:03 am. (1 message)