[PATCH 05/23] Subject: SCHED - pull RT tasks

Previous thread: none

Next thread: NFS Killable tasks request comments on patch by Liam Howlett on Tuesday, December 4, 2007 - 5:25 pm. (4 messages)
To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

Ingo,

This series applies on GIT commit 2254c2e0184c603f92fc9b81016ff4bb53da622d
(2.6.24-rc4 (ish) git HEAD)

These patches replace v6 + v6b announced here:

http://lkml.org/lkml/2007/11/20/613

and

http://lkml.org/lkml/2007/11/21/226

The series is a collaborative effort between myself and Steven Rostedt.

Changes since v6:

*) cleaned up patch headers
*) fixed some checkpatch errors
*) incorporated v6<x> add-ons to one unified series

Synopsis:
---------------------

Most people don't care about RT tasks, but some of us do, and this series
helps us fix a problem with the current design. We designed carefully to have
negligible impact on non RT code paths, and we believe we have achieved this
goal.

These tests against a 64-way SMP architecture show that the series doesn't
hurt those that don't care about RT tasks, but it helps out those that do.

http://people.redhat.com/srostedt/rt-benchmarks/

you can find a tarball of the entire series for your convenience here:

ftp://ftp.novell.com/dev/ghaskins/rt-balance-v7.tar.bz2

also, most of these patches have been incorporated into the later versions of
-rt which you can find in patch form here:

http://www.kernel.org/pub/linux/kernel/projects/rt/

or in prebuilt form here from opensuse-factory:

http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/sus...

Please consider for inclusion in the next convenient merge window.

Regards,
-Steven Rostedt, and Gregory Haskins
--

To: Gregory Haskins <ghaskins@...>
Cc: <rostedt@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 5:27 pm

please post patches against sched-devel.git - it has part of your
previous patches included already, plus some cleanups i did to them, so
this series of yours wont apply. sched-devel.git is at:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

Ingo
--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 10:55 pm

>>> On Tue, Dec 4, 2007 at 4:27 PM, in message <20071204212705.GA10360@elte.hu>,
Hi Ingo,
I have rebased to your sched-devel.git. You were right, most of the patches
were already there (1-20, in fact), so there remain only the last three.
These constitute the "cpupri" moniker in Steven's testing. Let me know if
you have any questions. Comments/review by anyone are of course welcome.

Regards,
-Greg

--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 10:55 pm

The current code use a linear algorithm which causes scaling issues
on larger SMP machines. This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

kernel/Kconfig.preempt | 11 +++
kernel/Makefile | 1
kernel/sched.c | 8 ++
kernel/sched_cpupri.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_cpupri.h | 37 ++++++++++
kernel/sched_rt.c | 36 +++++++++-
6 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c64ce9c..f78ab80 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -63,3 +63,14 @@ config PREEMPT_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+config CPUPRI
+ bool "Optimize lowest-priority search"
+ depends on SMP && EXPERIMENTAL
+ default n
+ help
+ This option attempts to reduce latency in the kernel by replacing
+ the linear lowest-priority search algorithm with a 2-d bitmap.
+
+ Say Y here if you want to try this experimental algorithm.
+ Say N if you are unsure.
+
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..78a385e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_CPUPRI) += sched_cpupri.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a09f82..1c173c1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -69,6 +69,8 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>

+#include "sched_cpupri.h"
+
/*
...

To: Gregory Haskins <ghaskins@...>
Cc: <rostedt@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Wednesday, December 5, 2007 - 5:34 am

hm, what kind of scaling issues - do you have any numbers? What workload
did you measure and on what hardware?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: <rostedt@...>, <linux-kernel@...>, <linux-rt-users@...>
Date: Wednesday, December 5, 2007 - 6:19 am

>>> On Wed, Dec 5, 2007 at 4:34 AM, in message <20071205093412.GA17911@elte.hu>,

The vast majority of my own work was on 4-way and 8-way systems with -rt which I can dig up and share with you if you like. Generally speaking, I would present various loads (make -j 128 being the defacto standard) and simultaneously measure RT latency performance with tools such as cyclictest and/or preempt-test. Typically, I would let these loads/measurements run overnight to really seek out the worst-case maximums.

As a synopsis, the 2-d alg helps even on the 4-way, but even more so on the 8-way. I expect that trend to continue as we add more cpus. The general numbers I recall from memory are that the 2-d alg reduces max latencies by about 25-40% on the 8-way.

However, that said, Steven's testing work on the mainline port of our series sums it up very nicely, so I will present that in lieu of digging up my -rt numbers unless you specifically want them too. Here they are:

http://people.redhat.com/srostedt/rt-benchmarks/

As I believe you are aware, these numbers were generated on a 64-way SMP beast. To break down what we are looking at, lets start with the first row of graphs (under Kernel and System Information). What we see here are scatterplots of wake-latencies generated from Steven's "rt-migrate" test. We start with the vanilla "rc3" kernel, and make our way through "sdr" (the first 7-8 patches from v7), "gh" (patches 8-20), and finally "cpupri" (patches 21-23). (Ignore the "noavoid" runs, that was an experiment that Steven and I agree was a flop so its dropped from the series).

The test has a default run-interval of 20ms, which is employed here in this test. What we expect to see is that the red points (high-priority) should be allowed to start immediately (close to 0 latency), whereas green points (low-priority) may sometimes happen to start first, or may start after the 20ms interval depending on luck of the draw.

You can see that "rc3" has issues keeping the red-points nea...

To: Gregory Haskins <ghaskins@...>
Cc: <rostedt@...>, <linux-kernel@...>, <linux-rt-users@...>
Date: Wednesday, December 5, 2007 - 7:44 am

i'm well aware of Steve's benchmarking efforts, but i dont think he's
finished with it and i'll let him present the results once he wants to
announce them. I asked about the effects of the "2-d" patch in isolation
and i'm not sure the numbers show that individual patch in action.

in any case, you are preaching to the choir, i wrote the first
rt-overload code and it's been in -rt forever so it's not like you need
to sell me the concept ;-) But upstream quality requirements are
different from -rt and we need to examine all aspects of scheduling, not
just latency. In any case, i'll wait for the rest of Steve's numbers.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: <rostedt@...>, <linux-kernel@...>, <linux-rt-users@...>
Date: Wednesday, December 5, 2007 - 9:41 am

>>> On Wed, Dec 5, 2007 at 6:44 AM, in message <20071205114438.GC6143@elte.hu>,

Ah, sorry if I was not clear. What I was trying to show was that you can compare "gh" to "cpupri" to see the effects of the 2-d patch in isolation (*) in Steven's tests. I believe it shows a positive impact on some tests, and a negligible impact on some tests. As long as we dont have a regression somewhere, I am happy :)

Sounds good. Ill try to dig up my 4/8-way numbers as well for another data point.

Thanks for taking the time to review all this stuff. I know you are swamped these days.

-Greg

--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 10:55 pm

We move the rt-overload data as the first global to per-domain
reclassification. This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system. Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down. Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain. However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution. If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

kernel/sched.c | 7 +++++++
kernel/sched_rt.c | 57 +++++++++++++++++++++++++++++------------------------
2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dfb939f..9a09f82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -367,6 +367,13 @@ struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;
+
+ /*
+ * The "RT overload" flag: it gets set if a CPU has more than
+ * one runnable RT task.
+ */
+ cpumask_t rto_mask;
+ atomic_t rto_count;
};

static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9dcf522..f9728d0 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -5,22 +5,14 @@

#ifdef CONFIG_SMP

-/*
- * The "RT overload" flag: it gets set if a CPU has more than
- * one runnable RT task.
- */
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-
-static inline int rt_overloaded(void)
+static inline int rt_overloaded(struct rq *rq)
{
- return atomic_read(&...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 10:55 pm

We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables. Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset. However, we currently still maintain some
policy/state as global variables which transcend all cpusets. Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one. This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
---

include/linux/sched.h | 3 +
kernel/sched.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0bb7033..b6885ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -854,6 +854,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+ void (*join_domain)(struct rq *rq);
+ void (*leave_domain)(struct rq *rq);
};

struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index 4849c72..dfb939f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -351,6 +351,28 @@ struct rt_rq {
int overloaded;
};

+#ifdef CONFIG_...

To: Ingo Molnar <mingo@...>
Cc: Gregory Haskins <ghaskins@...>, <rostedt@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 5:35 pm

Ah, will do. Thanks!

-Greg
--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:46 pm

The current code use a linear algorithm which causes scaling issues
on larger SMP machines. This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

kernel/Kconfig.preempt | 11 +++
kernel/Makefile | 1
kernel/sched.c | 8 ++
kernel/sched_cpupri.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_cpupri.h | 37 ++++++++++
kernel/sched_rt.c | 36 +++++++++-
6 files changed, 265 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c64ce9c..f78ab80 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -63,3 +63,14 @@ config PREEMPT_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+config CPUPRI
+ bool "Optimize lowest-priority search"
+ depends on SMP && EXPERIMENTAL
+ default n
+ help
+ This option attempts to reduce latency in the kernel by replacing
+ the linear lowest-priority search algorithm with a 2-d bitmap.
+
+ Say Y here if you want to try this experimental algorithm.
+ Say N if you are unsure.
+
diff --git a/kernel/Makefile b/kernel/Makefile
index dfa9695..78a385e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
obj-$(CONFIG_MARKERS) += marker.o
+obj-$(CONFIG_CPUPRI) += sched_cpupri.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/sched.c b/kernel/sched.c
index 0c7e5e4..2ace03f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -70,6 +70,8 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>

+#include "sched_cpupri.h"
+
/*
...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:46 pm

We move the rt-overload data as the first global to per-domain
reclassification. This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.

Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system. Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down. Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.

Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain. However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution. If it can be optimized later, so be it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
---

kernel/sched.c | 2 ++
kernel/sched_rt.c | 57 ++++++++++++++++++++++++++++++++---------------------
2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 79f3eba..0c7e5e4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -292,6 +292,8 @@ struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;
+ cpumask_t rto_mask;
+ atomic_t rto_count;
};

static struct root_domain def_root_domain;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 039be04..9e8a59d 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -4,20 +4,18 @@
*/

#ifdef CONFIG_SMP
-static cpumask_t rt_overload_mask;
-static atomic_t rto_count;
-static inline int rt_overloaded(void)
+
+static inline int rt_overloaded(struct rq *rq)
{
- return atomic_read(&rto_count);
+ return atomic_read(&rq->rd->rto_count);
}
-static inline cpumask_t *rt_overload(void)
+static inline cpumask_t *rt_overload(struct rq *rq)
{
- return &rt_overload_mask;
+ return &r...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:46 pm

We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables. Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset. However, we currently still maintain some
policy/state as global variables which transcend all cpusets. Consider,
for instance, rt-overload state.

Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).

We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one. This logic doesn't have any clients
yet but it will later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
---

include/linux/sched.h | 3 +
kernel/sched.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 809658c..b891d81 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -849,6 +849,9 @@ struct sched_class {
void (*task_tick) (struct rq *rq, struct task_struct *p);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask);
+
+ void (*join_domain)(struct rq *rq);
+ void (*leave_domain)(struct rq *rq);
};

struct load_weight {
diff --git a/kernel/sched.c b/kernel/sched.c
index ba9eadb..79f3eba 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -276,6 +276,28 @@ struct rt_rq {
int overloaded;
};

+#ifdef CONFIG_...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:46 pm

From: Steven Rostedt <srostedt@redhat.com>

Run the RT balancing code on wake up to an RT task.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ed031bd..ba9eadb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1656,6 +1656,7 @@ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
inc_nr_running(p, rq);
}
check_preempt_curr(rq, p);
+ wakeup_balance_rt(rq, p);
task_rq_unlock(rq, &flags);
}

--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched_rt.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 19db3a9..e007d2b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -158,11 +158,23 @@ static int select_task_rq_rt(struct task_struct *p, int sync)
struct rq *rq = task_rq(p);

/*
- * If the task will not preempt the RQ, try to find a better RQ
- * before we even activate the task
+ * If the current task is an RT task, then
+ * try to see if we can wake this RT task up on another
+ * runqueue. ...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally. In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead. NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface. So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched.c | 1 +
kernel/sched_rt.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6fa511d..651270e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -24,6 +24,7 @@
* 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri
* 2007-10-22 RT overload balancing by Steven Rostedt
* (with thanks to Gregory Haskins)
+ * 2007-11-05 RT migration/wakeup tuning by Gregory Haskins
*/

#include <linux/mm.h>
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ea40851..67daa66 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -279,35 +279,111 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
}

static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
+static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);

-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
{
- int cpu;
- cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
- struct rq *lowest_rq = NULL;
+ int cpu;
+ cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
+ int lowest_prio = -1;
+ int ret = 0;

- cpus...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:46 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched_rt.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0514b27..039be04 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -294,29 +294,36 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
}

static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);
-static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask);

static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
{
- int cpu;
- cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int lowest_prio = -1;
+ int lowest_cpu = -1;
int count = 0;
+ int cpu;

- cpus_clear(*lowest_mask);
- cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
+ cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed);

/*
* Scan each rq for the lowest prio.
*/
- for_each_cpu_mask(cpu, *valid_mask) {
+ for_each_cpu_mask(cpu, *lowest_mask) {
struct rq *rq = cpu_rq(cpu);

/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
- if (count)
+ /*
+ * if we already found a low RT queue
+ * and now we found this non-rt queue
+ * clear the mask and set our bit.
+ * Otherwise just return the queue as is
+ * and the count==1 will cause the algorithm
+ * to use the first bit found.
+ */
+ if (lowest_cpu != -1) {
cpus_clear(*lowest_mask);
- cpu_set(rq->cpu, *lowest_mask);
+ cpu_set(rq->cpu...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs. Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched_rt.c | 25 ++++++++++++++++++-------
1 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index fe0b43f..0514b27 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -301,7 +301,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
int cpu;
cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask);
int lowest_prio = -1;
- int ret = 0;
+ int count = 0;

cpus_clear(*lowest_mask);
cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed);
@@ -314,7 +314,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)

/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
- if (ret)
+ if (count)
cpus_clear(*lowest_mask);
cpu_set(rq->cpu, *lowest_mask);
return 1;
@@ -326,14 +326,17 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
if (rq->rt.highest_prio > lowest_prio) {
/* new low - clear old data */
lowest_prio = rq->rt.highest_prio;
- cpus_clear(*lowest_mask);
+ if (count) {
+ cpus_clear(*lowest_mask);
+ count = 0;
+ }
}
cpu_set(rq->cpu, *lowest_mask);
- ret = 1;
+ count++;
}
}

- return ret;
+ return count;
}

static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask)
@@ -357,9 +360,17 @@ static int find_lowest_rq(struct task_struct *task)
cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask);
int this_cpu = smp_processor_id();
int cpu = task_cpu(task);
+ int count = find_lowest_cpus(task, lowest_mask);

- if (!find_lowest_cpu...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away. So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched.c | 2 ++
kernel/sched_rt.c | 10 ++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 651270e..ed031bd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -273,6 +273,7 @@ struct rt_rq {
unsigned long rt_nr_migratory;
/* highest queued rt task prio */
int highest_prio;
+ int overloaded;
};

/*
@@ -6692,6 +6693,7 @@ void __init sched_init(void)
rq->migration_thread = NULL;
INIT_LIST_HEAD(&rq->migration_queue);
rq->rt.highest_prio = MAX_RT_PRIO;
+ rq->rt.overloaded = 0;
#endif
atomic_set(&rq->nr_iowait, 0);

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 67daa66..19db3a9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(void)
}
static inline void rt_set_overload(struct rq *rq)
{
+ rq->rt.overloaded = 1;
cpu_set(rq->cpu, rt_overload_mask);
/*
* Make sure the mask is visible before we set
@@ -32,6 +33,7 @@ static inline void rt_clear_overload(struct rq *rq)
/* the order here really doesn't matter */
atomic_dec(&rto_count);
cpu_clear(rq->cpu, rt_overload_mask);
+ rq->rt.overloaded = 0;
}

static void update_rt_migration(struct rq *rq)
@@ -446,6 +448,9 @@ static int push_rt_task(struct rq *rq)

assert_spin_locked(&rq->lock);

+ if (!rq->rt.overloaded)
+ return 0;
+
next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;
@@ -673,7 +678,7 @@ static void schedule_tail_balance_rt(struct rq *rq)
* the lock was owned by prev, we need to release it
* first via finish_lock_switch and ...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched_rt.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index e007d2b..fe0b43f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -174,7 +174,8 @@ static int select_task_rq_rt(struct task_struct *p, int sync)
* that is just being woken and probably will have
* cold cache anyway.
*/
- if (unlikely(rt_task(rq->curr))) {
+ if (unlikely(rt_task(rq->curr)) &&
+ (p->nr_cpus_allowed > 1)) {
int cpu = find_lowest_rq(p);

return (cpu == -1) ? task_cpu(p) : cpu;

--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path. I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach. The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens. The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea. But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple: Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task. And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup. Most of the
time, the pre-routing should work and yield lower overhead. In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched_rt.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7e444f4..ea40851 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -149,8 +149,27 @@ yield_task_rt(struct rq *rq)
}

#ifdef CONFIG_SMP
+static int find_lowest_rq(struct task_struct *task);
+
static int select_task_rq_rt(struct task_struct *p, int sync)
{
+ struct rq *rq = task_rq(p);
+
+ /*
+ * If the task will not preempt the RQ, try to find a better RQ
+ * before we even activate the task
+ */
+ if ((p->prio >= rq->rt.highest_prio)
+ && (p->nr_cpus_allowed > 1)) {
+ int cpu = find_lowest_rq(p);
+
+ return (cpu == -1) ? task_cpu...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations. The problem is that
these calculations are only relevant to CFS tasks where load is king. For
RT tasks, priority is king. So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

(Note that there is one checkpatch error due to braces, but this code was
simply moved from kernel/sched.c to kernel/sched_fail.c so I don't want to
modify it)

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

include/linux/sched.h | 1
kernel/sched.c | 167 ++++++++---------------------------------------
kernel/sched_fair.c | 148 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched_idletask.c | 9 +++
kernel/sched_rt.c | 10 +++
5 files changed, 195 insertions(+), 140 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a3dc53b..809658c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -827,6 +827,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
+ int (*select_task_rq)(struct task_struct *p, int sync);

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);

diff --git a/kernel/sched.c b/kernel/sched.c
index 70f08de..6fa511d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -866,6 +866,13 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
#endif

+#ifdef CONFIG_SMP
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigne...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

It doesn't hurt if we allow the current CPU to be included in the
search. We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched_rt.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7e26c2c..7e444f4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -275,9 +275,6 @@ static int find_lowest_rq(struct task_struct *task)
for_each_cpu_mask(cpu, *cpu_mask) {
struct rq *rq = cpu_rq(cpu);

- if (cpu == rq->cpu)
- continue;
-
/* We look for lowest RT prio or non-rt CPU */
if (rq->rt.highest_prio >= MAX_RT_PRIO) {
lowest_rq = rq;
@@ -305,7 +302,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
for (tries = 0; tries < RT_MAX_TRIES; tries++) {
cpu = find_lowest_rq(task);

- if (cpu == -1)
+ if ((cpu == -1) || (cpu == rq->cpu))
break;

lowest_rq = cpu_rq(cpu);

--

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

(Checkpatch error is inherited from moved code)

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched_rt.c | 66 +++++++++++++++++++++++++++++++----------------------
1 files changed, 39 insertions(+), 27 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index da21c7a..7e26c2c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -261,54 +261,66 @@ static struct task_struct *pick_next_highest_task_rt(struct rq *rq,

static DEFINE_PER_CPU(cpumask_t, local_cpu_mask);

-/* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(struct task_struct *task,
- struct rq *this_rq)
+static int find_lowest_rq(struct task_struct *task)
{
- struct rq *lowest_rq = NULL;
int cpu;
- int tries;
cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask);
+ struct rq *lowest_rq = NULL;

cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed);

- for (tries = 0; tries < RT_MAX_TRIES; tries++) {
- /*
- * Scan each rq for the lowest prio.
- */
- for_each_cpu_mask(cpu, *cpu_mask) {
- struct rq *rq = &per_cpu(runqueues, cpu);
+ /*
+ * Scan each rq for the lowest prio.
+ */
+ for_each_cpu_mask(cpu, *cpu_mask) {
+ struct rq *rq = cpu_rq(cpu);

- if (cpu == this_rq->cpu)
- continue;
+ if (cpu == rq->cpu)
+ continue;

- /* We look for lowest RT prio or non-rt CPU */
- if (rq->rt.highest_prio >= MAX_RT_PRIO) {
- lowest_rq = rq;
- break;
- }
+ /* We look for lowest RT prio or non-rt CPU */
+ if (rq->rt.highest_prio >= MAX_RT_PRIO) {
+ lowest_rq = rq;
+ break;
+ }

- /* no locking for now */
- if (rq->rt.highest_prio > task->prio &&
- (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) {
- lowest_rq = rq;
- }
...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)"). So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

kernel/sched_rt.c | 22 +++++++++++-----------
1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index e5bce6a..d8f8738 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -326,21 +326,21 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
* running task can migrate over to a CPU that is running a task
* of lesser priority.
*/
-static int push_rt_task(struct rq *this_rq)
+static int push_rt_task(struct rq *rq)
{
struct task_struct *next_task;
struct rq *lowest_rq;
int ret = 0;
int paranoid = RT_MAX_TRIES;

- assert_spin_locked(&this_rq->lock);
+ assert_spin_locked(&rq->lock);

- next_task = pick_next_highest_task_rt(this_rq, -1);
+ next_task = pick_next_highest_task_rt(rq, -1);
if (!next_task)
return 0;

retry:
- if (unlikely(next_task == this_rq->curr)) {
+ if (unlikely(next_task == rq->curr)) {
WARN_ON(1);
return 0;
}
@@ -350,24 +350,24 @@ static int push_rt_task(struct rq *this_rq)
* higher priority than current. If that's the case
* just reschedule current.
*/
- if (unlikely(next_task->prio < this_rq->curr->prio)) {
- resched_task(this_rq->curr);
+ if (unlikely(next_task->prio < rq->curr->prio)) {
+ resched_task(rq->curr);
return 0;
}

- /* We might release this_rq lock */
+ /* We might release rq lock */
get_task_struct(next_task);

/* find_lock_lowest_rq locks the rq if found */
- lowest_rq = find_lock_lowest_rq(next_task, this_rq);
+ lowest_rq = find_lock_lowest_rq(next_task, rq);
if (!lowest_rq) {
struct task_struct *task;
/*
- * find lock_lowest_rq releases this_rq...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time. Consider, for instance, softirq_timer and softirq_sched. A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
up behind the running task, we will fail to ever relocate the unbounded
task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
overloaded tasks over. It is therefore optimial to strive to avoid the
overhead all together if we can cheaply detect the condition before
overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask. A weight of 1 indicates that
the task cannot be migrated. We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running. We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---

include/linux/init_task.h | 1 +
include/...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority. The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more. The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
migrated. Specifically, if the highest non running RT task CPU affinity
is restricted to CPUs that are running higher priority tasks, there may
be a lower priority task queued that has an affinity with a CPU that is
running a lower priority task that it could be migrated to. This
patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 8 ++
kernel/sched_rt.c | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 231 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 90c04fd..7748d14 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1883,6 +1883,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
prev_state = prev->state;
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
+ schedule_tail_balance_rt(rq);
+
fire_sch...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 3 +++
kernel/sched_rt.c | 10 ++++++++++
2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a30147e..ebd114b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -22,6 +22,8 @@
* by Peter Williams
* 2007-05-06 Interactivity improvements to CFS by Mike Galbraith
* 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri
+ * 2007-10-22 RT overload balancing by Steven Rostedt
+ * (with thanks to Gregory Haskins)
*/

#include <linux/mm.h>
@@ -1641,6 +1643,7 @@ out_activate:

out_running:
p->state = TASK_RUNNING;
+ wakeup_balance_rt(rq, p);
out:
task_rq_unlock(rq, &flags);

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a2f1057..a0b05ff 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -556,6 +556,15 @@ static void schedule_tail_balance_rt(struct rq *rq)
}
}

+
+static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
+{
+ if (unlikely(rt_task(p)) &&
+ !task_running(rq, p) &&
+ (p->prio >= rq->curr->prio))
+ push_rt_tasks(rq);
+}
+
/*
* Load-balancing iterator. Note: while the runqueue stays locked
* during the whole iteration, the current task might be
@@ -663,6 +672,7 @@ move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest,
#else /* CONFIG_SMP */
# define schedule_tail_balance_rt(rq) do { } while (0)
# define schedule_balance_rt(rq, prev) do { } while (0)
+# define wakeup_balance_rt(rq, p) do { } while (0)
#endif /* CONFIG_SMP */

static void task_tic...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:45 pm

From: Steven Rostedt <srostedt@redhat.com>

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via CFS. In fact, this code was found to pull RT tasks
away from CPUS that the active movement performed, resulting in
large latencies.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched_rt.c | 95 ++---------------------------------------------------
1 files changed, 4 insertions(+), 91 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a0b05ff..ea07ffa 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -565,109 +565,22 @@ static void wakeup_balance_rt(struct rq *rq, struct task_struct *p)
push_rt_tasks(rq);
}

-/*
- * Load-balancing iterator. Note: while the runqueue stays locked
- * during the whole iteration, the current task might be
- * dequeued so the iterator has to be dequeue-safe. Here we
- * achieve that by always pre-iterating before returning
- * the current task:
- */
-static struct task_struct *load_balance_start_rt(void *arg)
-{
- struct rq *rq = arg;
- struct rt_prio_array *array = &rq->rt.active;
- struct list_head *head, *curr;
- struct task_struct *p;
- int idx;
-
- idx = sched_find_first_bit(array->bitmap);
- if (idx >= MAX_RT_PRIO)
- return NULL;
-
- head = array->queue + idx;
- curr = head->prev;
-
- p = list_entry(curr, struct task_struct, run_list);
-
- curr = curr->prev;
-
- rq->rt.rt_load_balance_idx = idx;
- rq->rt.rt_load_balance_head = head;
- rq->rt.rt_load_balance_curr = curr;
-
- return p;
-}
-
-static struct task_struct *load_balance_next_rt(void *arg)
-{
- struct rq *rq = arg;
- struct rt_prio_array *array = &rq->rt.active;
- struct list_head *head, *curr;
- struct task_struct *p;
- int idx;
-
- idx = rq->rt.rt_load_balance_idx;
- head = rq->rt.rt_load_balance_head;
- curr = rq->rt.rt_load_balance_curr;
-
- /*
...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 2 +
kernel/sched_rt.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 178 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7748d14..a30147e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3652,6 +3652,8 @@ need_resched_nonpreemptible:
switch_count = &prev->nvcsw;
}

+ schedule_balance_rt(rq, prev);
+
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b8c758a..a2f1057 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -177,8 +177,17 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep);

+static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_running(rq, p) &&
+ (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)))
+ return 1;
+ return 0;
+}
+
/* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq)
+static struct task_struct *pick_next_highest_task_rt(struct rq *rq,
+ int cpu)
{
struct rt_prio_array *array = &rq->rt.active;
struct task_struct *next;
@@ -197,26 +206,36 @@ static struct task_struct *pick_next_highest_task_rt(struct...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched_rt.c | 36 ++++++++++++++++++++++++++++++++++++
1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b5ef4b8..b8c758a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -3,6 +3,38 @@
* policies)
*/

+#ifdef CONFIG_SMP
+static cpumask_t rt_overload_mask;
+static atomic_t rto_count;
+static inline int rt_overloaded(void)
+{
+ return atomic_read(&rto_count);
+}
+static inline cpumask_t *rt_overload(void)
+{
+ return &rt_overload_mask;
+}
+static inline void rt_set_overload(struct rq *rq)
+{
+ cpu_set(rq->cpu, rt_overload_mask);
+ /*
+ * Make sure the mask is visible before we set
+ * the overload count. That is checked to determine
+ * if we should look at the mask. It would be a shame
+ * if we looked at the mask, but the mask was not
+ * updated yet.
+ */
+ wmb();
+ atomic_inc(&rto_count);
+}
+static inline void rt_clear_overload(struct rq *rq)
+{
+ /* the order here really doesn't matter */
+ atomic_dec(&rto_count);
+ cpu_clear(rq->cpu, rt_overload_mask);
+}
+#endif /* CONFIG_SMP */
+
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
@@ -33,6 +65,8 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
#ifdef CONFIG_SMP
if (p->prio < rq->rt.highest_prio)
rq->rt.highest_prio = p->prio;
+ if (rq->rt.rt_nr_running > 1)
+ rt_set_overload(rq);
#endif /* CONFIG_SMP */
}

@@ -54,6 +88,8 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
} /* otherwise ...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 3 +++
kernel/sched_rt.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4751c2f..90c04fd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -267,6 +267,8 @@ struct rt_rq {
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
unsigned long rt_nr_running;
+ /* highest queued rt task prio */
+ int highest_prio;
};

/*
@@ -6783,6 +6785,7 @@ void __init sched_init(void)
rq->cpu = i;
rq->migration_thread = NULL;
INIT_LIST_HEAD(&rq->migration_queue);
+ rq->rt.highest_prio = MAX_RT_PRIO;
#endif
atomic_set(&rq->nr_iowait, 0);

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a6271f4..4d5b9b2 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -30,6 +30,10 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
{
WARN_ON(!rt_task(p));
rq->rt.rt_nr_running++;
+#ifdef CONFIG_SMP
+ if (p->prio < rq->rt.highest_prio)
+ rq->rt.highest_prio = p->prio;
+#endif /* CONFIG_SMP */
}

static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -37,6 +41,20 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
WARN_ON(!rt_task(p));
WARN_ON(!rq->rt.rt_nr_running);
rq->rt.rt_nr_running--;
+#ifdef CONFIG_SMP
+ if (rq->rt.rt_nr_running) {
+ struct rt_prio_array *array;
+
+ WARN_ON(p->prio < rq->rt.highest_prio);
+ if (p->p...

To: <mingo@...>
Cc: <rostedt@...>, <ghaskins@...>, <linux-rt-users@...>, <linux-kernel@...>
Date: Tuesday, December 4, 2007 - 4:44 pm

From: Steven Rostedt <srostedt@redhat.com>

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

kernel/sched.c | 1 +
kernel/sched_rt.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b062856..4751c2f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -266,6 +266,7 @@ struct rt_rq {
struct rt_prio_array active;
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+ unsigned long rt_nr_running;
};

/*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ee9c8b6..a6271f4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -26,12 +26,27 @@ static void update_curr_rt(struct rq *rq)
cpuacct_charge(curr, delta_exec);
}

+static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+ WARN_ON(!rt_task(p));
+ rq->rt.rt_nr_running++;
+}
+
+static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
+{
+ WARN_ON(!rt_task(p));
+ WARN_ON(!rq->rt.rt_nr_running);
+ rq->rt.rt_nr_running--;
+}
+
static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
{
struct rt_prio_array *array = &rq->rt.active;

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+
+ inc_rt_tasks(p, rq);
}

/*
@@ -46,6 +61,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+
+ dec_rt_tasks(p, rq);
}

/*

--

Previous thread: none

Next thread: NFS Killable tasks request comments on patch by Liam Howlett on Tuesday, December 4, 2007 - 5:25 pm. (4 messages)