[PATCH 01/13] RT: push-rt

Previous thread: [SUPERH] dma-mapping.h structure has no member named page by Kristoffer Ericson on Tuesday, October 23, 2007 - 6:53 pm. (3 messages)

Next thread: none
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

This is version 5 of the patch series against 23-rt1.

There have been numerous fixes/tweaks since v4, though we still are based on
the global rto_cpumask logic instead of Steve/Ingo's cpuset logic.  Otherwise,
it's in pretty good shape.

Without the series applied, the following test will fail:

ftp://ftp.novell.com/dev/ghaskins/preempt-test-latest.tar.bz2

After it is applied, it will pass.

NOTE: it appears that the series also introduces wake-latency spikes that
are not present in the baseline code, so this is still "RFC" quality.
However, the baseline scheduler also violates priority order, so its hard to
determine if the numbers translate apples to apples.  These issues are still
under investigation, but I am sharing the series now so that Steven Rostedt
and Darren Hart can have access to my current tree.  The issues appear to be
caused by some other strange scheduling decisions (such as running the idle
thread while we are busy).  TBD 
-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

The system currently evaluates all online CPUs whenever one or more enters
an rt_overload condition.  This suffers from scalability limitations as
the # of online CPUs increases.  So we introduce a cpumask to track
exactly which CPUs need RT balancing.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter W. Morreale <pmorreale@novell.com>
---

 kernel/sched.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index daeb8ed..e22eec7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -632,6 +632,7 @@ static inline struct rq *this_rq_lock(void)
 
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 static __cacheline_aligned_in_smp atomic_t rt_overload;
+static cpumask_t rto_cpus;
 #endif
 
 static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
@@ -640,8 +641,11 @@ static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq)
 	if (rt_task(p)) {
 		rq->rt_nr_running++;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 2)
+		if (rq->rt_nr_running == 2) {
+			cpu_set(rq->cpu, rto_cpus);
+			smp_wmb();
 			atomic_inc(&rt_overload);
+		}
 # endif
 	}
 #endif
@@ -654,8 +658,10 @@ static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq)
 		WARN_ON(!rq->rt_nr_running);
 		rq->rt_nr_running--;
 # ifdef CONFIG_SMP
-		if (rq->rt_nr_running == 1)
+		if (rq->rt_nr_running == 1) {
 			atomic_dec(&rt_overload);
+			cpu_clear(rq->cpu, rto_cpus);
+		}
 # endif
 	}
 #endif
@@ -1622,7 +1628,7 @@ static void balance_rt_tasks(struct rq *this_rq, int this_cpu)
 	 */
 	next = pick_next_task(this_rq, this_rq->curr);
 
-	for_each_online_cpu(cpu) {
+	for_each_cpu_mask(cpu, rto_cpus) {
 		if (cpu == this_cpu)
 			continue;
 		src_rq = cpu_rq(cpu);

-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

We inadvertently added a redundant function, so clean it up

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c    |    9 +++++----
 kernel/sched_rt.c |   44 --------------------------------------------
 2 files changed, 5 insertions(+), 48 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0dabf89..daeb8ed 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1471,9 +1471,10 @@ next_in_queue:
 	 * Return the highest-prio non-running RT task (if task
 	 * may run on this CPU):
 	 */
-	if (!task_running(src_rq, tmp) &&
-				cpu_isset(this_cpu, tmp->cpus_allowed))
-		return tmp;
+	if (!task_running(src_rq, tmp)) {
+		if ((this_cpu == -1) || cpu_isset(this_cpu, tmp->cpus_allowed))
+			return tmp;
+	}
 
 	curr = curr->next;
 	if (curr != head)
@@ -1569,7 +1570,7 @@ static int push_rt_task(struct rq *this_rq)
 
 	assert_spin_locked(&this_rq->lock);
 
-	next_task = rt_next_highest_task(this_rq);
+	next_task = pick_rt_task(this_rq, -1);
 	if (!next_task)
 		return 0;
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 8d59e62..369827b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -96,50 +96,6 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 	return next;
 }
 
-#ifdef CONFIG_PREEMPT_RT
-static struct task_struct *rt_next_highest_task(struct rq *rq)
-{
-	struct rt_prio_array *array = &rq->rt.active;
-	struct task_struct *next;
-	struct list_head *queue;
-	int idx;
-
-	if (likely (rq->rt_nr_running < 2))
-		return NULL;
-
-	idx = sched_find_first_bit(array->bitmap);
-	if (idx >= MAX_RT_PRIO) {
-		WARN_ON(1); /* rt_nr_running is bad */
-		return NULL;
-	}
-
-	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
-	if (unlikely(next != current))
-		return next;
-
-	if (queue->next->next != queue) {
-		/* same prio task */
-		next = list_entry(queue->next->next, struct task_struct, run_list);
-		goto out;
-	}
-
-	/* slower, but more flexible ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

From: Steven Rostedt <rostedt@goodmis.org>

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

 kernel/sched.c    |  141 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_rt.c |   44 +++++++++++++++++
 2 files changed, 178 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3e75c62..0dabf89 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -304,6 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
 	unsigned long rt_nr_running;
 	unsigned long rt_nr_uninterruptible;
+	int curr_prio;
 #endif
 
 	unsigned long switch_timestamp;
@@ -1484,6 +1485,123 @@ next_in_queue:
 
 static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 
+/* Only try this algorithm three times */
+#define RT_PUSH_MAX_TRIES 3
+
+/* Will lock the rq it finds */
+static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
+				      struct task_struct *task,
+				      struct rq *this_rq)
+{
+	struct rq *lowest_rq = NULL;
+	int dst_cpu = -1;
+	int cpu;
+	int tries;
+
+	for (tries = 0; tries < RT_PUSH_MAX_TRIES; tries++) {
+		/*
+		 * Scan each rq for the lowest prio.
+		 */
+		for_each_cpu_mask(cpu, *cpu_mask) {
+			struct rq *rq = &per_cpu(runqueues, cpu);
+
+			if (cpu == smp_processor_id())
+				continue;
+
+			/* We look for lowest RT prio or non-rt CPU */
+			if (rq->curr_prio >= MAX_RT_PRIO) {
+				lowest_rq = rq;
+				dst_cpu = cpu;
+				break;
+			}
+
+			/* no locking for now */
+			if (rq->curr_prio > task->prio &&
+			    (!lowest_rq || rq->curr_prio < lowest_rq->curr_prio)) {
+				lowest_rq = rq;
+				dst_cpu = cpu;
+			}
+		}
+
+		if (!lowest_rq)
+			break;
+
+		/* if the prio of this runqueue changed, try again */
+		if (double_lock_balance(this_rq, lowest_rq)) {
+			/*
+			 * We had to unlock the run queue. In
+			 * the mean time, task could have
+			 * migrated already or had its affinity changed.
+			 */
+			if (unlikely(task_rq(task) != this_rq ||
+				     !cpu_isset(dst_cpu, ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

We should init the base value of the current RQ priority to "IDLE"

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dfd0b92..7c4fba8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7386,6 +7386,8 @@ void __init sched_init(void)
 		highest_cpu = i;
 		/* delimiter for bitsearch: */
 		__set_bit(MAX_RT_PRIO, array->bitmap);
+
+		set_rq_prio(rq, MAX_PRIO);
 	}
 
 	set_load_weight(&init_task);

-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

A little cleanup to avoid #ifdef proliferation later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e22eec7..dfd0b92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -365,6 +365,16 @@ struct rq {
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 static DEFINE_MUTEX(sched_hotcpu_mutex);
 
+#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
+static inline void set_rq_prio(struct rq *rq, int prio)
+{
+	rq->curr_prio = prio;
+}
+
+#else
+#define set_rq_prio(rq, prio) do { } while(0)
+#endif
+
 static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
 {
 	rq->curr->sched_class->check_preempt_curr(rq, p);
@@ -2329,9 +2339,9 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 	prev_state = prev->state;
 	_finish_arch_switch(prev);
-#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
-	rq->curr_prio = current->prio;
-#endif
+
+	set_rq_prio(rq, current->prio);
+
 	finish_lock_switch(rq, prev);
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 	/*

-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

This is an implementation of Steve's idea where we should update the RQ
concept of priority to show the highest-task, even if that task is not (yet)
running.  This prevents us from pushing multiple tasks to the RQ before it
gets a chance to reschedule.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   34 +++++++++++++++++++++++++---------
 1 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c4fba8..c17e2e4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -304,7 +304,7 @@ struct rq {
 #ifdef CONFIG_PREEMPT_RT
 	unsigned long rt_nr_running;
 	unsigned long rt_nr_uninterruptible;
-	int curr_prio;
+	int highest_prio;
 #endif
 
 	unsigned long switch_timestamp;
@@ -368,11 +368,20 @@ static DEFINE_MUTEX(sched_hotcpu_mutex);
 #if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
 static inline void set_rq_prio(struct rq *rq, int prio)
 {
-	rq->curr_prio = prio;
+	rq->highest_prio = prio;
+}
+
+static inline void update_rq_prio(struct rq *rq)
+{
+	struct rt_prio_array *array = &rq->rt.active;
+	int                   prio  = sched_find_first_bit(array->bitmap);
+
+	set_rq_prio(rq, prio);
 }
 
 #else
 #define set_rq_prio(rq, prio) do { } while(0)
+#define update_rq_prio(rq)    do { } while(0)
 #endif
 
 static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
@@ -1023,12 +1032,14 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
 	sched_info_queued(p);
 	p->sched_class->enqueue_task(rq, p, wakeup);
 	p->se.on_rq = 1;
+	update_rq_prio(rq);
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
 {
 	p->sched_class->dequeue_task(rq, p, sleep);
 	p->se.on_rq = 0;
+	update_rq_prio(rq);
 }
 
 /*
@@ -1526,15 +1537,15 @@ static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
 				continue;
 
 			/* We look for lowest RT prio or non-rt CPU */
-			if (rq->curr_prio >= MAX_RT_PRIO) {
+			if (rq->highest_prio ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

There are three events that require consideration for redistributing RT
tasks:

1) When one or more higher-priority tasks preempts a lower-one from a
   RQ
2) When a lower-priority task is woken up on a RQ
3) When a RQ downgrades its current priority

Steve Rostedt's push_rt patch addresses (1).  It hooks in right after
a new task has been switched-in.  If this was the result of an RT
preemption, or if more than one task was awoken at the same time, we
can try to push some of those other tasks away.

This patch addresses (2).  When we wake up a task, we check to see
if it would preempt the current task on the queue.  If it will not, we
attempt to find a better suited CPU (e.g. one running something lower
priority than the task being woken) and try to activate the task there.

Finally, we have (3).  In theory, we only need to balance_rt_tasks() if
the following conditions are met:
   1) One or more CPUs are in overload, AND
   2) We are about to switch to a task that lowers our priority.

(3) will be addressed in a later patch.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |  109 ++++++++++++++++++++++++++++++++------------------------
 1 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0600062..e536142 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1626,6 +1626,13 @@ out:
 	return ret;
 }
 
+/* Push all tasks that we can to other CPUs */
+static void push_rt_tasks(struct rq *this_rq)
+{
+	while (push_rt_task(this_rq))
+		;
+}
+
 /*
  * Pull RT tasks from other CPUs in the RT-overload
  * case. Interrupts are disabled, local rq is locked.
@@ -1986,6 +1993,46 @@ out_set_cpu:
 		this_cpu = smp_processor_id();
 		cpu = task_cpu(p);
 	}
+	
+#if defined(CONFIG_PREEMPT_RT)
+       /*
+        * If a newly woken up RT task will not run immediately on its affined
+        * RQ, try to find another CPU it can preempt:
+        */
+	if (rt_task(p) && (p->prio > ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

We can avoid dirtying a rq related cacheline with a simple check, so why not.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e536142..1058a1f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -376,7 +376,8 @@ static inline void update_rq_prio(struct rq *rq)
 	struct rt_prio_array *array = &rq->rt.active;
 	int                   prio  = sched_find_first_bit(array->bitmap);
 
-	set_rq_prio(rq, prio);
+	if (rq->highest_prio != prio)
+		set_rq_prio(rq, prio);
 }
 
 #else

-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:50 am

Get rid of the superfluous dst_cpu, and move the cpu_mask inside the search
function.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   18 +++++++-----------
 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c17e2e4..0600062 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1517,20 +1517,21 @@ static int double_lock_balance(struct rq *this_rq, struct rq *busiest);
 #define RT_PUSH_MAX_TRIES 3
 
 /* Will lock the rq it finds */
-static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
-				      struct task_struct *task,
+static struct rq *find_lock_lowest_rq(struct task_struct *task,
 				      struct rq *this_rq)
 {
 	struct rq *lowest_rq = NULL;
-	int dst_cpu = -1;
 	int cpu;
 	int tries;
+	cpumask_t cpu_mask;
+
+	cpus_and(cpu_mask, cpu_online_map, task->cpus_allowed);
 
 	for (tries = 0; tries < RT_PUSH_MAX_TRIES; tries++) {
 		/*
 		 * Scan each rq for the lowest prio.
 		 */
-		for_each_cpu_mask(cpu, *cpu_mask) {
+		for_each_cpu_mask(cpu, cpu_mask) {
 			struct rq *rq = &per_cpu(runqueues, cpu);
 
 			if (cpu == smp_processor_id())
@@ -1539,7 +1540,6 @@ static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
 			/* We look for lowest RT prio or non-rt CPU */
 			if (rq->highest_prio >= MAX_RT_PRIO) {
 				lowest_rq = rq;
-				dst_cpu = cpu;
 				break;
 			}
 
@@ -1547,7 +1547,6 @@ static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
 			if (rq->highest_prio > task->prio &&
 			    (!lowest_rq || rq->highest_prio < lowest_rq->highest_prio)) {
 				lowest_rq = rq;
-				dst_cpu = cpu;
 			}
 		}
 
@@ -1562,7 +1561,7 @@ static struct rq *find_lock_lowest_rq(cpumask_t *cpu_mask,
 			 * migrated already or had its affinity changed.
 			 */
 			if (unlikely(task_rq(task) != this_rq ||
-				     !cpu_isset(dst_cpu, task->cpus_allowed))) {
+				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed))) {
 				spin_unlock(&lowest_rq->lock);
 				lowest_rq = ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

We only need to track if the CPU is in a non-RT state, as opposed to its
priority within the non-RT state.  So simplify setting in the effort of
reducing cache-thrash.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a1f1d92..4abe738 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -371,11 +371,19 @@ static inline void set_rq_prio(struct rq *rq, int prio)
 	rq->highest_prio = prio;
 }
 
+/*
+ * We dont care what the exact normal priority is.  We only care about
+ * RT-priority, vs non-RT (normal or idle).  So flatten the priority if its a
+ * non-RT variety. This will reduce cache-thrashing on the rq->highest_prio.
+ */
 static inline void update_rq_prio(struct rq *rq)
 {
 	struct rt_prio_array *array = &rq->rt.active;
 	int                   prio  = sched_find_first_bit(array->bitmap);
 
+	if ((prio != MAX_PRIO) && (prio > MAX_RT_PRIO))
+		prio = MAX_RT_PRIO;
+
 	if (rq->highest_prio != prio)
 		set_rq_prio(rq, prio);
 }

-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

From: Steven Rostedt <rostedt@goodmis.org>

Steve found these errors in the original patch

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c    |   15 ++++++++-
 kernel/sched_rt.c |   90 +----------------------------------------------------
 2 files changed, 15 insertions(+), 90 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1058a1f..a1f1d92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1535,7 +1535,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 		for_each_cpu_mask(cpu, cpu_mask) {
 			struct rq *rq = &per_cpu(runqueues, cpu);
 
-			if (cpu == smp_processor_id())
+			if (cpu == this_rq->cpu)
 				continue;
 
 			/* We look for lowest RT prio or non-rt CPU */
@@ -1561,7 +1561,8 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task,
 			 * the mean time, task could have
 			 * migrated already or had its affinity changed.
 			 */
-			if (unlikely(task_rq(task) != this_rq ||
+			if (unlikely(task_running(this_rq, task) ||
+				     task_rq(task) != this_rq ||
 				     !cpu_isset(lowest_rq->cpu, task->cpus_allowed))) {
 				spin_unlock(&lowest_rq->lock);
 				lowest_rq = NULL;
@@ -2380,6 +2381,7 @@ static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	}
 
 #endif
+
 	fire_sched_in_preempt_notifiers(current);
 	trace_stop_sched_switched(current);
 	/*
@@ -4102,6 +4104,15 @@ asmlinkage void __sched __schedule(void)
 		context_switch(rq, prev, next); /* unlocks the rq */
 		__preempt_enable_no_resched();
 	} else {
+#if defined(CONFIG_PREEMPT_RT) && defined(CONFIG_SMP)
+		/*
+		 * If we hit the condition where we do not need to actually
+		 * reschedule, we need to check if there are any tasks that
+		 * should be pushed away
+		 */
+		if (unlikely(rq->rt_nr_running > 1))
+			push_rt_tasks(rq);
+#endif
 		__preempt_enable_no_resched();
 		spin_unlock(&rq->lock);
 		trace_stop_sched_switched(next);
diff --git a/kernel/sched_rt.c ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

This code tracks the priority of each CPU so that global migration
  decisions are easy to calculate.  Each CPU can be in a state as follows:

                 (INVALID), IDLE, NORMAL, RT1, ... RT99

  going from the lowest priority to the highest.  CPUs in the INVALID state
  are not eligible for routing.  The system maintains this state with
  a 2 dimensional bitmap (the first for priority class, the second for cpus
  in that class).  Therefore a typical application without affinity
  restrictions can find a suitable CPU with O(1) complexity (e.g. two bit
  searches).  For tasks with affinity restrictions, the algorithm has a
  worst case complexity of O(min(102, NR_CPUS)), though the scenario that
  yields the worst case search is fairly contrived.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/Makefile       |    2 
 kernel/sched.c        |   37 +++------
 kernel/sched_cpupri.c |  200 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpupri.h |   10 ++
 4 files changed, 222 insertions(+), 27 deletions(-)

diff --git a/kernel/Makefile b/kernel/Makefile
index e4e2acf..d9d1351 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
 	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o \
-	    utsname.o
+	    utsname.o sched_cpupri.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/sched.c b/kernel/sched.c
index 4abe738..bdb6be0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -67,6 +67,8 @@
 
 #include <asm/tlb.h>
 
+#include "sched_cpupri.h"
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  * This is default implementation.
@@ -384,8 +386,10 @@ static inline void update_rq_prio(struct rq *rq)
 	if ((prio != MAX_PRIO) && (prio > MAX_RT_PRIO))
 		prio = MAX_RT_PRIO;
 ...
From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 9:51 am

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for one or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then during the handling of the timer, the system determines that it's
time to smp-rebalance the system so it schedules softirq_sched to run
from within the softirq_timer kthread context. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

The problem is that these tasks cannot ever be pulled away since they
are already running on their one and only valid RQ.  However, the other
CPUs cannot determine that the tasks are unpullable without going
through expensive checks/locking.  Therefore the helping CPUS
experience unecessary overhead/latencies regardless as they
ineffectively try to process the overload condition.

This patch tries to optimize the situation by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated, which may be utilized by the overload
handling code to eliminate uncessary rebalance attempts.  We also
introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable. 

Calculating the weight is probably relatively expensive, so it is only
done when the cpus_allowed mask is updated (which should be relatively
infrequent, especially compared to scheduling frequency) and cached in
the task_struct.


Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    1 
 kernel/fork.c         |    1 
 kernel/sched.c        |  121 +++++++++++++++++++++++++++++++++++++------------
 3 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h ...
From: Ingo Oeser
Date: Tuesday, October 23, 2007 - 5:19 pm

Hi Gregory,


Why not make it a task flag, since according to your code, you are only 
interested whether this is <= 1 or > 1. Since !(x <= 1) <=> (x > 1)
for any given unsigned integer x, the required data structure is
a "boolean" or a flag.


Best Regards

Ingo Oeser
-

From: Gregory Haskins
Date: Tuesday, October 23, 2007 - 8:20 pm

Hi Ingo,
  You are correct that the data is in fact interpreted as a boolean.  I
also had considered using a more boolean-like notation at one point.
However, I then figured I went through the expense of computing it, I
might as well store the actual value as an integer in case it can be
used in another way.  But to be honest, I cannot really think of any
other potential uses, so perhaps we would be best to follow your
suggestion.  It could always be changed if such a need ever arises.

Thank you for the feedback!

Regards,
-Greg


Previous thread: [SUPERH] dma-mapping.h structure has no member named page by Kristoffer Ericson on Tuesday, October 23, 2007 - 6:53 pm. (3 messages)

Next thread: none