[PATCH 1/2] RT: Remove comment that is no longer true

Previous thread: [PATCH 01/39] KVM: VMX: Clean up magic number 0x66 in init_rmode_tss by Avi Kivity on Thursday, September 25, 2008 - 4:54 am. (40 messages)

Next thread: [PATCH] - UV fix for size of hub mappings by Jack Steiner on Thursday, September 25, 2008 - 5:52 am. (4 messages)
From: Chirag Jog
Date: Thursday, September 25, 2008 - 5:32 am

Hi Gregory,
We see the following BUG followed by a hang on the latest kernel 2.6.26.5-rt9 on a Power6 blade (PPC64)
It is easily recreated by running the async_handler or sbrk_mutex (realtime tests from ltp) tests.

login: cpu 0x2: Vector: 700 (Program Check) at [c0000000e8e875d0]
    pc: c00000000005110c: .pick_next_pushable_task+0x54/0x9c
    lr: c000000000059f50: .push_rt_task+0x44/0x2b4
    sp: c0000000e8e87850
   msr: 8000000000021032
  current = 0xc0000000ea5bb2e0
  paca    = 0xc000000000608700
    pid   = 2811, comm = async_handler
kernel BUG at kernel/sched_rt.c:1041! <---------------------
enter ? for help
[link register   ] c000000000059f50 .push_rt_task+0x44/0x2b4
[c0000000e8e87850] c0000000e8e878f0 (unreliable)
[c0000000e8e87900] c00000000005a1dc .push_rt_tasks+0x1c/0x38
[c0000000e8e87980] c00000000005a21c .post_schedule_rt+0x24/0x44
[c0000000e8e87a10] c000000000057cbc .finish_task_switch+0xd0/0x180
[c0000000e8e87ab0] c0000000003b6e88 .__schedule+0x6e0/0x798
[c0000000e8e87b90] c0000000003b7148 .schedule+0xec/0x11c
[c0000000e8e87c10] c0000000003b7a40 .do_nanosleep+0x6c/0xcc
[c0000000e8e87c90] c000000000080738 .hrtimer_nanosleep+0x7c/0x100
[c0000000e8e87d90] c000000000080830 .sys_nanosleep+0x74/0x94
[c0000000e8e87e30] c0000000000086ac syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000008026449844
SP (400014185f0) is in userspace


This is generated by the BUG_ON lines in the pick_next_pushable function
introduced by the sched-only-push-once-per-queue.patch .

The -rt kernel prior to this patch didnot give such BUGes.

All this was tried with
CONFIG_GROUP_SCHED=N
CONFIG_RT_GROUP_SCHED=N


Setting the options
CONFIG_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=Y,
seems to solve the problem.



-- 
-Thanks,Chirag
--

From: Gregory Haskins
Date: Monday, September 29, 2008 - 11:13 am

Hi Chirag

FYI I am looking at this now.  I suspect a dequeue_pushable_task()
probably found its way inside a conditional for GROUP_SCHED and
inadventently gets compiled away if you disable the feature.=20
Investigating now..



From: Gregory Haskins
Date: Monday, September 29, 2008 - 2:18 pm

Hi Chirag


Call me an LTP newbie, but where can I get the sbrk_mutex/async_handler
tests.  I installed the LTP rpm and I dont seem to have tests by that
name.  I have a 26.5-rt9 kernel all configured and ready to go but no
way to reproduce this issue.  Please advise.

Regards,


From: Gregory Haskins
Date: Monday, September 29, 2008 - 2:34 pm

Ok, I figured out this part.  I needed a newer version of the .rpm from
a different repo.  However, both async_handler and sbrk_mutex seem to
segfault for me.  Hmm

---

ghaskins@test:~> uname -a
Linux test 2.6.26.5-rt9-rt #1 SMP PREEMPT RT Mon Sep 29 14:26:45 EDT
2008 x86_64 x86_64 x86_64 GNU/Linux

test:/home/ghaskins #
/usr/lib64/ltp/testcases/realtime/func/async_handler/async_handler

-----------------------------------
Asynchronous Event Handling Latency
-----------------------------------

jvmsim disabled
Running 1000000 iterations
Segmentation fault

---

Any ideas?



From: Gregory Haskins
Date: Monday, September 29, 2008 - 3:00 pm

Thanks to help from Darren I got around this issue.  Unfortunately both
tests pass so I cannot reproduce this issue, nor do I see the problem
via code inspection.  Ill keep digging but I am currently at a loss.  I
may need to send you some diagnostic patches to find this, if that is ok
with you Chirag?



From: Chirag Jog
Date: Monday, September 29, 2008 - 9:43 pm

Hi Gregory,
This particular bug is not producible on the x86 boxes, i have access
to. Only on ppc64.
Please send the diagnostic patches across. 
I'll try them out! :)

-- 
-Thanks,Chirag
--

From: Gilles Carry
Date: Monday, September 29, 2008 - 11:47 pm

Hi,

I have access to Power6 and x86_64 boxes and so far I could only
reproduce the bug on PPC64.

The bug arised from 2.6.26.3-rt6 since sched-only-push-if-pushable.patch
and sched-only-push-once-per-queue.patch.

Whereas sbrk_mutex definetly shows up the problem, it also can occur
randomly, sometimes during the boot period.

At the beginning, I had system hangs or this (very similar to Chirag's and
not necessarly in sbrk_mutex):

cpu 0x3: Vector: 700 (Program Check) at [c0000000ee30b600]
     pc: c0000000001b9bac: .__list_add+0x70/0xa0
     lr: c0000000001b9ba8: .__list_add+0x6c/0xa0
     sp: c0000000ee30b880
    msr: 8000000000021032
   current = 0xc0000000ee2b1830
   paca    = 0xc0000000005c3980
     pid   = 51, comm = sirq-sched/3
kernel BUG at lib/list_debug.c:33!
enter ? for help
[c0000000ee30b900] c0000000001b8ec0 .plist_del+0x6c/0xcc
[c0000000ee30b9a0] c00000000004d500 .dequeue_pushable_task+0x24/0x3c
[c0000000ee30ba20] c00000000004ec18 .push_rt_task+0x1f0/0x2c0
[c0000000ee30bae0] c00000000004ed0c .push_rt_tasks+0x24/0x44
[c0000000ee30bb70] c00000000004ed58 .post_schedule_rt+0x2c/0x50
[c0000000ee30bc00] c0000000000527c4 .finish_task_switch+0x100/0x1a8
[c0000000ee30bcb0] c0000000002cd1e0 .__schedule+0x688/0x744
[c0000000ee30bd90] c0000000002cd4ec .schedule+0xf4/0x128
[c0000000ee30be20] c000000000061634 .ksoftirqd+0x124/0x37c
[c0000000ee30bf00] c000000000076cf0 .kthread+0x84/0xd4
[c0000000ee30bf90] c000000000029368 .kernel_thread+0x4c/0x68
3:mon>

So I suspected a memory corruption but adding padding fields around
the pointers and extra checks did not reveal any data trashing.


Playing with xmon, I finally found out that when hanging, the system
was stuck in an infinite loop in plist_check_list.
Si I modified lib/plist.c:
I supposed that no list holds more than 100 000 000 elements in
the system. ;-)

  static void plist_check_list(struct list_head *top)
  {
         struct list_head *prev = top, *next = top->next;
+       unsigned long long i = ...
From: Gregory Haskins
Date: Wednesday, October 1, 2008 - 7:22 am

Hi Chirag,
  Please apply this patch to your 26.5-rt9 tree for ppc64, enable
  CONFIG_PROVE_LOCKING (which enables CONFIG_STACKTRACE) and give it a whirl.
  If you get an oops, please post the console output.

Thanks!

-Greg

------------------
sched: add a stacktrace on enqueue_pushable error

NOT FOR INCLUSION!

This is to help debug an issue discovered by Chirag Jog in the thread

http://lkml.org/lkml/2008/9/25/189

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    6 ++++++
 kernel/sched_rt.c     |   20 +++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 67da014..53e8a6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/stacktrace.h>
 
 #include <asm/processor.h>
 
@@ -1170,6 +1171,11 @@ struct task_struct {
 
 	struct list_head tasks;
 	struct plist_node pushable_tasks;
+	struct {
+		unsigned long data[15];
+		struct stack_trace trace;
+		pid_t pid;
+	} pushable_stack;
 
 	/*
 	 * ptrace_list/ptrace_children forms the list of my children
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..0e6a88c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -55,7 +55,25 @@ static void update_rt_migration(struct rq *rq)
 
 static void enqueue_pushable_task(struct rq *rq, struct task_struct *p)
 {
-	plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks);
+	if (plist_node_empty(&p->pushable_tasks)) {
+		struct stack_trace *trace = &p->pushable_stack.trace;
+
+		trace->nr_entries = 0;
+		trace->max_entries = sizeof(p->pushable_stack.data)/sizeof(p->pushable_stack.data[0]);
+		trace->entries = &p->pushable_stack.data[0];
+		trace->skip = 0;
+
+		save_stack_trace(trace);
+
+		p->pushable_stack.pid = current->pid;
+	} else {
+		printk(KERN_CRIT ...
From: Gilles Carry
Date: Thursday, October 2, 2008 - 2:42 am

Hi,

I could reproduce the bug on an intel architecture.
I found where the problem is.
I'll post the patch in a few hours.

Gilles.


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gilles.Carry
Linux Project team
mailto: gilles.carry@bull.net
Phone: +33 (0)4 76 29 74 27
Addr.: BULL S.A.  1 rue de Provence, B.P. 208 38432 Echirolles Cedex
http://www.bull.com
http://www.bullopensource.org/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--

From: Gilles Carry
Date: Thursday, October 2, 2008 - 4:18 am

Hi,

I could reproduce the bug on intel x86_64 with LTP's sbrk_mutex:

kernel BUG at kernel/sched_rt.c:1044!
invalid opcode: 0000 [1] PREEMPT SMP
CPU 5
Modules linked in: mptsas scsi_transport_sas
Pid: 27577, comm: sbrk_mutex Not tainted 2.6.26.5-rt9-00002-g3b27927 #23
RIP: 0010:[<ffffffff80227f95>]  [<ffffffff80227f95>] pick_next_pushable_task+0x6
1/0x77
RSP: 0018:ffff81007713fd28  EFLAGS: 00010046
RAX: 0000000000000005 RBX: ffff810083a4e280 RCX: ffff81013dcee458
RDX: ffff8100771f8000 RSI: ffff81013dcee2c0 RDI: ffff810083a4e280
RBP: ffff81007713fd28 R08: ffff81007713e000 R09: 0000000000000000
R10: 000000004bbbc9e0 R11: ffff81007dc3bde8 R12: ffff81023ff7c910
R13: ffff8101bf4ad0c0 R14: 0000000000000001 R15: ffff810083a4e280
FS:  000000004d3bf940(0063) GS:ffff81013f4458c0(0000) knlGS:00000000f7f216c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000389d495770 CR3: 000000007c11a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sbrk_mutex (pid: 27577, threadinfo ffff81007713e000, task ffff8100771d61
c0)
Stack:  ffff81007713fd68 ffffffff8022b1ce ffff81007713fdc8 ffff810083a4e280
  ffff81023ff7c910 ffff8101bf4ad0c0 0000000000000001 0000000000000000
  ffff81007713fd88 ffffffff8022b3c3 ffff81007713fda8 ffff810083a4e280
Call Trace:
  [<ffffffff8022b1ce>] push_rt_task+0x26/0x207
  [<ffffffff8022b3c3>] push_rt_tasks+0x14/0x1c
  [<ffffffff8022b3e4>] post_schedule_rt+0x19/0x25
  [<ffffffff8022d7e9>] finish_task_switch+0x73/0x121
  [<ffffffff805bbe3d>] thread_return+0x4f/0xdc
  [<ffffffff805bc066>] schedule+0xd4/0xf0
  [<ffffffff805bc686>] do_nanosleep+0x5c/0x9c
  [<ffffffff80248350>] ? hrtimer_nanosleep+0x54/0xbd
  [<ffffffff80247c9d>] ? hrtimer_wakeup+0x0/0x21
  [<ffffffff805bc66b>] ? do_nanosleep+0x41/0x9c
  [<ffffffff8022e9f4>] ? schedule_tail+0x43/0x97
  [<ffffffff80248405>] ? sys_nanosleep+0x4c/0x62
  [<ffffffff8020b32a>] ? ...
From: Gregory Haskins
Date: Friday, October 3, 2008 - 5:42 am

Hi Chirag,

 Please try the following patches applied to 26.5-rt9 and let me know what you
 find.

Hi Steve,
  If these look good to everyone, please consider them for inclusion in -rt10. 

Regards,
-Greg

---

Gregory Haskins (2):
      RT: remove "paranoid" limit in push_rt_task
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

--

From: Gregory Haskins
Date: Friday, October 3, 2008 - 5:43 am

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */

--

From: Gregory Haskins
Date: Friday, October 3, 2008 - 5:43 am

A panic was discovered by Chirag Jog and investigated by Gilles Carry
to be originating in the fact that a task being pushed away
may get migrated away during a double_lock_balance.  The result was
that the pushable_tasks list may become corrupted.

The root cause is that the "paranoid" retry limit could cause us to
bail out of a retry, but still try to remove the item from the (now
potentially incorrect) list.  There are numerous ways to correct the
condition, but the paranoid feature is no longer relevant with the new
pushable logic (since pushable naturally limits the loop anyway), so
lets just remove it.

Reported By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..5a754fe 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int paranoid = RT_MAX_TRIES;
 
 	if (!rq->rt.overloaded)
 		return 0;
@@ -1094,12 +1093,14 @@ static int push_rt_task(struct rq *rq)
 		 * If it has, then try again.
 		 */
 		task = pick_next_pushable_task(rq);
-		if (unlikely(task != next_task) && task && paranoid--) {
+		if (unlikely(task != next_task) && task) {
 			put_task_struct(next_task);
 			next_task = task;
 			goto retry;
 		}
 
+		BUG_ON(task_cpu(next_task) != rq->cpu);
+
 		/*
 		 * Once we have failed to push this task, we will not
 		 * try again, since the other cpus will pull from us

--

From: Gilles Carry
Date: Friday, October 3, 2008 - 6:46 am

Sorry Greg,

Neither PPC64 nor Intel64 make it with this patch.
At boot time, it stops at the BUG_ON you added:
0xc00000000004eca4 is in push_rt_task (kernel/sched_rt.c:1102)

I let you do more investigations.
Have a good week-end in you garage ;)

Gilles.


PPC64:
cpu 0x2: Vector: 700 (Program Check) at [c0000000ee2877b0]
     pc: c00000000004eca4: .push_rt_task+0x1f4/0x2d0
     lr: c00000000004ec24: .push_rt_task+0x174/0x2d0
     sp: c0000000ee287a30
    msr: 8000000000021032
   current = 0xc0000000ee276fe0
   paca    = 0xc0000000005c3780
     pid   = 36, comm = sirq-block/2
kernel BUG at kernel/sched_rt.c:1102!
enter ? for help
[c0000000ee287a30] c00000000004ec78 .push_rt_task+0x1c8/0x2d0 (unreliable)
[c0000000ee287ae0] c00000000004eda4 .push_rt_tasks+0x24/0x44
[c0000000ee287b70] c00000000004edf0 .post_schedule_rt+0x2c/0x50
[c0000000ee287c00] c000000000052864 .finish_task_switch+0x100/0x1a8
[c0000000ee287cb0] c0000000002cdbd0 .__schedule+0x6a0/0x75c
[c0000000ee287d90] c0000000002cdedc .schedule+0xf4/0x128
[c0000000ee287e20] c000000000061700 .ksoftirqd+0x124/0x37c
[c0000000ee287f00] c000000000076dc0 .kthread+0x84/0xd4
[c0000000ee287f90] c000000000029368 .kernel_thread+0x4c/0x68
2:mon>

Intel64:
kernel BUG at kernel/sched_rt.c:1102!
invalid opcode: 0000 [1] PREEMPT SMP
CPU 4
Modules linked in: mptsas scsi_transport_sas
Pid: 61, comm: sirq-block/4 Not tainted 2.6.26.5-rt9-00002-g3b27927-dirty #26
RIP: 0010:[<ffffffff8022b307>]  [<ffffffff8022b307>] push_rt_task+0x15f/0x20b
RSP: 0018:ffff81007f4d5d70  EFLAGS: 00010097
RAX: 0000000000000000 RBX: ffff81007edf09d0 RCX: 000000000822b765
RDX: 000000000822b765 RSI: 0000000000000000 RDI: ffff81000103f280
RBP: ffff81007f4d5da0 R08: ffff81007f4d4000 R09: ffff81007edcbe20
R10: 00000000ffffffff R11: ffffffff8021fa2c R12: 0000000000000000
R13: ffff810001034280 R14: ffff81007edf09e0 R15: ffff81000103f280
FS:  00007f2f26e776f0(0000) GS:ffff81007fc0ccc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: ...
From: Chirag Jog
Date: Friday, October 3, 2008 - 8:45 am

I am also confirming this issue gilles reported.

Although, i have a question:
When we enable group scheduling (CONFIG_RT_GROUP_SCHED),
all the problems disappear.
After skimming through the rt scheduler code, I don't feel group 
scheduling alters the behavior of push/pull strategies in any way.

So I am wonder whether enabling group scheduling
actually solves the problem or just makes it tough 
-- 
-Thanks,Chirag
--

From: Gregory Haskins
Date: Friday, October 3, 2008 - 10:27 am

Hi Chirag,


The issue that Gilles pointed me at is a race condition.  I do not
suspect GROUP_SCHED itself has anything to do with the problem other
than it changes the timing.  HTH!

-Greg



From: Gregory Haskins
Date: Friday, October 3, 2008 - 10:26 am

Indeed.  Your report has revealed the problem to me.

The issue is that there are three conditions embedded in that if(!lower_rq)
code, but two are buried in the !retry case.  This was the mistake I was making.

We basically need to 

a) dequeue if the task hasnt moved
b) retry if the task *has* moved AND there are more tasks left
c) stop of the task *has* moved AND there are no more tasks

I was missing logic to handle (c).  "v2" should fix this so it is handled.
Please give it a try.  Thanks again, Gilles!

(Again, only build-tested)

Regards,
-Greg


---

Gregory Haskins (2):
      RT: remove "paranoid" limit in push_rt_task
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   44 ++++++++++++++++++++++----------------------
 1 files changed, 22 insertions(+), 22 deletions(-)

-- 
Signature
--

From: Gregory Haskins
Date: Friday, October 3, 2008 - 10:26 am

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */

--

From: Gregory Haskins
Date: Friday, October 3, 2008 - 10:26 am

A panic was discovered by Chirag Jog and investigated by Gilles Carry
to be originating in the fact that a task being pushed away
may get migrated away during a double_lock_balance.  The result was
that the pushable_tasks list may become corrupted.

The root cause is that the "paranoid" retry limit could cause us to
bail out of a retry, but still try to remove the item from the (now
potentially incorrect) list.  There are numerous ways to correct the
condition, but the paranoid feature is no longer relevant with the new
pushable logic (since pushable naturally limits the loop anyway), so
lets just remove it.

Reported By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   34 ++++++++++++++++++++++------------
 1 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..201bd97 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq *rq)
 {
 	struct task_struct *next_task;
 	struct rq *lowest_rq;
-	int paranoid = RT_MAX_TRIES;
 
 	if (!rq->rt.overloaded)
 		return 0;
@@ -1090,23 +1089,34 @@ static int push_rt_task(struct rq *rq)
 		struct task_struct *task;
 		/*
 		 * find lock_lowest_rq releases rq->lock
-		 * so it is possible that next_task has changed.
-		 * If it has, then try again.
+		 * so it is possible that next_task has migrated.
+		 *
+		 * We need to make sure that the task is still on the same
+		 * run-queue and is also still the next task eligible for
+		 * pushing.
 		 */
 		task = pick_next_pushable_task(rq);
-		if (unlikely(task != next_task) && task && paranoid--) {
-			put_task_struct(next_task);
-			next_task = task;
-			goto retry;
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * If we get here, the task hasnt moved it all, but
+			 * it has failed to push.  We will not try ...
From: Gregory Haskins
Date: Friday, October 3, 2008 - 5:54 am

I meant to add: this are build-tested only (now back to vacation time
for me, which means slave labor in the garage!).

I will catch up with you guys on Monday.

-Greg


From: Gregory Haskins
Date: Monday, October 6, 2008 - 8:14 am

Hi Steve,
  Chirag reported (via IRC) that v2 fixed his issue.  V3 is identical to v2
  except I cleaned up the patch description and fixed a comment typo.  This
  applies to 26.5-rt9.  Please consider it an urgent fix for -rt10.

Regards,
-Greg

---

Gregory Haskins (2):
      RT: fix push_rt_task() to handle dequeue_pushable properly
      RT: Remove comment that is no longer true


 kernel/sched_rt.c |   44 ++++++++++++++++++++++----------------------
 1 files changed, 22 insertions(+), 22 deletions(-)

-- 

--

From: Gregory Haskins
Date: Monday, October 6, 2008 - 8:14 am

We fixed the condition noted in the comment with the "pushable_tasks"
logic, but forgot to remove this comment.  Lets clean it up.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 57a0c0d..59ead84 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1125,16 +1125,6 @@ out:
 	return 1;
 }
 
-/*
- * TODO: Currently we just use the second highest prio task on
- *       the queue, and stop when it can't migrate (or there's
- *       no more RT tasks).  There may be a case where a lower
- *       priority RT task has a different affinity than the
- *       higher RT task. In this case the lower RT task could
- *       possibly be able to migrate where as the higher priority
- *       RT task could not.  We currently ignore this issue.
- *       Enhancements are welcome!
- */
 static void push_rt_tasks(struct rq *rq)
 {
 	/* push_rt_task will return true if it moved an RT */

--

From: Gregory Haskins
Date: Monday, October 6, 2008 - 8:14 am

A panic was discovered by Chirag Jog where a BUG_ON sanity check
in the new "pushable_task" logic would trigger a panic under
certain circumstances:

http://lkml.org/lkml/2008/9/25/189

Gilles Carry discovered that the root cause was attributed to the
pushable_tasks list getting corrupted in the push_rt_task logic.
This was the result of a dropped rq lock in double_lock_balance
allowing a task in the process of being pushed to potentially migrate
away, and thus corrupt the pushable_tasks() list.

I traced back the problem as introduced by the pushable_tasks patch
that went in recently.   There is a "retry" path in push_rt_task()
that actually had a compound conditional to decide whether to
retry or exit.  I missed the meaning behind the rationale for the
virtual "if(!task) goto out;" portion of the compound statement and
thus did not handle it properly.  The new pushable_tasks logic
actually creates three distinct conditions:

1) an untouched and unpushable task should be dequeued
2) a migrated task where more pushable tasks remain should be retried
3) a migrated task where no more pushable tasks exist should exit

The original logic mushed (1) and (3) together, resulting in the
system dequeuing a migrated task (against an unlocked foreign run-queue
nonetheless).

To fix this, we get rid of the notion of "paranoid" and we support the
three unique conditions properly.  The paranoid feature is no longer
relevant with the new pushable logic (since pushable naturally limits
the loop) anyway, so lets just remove it.

Reported-By: Chirag Jog <chirag@linux.vnet.ibm.com>
Found-by: Gilles Carry <gilles.carry@bull.net>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched_rt.c |   34 ++++++++++++++++++++++------------
 1 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 59ead84..05a1d4a 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1056,7 +1056,6 @@ static int push_rt_task(struct rq ...
From: Gilles Carry
Date: Monday, October 6, 2008 - 11:04 pm

Hi,

After 24 hours of testing on PPC64, I confirm that this patch fixes the issue.

Gilles.
--

Previous thread: [PATCH 01/39] KVM: VMX: Clean up magic number 0x66 in init_rmode_tss by Avi Kivity on Thursday, September 25, 2008 - 4:54 am. (40 messages)

Next thread: