[snip]
I tested it with Dmitry's patch, and found that all the tasks on the offline
cpu were migrated to an online cpu by migrate_live_tasks() in migration_call().
But some tasks(such as klogd and so on)was moved back to the offline cpu
immediately before BUG_ON(rq->nr_running != 0) checking, even before acquiring
rq's lock.
static int __cpuinit
migration_call(struct notifier_block *nfb, unsigned long action, void *
{
...
switch (action) {
...
case CPU_DEAD:
case CPU_DEAD_FROZEN:
cpuset_lock();
migrate_live_tasks(cpu);
rq = cpu_rq(cpu);
...
spin_lock_irq(&rq->lock);
...
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
cpuset_unlock();
migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);
...
break;
}
...
}
By debuging, I found this bug was caused by select_task_rq_fair().
After migrating the tasks on the offline cpu to an online cpu, the kernel would
wake up these migrated tasks quickly by try_to_wake_up(). try_to_wake_up() would
invoke select_task_rq_fair() to find a lower-load cpu in sched domains for them.
But the sched domains weren't updated and the offline cpu was still in the sched
domains. So select_task_rq_fair() might return the offline cpu's id, then the
bug occurred.
I fix the bug just by checking the select_task_rq_fair()'s return value in
try_to_wake_up().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
kernel/sched.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 94ead43..15b5ddf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2103,6 +2103,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
goto out_activate;
cpu = p->sched_class->select_task_rq(p, sync);
+ if (unlikely(cpu_is_offline(cpu)))
+ cpu = orig_cpu;
+
if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
task_rq_unlock(rq, &flags);
--
1.5.4.rc3
--