[tip:sched/core] sched: Simplify cpu-hot-unplug task migration

Previous thread: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation by Raistlin on Thursday, October 28, 2010 - 11:30 pm. (17 messages)

Next thread: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic by Raistlin on Thursday, October 28, 2010 - 11:32 pm. (5 messages)
From: Raistlin
Date: Thursday, October 28, 2010 - 11:31 pm

There sometimes is the need of executing a task as if it would
have the maximum possible priority in the entire system, i.e.,
whenever it gets ready it must run! This is for example the case
for some maintainance kernel thread like migration and (sometimes)
watchdog or ksoftirq.

Since SCHED_DEADLINE is now the highest priority scheduling class
these tasks have to be handled therein, but it is not obvious how
to choose a runtime and a deadline that guarantee what explained
above. Therefore, we need a mean of recognizing system tasks inside
the -deadline class and always run them as soon as possible, without
any kind of runtime and bandwidth limitation.

This patch:
 - adds the SF_HEAD flag, which identify a special task that need
   absolute prioritization among any other task;
 - ensures that special tasks always preempt everyone else (and,
   obviously, are not preempted by non special tasks);
 - disables runtime and bandwidth checking for such tasks, hoping
   that the interference they cause is small enough.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   13 ++++++++++
 kernel/sched.c        |   59 ++++++++++++++++++++++++++++++++++++++----------
 kernel/sched_dl.c     |   27 ++++++++++++++++++++--
 kernel/softirq.c      |    6 +----
 kernel/watchdog.c     |    3 +-
 5 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f94da51..f25d3a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -154,6 +154,18 @@ struct sched_param_ex {
 	struct timespec curr_deadline;
 };
 
+/*
+ * Scheduler flags.
+ *
+ *  @SF_HEAD    tells us that the task has to be considered one of the
+ *              maximum priority tasks in the system. This means it
+ *              always enqueued with maximum priority in the runqueue
+ *              of the highest priority scheduling class. In case it
+ *              it sched_deadline, the task also ignore runtime and
+ *       ...
From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 7:31 am

From: Dario Faggioli
Date: Thursday, November 11, 2010 - 7:50 am

Yep. And (as said on IRC) this needs serious cleanup, in favour of
stop_task!

I'll completely drop patch 06 for next releases.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org
From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 7:34 am

Ingo, happen to know if this is really needed these days? hotplug should
have migrated all other tasks away, leaving only the idle task to run.
--

From: Oleg Nesterov
Date: Thursday, November 11, 2010 - 8:27 am

This is called before CPU_DEAD stage which migrates all tasks away.


Sorry, can't resist, off-topic quiestion. Do we really need
migration_call()->migrate_live_tasks() ?

With the recent changes, try_to_wake_up() can never choose
the dead (!cpu_online) cpu if the task was deactivated.

Looks like we should only worry about the running tasks, and
migrate_dead_tasks()->pick_next_task() loop seems to all work
we need.

(Of course, we can't just remove migrate_live_tasks(), at leat
 migrate_dead() needs simple changes).

What do you think?

Oleg.

--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 8:43 am

Yes, I think we can make that work, we could even move that
migrate_live_tasks() into CPU_DYING, which is before this point.
--

From: Oleg Nesterov
Date: Thursday, November 11, 2010 - 9:32 am

Hmm, I think you are right. In this case we can also simplify
migrate-from-dead-cpu paths, we know that nobody can touch rq->lock,
every CPU runs stop_machine_cpu_stop() with irqs disabled. No need
to get/put task_struct, etc.

Oleg.

--

From: Peter Zijlstra
Date: Saturday, November 13, 2010 - 11:35 am

Something like so?.. hasn't even seen a compiler yet but one's got to do
something to keep the worst bore of saturday night telly in check ;-)


---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010


Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |    7 -
 kernel/sched.c        |  181 ++++++++++++--------------------------------------
 3 files changed, 46 insertions(+), 145 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it should
-	   immediately notice cpu is offline and die quickly. */
-	sched_idle_next();
 	return 0;
 }
 
@@ -223,7 +217,6 @@ static int __ref ...
From: Oleg Nesterov
Date: Saturday, November 13, 2010 - 12:58 pm

Yes, I _think_ this all can work (and imho makes a lot of sense
if it works).



Hmm. I was really puzzled until I realized this is just cleanup,

Probably we do not need any checks. This task was picked by
->pick_next_task(), it should have task_cpu(p) == dead_cpu ?

But. I think there is a problem. We should not migrate current task,
stop thread, which does the migrating. At least, sched_stoptask.c
doesn't implement ->enqueue_task() and we can never wake it up later
for kthread_stop().

Perhaps migrate_tasks() should do for_each_class() by hand to
ignore stop_sched_class. But then _cpu_down() should somewhow
ensure the stop thread on the dead CPU is already parked in

Probably we don't really need rq->lock. All cpus run stop threads.

I am not sure about rq->idle, perhaps it should be deactivated.
I don't think we should migrate it.


What I never understood is the meaning of play_dead/etc. If we
remove sched_idle_next(), who will do that logic? And how the
idle thread can call idle_task_exit() ?

Oleg.

--

From: Peter Zijlstra
Date: Saturday, November 13, 2010 - 1:31 pm

Right, I think we should replace that with something like BUG_ON(!
idle_cpu(cpu)); Since we migrated everything away during the stop

Right.. Noticed that when I read that code, though I might as well fix


Hrm, right, so while the migration thread isn't actually on any rq
structure as such, pick_next_task() will return it.. need to come up
with a way to skip it.

As to current, take_cpu_down() is actually migrating current away before
this patch, so I simply included current in the
CPU_DYING->migrate_tasks() loop and removed the special case from

Well, since we're in stop_machine all cpus but the cpu that is executing
is stuck in the stop_machine_cpu_stop() loop, in both cases we could
simply fudge the pick_next_task_stop() condition (eg. set rq->stop =
NULL) while doing that loop, and restore it afterwards, nothing will hit

Right, but I was worried about stuff that relied on lockdep state like

Ah, I think the !nr_running check will bail before we end up selecting

Well, since we'll have migrated every task on that runqueue (except the
migration thread), the only runnable task left (once the migration
thread stops running) is the idle thread, so it should be implicit.

As to play_dead():

  cpu_idle()
    if (cpu_is_offline(smp_processor_id()))
      play_dead()
        native_play_dead() /* did I already say I detest paravirt? */
          play_dead_common()
            idle_task_exit();
            local_irq_disable();
          tboot/mwait/hlt

It basically puts the cpu to sleep with IRQs disabled, needs special
magic to wake it back up.
          

--

From: Peter Zijlstra
Date: Saturday, November 13, 2010 - 1:51 pm

Latest version, I actually compiled it too ! 

---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |   17 +---
 kernel/sched.c        |  201 ++++++++++++++------------------------------------
 3 files changed, 63 insertions(+), 158 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -198,7 +197,6 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	unsigned int cpu = (unsigned long)param->hcpu;
 	int err;
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
@@ -208,11 +206,6 @@ static int __ref take_cpu_down(void *_pa
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it ...
From: Peter Zijlstra
Date: Saturday, November 13, 2010 - 4:31 pm

Locks up solid though,.. even the NMI watchdog doesn't seem to trigger
anymore (or its broken too)...

/me goes prod at it a bit more
--

From: Peter Zijlstra
Date: Monday, November 15, 2010 - 1:06 pm

Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010

While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().

If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.

This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |   16 +--
 kernel/sched.c        |  210 +++++++++++++++-----------------------------------
 3 files changed, 69 insertions(+), 160 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct ...
From: Oleg Nesterov
Date: Wednesday, November 17, 2010 - 12:27 pm

Peter, sorry for delay.

I was going to read this patch carefully today, but due to the holiday
in the Czech Republic I have to drink (too much) beer instead ;)



I was very much confused, and I was going to say this is wrong.
However, now I think this is correct, just the comment is not
right.

There is another running thread we should not migrate, rq->idle.
If nothing else, dequeue_task_idle() should be never called.

But, if I understand correctly, ->nr_running does not account
the idle thread, and this is what makes this correct.

Correct?

Oleg.

--

From: Peter Zijlstra
Date: Wednesday, November 17, 2010 - 12:42 pm

Ah, you missed a patch that made pick_next_task_stop() look like:

static struct task_struct *pick_next_task_stop(struct rq *rq)
{
        struct task_struct *stop = rq->stop;

        if (stop && stop->se.on_rq)
                return stop;

        return NULL;


Right, I can add: (the idle thread is not counted in nr_running), if
that makes things clearer for you; however its a quite fundamental
property, we don't consider the idle task a proper runnable entity, its
simply the thing we do when there's nothing else to do.
--

From: Oleg Nesterov
Date: Thursday, November 18, 2010 - 7:05 am

Yes, I see now.

OK, this also explains my previous questions. I greatly misunderstood
this "small detail", starting from your initial patch. Every time I
thought you are trying to migrate rq->idle as well.


I am not sure.

Yes, we know for sure rhat the only runnable task is rq->idle.
But only after migration thread calls schedule() and switches to the
idle thread.

However, I see nothing which can guarantee this. Migration thread
running on the dead cpu wakes up the caller of stop_cpus() before
it calls schedule(), _cpu_down() can check rq->curr before it was
changed.

No?



Hmm. In fact, I think it is possible that cpu_stopper_thread() can
have more cpu_stop_work's queued when __stop_machine() returns.
This has nothing to do with this patch, but I think it makes sense
to clear stopper->enabled at CPU_DYING stage as well (of course,
this needs a separate patch).

Oleg.

--

From: Peter Zijlstra
Date: Thursday, November 18, 2010 - 7:24 am

Hmm, I think you're right, although I haven't hit that case during
testing.

There is no firm guarantee the dying cpu actually got to running the
idle thread (there's a guarantee it will at some point), so we ought to
maintain that wait-loop, possibly using cpu_relax(), I don't see the
point in calling yield() here.
--

From: Oleg Nesterov
Date: Thursday, November 18, 2010 - 8:32 am

Agreed. But do we need to wait at all?

With or without this change, even if we know that rq->idle is running
we can't know if it (say) already started play_dead_common() or not.

We are going to call __cpu_die(), afaics it should do the necessary
synchronization in any case.

For example, native_cpu_die() waits for cpu_state == CPU_DEAD in a
loop. Of course it should work in practice (it also does msleep),
but in theory there is no guarantee.

So. Can't we just remove this wait-loop? We know that rq->idle
will be scheduled "soon", I don't understand why it is necessary
to ensure that context_switch() has already happened.

Oleg.

--

From: tip-bot for Peter Zijlstra
Date: Thursday, November 18, 2010 - 7:09 am

Commit-ID:  48c5ccae88dcd989d9de507e8510313c6cbd352b
Gitweb:     http://git.kernel.org/tip/48c5ccae88dcd989d9de507e8510313c6cbd352b
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Sat, 13 Nov 2010 19:32:29 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 18 Nov 2010 13:27:46 +0100

sched: Simplify cpu-hot-unplug task migration

While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().

If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.

This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289851597.2109.547.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    3 -
 kernel/cpu.c          |   16 ++---
 kernel/sched.c        |  206 +++++++++++++++----------------------------------
 3 files changed, 67 insertions(+), 158 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3cd70cf..29d953a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,14 +1871,11 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 ...
From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 7:46 am

So this comes from 1c6b4aa94576, which is something I wouldn't have
bothered merging in the first place, if you pin a cpu with RT tasks like
that you get to keep the pieces, hotplug isn't the only thing that will
go wonky.

Anyway, if you leave the code as is you'll be fine, it'll me above any
FIFO task, but still below deadline tasks and the stop task, neither of

I'd be inclined to drop this too, if people get watchdog timeouts it
means the system is really over-commited on deadline tasks and the
watchdog FIFO thread didn't get around to running, something which I
think we both agree shouldn't be happening.
--

Previous thread: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation by Raistlin on Thursday, October 28, 2010 - 11:30 pm. (17 messages)

Next thread: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic by Raistlin on Thursday, October 28, 2010 - 11:32 pm. (5 messages)