There sometimes is the need of executing a task as if it would
have the maximum possible priority in the entire system, i.e.,
whenever it gets ready it must run! This is for example the case
for some maintainance kernel thread like migration and (sometimes)
watchdog or ksoftirq.
Since SCHED_DEADLINE is now the highest priority scheduling class
these tasks have to be handled therein, but it is not obvious how
to choose a runtime and a deadline that guarantee what explained
above. Therefore, we need a mean of recognizing system tasks inside
the -deadline class and always run them as soon as possible, without
any kind of runtime and bandwidth limitation.
This patch:
- adds the SF_HEAD flag, which identify a special task that need
absolute prioritization among any other task;
- ensures that special tasks always preempt everyone else (and,
obviously, are not preempted by non special tasks);
- disables runtime and bandwidth checking for such tasks, hoping
that the interference they cause is small enough.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
include/linux/sched.h | 13 ++++++++++
kernel/sched.c | 59 ++++++++++++++++++++++++++++++++++++++----------
kernel/sched_dl.c | 27 ++++++++++++++++++++--
kernel/softirq.c | 6 +----
kernel/watchdog.c | 3 +-
5 files changed, 85 insertions(+), 23 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f94da51..f25d3a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -154,6 +154,18 @@ struct sched_param_ex {
struct timespec curr_deadline;
};
+/*
+ * Scheduler flags.
+ *
+ * @SF_HEAD tells us that the task has to be considered one of the
+ * maximum priority tasks in the system. This means it
+ * always enqueued with maximum priority in the runqueue
+ * of the highest priority scheduling class. In case it
+ * it sched_deadline, the task also ignore runtime and
+ * ...Yep. And (as said on IRC) this needs serious cleanup, in favour of stop_task! I'll completely drop patch 06 for next releases. Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org
Ingo, happen to know if this is really needed these days? hotplug should have migrated all other tasks away, leaving only the idle task to run. --
This is called before CPU_DEAD stage which migrates all tasks away. Sorry, can't resist, off-topic quiestion. Do we really need migration_call()->migrate_live_tasks() ? With the recent changes, try_to_wake_up() can never choose the dead (!cpu_online) cpu if the task was deactivated. Looks like we should only worry about the running tasks, and migrate_dead_tasks()->pick_next_task() loop seems to all work we need. (Of course, we can't just remove migrate_live_tasks(), at leat migrate_dead() needs simple changes). What do you think? Oleg. --
Yes, I think we can make that work, we could even move that migrate_live_tasks() into CPU_DYING, which is before this point. --
Hmm, I think you are right. In this case we can also simplify migrate-from-dead-cpu paths, we know that nobody can touch rq->lock, every CPU runs stop_machine_cpu_stop() with irqs disabled. No need to get/put task_struct, etc. Oleg. --
Something like so?.. hasn't even seen a compiler yet but one's got to do
something to keep the worst bore of saturday night telly in check ;-)
---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
include/linux/sched.h | 3
kernel/cpu.c | 7 -
kernel/sched.c | 181 ++++++++++++--------------------------------------
3 files changed, 46 insertions(+), 145 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
extern void idle_task_exit(void);
#else
static inline void idle_task_exit(void) {}
#endif
-extern void sched_idle_next(void);
-
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
extern void wake_up_idle_cpu(int cpu);
#else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
}
struct take_cpu_down_param {
- struct task_struct *caller;
unsigned long mod;
void *hcpu;
};
@@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
cpu_notify(CPU_DYING | param->mod, param->hcpu);
- if (task_cpu(param->caller) == cpu)
- move_task_off_dead_cpu(cpu, param->caller);
- /* Force idle task to run as soon as we yield: it should
- immediately notice cpu is offline and die quickly. */
- sched_idle_next();
return 0;
}
@@ -223,7 +217,6 @@ static int __ref ...Yes, I _think_ this all can work (and imho makes a lot of sense if it works). Hmm. I was really puzzled until I realized this is just cleanup, Probably we do not need any checks. This task was picked by ->pick_next_task(), it should have task_cpu(p) == dead_cpu ? But. I think there is a problem. We should not migrate current task, stop thread, which does the migrating. At least, sched_stoptask.c doesn't implement ->enqueue_task() and we can never wake it up later for kthread_stop(). Perhaps migrate_tasks() should do for_each_class() by hand to ignore stop_sched_class. But then _cpu_down() should somewhow ensure the stop thread on the dead CPU is already parked in Probably we don't really need rq->lock. All cpus run stop threads. I am not sure about rq->idle, perhaps it should be deactivated. I don't think we should migrate it. What I never understood is the meaning of play_dead/etc. If we remove sched_idle_next(), who will do that logic? And how the idle thread can call idle_task_exit() ? Oleg. --
Right, I think we should replace that with something like BUG_ON(!
idle_cpu(cpu)); Since we migrated everything away during the stop
Right.. Noticed that when I read that code, though I might as well fix
Hrm, right, so while the migration thread isn't actually on any rq
structure as such, pick_next_task() will return it.. need to come up
with a way to skip it.
As to current, take_cpu_down() is actually migrating current away before
this patch, so I simply included current in the
CPU_DYING->migrate_tasks() loop and removed the special case from
Well, since we're in stop_machine all cpus but the cpu that is executing
is stuck in the stop_machine_cpu_stop() loop, in both cases we could
simply fudge the pick_next_task_stop() condition (eg. set rq->stop =
NULL) while doing that loop, and restore it afterwards, nothing will hit
Right, but I was worried about stuff that relied on lockdep state like
Ah, I think the !nr_running check will bail before we end up selecting
Well, since we'll have migrated every task on that runqueue (except the
migration thread), the only runnable task left (once the migration
thread stops running) is the idle thread, so it should be implicit.
As to play_dead():
cpu_idle()
if (cpu_is_offline(smp_processor_id()))
play_dead()
native_play_dead() /* did I already say I detest paravirt? */
play_dead_common()
idle_task_exit();
local_irq_disable();
tboot/mwait/hlt
It basically puts the cpu to sleep with IRQs disabled, needs special
magic to wake it back up.
--
Latest version, I actually compiled it too !
---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 3
kernel/cpu.c | 17 +---
kernel/sched.c | 201 ++++++++++++++------------------------------------
3 files changed, 63 insertions(+), 158 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
extern void idle_task_exit(void);
#else
static inline void idle_task_exit(void) {}
#endif
-extern void sched_idle_next(void);
-
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
extern void wake_up_idle_cpu(int cpu);
#else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
}
struct take_cpu_down_param {
- struct task_struct *caller;
unsigned long mod;
void *hcpu;
};
@@ -198,7 +197,6 @@ struct take_cpu_down_param {
static int __ref take_cpu_down(void *_param)
{
struct take_cpu_down_param *param = _param;
- unsigned int cpu = (unsigned long)param->hcpu;
int err;
/* Ensure this CPU doesn't handle any more interrupts. */
@@ -208,11 +206,6 @@ static int __ref take_cpu_down(void *_pa
cpu_notify(CPU_DYING | param->mod, param->hcpu);
- if (task_cpu(param->caller) == cpu)
- move_task_off_dead_cpu(cpu, param->caller);
- /* Force idle task to run as soon as we yield: it ...Locks up solid though,.. even the NMI watchdog doesn't seem to trigger anymore (or its broken too)... /me goes prod at it a bit more --
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010
While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().
If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.
This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 3
kernel/cpu.c | 16 +--
kernel/sched.c | 210 +++++++++++++++-----------------------------------
3 files changed, 69 insertions(+), 160 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
extern void idle_task_exit(void);
#else
static inline void idle_task_exit(void) {}
#endif
-extern void sched_idle_next(void);
-
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
extern void wake_up_idle_cpu(int cpu);
#else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
}
struct take_cpu_down_param {
- struct task_struct ...Peter, sorry for delay. I was going to read this patch carefully today, but due to the holiday in the Czech Republic I have to drink (too much) beer instead ;) I was very much confused, and I was going to say this is wrong. However, now I think this is correct, just the comment is not right. There is another running thread we should not migrate, rq->idle. If nothing else, dequeue_task_idle() should be never called. But, if I understand correctly, ->nr_running does not account the idle thread, and this is what makes this correct. Correct? Oleg. --
Ah, you missed a patch that made pick_next_task_stop() look like:
static struct task_struct *pick_next_task_stop(struct rq *rq)
{
struct task_struct *stop = rq->stop;
if (stop && stop->se.on_rq)
return stop;
return NULL;
Right, I can add: (the idle thread is not counted in nr_running), if
that makes things clearer for you; however its a quite fundamental
property, we don't consider the idle task a proper runnable entity, its
simply the thing we do when there's nothing else to do.
--
Yes, I see now. OK, this also explains my previous questions. I greatly misunderstood this "small detail", starting from your initial patch. Every time I thought you are trying to migrate rq->idle as well. I am not sure. Yes, we know for sure rhat the only runnable task is rq->idle. But only after migration thread calls schedule() and switches to the idle thread. However, I see nothing which can guarantee this. Migration thread running on the dead cpu wakes up the caller of stop_cpus() before it calls schedule(), _cpu_down() can check rq->curr before it was changed. No? Hmm. In fact, I think it is possible that cpu_stopper_thread() can have more cpu_stop_work's queued when __stop_machine() returns. This has nothing to do with this patch, but I think it makes sense to clear stopper->enabled at CPU_DYING stage as well (of course, this needs a separate patch). Oleg. --
Hmm, I think you're right, although I haven't hit that case during testing. There is no firm guarantee the dying cpu actually got to running the idle thread (there's a guarantee it will at some point), so we ought to maintain that wait-loop, possibly using cpu_relax(), I don't see the point in calling yield() here. --
Agreed. But do we need to wait at all? With or without this change, even if we know that rq->idle is running we can't know if it (say) already started play_dead_common() or not. We are going to call __cpu_die(), afaics it should do the necessary synchronization in any case. For example, native_cpu_die() waits for cpu_state == CPU_DEAD in a loop. Of course it should work in practice (it also does msleep), but in theory there is no guarantee. So. Can't we just remove this wait-loop? We know that rq->idle will be scheduled "soon", I don't understand why it is necessary to ensure that context_switch() has already happened. Oleg. --
Commit-ID: 48c5ccae88dcd989d9de507e8510313c6cbd352b Gitweb: http://git.kernel.org/tip/48c5ccae88dcd989d9de507e8510313c6cbd352b Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Sat, 13 Nov 2010 19:32:29 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Thu, 18 Nov 2010 13:27:46 +0100 sched: Simplify cpu-hot-unplug task migration While discussing the need for sched_idle_next(), Oleg remarked that since try_to_wake_up() ensures sleeping tasks will end up running on a sane cpu, we can do away with migrate_live_tasks(). If we then extend the existing hack of migrating current from CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the need for the sched_idle_next() abomination disappears as well, since idle will be the only possible thread left after the migration thread stops. This greatly simplifies the hot-unplug task migration path, as can be seen from the resulting code reduction (and about half the new lines are comments). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1289851597.2109.547.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 3 - kernel/cpu.c | 16 ++--- kernel/sched.c | 206 +++++++++++++++---------------------------------- 3 files changed, 67 insertions(+), 158 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 3cd70cf..29d953a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1871,14 +1871,11 @@ extern void sched_clock_idle_sleep_event(void); extern void sched_clock_idle_wakeup_event(u64 delta_ns); #ifdef CONFIG_HOTPLUG_CPU -extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p); extern void idle_task_exit(void); #else static inline void idle_task_exit(void) {} #endif -extern void sched_idle_next(void); - #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP) extern void wake_up_idle_cpu(int cpu); ...
So this comes from 1c6b4aa94576, which is something I wouldn't have bothered merging in the first place, if you pin a cpu with RT tasks like that you get to keep the pieces, hotplug isn't the only thing that will go wonky. Anyway, if you leave the code as is you'll be fine, it'll me above any FIFO task, but still below deadline tasks and the stop task, neither of I'd be inclined to drop this too, if people get watchdog timeouts it means the system is really over-commited on deadline tasks and the watchdog FIFO thread didn't get around to running, something which I think we both agree shouldn't be happening. --
