This is more of a starting point than a patch, but it is something I've
been meaning to look at for a long time. Many different workloads end
up hammering very hard on try_to_wake_up, to the point where the
runqueue locks dominate CPU profiles.
This includes benchmarking really fast IO subsystems, fast networking,
fast pipes...well anywhere that we lots of procs on lots of cpus waiting
for short periods on lots of different things.
I think we probably want to add a way to wait just for a little while as
a more lightweight operation (less balancing etc) but this patch doesn't
do that. It only tries to make the act of waking someone up less
expensive by avoiding the runqueue lock when we're on a different CPU
than the process we want to wake.
I do this with a per-runqueue list and some cmpxchg tricks. Basically
all the other CPUs will toss a given process they want to wakeup onto
the destination per-runqueue list. Later on, when that runqueue is
finding a new task to run, it processes the list in bulk.
From a performance point of view, this adds about 10% to a TPC
simulation on a large numa machine. It doubles the speed of my IPC semaphore
benchmark (which really just hammers on wakeup in a loop).
This is probably broken in less than subtle ways, so it includes a new
schedule feature to (FAST_WAKEUP) to control the new mode.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..9c2f188 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1187,6 +1187,8 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+ struct task_struct *fast_wakeup_next;
+ unsigned long fast_wakeup_mode;
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* list of struct preempt_notifier: */
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..3667e89 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,7 @@ struct rq {
unsigned long nr_uninterruptible;
struct task_struct ...Right, so one of the things that I considered was to make p->state an
atomic_t and replace the initial stage of try_to_wake_up() with
something like:
int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
{
int state = atomic_read(&p->state);
do {
if (!(state & mask))
return 0;
state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
} while (state != TASK_WAKING);
/* do this pending queue + ipi thing */
return 1;
}
Also, I think we might want to put that atomic single linked list thing
into some header (using atomic_long_t or so), because I have a similar
thing living in kernel/perf_event.c, that needs to queue things from NMI
context.
The advantage of doing basically the whole enqueue on the remote cpu is
less cacheline bouncing of the runqueue structures.
--
I tried not to set the task waking, since I was worried about races with us getting queued somewhere else. But, I don't have a good handle on all of that so I kind of chickened out. That's why my code falls back to the full ttwu in a few cases. Do you think the above could be an addition to my patch or that it's required for my patch to work well? -chris --
So I've done three of these cmpxchg lists recently...but they have all been a little different. I went back and forth a bunch of times about using a list_head based thing instead to avoid the walk for list append. I really don't like the walk. But, what makes this one unique is that I'm using a cmpxchg on the list pointer in the in task struct to take ownership of this task struct. It is how I avoid concurrent lockless enqueues. Your fiddling with the p->state above would let me avoid that. -chris --
I actually have the reverse lying around somewhere (even more broken, probably) to allow nested wakeups from the scheduler. The issue I want to tackle is waking up processes when others go to sleep. This means try_to_wake_up() from inside the runqueue lock. I used a simple per_cpu taskqueue where tasks are put on when waking during schedule(). At the end of schedule() I empty the list and reschedule as one of the newly woken tasks may be higher prio. I'm wondering if both approaches can be merged by checking this list before and after every schedule(). Stijn --
Tejun did something for that: http://lkml.org/lkml/2010/5/13/136 --
The difference with what I have is that Tejun's threads are guaranteed to be local and on the same CPU/runqueue avoiding runqueue deadlocks. My code can be used to wake up any thread as I basically defer waking up untill after the runqueue lock is released. --
To return the favour, here's a scary patch that renders your machine a
doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
fault.
It looses the ttwu task_running() check, as I must admit I'm not quite
sure what it does.. Ingo?
It makes the whole TASK_WAKING thing very interesting again :-)
It also re-introduces a bunch of races because we now need to run
->select_task_rq() without holding the rq->lock. We cannot defer running
that because it really wants to know the cpu the wakeup originated from
(affine wakeups and the like).
---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 225 ++++++++++++++++++++++++-----------------------
kernel/sched_fair.c | 10 +--
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 127 insertions(+), 122 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}
void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */
+void sched_ttwu_pending(void);
struct io_context; /* See blkdev.h */
@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) ...Progress, this one gave spectacular fireworks ;-)
---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 247 +++++++++++++++++++++++++----------------------
kernel/sched_fair.c | 10 +-
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 143 insertions(+), 128 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}
void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */
+void sched_ttwu_pending(void);
struct io_context; /* See blkdev.h */
@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
- void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+ void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);
void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
#ifdef CONFIG_SMP
+ struct task_struct *wake_entry;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
diff --git a/kernel/sched.c ...I think I figured out what its for, its for when p is prev in schedule() after deactivate_task(), we have to call activate_task() it again, but we cannot migrate the task because the CPU its on is still referencing it. I added it back, but still fireworks :-) --
So this one boots and builds a kernel on a dual-socket nehalem.
there's still quite a number of XXXs to fix, but I don't think any of
the races are crashing potential, mostly wrong accounting and scheduling
iffies like.
But give it a go.. see what it does for you (x86 only for now).
Ingo, any comments other than, eew, scary? :-)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 271 ++++++++++++++++++++++++++++++++---------------
kernel/sched_fair.c | 10 +-
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 198 insertions(+), 97 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}
void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */
+void sched_ttwu_pending(void);
struct io_context; /* See blkdev.h */
@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
- void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+ void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct ...None, other than a question: which future kernel do you aim it for? I'd prefer v2.6.50 or later ;-) This is a truly scary patch. Ingo --
Well, assuming it all works out and actually reduces runqueue lock contention we still need to sort out all those XXXs in there, I'd say at the soonest somewhere near .38/.39 or so. Its definitely not something we should rush in. --
This thread was started by Chris Mason back in May: http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html Chris provided some code as a starting point for a solution. Peter Zijlstra had some good ideas, and came up with some alternate code, culminating with: http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html Building on this previous work, I have another patch to try to address the problem. I have taken some of Peter's code (the cmpxchg() based queueing and unqueueing, plus the cross cpu interrupt), but taken a simpler (and hopefully less scary) approach otherwise: If the task to be woken is on a run queue on a different cpu then use cmpxchg() to put it onto a pending try_to_wake_up list on the different cpu. Then send an interrupt to the different cpu to cause that cpu to call try_to_wake_up() for each process on the try_to_wake_up list. The result is that the initial run queue lock acquired by try_to_wake_up() will be on the cpu we are currently executing on, not a different cpu. This patch does not address the run queue lock contention that may occur if try_to_wake_up() chooses to move the waking process to another cpu, based on the result returned by select_task_rq(). The patch was created on the -top tree. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Chris, can you check the performance of this patch on your large system? Limitations x86 only Tests - tested on 2 cpu x86_64 - very simplistic workload - results: rq->lock contention count reduced by ~ 95% rq->lock contention wait time reduced by ~ 90% test duration reduced by ~ 0.5% - 4% (in the noise) Review goals: (1) performance results (2) architectural comments Review non-goal: code style, etc (but will be a goal in a future review round) Todo: - add support for additional architectures - polish code style - add a schedule feature to control whether to use the new algorithm - ...
On 12/01/10 15:13, Frank Rowand wrote: Fat-fingered typing. That should be the -tip tree.... -Frank --
Without having looked at the actual code, the described thing cannot work, try_to_wake_up() has a return value that needs to be passed back. Also, try_to_wake_up() does load-balancing, you really want to do that before queueing it on a remote cpu. --
I have not been able to make sense of the task_running() check in
try_to_wake_up(), even with that clue. The try_to_wake_up() code in
question is:
rq = task_rq_lock(p, &flags);
if (!(p->state & state))
goto out;
if (p->se.on_rq)
goto out_running;
cpu = task_cpu(p);
orig_cpu = cpu;
#ifdef CONFIG_SMP
if (unlikely(task_running(rq, p)))
goto out_activate;
The relevent code in schedule() executes with the rq lock held (many
lines left out to emphasize the key lines):
raw_spin_lock_irq(&rq->lock);
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
deactivate_task(rq, prev, DEQUEUE_SLEEP);
if (likely(prev != next)) {
rq->curr = next;
context_switch(rq, prev, next); /* unlocks the rq */
} else
raw_spin_unlock_irq(&rq->lock);
If (p->se.on_rq) can becomes false due to deactivate_task()
then task_running() will also become false while the rq lock is still
held (either via "rq->curr = next" or via context_switch() updating
p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).
I haven't been able to find any case where task_running() can be true
when (p->se.on_rq) is false, while the rq lock is not being held. Thus
I don't think the branch to out_activate will ever be taken.
What am I missing, or is the task_running() test not needed?
Thanks!
Frank
--
Say the last runnable task passes through schedule(), is deactivated. We hit idle_balance(), which drops/retakes rq->lock _before_ the task schedules off. ttwu() can acquire rq->lock in that window, p->se.on_rq is false, p->state is true, as is task_current(rq, p). We have to check whether the task is still current, but not enqueued, lest the wakeup be a noop, and the wakee possibly then sleep forever. -Mike --
}
pre_schedule(rq, prev);
if (unlikely(!rq->nr_running))
Thanks Mike, that's just the cluebat I needed!
And for the lkml archives, in case anyone has this question again in the
future, with Mike's clue in hand I found another case in this window where
the rq->lock can be dropped then reacquired. Just before idle_balance()
is called, pre_schedule() is called:
pre_schedule()
prev->sched_class->pre_schedule(rq, prev)
[pre_schedule_rt()]
pull_rt_task(rq)
pull_rt_task[this_rq]
for_each_cpu(cpu, this_rq->rd->rto_mask)
double_lock_balance(this_rq, src_rq)
raw_spin_unlock(&this_rq->lock) <-----
double_rq_lock(this_rq, busiest)
-Frank
--
I am afraid I can misuderstood this all, including the question ;) But, with __ARCH_WANT_UNLOCKED_CTXSW task_running() checks ->oncpu. However, schedule() drops rq->lock after prev was deactivated but before it clears prev->oncpu. Oleg. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
