Hi, here a new posting of my scary patch(es) ;-) These actually survive a sembench run (and everything else I threw at it). The discussion between Mike and Frank over the task_running() check made me realize what was wrong with the previous one. As it turns out, what was needed (p->oncpu) was something Thomas wanted me to do for an entirely different reason (see patch #2). Frank's patch, while encouraging me to poke at it again, has a number of very fundamental problems with it, the most serious one being that it completely wrecks the wake-up load-balancing. I'll try and come up with a way to unwreck the 32bit select_task_rq_fair() problem, but for now inspiration in that departments seems lacking, yet I still wanted to share these patches so that others can have a go at them. If all you care about is the numbers, skip to patch #5. --
Reduce rq->lock contention on try_to_wake_up() by changing the task state using a cmpxchg loop. Once the task is set to TASK_WAKING we're guaranteed the only one poking at it, then proceed to pick a new cpu without holding the rq->lock (XXX this opens some races). Then instead of locking the remote rq and activating the task, place the task on a remote queue, again using cmpxchg, and notify the remote cpu per IPI if this queue was empty to start processing its wakeups. This avoids (in most cases) having to lock the remote runqueue (and therefore the exclusive cacheline transfer thereof) but also touching all the remote runqueue data structures needed for the actual activation. As measured using: http://oss.oracle.com/~mason/sembench.c $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 unpatched: run time 30 seconds 537953 worker burns per second patched: run time 30 seconds 657336 worker burns per second Still need to sort out all the races marked XXX (non-trivial), and its x86 only for the moment. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- arch/x86/kernel/smp.c | 1 include/linux/sched.h | 7 - kernel/sched.c | 241 ++++++++++++++++++++++++++++++++++-------------- kernel/sched_fair.c | 5 kernel/sched_features.h | 3 kernel/sched_idletask.c | 2 kernel/sched_rt.c | 4 kernel/sched_stoptask.c | 3 8 files changed, 190 insertions(+), 76 deletions(-) Index: linux-2.6/arch/x86/kernel/smp.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/smp.c +++ linux-2.6/arch/x86/kernel/smp.c @@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_ /* * KVM uses this interrupt to force a cpu out of guest mode */ + sched_ttwu_pending(); } void smp_call_function_interrupt(struct pt_regs *regs) Index: ...
Great, that's going to greatly simplify and lower the overhead of the remote tick restart I'm doing on wake up for the nohz task thing. --
Interesting... I didn't actually read this patch yet, just a very Somehow I was confused by initial "next = NULL", perhaps struct rq *rq = cpu_rq(cpu); struct task_struct *next = rq->wake_list; makes a bit more sense. Oleg. --
I think it can. Yes, we've set TASK_WAKING. But, at least the task itself can change its state back to TASK_RUNNING without calling Well, this surely breaks the code like CONDITION = true; wake_up_process(p); At least we need mb() before we check task_state the first time. Oleg. --
Oh crud, you're right, that's going to make all this cmpxchg stuff lots You're right (wmb, at least), I left that out because I had the cmpxchg in there that provides a mb, but didn't notice I read the state before that.. /me goes put the smp_wmb() back. --
Hrmph, we can add a task_is_waking() test to the rq->lock in schedule(),
like we have for __task_rq_lock():
local_irq_save(flags);
again:
while (task_is_waking(current))
cpu_relax();
raw_spin_lock(rq->lock);
if (task_is_waking(current)) {
raw_spin_unlock(rq->lock);
goto again;
}
But that's not particularly pretty...
--
OK, I'm just not thinking straight here, if we're not passing through
schedule() the above clearly won't help any...
Damn, that's a nasty case to solve...
current->state = TASK_UNINTERRUPTIBLE;
while (!true)
schedule();
current->state = TASK_RUNNING;
... /me goes make a strong pot of tea
--
Just when the tea was setting my brain sparked (could be insanity, dunno), anyway, how about the below? It does the state and on_rq checks first, if we find on_rq, we do a forced remote wakeup (rq->lock, recheck state, recheck on_rq, set p->state to TASK_RUNNING), otherwise, we do the cmpxchg thing to TASK_WAKING, (since !on_rq @p passed through schedule()) and proceed like before. --- Subject: sched: Reduce ttwu rq->lock contention From: Peter Zijlstra <peterz@infradead.org> Date: Tue, 22 Jun 2010 15:33:06 +0200 Reduce rq->lock contention on try_to_wake_up() by changing the task state using a cmpxchg loop. Once the task is set to TASK_WAKING we're guaranteed the only one poking at it, then proceed to pick a new cpu without holding the rq->lock (XXX this opens some races). Then instead of locking the remote rq and activating the task, place the task on a remote queue, again using cmpxchg, and notify the remote cpu per IPI if this queue was empty to start processing its wakeups. This avoids (in most cases) having to lock the remote runqueue (and therefore the exclusive cacheline transfer thereof) but also touching all the remote runqueue data structures needed for the actual activation. As measured using: http://oss.oracle.com/~mason/sembench.c $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 unpatched: run time 30 seconds 537953 worker burns per second patched: run time 30 seconds 657336 worker burns per second Still need to sort out all the races marked XXX (non-trivial), and its x86 only for the moment. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1277213586.1875.704.camel@laptop> --- arch/x86/kernel/smp.c | 1 include/linux/sched.h | 7 - kernel/sched.c | 226 +++++++++++++++++++++++++++++++++--------------- kernel/sched_fair.c | 5 - kernel/sched_features.h | 2 kernel/sched_idletask.c | 2 kernel/sched_rt.c | 4 ...
Looks like nothing prevents ttwu() from changing task's CPU while some one else is holding task_rq_lock(). Is this OK? Thanks Yan, Zheng --
Ah, crud, good catch. No that is not quite OK ;-)
I'm starting to think adding a per-task scheduler lock isn't such a bad
idea after all :-)
How does something like the below look, it waits for the current
task_rq(p)->lock owner to go away after we flip p->state to TASK_WAKING.
It also optimizes the x86 spinlock code a bit, no need to wait for all
pending owners to go away, just the current one.
This also solves the p->cpus_allowed race..
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2518,6 +2518,8 @@ try_to_wake_up(struct task_struct *p, un
break;
}
+ raw_spin_unlock_wait(&task_rq(p)->lock);
+
ret = 1; /* we qualify as a proper wakeup now */
if (load) // XXX racy
@@ -2536,10 +2538,7 @@ try_to_wake_up(struct task_struct *p, un
if (p->sched_class->task_waking)
p->sched_class->task_waking(p);
- /*
- * XXX: by having set TASK_WAKING outside of rq->lock, there
- * could be an in-flight change to p->cpus_allowed..
- */
+
cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
#endif
ttwu_queue(p, cpu);
Index: linux-2.6/arch/x86/include/asm/spinlock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/spinlock.h
+++ linux-2.6/arch/x86/include/asm/spinlock.h
@@ -158,18 +158,34 @@ static __always_inline void __ticket_spi
}
#endif
+#define TICKET_MASK ((1 << TICKET_SHIFT) - 1)
+
static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
{
int tmp = ACCESS_ONCE(lock->slock);
- return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+ return !!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK);
}
static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
{
int tmp = ACCESS_ONCE(lock->slock);
- return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+ return (((tmp >> TICKET_SHIFT) - tmp) & ...The problem is, somehow we should check both on_rq and state Yes, we need the full mb(). without subsequent spin_lock(), wmb() Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state, and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false. try_to_wake_up() starts the "for (;;)" loop. However, in the WINDOW above, it is possible that somebody else wakes it up, and then this task changes its state to TASK_INTERRUPTIBLE again. Then we set ->state = TASK_WAKING, but this (still running) T restores TASK_RUNNING after us. Oleg. --
See, there's a reason I CC'ed you ;-) Hrmph, so is it only about serializing concurrent wakeups? If so, we could possibly hold p->pi_lock over the wakeup. --
Something like the below.. except it still suffers from the
__migrate_task() hole you identified in your other email.
By fully serializing all wakeups using ->pi_lock it becomes a lot
simpler (although I just realized we might have a problem with
try_to_wake_up_local).
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, ret = 0;
smp_wmb();
raw_spin_lock_irqsave(&p->pi_lock, flags);
if (!(p->state & state))
goto unlock;
ret = 1; /* we qualify as a proper wakeup now */
if (p->se.on_rq && ttwu_force(p, state, wake_flags))
goto unlock;
p->sched_contributes_to_load = !!task_contributes_to_load(p);
/*
* In order to serialize against other tasks wanting to task_rq_lock()
* we need to wait until the current task_rq(p)->lock holder goes away,
* so that the next might observe TASK_WAKING.
*/
p->state = TASK_WAKING;
smp_wmb();
raw_spin_unlock_wait(&task_rq(p)->lock);
/*
* Stable, now that TASK_WAKING is visible.
*/
cpu = task_cpu(p);
#ifdef CONFIG_SMP
/*
* Catch the case where schedule() has done the dequeue but hasn't yet
* scheduled to a new task, in that case p is still being referenced
* by that cpu so we cannot wake it to any other cpu.
*
* Here we must either do a full remote enqueue, or simply wait for
* the remote cpu to finish the schedule(), the latter was found to
* be cheapest.
*/
while (p->oncpu)
cpu_relax();
if (p->sched_class->task_waking)
p->sched_class->task_waking(p);
cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
#endif
ttwu_queue(p, cpu);
ttwu_stat(p, cpu, wake_flags);
unlock:
...Well. All I can say, I'll try to re-read this code with the fresh head ;) This needs smp_mb(), unlock_wait() reads the memory. Oleg. --
Right, good plan ;-) I'll try and clean up my current mess and try to post a new series this weekend (although it looks to be a busy weekend, with the holidays coming up and all). --
Could below happen in this __window__?
p is going through wake_event and it first set TASK_UNINTERRUPTIBLE,
then waker see that and above if (!(p->state & state)) passed.
But at this time condition == true for p, and p return to run and
intend to sleep:
p->state == XXX;
sleep;
--
Only stand for myself.
--
I don't think this can happen with wait_event/wake_up/etc, wait_queue_head_t->lock adds the necessary synchronization. I think this is possible, and this is possible whatever we do. Afaics, this patch changes nothing in this sense. Consider: set_current_state(TASK_INTERRUPTIBLE); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); wake_up_state(TASK_INTERRUPTIBLE) in between can in fact wakeup this task in TASK_UNINTERRUPTIBLE state. I do not think this is the problem. The user of wake_up_process() should take care and write the correct code ;) And in any case, any wait-event-like code should handle the spurious wakeups correctly. Or I missed your point? Oleg. --
Actually I don't take different sight into wait_event/wake_up and sleep/wake_up_process, beause nothing prevent the user from using wake_up_process on an added-to-wait_queue sleeper though we know that it's not recommended. And you're right wait_queue_head_t->lock privide necessary Yup. Thanks, Yong --
I __think__ that it is possible to solve other problems, but Hmm, yes, this looks right. Oleg. --
Even simpler. This can race with, say, __migrate_task() which does deactivate_task + activate_task and temporary clears on_rq. Although this is simple to fix, I think. Also. Afaics, without rq->lock, we can't trust "while (p->oncpu)", at least we need rmb() after that. Interestingly, I can't really understand the current meaning of smp_wmb() in finish_lock_switch(). Do you know what exactly is buys? In any case, task_running() (or its callers) do not have the corresponding rmb(). Say, currently try_to_wake_up()->task_waking() can miss all changes starting from prepare_lock_switch(). Hopefully this is OK, but I am confused ;) Oleg. --
I think Linus once argued that loops like that should be fine without a rmb(), at worst they'll have to spin a few more times to observe the 1->0 switch (we don't care about the 0->1 switch in this case because I _think_ its meant to ensure the full contest switch happened and we've stored all changes to the rq structure (destroying all references to So I thought I saw how we are OK there, but then I got myself confused too :-) My argument was something along the lines of there must be some serialization between the task going to sleep and another task waking it (the task setting TASK_UNINTERRUPTIBLE and enqueuing it on a waitqueue, and the waker finding it on the waitqueue), this should be sufficient to make ->state visible to the waker. If the waker observes a !TASK_RUNNING ->state, then by definition it must see all the changes previous to it (including the ->oncpu 0->1 transition). But like said, got my brain in a twist too. --
__migrate_task() is a tad special in that it should only migrate runnable tasks, but of course we can have it preempted with ->state != TASK_RUNNING, which makes life interesting, but we can fix that by including a PREEMPT_ACTIVE test in ttwu. Other such callers are more straight fwd though: rt_mutex_setprio() set_user_nice() sched_move_task() __sched_setscheduler() normalize_task() none of those are fast-path operations, so it is quite doable to also take the ->pi_lock there and fully serialize them. --
Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 66 ++++++++++++++++++++++++++-------------------------------
1 file changed, 31 insertions(+), 35 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2310,21 +2310,35 @@ static void update_avg(u64 *avg, u64 sam
}
#endif
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
- bool is_sync, bool is_migrate, bool is_local,
- unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
{
+#ifdef CONFIG_SCHEDSTATS
+
+ schedstat_inc(rq, ttwu_count);
schedstat_inc(p, se.statistics.nr_wakeups);
- if (is_sync)
+
+ if (wake_flags & WF_SYNC)
schedstat_inc(p, se.statistics.nr_wakeups_sync);
- if (is_migrate)
+
+ if (cpu != task_cpu(p))
schedstat_inc(p, se.statistics.nr_wakeups_migrate);
- if (is_local)
+
+ if (cpu == smp_processor_id()) {
+ schedstat_inc(rq, ttwu_local);
schedstat_inc(p, se.statistics.nr_wakeups_local);
- else
- schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ } else {
+ struct sched_domain *sd;
- activate_task(rq, p, en_flags);
+ schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ for_each_domain(smp_processor_id(), sd) {
+ if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+ schedstat_inc(sd, ttwu_wake_remote);
+ break;
+ }
+ }
+ }
+#endif /* CONFIG_SCHEDSTATS */
}
static void
@@ -2384,12 +2398,12 @@ static int try_to_wake_up(struct task_st
if (!(p->state & state))
goto out;
+ cpu = task_cpu(p);
+
if (p->se.on_rq)
goto out_running;
- cpu = task_cpu(p);
orig_cpu = cpu;
-
#ifdef CONFIG_SMP
if (unlikely(task_running(rq, p)))
goto out_activate;
@@ ...Looks good to me. Nice cleanup... A couple of nits inline if you want them. --
try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.
This results in that success is always true for the tracepoint, so
remove its success argument.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/trace/events/sched.h | 18 ++++++++----------
kernel/sched.c | 18 ++++++++----------
kernel/trace/trace_sched_wakeup.c | 3 +--
3 files changed, 17 insertions(+), 22 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2327,10 +2327,10 @@ static inline void ttwu_activate(struct
activate_task(rq, p, en_flags);
}
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
- int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
{
- trace_sched_wakeup(p, success);
+ trace_sched_wakeup(p);
check_preempt_curr(rq, p, wake_flags);
p->state = TASK_RUNNING;
@@ -2350,7 +2350,7 @@ static inline void ttwu_post_activation(
}
#endif
/* if a worker is waking up, notify workqueue */
- if ((p->flags & PF_WQ_WORKER) && success)
+ if (p->flags & PF_WQ_WORKER)
wq_worker_waking_up(p, cpu_of(rq));
}
@@ -2449,9 +2449,9 @@ static int try_to_wake_up(struct task_st
#endif /* CONFIG_SMP */
ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
cpu == this_cpu, en_flags);
- success = 1;
out_running:
- ttwu_post_activation(p, rq, wake_flags, success);
+ ttwu_post_activation(p, rq, wake_flags);
+ success = 1;
out:
task_rq_unlock(rq, &flags);
put_cpu();
@@ -2470,7 +2470,6 @@ static int try_to_wake_up(struct task_st
static void try_to_wake_up_local(struct task_struct *p)
{
struct rq *rq = task_rq(p);
- bool success = false;
...Note we'll need to fix some perf scripts after that. And also perf sched, probably perf timechart and so on... --
Do any of those actually use the success parameter? If not, then me removing it shouldn't break those tools since they're supposed to parse the format stuff and not notice it missing ;-) --
Alternatively we could always call trace_sched_wakeup() and make success reflect actual wakeup success. --
So, that would work very well for sched-migration.py builtin-sched.c would continue to work fine but would notice that as a bug if a wake up occur on a task already running. But it will only trigger a warning, besides that it will just continue to work normally. It's ok I think, if that old tool triggers a spurious warning. I suspect very few people use it anyway. --
Yep, builtin-sched.c and sched-migration.py do. --
Looks good to me, so far. But I still I have a few more places that call try_to_wake_up() to validate. Acked-by: Frank Rowand <frank.rowand@am.sony.com> --
Since we now have p->oncpu unconditionally available, use it to spin
on.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mutex.h | 2 -
include/linux/sched.h | 2 -
kernel/mutex-debug.c | 2 -
kernel/mutex-debug.h | 2 -
kernel/mutex.c | 2 -
kernel/mutex.h | 2 -
kernel/sched.c | 52 +++++++++++++-------------------------------------
7 files changed, 20 insertions(+), 44 deletions(-)
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
spinlock_t wait_lock;
struct list_head wait_list;
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
- struct thread_info *owner;
+ struct task_struct *owner;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
return;
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
- DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+ DEBUG_LOCKS_WARN_ON(lock->owner != current);
DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
mutex_clear_owner(lock);
}
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
static inline void mutex_set_owner(struct mutex *lock)
{
- lock->owner = current_thread_info();
+ lock->owner = current;
}
static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ ...It seems, this owner->oncpu dereference needs probe_kernel_address() too. Of course, only in theory. OTOH, I do not understand why we need the first "if (!oncpu)" check (and "int oncpu" at all). Oleg. --
Probably because I got my head in a twist.. I wanted to keep the
probe_kernel_address() stuff there so that we wouldn't dereference a
dead owner pointer, however..
I then later figured that the: lock->owner == owner, test is already
sufficient for that, if owner died and went away, it would have released
the mutex already (unless something really bad happened, in which case
this wouldn't be the first explosion).
So yea, I think we can simplify all this even further, but we do need a
smp_rmb() in between those two checks, hmm, and RCU.
How about this?
---
Subject: mutex: Use p->oncpu for the adaptive spin
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 15 18:37:21 CET 2010
Since we no have p->oncpu unconditionally available, use it to spin
on.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
include/linux/mutex.h | 2 -
include/linux/sched.h | 2 -
kernel/mutex-debug.c | 2 -
kernel/mutex-debug.h | 2 -
kernel/mutex.c | 2 -
kernel/mutex.h | 2 -
kernel/sched.c | 87 +++++++++++++++++++++-----------------------------
7 files changed, 43 insertions(+), 56 deletions(-)
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
spinlock_t wait_lock;
struct list_head wait_list;
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
- struct thread_info *owner;
+ struct task_struct *owner;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
return;
DEBUG_LOCKS_WARN_ON(lock->magic != lock);
- DEBUG_LOCKS_WARN_ON(lock->owner != ...I am not sure we need rmb() or even read_barrier_depends(), I think the simple barrier() should be enough. Although rcu_dereference() may be more clean. We don't really care about the "correct" result of *owner. All we need, we should ensure it is safe to dereference this address. If CPU itself does something like "readahead" before we verified "lock->owner == owner" under RCU lock, this shouldn't lead to the fault. Oleg. --
Always provide p->oncpu so that we can determine if its on a cpu
without having to lock the rq.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 2 --
kernel/sched.c | 36 ++++++++++++++++++++++++------------
2 files changed, 24 insertions(+), 14 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -850,18 +850,39 @@ static inline int task_current(struct rq
return rq->curr == p;
}
-#ifndef __ARCH_WANT_UNLOCKED_CTXSW
static inline int task_running(struct rq *rq, struct task_struct *p)
{
+#ifdef CONFIG_SMP
+ return p->oncpu;
+#else
return task_current(rq, p);
+#endif
}
+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
{
+#ifdef CONFIG_SMP
+ /*
+ * We can optimise this out completely for !SMP, because the
+ * SMP rebalancing from interrupt is the only thing that cares
+ * here.
+ */
+ next->oncpu = 1;
+#endif
}
static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
{
+#ifdef CONFIG_SMP
+ /*
+ * After ->oncpu is cleared, the task can be moved to a different CPU.
+ * We must ensure this doesn't happen until the switch is completely
+ * finished.
+ */
+ smp_wmb();
+ prev->oncpu = 0;
+#endif
#ifdef CONFIG_DEBUG_SPINLOCK
/* this is a valid case when another task releases the spinlock */
rq->lock.owner = current;
@@ -877,15 +898,6 @@ static inline void finish_lock_switch(st
}
#else /* __ARCH_WANT_UNLOCKED_CTXSW */
-static inline int task_running(struct rq *rq, struct task_struct *p)
-{
-#ifdef CONFIG_SMP
- return p->oncpu;
-#else
- return task_current(rq, p);
-#endif
-}
-
static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
{
#ifdef CONFIG_SMP
@@ -2589,7 +2601,7 @@ void sched_fork(struct task_struct *p, i
if ...Looks good to me. Acked-by: Frank Rowand <frank.rowand@am.sony.com> --
And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.
By coincidence, I was about to post a new version of my scary patch when
this email arrived. I'll post my patches as a reply to this email, then
read through Peter's.
Frank's patch, Version 2
Changes from Version 1:
- Ensure return value of try_to_wake_up() is correct, even when queueing
wake up on a different cpu.
- rq->lock contention reduction not as good as first version
patch 1
The core changes. All the scary lock related stuff.
select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
of the waking smp_processor_id().
patch 2
select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
Limitations
x86 only
Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 75%
rq->lock contention wait time reduced by ~ 70%
test duration reduction is in the noise
rq->lock contention improvement is slightly better with just patch 1
applied, but the difference is in the noise
Todo
- handle cpu being offlined
-Frank
--
patch 1 of 2
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
---
arch/x86/kernel/smp.c | 1 1 + 0 - 0 !
include/linux/sched.h | 5 5 + 0 - 0 !
kernel/sched.c | 105 99 + 6 - 0 !
3 files changed, 105 insertions(+), 6 deletions(-)
Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}
void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_LOAD 0x04 /* for queued try_to_wake_up() */
#define ENQUEUE_WAKEUP 1
#define ENQUEUE_WAKING 2
@@ -1193,6 +1194,8 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
#ifdef CONFIG_SMP
+ struct task_struct *ttwu_queue_wake_entry;
+ int ttwu_queue_wake_flags;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
@@ -2017,6 +2020,7 @@ extern void release_uids(struct user_nam
extern void do_timer(unsigned long ticks);
+extern void sched_ttwu_pending(void);
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif
...So first you bounce the wakeup to the previous task cpu, when you do select_task_rq() and then you bounce it to the new target? --
Yes. (This is the answer when only the first patch is applied.) The wakeup is bounced to the previous task cpu. The bounce adds extra overhead (including an IRQ on the previous task cpu) in return for a reduction in the number of rq->lock contentions. Then the try_to_wake_up() on the previous task cpu may wake the task on a different cpu, but the bias should be the cpu that try_to_wake_up() is now running on. The second patch in my series instead feeds the waking cpu number to select_task_rq(), so the bias will be to wake the task on the waking cpu. -Frank --
The email that explains the context for this does not seem to have gotten
through to the list. Here is the email that this patch should have been
a reply to:
And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.
By coincidence, I was about to post a new version of my scary patch when
this email arrived. I'll post my patches as a reply to this email, then
read through Peter's.
Frank's patch, Version 2
Changes from Version 1:
- Ensure return value of try_to_wake_up() is correct, even when queueing
wake up on a different cpu.
- rq->lock contention reduction not as good as first version
patch 1
The core changes. All the scary lock related stuff.
select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
of the waking smp_processor_id().
patch 2
select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
Limitations
x86 only
Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 75%
rq->lock contention wait time reduced by ~ 70%
test duration reduction is in the noise
rq->lock contention improvement is slightly better with just patch 1
applied, but the difference is in the noise
Todo
- handle cpu being offlined
-Frank
--
patch 2 of 2
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1060,7 +1060,7 @@ struct sched_class {
#ifdef CONFIG_SMP
int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int sd_flag, int flags, int waking_cpu);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
@@ -1196,6 +1196,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct task_struct *ttwu_queue_wake_entry;
int ttwu_queue_wake_flags;
+ int ttwu_waking_cpu;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2262,9 +2262,11 @@ static int select_fallback_rq(int cpu, s
* The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags,
+ int wake_flags, int waking_cpu)
{
- int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+ int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags,
+ waking_cpu);
/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -2335,12 +2337,14 @@ static inline void ttwu_post_activation(
}
#ifdef CONFIG_SMP
-static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_queue_wake_up(struct task_struct *p, int this_cpu, int p_cpu,
+ int wake_flags)
{
struct task_struct *next = NULL;
- struct rq *rq = cpu_rq(cpu);
+ struct rq *rq = cpu_rq(p_cpu);
p->ttwu_queue_wake_flags = ...