Re: [RFC][PATCH 3/5] sched: Change the ttwu success details

Previous thread: [PATCH] replace idr in block layer SCSI generic (bsg) driver with kzalloced buffer by Hillf Danton on Thursday, December 16, 2010 - 8:01 am. (1 message)

Next thread: [PATCH 2/3] mbp_nvidia_bl: Check that the backlight control functions by Matthew Garrett on Thursday, December 16, 2010 - 8:13 am. (1 message)
From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

Hi, here a new posting of my scary patch(es) ;-)

These actually survive a sembench run (and everything else I threw at it).
The discussion between Mike and Frank over the task_running() check made me
realize what was wrong with the previous one.

As it turns out, what was needed (p->oncpu) was something Thomas wanted me
to do for an entirely different reason (see patch #2).

Frank's patch, while encouraging me to poke at it again, has a number of
very fundamental problems with it, the most serious one being that it
completely wrecks the wake-up load-balancing.

I'll try and come up with a way to unwreck the 32bit select_task_rq_fair()
problem, but for now inspiration in that departments seems lacking, yet I
still wanted to share these patches so that others can have a go at them.

If all you care about is the numbers, skip to patch #5.

--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

Reduce rq->lock contention on try_to_wake_up() by changing the task
state using a cmpxchg loop.

Once the task is set to TASK_WAKING we're guaranteed the only one
poking at it, then proceed to pick a new cpu without holding the
rq->lock (XXX this opens some races).

Then instead of locking the remote rq and activating the task, place
the task on a remote queue, again using cmpxchg, and notify the remote
cpu per IPI if this queue was empty to start processing its wakeups.

This avoids (in most cases) having to lock the remote runqueue (and
therefore the exclusive cacheline transfer thereof) but also touching
all the remote runqueue data structures needed for the actual
activation.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 657336 worker burns per second

Still need to sort out all the races marked XXX (non-trivial), and its
x86 only for the moment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smp.c   |    1 
 include/linux/sched.h   |    7 -
 kernel/sched.c          |  241 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched_fair.c     |    5 
 kernel/sched_features.h |    3 
 kernel/sched_idletask.c |    2 
 kernel/sched_rt.c       |    4 
 kernel/sched_stoptask.c |    3 
 8 files changed, 190 insertions(+), 76 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: ...
From: Frederic Weisbecker
Date: Thursday, December 16, 2010 - 8:31 am

Great, that's going to greatly simplify and lower the overhead of
the remote tick restart I'm doing on wake up for the nohz task thing.
--

From: Oleg Nesterov
Date: Thursday, December 16, 2010 - 10:58 am

Interesting... I didn't actually read this patch yet, just a very

Somehow I was confused by initial "next = NULL", perhaps

	struct rq *rq = cpu_rq(cpu);
	struct task_struct *next = rq->wake_list;

makes a bit more sense.

Oleg.

--

From: Oleg Nesterov
Date: Thursday, December 16, 2010 - 11:42 am

I think it can. Yes, we've set TASK_WAKING. But, at least the task
itself can change its state back to TASK_RUNNING without calling

Well, this surely breaks the code like

	CONDITION = true;
	wake_up_process(p);

At least we need mb() before we check task_state the first time.

Oleg.

--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 11:58 am

Oh crud, you're right, that's going to make all this cmpxchg stuff lots

You're right (wmb, at least), I left that out because I had the cmpxchg
in there that provides a mb, but didn't notice I read the state before
that.. /me goes put the smp_wmb() back.
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 12:03 pm

Hrmph, we can add a task_is_waking() test to the rq->lock in schedule(),
like we have for __task_rq_lock():

  local_irq_save(flags);
again:
  while (task_is_waking(current))
	cpu_relax();
  raw_spin_lock(rq->lock);
  if (task_is_waking(current)) {
	raw_spin_unlock(rq->lock);
	goto again;
  }


But that's not particularly pretty...
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 12:47 pm

OK, I'm just not thinking straight here, if we're not passing through
schedule() the above clearly won't help any...

Damn, that's a nasty case to solve...

  current->state = TASK_UNINTERRUPTIBLE;
  while (!true)
    schedule();
  current->state = TASK_RUNNING;

... /me goes make a strong pot of tea
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 1:32 pm

Just when the tea was setting my brain sparked (could be insanity,
dunno), anyway, how about the below?

It does the state and on_rq checks first, if we find on_rq, we do a
forced remote wakeup (rq->lock, recheck state, recheck on_rq, set
p->state to TASK_RUNNING), otherwise, we do the cmpxchg thing to
TASK_WAKING, (since !on_rq @p passed through schedule()) and proceed
like before.



---
Subject: sched: Reduce ttwu rq->lock contention
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 22 Jun 2010 15:33:06 +0200

Reduce rq->lock contention on try_to_wake_up() by changing the task
state using a cmpxchg loop.

Once the task is set to TASK_WAKING we're guaranteed the only one
poking at it, then proceed to pick a new cpu without holding the
rq->lock (XXX this opens some races).

Then instead of locking the remote rq and activating the task, place
the task on a remote queue, again using cmpxchg, and notify the remote
cpu per IPI if this queue was empty to start processing its wakeups.

This avoids (in most cases) having to lock the remote runqueue (and
therefore the exclusive cacheline transfer thereof) but also touching
all the remote runqueue data structures needed for the actual
activation.

As measured using: http://oss.oracle.com/~mason/sembench.c

$ echo 4096 32000 64 128 > /proc/sys/kernel/sem
$ ./sembench -t 2048 -w 1900 -o 0

unpatched: run time 30 seconds 537953 worker burns per second
patched:   run time 30 seconds 657336 worker burns per second

Still need to sort out all the races marked XXX (non-trivial), and its
x86 only for the moment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1277213586.1875.704.camel@laptop>
---
 arch/x86/kernel/smp.c   |    1 
 include/linux/sched.h   |    7 -
 kernel/sched.c          |  226 +++++++++++++++++++++++++++++++++---------------
 kernel/sched_fair.c     |    5 -
 kernel/sched_features.h |    2 
 kernel/sched_idletask.c |    2 
 kernel/sched_rt.c       |    4 
 ...
From: Yan, Zheng
Date: Thursday, December 16, 2010 - 8:06 pm

Looks like nothing prevents ttwu() from changing task's CPU while
some one else is holding task_rq_lock(). Is this OK?

Thanks
Yan, Zheng
--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 6:23 am

Ah, crud, good catch. No that is not quite OK ;-)

I'm starting to think adding a per-task scheduler lock isn't such a bad
idea after all :-)

How does something like the below look, it waits for the current
task_rq(p)->lock owner to go away after we flip p->state to TASK_WAKING.

It also optimizes the x86 spinlock code a bit, no need to wait for all
pending owners to go away, just the current one.

This also solves the p->cpus_allowed race..

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2518,6 +2518,8 @@ try_to_wake_up(struct task_struct *p, un
 			break;
 	}
 
+	raw_spin_unlock_wait(&task_rq(p)->lock);
+
 	ret = 1; /* we qualify as a proper wakeup now */
 
 	if (load) // XXX racy
@@ -2536,10 +2538,7 @@ try_to_wake_up(struct task_struct *p, un
 
 	if (p->sched_class->task_waking)
 		p->sched_class->task_waking(p);
-	/*
-	 * XXX: by having set TASK_WAKING outside of rq->lock, there
-	 * could be an in-flight change to p->cpus_allowed..
-	 */
+
 	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 #endif
 	ttwu_queue(p, cpu);
Index: linux-2.6/arch/x86/include/asm/spinlock.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/spinlock.h
+++ linux-2.6/arch/x86/include/asm/spinlock.h
@@ -158,18 +158,34 @@ static __always_inline void __ticket_spi
 }
 #endif
 
+#define TICKET_MASK ((1 << TICKET_SHIFT) - 1)
+
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & ((1 << TICKET_SHIFT) - 1));
+	return !!(((tmp >> TICKET_SHIFT) ^ tmp) & TICKET_MASK);
 }
 
 static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
 {
 	int tmp = ACCESS_ONCE(lock->slock);
 
-	return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1;
+	return (((tmp >> TICKET_SHIFT) - tmp) & ...
From: Oleg Nesterov
Date: Friday, December 17, 2010 - 9:54 am

The problem is, somehow we should check both on_rq and state

Yes, we need the full mb(). without subsequent spin_lock(), wmb()


Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
try_to_wake_up() starts the "for (;;)" loop.

However, in the WINDOW above, it is possible that somebody else wakes
it up, and then this task changes its state to TASK_INTERRUPTIBLE again.

Then we set ->state = TASK_WAKING, but this (still running) T restores
TASK_RUNNING after us.

Oleg.

--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 10:43 am

See, there's a reason I CC'ed you ;-)

Hrmph, so is it only about serializing concurrent wakeups? If so, we
could possibly hold p->pi_lock over the wakeup.
--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 11:15 am

Something like the below.. except it still suffers from the
__migrate_task() hole you identified in your other email.

By fully serializing all wakeups using ->pi_lock it becomes a lot
simpler (although I just realized we might have a problem with
try_to_wake_up_local).

static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
        unsigned long flags;
        int cpu, ret = 0;

        smp_wmb();
        raw_spin_lock_irqsave(&p->pi_lock, flags);

        if (!(p->state & state))
                goto unlock;

        ret = 1; /* we qualify as a proper wakeup now */

        if (p->se.on_rq && ttwu_force(p, state, wake_flags))
                goto unlock;

        p->sched_contributes_to_load = !!task_contributes_to_load(p);

        /*
         * In order to serialize against other tasks wanting to task_rq_lock()
         * we need to wait until the current task_rq(p)->lock holder goes away,
         * so that the next might observe TASK_WAKING.
         */
        p->state = TASK_WAKING;
        smp_wmb();
        raw_spin_unlock_wait(&task_rq(p)->lock);

        /*
         * Stable, now that TASK_WAKING is visible.
         */
        cpu = task_cpu(p);

#ifdef CONFIG_SMP
        /*
         * Catch the case where schedule() has done the dequeue but hasn't yet
         * scheduled to a new task, in that case p is still being referenced
         * by that cpu so we cannot wake it to any other cpu.
         *
         * Here we must either do a full remote enqueue, or simply wait for
         * the remote cpu to finish the schedule(), the latter was found to
         * be cheapest.
         */
        while (p->oncpu)
                cpu_relax();

        if (p->sched_class->task_waking)
                p->sched_class->task_waking(p);

        cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
#endif
        ttwu_queue(p, cpu);
        ttwu_stat(p, cpu, wake_flags);
unlock:
        ...
From: Oleg Nesterov
Date: Friday, December 17, 2010 - 12:28 pm

Well. All I can say, I'll try to re-read this code with the fresh head ;)

This needs smp_mb(), unlock_wait() reads the memory.

Oleg.

--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 2:02 pm

Right, good plan ;-)

I'll try and clean up my current mess and try to post a new series this
weekend (although it looks to be a busy weekend, with the holidays
coming up and all).

--

From: Yong Zhang
Date: Saturday, December 18, 2010 - 7:49 am

Could below happen in this __window__?

p is going through wake_event and it first set TASK_UNINTERRUPTIBLE,
then waker see that and above if (!(p->state & state)) passed.
But at this time condition == true for p, and p return to run and
intend to sleep:
          p->state == XXX;
          sleep;




-- 
Only stand for myself.
--

From: Oleg Nesterov
Date: Saturday, December 18, 2010 - 1:08 pm

I don't think this can happen with wait_event/wake_up/etc,
wait_queue_head_t->lock adds the necessary synchronization.


I think this is possible, and this is possible whatever we do.
Afaics, this patch changes nothing in this sense. Consider:

	set_current_state(TASK_INTERRUPTIBLE);

	set_current_state(TASK_UNINTERRUPTIBLE);
	schedule();

wake_up_state(TASK_INTERRUPTIBLE) in between can in fact wakeup
this task in TASK_UNINTERRUPTIBLE state.

I do not think this is the problem. The user of wake_up_process()
should take care and write the correct code ;) And in any case,
any wait-event-like code should handle the spurious wakeups
correctly.

Or I missed your point?

Oleg.

--

From: Yong Zhang
Date: Sunday, December 19, 2010 - 4:20 am

Actually I don't take different sight into wait_event/wake_up
and sleep/wake_up_process, beause nothing prevent the user
from using wake_up_process on an added-to-wait_queue sleeper
though we know that it's not recommended.

And you're right wait_queue_head_t->lock privide necessary



Yup.

Thanks,
Yong
--

From: Oleg Nesterov
Date: Friday, December 17, 2010 - 11:21 am

I __think__ that it is possible to solve other problems, but

Hmm, yes, this looks right.

Oleg.

--

From: Oleg Nesterov
Date: Friday, December 17, 2010 - 10:50 am

Even simpler. This can race with, say, __migrate_task() which does
deactivate_task + activate_task and temporary clears on_rq. Although
this is simple to fix, I think.



Also. Afaics, without rq->lock, we can't trust "while (p->oncpu)", at
least we need rmb() after that.

Interestingly, I can't really understand the current meaning of smp_wmb()
in finish_lock_switch(). Do you know what exactly is buys? In any case,
task_running() (or its callers) do not have the corresponding rmb().
Say, currently try_to_wake_up()->task_waking() can miss all changes
starting from prepare_lock_switch(). Hopefully this is OK, but I am
confused ;)

Oleg.

--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 11:24 am

I think Linus once argued that loops like that should be fine without a
rmb(), at worst they'll have to spin a few more times to observe the
1->0 switch (we don't care about the 0->1 switch in this case because

I _think_ its meant to ensure the full contest switch happened and we've
stored all changes to the rq structure (destroying all references to

So I thought I saw how we are OK there, but then I got myself confused
too :-)

My argument was something along the lines of there must be some
serialization between the task going to sleep and another task waking it
(the task setting TASK_UNINTERRUPTIBLE and enqueuing it on a waitqueue,
and the waker finding it on the waitqueue), this should be sufficient to
make ->state visible to the waker.

If the waker observes a !TASK_RUNNING ->state, then by definition it
must see all the changes previous to it (including the ->oncpu 0->1
transition).

But like said, got my brain in a twist too.
--

From: Peter Zijlstra
Date: Friday, December 17, 2010 - 11:41 am

__migrate_task() is a tad special in that it should only migrate
runnable tasks, but of course we can have it preempted with ->state !=
TASK_RUNNING, which makes life interesting, but we can fix that by
including a PREEMPT_ACTIVE test in ttwu.

Other such callers are more straight fwd though:

  rt_mutex_setprio()
  set_user_nice()
  sched_move_task()
  __sched_setscheduler()
  normalize_task()

none of those are fast-path operations, so it is quite doable to also
take the ->pi_lock there and fully serialize them.
  
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

Collect all ttwu stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   66 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 35 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2310,21 +2310,35 @@ static void update_avg(u64 *avg, u64 sam
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static void
+ttwu_stat(struct rq *rq, struct task_struct *p, int cpu, int wake_flags)
 {
+#ifdef CONFIG_SCHEDSTATS
+
+	schedstat_inc(rq, ttwu_count);
 	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (is_sync)
+
+	if (wake_flags & WF_SYNC)
 		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (is_migrate)
+
+	if (cpu != task_cpu(p))
 		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (is_local)
+
+	if (cpu == smp_processor_id()) {
+		schedstat_inc(rq, ttwu_local);
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+	} else {
+		struct sched_domain *sd;
 
-	activate_task(rq, p, en_flags);
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		for_each_domain(smp_processor_id(), sd) {
+			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_SCHEDSTATS */
 }
 
 static void
@@ -2384,12 +2398,12 @@ static int try_to_wake_up(struct task_st
 	if (!(p->state & state))
 		goto out;
 
+	cpu = task_cpu(p);
+
 	if (p->se.on_rq)
 		goto out_running;
 
-	cpu = task_cpu(p);
 	orig_cpu = cpu;
-
 #ifdef CONFIG_SMP
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
@@ ...
From: Frank Rowand
Date: Friday, December 17, 2010 - 6:09 pm

Looks good to me.  Nice cleanup...

A couple of nits inline if you want them.





--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.

This results in that success is always true for the tracepoint, so
remove its success argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/trace/events/sched.h      |   18 ++++++++----------
 kernel/sched.c                    |   18 ++++++++----------
 kernel/trace/trace_sched_wakeup.c |    3 +--
 3 files changed, 17 insertions(+), 22 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2327,10 +2327,10 @@ static inline void ttwu_activate(struct 
 	activate_task(rq, p, en_flags);
 }
 
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
-					int wake_flags, bool success)
+static void
+ttwu_post_activation(struct task_struct *p, struct rq *rq, int wake_flags)
 {
-	trace_sched_wakeup(p, success);
+	trace_sched_wakeup(p);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2350,7 +2350,7 @@ static inline void ttwu_post_activation(
 	}
 #endif
 	/* if a worker is waking up, notify workqueue */
-	if ((p->flags & PF_WQ_WORKER) && success)
+	if (p->flags & PF_WQ_WORKER)
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
@@ -2449,9 +2449,9 @@ static int try_to_wake_up(struct task_st
 #endif /* CONFIG_SMP */
 	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
 		      cpu == this_cpu, en_flags);
-	success = 1;
 out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
+	ttwu_post_activation(p, rq, wake_flags);
+	success = 1;
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
@@ -2470,7 +2470,6 @@ static int try_to_wake_up(struct task_st
 static void try_to_wake_up_local(struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
-	bool success = false;
 
 ...
From: Frederic Weisbecker
Date: Thursday, December 16, 2010 - 8:23 am

Note we'll need to fix some perf scripts after that. And also perf sched,
probably perf timechart and so on...
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 8:27 am

Do any of those actually use the success parameter? If not, then me
removing it shouldn't break those tools since they're supposed to parse
the format stuff and not notice it missing ;-)
--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 8:30 am

Alternatively we could always call trace_sched_wakeup() and make success
reflect actual wakeup success.
--

From: Frederic Weisbecker
Date: Thursday, December 16, 2010 - 8:45 am

So, that would work very well for sched-migration.py

builtin-sched.c would continue to work fine but would notice that as
a bug if a wake up occur on a task already running.

But it will only trigger a warning, besides that it will just continue
to work normally.

It's ok I think, if that old tool triggers a spurious warning. I suspect
very few people use it anyway.
--

From: Frederic Weisbecker
Date: Thursday, December 16, 2010 - 8:35 am

Yep, builtin-sched.c and sched-migration.py do.
--

From: Frank Rowand
Date: Friday, December 17, 2010 - 6:05 pm

Looks good to me, so far.  But I still I have a few more places that call
try_to_wake_up() to validate.

Acked-by: Frank Rowand <frank.rowand@am.sony.com>

--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

Since we now have p->oncpu unconditionally available, use it to spin
on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   52 +++++++++++++-------------------------------------
 7 files changed, 20 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev && !lock->wait_list.next);
 	mutex_clear_owner(lock);
 }
Index: linux-2.6/kernel/mutex-debug.h
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -29,7 +29,7 @@ extern void debug_mutex_init(struct mute
 
 static inline void mutex_set_owner(struct mutex *lock)
 {
-	lock->owner = current_thread_info();
+	lock->owner = current;
 }
 
 static inline void mutex_clear_owner(struct mutex *lock)
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ ...
From: Oleg Nesterov
Date: Thursday, December 16, 2010 - 10:34 am

It seems, this owner->oncpu dereference needs probe_kernel_address() too.
Of course, only in theory.

OTOH, I do not understand why we need the first "if (!oncpu)" check
(and "int oncpu" at all).

Oleg.

--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 12:29 pm

Probably because I got my head in a twist.. I wanted to keep the
probe_kernel_address() stuff there so that we wouldn't dereference a
dead owner pointer, however..

I then later figured that the: lock->owner == owner, test is already
sufficient for that, if owner died and went away, it would have released
the mutex already (unless something really bad happened, in which case
this wouldn't be the first explosion).

So yea, I think we can simplify all this even further, but we do need a
smp_rmb() in between those two checks, hmm, and RCU.

How about this?


---
Subject: mutex: Use p->oncpu for the adaptive spin
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 15 18:37:21 CET 2010

Since we no have p->oncpu unconditionally available, use it to spin
on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/mutex.h |    2 -
 include/linux/sched.h |    2 -
 kernel/mutex-debug.c  |    2 -
 kernel/mutex-debug.h  |    2 -
 kernel/mutex.c        |    2 -
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   87 +++++++++++++++++++++-----------------------------
 7 files changed, 43 insertions(+), 56 deletions(-)

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -51,7 +51,7 @@ struct mutex {
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
 #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
-	struct thread_info	*owner;
+	struct task_struct	*owner;
 #endif
 #ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
Index: linux-2.6/kernel/mutex-debug.c
===================================================================
--- linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -75,7 +75,7 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != ...
From: Oleg Nesterov
Date: Friday, December 17, 2010 - 12:17 pm

I am not sure we need rmb() or even read_barrier_depends(), I think
the simple barrier() should be enough. Although rcu_dereference()
may be more clean.

We don't really care about the "correct" result of *owner. All we need,
we should ensure it is safe to dereference this address. If CPU itself
does something like "readahead" before we verified "lock->owner == owner"
under RCU lock, this shouldn't lead to the fault.

Oleg.

--

From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 7:56 am

Always provide p->oncpu so that we can determine if its on a cpu
without having to lock the rq.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 --
 kernel/sched.c        |   36 ++++++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -850,18 +850,39 @@ static inline int task_current(struct rq
 	return rq->curr == p;
 }
 
-#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline int task_running(struct rq *rq, struct task_struct *p)
 {
+#ifdef CONFIG_SMP
+	return p->oncpu;
+#else
 	return task_current(rq, p);
+#endif
 }
 
+#ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * We can optimise this out completely for !SMP, because the
+	 * SMP rebalancing from interrupt is the only thing that cares
+	 * here.
+	 */
+	next->oncpu = 1;
+#endif
 }
 
 static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 {
+#ifdef CONFIG_SMP
+	/*
+	 * After ->oncpu is cleared, the task can be moved to a different CPU.
+	 * We must ensure this doesn't happen until the switch is completely
+	 * finished.
+	 */
+	smp_wmb();
+	prev->oncpu = 0;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq->lock.owner = current;
@@ -877,15 +898,6 @@ static inline void finish_lock_switch(st
 }
 
 #else /* __ARCH_WANT_UNLOCKED_CTXSW */
-static inline int task_running(struct rq *rq, struct task_struct *p)
-{
-#ifdef CONFIG_SMP
-	return p->oncpu;
-#else
-	return task_current(rq, p);
-#endif
-}
-
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
 {
 #ifdef CONFIG_SMP
@@ -2589,7 +2601,7 @@ void sched_fork(struct task_struct *p, i
 	if ...
From: Frank Rowand
Date: Friday, December 17, 2010 - 6:03 pm

Looks good to me.

Acked-by: Frank Rowand <frank.rowand@am.sony.com>

--

From: Frank Rowand
Date: Thursday, December 16, 2010 - 12:12 pm

And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived.  I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
  - Ensure return value of try_to_wake_up() is correct, even when queueing
    wake up on a different cpu.
  - rq->lock contention reduction not as good as first version

patch 1

  The core changes.  All the scary lock related stuff.

  select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
  of the waking smp_processor_id().

patch 2

  select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
  
Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 75%
     rq->lock contention wait time reduced by ~ 70%
     test duration reduction is in the noise
     rq->lock contention improvement is slightly better with just patch 1
       applied, but the difference is in the noise

Todo
  - handle cpu being offlined


-Frank
--

From: Frank Rowand
Date: Thursday, December 16, 2010 - 12:36 pm

patch 1 of 2

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>

---
 arch/x86/kernel/smp.c |    1 	1 +	0 -	0 !
 include/linux/sched.h |    5 	5 +	0 -	0 !
 kernel/sched.c        |  105 	99 +	6 -	0 !
 3 files changed, 105 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
  */
 #define WF_SYNC		0x01		/* waker goes to sleep after wakup */
 #define WF_FORK		0x02		/* child wakeup after fork */
+#define WF_LOAD		0x04		/* for queued try_to_wake_up() */
 
 #define ENQUEUE_WAKEUP		1
 #define ENQUEUE_WAKING		2
@@ -1193,6 +1194,8 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *ttwu_queue_wake_entry;
+	int ttwu_queue_wake_flags;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
@@ -2017,6 +2020,7 @@ extern void release_uids(struct user_nam
 
 extern void do_timer(unsigned long ticks);
 
+extern void sched_ttwu_pending(void);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 ...
From: Peter Zijlstra
Date: Thursday, December 16, 2010 - 12:42 pm

So first you bounce the wakeup to the previous task cpu, when you do
select_task_rq() and then you bounce it to the new target?


--

From: Frank Rowand
Date: Thursday, December 16, 2010 - 1:45 pm

Yes.  (This is the answer when only the first patch is applied.)

The wakeup is bounced to the previous task cpu.  The bounce adds extra
overhead (including an IRQ on the previous task cpu) in return for a
reduction in the number of rq->lock contentions.

Then the try_to_wake_up() on the previous task cpu may wake the task
on a different cpu, but the bias should be the cpu that try_to_wake_up()
is now running on.

The second patch in my series instead feeds the waking cpu number to
select_task_rq(), so the bias will be to wake the task on the waking
cpu.

-Frank

--

From: Frank Rowand
Date: Thursday, December 16, 2010 - 12:39 pm

The email that explains the context for this does not seem to have gotten
through to the list.  Here is the email that this patch should have been
a reply to:


And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived.  I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
  - Ensure return value of try_to_wake_up() is correct, even when queueing
    wake up on a different cpu.
  - rq->lock contention reduction not as good as first version

patch 1

  The core changes.  All the scary lock related stuff.

  select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
  of the waking smp_processor_id().

patch 2

  select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()
  
Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 75%
     rq->lock contention wait time reduced by ~ 70%
     test duration reduction is in the noise
     rq->lock contention improvement is slightly better with just patch 1
       applied, but the difference is in the noise

Todo
  - handle cpu being offlined


-Frank
--

From: Frank Rowand
Date: Thursday, December 16, 2010 - 12:36 pm

patch 2 of 2

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>

---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1060,7 +1060,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+			       int sd_flag, int flags, int waking_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
@@ -1196,6 +1196,7 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct task_struct *ttwu_queue_wake_entry;
 	int ttwu_queue_wake_flags;
+	int ttwu_waking_cpu;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2262,9 +2262,11 @@ static int select_fallback_rq(int cpu, s
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags,
+		   int wake_flags, int waking_cpu)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags,
+						 waking_cpu);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2335,12 +2337,14 @@ static inline void ttwu_post_activation(
 }
 
 #ifdef CONFIG_SMP
-static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_queue_wake_up(struct task_struct *p, int this_cpu, int p_cpu,
+			       int wake_flags)
 {
 	struct task_struct *next = NULL;
-	struct rq *rq = cpu_rq(cpu);
+	struct rq *rq = cpu_rq(p_cpu);
 
 	p->ttwu_queue_wake_flags = ...
Previous thread: [PATCH] replace idr in block layer SCSI generic (bsg) driver with kzalloced buffer by Hillf Danton on Thursday, December 16, 2010 - 8:01 am. (1 message)

Next thread: [PATCH 2/3] mbp_nvidia_bl: Check that the backlight control functions by Matthew Garrett on Thursday, December 16, 2010 - 8:13 am. (1 message)