Re: [PATCH RFC] reduce runqueue lock contention

Previous thread: [PATCH]pacmcia:yena_socket.c Remove extra #ifdef CONFIG_YENTA_TI by Justin P. Mattock on Thursday, May 20, 2010 - 1:40 pm. (3 messages)

Next thread: [PATCH 1/2] IPS driver: add GPU busy and turbo checking by Jesse Barnes on Thursday, May 20, 2010 - 2:27 pm. (1 message)
From: Chris Mason
Date: Thursday, May 20, 2010 - 1:48 pm

This is more of a starting point than a patch, but it is something I've
been meaning to look at for a long time.  Many different workloads end
up hammering very hard on try_to_wake_up, to the point where the
runqueue locks dominate CPU profiles.

This includes benchmarking really fast IO subsystems, fast networking,
fast pipes...well anywhere that we lots of procs on lots of cpus waiting
for short periods on lots of different things.

I think we probably want to add a way to wait just for a little while as
a more lightweight operation (less balancing etc) but this patch doesn't
do that.  It only tries to make the act of waking someone up less
expensive by avoiding the runqueue lock when we're on a different CPU
than the process we want to wake.

I do this with a per-runqueue list and some cmpxchg tricks.  Basically
all the other CPUs will toss a given process they want to wakeup onto
the destination per-runqueue list.  Later on, when that runqueue is
finding a new task to run, it processes the list in bulk.

From a performance point of view, this adds about 10% to a TPC
simulation on a large numa machine.  It doubles the speed of my IPC semaphore
benchmark (which really just hammers on wakeup in a loop).

This is probably broken in less than subtle ways, so it includes a new
schedule feature to (FAST_WAKEUP) to control the new mode.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..9c2f188 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1187,6 +1187,8 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct task_struct *fast_wakeup_next;
+	unsigned long fast_wakeup_mode;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..3667e89 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,7 @@ struct rq {
 	unsigned long nr_uninterruptible;
 
 	struct task_struct ...
From: Peter Zijlstra
Date: Thursday, May 20, 2010 - 2:09 pm

Right, so one of the things that I considered was to make p->state an
atomic_t and replace the initial stage of try_to_wake_up() with
something like:

int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
{
  int state = atomic_read(&p->state);

  do {
    if (!(state & mask))
      return 0;

    state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
  } while (state != TASK_WAKING);

  /* do this pending queue + ipi thing */

  return 1;
}

Also, I think we might want to put that atomic single linked list thing
into some header (using atomic_long_t or so), because I have a similar
thing living in kernel/perf_event.c, that needs to queue things from NMI
context.

The advantage of doing basically the whole enqueue on the remote cpu is
less cacheline bouncing of the runqueue structures.

--

From: Peter Zijlstra
Date: Thursday, May 20, 2010 - 2:23 pm

cpu = select_task_rq()



--

From: Chris Mason
Date: Thursday, May 20, 2010 - 3:17 pm

I tried not to set the task waking, since I was worried about races with
us getting queued somewhere else.  But, I don't have a good handle on
all of that so I kind of chickened out.  That's why my code falls back
to the full ttwu in a few cases.

Do you think the above could be an addition to my patch or that it's
required for my patch to work well?

-chris

--

From: Chris Mason
Date: Thursday, May 20, 2010 - 3:21 pm

So I've done three of these cmpxchg lists recently...but they have all
been a little different.  I went back and forth a bunch of times about
using a list_head based thing instead to avoid the walk for list append.
I really don't like the walk.

But, what makes this one unique is that I'm using a cmpxchg on the list
pointer in the in task struct to take ownership of this task struct.
It is how I avoid concurrent lockless enqueues.

Your fiddling with the p->state above would let me avoid that.

-chris
--

From: Stijn Devriendt
Date: Friday, June 4, 2010 - 3:56 am

I actually have the reverse lying around somewhere (even more broken, probably)
to allow nested wakeups from the scheduler.

The issue I want to tackle is waking up processes when others go to sleep.
This means try_to_wake_up() from inside the runqueue lock.

I used a simple per_cpu taskqueue where tasks are put on when waking
during schedule(). At the end of schedule() I empty the list and reschedule
as one of the newly woken tasks may be higher prio.

I'm wondering if both approaches can be merged by checking this list before
and after every schedule().

Stijn
--

From: Peter Zijlstra
Date: Friday, June 4, 2010 - 5:00 am

Tejun did something for that:

  http://lkml.org/lkml/2010/5/13/136
  

--

From: Stijn Devriendt
Date: Saturday, June 5, 2010 - 2:37 am

The difference with what I have is that Tejun's threads are guaranteed
to be local
and on the same CPU/runqueue avoiding runqueue deadlocks. My code can be
used to wake up any thread as I basically defer waking up untill after
the runqueue
lock is released.
--

From: Peter Zijlstra
Date: Monday, June 21, 2010 - 3:02 am

To return the favour, here's a scary patch that renders your machine a
doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
fault.

It looses the ttwu task_running() check, as I must admit I'm not quite
sure what it does.. Ingo?

It makes the whole TASK_WAKING thing very interesting again :-)

It also re-introduces a bunch of races because we now need to run
->select_task_rq() without holding the rq->lock. We cannot defer running
that because it really wants to know the cpu the wakeup originated from
(affine wakeups and the like).


---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  225 ++++++++++++++++++++++++-----------------------
 kernel/sched_fair.c     |   10 +--
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 127 insertions(+), 122 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) ...
From: Peter Zijlstra
Date: Monday, June 21, 2010 - 3:54 am

Progress, this one gave spectacular fireworks ;-)

---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  247 +++++++++++++++++++++++++----------------------
 kernel/sched_fair.c     |   10 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 143 insertions(+), 128 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
diff --git a/kernel/sched.c ...
From: Peter Zijlstra
Date: Monday, June 21, 2010 - 6:04 am

I think I figured out what its for, its for when p is prev in schedule()
after deactivate_task(), we have to call activate_task() it again, but
we cannot migrate the task because the CPU its on is still referencing
it.

I added it back, but still fireworks :-)
--

From: Peter Zijlstra
Date: Tuesday, June 22, 2010 - 6:33 am

So this one boots and builds a kernel on a dual-socket nehalem.

there's still quite a number of XXXs to fix, but I don't think any of
the races are crashing potential, mostly wrong accounting and scheduling
iffies like.

But give it a go.. see what it does for you (x86 only for now).

Ingo, any comments other than, eew, scary? :-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  271 ++++++++++++++++++++++++++++++++---------------
 kernel/sched_fair.c     |   10 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 198 insertions(+), 97 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct ...
From: Ingo Molnar
Date: Tuesday, June 22, 2010 - 2:11 pm

None, other than a question: which future kernel do you aim it for? I'd prefer 
v2.6.50 or later ;-)

This is a truly scary patch.

	Ingo
--

From: Peter Zijlstra
Date: Wednesday, June 23, 2010 - 2:10 am

Well, assuming it all works out and actually reduces runqueue lock
contention we still need to sort out all those XXXs in there, I'd say at
the soonest somewhere near .38/.39 or so.

Its definitely not something we should rush in.
--

From: Frank Rowand
Date: Wednesday, December 1, 2010 - 4:13 pm

This thread was started by Chris Mason back in May:
   http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html


Chris provided some code as a starting point for a solution.

Peter Zijlstra had some good ideas, and came up with some alternate code,
culminating with:

   http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html

Building on this previous work, I have another patch to try to address
the problem.  I have taken some of Peter's code (the cmpxchg() based
queueing and unqueueing, plus the cross cpu interrupt), but taken a
simpler (and hopefully less scary) approach otherwise:

  If the task to be woken is on a run queue on a different cpu then use
  cmpxchg() to put it onto a pending try_to_wake_up list on the different
  cpu.  Then send an interrupt to the different cpu to cause that cpu to
  call try_to_wake_up() for each process on the try_to_wake_up list.

  The result is that the initial run queue lock acquired by try_to_wake_up()
  will be on the cpu we are currently executing on, not a different cpu.

  This patch does not address the run queue lock contention that may occur
  if try_to_wake_up() chooses to move the waking process to another cpu,
  based on the result returned by select_task_rq().

  The patch was created on the -top tree.

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>


Chris, can you check the performance of this patch on your large system?


Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 95%
     rq->lock contention wait time reduced by ~ 90%
     test duration reduced by ~ 0.5% - 4% (in the noise)

Review goals:
  (1) performance results
  (2) architectural comments

Review non-goal:
  code style, etc (but will be a goal in a future review round)

Todo:
  - add support for additional architectures
  - polish code style
  - add a schedule feature to control whether to use the new algorithm
  - ...
From: Frank Rowand
Date: Wednesday, December 1, 2010 - 6:17 pm

On 12/01/10 15:13, Frank Rowand wrote:


Fat-fingered typing.  That should be the -tip tree....

-Frank

--

From: Peter Zijlstra
Date: Thursday, December 2, 2010 - 12:36 am

Without having looked at the actual code, the described thing cannot
work, try_to_wake_up() has a return value that needs to be passed back.

Also, try_to_wake_up() does load-balancing, you really want to do that
before queueing it on a remote cpu.
--

From: Frank Rowand
Date: Monday, December 13, 2010 - 7:41 pm

I have not been able to make sense of the task_running() check in
try_to_wake_up(), even with that clue.  The try_to_wake_up() code in
question is:

        rq = task_rq_lock(p, &flags);
        if (!(p->state & state))
                goto out;

        if (p->se.on_rq)
                goto out_running;

        cpu = task_cpu(p);
        orig_cpu = cpu;

#ifdef CONFIG_SMP
        if (unlikely(task_running(rq, p)))
                goto out_activate;


The relevent code in schedule() executes with the rq lock held (many
lines left out to emphasize the key lines):

        raw_spin_lock_irq(&rq->lock);

        if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

                       deactivate_task(rq, prev, DEQUEUE_SLEEP);

        if (likely(prev != next)) {
                rq->curr = next;
                context_switch(rq, prev, next); /* unlocks the rq */
        } else
                raw_spin_unlock_irq(&rq->lock);


If (p->se.on_rq) can becomes false due to deactivate_task()
then task_running() will also become false while the rq lock is still
held (either via "rq->curr = next" or via context_switch() updating
p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).

I  haven't been able to find any case where task_running() can be true
when (p->se.on_rq) is false, while the rq lock is not being held.  Thus
I don't think the branch to out_activate will ever be taken.

What am I missing, or is the task_running() test not needed?

Thanks!

Frank

--

From: Mike Galbraith
Date: Monday, December 13, 2010 - 8:42 pm

Say the last runnable task passes through schedule(), is deactivated.
We hit idle_balance(), which drops/retakes rq->lock _before_ the task
schedules off.  ttwu() can acquire rq->lock in that window, p->se.on_rq
is false, p->state is true, as is task_current(rq, p).

We have to check whether the task is still current, but not enqueued,
lest the wakeup be a noop, and the wakee possibly then sleep forever.

	-Mike

--

From: Frank Rowand
Date: Tuesday, December 14, 2010 - 2:42 pm

}
           pre_schedule(rq, prev);
           if (unlikely(!rq->nr_running))

Thanks Mike, that's just the cluebat I needed!

And for the lkml archives, in case anyone has this question again in the
future, with Mike's clue in hand I found another case in this window where
the rq->lock can be dropped then reacquired.  Just before idle_balance()
is called, pre_schedule() is called:

pre_schedule()
   prev->sched_class->pre_schedule(rq, prev)
   [pre_schedule_rt()]
      pull_rt_task(rq)
      pull_rt_task[this_rq]
         for_each_cpu(cpu, this_rq->rd->rto_mask)
            double_lock_balance(this_rq, src_rq)
               raw_spin_unlock(&this_rq->lock)        <-----
               double_rq_lock(this_rq, busiest)

-Frank

--

From: Oleg Nesterov
Date: Wednesday, December 15, 2010 - 11:59 am

I am afraid I can misuderstood this all, including the question ;)

But, with __ARCH_WANT_UNLOCKED_CTXSW task_running() checks ->oncpu.
However, schedule() drops rq->lock after prev was deactivated but
before it clears prev->oncpu.

Oleg.

--

Previous thread: [PATCH]pacmcia:yena_socket.c Remove extra #ifdef CONFIG_YENTA_TI by Justin P. Mattock on Thursday, May 20, 2010 - 1:40 pm. (3 messages)

Next thread: [PATCH 1/2] IPS driver: add GPU busy and turbo checking by Jesse Barnes on Thursday, May 20, 2010 - 2:27 pm. (1 message)