[PATCH 3/4 UPDATED] scheduler: replace migration_thread with cpu_stop

Previous thread: [PATCH 2/4] stop_machine: reimplement using cpu_stop by Tejun Heo on Tuesday, May 4, 2010 - 6:47 am. (1 message)

Next thread: Re: [PATCH 0/8] Suspend block api (version 6) by Alan Stern on Tuesday, May 4, 2010 - 6:51 am. (5 messages)
From: Tejun Heo
Date: Tuesday, May 4, 2010 - 6:47 am

Hello,

This is the second take of cpu_stop.  There have been only two minor
changes from the last take[L].

* work_buf initialization in stop_one_cpu_nowait() simplified.

* Comment to clarify active_balance_work synchronization added.

Peter, if you ack the series, I'll add refresh the patches w/ your ACK
and request pull into sched/core to Ingo.

This patchset contains the following four patches.

 0001-cpu_stop-implement-stop_cpu-s.patch
 0002-stop_machine-reimplement-using-cpu_stop.patch
 0003-scheduler-replace-migration_thread-with-cpu_stop.patch
 0004-scheduler-kill-paranoia-check-in-synchronize_sched_e.patch

The patches are against the current linux-2.6-tip/sched/core
(99bd5e2f245d8cd17d040c82d40becdb3efd9b69) and are available in the
following git tree.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git cpu_stop

diffstat follows.

 Documentation/RCU/torture.txt |   10 
 arch/s390/kernel/time.c       |    1 
 drivers/xen/manage.c          |   14 -
 include/linux/rcutiny.h       |    2 
 include/linux/rcutree.h       |    1 
 include/linux/stop_machine.h  |   59 ++--
 kernel/cpu.c                  |    8 
 kernel/module.c               |   14 -
 kernel/rcutorture.c           |    2 
 kernel/sched.c                |  271 +++------------------
 kernel/sched_fair.c           |   48 ++-
 kernel/stop_machine.c         |  530 ++++++++++++++++++++++++++++++++----------
 12 files changed, 525 insertions(+), 435 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/976691
--

From: Tejun Heo
Date: Tuesday, May 4, 2010 - 6:47 am

Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched.  All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.

This patch kills migration_thread and replaces all three uses with
cpu_stop.  The three different roles of migration_thread() are
splitted into three separate cpu_stop callbacks -
migration_cpu_stop(), active_load_balance_cpu_stop() and
synchronize_sched_expedited_cpu_stop() - and each use case now simply
asks cpu_stop to execute the callback as necessary.

synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpu_stop.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.

synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched().  If called with cpu hotplug blocked,
cpu_stop already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break.  However, this patch preserves
the paranoid check using a cpumask to record on which cpus the stopper
ran so that it can serve as a bisection point if something actually
goes wrong theree.

Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.

This patch also renames cpu_stop threads to from "stopper/%d" to
"migration/%d".  The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.

With this patch applied, stop_machine() and sched now share the same
resources.  stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar ...
From: Paul E. McKenney
Date: Tuesday, May 4, 2010 - 6:33 pm

Nice!!!

I do have one concern, though.  Can the following sequence of events
occur?

o	CPU 0 calls synchronize_sched_expedited().

o	CPU 1 is already in cpu_stopper_thread(), perhaps being
	interrupted or NMIed just after returning from cpu_stop_signal_done()
	in cpu_stopper_thread().

o	Therefore, when CPU 0 queues the work for CPU 1, CPU 1
	loops right around and processes it.  There will be no
	context switch on CPU 1.

	At first glance, this looks safe because:

	1.	Although there is no context switch, there (presumably)
		can be no RCU read-side critical sections on CPU 1 that
		span this sequence of events.  (As far as I can see,
		any such RCU read-side critical section would be due
		to abuse of rcu_read_lock() and friends.)

	2.	CPU 1 will acquire and release stopper->lock, and
		further more will do an atomic_dec_and_test() in
		cpu_stop_signal_done().  The former is a weak
		guarantee, but the latter guarantees a memory
		barrier, so that any subsequent code on CPU 1 will
		be guaranteed to see changes on CPU 0 prior to the
		call to synchronize_sched_expedited().

		The guarantee required is that there will be a
		full memory barrier on each affected CPU between
		the time that try_stop_cpus() is called and the
		time that it returns.

Both guarantees (1) and (2) are absolutely required for us to expect
that synchronize_sched_expedited() will operate correctly.  If my
reading is correct, could you please document guarantee (2) in the
docbook header for try_stop_cpus()?  Otherwise, a fix is needed.

Or am I misunderstanding the code?

--

From: Tejun Heo
Date: Wednesday, May 5, 2010 - 12:28 am

Hello,



AFAICS, this must hold; otherwise, synchronize_sched_expedited()
wouldn't have worked in the first place.  On entry to any cpu_stop

Ah, right.  I think it would be dangerous to depend on the implicit
barriers there.  It might work today but it can easily break with
later implementation detail changes which seem completely unrelated.
Adding smp_mb() in the cpu_stop function should suffice, right?  It's
not like the cost of smp_mb() there would mean anything anyway.

Thank you.

-- 
tejun
--

From: Paul E. McKenney
Date: Wednesday, May 5, 2010 - 10:47 am

Makes sense to me!

The actual requirement is that, on each CPU, there must have been a
context switch between the end of the last RCU read-side critical
section and the end of a successful return from try_stop_cpus().

For CONFIG_TREE_PREEMPT_RCU, the guarantee required is a bit different:
on each CPU, either that CPU must not have been in an RCU read-side
critical section, or, if it was, there must have been a context switch
between the time that CPU entered its RCU read-side critical section
and the memory barrier executed within a successful try_stop_cpus().

As near as I can tell, the current implementation does meet these

If I understand the code correctly, this would be very good!

							Thanx, Paul
--

From: Tejun Heo
Date: Wednesday, May 5, 2010 - 11:10 am

Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched.  All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.

This patch kills migration_thread and replaces all three uses with
cpu_stop.  The three different roles of migration_thread() are
splitted into three separate cpu_stop callbacks -
migration_cpu_stop(), active_load_balance_cpu_stop() and
synchronize_sched_expedited_cpu_stop() - and each use case now simply
asks cpu_stop to execute the callback as necessary.

synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpu_stop.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.

synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched().  If called with cpu hotplug blocked,
cpu_stop already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break.  However, this patch preserves
the paranoid check using a cpumask to record on which cpus the stopper
ran so that it can serve as a bisection point if something actually
goes wrong theree.

Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.

This patch also renames cpu_stop threads to from "stopper/%d" to
"migration/%d".  The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.

With this patch applied, stop_machine() and sched now share the same
resources.  stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra ...
From: Paul E. McKenney
Date: Wednesday, May 5, 2010 - 1:31 pm

Almost.  With the following patch, I am good with it.

							Thanx, Paul

------------------------------------------------------------------------

commit ab3e147752b1edab4588abd7a66f2505b7b433ed
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed May 5 13:28:37 2010 -0700

    sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
    
    The memory barriers must be in the SMP case, not in the !SMP case.
    Also add a barrier after the atomic_inc() in order to ensure that other
    CPUs see post-synchronize_sched_expedited() actions as following the
    expedited grace period.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index e9c6d79..155a16d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8932,6 +8932,15 @@ struct cgroup_subsys cpuacct_subsys = {
 
 void synchronize_sched_expedited(void)
 {
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+#else /* #ifndef CONFIG_SMP */
+
+static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
+
+static int synchronize_sched_expedited_cpu_stop(void *data)
+{
 	/*
 	 * There must be a full memory barrier on each affected CPU
 	 * between the time that try_stop_cpus() is called and the
@@ -8943,16 +8952,7 @@ void synchronize_sched_expedited(void)
 	 * necessary.  Do smp_mb() anyway for documentation and
 	 * robustness against future implementation changes.
 	 */
-	smp_mb();
-}
-EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
-
-#else /* #ifndef CONFIG_SMP */
-
-static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
-
-static int synchronize_sched_expedited_cpu_stop(void *data)
-{
+	smp_mb(); /* See above comment block. */
 	return 0;
 }
 
@@ -8990,6 +8990,7 @@ void synchronize_sched_expedited(void)
 		get_online_cpus();
 	}
 	atomic_inc(&synchronize_sched_expedited_count);
+	smp_mb__after_atomic_inc(); /* ensure post-GP actions seen after GP. */
 ...
From: Tejun Heo
Date: Thursday, May 6, 2010 - 9:30 am

Ah, right.  What was I thinking?  I'll include your patch and push it
to Ingo.  Thanks.

-- 
tejun
--

From: Paul E. McKenney
Date: Thursday, May 6, 2010 - 11:42 am

And of course, I overdid it in my patch when I removed smp_mb() from
the !SMP version of synchronize_sched_expedited().  :-/

If synchronize_sched_expedited() is ever to be invoked from within
kernel/sched.c in a UP kernel, there needs to be a "barrier()" in the
!SMP version.  So please see the following patch.

							Thanx, Paul

commit 0eab52c0eeeb8507a83bdb37cc8f690b1c4f4a2c
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu May 6 11:40:17 2010 -0700

    rcu: need barrier() in UP synchronize_sched_expedited()
    
    If synchronize_sched_expedited() is ever to be called from within
    kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs
    a barrier().
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index 155a16d..fbaf312 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8932,6 +8932,7 @@ struct cgroup_subsys cpuacct_subsys = {
 
 void synchronize_sched_expedited(void)
 {
+	barrier();
 }
 EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
 
--

From: Tejun Heo
Date: Thursday, May 6, 2010 - 10:24 pm

Thanks.  Pushed.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, May 4, 2010 - 6:47 am

The paranoid check which verifies that the cpu_stop callback is
actually called on all online cpus is completely superflous.  It's
guaranteed by cpu_stop facility and if it didn't work as advertised
other things would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.

Kill the paranoid check.  Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
 kernel/sched.c |   40 +++-------------------------------------
 1 files changed, 3 insertions(+), 37 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8a198f1..c65ebbd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8941,14 +8941,6 @@ static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
 
 static int synchronize_sched_expedited_cpu_stop(void *data)
 {
-	static DEFINE_SPINLOCK(done_mask_lock);
-	struct cpumask *done_mask = data;
-
-	if (done_mask) {
-		spin_lock(&done_mask_lock);
-		cpumask_set_cpu(smp_processor_id(), done_mask);
-		spin_unlock(&done_mask_lock);
-	}
 	return 0;
 }
 
@@ -8964,55 +8956,29 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
  */
 void synchronize_sched_expedited(void)
 {
-	cpumask_var_t done_mask_var;
-	struct cpumask *done_mask = NULL;
 	int snap, trycount = 0;
 
-	/*
-	 * done_mask is used to check that all cpus actually have
-	 * finished running the stopper, which is guaranteed by
-	 * stop_cpus() if it's called with cpu hotplug blocked.  Keep
-	 * the paranoia for now but it's best effort if cpumask is off
-	 * stack.
-	 */
-	if (zalloc_cpumask_var(&done_mask_var, GFP_ATOMIC))
-		done_mask = ...
From: Peter Zijlstra
Date: Tuesday, May 4, 2010 - 11:52 am

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Thanks Tejun!

--

From: Tejun Heo
Date: Wednesday, May 5, 2010 - 12:30 am

Great, thank you.  As soon as Paul's comment is resolved, I'll push it
to Ingo.

-- 
tejun
--

Previous thread: [PATCH 2/4] stop_machine: reimplement using cpu_stop by Tejun Heo on Tuesday, May 4, 2010 - 6:47 am. (1 message)

Next thread: Re: [PATCH 0/8] Suspend block api (version 6) by Alan Stern on Tuesday, May 4, 2010 - 6:51 am. (5 messages)