Hello, This is the second take of cpu_stop. There have been only two minor changes from the last take[L]. * work_buf initialization in stop_one_cpu_nowait() simplified. * Comment to clarify active_balance_work synchronization added. Peter, if you ack the series, I'll add refresh the patches w/ your ACK and request pull into sched/core to Ingo. This patchset contains the following four patches. 0001-cpu_stop-implement-stop_cpu-s.patch 0002-stop_machine-reimplement-using-cpu_stop.patch 0003-scheduler-replace-migration_thread-with-cpu_stop.patch 0004-scheduler-kill-paranoia-check-in-synchronize_sched_e.patch The patches are against the current linux-2.6-tip/sched/core (99bd5e2f245d8cd17d040c82d40becdb3efd9b69) and are available in the following git tree. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git cpu_stop diffstat follows. Documentation/RCU/torture.txt | 10 arch/s390/kernel/time.c | 1 drivers/xen/manage.c | 14 - include/linux/rcutiny.h | 2 include/linux/rcutree.h | 1 include/linux/stop_machine.h | 59 ++-- kernel/cpu.c | 8 kernel/module.c | 14 - kernel/rcutorture.c | 2 kernel/sched.c | 271 +++------------------ kernel/sched_fair.c | 48 ++- kernel/stop_machine.c | 530 ++++++++++++++++++++++++++++++++---------- 12 files changed, 525 insertions(+), 435 deletions(-) Thanks. -- tejun [L] http://thread.gmane.org/gmane.linux.kernel/976691 --
Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar ...
Nice!!! I do have one concern, though. Can the following sequence of events occur? o CPU 0 calls synchronize_sched_expedited(). o CPU 1 is already in cpu_stopper_thread(), perhaps being interrupted or NMIed just after returning from cpu_stop_signal_done() in cpu_stopper_thread(). o Therefore, when CPU 0 queues the work for CPU 1, CPU 1 loops right around and processes it. There will be no context switch on CPU 1. At first glance, this looks safe because: 1. Although there is no context switch, there (presumably) can be no RCU read-side critical sections on CPU 1 that span this sequence of events. (As far as I can see, any such RCU read-side critical section would be due to abuse of rcu_read_lock() and friends.) 2. CPU 1 will acquire and release stopper->lock, and further more will do an atomic_dec_and_test() in cpu_stop_signal_done(). The former is a weak guarantee, but the latter guarantees a memory barrier, so that any subsequent code on CPU 1 will be guaranteed to see changes on CPU 0 prior to the call to synchronize_sched_expedited(). The guarantee required is that there will be a full memory barrier on each affected CPU between the time that try_stop_cpus() is called and the time that it returns. Both guarantees (1) and (2) are absolutely required for us to expect that synchronize_sched_expedited() will operate correctly. If my reading is correct, could you please document guarantee (2) in the docbook header for try_stop_cpus()? Otherwise, a fix is needed. Or am I misunderstanding the code? --
Hello, AFAICS, this must hold; otherwise, synchronize_sched_expedited() wouldn't have worked in the first place. On entry to any cpu_stop Ah, right. I think it would be dangerous to depend on the implicit barriers there. It might work today but it can easily break with later implementation detail changes which seem completely unrelated. Adding smp_mb() in the cpu_stop function should suffice, right? It's not like the cost of smp_mb() there would mean anything anyway. Thank you. -- tejun --
Makes sense to me! The actual requirement is that, on each CPU, there must have been a context switch between the end of the last RCU read-side critical section and the end of a successful return from try_stop_cpus(). For CONFIG_TREE_PREEMPT_RCU, the guarantee required is a bit different: on each CPU, either that CPU must not have been in an RCU read-side critical section, or, if it was, there must have been a context switch between the time that CPU entered its RCU read-side critical section and the memory barrier executed within a successful try_stop_cpus(). As near as I can tell, the current implementation does meet these If I understand the code correctly, this would be very good! Thanx, Paul --
Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra ...
Almost. With the following patch, I am good with it.
Thanx, Paul
------------------------------------------------------------------------
commit ab3e147752b1edab4588abd7a66f2505b7b433ed
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Wed May 5 13:28:37 2010 -0700
sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
The memory barriers must be in the SMP case, not in the !SMP case.
Also add a barrier after the atomic_inc() in order to ensure that other
CPUs see post-synchronize_sched_expedited() actions as following the
expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
diff --git a/kernel/sched.c b/kernel/sched.c
index e9c6d79..155a16d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8932,6 +8932,15 @@ struct cgroup_subsys cpuacct_subsys = {
void synchronize_sched_expedited(void)
{
+}
+EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
+
+#else /* #ifndef CONFIG_SMP */
+
+static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
+
+static int synchronize_sched_expedited_cpu_stop(void *data)
+{
/*
* There must be a full memory barrier on each affected CPU
* between the time that try_stop_cpus() is called and the
@@ -8943,16 +8952,7 @@ void synchronize_sched_expedited(void)
* necessary. Do smp_mb() anyway for documentation and
* robustness against future implementation changes.
*/
- smp_mb();
-}
-EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
-
-#else /* #ifndef CONFIG_SMP */
-
-static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
-
-static int synchronize_sched_expedited_cpu_stop(void *data)
-{
+ smp_mb(); /* See above comment block. */
return 0;
}
@@ -8990,6 +8990,7 @@ void synchronize_sched_expedited(void)
get_online_cpus();
}
atomic_inc(&synchronize_sched_expedited_count);
+ smp_mb__after_atomic_inc(); /* ensure post-GP actions seen after GP. */
...Ah, right. What was I thinking? I'll include your patch and push it to Ingo. Thanks. -- tejun --
And of course, I overdid it in my patch when I removed smp_mb() from
the !SMP version of synchronize_sched_expedited(). :-/
If synchronize_sched_expedited() is ever to be invoked from within
kernel/sched.c in a UP kernel, there needs to be a "barrier()" in the
!SMP version. So please see the following patch.
Thanx, Paul
commit 0eab52c0eeeb8507a83bdb37cc8f690b1c4f4a2c
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Thu May 6 11:40:17 2010 -0700
rcu: need barrier() in UP synchronize_sched_expedited()
If synchronize_sched_expedited() is ever to be called from within
kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs
a barrier().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
diff --git a/kernel/sched.c b/kernel/sched.c
index 155a16d..fbaf312 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8932,6 +8932,7 @@ struct cgroup_subsys cpuacct_subsys = {
void synchronize_sched_expedited(void)
{
+ barrier();
}
EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
--
The paranoid check which verifies that the cpu_stop callback is
actually called on all online cpus is completely superflous. It's
guaranteed by cpu_stop facility and if it didn't work as advertised
other things would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.
Kill the paranoid check. Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
kernel/sched.c | 40 +++-------------------------------------
1 files changed, 3 insertions(+), 37 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 8a198f1..c65ebbd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8941,14 +8941,6 @@ static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
static int synchronize_sched_expedited_cpu_stop(void *data)
{
- static DEFINE_SPINLOCK(done_mask_lock);
- struct cpumask *done_mask = data;
-
- if (done_mask) {
- spin_lock(&done_mask_lock);
- cpumask_set_cpu(smp_processor_id(), done_mask);
- spin_unlock(&done_mask_lock);
- }
return 0;
}
@@ -8964,55 +8956,29 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
*/
void synchronize_sched_expedited(void)
{
- cpumask_var_t done_mask_var;
- struct cpumask *done_mask = NULL;
int snap, trycount = 0;
- /*
- * done_mask is used to check that all cpus actually have
- * finished running the stopper, which is guaranteed by
- * stop_cpus() if it's called with cpu hotplug blocked. Keep
- * the paranoia for now but it's best effort if cpumask is off
- * stack.
- */
- if (zalloc_cpumask_var(&done_mask_var, GFP_ATOMIC))
- done_mask = ...Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Thanks Tejun! --
Great, thank you. As soon as Paul's comment is resolved, I'll push it to Ingo. -- tejun --
