Hello, all. This patchset implements cpuhog which is a simplistic cpu monopolization mechanism and reimplements stop_machine() and replaces migration_thread with it. This allows stop_machine() to be simpler and much more efficient on very large machines without using more resources while also making the rather messy overloaded migration_thread usages cleaner. This should solve the slow boot problem[1] caused by repeated stop_machine workqueue creation/destruction reported by Dimitri Sivanich. The patchset is currently on top of v2.6.33 and contains the following patches. 0001-cpuhog-implement-cpuhog.patch 0002-stop_machine-reimplement-using-cpuhog.patch 0003-scheduler-replace-migration_thread-with-cpuhog.patch 0004-scheduler-kill-paranoia-check-in-synchronize_sched_e.patch 0001 implements cpuhog. 0002 converts stop_machine. 0003 converts migration users and 0004 removes paranoia checks in synchronize_sched_expedited(). 0004 is done separately so that 0003 can serve as a debug/bisection point. Tested cpu on/offlining, shutdown, all migration usage paths including RCU torture test at 0003 and 004 and everything seems to work fine here. Dimitri, can you please test whether this solves the problem you're seeing there? The tree is available in the following git tree. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git cpuhog diffstat follows. Documentation/RCU/torture.txt | 10 - arch/s390/kernel/time.c | 1 drivers/xen/manage.c | 14 - include/linux/cpuhog.h | 24 ++ include/linux/rcutiny.h | 2 include/linux/rcutree.h | 1 include/linux/stop_machine.h | 20 -- kernel/Makefile | 2 kernel/cpu.c | 8 kernel/cpuhog.c | 362 ++++++++++++++++++++++++++++++++++++++++++ kernel/module.c | 14 - kernel/rcutorture.c | 2 kernel/sched.c | 327 +++++++++---------------------------- ...
Implement a simplistic per-cpu maximum priority cpu hogging mechanism
named cpuhog. A callback can be scheduled to run on one or multiple
cpus with maximum priority monopolozing those cpus. This is primarily
to replace and unify RT workqueue usage in stop_machine and scheduler
migration_thread which currently is serving multiple purposes.
Four functions are provided - hog_one_cpu(), hog_one_cpu_nowait(),
hog_cpus() and try_hog_cpus().
This is to allow clean sharing of resources among stop_cpu and all the
migration thread users. One cpuhog thread per cpu is created which is
currently named "hog/CPU". This will eventually replace the migration
thread and take on its name.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
include/linux/cpuhog.h | 24 +++
kernel/Makefile | 2 +-
kernel/cpuhog.c | 362 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 387 insertions(+), 1 deletions(-)
create mode 100644 include/linux/cpuhog.h
create mode 100644 kernel/cpuhog.c
diff --git a/include/linux/cpuhog.h b/include/linux/cpuhog.h
new file mode 100644
index 0000000..5252884
--- /dev/null
+++ b/include/linux/cpuhog.h
@@ -0,0 +1,24 @@
+/*
+ * linux/cpuhog.h - CPU hogs to monopolize CPUs
+ *
+ * Copyright (C) 2010 SUSE Linux Products GmbH
+ *
+ * This file is released under the GPLv2.
+ */
+#include <linux/cpumask.h>
+#include <linux/list.h>
+
+typedef int (*cpuhog_fn_t)(void *arg);
+
+struct cpuhog_work {
+ struct list_head list; /* cpuhog->works */
+ cpuhog_fn_t fn;
+ void *arg;
+ struct cpuhog_done *done;
+};
+
+int hog_one_cpu(unsigned int cpu, cpuhog_fn_t fn, void *arg);
+void hog_one_cpu_nowait(unsigned int cpu, cpuhog_fn_t fn, void *arg,
+ struct cpuhog_work *work_buf);
+int hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);
+int try_hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);
diff --git ...Heh. In no way I can ack (or even review) the changes in sched.c, but personally I like this idea. And I think cpuhog can have more users. Say, wait_task_context_switch() could use hog_one_cpu() to force the context switch instead of looping, perhaps. Is this really right? I mean, perhaps it makes more sense if ->executed was set only if _all_ CPUs from cpumask "ack" this call? I guess, currently this doesn't matter, stop_machine() uses cpu_online_mask and we can't race with hotplug. Oleg. --
Hello, Oleg. cpuhog itself doesn't care whether the cpus go on or offline and ensuring cpus stay online is the caller's responsibility. The return value check is mostly added for hog_one_cpu() users which is much more light weight than hog_cpus() and thus is much more likely to be used w/o disabling cpu hotplugging. For hog_cpus(), I wanted it not to care about cpu onliness itself. Requests will be executed for online ones while ignored for others, so the only exceptional one is the case where none gets executed. Thanks. -- tejun --
Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpuhog. The three different roles of migration_thread() are splitted into three separate cpuhog callbacks - migration_hog(), active_load_balance_hog() and synchronize_sched_expedited_hog() - and each use case now simply asks cpuhog to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpuhog. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpuhog already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the hog ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpuhog threads to from "hog/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra ...
Reimplement stop_machine using cpuhog. As cpuhogs are guaranteed to be available for all online cpus, stop_machine_create/destroy() are no longer necessary and removed. With resource management and synchronization handled by cpuhog, the new implementation is much simpler. Asking the cpuhog to execute the stop_cpu() state machine on all online cpus with cpu hotplug disabled is enough. stop_machine itself doesn't need to manage any global resources anymore, so all per-instance information is rolled into struct stop_machine_data and the mutex and all static data variables are removed. The previous implementation created and destroyed RT workqueues as necessary which made stop_machine() calls highly expensive on very large machines. According to Dimitri Sivanich, preventing the dynamic creation/destruction makes booting faster more than twice on very large machines. cpuhog resources are preallocated for all online cpus and should have the same effect. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com> --- arch/s390/kernel/time.c | 1 - drivers/xen/manage.c | 14 +--- include/linux/stop_machine.h | 20 ----- kernel/cpu.c | 8 -- kernel/module.c | 14 +--- kernel/stop_machine.c | 162 ++++++++++-------------------------------- 6 files changed, 42 insertions(+), 177 deletions(-) diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c index 65065ac..afe429e 100644 --- a/arch/s390/kernel/time.c +++ b/arch/s390/kernel/time.c @@ -397,7 +397,6 @@ static void __init time_init_wq(void) if (time_sync_wq) return; time_sync_wq = create_singlethread_workqueue("timesync"); - stop_machine_create(); } /* diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 5d42d55..888114a 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -79,12 +79,6 @@ static void ...
On Tue, 9 Mar 2010 00:53:21 +0900 question; stop_machine pretty much also stops interrupts (it has to, at least for the users of stop_machine).. do cpu_hogs do this too? (if they don't, cpu_hogs aren't safe for some of the users of stop_machine, like code patching etc) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Hello, That is and has always been done by stop_machine cpuhog callback stop_cpu(). Only the thread pool is changed. The logic stays the same. Thanks. -- tejun --
The reason we introduced stop_machine_create/destroy was to have a non-failing variant that doesn't rely on I/O. If we ever see a timesync machine check no I/O will succeed (it blocks) until clocks have been synchronized. That means also that we rely on the non-blocking semantics that those functions must have that are called via stop_machine. This isn't true anymore with the cpu hog infrastructure: if passed a blocking function that could wait on I/O we won't see any progress anymore and the machine is dead. --
Could you please spell? How cpuhog can make a difference? Afaics, we shouldn't pass a blocking callback to hog_cpus/hog_one_cpu. Oleg. --
Well, it might me true that this shouldn't be done. But I don't see a reason why in general it wouldn't work to pass a function that would block. So it's just a matter of time until somebody uses it for such a purpose. For the current stop_machine implementation it would be broken to pass a blocking function (preemption disabled, interrupts disabled). --
Hello, Well, all current users don't block and it definitely can be enforced by turning off preemption around the callback. stop_machine() uses busy waiting for every state transition so something else blocking on a cpu could waste a lot of cpu cycles on other cpus even if the wait is guaranteed to be finite. Would that sooth your concern? Thanks. -- tejun --
Yes, enforcing non blocking functions would be good. Thanks! --
Alright, will refresh and post the second round. Thanks. -- tejun --
Could you please confirm this is correct? I am not sure I understand how the code looks with the patch applied, but the lockless set_state() above can confuse another stop_machine() in progress? Oleg. --
Hello, set_state() now modifies smdata->state and smdata is now local variable of __stop_machine(). stop_machine instances don't have any shared resource anymore. Synchronization and resource sharing are all cpuhog's responsibilities. Thanks. -- tejun --
The paranoid check which verifies that the hog callback is actually
called on all online cpus is completely superflous. It's guaranteed
by cpuhog facility and if it didn't work as advertised other things
would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.
Kill the paranoid check. Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
kernel/sched.c | 39 +++------------------------------------
1 files changed, 3 insertions(+), 36 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 900088d..600ab5d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10829,14 +10829,6 @@ static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
static int synchronize_sched_expedited_hog(void *data)
{
- static DEFINE_SPINLOCK(done_mask_lock);
- struct cpumask *done_mask = data;
-
- if (done_mask) {
- spin_lock(&done_mask_lock);
- cpumask_set_cpu(smp_processor_id(), done_mask);
- spin_unlock(&done_mask_lock);
- }
return 0;
}
@@ -10852,53 +10844,28 @@ static int synchronize_sched_expedited_hog(void *data)
*/
void synchronize_sched_expedited(void)
{
- cpumask_var_t done_mask_var;
- struct cpumask *done_mask = NULL;
int snap, trycount = 0;
- /*
- * done_mask is used to check that all cpus actually have
- * finished running the hog, which is guaranteed by hog_cpus()
- * if it's called with cpu hotplug blocked. Keep the paranoia
- * for now but it's best effort if cpumask is off stack.
- */
- if (zalloc_cpumask_var(&done_mask_var, GFP_ATOMIC))
- done_mask = done_mask_var;
-
...cpuhog as a name doesn't work for me, stop-machine had a name that described its severity and impact, cpuhog makes me think of while(1);. Can't we keep the stop_machine name and make that a workqueue interface like you propose? That way we'd end up with something like: kernel/stop_machine.c int stop_cpu(int cpu, stop_fn_t fn, void *arg) int stop_machine(struct cpumask *mask, stop_fn_t fn, void *arg) alternatively, something like schedule_primary_work*() might work I guess. --
Hello, Peter. The distinction would be diabling of IRQ on each CPU. hog_[one_]cpu[s]() schedule highest priority task to, well, hog the cpu but doesn't affect contextless part of the cpu (irq, bh, whatnot). In that sense, it is the lowest bottom of upper half but not quite stopping the cpu and I think the distinction is rather important to make. With the proposed preemption disabling around the callback, it I wanted to avoid verbs associatffed with the traditional workqueue - schedule and queue, while emphasizing that this is something that you don't want to abuse - so the verb hog. monopolize_cpu() was the second choice but hog is shorter, sweeter and can also be used as a noun as-is, so I chose hog. So, those were my rationales. What do you think? Thanks. -- tejun --
I rather like the name. And the stop_machine name is still there; it's just using cpuhog rather than workqueues. Ugly things should have ugly names. For Patch 2/4 at least: Acked-by: Rusty Russell <rusty@rustcorp.com.au> Great work Tejun! Rusty. --
Its a pretty minor difference, shouldn't we simply audit all existing kstopmachine users and fix that up, having two similar but not quite Still don't like the name fwiw. --
Hello, Peter. Yeap, sure. I don't think naming one way or the other is a problem logistics-wise. These aren't very widely used APIs anyway. I've been thinking quite a while about it and visible interface like the following would probably fit your suggestion. * stop_cpu() - identical to hog_cpu() * stop_cpus() - identical to hog_cpus() * stop_machine() It's just that stop_cpu[s]() don't look like good names because they don't really stop cpus. This distinction is visible in implementation. stop_machine()'s per-cpu callback is currently named stop_cpu() and it adds quite a bit more restrictions on top of just hogging the cpu. To me, the following visible API seems better. * hog_cpu() * hog_cpus() * stop_machine() - uses stop_cpu() internally for implementation Oh well, I guess it's a matter of taste. Given that other people don't dislike the current naming too much, I'll try to push it forward to Ingo w/ your objection to naming noted. Thank you for reviewing. -- tejun --
