Re: [PATCHSET] cpuhog: implement and use cpuhog

Previous thread: [PATCH] fs, binfmt_aout: Fix pointer warnings by Borislav Petkov on Monday, March 8, 2010 - 8:46 am. (2 messages)

Next thread: [PATCH] asm-generic/io.h: add big endian versions of io{read,write}{16,32} by Mike Frysinger on Monday, March 8, 2010 - 9:34 am. (2 messages)
From: Tejun Heo
Date: Monday, March 8, 2010 - 8:53 am

Hello, all.

This patchset implements cpuhog which is a simplistic cpu
monopolization mechanism and reimplements stop_machine() and replaces
migration_thread with it.

This allows stop_machine() to be simpler and much more efficient on
very large machines without using more resources while also making the
rather messy overloaded migration_thread usages cleaner.

This should solve the slow boot problem[1] caused by repeated
stop_machine workqueue creation/destruction reported by Dimitri
Sivanich.

The patchset is currently on top of v2.6.33 and contains the following
patches.

 0001-cpuhog-implement-cpuhog.patch
 0002-stop_machine-reimplement-using-cpuhog.patch
 0003-scheduler-replace-migration_thread-with-cpuhog.patch
 0004-scheduler-kill-paranoia-check-in-synchronize_sched_e.patch

0001 implements cpuhog.  0002 converts stop_machine.  0003 converts
migration users and 0004 removes paranoia checks in
synchronize_sched_expedited().  0004 is done separately so that 0003
can serve as a debug/bisection point.

Tested cpu on/offlining, shutdown, all migration usage paths including
RCU torture test at 0003 and 004 and everything seems to work fine
here.  Dimitri, can you please test whether this solves the problem
you're seeing there?

The tree is available in the following git tree.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git cpuhog

diffstat follows.

 Documentation/RCU/torture.txt |   10 -
 arch/s390/kernel/time.c       |    1 
 drivers/xen/manage.c          |   14 -
 include/linux/cpuhog.h        |   24 ++
 include/linux/rcutiny.h       |    2 
 include/linux/rcutree.h       |    1 
 include/linux/stop_machine.h  |   20 --
 kernel/Makefile               |    2 
 kernel/cpu.c                  |    8 
 kernel/cpuhog.c               |  362 ++++++++++++++++++++++++++++++++++++++++++
 kernel/module.c               |   14 -
 kernel/rcutorture.c           |    2 
 kernel/sched.c                |  327 +++++++++----------------------------
 ...
From: Tejun Heo
Date: Monday, March 8, 2010 - 8:53 am

Implement a simplistic per-cpu maximum priority cpu hogging mechanism
named cpuhog.  A callback can be scheduled to run on one or multiple
cpus with maximum priority monopolozing those cpus.  This is primarily
to replace and unify RT workqueue usage in stop_machine and scheduler
migration_thread which currently is serving multiple purposes.

Four functions are provided - hog_one_cpu(), hog_one_cpu_nowait(),
hog_cpus() and try_hog_cpus().

This is to allow clean sharing of resources among stop_cpu and all the
migration thread users.  One cpuhog thread per cpu is created which is
currently named "hog/CPU".  This will eventually replace the migration
thread and take on its name.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
 include/linux/cpuhog.h |   24 +++
 kernel/Makefile        |    2 +-
 kernel/cpuhog.c        |  362 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 387 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/cpuhog.h
 create mode 100644 kernel/cpuhog.c

diff --git a/include/linux/cpuhog.h b/include/linux/cpuhog.h
new file mode 100644
index 0000000..5252884
--- /dev/null
+++ b/include/linux/cpuhog.h
@@ -0,0 +1,24 @@
+/*
+ * linux/cpuhog.h - CPU hogs to monopolize CPUs
+ *
+ * Copyright (C) 2010		SUSE Linux Products GmbH
+ *
+ * This file is released under the GPLv2.
+ */
+#include <linux/cpumask.h>
+#include <linux/list.h>
+
+typedef int (*cpuhog_fn_t)(void *arg);
+
+struct cpuhog_work {
+	struct list_head	list;		/* cpuhog->works */
+	cpuhog_fn_t		fn;
+	void			*arg;
+	struct cpuhog_done	*done;
+};
+
+int hog_one_cpu(unsigned int cpu, cpuhog_fn_t fn, void *arg);
+void hog_one_cpu_nowait(unsigned int cpu, cpuhog_fn_t fn, void *arg,
+			struct cpuhog_work *work_buf);
+int hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);
+int try_hog_cpus(const struct cpumask *cpumask, cpuhog_fn_t fn, void *arg);
diff --git ...
From: Oleg Nesterov
Date: Monday, March 8, 2010 - 12:01 pm

Heh. In no way I can ack (or even review) the changes in sched.c, but
personally I like this idea.

And I think cpuhog can have more users. Say, wait_task_context_switch()
could use hog_one_cpu() to force the context switch instead of looping,
perhaps.



Is this really right?

I mean, perhaps it makes more sense if ->executed was set only if _all_
CPUs from cpumask "ack" this call?


I guess, currently this doesn't matter, stop_machine() uses cpu_online_mask
and we can't race with hotplug.

Oleg.

--

From: Tejun Heo
Date: Monday, March 8, 2010 - 4:18 pm

Hello, Oleg.


cpuhog itself doesn't care whether the cpus go on or offline and
ensuring cpus stay online is the caller's responsibility.  The return
value check is mostly added for hog_one_cpu() users which is much more
light weight than hog_cpus() and thus is much more likely to be used
w/o disabling cpu hotplugging.  For hog_cpus(), I wanted it not to
care about cpu onliness itself.  Requests will be executed for online
ones while ignored for others, so the only exceptional one is the case
where none gets executed.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 8:53 am

Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched.  All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.

This patch kills migration_thread and replaces all three uses with
cpuhog.  The three different roles of migration_thread() are splitted
into three separate cpuhog callbacks - migration_hog(),
active_load_balance_hog() and synchronize_sched_expedited_hog() - and
each use case now simply asks cpuhog to execute the callback as
necessary.

synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpuhog.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.

synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched().  If called with cpu hotplug blocked,
cpuhog already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break.  However, this patch preserves
the paranoid check using a cpumask to record on which cpus the hog ran
so that it can serve as a bisection point if something actually goes
wrong theree.

Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.

This patch also renames cpuhog threads to from "hog/%d" to
"migration/%d".  The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.

With this patch applied, stop_machine() and sched now share the same
resources.  stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra ...
From: Tejun Heo
Date: Monday, March 8, 2010 - 8:53 am

Reimplement stop_machine using cpuhog.  As cpuhogs are guaranteed to
be available for all online cpus, stop_machine_create/destroy() are no
longer necessary and removed.

With resource management and synchronization handled by cpuhog, the
new implementation is much simpler.  Asking the cpuhog to execute the
stop_cpu() state machine on all online cpus with cpu hotplug disabled
is enough.

stop_machine itself doesn't need to manage any global resources
anymore, so all per-instance information is rolled into struct
stop_machine_data and the mutex and all static data variables are
removed.

The previous implementation created and destroyed RT workqueues as
necessary which made stop_machine() calls highly expensive on very
large machines.  According to Dimitri Sivanich, preventing the dynamic
creation/destruction makes booting faster more than twice on very
large machines.  cpuhog resources are preallocated for all online cpus
and should have the same effect.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
 arch/s390/kernel/time.c      |    1 -
 drivers/xen/manage.c         |   14 +---
 include/linux/stop_machine.h |   20 -----
 kernel/cpu.c                 |    8 --
 kernel/module.c              |   14 +---
 kernel/stop_machine.c        |  162 ++++++++++--------------------------------
 6 files changed, 42 insertions(+), 177 deletions(-)

diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index 65065ac..afe429e 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -397,7 +397,6 @@ static void __init time_init_wq(void)
 	if (time_sync_wq)
 		return;
 	time_sync_wq = create_singlethread_workqueue("timesync");
-	stop_machine_create();
 }
 
 /*
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 5d42d55..888114a 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -79,12 +79,6 @@ static void ...
From: Arjan van de Ven
Date: Monday, March 8, 2010 - 9:32 am

On Tue,  9 Mar 2010 00:53:21 +0900

question;
stop_machine pretty much also stops interrupts (it has to, at least for
the users of stop_machine).. do cpu_hogs do this too?

(if they don't, cpu_hogs aren't safe for some of the users of
stop_machine, like code patching etc)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 4:21 pm

Hello,


That is and has always been done by stop_machine cpuhog callback
stop_cpu().  Only the thread pool is changed.  The logic stays the
same.

Thanks.

-- 
tejun
--

From: Heiko Carstens
Date: Monday, March 8, 2010 - 10:10 am

The reason we introduced stop_machine_create/destroy was to have a non-failing
variant that doesn't rely on I/O.
If we ever see a timesync machine check no I/O will succeed (it blocks) until
clocks have been synchronized. That means also that we rely on the non-blocking
semantics that those functions must have that are called via stop_machine.
This isn't true anymore with the cpu hog infrastructure:
if passed a blocking function that could wait on I/O we won't see any progress
anymore and the machine is dead.
--

From: Oleg Nesterov
Date: Monday, March 8, 2010 - 11:27 am

Could you please spell?

How cpuhog can make a difference? Afaics, we shouldn't pass a blocking callback
to hog_cpus/hog_one_cpu.

Oleg.

--

From: Heiko Carstens
Date: Monday, March 8, 2010 - 12:37 pm

Well, it might me true that this shouldn't be done. But I don't see a reason why
in general it wouldn't work to pass a function that would block.
So it's just a matter of time until somebody uses it for such a purpose.
For the current stop_machine implementation it would be broken to pass a blocking
function (preemption disabled, interrupts disabled).
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 4:39 pm

Hello,


Well, all current users don't block and it definitely can be enforced
by turning off preemption around the callback.  stop_machine() uses
busy waiting for every state transition so something else blocking on
a cpu could waste a lot of cpu cycles on other cpus even if the wait
is guaranteed to be finite.  Would that sooth your concern?

Thanks.

-- 
tejun
--

From: Heiko Carstens
Date: Tuesday, March 9, 2010 - 12:09 am

Yes, enforcing non blocking functions would be good.

Thanks!
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 12:16 am

Alright, will refresh and post the second round.

Thanks.

-- 
tejun
--

From: Oleg Nesterov
Date: Monday, March 8, 2010 - 12:06 pm

Could you please confirm this is correct?

I am not sure I understand how the code looks with the patch applied,
but the lockless set_state() above can confuse another stop_machine()
in progress?

Oleg.

--

From: Tejun Heo
Date: Monday, March 8, 2010 - 4:22 pm

Hello,


set_state() now modifies smdata->state and smdata is now local
variable of __stop_machine().  stop_machine instances don't have any
shared resource anymore.  Synchronization and resource sharing are all
cpuhog's responsibilities.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 8:53 am

The paranoid check which verifies that the hog callback is actually
called on all online cpus is completely superflous.  It's guaranteed
by cpuhog facility and if it didn't work as advertised other things
would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.

Kill the paranoid check.  Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
---
 kernel/sched.c |   39 +++------------------------------------
 1 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 900088d..600ab5d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10829,14 +10829,6 @@ static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0);
 
 static int synchronize_sched_expedited_hog(void *data)
 {
-	static DEFINE_SPINLOCK(done_mask_lock);
-	struct cpumask *done_mask = data;
-
-	if (done_mask) {
-		spin_lock(&done_mask_lock);
-		cpumask_set_cpu(smp_processor_id(), done_mask);
-		spin_unlock(&done_mask_lock);
-	}
 	return 0;
 }
 
@@ -10852,53 +10844,28 @@ static int synchronize_sched_expedited_hog(void *data)
  */
 void synchronize_sched_expedited(void)
 {
-	cpumask_var_t done_mask_var;
-	struct cpumask *done_mask = NULL;
 	int snap, trycount = 0;
 
-	/*
-	 * done_mask is used to check that all cpus actually have
-	 * finished running the hog, which is guaranteed by hog_cpus()
-	 * if it's called with cpu hotplug blocked.  Keep the paranoia
-	 * for now but it's best effort if cpumask is off stack.
-	 */
-	if (zalloc_cpumask_var(&done_mask_var, GFP_ATOMIC))
-		done_mask = done_mask_var;
-
 ...
From: Peter Zijlstra
Date: Wednesday, March 10, 2010 - 12:25 pm

cpuhog as a name doesn't work for me, stop-machine had a name that
described its severity and impact, cpuhog makes me think of while(1);.

Can't we keep the stop_machine name and make that a workqueue interface
like you propose?

That way we'd end up with something like:

kernel/stop_machine.c
  int stop_cpu(int cpu, stop_fn_t fn, void *arg)
  int stop_machine(struct cpumask *mask, stop_fn_t fn, void *arg)

alternatively, something like schedule_primary_work*() might work I
guess.



--

From: Tejun Heo
Date: Thursday, March 11, 2010 - 8:13 pm

Hello, Peter.


The distinction would be diabling of IRQ on each CPU.
hog_[one_]cpu[s]() schedule highest priority task to, well, hog the
cpu but doesn't affect contextless part of the cpu (irq, bh, whatnot).
In that sense, it is the lowest bottom of upper half but not quite
stopping the cpu and I think the distinction is rather important to
make.  With the proposed preemption disabling around the callback, it

I wanted to avoid verbs associatffed with the traditional workqueue -
schedule and queue, while emphasizing that this is something that you
don't want to abuse - so the verb hog.  monopolize_cpu() was the
second choice but hog is shorter, sweeter and can also be used as a
noun as-is, so I chose hog.

So, those were my rationales.  What do you think?

Thanks.

-- 
tejun
--

From: Rusty Russell
Date: Sunday, March 28, 2010 - 11:46 pm

I rather like the name.  And the stop_machine name is still there; it's just
using cpuhog rather than workqueues.

Ugly things should have ugly names.

For Patch 2/4 at least:

	Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Great work Tejun!
Rusty.
--

From: Peter Zijlstra
Date: Monday, March 29, 2010 - 2:11 am

Its a pretty minor difference, shouldn't we simply audit all existing
kstopmachine users and fix that up, having two similar but not quite

Still don't like the name fwiw.

--

From: Tejun Heo
Date: Thursday, April 1, 2010 - 10:45 pm

Hello, Peter.


Yeap, sure.  I don't think naming one way or the other is a problem
logistics-wise.  These aren't very widely used APIs anyway.  I've been
thinking quite a while about it and visible interface like the
following would probably fit your suggestion.

* stop_cpu()		- identical to hog_cpu()
* stop_cpus()		- identical to hog_cpus()
* stop_machine()

It's just that stop_cpu[s]() don't look like good names because they
don't really stop cpus.  This distinction is visible in
implementation.  stop_machine()'s per-cpu callback is currently named
stop_cpu() and it adds quite a bit more restrictions on top of just
hogging the cpu.  To me, the following visible API seems better.

* hog_cpu()
* hog_cpus()
* stop_machine()	- uses stop_cpu() internally for implementation

Oh well, I guess it's a matter of taste.  Given that other people
don't dislike the current naming too much, I'll try to push it forward
to Ingo w/ your objection to naming noted.

Thank you for reviewing.

-- 
tejun
--

Previous thread: [PATCH] fs, binfmt_aout: Fix pointer warnings by Borislav Petkov on Monday, March 8, 2010 - 8:46 am. (2 messages)

Next thread: [PATCH] asm-generic/io.h: add big endian versions of io{read,write}{16,32} by Mike Frysinger on Monday, March 8, 2010 - 9:34 am. (2 messages)