Re: [RFC 07/13] sched: Reduce stack size requirements in kernel/sched.c

Previous thread: [RFC 08/13] cpufreq: Reduce stack size requirements in acpi-cpufreq.c by Mike Travis on Saturday, September 6, 2008 - 4:50 pm. (1 message)

Next thread: [RFC 09/13] genapic: reduce stack pressuge in io_apic.c step 1 temp cpumask_ts by Mike Travis on Saturday, September 6, 2008 - 4:50 pm. (3 messages)
From: Mike Travis
Date: Saturday, September 6, 2008 - 4:50 pm

* Make the following changes to kernel/sched.c functions:

    - use node_to_cpumask_ptr in place of node_to_cpumask
    - use get_cpumask_var for temporary cpumask_t variables
    - use alloc_cpumask_ptr where available

  * Remove special code for SCHED_CPUMASK_ALLOC and use CPUMASK_ALLOC
    from linux/cpumask.h.

  * The resultant stack savings are:

    ====== Stack (-l 100)

	1 - initial
	2 - stack-hogs-kernel_sched_c
	'.' is less than the limit(100)

       .1.    .2.    ..final..
      2216  -1536 680   -69%  __build_sched_domains
      1592  -1592   .  -100%  move_task_off_dead_cpu
      1096  -1096   .  -100%  sched_balance_self
      1032  -1032   .  -100%  sched_setaffinity
       616   -616   .  -100%  rebalance_domains
       552   -552   .  -100%  free_sched_groups
       512   -512   .  -100%  cpu_to_allnodes_group
      7616  -6936 680   -91%  Totals


Applies to linux-2.6.tip/master.

Signed-off-by: Mike Travis <travis@sgi.com>
---
 kernel/sched.c |  151 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 81 insertions(+), 70 deletions(-)

--- linux-2.6.tip.orig/kernel/sched.c
+++ linux-2.6.tip/kernel/sched.c
@@ -70,6 +70,7 @@
 #include <linux/bootmem.h>
 #include <linux/debugfs.h>
 #include <linux/ctype.h>
+#include <linux/cpumask_ptr.h>
 #include <linux/ftrace.h>
 #include <trace/sched.h>
 
@@ -117,6 +118,12 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * temp cpumask variables
+ */
+static DEFINE_PER_CPUMASK(temp_cpumask_1);
+static DEFINE_PER_CPUMASK(temp_cpumask_2);
+
 #ifdef CONFIG_SMP
 /*
  * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
@@ -2141,7 +2148,11 @@ static int sched_balance_self(int cpu, i
 {
 	struct task_struct *t = current;
 	struct sched_domain *tmp, *sd = NULL;
+	cpumask_ptr span;
+	cpumask_ptr tmpmask;
 
+	get_cpumask_var(span, temp_cpumask_1);
+	get_cpumask_var(tmpmask, temp_cpumask_2);
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If power savings ...
From: Peter Zijlstra
Date: Sunday, September 7, 2008 - 3:24 am

Yuck, that relies on turning preemption off everywhere you want to use

BUG!

get_online_cpus() can sleep, but you just disabled preemption with those
get_cpumask_var() horribles!

Couldn't be arsed to look through the rest, but I really hate this
cpumask_ptr() stuff that relies on disabling preemption.

NAK

--

From: Andrew Morton
Date: Sunday, September 7, 2008 - 4:00 am

that's harder to fix ;)
--

From: Peter Zijlstra
Date: Sunday, September 7, 2008 - 6:05 am

A much easier fix is just re-ordering those operations and do the
get_online_cpus() before disabling preemption. But it does indicate this

Looking at more patches than just the sched one convinced me more that
this approach isn't a good one. It seems to make code much more
fragile. 

See patch 9, there it was needed to draw out the callgraph in order to
map stuff to these global variables - we're adding global dependencies
to code that didn't have any, increasing complexity.



--

From: Mike Travis
Date: Monday, September 8, 2008 - 7:56 am

Again, yes, as I got farther into that one, it became clear that having
static cpumask_t temps over too large a range was ending up very messy.

Thanks,
Mike

--

From: Peter Zijlstra
Date: Sunday, September 7, 2008 - 1:28 pm

That might actually be a worthwhile idea, but it will not make
get_online_cpus() atomic. The whole point of get_online_cpus() is to
serialize against actual hotplug operations, so it will have to sleep at
some point.

Now, turning cpu_hotplug.refcount into an atomic_t might be worthwhile
because it will reduce the amount of atomic operations in its fastpath
from 2 to 1.

You'd have to make recount==1 the stable situation and use
atomic_inc_unless() and atomic_dec_and_test() in get_online_cpus() and
put_online_cpus() resp. that way !refcount can signify a hotplug
operation and we'd fall back into the slow paths.

--

From: Mike Travis
Date: Monday, September 8, 2008 - 7:54 am

Yeah, I really agree as well.  But I wanted to start playing with using
cpumask_t pointers in some fairly straight forward manner.  Linus's and
Ingo's suggestion to just bite the bullet and redefine the cpumask_t 
would force a lot of changes to be made, but perhaps that's really the
way to go.

As to obtaining temp cpumask_t's (both early and late), perhaps a pool of
them would be better?  I believe it could be done similar to alloc_bootmem
(but much simpler), and I don't think there's enough nesting to require a
very large pool.  (4 was the largest depth I could find in io_apic.c.)  Of
course, with preemption enabled then other problems arise...

One other really big use was for the "allbutself" cpumask in the send_IPI
functions.  I think here, preemption is ok because the ownership of the
cpumask temp is very short lived.

But thanks for pointing out the get_online_cpus problem.  I did try and
chase down as many call trees as I could, but I obviously missed one
important one.

And thanks for looking it over!
Mike

--

From: Peter Zijlstra
Date: Monday, September 8, 2008 - 8:05 am

The thing is, you add serialization requirements (be it preempt_disable,
or a lock for some preemptable form) to code that didn't had any for a
case that hardly anyone will ever encounter in real life - I mean,
really, who has 4096 cpus?

Stuffing the cpumap_t in an already existing structure that has suitable
serialization requirements is of course the preferred situation, but
lacking that a dynamic cpumap_t is best, since it keeps the references
local, and thus doesn't add requirements to the existing code.

You could also consider adding 1 cpumap_t to task_struct and use that as
temporary scratch pad - but seeing you needed at least 4 that might not
be a feasible solution either.

--

From: Ingo Molnar
Date: Monday, September 8, 2008 - 11:38 am

seconded. Mike, since none of this is v2.6.27 material, lets do it right 
with a v2.6.28 target. You know all the cpumask_t using code sites 
inside out already, so the know-how is all available already :-) Please 
make it finegrained series of patches so that we can resolve conflicts 
with other trees more easily.

perhaps propose the new cpumask_t API early (in this thread?), so that 
people can comment on it before NAKs come flying against a full patchset 
;-)

	Ingo
--

From: Mike Travis
Date: Wednesday, September 10, 2008 - 3:47 pm

Here's an initial proposal for abstracting cpumask_t to be either
an array of 1 or a pointer to an array...   Hopefully this will
minimize the amount of code changes while providing the capabilities
this change is attempting to do.

Comments most welcome. ;-)

Thanks,
Mike
--

Basically, linux/cpumask.h has the following defines:

	typedef struct { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_data;

	#if NR_CPUS > BITS_PER_LONG
	typedef const cpumask_data	*cpumask_t;
	typedef cpumask_data		*cpumask_var;
	typedef cpumask_t		cpumask_val;
	#define	_NR_CPUS		nr_cpu_ids
	#else
	typedef const cpumask_data	cpumask_t[1];
	typedef cpumask_data		cpumask_var[1];
	typedef cpumask_data		cpumask_val;
	#define	_NR_CPUS		NR_CPUS
	#endif

So in function prototypes:

	cpumask_t function(const cpumask_t *A,
			   cpumask_t *B,
			   cpumask_t cpumask_C)

becomes:

	cpumask_val function(cpumask_t A,
			     cpumask_var B,
			     cpumask_t cpumask_C)

And in local variables:

	cpumask_t ==> cpumask_var  IFF variable is to be written.

This:

	cpumask_t mask = cpu_online_map 
	<change mask>

becomes:

	#include <linux/cpumask_alloc.h>

	cpumask_var mask;

	alloc_cpumask(&mask);
	*mask = *cpu_online_map;
	<change mask>
	free_cpumask(&mask);

Currently, alloc_cpumask is:

	#define BYTES_PER_CPUMASK (BITS_TO_LONGS(nr_cpu_ids)/sizeof(long))

	static inline bool no_cpumask(cpumask_t *m)
	{
		return (*m == NULL);
	}

	static inline void alloc_cpumask(cpumask_t *m)
	{
		cpumask_t d = kmalloc(BYTES_PER_CPUMASK, GFP_KERNEL);
		if (no_cpumask(&d))
			BUG();
		*m = d;
	}

	static inline void alloc_cpumask_nopanic(cpumask_t *m)
	{
		cpumask_t d = kmalloc(BYTES_PER_CPUMASK, GFP_KERNEL);

		*m = d;
	}

	static inline void free_cpumask(cpumask_t *m)
	{
		kfree(*m);
	}

Other means of obtaining a temporary cpumask_t variable will be provided
for those cases where kmalloc() is not available.

Furthermore, system-wide maps become:

	extern cpumask_data ...
From: Andi Kleen
Date: Wednesday, September 10, 2008 - 3:53 pm

> Here's an initial proposal for abstracting cpumask_t to be either

At least for some cases I don't think you'll get around defining
a "nearby subset of CPUs that can be handled together" type. Handling 1K 
objects all the time in one piece is simply not a good idea.

-Andi
--

From: Mike Travis
Date: Wednesday, September 10, 2008 - 4:33 pm

Every time I stop to think about this, the problems with the cpu
operators come to mind.  Should there be a separate set?  Or simply
conversion functions to/from a "cpumask_subset" type?  

Thanks,
Mike
--

From: Andi Kleen
Date: Wednesday, September 10, 2008 - 10:21 pm

A subset would be hopefully enough (set/isset etc.) plus conversion
operators.

-Andi

-- 
ak@linux.intel.com
--

From: Peter Zijlstra
Date: Thursday, September 11, 2008 - 2:00 am

I guess we have to stick the const into the typedef otherwise we get a
const pointer instead of a const array member, right?

In which case I much prefer the following names:

 cpumask_data_t  - value

 const_cpumask_t - pointer to constant value



gah - at the very least you got the naming wrong, methinks the one
panic-ing should have panic in its name - if you really want to persist

--

From: Mike Travis
Date: Thursday, September 11, 2008 - 8:04 am

Peter Zijlstra wrote:

There were some comments previously such that we should "imply" that the
incoming cpumask_t args are const, so the compiler would flag those

In a perfect world, no... ;-)

Yeah, I rather rushed through the allocation part (yuck indeed ;-).

There are some other alternatives:

	- reserve one or more of these in the task struct
	- reserve one or more in a per-cpu area
	- setup some kind of allocation pool similar to alloc_bootmem
	- ???

Thanks,
Mike
--

From: Rusty Russell
Date: Thursday, September 11, 2008 - 9:55 pm

I think this is still "wrong way go back".

I'm yet to be convinced that we really need to allocate cpumasks in any fast 
paths.  And if not, we should simply allocate them everywhere.  I'd rather 
see one #ifdef around a place where we can show a perf issue.

Get rid of CPU_MASK_ALL et al in favour of cpu_mask_all.  And cpu_mask_any_one 
instead of CPU_MASK_CPU0 since that's usually what they want.

API looks like so (look Ma, no typedefs!)

	struct cpumask *cpus;

	cpus = cpumask_alloc();
	if (!cpus)
		return -ENOMEM;

	cpumask_init_single(cpunum);
	OR
	cpumask_init(cpu_mask_all);
	...
	cpumask_free(cpus);

Unmistakable and really hard to screw up.  You can even be clever and not 
reveal the struct cpumask definition so noone can declare one by accident...

Cheers,
Rusty.
--

From: Mike Travis
Date: Friday, September 12, 2008 - 7:28 am

Using a typedef came from Linus, and the idea is basically if NR_CPUS fits
into a long, then it's carried as an array of one (ie., local variable).
If it's bigger, then it's a pointer to a remote array.  The references can
all be pointers (*cpumask), though most of the references use the cpu_XXX
operators which already treat the references correctly (in my proposal,
that is).  That way, small systems can optimize out the indirect reference
and the overhead becomes zero.

Also, cpumask_alloc/free() becomes nop's for small systems.

But I like the idea of dumping some of the initializers.  I should have
made CPU0 "cpumask_of_cpu(0)".  I'll have to look at where they are used to
see if this is feasible.

Thanks!
Mike
--

From: Rusty Russell
Date: Friday, September 12, 2008 - 3:02 pm

Sure it's clever.  ie. nice and confusing.

But do we have any code paths where we care?  Unless we do, let's just keep it 
simple...

Cheers,
Rusty.

--

From: Mike Travis
Date: Friday, September 12, 2008 - 3:50 pm

Here's the thread:

	http://marc.info/?l=linux-kernel&m=121976977901223&w=4

It doesn't seem worthwhile to force all systems to deal with large cpumask's
if they don't need to.  Passing the value on the stack (actually usually in a
reg) if it fits into a long makes a lot of sense.

And I don't think it's that abstract, but I'm willing to hear other opinions.

Btw, most likely only distros that distribute an "Enterprise" edition of
Linux will ever set NR_CPUS so high, so the actual number of systems making
use of this will be a very small percentage (big $$-wise though... ;-) 

I even think that perhaps BITS_PER_LONG might be too low a threshold to kick
in this extra code.  A Larabee chip will have 128 cpus so maybe 128 or 256 is
a better metric...?

As soon as I get a working kernel with the proposed changes, I will definitely
be doing perf testing.

Thanks,
Mike
--

From: H. Peter Anvin
Date: Friday, September 12, 2008 - 3:58 pm

If the performance difference isn't significant, then there is a major 
advantage to getting rid of a configuration option.  At that point we 
can basically scale to an arbitrary number of processors in a stock 
configuration.

	-hpa

--

Previous thread: [RFC 08/13] cpufreq: Reduce stack size requirements in acpi-cpufreq.c by Mike Travis on Saturday, September 6, 2008 - 4:50 pm. (1 message)

Next thread: [RFC 09/13] genapic: reduce stack pressuge in io_apic.c step 1 temp cpumask_ts by Mike Travis on Saturday, September 6, 2008 - 4:50 pm. (3 messages)