* Make the following changes to kernel/sched.c functions:
- use node_to_cpumask_ptr in place of node_to_cpumask
- use get_cpumask_var for temporary cpumask_t variables
- use alloc_cpumask_ptr where available
* Remove special code for SCHED_CPUMASK_ALLOC and use CPUMASK_ALLOC
from linux/cpumask.h.
* The resultant stack savings are:
====== Stack (-l 100)
1 - initial
2 - stack-hogs-kernel_sched_c
'.' is less than the limit(100)
.1. .2. ..final..
2216 -1536 680 -69% __build_sched_domains
1592 -1592 . -100% move_task_off_dead_cpu
1096 -1096 . -100% sched_balance_self
1032 -1032 . -100% sched_setaffinity
616 -616 . -100% rebalance_domains
552 -552 . -100% free_sched_groups
512 -512 . -100% cpu_to_allnodes_group
7616 -6936 680 -91% Totals
Applies to linux-2.6.tip/master.
Signed-off-by: Mike Travis <travis@sgi.com>
---
kernel/sched.c | 151 ++++++++++++++++++++++++++++++---------------------------
1 file changed, 81 insertions(+), 70 deletions(-)
--- linux-2.6.tip.orig/kernel/sched.c
+++ linux-2.6.tip/kernel/sched.c
@@ -70,6 +70,7 @@
#include <linux/bootmem.h>
#include <linux/debugfs.h>
#include <linux/ctype.h>
+#include <linux/cpumask_ptr.h>
#include <linux/ftrace.h>
#include <trace/sched.h>
@@ -117,6 +118,12 @@
*/
#define RUNTIME_INF ((u64)~0ULL)
+/*
+ * temp cpumask variables
+ */
+static DEFINE_PER_CPUMASK(temp_cpumask_1);
+static DEFINE_PER_CPUMASK(temp_cpumask_2);
+
#ifdef CONFIG_SMP
/*
* Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
@@ -2141,7 +2148,11 @@ static int sched_balance_self(int cpu, i
{
struct task_struct *t = current;
struct sched_domain *tmp, *sd = NULL;
+ cpumask_ptr span;
+ cpumask_ptr tmpmask;
+ get_cpumask_var(span, temp_cpumask_1);
+ get_cpumask_var(tmpmask, temp_cpumask_2);
for_each_domain(cpu, tmp) {
/*
* If power savings ...Yuck, that relies on turning preemption off everywhere you want to use BUG! get_online_cpus() can sleep, but you just disabled preemption with those get_cpumask_var() horribles! Couldn't be arsed to look through the rest, but I really hate this cpumask_ptr() stuff that relies on disabling preemption. NAK --
A much easier fix is just re-ordering those operations and do the get_online_cpus() before disabling preemption. But it does indicate this Looking at more patches than just the sched one convinced me more that this approach isn't a good one. It seems to make code much more fragile. See patch 9, there it was needed to draw out the callgraph in order to map stuff to these global variables - we're adding global dependencies to code that didn't have any, increasing complexity. --
Again, yes, as I got farther into that one, it became clear that having static cpumask_t temps over too large a range was ending up very messy. Thanks, Mike --
That might actually be a worthwhile idea, but it will not make get_online_cpus() atomic. The whole point of get_online_cpus() is to serialize against actual hotplug operations, so it will have to sleep at some point. Now, turning cpu_hotplug.refcount into an atomic_t might be worthwhile because it will reduce the amount of atomic operations in its fastpath from 2 to 1. You'd have to make recount==1 the stable situation and use atomic_inc_unless() and atomic_dec_and_test() in get_online_cpus() and put_online_cpus() resp. that way !refcount can signify a hotplug operation and we'd fall back into the slow paths. --
Yeah, I really agree as well. But I wanted to start playing with using cpumask_t pointers in some fairly straight forward manner. Linus's and Ingo's suggestion to just bite the bullet and redefine the cpumask_t would force a lot of changes to be made, but perhaps that's really the way to go. As to obtaining temp cpumask_t's (both early and late), perhaps a pool of them would be better? I believe it could be done similar to alloc_bootmem (but much simpler), and I don't think there's enough nesting to require a very large pool. (4 was the largest depth I could find in io_apic.c.) Of course, with preemption enabled then other problems arise... One other really big use was for the "allbutself" cpumask in the send_IPI functions. I think here, preemption is ok because the ownership of the cpumask temp is very short lived. But thanks for pointing out the get_online_cpus problem. I did try and chase down as many call trees as I could, but I obviously missed one important one. And thanks for looking it over! Mike --
The thing is, you add serialization requirements (be it preempt_disable, or a lock for some preemptable form) to code that didn't had any for a case that hardly anyone will ever encounter in real life - I mean, really, who has 4096 cpus? Stuffing the cpumap_t in an already existing structure that has suitable serialization requirements is of course the preferred situation, but lacking that a dynamic cpumap_t is best, since it keeps the references local, and thus doesn't add requirements to the existing code. You could also consider adding 1 cpumap_t to task_struct and use that as temporary scratch pad - but seeing you needed at least 4 that might not be a feasible solution either. --
seconded. Mike, since none of this is v2.6.27 material, lets do it right with a v2.6.28 target. You know all the cpumask_t using code sites inside out already, so the know-how is all available already :-) Please make it finegrained series of patches so that we can resolve conflicts with other trees more easily. perhaps propose the new cpumask_t API early (in this thread?), so that people can comment on it before NAKs come flying against a full patchset ;-) Ingo --
Here's an initial proposal for abstracting cpumask_t to be either
an array of 1 or a pointer to an array... Hopefully this will
minimize the amount of code changes while providing the capabilities
this change is attempting to do.
Comments most welcome. ;-)
Thanks,
Mike
--
Basically, linux/cpumask.h has the following defines:
typedef struct { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_data;
#if NR_CPUS > BITS_PER_LONG
typedef const cpumask_data *cpumask_t;
typedef cpumask_data *cpumask_var;
typedef cpumask_t cpumask_val;
#define _NR_CPUS nr_cpu_ids
#else
typedef const cpumask_data cpumask_t[1];
typedef cpumask_data cpumask_var[1];
typedef cpumask_data cpumask_val;
#define _NR_CPUS NR_CPUS
#endif
So in function prototypes:
cpumask_t function(const cpumask_t *A,
cpumask_t *B,
cpumask_t cpumask_C)
becomes:
cpumask_val function(cpumask_t A,
cpumask_var B,
cpumask_t cpumask_C)
And in local variables:
cpumask_t ==> cpumask_var IFF variable is to be written.
This:
cpumask_t mask = cpu_online_map
<change mask>
becomes:
#include <linux/cpumask_alloc.h>
cpumask_var mask;
alloc_cpumask(&mask);
*mask = *cpu_online_map;
<change mask>
free_cpumask(&mask);
Currently, alloc_cpumask is:
#define BYTES_PER_CPUMASK (BITS_TO_LONGS(nr_cpu_ids)/sizeof(long))
static inline bool no_cpumask(cpumask_t *m)
{
return (*m == NULL);
}
static inline void alloc_cpumask(cpumask_t *m)
{
cpumask_t d = kmalloc(BYTES_PER_CPUMASK, GFP_KERNEL);
if (no_cpumask(&d))
BUG();
*m = d;
}
static inline void alloc_cpumask_nopanic(cpumask_t *m)
{
cpumask_t d = kmalloc(BYTES_PER_CPUMASK, GFP_KERNEL);
*m = d;
}
static inline void free_cpumask(cpumask_t *m)
{
kfree(*m);
}
Other means of obtaining a temporary cpumask_t variable will be provided
for those cases where kmalloc() is not available.
Furthermore, system-wide maps become:
extern cpumask_data ...> Here's an initial proposal for abstracting cpumask_t to be either At least for some cases I don't think you'll get around defining a "nearby subset of CPUs that can be handled together" type. Handling 1K objects all the time in one piece is simply not a good idea. -Andi --
Every time I stop to think about this, the problems with the cpu operators come to mind. Should there be a separate set? Or simply conversion functions to/from a "cpumask_subset" type? Thanks, Mike --
A subset would be hopefully enough (set/isset etc.) plus conversion operators. -Andi -- ak@linux.intel.com --
I guess we have to stick the const into the typedef otherwise we get a const pointer instead of a const array member, right? In which case I much prefer the following names: cpumask_data_t - value const_cpumask_t - pointer to constant value gah - at the very least you got the naming wrong, methinks the one panic-ing should have panic in its name - if you really want to persist --
Peter Zijlstra wrote: There were some comments previously such that we should "imply" that the incoming cpumask_t args are const, so the compiler would flag those In a perfect world, no... ;-) Yeah, I rather rushed through the allocation part (yuck indeed ;-). There are some other alternatives: - reserve one or more of these in the task struct - reserve one or more in a per-cpu area - setup some kind of allocation pool similar to alloc_bootmem - ??? Thanks, Mike --
I think this is still "wrong way go back". I'm yet to be convinced that we really need to allocate cpumasks in any fast paths. And if not, we should simply allocate them everywhere. I'd rather see one #ifdef around a place where we can show a perf issue. Get rid of CPU_MASK_ALL et al in favour of cpu_mask_all. And cpu_mask_any_one instead of CPU_MASK_CPU0 since that's usually what they want. API looks like so (look Ma, no typedefs!) struct cpumask *cpus; cpus = cpumask_alloc(); if (!cpus) return -ENOMEM; cpumask_init_single(cpunum); OR cpumask_init(cpu_mask_all); ... cpumask_free(cpus); Unmistakable and really hard to screw up. You can even be clever and not reveal the struct cpumask definition so noone can declare one by accident... Cheers, Rusty. --
Using a typedef came from Linus, and the idea is basically if NR_CPUS fits into a long, then it's carried as an array of one (ie., local variable). If it's bigger, then it's a pointer to a remote array. The references can all be pointers (*cpumask), though most of the references use the cpu_XXX operators which already treat the references correctly (in my proposal, that is). That way, small systems can optimize out the indirect reference and the overhead becomes zero. Also, cpumask_alloc/free() becomes nop's for small systems. But I like the idea of dumping some of the initializers. I should have made CPU0 "cpumask_of_cpu(0)". I'll have to look at where they are used to see if this is feasible. Thanks! Mike --
Sure it's clever. ie. nice and confusing. But do we have any code paths where we care? Unless we do, let's just keep it simple... Cheers, Rusty. --
Here's the thread: http://marc.info/?l=linux-kernel&m=121976977901223&w=4 It doesn't seem worthwhile to force all systems to deal with large cpumask's if they don't need to. Passing the value on the stack (actually usually in a reg) if it fits into a long makes a lot of sense. And I don't think it's that abstract, but I'm willing to hear other opinions. Btw, most likely only distros that distribute an "Enterprise" edition of Linux will ever set NR_CPUS so high, so the actual number of systems making use of this will be a very small percentage (big $$-wise though... ;-) I even think that perhaps BITS_PER_LONG might be too low a threshold to kick in this extra code. A Larabee chip will have 128 cpus so maybe 128 or 256 is a better metric...? As soon as I get a working kernel with the proposed changes, I will definitely be doing perf testing. Thanks, Mike --
If the performance difference isn't significant, then there is a major advantage to getting rid of a configuration option. At that point we can basically scale to an arbitrary number of processors in a stock configuration. -hpa --
