* Cleanup cpumask_t usages in smp_call_function_mask function chain
to prevent stack overflow problem when NR_CPUS=4096.
* Reduce the number of passed cpumask_t variables in the following
call chain for x86_64:
smp_call_function_mask -->
arch_send_call_function_ipi->
smp_ops.send_call_func_ipi -->
genapic->send_IPI_mask
Since the smp_call_function_mask() is an EXPORTED function, we
cannot change it's calling interface for a patch to 2.6.27.
The smp_ops.send_call_func_ipi interface is internal only and
has two arch provided functions:
arch/x86/kernel/smp.c: .send_call_func_ipi = native_send_call_func_ipi
arch/x86/xen/smp.c: .send_call_func_ipi = xen_smp_send_call_function_ipi
arch/x86/mach-voyager/voyager_smp.c: (uses native_send_call_func_ipi)
Therefore modifying the internal interface to use a cpumask_t pointer
is straight-forward.
The changes to genapic are much more extensive and are affected by the
recent additions of the x2apic modes, so they will be done for 2.6.28 only.
Based on 2.6.27-rc5-git6.
Applies to linux-2.6.tip/master (with FUZZ).
Signed-off-by: Mike Travis <travis@sgi.com>
---
--
applied to tip/cpus4096, thanks Mike. I'm still wondering whether we should get rid of non-reference based cpumask_t altogether ... Did you have a chance to look at the ftrace/stacktrace tracer in latest tip/master, which will show the maximum stack footprint that can occur? Also, i've applied the patch below as well to restore MAXSMP in a muted form - with big warning signs added as well. Ingo --------------> From 363a5e3d7b4b69371f21bcafd7fc76e68c73733a Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Sat, 6 Sep 2008 15:24:52 +0200 Subject: [PATCH] x86: add MAXSMP restore MAXSMP, it's a nice debugging helper to trigger various crashes and problems with maximum sized x86 systems. Make it depend on EXPERIMENTAL and DEBUG_KERNEL, and inform the user about the effects (stacksize, overhead, memory usage) of this flag. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/Kconfig | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index ed97f2b..91212c1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -580,10 +580,15 @@ config IOMMU_HELPER config MAXSMP bool "Configure Maximum number of SMP Processors and NUMA Nodes" - depends on X86_64 && SMP && BROKEN - default n + depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL help - Configure maximum number of CPUS and NUMA Nodes for this architecture. + Configure maximum number of CPUS and NUMA Nodes for this + architecture (up to 4096!). + + This can increase memory usage, bigger stack footprint and can + add some runtime overhead as well so unless you want a generic + distro kernel you likely want to say N. + If unsure, say N. config NR_CPUS --
I've got a whole slew of "get-ready-to-remove-cpumask_t's" coming soon. There are two phases, one completely within the x86 arch and the 2nd hits the generic smp_call_function_mask ABI (won't be doable as a back-ported Hmm, no. I'm using a default config right now as I can boot that pretty The main thing is to allow the distros to set it manually for their QA testing of 2.6.27. I'm sure I'll get back bugs because of just that. (Is there a way to have them know to assign bugzilla's to me if NR_CPUS=4k is the root of the problem? This is an extremely serious issue for SGI and I'd like to avoid any delays in me finding out about problems.) Thanks again, --
the commits are: 363a5e3: x86: add MAXSMP 01f569c: x86: restore 4096 limit for NR_CPUS ae74da3: x86: reduce stack requirements for send_call_func_ipi 562d8c2: smp: reduce stack requirements for smp_call_function_mask the merge into tip/master is: | commit 7f5d26f9425851e20ca9774acbd13d0e3b96d9dd | Merge: da5e209... 363a5e3... | Author: Ingo Molnar <mingo@elte.hu> | Date: Sat Sep 6 15:29:18 2008 +0200 | | Merge branch 'cpus4096' That merge commit will go away on the next integration run though. your changes seem to be largely problem-free so far - with two dozen ok. None of this can go into v2.6.27 obviously - the stack corruptions were rather nasty. But it's looking good for v2.6.28 - especially if you i dont think there's any easy mapping. Ingo --
Considering that, unless I'm mistaken, you want to run production systems with 4096 CPUs at some point, then I would say you should really consider increasing NR_CPUS _further_ than that in QA efforts, so that we might be a bit more confident of running production kernels with 4096. Is that being tried? Setting it to 8192 or even higher during QA seems like a good idea to me. --
That's a good idea. I do occasionally set it to 16k (and 64k) for experimental reasons (and to really highlight where cpumask_t space hogs reside), but I hadn't thought to do it in the QA environment. Thanks, Mike --
From: Nick Piggin <nickpiggin@yahoo.com.au> This is a great idea, especially since it will make it even more painfully obvious that essentially any function local cpumask_t variable is a bug. Really, it seems sensible to do something like: 1) Make cpumask_t a pointer. 2) Add cpumask_data_t which is what cpumask_t is now. This gets used when for the actual storage, and will only get applied to datastructures that are dynamically allocated. For example, for the cpu_vm_mask in mm_struct. 3) Type make and fix build failures until they are all gone. --
Yes, that's what I have done in the past ... but putting it into the QA I was wondering if we'd need to be able to default a cpumask_t pointer argument to be a const and then use a different method for those cases where it shouldn't be? This would strengthen the compiler type checking of functions calls. For example: proto(cpumask_t mask) would imply that *mask is a const, whereas proto(cpumask_var mask) would indicate it to be non-const? But then we couldn't use "cpumask_t" as a local declarator... So perhaps we need something completely different for declaring cpumask arguments? (I'm trying to figure out how to structure this with the least amount of source editing.) Thanks! Mike --
From: Mike Travis <travis@sgi.com> Yes, of course, the pointer should be const. --
Cool, I think we should, it's like a ticking bomb waiting to explode on us eventually. IMHO it was a big mistake to allow cpumask_t being passed by value in the first place. Cheers, Jes --
Linus's idea of defining cpumask_t to be a simple long[1] or a pointer to a cpumask is a good one. Unfortunately, the amount (and breadth) of the code changes required is daunting, to say the least. In my source tree there are 892 references to cpumask_t. But I'll start looking into it asap. I don't know however if "NR_CPUS > BITS_PER_LONG" is the correct metric to decide when to use pointers. There must be a better "pain" indicator... ;-) Thanks, Mike --
