The core portion of the cpu allocator. The per cpu allocator allows dynamic allocation of memory on all processor simultaneously. A bitmap is used to track used areas. The allocator implements tight packing to reduce the cache footprint and increase speed since cacheline contention is typically not a concern for memory mainly used by a single cpu. Small objects will fill up gaps left by larger allocations that required alignments. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/cpu_alloc.h | 56 ++++++ include/linux/mm.h | 13 + include/linux/vmstat.h | 2 mm/Kconfig | 33 +++ mm/Makefile | 3 mm/cpu_alloc.c | 407 ++++++++++++++++++++++++++++++++++++++++++++++ mm/vmstat.c | 1 7 files changed, 512 insertions(+), 3 deletions(-) Index: linux-2.6/mm/cpu_alloc.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/cpu_alloc.c 2007-11-06 06:05:06.000000000 +0000 @@ -0,0 +1,407 @@ +/* + * Cpu allocator - Manage objects allocated for each processor + * + * (C) 2007 SGI, Christoph Lameter <clameter@sgi.com> + * Basic implementation with allocation and free from a dedicated per + * cpu area. + * + * The per cpu allocator allows dynamic allocation of memory on all + * processor simultaneously. A bitmap is used to track used areas. + * The allocator implements tight packing to reduce the cache footprint + * and increase speed since cacheline contention is typically not a concern + * for memory mainly used by a single cpu. Small objects will fill up gaps + * left by larger allocations that required alignments. + */ +#include <linux/mm.h> +#include <linux/mmzone.h> +#include <linux/module.h> +#include <linux/cpu_alloc.h> +#include <linux/bitmap.h> +#include <linux/vmalloc.h> +#include <linux/bootmem.h> +#include <linux/sched.h> /* i386 definition of init_mm */ +#include ...
Why a bitmap allocator and not a heap allocator? Also, looking at the lock usage, this thing is not IRQ safe, so it Like said in the previous mail (which due to creative mailing from your end never made it out to the lists), I dislike those shouting macros. Please lowercase them. -
sed -i -e 's/CPU_OFFSET/cpu_offset/g' -e 's/CPU_PTR/cpu_ptr/' -e 's/CPU_ALLOC/cpu_alloc_type/g' -e 's/cpu_free/__cpu_free/g' -e 's/CPU_FREE/cpu_free/' -e 's/THIS_CPU/this_cpu/g' patches/*.patch should get you there. -
Well I went the other way and made it work like the slab allocators. cpu_alloc: Make it irq safe Use the same method as used in SLAB/SLUB to make the allocator interrupt safe. disable interrupts when allocator metadata is processed. Reenable interrupts during page allocator calls if __GFP_WAIT is set in the flags passed to the allocator. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/cpu_alloc.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) Index: linux-2.6/mm/cpu_alloc.c =================================================================== --- linux-2.6.orig/mm/cpu_alloc.c 2007-11-07 16:49:40.069701326 -0800 +++ linux-2.6/mm/cpu_alloc.c 2007-11-08 10:45:43.172294260 -0800 @@ -186,6 +186,8 @@ static int expand_cpu_area(gfp_t flags) goto out; spin_unlock(&cpu_alloc_map_lock); + if (flags & __GFP_WAIT) + local_irq_enable(); /* * Determine the size of the bit map needed @@ -212,6 +214,8 @@ static int expand_cpu_area(gfp_t flags) goto out; } + if (flags & __GFP_WAIT) + local_irq_disable(); spin_lock(&cpu_alloc_map_lock); /* @@ -312,10 +316,11 @@ void *cpu_alloc(unsigned long size, gfp_ void *ptr; int first; unsigned long map_size; + unsigned long flags; BUG_ON(gfpflags & ~(GFP_RECLAIM_MASK | __GFP_ZERO)); - spin_lock(&cpu_alloc_map_lock); + spin_lock_irqsave(&cpu_alloc_map_lock, flags); restart: map_size = PAGE_SIZE << cpu_alloc_map_order; @@ -358,7 +363,7 @@ restart: units_free -= units; __count_vm_events(CPU_BYTES, units * UNIT_SIZE); - spin_unlock(&cpu_alloc_map_lock); + spin_unlock_irqrestore(&cpu_alloc_map_lock, flags); ptr = cpu_area + start * UNIT_SIZE; @@ -372,7 +377,7 @@ restart: return ptr; out_of_memory: - spin_unlock(&cpu_alloc_map_lock); + spin_unlock_irqrestore(&cpu_alloc_map_lock, flags); return NULL; } EXPORT_SYMBOL(cpu_alloc); @@ -386,13 +391,14 @@ void cpu_free(void *start, unsigned long int units = ...
From: Christoph Lameter <clameter@sgi.com> Unfortunately, sparc64 fails to boot even with just this patch applied. The problem is that for the non-virtualized case this patch bloats up the BSS section to be more than 8MB in size. Sparc64 kernel images cannot be more than 8MB in size total due to various boot loader and firmware limitations. I have NR_CPUS set to 64, but it can be up to 4096 on sparc64. Yes, I could add virtualized area support to sparc64, but we cannot impose this on every platform. One thing you could do is simply use a vmalloc allocation in the non-virtualized case. -
Yuck. Meaning to add more crappy code. The bss limitations to 8M is a bit strange though. Do other platforms have the same issues? -
Maybe not so crappy, because even for i386 code, you might use not a strict vmalloc() implementation but at least reserving percpu space inside the vmalloc range. (ie not use a dedicated area as your current patchset does) This is because NR_CPUS is defaulted to 32 on i386 (with a limit of 256), so reserving 256*256KB = 64 MB of virtual space might be too much. (this is half the typical vmalloc area) The idea would be : - Reserving an area of NR_CPUS*256KB inside vmalloc() space (but of course not allocating pages) - Then for each non possible cpu, 'release' its 256KB area and give it back to vmalloc free areas pool. Once you add in mm/vmalloc.c all needed helpers, no need to use BSS Megablob anymore ? -
Well I think all of this can be avoided by simply copying the existing vmemmap helper functions and providing a virtual address for sparc64. -
From: Christoph Lameter <clameter@sgi.com> I intend to do that in the end, but you miss my point. Requiring this is unreasonable. And nobody is going to do the virt stuff for platforms like sparc32. And I do mean nobody. -
It defaults to 8 except if you use a NUMA system.
config NR_CPUS
int "Maximum number of CPUs (2-255)"
range 2 255
depends on SMP
default "32" if X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000
default "8"
-
From: Christoph Lameter <clameter@sgi.com> sparc32 has the same limit. I'm surprised this is your reaction instead of "oh damn, sorry I bloated up the kernel image size by 8mb, I'll find a way to fix that." -
Well I found a way to fix that and its in the patch... -
Other platforms do not have the 8MB restriction nor do they have so many
processors.
Here is the draft of a virtual cpu area implementation for sparc64. Uses
the VMEMMAP chunks:
---
arch/sparc64/Kconfig | 12 ++++++++++++
arch/sparc64/mm/init.c | 34 ++++++++++++++++++++++++++++++++++
include/asm-sparc64/pgtable.h | 1 +
3 files changed, 47 insertions(+)
Index: linux-2.6/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/init.c 2007-11-13 14:09:44.619500290 -0800
+++ linux-2.6/arch/sparc64/mm/init.c 2007-11-13 14:17:49.794860210 -0800
@@ -1697,6 +1697,40 @@ int __meminit vmemmap_populate(struct pa
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+int cpu_area_populate(void *start, unsigned long size, gfp_t flags, int node)
+{
+ unsigned long vstart = (unsigned long) start;
+ unsigned long vend = (unsigned long) (start + size);
+ unsigned long phys_start = (vstart - CPU_AREA_BASE);
+ unsigned long phys_end = (vend - CPU_AREA_BASE);
+ unsigned long addr = phys_start & VMEMMAP_CHUNK_MASK;
+ unsigned long end = VMEMMAP_ALIGN(phys_end);
+ unsigned long pte_base;
+
+ pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4U |
+ _PAGE_CP_4U | _PAGE_CV_4U |
+ _PAGE_P_4U | _PAGE_W_4U);
+ if (tlb_type == hypervisor)
+ pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4V |
+ _PAGE_CP_4V | _PAGE_CV_4V |
+ _PAGE_P_4V | _PAGE_W_4V);
+
+ for (; addr < end; addr += VMEMMAP_CHUNK) {
+ unsigned long *vmem_pp =
+ vmemmap_table + (addr >> VMEMMAP_CHUNK_SHIFT);
+ void *block;
+
+ if (!(*vmem_pp & _PAGE_VALID)) {
+ block = vmemmap_alloc_block(1UL << 22, flags, node);
+ if (!block)
+ return -ENOMEM;
+
+ *vmem_pp = pte_base | __pa(block);
+ }
+ }
+ return 0;
+}
+
static void prot_init_common(unsigned long page_none,
unsigned long page_shared,
unsigned long page_copy,
Index: ...From: Christoph Lameter <clameter@sgi.com> sparc32 has the same limitations, nobody is going to implement This doesn't avoid the core problem. Bloating up the BSS like that is bad, end enforcing a virt implementation to avoid that is an anti-social way to go about implementing this feature. -
BTW, I'm going to stop testing your patches on sparc64 for a while until you start to make me feel like you understand that ignoring the BSS bloat issue is bad. -
Well this is just the fallback. How can I avoid this and still keep a constant? Add a new segment to vmlinux.lds.S? -
From: Christoph Lameter <clameter@sgi.com> I'm not so sure. The idea about doling out vmalloc space seemed the most promising. -
Well that is basically the same as the virtual mode. Just ditch the fallback mode? vmalloc directly does not guarantee a fixed address. -
I recently ran into a similar problem with x86-64 and large BSS from lockdep conflicting with a 16MB kdump kernel. Solution was to do another early allocator before bootmem and then move the tables into there. -Andi -
From: Andi Kleen <andi@firstfloor.org> Yes, I've run into similar problems with lockdep as well. I had to build an ultra minimalized kernel to get it to boot on my Niagara boxes. I think I even looked at the same lockdep code, and I'd appreciate it if you'd submit your fix for this if you haven't already. -
Hmmmm. cpu_alloc really does not need zeroed data. Just an address fixed by the compiler where stuff can be put. Can the loader do that somehow? -
From: Christoph Lameter <clameter@sgi.com> Yes, and I think IA64 uses such a scheme for it's 64KB fixed per-cpu TLB mapping thing doesn't it? -
The per cpu TLB mapping is virtually mapped. The real memory allocation behind it occurs dynamically from bootmem. -
ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/early-reserve ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/early-alloc ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/lockdep-early-alloc I didn't plan to submit it for .24, just .25. Or do you need it urgently? Also it would require you to write a sparc specific arch_early_alloc() of course. I've only done the x86-64 version. -Andi -
From: Andi Kleen <andi@firstfloor.org> I'll be sure to take care of that when it hits .25 Thanks Andi. -
Would've been nice to have heard about this lockdep problem. Anyway,
thanks for tackling it.
How about moving this bit:
+#ifndef ARCH_HAS_EARLY_ALLOC
+#define LARGEVAR(x,y) { static typeof(*x) __ ## x[y]; x = __ ## x; }
+#else
+#define LARGEVAR(x,y) x = arch_early_alloc(sizeof(*x) * y)
+#endif
out of the lockdep code and into the generic early alloc code?
-
Hmmm.. I have v2 in preparation here that puts the pda and the per cpu data into the cpu_alloc area. Thus gs: can be used to access all per cpu data. Any ideas how to abstract out the pda operations? Wasnt local_t supposed to be able to do atomic ops on cpu data? Is there an segment register version of local_t? I want also to have cmpxchg xchg etc that are all atomic without requiring any interrupt disable or preempt disable. cpu_alloc allows pointer arithmetic on cpu area pointers. The segment prefix can then be used to select the appropriate area. Guess I need also to add an arch configuration guide to V2 as well so that the other arches can do similar tricks and emphasize that the static default that requires bss is only suitable for small systems. -
From: Christoph Lameter <clameter@sgi.com> I'm going to be against your changes until you implement a real fix for the BSS bloat problems. It's worse than the per-cpu allocator we have now, much worse. -
BTW linux-mm@vger.kernel.org does not exist, please remove it from the CC: in the future :-) Thanks. -
You need to configure cpu_alloc for your arch and so far you seem to not have had the time to understand how it works otherwise you would not repeat these statements and ask me to implement what the patch already provides. The only problem that I see so far is a communication problem. Thus we need more documentation. -
From: Christoph Lameter <clameter@sgi.com> Fair enough, I'll look more closely at the next rev of your patch set. -
Well there is an LWN article now that also claims that the cpu_alloc patchset requires a large bss space. Sigh. See http://lwn.net/Articles/257828/ Not true! 44 bytes is reasonable. christoph@stapp:~/linux-2.6$ size mm/cpu_alloc.o text data bss dec hex filename 5625 36 44 5705 1649 mm/cpu_alloc.o Need to separate out the virtualization into a separate patch in V2 to make clear that it is there. -
I am running the same version that you also ran. The problem is that you did not configure the stuff properly for your box and I did not include a configuration for sparc64 since I did not know how it needed to be configured for sparc64. You ignored the patch for sparc64 that I provided to correct the problem. -
From: Christoph Lameter <clameter@sgi.com> If you're talking about the VMEMMAP thing, that patch didn't remove the problem, it simply added optimizations for sparc64 so that you could sweep the problem under the rug. Sparc32 is still broken, as just one of several possible examples. The BSS usage is still there for platforms that don't use VMEMMAP. So again, the lwn.net report is accurate. -
The virtual mapping of the cpu areas is used by the patch I I have not looked at sparc32 sorry. If you simply set up a couple of All MMU platforms can use the virtual mappings. The main use of the static configuration is for embedded systems. -
There is no assembly code required. I overdid it in the patch that I sent you trying to make sparc64 use large mappings like x86_64 NUMA. You really do not need that. Look at the IA64 and i386 configurations. There is no C code required. The x86_64 code only adds some special C code for the NUMA VMEMMAP is something different from the cpu allocator. All MMU platforms have vmalloc support and you even suggested the use of vmalloc. -
From: Christoph Lameter <clameter@sgi.com> Ok, and like I said last time, I'll examine this more closely when you spin your next version of these patches. Please post them soon as I'm eager to test this stuff out for you. Thanks. -
All the above statements are about the version of the patch that I Ok. Thanks. (You could test it with the current version.... Just edit Kconfig and try a few things and I could include the settings in the next release.) -
