Re: [patch 01/28] cpu alloc: The allocator

Previous thread: [patch 05/28] cpu alloc: Use in SLUB by Christoph Lameter on Tuesday, November 6, 2007 - 12:51 pm. (1 message)

Next thread: [patch 00/28] cpu alloc v1: Optimize by removing arrays of pointers to per cpu objects by Christoph Lameter on Tuesday, November 6, 2007 - 12:51 pm. (3 messages)
From: Christoph Lameter
Date: Tuesday, November 6, 2007 - 12:51 pm

The core portion of the cpu allocator.

The per cpu allocator allows dynamic allocation of memory on all
processor simultaneously. A bitmap is used to track used areas.
The allocator implements tight packing to reduce the cache footprint
and increase speed since cacheline contention is typically not a concern
for memory mainly used by a single cpu. Small objects will fill up gaps
left by larger allocations that required alignments.

Signed-off-by: Christoph Lameter <clameter@sgi.com>


---
 include/linux/cpu_alloc.h |   56 ++++++
 include/linux/mm.h        |   13 +
 include/linux/vmstat.h    |    2 
 mm/Kconfig                |   33 +++
 mm/Makefile               |    3 
 mm/cpu_alloc.c            |  407 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c               |    1 
 7 files changed, 512 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/cpu_alloc.c	2007-11-06 06:05:06.000000000 +0000
@@ -0,0 +1,407 @@
+/*
+ * Cpu allocator - Manage objects allocated for each processor
+ *
+ * (C) 2007 SGI, Christoph Lameter <clameter@sgi.com>
+ * 	Basic implementation with allocation and free from a dedicated per
+ * 	cpu area.
+ *
+ * The per cpu allocator allows dynamic allocation of memory on all
+ * processor simultaneously. A bitmap is used to track used areas.
+ * The allocator implements tight packing to reduce the cache footprint
+ * and increase speed since cacheline contention is typically not a concern
+ * for memory mainly used by a single cpu. Small objects will fill up gaps
+ * left by larger allocations that required alignments.
+ */
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/module.h>
+#include <linux/cpu_alloc.h>
+#include <linux/bitmap.h>
+#include <linux/vmalloc.h>
+#include <linux/bootmem.h>
+#include <linux/sched.h>	/* i386 definition of init_mm */
+#include ...
From: Peter Zijlstra
Date: Thursday, November 8, 2007 - 5:34 am

Why a bitmap allocator and not a heap allocator?

Also, looking at the lock usage, this thing is not IRQ safe, so it

Like said in the previous mail (which due to creative mailing from your
end never made it out to the lists), I dislike those shouting macros.
Please lowercase them.


-

From: Peter Zijlstra
Date: Thursday, November 8, 2007 - 5:37 am

sed -i -e 's/CPU_OFFSET/cpu_offset/g' -e 's/CPU_PTR/cpu_ptr/' -e
's/CPU_ALLOC/cpu_alloc_type/g'  -e 's/cpu_free/__cpu_free/g' -e
's/CPU_FREE/cpu_free/' -e 's/THIS_CPU/this_cpu/g' patches/*.patch

should get you there.

-

From: Christoph Lameter
Date: Thursday, November 8, 2007 - 11:33 am

From: Christoph Lameter
Date: Thursday, November 8, 2007 - 11:50 am

Well I went the other way and made it work like the slab allocators.


cpu_alloc: Make it irq safe

Use the same method as used in SLAB/SLUB to make the allocator interrupt safe.
disable interrupts when allocator metadata is processed. Reenable interrupts
during page allocator calls if __GFP_WAIT is set in the flags passed to the
allocator.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/cpu_alloc.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- linux-2.6.orig/mm/cpu_alloc.c	2007-11-07 16:49:40.069701326 -0800
+++ linux-2.6/mm/cpu_alloc.c	2007-11-08 10:45:43.172294260 -0800
@@ -186,6 +186,8 @@ static int expand_cpu_area(gfp_t flags)
 		goto out;
 
 	spin_unlock(&cpu_alloc_map_lock);
+	if (flags & __GFP_WAIT)
+		local_irq_enable();
 
 	/*
 	 * Determine the size of the bit map needed
@@ -212,6 +214,8 @@ static int expand_cpu_area(gfp_t flags)
 			goto out;
 	}
 
+	if (flags & __GFP_WAIT)
+		local_irq_disable();
 	spin_lock(&cpu_alloc_map_lock);
 
 	/*
@@ -312,10 +316,11 @@ void *cpu_alloc(unsigned long size, gfp_
 	void *ptr;
 	int first;
 	unsigned long map_size;
+	unsigned long flags;
 
 	BUG_ON(gfpflags & ~(GFP_RECLAIM_MASK | __GFP_ZERO));
 
-	spin_lock(&cpu_alloc_map_lock);
+	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
 
 restart:
 	map_size = PAGE_SIZE << cpu_alloc_map_order;
@@ -358,7 +363,7 @@ restart:
 	units_free -= units;
 	__count_vm_events(CPU_BYTES, units * UNIT_SIZE);
 
-	spin_unlock(&cpu_alloc_map_lock);
+	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
 
 	ptr = cpu_area + start * UNIT_SIZE;
 
@@ -372,7 +377,7 @@ restart:
 	return ptr;
 
 out_of_memory:
-	spin_unlock(&cpu_alloc_map_lock);
+	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
 	return NULL;
 }
 EXPORT_SYMBOL(cpu_alloc);
@@ -386,13 +391,14 @@ void cpu_free(void *start, unsigned long
 	int units = ...
From: Peter Zijlstra
Date: Thursday, November 8, 2007 - 1:19 pm

Nice.

-

From: David Miller
Date: Tuesday, November 13, 2007 - 4:15 am

From: Christoph Lameter <clameter@sgi.com>

Unfortunately, sparc64 fails to boot even with just this patch
applied.

The problem is that for the non-virtualized case this patch bloats up
the BSS section to be more than 8MB in size.  Sparc64 kernel images
cannot be more than 8MB in size total due to various boot loader and
firmware limitations.

I have NR_CPUS set to 64, but it can be up to 4096 on sparc64.

Yes, I could add virtualized area support to sparc64, but we cannot
impose this on every platform.

One thing you could do is simply use a vmalloc allocation in the
non-virtualized case.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 2:40 pm

Yuck. Meaning to add more crappy code. The bss limitations to 8M is a bit 
strange though. Do other platforms have the same issues? 
-

From: Eric Dumazet
Date: Tuesday, November 13, 2007 - 2:58 pm

Maybe not so crappy, because even for i386 code, you might use not a strict 
vmalloc() implementation but at least reserving percpu space inside the 
vmalloc range. (ie not use a dedicated area as your current patchset does)

This is because NR_CPUS is defaulted to 32 on i386 (with a limit of 256), so 
reserving 256*256KB = 64 MB of virtual space might be too much. (this is half 
the typical vmalloc area)

The idea would be :

- Reserving an area of NR_CPUS*256KB inside vmalloc() space (but of course not 
allocating pages)

- Then for each non possible cpu, 'release' its 256KB area and give it back to 
vmalloc free areas pool.

Once you add in mm/vmalloc.c all needed helpers, no need to use BSS Megablob 
anymore ?
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 3:00 pm

Well I think all of this can be avoided by simply copying the existing 
vmemmap helper functions and providing a virtual address for sparc64.

-

From: David Miller
Date: Tuesday, November 13, 2007 - 6:33 pm

From: Christoph Lameter <clameter@sgi.com>

I intend to do that in the end, but you miss my point.
Requiring this is unreasonable.

And nobody is going to do the virt stuff for platforms like sparc32.
And I do mean nobody.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 3:02 pm

It defaults to 8 except if you use a NUMA system.

config NR_CPUS
        int "Maximum number of CPUs (2-255)"
        range 2 255
        depends on SMP
        default "32" if X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000
        default "8"

-

From: David Miller
Date: Tuesday, November 13, 2007 - 6:30 pm

From: Christoph Lameter <clameter@sgi.com>

sparc32 has the same limit.

I'm surprised this is your reaction instead of "oh damn, sorry
I bloated up the kernel image size by 8mb, I'll find a way to
fix that."
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 6:48 pm

Well I found a way to fix that and its in the patch...

-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 3:20 pm

Other platforms do not have the 8MB restriction nor do they have so many 
processors.

Here is the draft of a virtual cpu area implementation for sparc64. Uses 
the VMEMMAP chunks:

---
 arch/sparc64/Kconfig          |   12 ++++++++++++
 arch/sparc64/mm/init.c        |   34 ++++++++++++++++++++++++++++++++++
 include/asm-sparc64/pgtable.h |    1 +
 3 files changed, 47 insertions(+)

Index: linux-2.6/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/init.c	2007-11-13 14:09:44.619500290 -0800
+++ linux-2.6/arch/sparc64/mm/init.c	2007-11-13 14:17:49.794860210 -0800
@@ -1697,6 +1697,40 @@ int __meminit vmemmap_populate(struct pa
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+int cpu_area_populate(void *start, unsigned long size, gfp_t flags, int node)
+{
+	unsigned long vstart = (unsigned long) start;
+	unsigned long vend = (unsigned long) (start + size);
+	unsigned long phys_start = (vstart - CPU_AREA_BASE);
+	unsigned long phys_end = (vend - CPU_AREA_BASE);
+	unsigned long addr = phys_start & VMEMMAP_CHUNK_MASK;
+	unsigned long end = VMEMMAP_ALIGN(phys_end);
+	unsigned long pte_base;
+
+	pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4U |
+		    _PAGE_CP_4U | _PAGE_CV_4U |
+		    _PAGE_P_4U | _PAGE_W_4U);
+	if (tlb_type == hypervisor)
+		pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4V |
+			    _PAGE_CP_4V | _PAGE_CV_4V |
+			    _PAGE_P_4V | _PAGE_W_4V);
+
+	for (; addr < end; addr += VMEMMAP_CHUNK) {
+		unsigned long *vmem_pp =
+			vmemmap_table + (addr >> VMEMMAP_CHUNK_SHIFT);
+		void *block;
+
+		if (!(*vmem_pp & _PAGE_VALID)) {
+			block = vmemmap_alloc_block(1UL << 22, flags, node);
+			if (!block)
+				return -ENOMEM;
+
+			*vmem_pp = pte_base | __pa(block);
+		}
+	}
+	return 0;
+}
+
 static void prot_init_common(unsigned long page_none,
 			     unsigned long page_shared,
 			     unsigned long page_copy,
Index: ...
From: David Miller
Date: Tuesday, November 13, 2007 - 6:36 pm

From: Christoph Lameter <clameter@sgi.com>

sparc32 has the same limitations, nobody is going to implement

This doesn't avoid the core problem.  Bloating up the BSS like
that is bad, end enforcing a virt implementation to avoid that
is an anti-social way to go about implementing this feature.
-

From: David Miller
Date: Tuesday, November 13, 2007 - 6:37 pm

BTW, I'm going to stop testing your patches on sparc64 for
a while until you start to make me feel like you understand
that ignoring the BSS bloat issue is bad.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 6:50 pm

Well this is just the fallback. How can I avoid this and still keep a 
constant? Add a new segment to vmlinux.lds.S?
 
-

From: David Miller
Date: Tuesday, November 13, 2007 - 7:00 pm

From: Christoph Lameter <clameter@sgi.com>

I'm not so sure.

The idea about doling out vmalloc space seemed the most promising.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 7:05 pm

Well that is basically the same as the virtual mode. Just ditch the 
fallback mode? vmalloc directly does not guarantee a fixed address.

-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 6:06 pm

I recently ran into a similar problem with x86-64 and large BSS from
lockdep conflicting with a 16MB kdump kernel. Solution was to do
another early allocator before bootmem and then move the tables into
there.

-Andi

-

From: David Miller
Date: Tuesday, November 13, 2007 - 6:52 pm

From: Andi Kleen <andi@firstfloor.org>

Yes, I've run into similar problems with lockdep as well.
I had to build an ultra minimalized kernel to get it to
boot on my Niagara boxes.

I think I even looked at the same lockdep code, and I'd
appreciate it if you'd submit your fix for this if you
haven't already.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 6:57 pm

Hmmmm. cpu_alloc really does not need zeroed data. Just an address fixed 
by the compiler where stuff can be put. Can the loader do that somehow?


-

From: David Miller
Date: Tuesday, November 13, 2007 - 7:01 pm

From: Christoph Lameter <clameter@sgi.com>

Yes, and I think IA64 uses such a scheme for it's 64KB fixed
per-cpu TLB mapping thing doesn't it?
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 7:03 pm

The per cpu TLB mapping is virtually mapped. The real memory allocation 
behind it occurs dynamically from bootmem.

-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 7:28 pm

ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/early-reserve
ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/early-alloc 
ftp://firstfloor.org/pub/ak/x86_64/quilt/patches/lockdep-early-alloc

I didn't plan to submit it for .24, just .25. Or do you need it 
urgently?

Also it would require you to write a sparc specific arch_early_alloc()
of course.  I've only done the x86-64 version.

-Andi
-

From: David Miller
Date: Tuesday, November 13, 2007 - 8:48 pm

From: Andi Kleen <andi@firstfloor.org>


I'll be sure to take care of that when it hits .25

Thanks Andi.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 8:49 pm

[Empty message]
From: Peter Zijlstra
Date: Friday, November 16, 2007 - 3:23 am

Would've been nice to have heard about this lockdep problem. Anyway,
thanks for tackling it.

How about moving this bit:

+#ifndef ARCH_HAS_EARLY_ALLOC
+#define LARGEVAR(x,y) { static typeof(*x) __ ## x[y];  x = __ ## x; }
+#else
+#define LARGEVAR(x,y) x = arch_early_alloc(sizeof(*x) * y)
+#endif

out of the lockdep code and into the generic early alloc code?

-

From: Andi Kleen
Date: Friday, November 16, 2007 - 4:44 am

Will do.

-Andi
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 9:15 pm

Hmmm.. I have v2 in preparation here that puts the pda and the per cpu 
data into the cpu_alloc area. Thus gs: can be used to access all per cpu 
data.

Any ideas how to abstract out the pda operations? Wasnt local_t supposed 
to be able to do atomic ops on cpu data? Is there an segment register 
version of local_t? I want also to have cmpxchg xchg etc that are all 
atomic without requiring any interrupt disable or preempt disable.

cpu_alloc allows pointer arithmetic on cpu area pointers. The segment 
prefix can then be used to select the appropriate area.

Guess I need also to add an arch configuration guide to V2 as well so that 
the other arches can do similar tricks and emphasize that the static 
default that requires bss is only suitable for small systems.
-

From: David Miller
Date: Tuesday, November 13, 2007 - 9:18 pm

From: Christoph Lameter <clameter@sgi.com>

I'm going to be against your changes until you implement
a real fix for the BSS bloat problems.

It's worse than the per-cpu allocator we have now, much
worse.
-

From: David Miller
Date: Tuesday, November 13, 2007 - 9:21 pm

BTW linux-mm@vger.kernel.org does not exist, please remove
it from the CC: in the future :-)

Thanks.
-

From: Christoph Lameter
Date: Tuesday, November 13, 2007 - 9:26 pm

You need to configure cpu_alloc for your arch and so far you seem 
to not have had the time to understand how it works otherwise you would 
not repeat these statements and ask me to implement what the 
patch already provides.

The only problem that I see so far is a communication problem. Thus we 
need more documentation.
-

From: David Miller
Date: Tuesday, November 13, 2007 - 10:53 pm

From: Christoph Lameter <clameter@sgi.com>

Fair enough, I'll look more closely at the next rev of
your patch set.
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 11:49 am

Well there is an LWN article now that also claims that the cpu_alloc 
patchset requires a large bss space. Sigh. See

http://lwn.net/Articles/257828/

Not true! 44 bytes is reasonable.

christoph@stapp:~/linux-2.6$ size mm/cpu_alloc.o
   text    data     bss     dec     hex filename
   5625      36      44    5705    1649 mm/cpu_alloc.o

Need to separate out the virtualization into a separate patch in V2 to 
make clear that it is there.


-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 7:19 pm

I am running the same version that you also ran. The problem is that you 
did not configure the stuff properly for your box and I did not include a 
configuration for sparc64 since I did not know how it needed to be 
configured for sparc64. You ignored the patch for sparc64 that I provided 
to correct the problem.




-

From: David Miller
Date: Thursday, November 15, 2007 - 7:50 pm

From: Christoph Lameter <clameter@sgi.com>

If you're talking about the VMEMMAP thing, that patch didn't remove
the problem, it simply added optimizations for sparc64 so that you could
sweep the problem under the rug.

Sparc32 is still broken, as just one of several possible examples.
The BSS usage is still there for platforms that don't use VMEMMAP.

So again, the lwn.net report is accurate.
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 7:55 pm

The virtual mapping of the cpu areas is used by the patch I 

I have not looked at sparc32 sorry. If you simply set up a couple of 

All MMU platforms can use the virtual mappings. The main use of the static 
configuration is for embedded systems.
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 8:10 pm

There is no assembly code required. I overdid it in the patch that I sent 
you trying to make sparc64 use large mappings like x86_64 NUMA. You really 
do not need that. Look at the IA64 and i386 configurations. There is no C 
code required. The x86_64 code only adds some special C code for the NUMA 

VMEMMAP is something different from the cpu allocator. All MMU platforms 
have vmalloc support and you even suggested the use of vmalloc.
-

From: David Miller
Date: Thursday, November 15, 2007 - 8:17 pm

From: Christoph Lameter <clameter@sgi.com>

Ok, and like I said last time, I'll examine this more closely
when you spin your next version of these patches.

Please post them soon as I'm eager to test this stuff out for
you.

Thanks.
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 8:19 pm

All the above statements are about the version of the patch that I 

Ok. Thanks. (You could test it with the current version.... Just edit 
Kconfig and try a few things and I could include the settings in the next 
release.)


-

Previous thread: [patch 05/28] cpu alloc: Use in SLUB by Christoph Lameter on Tuesday, November 6, 2007 - 12:51 pm. (1 message)

Next thread: [patch 00/28] cpu alloc v1: Optimize by removing arrays of pointers to per cpu objects by Christoph Lameter on Tuesday, November 6, 2007 - 12:51 pm. (3 messages)