Hi all, After merging the final tree, today's linux-next build (powerpc allnoconfig) failed like this: mm/slub.c: In function 'alloc_kmem_cache_cpus': mm/slub.c:2094: error: 'PERCPU_DYNAMIC_EARLY_SIZE' undeclared (first use in this function) mm/slub.c:2094: error: (Each undeclared identifier is reported only once mm/slub.c:2094: error: for each function it appears in.) mm/slub.c:2094: error: bit-field '<anonymous>' width not an integer constant Caused by commit e18d65f0500b95d8724b17d8ea9f1116cf390bbe ("slub: Remove static kmem_cache_cpu array for boot"). PERCPU_DYNAMIC_EARLY_SIZE is only defined for SMP (and only in linux/percpu.h which is not explicitly included). This build does not have CONFIG_SMP set. I have used the version of the slab tree from next-20100823 for today. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
Thanks for the report. The problem should be fixed by this commit: http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=5792949c... --
Its not that easy. __alloc_percpu falls back to kzalloc() on UP and this can result in unique bootstrap problems with UP since the bootstrap array is no longer there. Does the UP kernel boot? Why did this ever build on my UP configuration tests? Hmmm... I only tested x86_64 UP. This was 32 bit UP I guess. --
No, I get this under kvm: [ 0.000000] Linux version 2.6.36-rc2+ (penberg@tiger) (gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9) ) #103 Tue Aug 24 21:27:28 EEST 2010 [ 0.000000] Command line: notsc nolapic nosmp noacpi pci=conf1 earlyprintk=ttyS0,keep [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) [ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 00000000000fffff (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 0000000004000000 (usable) [ 0.000000] console [earlyser0] enabled [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] No AGP bridge found [ 0.000000] last_pfn = 0x4000 max_arch_pfn = 0x400000000 [ 0.000000] x86 PAT enabled: cpu 0, old 0x0, new 0x7010600070106 [ 0.000000] CPU MTRRs all blank - virtualized system. [ 0.000000] Scanning 1 areas for low memory corruption [ 0.000000] modified physical RAM map: [ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved) [ 0.000000] modified: 0000000000010000 - 000000000009fc00 (usable) [ 0.000000] modified: 000000000009fc00 - 00000000000a0000 (reserved) [ 0.000000] modified: 00000000000f0000 - 00000000000fffff (reserved) [ 0.000000] modified: 0000000000100000 - 0000000004000000 (usable) [ 0.000000] init_memory_mapping: 0000000000000000-0000000004000000 [ 0.000000] ACPI Error: A valid RSDP was not found (20100702/tbxfroot-219) [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000000] kvm-clock: cpu 0, msr 0:1945621, boot clock [ 0.000000] Zone PFN ranges: [ 0.000000] DMA 0x00000010 -> 0x00001000 [ 0.000000] DMA32 0x00001000 -> 0x00100000 [ 0.000000] Normal empty [ 0.000000] Movable zone start PFN for each node [ 0.000000] early_node_map[2] active PFN ranges [ 0.000000] 0: 0x00000010 -> 0x0000009f [ ...
alloc per cpu result in kmalloc which fails. Tejon: Is there some way we could get a reserved per cpu area under UP instead of fallback to slab allocations during bootup? --
Hello, Eh... nasty. Maybe we can create a alloc_percpu_early() function which doesn't allow freeing of allocate memory and just redirect to bootmem on UP? -- tejun --
Yeah, I was thinking about that too. Pekka --
Another solution is to allocate an order 1 (compound) page for each early cache. Then resize the kmalloc array after everything is up. kfree() will then redirect to the page allocator. But this is a slab allocator specific solution. We now have the situation that alloc_percpu can only be used on early boot on SMP machines. A general solution would be better I think. If the early alloc_percpu stuff would work consistently then it could also be used to avoid the boot_pageset in the page allocator f.e. Can we just get rid of the special UP case and just run the percpu subsystem even for UP? --
Hello, Yeah, maybe. Then we also can guarantee that percpu allocator always honors alignment (which wq code currently requires and papers over with similarly ugly workaround). It would add a mostly redundant allocator code tho. I'll look into how easily it can be done. Thanks. -- tejun --
These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>
---
So, something like these three patches.
Nick, can I route this one through percpu tree?
Thanks.
include/linux/vmalloc.h | 2 ++
mm/vmalloc.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index de05e96..3d510e8 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -115,10 +115,12 @@ extern rwlock_t vmlist_lock;
extern struct vm_struct *vmlist;
extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
+#ifdef CONFIG_SMP
struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
size_t align, gfp_t gfp_mask);
void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
+#endif
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b7e314b..eb57a63 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2052,6 +2052,7 @@ void free_vm_area(struct vm_struct *area)
}
EXPORT_SYMBOL_GPL(free_vm_area);
+#ifdef CONFIG_SMP
static struct vmap_area *node_to_va(struct rb_node *n)
{
return n ? rb_entry(n, struct vmap_area, rb_node) : NULL;
@@ -2332,6 +2333,7 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
free_vm_area(vms[i]);
kfree(vms);
}
+#endif /* CONFIG_SMP */
#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
--
1.7.1
--
Reviewed-by: Chrsitoph Lameter <cl@linux.com> --
In preparation of enabling percpu allocator for UP, reduce PCPU_MIN_UNIT_SIZE to 32k. On UP, the first chunk doesn't have to include static percpu variables and chunk size can be smaller which is important as UP percpu allocator will use contiguous kernel memory to populate chunks. PCPU_MIN_UNIT_SIZE also determines the maximum supported allocation size but 32k should still be enough. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/percpu.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 49466b1..fc8130a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -42,7 +42,7 @@ #ifdef CONFIG_SMP /* minimum unit size, also is the maximum supported allocation size */ -#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10) +#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10) /* * Percpu allocator can serve percpu allocations before slab is -- 1.7.1 --
Reviewed-by: Christoph Lameter <cl@linux.com> --
On UP, percpu allocations were redirected to kmalloc. This has the following problems. * For certain amount of allocations (determined by PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu allocator can be used before the usual kernel memory allocator is brought online. On SMP, this is used to initialize the kernel memory allocator. * percpu allocator honors alignment upto PAGE_SIZE but kmalloc() doesn't. For example, workqueue makes use of larger alignments for cpu_workqueues. Currently, users of percpu allocators need to handle UP differently, which is somewhat fragile and ugly. Other than small amount of memory, there isn't much to lose by enabling percpu allocator on UP. It can simply use kernel memory based chunk allocation which was added for SMP archs w/o MMUs. This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and makes UP build use percpu-km. As percpu addresses and kernel addresses are always identity mapped and static percpu variables don't need any special treatment, nothing is arch dependent and mm/percpu.c implements generic setup_per_cpu_areas() for UP. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> --- include/linux/percpu.h | 29 ++++------------------- mm/Kconfig | 8 ++++++ mm/Makefile | 7 +---- mm/percpu-km.c | 2 +- mm/percpu.c | 60 ++++++++++++++++++++++++++++++++++++++++++++--- mm/percpu_up.c | 30 ------------------------ 6 files changed, 71 insertions(+), 65 deletions(-) delete mode 100644 mm/percpu_up.c diff --git a/include/linux/percpu.h b/include/linux/percpu.h index fc8130a..aeeeef1 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -39,8 +39,6 @@ preempt_enable(); \ } while (0) -#ifdef CONFIG_SMP - /* minimum unit size, also is the maximum supported allocation size */ #define ...
As much as I can review looks okay. Reviewed-by: Christoph Lameter <cl@linux.com> --
Acked-by: Pekka Enberg <penberg@kernel.org> Is this going into some public append-only branch I could cherry-pick the changeset from to my 'slub/cleanups' branch? Pekka --
Hello, I'll put it into percpu#for-next once Linus pulls in the currently pending set of fixes. I'll let you know. Thanks. -- tejun --
Alright, now in percpu#for-next tree. git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next The branch currenly contains only those three patches, so please feel free to pull from it. Thanks. -- tejun --
Branch updated to fix build for s390 and add missing memory clearing for km- allocator. You'll at least want to pull upto commit fc1481a956181d0360d3eb129965302489895a1b. Thanks. -- tejun --
I cherry-picked patches from the branch and reverted SLUB bandaid. SLUB works fine here on UP now so I'm putting it in linux-next. Thanks Tejun! Pekka --
These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>
---
(resending w/ Nick's new email address)
So, something like these three patches. Also available in the
following git tree.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review-up
Nick, can I route this one through percpu tree?
Thanks.
include/linux/vmalloc.h | 2 ++
mm/vmalloc.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index de05e96..3d510e8 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -115,10 +115,12 @@ extern rwlock_t vmlist_lock;
extern struct vm_struct *vmlist;
extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
+#ifdef CONFIG_SMP
struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
size_t align, gfp_t gfp_mask);
void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
+#endif
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b7e314b..eb57a63 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2052,6 +2052,7 @@ void free_vm_area(struct vm_struct *area)
}
EXPORT_SYMBOL_GPL(free_vm_area);
+#ifdef CONFIG_SMP
static struct vmap_area *node_to_va(struct rb_node *n)
{
return n ? rb_entry(n, struct vmap_area, rb_node) : NULL;
@@ -2332,6 +2333,7 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
free_vm_area(vms[i]);
kfree(vms);
}
+#endif /* CONFIG_SMP */
#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
--
1.7.1
--
The whole early alloc stuff does also does not go over too well with the page allocator (never had trouble on KVM): Decompressing Linux... Parsing ELF... done. Booting the kernel. Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.36-rc2 (root@rd-rsync) (gcc version 4.4.4 (Debian 4.4.4-5) ) #1 SMP Wed Aug 25 13:26:59 CDT 2010 Command line: ro root=/dev/mapper/vgubuntu-root console=tty0 console=ttyS1,57600 idle=mwait earlyprintk=ttyS1,57600 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 00000000000a0000 (usable) BIOS-e820: 0000000000100000 - 00000000cf699000 (usable) BIOS-e820: 00000000cf699000 - 00000000cf6af000 (reserved) BIOS-e820: 00000000cf6af000 - 00000000cf6ce000 (ACPI data) BIOS-e820: 00000000cf6ce000 - 00000000d0000000 (reserved) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 00000001b0000000 (usable) bootconsole [earlyser0] enabled NX (Execute Disable) protection: active DMI 2.6 present. No AGP bridge found last_pfn = 0x1b0000 max_arch_pfn = 0x400000000 last_pfn = 0xcf699 max_arch_pfn = 0x400000000 found SMP MP-table at [ffff8800000fe710] fe710 init_memory_mapping: 0000000000000000-00000000cf699000 init_memory_mapping: 0000000100000000-00000001b0000000 RAMDISK: 37b98000 - 37ff0000 ACPI: RSDP 00000000000f1630 00024 (v02 DELL ) ACPI: XSDT 00000000000f1734 0009C (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: FACP 00000000cf6c3f9c 000F4 (v03 DELL PE_SC3 00000001 DELL 00000001) ACPI: DSDT 00000000cf6af000 0320F (v01 DELL PE_SC3 00000001 INTL 20050624) ACPI: FACS 00000000cf6c6000 00040 ACPI: APIC 00000000cf6c3478 0015E (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: SPCR 00000000cf6c35d8 00050 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: HPET 00000000cf6c362c 00038 (v01 DELL PE_SC3 00000001 DELL 00000001) ACPI: DM__ 00000000cf6c3668 001A8 (v01 DELL PE_SC3 00000001 ...
Err... Sorry there was patch under testing in there (remove the boot pageset from the page allocator and use early allocpercpu) which resulted in the reserve area becoming a bit tight. Sorry. --
Hi Pekka, A small point: I did *not* review that patch. I *did* report the bug (and seemingly provided the commit message). Thanks for the build fix. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
I dropped slub cleanups from -next until we've resolved the UP boot time crash problem. --
It works here with this patch. I'd rather see a general solution so that
we can use allocpercpu for bootstrapping any allocator without such
special bandaids.
Subject: Slub: UP bandaid
Since the percpu allocator does not provide early allocation in UP
mode (only in SMP configurations) use __get_free_page() to improvise
a compound page allocation that can be later freed via kfree().
Compound pages will be released when the cpu caches are resized.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/slub.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-08-24 20:07:12.766010774 -0500
+++ linux-2.6/mm/slub.c 2010-08-24 20:15:46.304130417 -0500
@@ -2064,8 +2064,24 @@
static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
{
+#ifdef CONFIG_SMP
+ /*
+ * Will use reserve that does not require slab operation during
+ * early boot.
+ */
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
+#else
+ /*
+ * Special hack for UP mode. allocpercpu() falls back to kmalloc
+ * operations. So we cannot use that before the slab allocator is up
+ * Simply get the smallest possible compound page. The page will be
+ * released via kfree() when the cpu caches are resized later.
+ */
+ if (slab_state < UP)
+ s->cpu_slab = (__percpu void *)__get_free_page(GFP_NOWAIT, 1);
+ else
+#endif
s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
--
Right. Patch not refreshed after the last tinzy winzy change. Seems that I
got a bit rusty.
Subject: Slub: UP bandaid
Since the percpu allocator does not provide early allocation in UP
mode (only in SMP configurations) use __get_free_page() to improvise
a compound page allocation that can be later freed via kfree().
Compound pages will be released when the cpu caches are resized.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/slub.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-08-25 09:00:59.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-08-25 10:59:30.000000000 -0500
@@ -2091,8 +2091,24 @@ init_kmem_cache_node(struct kmem_cache_n
static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
{
+#ifdef CONFIG_SMP
+ /*
+ * Will use reserve that does not require slab operation during
+ * early boot.
+ */
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
+#else
+ /*
+ * Special hack for UP mode. allocpercpu() falls back to kmalloc
+ * operations. So we cannot use that before the slab allocator is up
+ * Simply get the smallest possible compound page. The page will be
+ * released via kfree() when the cpu caches are resized later.
+ */
+ if (slab_state < UP)
+ s->cpu_slab = (__percpu void *)__get_free_pages(GFP_NOWAIT, 1);
+ else
+#endif
s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
--
Acked-by: David Rientjes <rientjes@google.com> I'm really hoping that we can remove this hack soon when the percpu allocator can handle these allocations on UP without any specialized slab behavior. --
So do I. Here is a slightly less hacky version through using
kmalloc_large instead:
Subject: Slub: UP bandaid
Since the percpu allocator does not provide early allocation in UP
mode (only in SMP configurations) use __get_free_page() to improvise
a compound page allocation that can be later freed via kfree().
Compound pages will be released when the cpu caches are resized.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/slub.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-08-26 09:19:35.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-08-26 09:36:29.000000000 -0500
@@ -2103,8 +2103,24 @@ init_kmem_cache_node(struct kmem_cache_n
static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
{
+#ifdef CONFIG_SMP
+ /*
+ * Will use reserve that does not require slab operation during
+ * early boot.
+ */
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
+#else
+ /*
+ * Special hack for UP mode. allocpercpu() falls back to kmalloc
+ * operations. So we cannot use that before the slab allocator is up
+ * Simply get the smallest possible compound page. The page will be
+ * released via kfree() when the cpu caches are resized later.
+ */
+ if (slab_state < UP)
+ s->cpu_slab = (__percpu void *)kmalloc_large(PAGE_SIZE << 1, GFP_NOWAIT);
+ else
+#endif
s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
--
