our x86.git randconfig auto-qa found a mm/slab.c early-bootup crash in mainline that got introduced since v2.6.24. http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad Note, the very same bzImage does not crash on other testboxes - only on this 8-way box with 4GB of RAM. i tried a "use v2.6.24's slab.c" revert (with a few API fixes needed for it to build on .25) but that didnt solve the problem either. i tried a bisection yesterday but it didnt work out too well - a combination of block layer (?) and networking regressions made it impossible. Here's the list of "good" bisection points between v2.6.24 (from multiple bisection runs): 0773769191d943358a8392fa86abd756d004c4b6 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba 26b8256e2bb930a8e4d4d10aa74950d8921376b8 2a10e7c41254941cac87be1eccdcb6379ce097f5 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb 49914084e797530d9baaf51df9eda77babc98fa8 53a6e2342d73d509318836e320f70cd286acd69c 5be3bda8987b12a87863c89b74b136fdb1f072db 6d5f718a497375f853d90247f5f6963368e89803 7272dcd31d56580dee7693c21e369fd167e137fe 77de2c590ec72828156d85fa13a96db87301cc68 82cfbb008572b1a953091ef78f767aa3ca213092 b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64 c087567d3ffb2c7c61e091982e6ca45478394f1a d4b37ff73540ab90bee57b882a10b21e2f97939f fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d the "bad" bisection points where i saw a slab.c crash were: 7180c4c9e09888db0a188f729c96c6d7bd61fa83 7fa2ac3728ce828070fa3d5846c08157fe5ef431 this still leaves a rather large set of commits: Bisecting: 1874 revisions left to test after this and the mm/ bits alone look volumonious: $ git-bisect visualize -p -- mm | diffstat | tail -1 106 files changed, 67759 insertions(+), 20852 deletions(-) Ingo ----------------> Subject: slab: revert From: Ingo Molnar <mingo@elte.hu> Date: Thu Apr 10 11:04:16 CEST 2008 Signed-off-by: Ingo Molnar ...
Hi Ingo, As mentioned privately, I suspect it's the page allocator changes that went into 2.6.24. Mel, Christoph, any ideas? --
So I'm thinking it's probably related to this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=523b94... As kmalloc_node() in setup_cpu_cache() returns NULL, it seems likely to be due to the use of GFP_THISNODE in cache_alloc_refill() when calling cache_grow() and that the semantics changed. No idea why page allocator would think your UMA "local node" has no memory though. Pekka --
but ... as i said it in my report, this is a regression since v2.6.24 - v2.6.24 (and a whole bunch of commits since then, i listed the IDs) booted up fine. The commit ID you mention is: v2.6.23-4345-g523b945, way earlier than the good commit IDs. so this is a recent regression. Ingo --
Hi Ingo. Right. Then you probably want to look into any changes in arch/x86/ related to setting up the zonelists. I'm fairly certain this is not a slab bug and I don't see any recent changes to the page allocator either that would explain this. --
I'd be willing to put some money on this: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b7ad14... --
And I'd lose as you're 32-bit. Oh well, that's the price to pay for pretending to know x86 arch internals. --
yeah, sorry - we are working hard to unify generic bits like that, but it's a huge architecture. btw., i always felt that the zone/memory setup is rather fragile and ad-hoc in places and it trusts the architecture code too much. Just in the .25 cycle i've seen about a dozen bugs all around that thing. I believe we should work on making the info that an architecture feeds to the MM "fool proof" - i.e. sanity-check for overlaps and other common setup errors. It is easy for an architecture to mess up those things... Especially on oddball systems that are too large or too small to be normally tested. It's a common, reoccuring bug pattern that we could avoid by being a bit more resilient. if this is a zone setup bug then a sanity-check could catch it right where it happens - not much later in the slab code or so. Ingo --
BTW. I think I'm seeing some problems perhaps related to change page attr stuff for DEBUG_PAGEALLOC on x86-64. And I don't know if it is the same thing, but some general instability around either the page allocator or slab allocator. The debug pagealloc problems seem to be that a thread suddenly get stuck in the kernel spinning in cpa (usually on one of the locks) and never seems to recover. Once it seemed to be spinning in clear_page_... too, but perhaps could it be messing up the page attributes and running so slowly that it just appears to be hanging? I'll try to get more info here but it is hard to reproduce. The general instability -- I've just seen an oops or two in the page allocation path in slub recently. Nothing reportable because I've been running my own patches and/or been unable to reproduce... but it is a bit unusual and I'll keep an eye out. Anyway, I'd suggest cooking this kernel a bit longer before release... --
Could you post the zone setup of the system that fails? A memory map would be useful and full dmesg output up to the failure. --
Pekka pointed out the orig post that had the boot info. One commonality in the two failures inhouse and yours is that both had more than 4G memory and had CONFIG_HIGHMEM4G set. Possibly a wraparound of some limit? --
Ingo, does your old 8 socket system work with 64bit kernel? YH --
it's all in my bugreport you are replying to: http://lkml.org/lkml/2008/4/11/34 it's a full dmesg up to the failure, which starts with the memory map of the system ... Ingo --
I hadn't realised that such setup errors were common. It should be already able to handle some overlapping problems in add_active_range(). I'm playing catch-up here but looking at your dmesg output, I see the following snippets. [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable) [ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable) [ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data) There are two portions of usable memory with a few holes there. [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved) [ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable) And is memory over the 4GB boundary but.... [ 0.000000] Warning only 4GB will be used. [ 0.000000] Use a HIGHMEM64G enabled kernel. [ 0.000000] Entering add_active_range(0, 0, 1048576) 0 entries of 256 used It's recognised and only memory below 4GB is registered and it's all on node 0. However, I do note that it also registers all the holes as valid memory. The memory should never get freed because it should be reserved during boot by reserve_bootmem() but it still raises an eyebrow. [ 0.000000] early_node_map[1] active PFN ranges [ 0.000000] 0: 0 -> 1048576 [ 0.000000] On node 0 totalpages: 1048576 [ 0.000000] DMA zone: 32 pages used for memmap [ 0.000000] DMA zone: 0 pages reserved [ 0.000000] DMA zone: 4064 pages, LIFO batch:0 [ 0.000000] Normal zone: 1760 pages used for memmap [ 0.000000] Normal zone: 223520 pages, LIFO batch:31 [ 0.000000] HighMem zone: 6400 pages used for memmap [ 0.000000] HighMem zone: 812800 pages, LIFO batch:31 [ 0.000000] ...
32-bit does memory_present() calls to register all RAM - and those calls are correct (they do not include holes) and the resulting sparse memory section layout looks correct, and all the mem_map[] chunk allocations succeed as well. furthermore, when freeing memory from bootmem allocator into the buddy allocator we consult the e820 map again via a page_is_ram() call, so we make sure holes do not end up in the memory map and in the free page yep. I did a few extra printouts to make sure, but came to the same conclusion. The system boots fine with the same config on v2.6.24. Ingo --
you asked me to run with the debug patch attached below. I just tried vanilla -rc9 (head 120dd64cacd4fb7) and it still crashes with this config: http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 debug output is: http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 so it's probably the first few page allocations (setup_cpu_cache()) going wrong already - suggesting a some fundamental borkage in SLAB? note, when i change SLAB to SLUB (and keep the config unchanged otherwise), i get a similar early crash: http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad http://redhat.com/~mingo/misc/config-Tue_Apr_15_07_24_59_CEST_2008.bad i've also uploaded a bzImage (SLUB, debug patch not applied) that you can pick up and run on any 32-bit test-system: http://redhat.com/~mingo/misc/bzImage-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 it's a relatively generic bzImage that should boot on most whitebox PCs on most distros as long as you use a pure ext3 setup and might even give you networking (no modules or initrd is needed). It boots fine on two other 32-bit PCs i have (an Intel laptop and an AMD desktop). Ingo Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -1485,6 +1485,7 @@ restart: * Happens if we have an empty zonelist as a result of * GFP_THISNODE being used on a memoryless node */ + WARN_ON(1); return NULL; } Index: linux/mm/slab.c =================================================================== --- linux.orig/mm/slab.c +++ linux/mm/slab.c @@ -1682,6 +1682,7 @@ static void *kmem_getpages(struct kmem_c flags |= __GFP_RECLAIMABLE; page = alloc_pages_node(nodeid, flags, cachep->gfporder); + WARN_ON(!page); if (!page) return NULL; @@ -2620,6 +2621,7 @@ static struct slab *alloc_slabmgmt(struc /* Slab management obj is off-slab. */ ...
I think it's still pointing to the page allocator and/or setting up
...especially considering you have similar crash with SLUB as well.
Now this:
[ 0.000999] ------------[ cut here ]------------
[ 0.000999] WARNING: at mm/slab.c:1685 cache_alloc_refill+0x2a6/0x4a3()
[ 0.000999] Pid: 0, comm: swapper Not tainted 2.6.25-rc9 #924
[ 0.000999] [<c0121b6f>] warn_on_slowpath+0x3c/0x4c
[ 0.000999] [<c0781873>] ? _spin_unlock_irqrestore+0xf/0x13
[ 0.000999] [<c02941ad>] ? delay_tsc+0x2e/0x4e
[ 0.000999] [<c029414d>] ? __delay+0x9/0xb
[ 0.000999] [<c0353db3>] ? serial8250_console_putchar+0x80/0x86
[ 0.000999] [<c0148822>] ? get_page_from_freelist+0x230/0x345
[ 0.000999] [<c0121eb1>] ? __call_console_drivers+0x56/0x63
[ 0.000999] [<c01489bb>] ? __alloc_pages+0x6e/0x2be
[ 0.000999] [<c015bd2e>] cache_alloc_refill+0x2a6/0x4a3
[ 0.000999] [<c015ba3f>] kmem_cache_alloc+0x5b/0xa4
Says that alloc_pages_node() returned NULL early on in the boot.
However, GFP_THISNODE is ruled out as this:
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1485,6 +1485,7 @@ restart:
* Happens if we have an empty zonelist as a result of
* GFP_THISNODE being used on a memoryless node
*/
+ WARN_ON(1);
return NULL;
}
does not trigger. Hmm...
--
i did a .config bisection and it pinpointed CONFIG_SPARSEMEM=y as the culprit. Changing it to FLATMEM gives a correctly booting system. if you look at the good versus bad bootup log: http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.good http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad (both SLUB) you'll see that the zone layout provided by the architecture code is _exactly_ the same and looks sane as well. So this is not an architecture zone layout bug, this is probably sparsemem setup (and/or the page allocator) getting confused by something. why are there no good debug logs possible in this area? To debug such bugs we'd need an early dump of the precise layout of all memory maps, what points where, how large it is, where it is allocated - and then compare it with how the rest of the system is layed out - looking at possible overlaps or other bugs. This 8-way box is a pain to debug on, it takes a long time to boot it up, etc. etc. Ingo --
so same config 64 bit with SLUB works and only 32bit is broken? or it 2.6.24 with 32bit + sparse + slub is broken already? YH --
i've done a revert of the page allocator to v2.6.24 status (with fixes
ontop to make it work on .25 infrastructure), via the patch below - but
this didnt change the problem.
i also doubled the sparse mem_map[] allocations on the theory that they
might overflow - but that didnt solve the crash either.
Ingo
------------------------>
Subject: revert: page alloc
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 10:44:34 CEST 2008
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/gfp.h | 2
include/linux/mmzone.h | 2
mm/page_alloc.c | 169 ++++++++++++++++++++++---------------------------
mm/vmstat.c | 61 ++++++++---------
4 files changed, 110 insertions(+), 124 deletions(-)
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -227,7 +227,5 @@ extern void free_cold_page(struct page *
void page_alloc_init(void);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
-void drain_all_pages(void);
-void drain_local_pages(void *dummy);
#endif /* __LINUX_GFP_H */
Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -113,7 +113,7 @@ struct per_cpu_pages {
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp;
+ struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
#ifdef CONFIG_NUMA
s8 expire;
#endif
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -19,7 +19,6 @@
#include <linux/swap.h>
#include <linux/interrupt.h>
#include <linux/pagemap.h>
-#include <linux/jiffies.h>
#include <linux/bootmem.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
@@ -44,7 +43,6 @@
#include <linux/backing-dev.h>
#include ...Well, I think it suggests some fundamental borkage in the page allocator. That first warn-on is from the "alloc_pages_node()" returning NULL at bootup. Sure, it could be that the arguments are bogus, but that sounds unlikely since none of that is dependent on any kconfig stuff. The fact that it happens with both SLUB/SLAB makes that even more obvious. Now, you don't have fault injection on, so it can't be that, and your debug entry for *z == NULL didn' trigger in alloc_pages, so it's no that one either. However, if __alloc_pages() failed, I would have expected to see the "memory allocation failed" printk. Why didn't it? Is printk_ratelimit() broken at boot (last_msg start out as zero - maybe i should start out as a negative number)? Linus --
btw., now with a second full day spent on this regression, i have figured out a workaround the hard way: increasing SECTION_SIZE_BITS in include/asm-x86/sparsemem.h from 26 to 27 makes it go away. (i.e. we use section chunks of 128 MB instead of 64 MB before) I've given up on analyzing the crash site - it seems rather random and uninformative and just suggests page allocator borkage. So this seems like a general sparsemem borkage. PAE uses a shift of 30 due to page->flags shortage (which masks this bug), 64-bit uses 27 which too probably masks this bug. Since this is a !NUMA config and !PAE as well, NODES_SHIFT is 0, ZONES_SHIFT is 2, so the theory of running out of bits in page->flags is wrong as well. I also tried a hack to double the size of all sparsemem mem_map allocations (on the theory of an overflow there) - but it didnt help. So i think we need to go down further into the page allocator. Perhaps the buddy bitmaps are wrongly sized somewhere. I'm grasping at straws. Btw., Mel Gorman has reproduced crashes with my bzImage on his box (and a hang with my config, using his build), so i think we can eliminate hw and build environment specialities as a cause. Ingo --
Interesting. I wonder.. So since you don't have NUMA, you have NODES_SHIFT == 0. That in turn means that NODE_NOT_IN_PAGE_FLAGS is _not_ set. That, in turn, means that ZONEID_SHIFT does *not* contain SECTIONS_SHIFT. Is that really what is supposed to happen? Because then "page_is_buddy()" will not even test the section, as far as I can tell. But I'm probably missing something. Why would we not need to test the section in page_zone_id() when the node ID is in the page flags (but has zero size)? Linus --
still crashes with the patch below - find the crash-log further below. (the kernel has a few more non-destructive debug printouts and debug checks included as well, which you can see in the log, but it's a vanilla kernel otherwise.) Ingo -----------------------> Subject: nodes: shift fix From: Ingo Molnar <mingo@elte.hu> Date: Tue Apr 15 21:15:21 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -424,7 +424,7 @@ static inline void set_compound_order(st * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. */ -#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0) +#if NODES_WIDTH <= 0 || NODES_SHIFT == 0 #define NODE_NOT_IN_PAGE_FLAGS #endif [ 0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #960 SMP Tue Apr 15 21:16:23 CEST 2008 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable) [ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable) [ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data) [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved) [ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable) [ 0.000000] console [earlyser0] enabled [ 0.000000] Warning only 4GB will be used. [ 0.000000] Use a HIGHMEM64G enabled kernel. [ 0.000000] 3200MB HIGHMEM ...
Peter "radar eye" Zijlstra noticed an ugly and annoying typo in mm.h: -#ifdef NODE_NOT_IN_PAGEFLAGS +#ifdef NODE_NOT_IN_PAGE_FLAGS but even with the full fix (see below) the same crash remains. i think getting NODE_NOT_IN_PAGEFLAGS wrong seems to result in non-optimal but still correct code - by virtue of NODES_MASK ending up zero. Ingo -----------------------> Subject: nodes: shift fix From: Ingo Molnar <mingo@elte.hu> Date: Tue Apr 15 21:15:21 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/mm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -424,7 +424,7 @@ static inline void set_compound_order(st * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. */ -#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0) +#if NODES_WIDTH <= 0 || NODES_SHIFT == 0 #define NODE_NOT_IN_PAGE_FLAGS #endif @@ -442,7 +442,7 @@ static inline void set_compound_order(st #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allcator */ -#ifdef NODE_NOT_IN_PAGEFLAGS +#ifdef NODE_NOT_IN_PAGE_FLAGS #define ZONEID_SHIFT (SECTIONS_SHIFT + ZONES_SHIFT) #define ZONEID_PGOFF ((SECTIONS_PGOFF < ZONES_PGOFF)? \ SECTIONS_PGOFF : ZONES_PGOFF) --
Hmmmm. SECTION_SIZE_BITS == 26 means SECTIONS_SHIFT == 6. Increasing SECTION_SIZE_BITS to 27 reduces SECTION_SHIFT to 5. Thereby the number of sparsemem sections (NR_MEM_SECTIONS) is reduced to half (64 to 32). --
yes, as i said in this thread already earlier today, the sparse chunking goes from 64MB to 128MB. (and hence, by virtue of !PAE having a 4GB physical address space, the # of sparse sections goes from 64 to 32 - you can see the full sparse sections printout in my latest crashlog in my previous mail, including the NR_MEM_SECTIONS printout.) Pretty please, could you pay more than cursory attention to this bug i already spent two full days on and which is blocking the v2.6.25 release? Your commits are all over the place in this code, and you are one of the maintainers as well. We've got 5000 lines of flux in mm/* in v2.6.25. I'm just guessing my way around, but right now my impression is that the current early memory setup code is unrobust, over-complex, occasionally butt-ugly to read code in high need of cleanups, simplifications and debug facilities, visibly plagued by hit-and-run changes with frequent typos and everything else you normally dont want to see in the core kernel. (Did i get your attention now? ;-) Ingo --
Yeah trying to get to understand how exactly sparsemem works and how the 32 bit highmem stuff interacts with it... Sorry not code that I am an expert in nor the platform that I am familiar with. Code mods there required heavy review from multiple parties with expertise in various I thought the NR_SECTIONS stuff would trigger some memories. Adding apw who seemed to be most familiar with the material in the past (AFAICT NODE_NOT_IN_PAGE_FLAGS is there for IBM NUMAQ etc) and Kame-san. Andy, Kame-san could you have a look at the sparsemem config issue with 32 bit !PAE? This is SPARSEMEM_STATIC. --
yeah - sorry about that impatient flame. And it could still be anything from the page allocator to bootmem - or some completely unrelated piece of code corrupting some key data structure. sparsemem is supposed to work roughly like this on x86 (32-bit): - the x86 memory map comes from the bios via e820. - those individual chunks of e820-enumerated memory get registered with mm/sparse.c's data structures via memory_present() callbacks. [btw., this should be renamed to register_memory_present() or register_sparse_range() - something less opaque.] - there's really just 3 RAM areas that matter on this box, and the last one is unusable for !PAE, which leaves 2. - there's a 256 MB PCI aperture hole at 0xf0000000. - out of the 64 sparse memory chunk the first 60 get filled in (all have at least partially some RAM content) - the last 4 [the PCI aperture hole] remains !present. - we pass in an array of 3 zones to free_area_init_nodes(). - we free the lowmem pages into the buddy allocator via the usual generic setup - we have a special loop for highmem pages in arch/x86/mm/init_32.c, set_highmem_pages_init(). This just goes through the PFNs one by one and does an explicit __free_page() on all RAM pages that are in the mem_map[] and which are non-reserved. and that's it roughly. my current guess would have been some bootmem regression/interaction that messes up the buddy bitmaps - but i just reverted to the v2.6.24 version of bootmem.c and that crashes too ... Ingo --
btw., highmem shouldnt matter because it does not influence how we allocate our key data structures. i confirmed that by turning set_highmem_pages_init() into a NOP - the kernel still crashed with just lowmem memory being around. Ingo --
and booting with NOHIGHMEM gives a crash too - updated config attached. (in case anyone wonders about the CONFIG_M486=y - it crashes with CONFIG_M686=y too) Ingo # # Automatically generated make config: don't edit # Linux kernel version: 2.6.25-rc9 # Tue Apr 15 22:14:51 2008 # # CONFIG_64BIT is not set CONFIG_X86_32=y # CONFIG_X86_64 is not set CONFIG_X86=y # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y # CONFIG_GENERIC_TIME_VSYSCALL is not set CONFIG_ARCH_HAS_CPU_RELAX=y # CONFIG_HAVE_SETUP_PER_CPU_AREA is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_ZONE_DMA32 is not set CONFIG_ARCH_POPULATES_NODE_MAP=y # CONFIG_AUDIT_ARCH is not set CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_32_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y CONFIG_KTIME_SCALAR=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y # CONFIG_SWAP is not set # CONFIG_SYSVIPC is not set CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not ...
changing the .config to UP makes it boot up fine. Config and bootlog attached. Ingo # # Automatically generated make config: don't edit # Linux kernel version: 2.6.25-rc9 # Tue Apr 15 22:20:58 2008 # # CONFIG_64BIT is not set CONFIG_X86_32=y # CONFIG_X86_64 is not set CONFIG_X86=y # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y # CONFIG_GENERIC_TIME_VSYSCALL is not set CONFIG_ARCH_HAS_CPU_RELAX=y # CONFIG_HAVE_SETUP_PER_CPU_AREA is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_ZONE_DMA32 is not set CONFIG_ARCH_POPULATES_NODE_MAP=y # CONFIG_AUDIT_ARCH is not set CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_X86_BIOS_REBOOT=y CONFIG_KTIME_SCALAR=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y # CONFIG_SWAP is not set # CONFIG_SYSVIPC is not set CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=20 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not ...
Vexing. The failure in the slabs suggests that no lowmem pages were freed during the walk of the bootmem bitmaps. Could you call show_mem before kmem_cache_init() runs? --
sure - find the crashlog below.
but it seems there's plenty of free RAM in the buddy:
[ 0.000999] DMA: 3*4kB 2*8kB 4*16kB 2*32kB 3*64kB 1*128kB 1*256kB
0*512kB 1*1024kB 1*2048kB 0*4096kB = 3804kB
[ 0.000999] Normal: 54*4kB 54*8kB 54*16kB 54*32kB 54*64kB 60*128kB
60*256kB 0*512kB 1*1024kB 0*2048kB 197*4096kB =
837672kB
and the bug pattern seems to be memory corruption - not memory
exhaustion.
i.e. we allocated RAM but it got corrupted after allocation.
Ingo
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -609,6 +609,7 @@ asmlinkage void __init start_kernel(void
mem_init();
enable_debug_pagealloc();
cpu_hotplug_init();
+ show_mem();
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();
[ 0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #968 SMP Tue Apr 15 22:39:35 CEST 2008
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable)
[ 0.000000] console [earlyser0] enabled
[ 0.000000] Warning only 896MB will be used.
[ 0.000000] Use a HIGHMEM64G enabled kernel.
[ 0.000000] 896MB LOWMEM available.
[ 0.000000] Scan SMP from c0000000 for 1024 bytes.
[ 0.000000] Scan SMP from c009fc00 for ...SLUB does not do a memory allocation where it fails here but simply In some situations we are screwing up the per cpu data handling on 32 bit x86? Adding Mike. This looks like the per cpu area overlaps with something else? --
yep, that was my other theory - and i doubled CONFIG_NR_CPUS to reduce that chance. in hindsight ... that wont save us from any overlap, right? what's the best way to artificially increase the size of the allocated per cpu area? (say double it) Ingo --
Add a big per cpu declaration? static DEFINE_PER_CPU(char, dummy)[10000]; --
what's the guarantee that it's at the end of the section? I'd like to pad the per cpu areas at their end. (doubling their size is a good way to achieve that) Ingo --
No guarantee. Its up to the linker. Sorry. We could add a new percpu.last section but that requires a number of changes to linking. --
ah. Then the patch below should do the trick, right? Ingo -------------> Subject: larger: percpu From: Ingo Molnar <mingo@elte.hu> Date: Tue Apr 15 23:13:18 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/vmlinux_32.lds.S | 1 + 1 file changed, 1 insertion(+) Index: linux/arch/x86/kernel/vmlinux_32.lds.S =================================================================== --- linux.orig/arch/x86/kernel/vmlinux_32.lds.S +++ linux/arch/x86/kernel/vmlinux_32.lds.S @@ -186,6 +186,7 @@ SECTIONS __per_cpu_start = .; *(.data.percpu) *(.data.percpu.shared_aligned) + . = . + 65536; __per_cpu_end = .; } . = ALIGN(PAGE_SIZE); --
Hopefully. The linker sometimes reacts in funky ways and we do some strange magic with the offsets through gcc memory models (at least on x86_64 not sure what 32 bit does). --
Or you could try this: --- linux-2.6.x86.sched-last-0415.orig/include/linux/percpu.h +++ linux-2.6.x86.sched-last-0415/include/linux/percpu.h @@ -38,10 +38,7 @@ /* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */ #ifndef PERCPU_ENOUGH_ROOM -#ifdef CONFIG_MODULES -#define PERCPU_MODULE_RESERVE 8192 -#else -#define PERCPU_MODULE_RESERVE 0 +#define PERCPU_MODULE_RESERVE 65536 #endif #define PERCPU_ENOUGH_ROOM \ --
this seems to have the intended effect of +0x10000 padding at the end of the percpu area: c0b927d0 D per_cpu__cpu_info c0b92880 d per_cpu__runqueues c0ba2d00 D __per_cpu_end c0ba3000 B __bss_start it still crashes though, with an very similar crash pattern to the previous ones. Ingo --
I don't know that there is a boot option. If modules are defined it adds an extra 8k. The size is defined in include/linux/percpu.h (PERCPU_ENOUGH_ROOM). Otherwise define a really large per_cpu variable...? -Mike --
I know this is bit of hand-waving but have you noticed how all the interesting sparsemem changes that one would expect to have caused the breakage happened _before_ v2.6.24? So sorry for asking this again but are we 110% sure the problem does not trigger with any of the v2.6.24-rcN kernels? Pekka --
quite. Here are all the successfull bootups from my (failed) bisection attempt: 0773769191d943358a8392fa86abd756d004c4b6 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba 26b8256e2bb930a8e4d4d10aa74950d8921376b8 2a10e7c41254941cac87be1eccdcb6379ce097f5 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb 49914084e797530d9baaf51df9eda77babc98fa8 53a6e2342d73d509318836e320f70cd286acd69c 5be3bda8987b12a87863c89b74b136fdb1f072db 6d5f718a497375f853d90247f5f6963368e89803 7272dcd31d56580dee7693c21e369fd167e137fe 77de2c590ec72828156d85fa13a96db87301cc68 82cfbb008572b1a953091ef78f767aa3ca213092 b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64 c087567d3ffb2c7c61e091982e6ca45478394f1a d4b37ff73540ab90bee57b882a10b21e2f97939f fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d or, via git-describe: v2.6.24-3908-g0773769 v2.6.24-2392-g21af029 v2.6.24-3868-g26b8256 v2.6.24-4463-g2a10e7c v2.6.24-4457-g3aa88cd v2.6.24 v2.6.24-3522-g53a6e23 v2.6.24-3131-g5be3bda v2.6.24-4461-g6d5f718 v2.6.24-3891-g7272dcd v2.6.24-3902-g77de2c5 v2.6.24-3613-g82cfbb0 v2.6.24-4449-gb75f53d v2.6.24-3911-gc087567 v2.6.24-3913-gd4b37ff v2.6.24-4464-gfde1b3f i.e. vanilla v2.6.24 and a whole bunch of commits after it were booting just fine. (the problem might have been masked up to a certain point in theory, but given how resilient it is to offset changes in my testing i find that not very probable [but not impossible] ) Ingo --
Ok, can you try this script git bisect start git bisect bad 7fa2ac3728ce828070fa3d5846c08157fe5ef431 git bisect good 0773769191d943358a8392fa86abd756d004c4b6 git bisect good 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba git bisect good 26b8256e2bb930a8e4d4d10aa74950d8921376b8 git bisect good 2a10e7c41254941cac87be1eccdcb6379ce097f5 git bisect good 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb git bisect good 49914084e797530d9baaf51df9eda77babc98fa8 git bisect good 53a6e2342d73d509318836e320f70cd286acd69c git bisect good 5be3bda8987b12a87863c89b74b136fdb1f072db git bisect good 6d5f718a497375f853d90247f5f6963368e89803 git bisect good 7272dcd31d56580dee7693c21e369fd167e137fe git bisect good 77de2c590ec72828156d85fa13a96db87301cc68 git bisect good 82cfbb008572b1a953091ef78f767aa3ca213092 git bisect good b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64 git bisect good c087567d3ffb2c7c61e091982e6ca45478394f1a git bisect good d4b37ff73540ab90bee57b882a10b21e2f97939f git bisect good fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d and then you'll apparently hit that commit you had compile problems with. HOWEVER, at that point, just do git bisect visualize and pick a commit somewhere roughly half-way that you suspect is a good point of testing, but not near the range that you had problems with. If you have compile problems in the middle, pick something that is just one third down, for example. It will make the bisection slower, but considering how unable we've been to make much progress other ways, if we can narrow it down from 1874 commits to something smaller, I suspect we'll be much happier. Then you just do git checkout <sha-you-picked-out-here> and compile that one, and check. Besides, while the _optimal_ point is half-way, even if you only remove a third or a quarter of the commits at each stage, it's still going to be an exponential thing. Linus --
ok, will try that now. The 'bad' points i posted are definitely well-established as i post-validated them them by looking for the slab.c crash pattern in the serial logs and looking at the git commit in the bootup signature. (this is more reliable than looking at bisection logs - i tried 4 different bisection runs and not all were reliable.) Ingo --
btw., as i progress with that bisection effort, i triggered new crash patterns, which lappen later during bootup: [ 0.775886] initcall 0xc0b00559 ran for 0 msecs: ksysfs_init+0x0/0x96() [ 0.777885] Calling initcall 0xc0b01eb8: filelock_init+0x0/0x27() [ 0.780137] BUG: unable to handle kernel NULL pointer dereference at 00000001 [ 0.782883] IP: [<c0293981>] strlen+0xb/0x15 [ 0.784884] *pde = 00000000 [ 0.786889] Oops: 0000 [#1] SMP [ 0.787880] [ 0.787880] Pid: 1, comm: swapper Not tainted (2.6.24-05281-g6232665 #3) [ 0.787880] EIP: 0060:[<c0293981>] EFLAGS: 00010286 CPU: 0 [ 0.787880] EIP is at strlen+0xb/0x15 [ 0.787880] EAX: 00000000 EBX: 00040000 ECX: ffffffff EDX: 00040000 [ 0.787880] ESI: c0915320 EDI: 00000001 EBP: f7c23f08 ESP: f7c23f04 [ 0.787880] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 0.787880] Process swapper (pid: 1, ti=f7c22000 task=f7c20000 task.ti=f7c22000) [ 0.787880] Stack: f7c1f540 f7c23f18 c015b138 00000094 c016b45a f7c23f48 c015c92c c016b45a [ 0.787880] 00000025 c0915320 f7c1f540 000000d0 f7c23f58 00000282 f7c1f540 00000000 [ 0.787880] 00000000 f7c23f84 c015cbc8 00000060 00000000 00040000 c016b45a c0133e81 [ 0.787880] Call Trace: [ 0.787880] [<c015b138>] ? kmem_cache_flags+0x3d/0x5b [ 0.787880] [<c016b45a>] ? init_once+0x0/0xc [ 0.787880] [<c015c92c>] ? kmem_cache_open+0x64/0x128 [ 0.787880] [<c016b45a>] ? init_once+0x0/0xc [ 0.787880] [<c015cbc8>] ? kmem_cache_create+0x14e/0x1d6 [ 0.787880] [<c016b45a>] ? init_once+0x0/0xc [ 0.787880] [<c0133e81>] ? ktime_get_ts+0x3b/0x3f [ 0.787880] [<c0133e96>] ? ktime_get+0x11/0x2f [ 0.787880] [<c0b01ed6>] ? filelock_init+0x1e/0x27 [ 0.787880] [<c016b45a>] ? init_once+0x0/0xc [ 0.787880] [<c0af176d>] ? kernel_init+0x13f/0x298 [ 0.787880] [<c0103b46>] ? ret_from_fork+0x6/0x20 [ 0.787880] [<c0af162e>] ? kernel_init+0x0/0x298 [ 0.787880] [<c0af162e>] ? kernel_init+0x0/0x298 [ ...
It may help to run with slub_debug (or CONFIG_SLUB_DEBUG_ON) then to detect the corruption as early as possible. Only works if you get by kmem_cache_init() though. Should give us some informative dumps of what is exactly corrupted if it hits a slab object. --
grumble. Do you read my bugreports? CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLUB_DEBUG_ON=y Ingo --
Cannot memorize everything sorry. This looked like slab object corruption and there were no slub diagnostics in the log. Trying to duplicate the issue here. Boot with PAE support on a machine with 8G RAM 8 processors here works fine. Also booting without highmem support (SMP) works fine. On how many machines does the problem occur? # # Automatically generated make config: don't edit # Linux kernel version: 2.6.25-rc9 # Tue Apr 15 21:59:36 2008 # # CONFIG_64BIT is not set CONFIG_X86_32=y # CONFIG_X86_64 is not set CONFIG_X86=y # CONFIG_GENERIC_LOCKBREAK is not set CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_GENERIC_GPIO is not set CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y # CONFIG_GENERIC_TIME_VSYSCALL is not set CONFIG_ARCH_HAS_CPU_RELAX=y # CONFIG_HAVE_SETUP_PER_CPU_AREA is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_ZONE_DMA32 is not set CONFIG_ARCH_POPULATES_NODE_MAP=y # CONFIG_AUDIT_ARCH is not set CONFIG_ARCH_SUPPORTS_AOUT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_32_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y CONFIG_KTIME_SCALAR=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General ...
have you tried my config on that box? Check: http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 Ingo --
Yup that config fails here too... --
great. Note that it's randconfig generated - so watch out for weird config combinations. randconfig, besides finding build-bugs, is also good at finding various runtime bugs: it is great at finding weird alignment and boundary-condition bugs in generic code, and it's also great at finding races (by virtue of introducing random delays between various functions, via random enabling/disabling of debug facilities and other options that impact the generated code's layout and timing). Ingo --
Hmmm... If one enables CONFIG_X86_PAE (even with no highmem) then everything is fine. For PAE to be enabled some other things also fall by the wayside. Diff to your failing config follows. Will try to minimize the diff even further: --- config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 2008-04-15 06:02:13.000000000 +0000 +++ .config 2008-04-15 23:15:53.000000000 +0000 @@ -1,7 +1,7 @@ # # Automatically generated make config: don't edit # Linux kernel version: 2.6.25-rc9 -# Tue Apr 15 07:37:33 2008 +# Tue Apr 15 23:15:53 2008 # # CONFIG_64BIT is not set CONFIG_X86_32=y @@ -169,11 +169,11 @@ # CONFIG_X86_VSMP is not set # CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER is not set CONFIG_PARAVIRT_GUEST=y +# CONFIG_XEN is not set CONFIG_VMI=y -# CONFIG_LGUEST_GUEST is not set CONFIG_PARAVIRT=y # CONFIG_M386 is not set -CONFIG_M486=y +# CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set @@ -196,20 +196,23 @@ # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set # CONFIG_MPSC is not set -# CONFIG_MCORE2 is not set +CONFIG_MCORE2=y # CONFIG_GENERIC_CPU is not set # CONFIG_X86_GENERIC is not set CONFIG_X86_CMPXCHG=y -CONFIG_X86_L1_CACHE_SHIFT=4 +CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_XADD=y -# CONFIG_X86_PPRO_FENCE is not set -CONFIG_X86_F00F_BUG=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y -CONFIG_X86_ALIGNMENT_16=y -CONFIG_X86_MINIMUM_CPU_FAMILY=4 +CONFIG_X86_GOOD_APIC=y +CONFIG_X86_INTEL_USERCOPY=y +CONFIG_X86_USE_PPRO_CHECKSUM=y +CONFIG_X86_P6_NOP=y +CONFIG_X86_TSC=y +CONFIG_X86_MINIMUM_CPU_FAMILY=6 +CONFIG_X86_DEBUGCTLMSR=y # CONFIG_HPET_TIMER is not set # CONFIG_IOMMU_HELPER is not set CONFIG_NR_CPUS=8 @@ -229,11 +232,11 @@ CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y -# CONFIG_NOHIGHMEM is not set -CONFIG_HIGHMEM4G=y +CONFIG_NOHIGHMEM=y +# CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set ...
... in the thread i've already explained that it's because on PAE we use 1GB sparse chunks (shift 30) which masks the bug. (on PAE we cannot go below a shift of 29 due to shortage of page->flags) Ingo --
Ahh. Right. That is the same situation as HIGHMEM_64G. I was able to enable PAE without HIGHMEM_64G. Thought that would keep things as is. --
Added some printks to the initialization of slub and I see 0x1 double words written over global variables that should be zero. The cpu mask to track processors that are initialized is screwed up (kmem_cach_cpu_free_init_once). [ 0.000999] 0xc0cb320c: 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................ [ 0.000999] 0xc0cb321c: 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................ [ 0.000999] 0xc0cb322c: 00 00 00 00 01 00 00 00 ........ c0cb3010 B mem_section c0cb3210 b lock.25923 c0cb3214 b shmem_inode_cachep c0cb3218 b shm_mnt c0cb321c b slab_state c0cb3220 b kmem_cach_cpu_free_init_once c0cb3224 b slub_debug c0cb3228 b slub_debug_slabs mem_section is 512 bytes long. Array overrun? --
same .config with -rc9 on one system with 128g 4 sockets with quad core crashed here too. YH --
finally found it ... the patch below solves the sparsemem crash and the
testsystem boots up fine now:
mars:~> uname -a
Linux mars 2.6.25-rc9-sched-devel.git-x86-latest.git #985 SMP Wed Apr 16
01:37:37 CEST 2008 i686 i686 i386 GNU/Linux
yay! :-)
Ingo
ps. anyone who can correctly guess the method with which i found the
exact place that corrupted memory will get a free beer next time we
meet :-)
------------------------->
Subject: mm: sparsemem memory_present() memory corruption fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008
fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.
that causes this loop to happily walk over the limit of the
sparse memory section map:
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID. Depending on what that memory is, we might crash,
misbehave or just not notice the bug.
the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.
an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file changed, 10 ...Yes that fixes it here too. And the corruption that I saw of slab variables is explained by your analysis. Thanks! Tested-by: Christoph Lameter <clameter@sgi.com> --
i re-checked the original SLAB config too and that boots fine as well now - so i'm confident that the regression has been sufficiently cured. it's getting quite late here (or rather, it's getting early :-/ ) so it would be nice if others could double-check this calculation (with an eye on all possible architectures): + unsigned long max_arch_pfn = 1ULL << (MAX_PHYSMEM_BITS-PAGE_SHIFT); and also check my analysis whether it is correct and whether it matches the reported bug patterns. But otherwise the fix looks like a safe fix for v2.6.25-final to me - it only filters out values from sparsemem input that are nonsensical in the sparsemem framework anyway. Ingo --
can you check why find_max_pfn() e820_32.c need to call memory_present? wonder if it can be removed. YH --
this is the only call to memory_present() we do in 32-bit arch setup, so it's required. (the function find_max_pfn() is woefully misnamed, but that's a cleanup - i just fixed this in x86.git.) Ingo --
We could clip there if SPARSEMEM is configured. I wonder if this affects other platforms that need HIGHMEM support? --
i.e. as per my previous argument i'd consider the need to sanitize the calls in the architecture fundamentally wrong. whether the core code emits a warning or allows the call is an additional question i mention in the changelog - but the core sparse memory code should _definitely_ not silently overflow a key internal array ... (of which data structure the architecture code is not even aware of) Ingo --
or you can move that check into find_max_pfn for x86_32? so it will not affect other platform regarding Christoph's concern? YH --
the patch doesn't have side effects on x86_64. YH --
On Tue, 15 Apr 2008 19:00:18 -0700 also no side effects on my ia64/NUMA box, which has sparse physical memory map. Thanks, -Kame --
64 bit is calling that via paging_init
==>sparse_memory_present_with_active_regions(MAX_NUMNODES).
and
void __init sparse_memory_present_with_active_regions(int nid)
{
int i;
for_each_active_range_index_in_nid(i, nid)
memory_present(early_node_map[i].nid,
early_node_map[i].start_pfn,
early_node_map[i].end_pfn);
}
that is some late than 32 bit.
YH
--
yeah - 64-bit is different here and it's not affected by the problem because there SECTION_SIZE_BITS is 27 (==128 MB chunks), MAX_PHYSADDR_BITS is 40 (== 1 TB) - giving 8192 section map entries. Once larger than 1 TB 64-bit x86 systems are created MAX_PHYSADDR_BITS needs to be increased. The only downside of the current setup on 64-bit is that it wastes 128K of RAM on the majority of systems. We could perhaps try a shift of 28, which halves the footprint to 64K of RAM, and which still is good enough to allow the PCI aperture to remain a hole on most systems. It would also compress the data-cache footprint of the sparse memory maps. (without having to use sparsemem-extreme indirection) Ingo --
also 64 bit
early_node_map[10] active PFN ranges
0: 0 -> 149
0: 256 -> 917408
0: 1048576 -> 8519680
1: 8519680 -> 16908288
2: 16908288 -> 25296896
3: 25296896 -> 33685504
4: 33685504 -> 42074112
5: 42074112 -> 50462720
6: 50462720 -> 58851328
7: 58851328 -> 67239936
and 32 bit only has one entry
[ 0.000000] early_node_map[1] active PFN ranges
[ 0.000000] 0: 0 -> 1048576
YH
--
Well okay this fixes it but is this the right fix? The arch should not call memory_present() with an invalid pfn. --
it is the right fix. The architecture memory setup code doesnt even _know_ the limits at this place in an open-coded way (and shouldnt know them) - and even later on we use pfn_valid() to determine whether to attempt to get to a struct page and free it into the buddy. [ Of course the architecture code in general 'knows' about the limits - but still it's cleaner to have a dumb enumeration interface here combined with a resilient core code - that's always going to be less fragile. ] btw., i just did some bug history analysis, the calls were originally added when sparsemem support was added: | commit 215c3409eed16c89b6d11ea1126bd9d4f36b9afd | Author: Andy Whitcroft <apw@shadowen.org> | Date: Fri Jan 6 00:12:06 2006 -0800 | | [PATCH] i386 sparsemem for single node systems in v2.6.15-1003-g215c340. (so this is appears to be an unfixed bug in v2.6.16 as well) Ingo --
the corruption might happen when encoding a non-zero node ID, or due to the SECTION_MARKED_PRESENT which is 0x1: mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0) Ingo --
Joe Perches pointed out that the ULL was superfluous (i typoed it, i
knew it's a pfn). Updated patch below.
Ingo
-------------------------->
Subject: mm: sparsemem memory_present() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008
fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.
that causes this loop to happily walk over the limit of the
sparse memory section map:
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;
'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.
an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -149,8 +149,18 @@ static inline int sparse_early_nid(struc
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long ...So its a general issue that has been there for years that we are now noticing because we are now testing with memory sizes > 4GB. This also affects the enterprise releases (SLES10, RHEL5). Argh! I wonder why this did not show up earlier in testing? Running a kernel that cannot access all of memory is unusual I guess. --
i guess people saw the "you are not running a PAE kernel" warning and went to a PAE kernel which didnt have this issue. OTOH, quite a few testers consciously use non-PAE kernels on 4GB systems, so i'd not be surprised if this solved a few mystery regressions we have. Ingo --
i believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 "worked" and v2.6.25-rc9 broke visibly. Ingo --
Ok, you didn't make that addendum to your second version, so I added it myself. Anyway, good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow. Linus --
I'm sorry to be too late here.. On Wed, 16 Apr 2008 02:03:56 +0200 how about max_arch_pfn = NR_MEM_SECTIONS * PAGES_PER_SECTION. ? Thanks, -Kame --
Very cool :) This fixed the silent lock-up that I was getting when using your config as well. At a bit of a loss yesterday to explain what was going wrong, I had started putting together patches to sanity check memory initialisation at various different stages trying to catch where things were going pear-shaped. You found the bug before it was done but I finished the basics anyway and posted it as "[RFC] Verification and debugging of memory initialisation". Something like it may help avoid similar headaches for people who tend to run into -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
the method was to notice that the slub_debug_slabs SLUB variable got corrupted from an expected value of 0 to a value of 0x1. Then i added a simple brute-force function-tracer hook (in sched-devel) that checked when slub_debug_slabs went from 0 to 1, and which then printed a backtrace. Since under CONFIG_FTRACE=y every kernel function calls this callback, it triggered immediately after the value got corrupted: [ 0.000000] console [earlyser0] enabled [ 0.000000] BUG: slub_debug_slabs: 00000001 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.25-rc9-sched-devel.git-x86-latest.git #982 [ 0.000000] [<c0177fba>] print_slub_debug_slabs+0x3a/0x40 [ 0.000000] [<c01050f7>] trace+0x8/0x11 [ 0.000000] [<c0cc929e>] ? mtrr_bp_init+0xe/0x320 [ 0.000000] [<c01050f7>] ? trace+0x8/0x11 [ 0.000000] [<c0cd7369>] ? memory_present+0x9/0x50 [ 0.000000] [<c0cc7a09>] ? find_max_pfn+0x99/0xb0 [ 0.000000] [<c0cc6af7>] setup_arch+0x217/0x470 [ 0.000000] [<c012c59b>] ? printk+0x1b/0x20 [ 0.000000] [<c0cc2b46>] start_kernel+0x96/0x3f0 [ 0.000000] [<c0cc22fd>] i386_start_kernel+0xd/0x10 [ 0.000000] ======================= [ 0.000000] x86: PAT support disabled. and the backtrace had all the guilty parties on stack - memory_present() [which was just called] and find_max_pfn()/setup_arch() - thanks to the new fuzzy "?" backtrace entries we print out in v2.6.25. (i could also have printed out the current ftrace buffer as well, showing the history of all recent function calls that the kernel executed.) Ingo --
The simplest solution for now may be to go with your workaround increasing SECTION_SIZE_BITS to 27. PAE mode already uses 30 and x86_64 also works with 27. This is going to affect the memory hotplug granularity for !PAE 32 bit configs though. Kame-san, any concerns with that? --
the bug's effects are so severe that this is the last thing i'd like to do. Ingo --
more verbosely: we sometimes do "blind" reverts, if it's reasonably established (or strongly suspected) that a revert makes a bug less severe. We do this even if we dont fully understand the bug and its effects and time runs out - on the assumption that we wont get worse than the old code was. but what i'd not really like to do are blind _non-revert_ changes. With your suggested change we'd introduce a seemingly innocious but still wholly new (and untested) memory setup layout on the most popular Linux kernel memory config in existence. (!PAE 32-bit is still being run on more than 50% of the Linux desktops - around 80% runs 32-bit kernels.) And as this bug demonstrates it, seemingly small differences appear to have large effects so we cannot know in what direction that would go - we might turn a rare regression into a common regression. I'd rather release with this bug being unfixed than with tweaking it just because the effect seems less severe on a totally unrepresentative set of systems. Ingo --
btw., here's the 'good' versus 'bad' bootup log (vanilla kernel spiced with a few extra stats printed out [*]): http://redhat.com/~mingo/misc/boot.26.log # bad http://redhat.com/~mingo/misc/boot.27.log # good the only difference is SECTION_SIZE_BITS == 26 versus 27. looking at the dmesg diff, there's just minimal (and expected) offset difference in some structure sizes. (more sparse maps use a bit more memory) Ingo [*] in case you wonder why memory_section->map is twice its size - i doubled it just to eliminate any doubts about off-by-one errors. Their natural size, as returned by bootmem, was 512KB plus 16 bytes (!), which seemed a bit weird. Probably a section entry came between two memory map allocations? --
Allowing systems without node 0 is a major change for x86. --
I also have an internal report that x86-git causes boot to fail with an 8p if one starts with a x86_64 config file and then converts to x86_32. Somehow the NR_CPUS is set to 255 in that case. Could this exhaust memory? I guess the per cpu cleanup work may figure in that area. Mike? --
how about reading my bugreport that you replied to: http://lkml.org/lkml/2008/4/11/34 It gives an answer to your question, trivially so. It includes an easy link to the very config that failed: http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad which would tell you: CONFIG_NR_CPUS=8 so no, it's not 255 CPUs exhausing RAM ... Ingo --
