Re: [patch] mm: sparsemem memory_present() memory corruption fix

Previous thread: linux-next: Tree for April 11 by Stephen Rothwell on Friday, April 11, 2008 - 12:36 am. (3 messages)

Next thread: [PATCH 1/2] add tunable_notifier function ,take2 by Takenori Nagano on Friday, April 11, 2008 - 12:53 am. (2 messages)
From: Ingo Molnar
Date: Friday, April 11, 2008 - 12:41 am

our x86.git randconfig auto-qa found a mm/slab.c early-bootup crash in 
mainline that got introduced since v2.6.24.

  http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad
  http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad

Note, the very same bzImage does not crash on other testboxes - only on 
this 8-way box with 4GB of RAM.

i tried a "use v2.6.24's slab.c" revert (with a few API fixes needed for 
it to build on .25) but that didnt solve the problem either.

i tried a bisection yesterday but it didnt work out too well - a 
combination of block layer (?) and networking regressions made it 
impossible.

Here's the list of "good" bisection points between v2.6.24 (from 
multiple bisection runs):

 0773769191d943358a8392fa86abd756d004c4b6
 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
 26b8256e2bb930a8e4d4d10aa74950d8921376b8
 2a10e7c41254941cac87be1eccdcb6379ce097f5
 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
 49914084e797530d9baaf51df9eda77babc98fa8
 53a6e2342d73d509318836e320f70cd286acd69c
 5be3bda8987b12a87863c89b74b136fdb1f072db
 6d5f718a497375f853d90247f5f6963368e89803
 7272dcd31d56580dee7693c21e369fd167e137fe
 77de2c590ec72828156d85fa13a96db87301cc68
 82cfbb008572b1a953091ef78f767aa3ca213092
 b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
 c087567d3ffb2c7c61e091982e6ca45478394f1a
 d4b37ff73540ab90bee57b882a10b21e2f97939f
 fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d

the "bad" bisection points where i saw a slab.c crash were:

 7180c4c9e09888db0a188f729c96c6d7bd61fa83
 7fa2ac3728ce828070fa3d5846c08157fe5ef431

this still leaves a rather large set of commits:

  Bisecting: 1874 revisions left to test after this

and the mm/ bits alone look volumonious:

 $ git-bisect visualize -p -- mm | diffstat | tail -1
 106 files changed, 67759 insertions(+), 20852 deletions(-)

	Ingo

---------------->
Subject: slab: revert
From: Ingo Molnar <mingo@elte.hu>
Date: Thu Apr 10 11:04:16 CEST 2008

Signed-off-by: Ingo Molnar ...
From: Pekka Enberg
Date: Friday, April 11, 2008 - 1:21 am

Hi Ingo,


As mentioned privately, I suspect it's the page allocator changes that
went into 2.6.24. Mel, Christoph, any ideas?
--

From: Pekka Enberg
Date: Friday, April 11, 2008 - 1:50 am

So I'm thinking it's probably related to this patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=523b94...

As kmalloc_node() in setup_cpu_cache() returns NULL, it seems likely
to be due to the use of GFP_THISNODE in cache_alloc_refill() when
calling cache_grow() and that the semantics changed. No idea why page
allocator would think your UMA "local node" has no memory though.

                        Pekka
--

From: Ingo Molnar
Date: Friday, April 11, 2008 - 1:54 am

but ... as i said it in my report, this is a regression since v2.6.24 - 
v2.6.24 (and a whole bunch of commits since then, i listed the IDs) 
booted up fine. The commit ID you mention is: v2.6.23-4345-g523b945, way 
earlier than the good commit IDs.

so this is a recent regression.

	Ingo
--

From: Pekka Enberg
Date: Friday, April 11, 2008 - 2:05 am

Hi Ingo.


Right. Then you probably want to look into any changes in arch/x86/
related to setting up the zonelists. I'm fairly certain this is not a
slab bug and I don't see any recent changes to the page allocator
either that would explain this.
--

From: Pekka Enberg
Date: Friday, April 11, 2008 - 2:11 am

And I'd lose as you're 32-bit. Oh well, that's the price to pay for
pretending to know x86 arch internals.
--

From: Ingo Molnar
Date: Friday, April 11, 2008 - 2:24 am

yeah, sorry - we are working hard to unify generic bits like that, but 
it's a huge architecture.

btw., i always felt that the zone/memory setup is rather fragile and 
ad-hoc in places and it trusts the architecture code too much. Just in 
the .25 cycle i've seen about a dozen bugs all around that thing. I 
believe we should work on making the info that an architecture feeds to 
the MM "fool proof" - i.e. sanity-check for overlaps and other common 
setup errors. It is easy for an architecture to mess up those things... 
Especially on oddball systems that are too large or too small to be 
normally tested. It's a common, reoccuring bug pattern that we could 
avoid by being a bit more resilient.

if this is a zone setup bug then a sanity-check could catch it right 
where it happens - not much later in the slab code or so.

	Ingo
--

From: Nick Piggin
Date: Friday, April 11, 2008 - 3:34 am

BTW. I think I'm seeing some problems perhaps related to change page
attr stuff for DEBUG_PAGEALLOC on x86-64. And I don't know if it is the
same thing, but some general instability around either the page allocator
or slab allocator.

The debug pagealloc problems seem to be that a thread suddenly get stuck
in the kernel spinning in cpa (usually on one of the locks) and never
seems to recover. Once it seemed to be spinning in clear_page_... too,
but perhaps could it be messing up the page attributes and running so
slowly that it just appears to be hanging? I'll try to get more info here
but it is hard to reproduce.

The general instability -- I've just seen an oops or two in the page
allocation path in slub recently. Nothing reportable because I've been
running my own patches and/or been unable to reproduce... but it is a bit
unusual and I'll keep an eye out.

Anyway, I'd suggest cooking this kernel a bit longer before release...

--

From: Christoph Lameter
Date: Friday, April 11, 2008 - 12:28 pm

Could you post the zone setup of the system that fails? A memory map would 
be useful and full dmesg output up to the failure.

--

From: Christoph Lameter
Date: Saturday, April 12, 2008 - 3:38 am

Pekka pointed out the orig post that had the boot info.

One commonality in the two failures inhouse and yours is that 
both had more than 4G memory and had CONFIG_HIGHMEM4G set. Possibly
a wraparound of some limit?


--

From: Yinghai Lu
Date: Saturday, April 12, 2008 - 10:22 am

Ingo,

does your old 8 socket system work with 64bit kernel?

YH
--

From: Ingo Molnar
Date: Monday, April 14, 2008 - 10:43 pm

it's all in my bugreport you are replying to:

   http://lkml.org/lkml/2008/4/11/34

it's a full dmesg up to the failure, which starts with the memory map of 
the system ...

	Ingo
--

From: Mel Gorman
Date: Tuesday, April 15, 2008 - 2:36 am

I hadn't realised that such setup errors were common. It should be already able
to handle some overlapping problems in add_active_range().

I'm playing catch-up here but looking at your dmesg output, I see the
following snippets.

[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[    0.000000]  BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)

There are two portions of usable memory with a few holes there.

[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000110000000 (usable)

And is memory over the 4GB boundary but....

[    0.000000] Warning only 4GB will be used.
[    0.000000] Use a HIGHMEM64G enabled kernel.
[    0.000000] Entering add_active_range(0, 0, 1048576) 0 entries of 256 used

It's recognised and only memory below 4GB is registered and it's all on
node 0. However, I do note that it also registers all the holes as valid
memory. The memory should never get freed because it should be reserved
during boot by reserve_bootmem() but it still raises an eyebrow.

[    0.000000] early_node_map[1] active PFN ranges
[    0.000000]     0:        0 ->  1048576
[    0.000000] On node 0 totalpages: 1048576
[    0.000000]   DMA zone: 32 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 4064 pages, LIFO batch:0
[    0.000000]   Normal zone: 1760 pages used for memmap
[    0.000000]   Normal zone: 223520 pages, LIFO batch:31
[    0.000000]   HighMem zone: 6400 pages used for memmap
[    0.000000]   HighMem zone: 812800 pages, LIFO batch:31
[    0.000000]   ...
From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 3:03 am

32-bit does memory_present() calls to register all RAM - and those calls 
are correct (they do not include holes) and the resulting sparse memory 
section layout looks correct, and all the mem_map[] chunk allocations 
succeed as well.

furthermore, when freeing memory from bootmem allocator into the buddy 
allocator we consult the e820 map again via a page_is_ram() call, so we 
make sure holes do not end up in the memory map and in the free page 

yep. I did a few extra printouts to make sure, but came to the same 
conclusion. The system boots fine with the same config on v2.6.24.

	Ingo
--

From: Ingo Molnar
Date: Monday, April 14, 2008 - 11:25 pm

you asked me to run with the debug patch attached below. I just tried 
vanilla -rc9 (head 120dd64cacd4fb7) and it still crashes with this 
config:

  http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9

debug output is:

  http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9

so it's probably the first few page allocations (setup_cpu_cache()) 
going wrong already - suggesting a some fundamental borkage in SLAB?

note, when i change SLAB to SLUB (and keep the config unchanged 
otherwise), i get a similar early crash:

  http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad
  http://redhat.com/~mingo/misc/config-Tue_Apr_15_07_24_59_CEST_2008.bad

i've also uploaded a bzImage (SLUB, debug patch not applied) that you 
can pick up and run on any 32-bit test-system:

  http://redhat.com/~mingo/misc/bzImage-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9

it's a relatively generic bzImage that should boot on most whitebox PCs 
on most distros as long as you use a pure ext3 setup and might even give 
you networking (no modules or initrd is needed). It boots fine on two 
other 32-bit PCs i have (an Intel laptop and an AMD desktop).

	Ingo

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1485,6 +1485,7 @@ restart:
 		 * Happens if we have an empty zonelist as a result of
 		 * GFP_THISNODE being used on a memoryless node
 		 */
+		WARN_ON(1);
 		return NULL;
 	}
 
Index: linux/mm/slab.c
===================================================================
--- linux.orig/mm/slab.c
+++ linux/mm/slab.c
@@ -1682,6 +1682,7 @@ static void *kmem_getpages(struct kmem_c
 		flags |= __GFP_RECLAIMABLE;
 
 	page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+	WARN_ON(!page);
 	if (!page)
 		return NULL;
 
@@ -2620,6 +2621,7 @@ static struct slab *alloc_slabmgmt(struc
 		/* Slab management obj is off-slab. */
 ...
From: Pekka Enberg
Date: Monday, April 14, 2008 - 11:41 pm

I think it's still pointing to the page allocator and/or setting up

...especially considering you have similar crash with SLUB as well.

Now this:

[    0.000999] ------------[ cut here ]------------
[    0.000999] WARNING: at mm/slab.c:1685 cache_alloc_refill+0x2a6/0x4a3()
[    0.000999] Pid: 0, comm: swapper Not tainted 2.6.25-rc9 #924
[    0.000999]  [<c0121b6f>] warn_on_slowpath+0x3c/0x4c
[    0.000999]  [<c0781873>] ? _spin_unlock_irqrestore+0xf/0x13
[    0.000999]  [<c02941ad>] ? delay_tsc+0x2e/0x4e
[    0.000999]  [<c029414d>] ? __delay+0x9/0xb
[    0.000999]  [<c0353db3>] ? serial8250_console_putchar+0x80/0x86
[    0.000999]  [<c0148822>] ? get_page_from_freelist+0x230/0x345
[    0.000999]  [<c0121eb1>] ? __call_console_drivers+0x56/0x63
[    0.000999]  [<c01489bb>] ? __alloc_pages+0x6e/0x2be
[    0.000999]  [<c015bd2e>] cache_alloc_refill+0x2a6/0x4a3
[    0.000999]  [<c015ba3f>] kmem_cache_alloc+0x5b/0xa4

Says that alloc_pages_node() returned NULL early on in the boot.
However, GFP_THISNODE is ruled out as this:

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1485,6 +1485,7 @@ restart:
                * Happens if we have an empty zonelist as a result of
                * GFP_THISNODE being used on a memoryless node
                */
+               WARN_ON(1);
               return NULL;
       }

does not trigger. Hmm...
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 12:08 am

i did a .config bisection and it pinpointed CONFIG_SPARSEMEM=y as the 
culprit. Changing it to FLATMEM gives a correctly booting system.

if you look at the good versus bad bootup log:

  http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.good
  http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad

(both SLUB) you'll see that the zone layout provided by the architecture 
code is _exactly_ the same and looks sane as well. So this is not an 
architecture zone layout bug, this is probably sparsemem setup (and/or 
the page allocator) getting confused by something.

why are there no good debug logs possible in this area? To debug such 
bugs we'd need an early dump of the precise layout of all memory maps, 
what points where, how large it is, where it is allocated - and then 
compare it with how the rest of the system is layed out - looking at 
possible overlaps or other bugs. This 8-way box is a pain to debug on, 
it takes a long time to boot it up, etc. etc.

	Ingo
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 1:31 am

so same config 64 bit with SLUB works and only 32bit is broken? or it 2.6.24 with 32bit + sparse + slub is broken already?

YH

--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:46 am

this is a 32-bit-only box.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:11 am

i've done a revert of the page allocator to v2.6.24 status (with fixes 
ontop to make it work on .25 infrastructure), via the patch below - but 
this didnt change the problem.

i also doubled the sparse mem_map[] allocations on the theory that they 
might overflow - but that didnt solve the crash either.

	Ingo

------------------------>
Subject: revert: page alloc
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 10:44:34 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/gfp.h    |    2 
 include/linux/mmzone.h |    2 
 mm/page_alloc.c        |  169 ++++++++++++++++++++++---------------------------
 mm/vmstat.c            |   61 ++++++++---------
 4 files changed, 110 insertions(+), 124 deletions(-)

Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -227,7 +227,5 @@ extern void free_cold_page(struct page *
 
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
-void drain_all_pages(void);
-void drain_local_pages(void *dummy);
 
 #endif /* __LINUX_GFP_H */
Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -113,7 +113,7 @@ struct per_cpu_pages {
 };
 
 struct per_cpu_pageset {
-	struct per_cpu_pages pcp;
+	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
 #ifdef CONFIG_NUMA
 	s8 expire;
 #endif
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -19,7 +19,6 @@
 #include <linux/swap.h>
 #include <linux/interrupt.h>
 #include <linux/pagemap.h>
-#include <linux/jiffies.h>
 #include <linux/bootmem.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -44,7 +43,6 @@
 #include <linux/backing-dev.h>
 #include ...
From: Linus Torvalds
Date: Tuesday, April 15, 2008 - 9:02 am

Well, I think it suggests some fundamental borkage in the page allocator.

That first warn-on is from the "alloc_pages_node()" returning NULL at 
bootup. Sure, it could be that the arguments are bogus, but that sounds 
unlikely since none of that is dependent on any kconfig stuff.

The fact that it happens with both SLUB/SLAB makes that even more obvious.

Now, you don't have fault injection on, so it can't be that, and your 
debug entry for *z == NULL didn' trigger in alloc_pages, so it's no that 
one either. 

However, if __alloc_pages() failed, I would have expected to see the 
"memory allocation failed" printk. Why didn't it? Is printk_ratelimit() 
broken at boot (last_msg start out as zero - maybe i should start out as 
a negative number)?

			Linus
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 9:15 am

btw., now with a second full day spent on this regression, i have 
figured out a workaround the hard way: increasing SECTION_SIZE_BITS in 
include/asm-x86/sparsemem.h from 26 to 27 makes it go away. (i.e. we use 
section chunks of 128 MB instead of 64 MB before) I've given up on 
analyzing the crash site - it seems rather random and uninformative and 
just suggests page allocator borkage.

So this seems like a general sparsemem borkage. PAE uses a shift of 30 
due to page->flags shortage (which masks this bug), 64-bit uses 27 which 
too probably masks this bug.

Since this is a !NUMA config and !PAE as well, NODES_SHIFT is 0, 
ZONES_SHIFT is 2, so the theory of running out of bits in page->flags is 
wrong as well.

I also tried a hack to double the size of all sparsemem mem_map 
allocations (on the theory of an overflow there) - but it didnt help.

So i think we need to go down further into the page allocator. Perhaps 
the buddy bitmaps are wrongly sized somewhere. I'm grasping at straws.

Btw., Mel Gorman has reproduced crashes with my bzImage on his box (and 
a hang with my config, using his build), so i think we can eliminate hw 
and build environment specialities as a cause.

	Ingo
--

From: Linus Torvalds
Date: Tuesday, April 15, 2008 - 10:23 am

Interesting.

I wonder..

So since you don't have NUMA, you have NODES_SHIFT == 0.

That in turn means that NODE_NOT_IN_PAGE_FLAGS is _not_ set.

That, in turn, means that ZONEID_SHIFT does *not* contain SECTIONS_SHIFT.
Is that really what is supposed to happen?

Because then "page_is_buddy()" will not even test the section, as far as I 
can tell.

But I'm probably missing something. Why would we not need to test the 
section in page_zone_id() when the node ID is in the page flags (but has 
zero size)?

		Linus
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 12:35 pm

still crashes with the patch below - find the crash-log further below.

(the kernel has a few more non-destructive debug printouts and debug 
checks included as well, which you can see in the log, but it's a 
vanilla kernel otherwise.)

	Ingo

----------------------->
Subject: nodes: shift fix
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 21:15:21 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/mm.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -424,7 +424,7 @@ static inline void set_compound_order(st
  * We are going to use the flags for the page to node mapping if its in
  * there.  This includes the case where there is no node, so it is implicit.
  */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#if NODES_WIDTH <= 0 || NODES_SHIFT == 0
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 

[    0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #960 SMP Tue Apr 15 21:16:23 CEST 2008
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[    0.000000]  BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000110000000 (usable)
[    0.000000] console [earlyser0] enabled
[    0.000000] Warning only 4GB will be used.
[    0.000000] Use a HIGHMEM64G enabled kernel.
[    0.000000] 3200MB HIGHMEM ...
From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 12:41 pm

Peter "radar eye" Zijlstra noticed an ugly and annoying typo in mm.h:

-#ifdef NODE_NOT_IN_PAGEFLAGS
+#ifdef NODE_NOT_IN_PAGE_FLAGS

but even with the full fix (see below) the same crash remains.

i think getting NODE_NOT_IN_PAGEFLAGS wrong seems to result in 
non-optimal but still correct code - by virtue of NODES_MASK ending up 
zero.

	Ingo

----------------------->
Subject: nodes: shift fix
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 21:15:21 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/mm.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -424,7 +424,7 @@ static inline void set_compound_order(st
  * We are going to use the flags for the page to node mapping if its in
  * there.  This includes the case where there is no node, so it is implicit.
  */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#if NODES_WIDTH <= 0 || NODES_SHIFT == 0
 #define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
@@ -442,7 +442,7 @@ static inline void set_compound_order(st
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allcator */
-#ifdef NODE_NOT_IN_PAGEFLAGS
+#ifdef NODE_NOT_IN_PAGE_FLAGS
 #define ZONEID_SHIFT		(SECTIONS_SHIFT + ZONES_SHIFT)
 #define ZONEID_PGOFF		((SECTIONS_PGOFF < ZONES_PGOFF)? \
 						SECTIONS_PGOFF : ZONES_PGOFF)
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 12:39 pm

Hmmmm. SECTION_SIZE_BITS == 26 means SECTIONS_SHIFT == 6. 
Increasing SECTION_SIZE_BITS to 27 reduces SECTION_SHIFT to 5. Thereby 
the number of sparsemem sections (NR_MEM_SECTIONS) is reduced to half (64 
to 32).
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 12:54 pm

yes, as i said in this thread already earlier today, the sparse chunking 
goes from 64MB to 128MB. (and hence, by virtue of !PAE having a 4GB 
physical address space, the # of sparse sections goes from 64 to 32 - 
you can see the full sparse sections printout in my latest crashlog in 
my previous mail, including the NR_MEM_SECTIONS printout.)

Pretty please, could you pay more than cursory attention to this bug i 
already spent two full days on and which is blocking the v2.6.25 
release?

Your commits are all over the place in this code, and you are one of the 
maintainers as well. We've got 5000 lines of flux in mm/* in v2.6.25.

I'm just guessing my way around, but right now my impression is that the 
current early memory setup code is unrobust, over-complex, occasionally 
butt-ugly to read code in high need of cleanups, simplifications and 
debug facilities, visibly plagued by hit-and-run changes with frequent 
typos and everything else you normally dont want to see in the core 
kernel. (Did i get your attention now? ;-)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 1:03 pm

Yeah trying to get to understand how exactly sparsemem works and how the 
32 bit highmem stuff interacts with it... Sorry not code that I am an 
expert in nor the platform that I am familiar with. Code mods there 
required heavy review from multiple parties with expertise in various 

I thought the NR_SECTIONS stuff would trigger some memories. Adding apw 
who seemed to be most familiar with the material in the past 
(AFAICT NODE_NOT_IN_PAGE_FLAGS is there for IBM NUMAQ etc) and Kame-san.

Andy, Kame-san could you have a look at the sparsemem config issue with 
32 bit !PAE? This is SPARSEMEM_STATIC.


--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:17 pm

yeah - sorry about that impatient flame. And it could still be anything 
from the page allocator to bootmem - or some completely unrelated piece 
of code corrupting some key data structure.

sparsemem is supposed to work roughly like this on x86 (32-bit):

- the x86 memory map comes from the bios via e820.

- those individual chunks of e820-enumerated memory get
  registered with mm/sparse.c's data structures via memory_present()
  callbacks. [btw., this should be renamed to register_memory_present()
  or register_sparse_range() - something less opaque.]

- there's really just 3 RAM areas that matter on this box, and the last 
  one is unusable for !PAE, which leaves 2.

- there's a 256 MB PCI aperture hole at 0xf0000000.

- out of the 64 sparse memory chunk the first 60 get filled in (all have 
  at least partially some RAM content) - the last 4 [the PCI aperture 
  hole] remains !present.

- we pass in an array of 3 zones to free_area_init_nodes().

- we free the lowmem pages into the buddy allocator via the usual 
  generic setup

- we have a special loop for highmem pages in arch/x86/mm/init_32.c, 
  set_highmem_pages_init(). This just goes through the PFNs one by one 
  and does an explicit __free_page() on all RAM pages that are in the 
  mem_map[] and which are non-reserved.

and that's it roughly.

my current guess would have been some bootmem regression/interaction 
that messes up the buddy bitmaps - but i just reverted to the v2.6.24 
version of bootmem.c and that crashes too ...

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:28 pm

btw., highmem shouldnt matter because it does not influence how we 
allocate our key data structures.

i confirmed that by turning set_highmem_pages_init() into a NOP - the 
kernel still crashed with just lowmem memory being around.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:34 pm

and booting with NOHIGHMEM gives a crash too - updated config attached. 
(in case anyone wonders about the CONFIG_M486=y - it crashes with 
CONFIG_M686=y too)

	Ingo

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 22:14:51 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not ...
From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:42 pm

changing the .config to UP makes it boot up fine. Config and bootlog 
attached.

	Ingo

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 22:20:58 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not ...
From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 1:50 pm

Vexing. The failure in the slabs suggests that no lowmem pages were freed 
during the walk of the bootmem bitmaps. Could you call show_mem before 
kmem_cache_init() runs?
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:58 pm

sure - find the crashlog below.

but it seems there's plenty of free RAM in the buddy:

[    0.000999] DMA: 3*4kB 2*8kB 4*16kB 2*32kB 3*64kB 1*128kB 1*256kB 
                    0*512kB 1*1024kB 1*2048kB 0*4096kB = 3804kB

[    0.000999] Normal: 54*4kB 54*8kB 54*16kB 54*32kB 54*64kB 60*128kB 
                       60*256kB 0*512kB 1*1024kB 0*2048kB 197*4096kB = 
                       837672kB

and the bug pattern seems to be memory corruption - not memory 
exhaustion.

i.e. we allocated RAM but it got corrupted after allocation.

	Ingo

Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -609,6 +609,7 @@ asmlinkage void __init start_kernel(void
 	mem_init();
 	enable_debug_pagealloc();
 	cpu_hotplug_init();
+	show_mem();
 	kmem_cache_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();

[    0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #968 SMP Tue Apr 15 22:39:35 CEST 2008
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[    0.000000]  BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000110000000 (usable)
[    0.000000] console [earlyser0] enabled
[    0.000000] Warning only 896MB will be used.
[    0.000000] Use a HIGHMEM64G enabled kernel.
[    0.000000] 896MB LOWMEM available.
[    0.000000] Scan SMP from c0000000 for 1024 bytes.
[    0.000000] Scan SMP from c009fc00 for ...
From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 2:08 pm

SLUB does not do a memory allocation where it fails here but simply 

In some situations we are screwing up the per cpu data handling on 
32 bit x86? Adding Mike. This looks like the per cpu area overlaps with 
something else?



--

From: Mike Travis
Date: Tuesday, April 15, 2008 - 2:16 pm

I'll certainly take a look...

-Mike
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:19 pm

yep, that was my other theory - and i doubled CONFIG_NR_CPUS to reduce 
that chance.

in hindsight ... that wont save us from any overlap, right?

what's the best way to artificially increase the size of the allocated 
per cpu area? (say double it)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 2:21 pm

Add a big  per cpu declaration?

static DEFINE_PER_CPU(char, dummy)[10000];


--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:23 pm

what's the guarantee that it's at the end of the section? I'd like to 
pad the per cpu areas at their end. (doubling their size is a good way 
to achieve that)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 2:24 pm

No guarantee. Its up to the linker. Sorry. We could add a new percpu.last 
section but that requires a number of changes to linking.

--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:28 pm

ah. Then the patch below should do the trick, right?

	Ingo

------------->
Subject: larger: percpu
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 23:13:18 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/vmlinux_32.lds.S |    1 +
 1 file changed, 1 insertion(+)

Index: linux/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux.orig/arch/x86/kernel/vmlinux_32.lds.S
+++ linux/arch/x86/kernel/vmlinux_32.lds.S
@@ -186,6 +186,7 @@ SECTIONS
 	__per_cpu_start = .;
 	*(.data.percpu)
 	*(.data.percpu.shared_aligned)
+	. = . + 65536;
 	__per_cpu_end = .;
   }
   . = ALIGN(PAGE_SIZE);
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 2:33 pm

Hopefully. The linker sometimes reacts in funky ways and we do some 
strange magic with the offsets through gcc memory models (at least on 
x86_64 not sure what 32 bit does).
--

From: Mike Travis
Date: Tuesday, April 15, 2008 - 2:43 pm

Or you could try this:

--- linux-2.6.x86.sched-last-0415.orig/include/linux/percpu.h
+++ linux-2.6.x86.sched-last-0415/include/linux/percpu.h
@@ -38,10 +38,7 @@

 /* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
 #ifndef PERCPU_ENOUGH_ROOM
-#ifdef CONFIG_MODULES
-#define PERCPU_MODULE_RESERVE  8192
-#else
-#define PERCPU_MODULE_RESERVE  0
+#define PERCPU_MODULE_RESERVE  65536
 #endif

 #define PERCPU_ENOUGH_ROOM                                             \
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 3:07 pm

this seems to have the intended effect of +0x10000 padding at the end of 
the percpu area:

 c0b927d0 D per_cpu__cpu_info
 c0b92880 d per_cpu__runqueues
 c0ba2d00 D __per_cpu_end
 c0ba3000 B __bss_start

it still crashes though, with an very similar crash pattern to the 
previous ones.

	Ingo

--

From: Mike Travis
Date: Tuesday, April 15, 2008 - 2:27 pm

I don't know that there is a boot option.  If modules are defined it
adds an extra 8k.  The size is defined in include/linux/percpu.h 
(PERCPU_ENOUGH_ROOM).

Otherwise define a really large per_cpu variable...?

-Mike
--

From: Pekka Enberg
Date: Tuesday, April 15, 2008 - 1:34 pm

I know this is bit of hand-waving but have you noticed how all the 
interesting sparsemem changes that one would expect to have caused the 
breakage happened _before_ v2.6.24? So sorry for asking this again but 
are we 110% sure the problem does not trigger with any of the 
v2.6.24-rcN kernels?

			Pekka
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:40 pm

quite. Here are all the successfull bootups from my (failed) bisection 
attempt:

 0773769191d943358a8392fa86abd756d004c4b6
 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
 26b8256e2bb930a8e4d4d10aa74950d8921376b8
 2a10e7c41254941cac87be1eccdcb6379ce097f5
 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
 49914084e797530d9baaf51df9eda77babc98fa8
 53a6e2342d73d509318836e320f70cd286acd69c
 5be3bda8987b12a87863c89b74b136fdb1f072db
 6d5f718a497375f853d90247f5f6963368e89803
 7272dcd31d56580dee7693c21e369fd167e137fe
 77de2c590ec72828156d85fa13a96db87301cc68
 82cfbb008572b1a953091ef78f767aa3ca213092
 b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
 c087567d3ffb2c7c61e091982e6ca45478394f1a
 d4b37ff73540ab90bee57b882a10b21e2f97939f
 fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d

or, via git-describe:

v2.6.24-3908-g0773769
v2.6.24-2392-g21af029
v2.6.24-3868-g26b8256
v2.6.24-4463-g2a10e7c
v2.6.24-4457-g3aa88cd
v2.6.24
v2.6.24-3522-g53a6e23
v2.6.24-3131-g5be3bda
v2.6.24-4461-g6d5f718
v2.6.24-3891-g7272dcd
v2.6.24-3902-g77de2c5
v2.6.24-3613-g82cfbb0
v2.6.24-4449-gb75f53d
v2.6.24-3911-gc087567
v2.6.24-3913-gd4b37ff
v2.6.24-4464-gfde1b3f

i.e. vanilla v2.6.24 and a whole bunch of commits after it were booting 
just fine. (the problem might have been masked up to a certain point in 
theory, but given how resilient it is to offset changes in my testing i 
find that not very probable [but not impossible] )

	Ingo
--

From: Linus Torvalds
Date: Tuesday, April 15, 2008 - 2:06 pm

Ok, can you try this script

	git bisect start
	git bisect bad 7fa2ac3728ce828070fa3d5846c08157fe5ef431
	git bisect good 0773769191d943358a8392fa86abd756d004c4b6
	git bisect good 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
	git bisect good 26b8256e2bb930a8e4d4d10aa74950d8921376b8
	git bisect good 2a10e7c41254941cac87be1eccdcb6379ce097f5
	git bisect good 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
	git bisect good 49914084e797530d9baaf51df9eda77babc98fa8
	git bisect good 53a6e2342d73d509318836e320f70cd286acd69c
	git bisect good 5be3bda8987b12a87863c89b74b136fdb1f072db
	git bisect good 6d5f718a497375f853d90247f5f6963368e89803
	git bisect good 7272dcd31d56580dee7693c21e369fd167e137fe
	git bisect good 77de2c590ec72828156d85fa13a96db87301cc68
	git bisect good 82cfbb008572b1a953091ef78f767aa3ca213092
	git bisect good b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
	git bisect good c087567d3ffb2c7c61e091982e6ca45478394f1a
	git bisect good d4b37ff73540ab90bee57b882a10b21e2f97939f
	git bisect good fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7d

and then you'll apparently hit that commit you had compile problems
with.  HOWEVER, at that point, just do

	git bisect visualize

and pick a commit somewhere roughly half-way that you suspect is a good 
point of testing, but not near the range that you had problems with.  If 
you have compile problems in the middle, pick something that is just one 
third down, for example. It will make the bisection slower, but 
considering how unable we've been to make much progress other ways, if we 
can narrow it down from 1874 commits to something smaller, I suspect we'll 
be much happier.

Then you just do

	git checkout <sha-you-picked-out-here>

and compile that one, and check. 

Besides, while the _optimal_ point is half-way, even if you only remove a 
third or a quarter of the commits at each stage, it's still going to be 
an exponential thing.

		Linus
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:13 pm

ok, will try that now.

The 'bad' points i posted are definitely well-established as i 
post-validated them them by looking for the slab.c crash pattern in the 
serial logs and looking at the git commit in the bootup signature. (this 
is more reliable than looking at bisection logs - i tried 4 different 
bisection runs and not all were reliable.)

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:24 pm

btw., as i progress with that bisection effort, i triggered new crash 
patterns, which lappen later during bootup:

[    0.775886] initcall 0xc0b00559 ran for 0 msecs: ksysfs_init+0x0/0x96()
[    0.777885] Calling initcall 0xc0b01eb8: filelock_init+0x0/0x27()
[    0.780137] BUG: unable to handle kernel NULL pointer dereference at 00000001
[    0.782883] IP: [<c0293981>] strlen+0xb/0x15
[    0.784884] *pde = 00000000 
[    0.786889] Oops: 0000 [#1] SMP 
[    0.787880] 
[    0.787880] Pid: 1, comm: swapper Not tainted (2.6.24-05281-g6232665 #3)
[    0.787880] EIP: 0060:[<c0293981>] EFLAGS: 00010286 CPU: 0
[    0.787880] EIP is at strlen+0xb/0x15
[    0.787880] EAX: 00000000 EBX: 00040000 ECX: ffffffff EDX: 00040000
[    0.787880] ESI: c0915320 EDI: 00000001 EBP: f7c23f08 ESP: f7c23f04
[    0.787880]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[    0.787880] Process swapper (pid: 1, ti=f7c22000 task=f7c20000 task.ti=f7c22000)
[    0.787880] Stack: f7c1f540 f7c23f18 c015b138 00000094 c016b45a f7c23f48 c015c92c c016b45a 
[    0.787880]        00000025 c0915320 f7c1f540 000000d0 f7c23f58 00000282 f7c1f540 00000000 
[    0.787880]        00000000 f7c23f84 c015cbc8 00000060 00000000 00040000 c016b45a c0133e81 
[    0.787880] Call Trace:
[    0.787880]  [<c015b138>] ? kmem_cache_flags+0x3d/0x5b
[    0.787880]  [<c016b45a>] ? init_once+0x0/0xc
[    0.787880]  [<c015c92c>] ? kmem_cache_open+0x64/0x128
[    0.787880]  [<c016b45a>] ? init_once+0x0/0xc
[    0.787880]  [<c015cbc8>] ? kmem_cache_create+0x14e/0x1d6
[    0.787880]  [<c016b45a>] ? init_once+0x0/0xc
[    0.787880]  [<c0133e81>] ? ktime_get_ts+0x3b/0x3f
[    0.787880]  [<c0133e96>] ? ktime_get+0x11/0x2f
[    0.787880]  [<c0b01ed6>] ? filelock_init+0x1e/0x27
[    0.787880]  [<c016b45a>] ? init_once+0x0/0xc
[    0.787880]  [<c0af176d>] ? kernel_init+0x13f/0x298
[    0.787880]  [<c0103b46>] ? ret_from_fork+0x6/0x20
[    0.787880]  [<c0af162e>] ? kernel_init+0x0/0x298
[    0.787880]  [<c0af162e>] ? kernel_init+0x0/0x298
[    ...
From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 2:42 pm

It may help to run with slub_debug (or CONFIG_SLUB_DEBUG_ON) then to 
detect the corruption as early as possible. Only works if you get by 
kmem_cache_init() though. Should give us some informative dumps of what is 
exactly corrupted if it hits a slab object.
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:55 pm

grumble. Do you read my bugreports?

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_DEBUG_ON=y

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 3:06 pm

Cannot memorize everything sorry. This looked like slab object corruption 
and there were no slub diagnostics in the log.

Trying to duplicate the issue here.

Boot with PAE support on a machine with 8G RAM 8 processors here works 
fine.

Also booting without highmem support (SMP) works fine. 

On how many machines does the problem occur?

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 21:59:36 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General ...
From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 3:13 pm

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 3:27 pm

Yup that config fails here too... 




--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 3:32 pm

great. Note that it's randconfig generated - so watch out for weird 
config combinations.

randconfig, besides finding build-bugs, is also good at finding various 
runtime bugs: it is great at finding weird alignment and 
boundary-condition bugs in generic code, and it's also great at finding 
races (by virtue of introducing random delays between various functions, 
via random enabling/disabling of debug facilities and other options that 
impact the generated code's layout and timing).

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 4:22 pm

Hmmm... If one enables CONFIG_X86_PAE (even with no highmem) then 
everything is fine. For PAE to be enabled some other things also fall by 
the wayside. Diff to your failing config follows. Will try to minimize 
the diff even further:

--- config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9	2008-04-15 06:02:13.000000000 +0000
+++ .config	2008-04-15 23:15:53.000000000 +0000
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.25-rc9
-# Tue Apr 15 07:37:33 2008
+# Tue Apr 15 23:15:53 2008
 #
 # CONFIG_64BIT is not set
 CONFIG_X86_32=y
@@ -169,11 +169,11 @@
 # CONFIG_X86_VSMP is not set
 # CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER is not set
 CONFIG_PARAVIRT_GUEST=y
+# CONFIG_XEN is not set
 CONFIG_VMI=y
-# CONFIG_LGUEST_GUEST is not set
 CONFIG_PARAVIRT=y
 # CONFIG_M386 is not set
-CONFIG_M486=y
+# CONFIG_M486 is not set
 # CONFIG_M586 is not set
 # CONFIG_M586TSC is not set
 # CONFIG_M586MMX is not set
@@ -196,20 +196,23 @@
 # CONFIG_MVIAC3_2 is not set
 # CONFIG_MVIAC7 is not set
 # CONFIG_MPSC is not set
-# CONFIG_MCORE2 is not set
+CONFIG_MCORE2=y
 # CONFIG_GENERIC_CPU is not set
 # CONFIG_X86_GENERIC is not set
 CONFIG_X86_CMPXCHG=y
-CONFIG_X86_L1_CACHE_SHIFT=4
+CONFIG_X86_L1_CACHE_SHIFT=6
 CONFIG_X86_XADD=y
-# CONFIG_X86_PPRO_FENCE is not set
-CONFIG_X86_F00F_BUG=y
 CONFIG_X86_WP_WORKS_OK=y
 CONFIG_X86_INVLPG=y
 CONFIG_X86_BSWAP=y
 CONFIG_X86_POPAD_OK=y
-CONFIG_X86_ALIGNMENT_16=y
-CONFIG_X86_MINIMUM_CPU_FAMILY=4
+CONFIG_X86_GOOD_APIC=y
+CONFIG_X86_INTEL_USERCOPY=y
+CONFIG_X86_USE_PPRO_CHECKSUM=y
+CONFIG_X86_P6_NOP=y
+CONFIG_X86_TSC=y
+CONFIG_X86_MINIMUM_CPU_FAMILY=6
+CONFIG_X86_DEBUGCTLMSR=y
 # CONFIG_HPET_TIMER is not set
 # CONFIG_IOMMU_HELPER is not set
 CONFIG_NR_CPUS=8
@@ -229,11 +232,11 @@
 CONFIG_MICROCODE_OLD_INTERFACE=y
 CONFIG_X86_MSR=y
 CONFIG_X86_CPUID=y
-# CONFIG_NOHIGHMEM is not set
-CONFIG_HIGHMEM4G=y
+CONFIG_NOHIGHMEM=y
+# CONFIG_HIGHMEM4G is not set
 # CONFIG_HIGHMEM64G is not set
 ...
From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 4:27 pm

... in the thread i've already explained that it's because on PAE we use 
1GB sparse chunks (shift 30) which masks the bug.

(on PAE we cannot go below a shift of 29 due to shortage of page->flags)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 4:32 pm

Ahh. Right. That is the same situation as HIGHMEM_64G. I was able to 
enable PAE without HIGHMEM_64G. Thought that would keep things as is.
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 5:04 pm

Added some printks to the initialization of slub and I see 0x1 double 
words written over global variables that should be zero. The cpu mask to 
track processors that are initialized is screwed up 
(kmem_cach_cpu_free_init_once).

[    0.000999] 0xc0cb320c:  00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................
[    0.000999] 0xc0cb321c:  00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................
[    0.000999] 0xc0cb322c:  00 00 00 00 01 00 00 00                 ........

c0cb3010 B mem_section
c0cb3210 b lock.25923
c0cb3214 b shmem_inode_cachep
c0cb3218 b shm_mnt
c0cb321c b slab_state
c0cb3220 b kmem_cach_cpu_free_init_once
c0cb3224 b slub_debug
c0cb3228 b slub_debug_slabs

mem_section is 512 bytes long. Array overrun?

--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 4:18 pm

same .config  with -rc9 on one system with 128g 4 sockets with quad core crashed here too.

YH
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:03 pm

finally found it ... the patch below solves the sparsemem crash and the 
testsystem boots up fine now:

  mars:~> uname -a
  Linux mars 2.6.25-rc9-sched-devel.git-x86-latest.git #985 SMP Wed Apr 16 
  01:37:37 CEST 2008 i686 i686 i386 GNU/Linux

yay! :-)

	Ingo

ps. anyone who can correctly guess the method with which i found the 
    exact place that corrupted memory will get a free beer next time we 
    meet :-)

------------------------->
Subject: mm: sparsemem memory_present() memory corruption fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008

fix memory corruption and crash on 32-bit x86 systems.

if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.

that causes this loop to happily walk over the limit of the
sparse memory section map:

    for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
                unsigned long section = pfn_to_section_nr(pfn);
                struct mem_section *ms;

                sparse_index_init(section, nid);
                set_section_nid(section, nid);

                ms = __nr_to_section(section);
                if (!ms->section_mem_map)
                        ms->section_mem_map = sparse_encode_early_nid(nid) |

'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID. Depending on what that memory is, we might crash, 
misbehave or just not notice the bug.

the fix is to sanity check anything the architecture passes to sparsemem.

this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.

an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 mm/sparse.c |   10 ++++++++++
 1 file changed, 10 ...
From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 5:10 pm

Yes that fixes it here too. And the corruption that I saw of slab 
variables is explained by your analysis. Thanks!

Tested-by: Christoph Lameter <clameter@sgi.com>
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:18 pm

i re-checked the original SLAB config too and that boots fine as well 
now - so i'm confident that the regression has been sufficiently cured.

it's getting quite late here (or rather, it's getting early :-/ ) so it 
would be nice if others could double-check this calculation (with an eye 
on all possible architectures):

+       unsigned long max_arch_pfn = 1ULL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);

and also check my analysis whether it is correct and whether it matches 
the reported bug patterns. But otherwise the fix looks like a safe fix 
for v2.6.25-final to me - it only filters out values from sparsemem 
input that are nonsensical in the sparsemem framework anyway.

	Ingo
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 5:32 pm

can you check why find_max_pfn() e820_32.c need to call memory_present?
wonder if it can be removed.

YH
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:44 pm

this is the only call to memory_present() we do in 32-bit arch setup, so 
it's required.

(the function find_max_pfn() is woefully misnamed, but that's a cleanup 
- i just fixed this in x86.git.)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 5:46 pm

We could clip there if SPARSEMEM is configured. I wonder if this affects 
other platforms that need HIGHMEM support?
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:52 pm

clip where and what?

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 6:17 pm

i.e. as per my previous argument i'd consider the need to sanitize the 
calls in the architecture fundamentally wrong.

whether the core code emits a warning or allows the call is an 
additional question i mention in the changelog - but the core sparse 
memory code should _definitely_ not silently overflow a key internal 
array ... (of which data structure the architecture code is not even 
aware of)

	Ingo
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 6:30 pm

or you can move that check into find_max_pfn for x86_32? so it will
not affect other platform regarding Christoph's concern?

YH
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 7:00 pm

the patch doesn't have side effects on x86_64.

YH
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 15, 2008 - 7:20 pm

On Tue, 15 Apr 2008 19:00:18 -0700
also no side effects on my ia64/NUMA box, which has sparse physical memory map.

Thanks,
-Kame



--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 5:56 pm

64 bit is calling that via paging_init
==>sparse_memory_present_with_active_regions(MAX_NUMNODES).

and
void __init sparse_memory_present_with_active_regions(int nid)
{
        int i;

        for_each_active_range_index_in_nid(i, nid)
                memory_present(early_node_map[i].nid,
                                early_node_map[i].start_pfn,
                                early_node_map[i].end_pfn);
}

that is some late than 32 bit.

YH
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 6:02 pm

yeah - 64-bit is different here and it's not affected by the problem 
because there SECTION_SIZE_BITS is 27 (==128 MB chunks), 
MAX_PHYSADDR_BITS is 40 (== 1 TB) - giving 8192 section map entries. 
Once larger than 1 TB 64-bit x86 systems are created MAX_PHYSADDR_BITS 
needs to be increased.

The only downside of the current setup on 64-bit is that it wastes 128K 
of RAM on the majority of systems. We could perhaps try a shift of 28, 
which halves the footprint to 64K of RAM, and which still is good enough 
to allow the PCI aperture to remain a hole on most systems. It would 
also compress the data-cache footprint of the sparse memory maps. 
(without having to use sparsemem-extreme indirection)

	Ingo
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 6:17 pm

also 64 bit
early_node_map[10] active PFN ranges
    0:        0 ->      149
    0:      256 ->   917408
    0:  1048576 ->  8519680
    1:  8519680 -> 16908288
    2: 16908288 -> 25296896
    3: 25296896 -> 33685504
    4: 33685504 -> 42074112
    5: 42074112 -> 50462720
    6: 50462720 -> 58851328
    7: 58851328 -> 67239936

and 32 bit only has one entry
[    0.000000] early_node_map[1] active PFN ranges
[    0.000000]     0:        0 ->  1048576

YH
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 5:19 pm

Well okay this fixes it but is this the right fix? The arch should not 
call memory_present() with an invalid pfn.
--

From: Yinghai Lu
Date: Tuesday, April 15, 2008 - 5:33 pm

yes in find_max_pfn...

YH
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:36 pm

it is the right fix. The architecture memory setup code doesnt even 
_know_ the limits at this place in an open-coded way (and shouldnt know 
them) - and even later on we use pfn_valid() to determine whether to 
attempt to get to a struct page and free it into the buddy.

[ Of course the architecture code in general 'knows' about the limits - 
  but still it's cleaner to have a dumb enumeration interface here 
  combined with a resilient core code - that's always going to be less 
  fragile. ]

btw., i just did some bug history analysis, the calls were originally 
added when sparsemem support was added:

| commit 215c3409eed16c89b6d11ea1126bd9d4f36b9afd
| Author: Andy Whitcroft <apw@shadowen.org>
| Date:   Fri Jan 6 00:12:06 2006 -0800
|
|    [PATCH] i386 sparsemem for single node systems

in v2.6.15-1003-g215c340. (so this is appears to be an unfixed bug in 
v2.6.16 as well)

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:34 pm

the corruption might happen when encoding a non-zero node ID, or due to 
the SECTION_MARKED_PRESENT which is 0x1:

mmzone.h:#define        SECTION_MARKED_PRESENT  (1UL<<0)

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:40 pm

Joe Perches pointed out that the ULL was superfluous (i typoed it, i 
knew it's a pfn). Updated patch below.

	Ingo

-------------------------->
Subject: mm: sparsemem memory_present() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008

fix memory corruption and crash on 32-bit x86 systems.

if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.

that causes this loop to happily walk over the limit of the
sparse memory section map:

    for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
                unsigned long section = pfn_to_section_nr(pfn);
                struct mem_section *ms;

                sparse_index_init(section, nid);
                set_section_nid(section, nid);

                ms = __nr_to_section(section);
                if (!ms->section_mem_map)
                        ms->section_mem_map = sparse_encode_early_nid(nid) |
			                                SECTION_MARKED_PRESENT;

'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.

the fix is to sanity check anything the architecture passes to sparsemem.

this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.

an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 mm/sparse.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -149,8 +149,18 @@ static inline int sparse_early_nid(struc
 /* Record a memory area against a node. */
 void __init memory_present(int nid, unsigned long ...
From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 5:45 pm

So its a general issue that has been there for years that we are now 
noticing because we are now testing with memory sizes > 4GB. This also 
affects the enterprise releases (SLES10, RHEL5). Argh!

I wonder why this did not show up earlier in testing? Running a kernel 
that cannot access all of memory is unusual I guess.


--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 5:52 pm

i guess people saw the "you are not running a PAE kernel" warning and 
went to a PAE kernel which didnt have this issue.

OTOH, quite a few testers consciously use non-PAE kernels on 4GB 
systems, so i'd not be surprised if this solved a few mystery 
regressions we have.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 6:14 pm

i believe this was the reason why my many bisection attempts were 
unsuccessful: the bug pattern was not stable and seemingly working 
kernels had the memory corruption too. It was pure luck that v2.6.24 
"worked" and v2.6.25-rc9 broke visibly.

	Ingo
--

From: Linus Torvalds
Date: Tuesday, April 15, 2008 - 7:45 pm

Ok, you didn't make that addendum to your second version, so I added it 
myself.

Anyway, good job. I've pushed this out, and will let this simmer at least 
overnight to see if there are any brown-paper-bag issues (either with this 
or with some last changes from Andrew), but I'm happy, and I think I'll do 
the real 2.6.25 tomorrow.

		Linus
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 15, 2008 - 6:48 pm

I'm sorry to be too late here..

On Wed, 16 Apr 2008 02:03:56 +0200
how about

max_arch_pfn = NR_MEM_SECTIONS * PAGES_PER_SECTION.

?
Thanks,
-Kame

--

From: Mel Gorman
Date: Wednesday, April 16, 2008 - 7:05 am

Very cool :) This fixed the silent lock-up that I was getting when using
your config as well.

At a bit of a loss yesterday to explain what was going wrong, I had started
putting together patches to sanity check memory initialisation at various
different stages trying to catch where things were going pear-shaped. You
found the bug before it was done but I finished the basics anyway and posted
it as "[RFC] Verification and debugging of memory initialisation". Something
like it may help avoid similar headaches for people who tend to run into

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Ingo Molnar
Date: Wednesday, April 16, 2008 - 8:03 am

the method was to notice that the slub_debug_slabs SLUB variable got 
corrupted from an expected value of 0 to a value of 0x1.

Then i added a simple brute-force function-tracer hook (in sched-devel) 
that checked when slub_debug_slabs went from 0 to 1, and which then 
printed a backtrace.

Since under CONFIG_FTRACE=y every kernel function calls this callback, 
it triggered immediately after the value got corrupted:

[    0.000000] console [earlyser0] enabled
[    0.000000] BUG: slub_debug_slabs: 00000001
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.25-rc9-sched-devel.git-x86-latest.git #982
[    0.000000]  [<c0177fba>] print_slub_debug_slabs+0x3a/0x40
[    0.000000]  [<c01050f7>] trace+0x8/0x11
[    0.000000]  [<c0cc929e>] ? mtrr_bp_init+0xe/0x320
[    0.000000]  [<c01050f7>] ? trace+0x8/0x11
[    0.000000]  [<c0cd7369>] ? memory_present+0x9/0x50
[    0.000000]  [<c0cc7a09>] ? find_max_pfn+0x99/0xb0
[    0.000000]  [<c0cc6af7>] setup_arch+0x217/0x470
[    0.000000]  [<c012c59b>] ? printk+0x1b/0x20
[    0.000000]  [<c0cc2b46>] start_kernel+0x96/0x3f0
[    0.000000]  [<c0cc22fd>] i386_start_kernel+0xd/0x10
[    0.000000]  =======================
[    0.000000] x86: PAT support disabled.

and the backtrace had all the guilty parties on stack - memory_present() 
[which was just called] and find_max_pfn()/setup_arch() - thanks to the 
new fuzzy "?" backtrace entries we print out in v2.6.25.

(i could also have printed out the current ftrace buffer as well, 
showing the history of all recent function calls that the kernel 
executed.)

	Ingo
--

From: Christoph Lameter
Date: Tuesday, April 15, 2008 - 1:54 pm

The simplest solution for now may be to go with your workaround increasing 
SECTION_SIZE_BITS to 27. PAE mode already uses 30 and x86_64 also works 
with 27. This is going to affect the memory hotplug granularity for !PAE 
32 bit configs though. Kame-san, any concerns with that?
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:58 pm

the bug's effects are so severe that this is the last thing i'd like to 
do.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 2:08 pm

more verbosely: we sometimes do "blind" reverts, if it's reasonably 
established (or strongly suspected) that a revert makes a bug less 
severe. We do this even if we dont fully understand the bug and its 
effects and time runs out - on the assumption that we wont get worse 
than the old code was.

but what i'd not really like to do are blind _non-revert_ changes. With 
your suggested change we'd introduce a seemingly innocious but still 
wholly new (and untested) memory setup layout on the most popular Linux 
kernel memory config in existence. (!PAE 32-bit is still being run on 
more than 50% of the Linux desktops - around 80% runs 32-bit kernels.)

And as this bug demonstrates it, seemingly small differences appear to 
have large effects so we cannot know in what direction that would go - 
we might turn a rare regression into a common regression. I'd rather 
release with this bug being unfixed than with tweaking it just because 
the effect seems less severe on a totally unrepresentative set of 
systems.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 15, 2008 - 1:23 pm

btw., here's the 'good' versus 'bad' bootup log (vanilla kernel spiced 
with a few extra stats printed out [*]):

 http://redhat.com/~mingo/misc/boot.26.log         # bad
 http://redhat.com/~mingo/misc/boot.27.log         # good

the only difference is SECTION_SIZE_BITS == 26 versus 27.

looking at the dmesg diff, there's just minimal (and expected) offset 
difference in some structure sizes. (more sparse maps use a bit more 
memory)

	Ingo

[*] in case you wonder why memory_section->map is twice its size - i 
    doubled it just to eliminate any doubts about off-by-one errors. 
    Their natural size, as returned by bootmem, was 512KB plus 16 bytes 
    (!), which seemed a bit weird. Probably a section entry came between 
    two memory map allocations?
--

From: Christoph Lameter
Date: Friday, April 11, 2008 - 12:26 pm

Allowing systems without node 0 is a major change for x86.

--

From: Christoph Lameter
Date: Friday, April 11, 2008 - 12:25 pm

I also have an internal report that x86-git causes boot to fail with an 8p 
if one starts with a x86_64 config file and then converts to 
x86_32. Somehow the NR_CPUS is set to 255 in that case. Could this 
exhaust memory? I guess the per cpu cleanup work may figure in that area.
Mike?

--

From: Ingo Molnar
Date: Monday, April 14, 2008 - 10:49 pm

how about reading my bugreport that you replied to:

   http://lkml.org/lkml/2008/4/11/34

It gives an answer to your question, trivially so. It includes an easy 
link to the very config that failed:

   http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad

which would tell you:

   CONFIG_NR_CPUS=8

so no, it's not 255 CPUs exhausing RAM ...

	Ingo
--

Previous thread: linux-next: Tree for April 11 by Stephen Rothwell on Friday, April 11, 2008 - 12:36 am. (3 messages)

Next thread: [PATCH 1/2] add tunable_notifier function ,take2 by Takenori Nagano on Friday, April 11, 2008 - 12:53 am. (2 messages)