our x86.git randconfig auto-qa found a mm/slab.c early-bootup crash in
mainline that got introduced since v2.6.24.http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad
http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.badNote, the very same bzImage does not crash on other testboxes - only on
this 8-way box with 4GB of RAM.i tried a "use v2.6.24's slab.c" revert (with a few API fixes needed for
it to build on .25) but that didnt solve the problem either.i tried a bisection yesterday but it didnt work out too well - a
combination of block layer (?) and networking regressions made it
impossible.Here's the list of "good" bisection points between v2.6.24 (from
multiple bisection runs):0773769191d943358a8392fa86abd756d004c4b6
21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
26b8256e2bb930a8e4d4d10aa74950d8921376b8
2a10e7c41254941cac87be1eccdcb6379ce097f5
3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
49914084e797530d9baaf51df9eda77babc98fa8
53a6e2342d73d509318836e320f70cd286acd69c
5be3bda8987b12a87863c89b74b136fdb1f072db
6d5f718a497375f853d90247f5f6963368e89803
7272dcd31d56580dee7693c21e369fd167e137fe
77de2c590ec72828156d85fa13a96db87301cc68
82cfbb008572b1a953091ef78f767aa3ca213092
b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
c087567d3ffb2c7c61e091982e6ca45478394f1a
d4b37ff73540ab90bee57b882a10b21e2f97939f
fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7dthe "bad" bisection points where i saw a slab.c crash were:
7180c4c9e09888db0a188f729c96c6d7bd61fa83
7fa2ac3728ce828070fa3d5846c08157fe5ef431this still leaves a rather large set of commits:
Bisecting: 1874 revisions left to test after this
and the mm/ bits alone look volumonious:
$ git-bisect visualize -p -- mm | diffstat | tail -1
106 files changed, 67759 insertions(+), 20852 deletions(-)Ingo
---------------->
Subject: slab: revert
From: Ingo Molnar <mingo@elte.hu>
Date: Thu Apr 10 11:04:16 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte...
Hi Ingo,
As mentioned privately, I suspect it's the page allocator changes that
went into 2.6.24. Mel, Christoph, any ideas?
--
So I'm thinking it's probably related to this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...
As kmalloc_node() in setup_cpu_cache() returns NULL, it seems likely
to be due to the use of GFP_THISNODE in cache_alloc_refill() when
calling cache_grow() and that the semantics changed. No idea why page
allocator would think your UMA "local node" has no memory though.Pekka
--
but ... as i said it in my report, this is a regression since v2.6.24 -
v2.6.24 (and a whole bunch of commits since then, i listed the IDs)
booted up fine. The commit ID you mention is: v2.6.23-4345-g523b945, way
earlier than the good commit IDs.so this is a recent regression.
Ingo
--
Hi Ingo.
Right. Then you probably want to look into any changes in arch/x86/
related to setting up the zonelists. I'm fairly certain this is not a
slab bug and I don't see any recent changes to the page allocator
either that would explain this.
--
I also have an internal report that x86-git causes boot to fail with an 8p
if one starts with a x86_64 config file and then converts to
x86_32. Somehow the NR_CPUS is set to 255 in that case. Could this
exhaust memory? I guess the per cpu cleanup work may figure in that area.
Mike?--
how about reading my bugreport that you replied to:
http://lkml.org/lkml/2008/4/11/34
It gives an answer to your question, trivially so. It includes an easy
link to the very config that failed:http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad
which would tell you:
CONFIG_NR_CPUS=8
so no, it's not 255 CPUs exhausing RAM ...
Ingo
--
I'd be willing to put some money on this:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...
--
Allowing systems without node 0 is a major change for x86.
--
And I'd lose as you're 32-bit. Oh well, that's the price to pay for
pretending to know x86 arch internals.
--
you asked me to run with the debug patch attached below. I just tried
vanilla -rc9 (head 120dd64cacd4fb7) and it still crashes with this
config:http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9
debug output is:
http://redhat.com/~mingo/misc/log-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9
so it's probably the first few page allocations (setup_cpu_cache())
going wrong already - suggesting a some fundamental borkage in SLAB?note, when i change SLAB to SLUB (and keep the config unchanged
otherwise), i get a similar early crash:http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad
http://redhat.com/~mingo/misc/config-Tue_Apr_15_07_24_59_CEST_2008.badi've also uploaded a bzImage (SLUB, debug patch not applied) that you
can pick up and run on any 32-bit test-system:http://redhat.com/~mingo/misc/bzImage-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9
it's a relatively generic bzImage that should boot on most whitebox PCs
on most distros as long as you use a pure ext3 setup and might even give
you networking (no modules or initrd is needed). It boots fine on two
other 32-bit PCs i have (an Intel laptop and an AMD desktop).Ingo
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1485,6 +1485,7 @@ restart:
* Happens if we have an empty zonelist as a result of
* GFP_THISNODE being used on a memoryless node
*/
+ WARN_ON(1);
return NULL;
}Index: linux/mm/slab.c
===================================================================
--- linux.orig/mm/slab.c
+++ linux/mm/slab.c
@@ -1682,6 +1682,7 @@ static void *kmem_getpages(struct kmem_c
flags |= __GFP_RECLAIMABLE;page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ WARN_ON(!page);
if (!page)
return NULL;@@ -2620,6 +2621,7 @@ static struct slab *alloc_slabmgmt(struc
/* Slab management obj is off-slab. */
slab...
Well, I think it suggests some fundamental borkage in the page allocator.
That first warn-on is from the "alloc_pages_node()" returning NULL at
bootup. Sure, it could be that the arguments are bogus, but that sounds
unlikely since none of that is dependent on any kconfig stuff.The fact that it happens with both SLUB/SLAB makes that even more obvious.
Now, you don't have fault injection on, so it can't be that, and your
debug entry for *z == NULL didn' trigger in alloc_pages, so it's no that
one either.However, if __alloc_pages() failed, I would have expected to see the
"memory allocation failed" printk. Why didn't it? Is printk_ratelimit()
broken at boot (last_msg start out as zero - maybe i should start out as
a negative number)?Linus
--
btw., now with a second full day spent on this regression, i have
figured out a workaround the hard way: increasing SECTION_SIZE_BITS in
include/asm-x86/sparsemem.h from 26 to 27 makes it go away. (i.e. we use
section chunks of 128 MB instead of 64 MB before) I've given up on
analyzing the crash site - it seems rather random and uninformative and
just suggests page allocator borkage.So this seems like a general sparsemem borkage. PAE uses a shift of 30
due to page->flags shortage (which masks this bug), 64-bit uses 27 which
too probably masks this bug.Since this is a !NUMA config and !PAE as well, NODES_SHIFT is 0,
ZONES_SHIFT is 2, so the theory of running out of bits in page->flags is
wrong as well.I also tried a hack to double the size of all sparsemem mem_map
allocations (on the theory of an overflow there) - but it didnt help.So i think we need to go down further into the page allocator. Perhaps
the buddy bitmaps are wrongly sized somewhere. I'm grasping at straws.Btw., Mel Gorman has reproduced crashes with my bzImage on his box (and
a hang with my config, using his build), so i think we can eliminate hw
and build environment specialities as a cause.Ingo
--
btw., here's the 'good' versus 'bad' bootup log (vanilla kernel spiced
with a few extra stats printed out [*]):http://redhat.com/~mingo/misc/boot.26.log # bad
http://redhat.com/~mingo/misc/boot.27.log # goodthe only difference is SECTION_SIZE_BITS == 26 versus 27.
looking at the dmesg diff, there's just minimal (and expected) offset
difference in some structure sizes. (more sparse maps use a bit more
memory)Ingo
[*] in case you wonder why memory_section->map is twice its size - i
doubled it just to eliminate any doubts about off-by-one errors.
Their natural size, as returned by bootmem, was 512KB plus 16 bytes
(!), which seemed a bit weird. Probably a section entry came between
two memory map allocations?
--
Interesting.
I wonder..
So since you don't have NUMA, you have NODES_SHIFT == 0.
That in turn means that NODE_NOT_IN_PAGE_FLAGS is _not_ set.
That, in turn, means that ZONEID_SHIFT does *not* contain SECTIONS_SHIFT.
Is that really what is supposed to happen?Because then "page_is_buddy()" will not even test the section, as far as I
can tell.But I'm probably missing something. Why would we not need to test the
section in page_zone_id() when the node ID is in the page flags (but has
zero size)?Linus
--
Hmmmm. SECTION_SIZE_BITS == 26 means SECTIONS_SHIFT == 6.
Increasing SECTION_SIZE_BITS to 27 reduces SECTION_SHIFT to 5. Thereby
the number of sparsemem sections (NR_MEM_SECTIONS) is reduced to half (64
to 32).
--
yes, as i said in this thread already earlier today, the sparse chunking
goes from 64MB to 128MB. (and hence, by virtue of !PAE having a 4GB
physical address space, the # of sparse sections goes from 64 to 32 -
you can see the full sparse sections printout in my latest crashlog in
my previous mail, including the NR_MEM_SECTIONS printout.)Pretty please, could you pay more than cursory attention to this bug i
already spent two full days on and which is blocking the v2.6.25
release?Your commits are all over the place in this code, and you are one of the
maintainers as well. We've got 5000 lines of flux in mm/* in v2.6.25.I'm just guessing my way around, but right now my impression is that the
current early memory setup code is unrobust, over-complex, occasionally
butt-ugly to read code in high need of cleanups, simplifications and
debug facilities, visibly plagued by hit-and-run changes with frequent
typos and everything else you normally dont want to see in the core
kernel. (Did i get your attention now? ;-)Ingo
--
Yeah trying to get to understand how exactly sparsemem works and how the
32 bit highmem stuff interacts with it... Sorry not code that I am an
expert in nor the platform that I am familiar with. Code mods there
required heavy review from multiple parties with expertise in variousI thought the NR_SECTIONS stuff would trigger some memories. Adding apw
who seemed to be most familiar with the material in the past
(AFAICT NODE_NOT_IN_PAGE_FLAGS is there for IBM NUMAQ etc) and Kame-san.Andy, Kame-san could you have a look at the sparsemem config issue with
32 bit !PAE? This is SPARSEMEM_STATIC.--
yeah - sorry about that impatient flame. And it could still be anything
from the page allocator to bootmem - or some completely unrelated piece
of code corrupting some key data structure.sparsemem is supposed to work roughly like this on x86 (32-bit):
- the x86 memory map comes from the bios via e820.
- those individual chunks of e820-enumerated memory get
registered with mm/sparse.c's data structures via memory_present()
callbacks. [btw., this should be renamed to register_memory_present()
or register_sparse_range() - something less opaque.]- there's really just 3 RAM areas that matter on this box, and the last
one is unusable for !PAE, which leaves 2.- there's a 256 MB PCI aperture hole at 0xf0000000.
- out of the 64 sparse memory chunk the first 60 get filled in (all have
at least partially some RAM content) - the last 4 [the PCI aperture
hole] remains !present.- we pass in an array of 3 zones to free_area_init_nodes().
- we free the lowmem pages into the buddy allocator via the usual
generic setup- we have a special loop for highmem pages in arch/x86/mm/init_32.c,
set_highmem_pages_init(). This just goes through the PFNs one by one
and does an explicit __free_page() on all RAM pages that are in the
mem_map[] and which are non-reserved.and that's it roughly.
my current guess would have been some bootmem regression/interaction
that messes up the buddy bitmaps - but i just reverted to the v2.6.24
version of bootmem.c and that crashes too ...Ingo
--
The simplest solution for now may be to go with your workaround increasing
SECTION_SIZE_BITS to 27. PAE mode already uses 30 and x86_64 also works
with 27. This is going to affect the memory hotplug granularity for !PAE
32 bit configs though. Kame-san, any concerns with that?
--
the bug's effects are so severe that this is the last thing i'd like to
do.Ingo
--
more verbosely: we sometimes do "blind" reverts, if it's reasonably
established (or strongly suspected) that a revert makes a bug less
severe. We do this even if we dont fully understand the bug and its
effects and time runs out - on the assumption that we wont get worse
than the old code was.but what i'd not really like to do are blind _non-revert_ changes. With
your suggested change we'd introduce a seemingly innocious but still
wholly new (and untested) memory setup layout on the most popular Linux
kernel memory config in existence. (!PAE 32-bit is still being run on
more than 50% of the Linux desktops - around 80% runs 32-bit kernels.)And as this bug demonstrates it, seemingly small differences appear to
have large effects so we cannot know in what direction that would go -
we might turn a rare regression into a common regression. I'd rather
release with this bug being unfixed than with tweaking it just because
the effect seems less severe on a totally unrepresentative set of
systems.Ingo
--
I know this is bit of hand-waving but have you noticed how all the
interesting sparsemem changes that one would expect to have caused the
breakage happened _before_ v2.6.24? So sorry for asking this again but
are we 110% sure the problem does not trigger with any of the
v2.6.24-rcN kernels?Pekka
--
quite. Here are all the successfull bootups from my (failed) bisection
attempt:0773769191d943358a8392fa86abd756d004c4b6
21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
26b8256e2bb930a8e4d4d10aa74950d8921376b8
2a10e7c41254941cac87be1eccdcb6379ce097f5
3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
49914084e797530d9baaf51df9eda77babc98fa8
53a6e2342d73d509318836e320f70cd286acd69c
5be3bda8987b12a87863c89b74b136fdb1f072db
6d5f718a497375f853d90247f5f6963368e89803
7272dcd31d56580dee7693c21e369fd167e137fe
77de2c590ec72828156d85fa13a96db87301cc68
82cfbb008572b1a953091ef78f767aa3ca213092
b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
c087567d3ffb2c7c61e091982e6ca45478394f1a
d4b37ff73540ab90bee57b882a10b21e2f97939f
fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7dor, via git-describe:
v2.6.24-3908-g0773769
v2.6.24-2392-g21af029
v2.6.24-3868-g26b8256
v2.6.24-4463-g2a10e7c
v2.6.24-4457-g3aa88cd
v2.6.24
v2.6.24-3522-g53a6e23
v2.6.24-3131-g5be3bda
v2.6.24-4461-g6d5f718
v2.6.24-3891-g7272dcd
v2.6.24-3902-g77de2c5
v2.6.24-3613-g82cfbb0
v2.6.24-4449-gb75f53d
v2.6.24-3911-gc087567
v2.6.24-3913-gd4b37ff
v2.6.24-4464-gfde1b3fi.e. vanilla v2.6.24 and a whole bunch of commits after it were booting
just fine. (the problem might have been masked up to a certain point in
theory, but given how resilient it is to offset changes in my testing i
find that not very probable [but not impossible] )Ingo
--
Ok, can you try this script
git bisect start
git bisect bad 7fa2ac3728ce828070fa3d5846c08157fe5ef431
git bisect good 0773769191d943358a8392fa86abd756d004c4b6
git bisect good 21af0297c7e56024a5ccc4d8ad2a590f9ec371ba
git bisect good 26b8256e2bb930a8e4d4d10aa74950d8921376b8
git bisect good 2a10e7c41254941cac87be1eccdcb6379ce097f5
git bisect good 3aa88cdf6bcc9e510c0707581131b821a7d3b7cb
git bisect good 49914084e797530d9baaf51df9eda77babc98fa8
git bisect good 53a6e2342d73d509318836e320f70cd286acd69c
git bisect good 5be3bda8987b12a87863c89b74b136fdb1f072db
git bisect good 6d5f718a497375f853d90247f5f6963368e89803
git bisect good 7272dcd31d56580dee7693c21e369fd167e137fe
git bisect good 77de2c590ec72828156d85fa13a96db87301cc68
git bisect good 82cfbb008572b1a953091ef78f767aa3ca213092
git bisect good b75f53dba8a4a61fda1ff7e0fb0fe3b0d80e0c64
git bisect good c087567d3ffb2c7c61e091982e6ca45478394f1a
git bisect good d4b37ff73540ab90bee57b882a10b21e2f97939f
git bisect good fde1b3fa947c2512e3715962ebb1d3a6a9b9bb7dand then you'll apparently hit that commit you had compile problems
with. HOWEVER, at that point, just dogit bisect visualize
and pick a commit somewhere roughly half-way that you suspect is a good
point of testing, but not near the range that you had problems with. If
you have compile problems in the middle, pick something that is just one
third down, for example. It will make the bisection slower, but
considering how unable we've been to make much progress other ways, if we
can narrow it down from 1874 commits to something smaller, I suspect we'll
be much happier.Then you just do
git checkout <sha-you-picked-out-here>
and compile that one, and check.
Besides, while the _optimal_ point is half-way, even if you only remove a
third or a quarter of the commits at each stage, it's still going to be
an exponential thing.Linus
--
finally found it ... the patch below solves the sparsemem crash and the
testsystem boots up fine now:mars:~> uname -a
Linux mars 2.6.25-rc9-sched-devel.git-x86-latest.git #985 SMP Wed Apr 16
01:37:37 CEST 2008 i686 i686 i386 GNU/Linuxyay! :-)
Ingo
ps. anyone who can correctly guess the method with which i found the
exact place that corrupted memory will get a free beer next time we
meet :-)------------------------->
Subject: mm: sparsemem memory_present() memory corruption fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.that causes this loop to happily walk over the limit of the
sparse memory section map:for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;sparse_index_init(section, nid);
set_section_nid(section, nid);ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID. Depending on what that memory is, we might crash,
misbehave or just not notice the bug.the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file cha...
the method was to notice that the slub_debug_slabs SLUB variable got
corrupted from an expected value of 0 to a value of 0x1.Then i added a simple brute-force function-tracer hook (in sched-devel)
that checked when slub_debug_slabs went from 0 to 1, and which then
printed a backtrace.Since under CONFIG_FTRACE=y every kernel function calls this callback,
it triggered immediately after the value got corrupted:[ 0.000000] console [earlyser0] enabled
[ 0.000000] BUG: slub_debug_slabs: 00000001
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.25-rc9-sched-devel.git-x86-latest.git #982
[ 0.000000] [<c0177fba>] print_slub_debug_slabs+0x3a/0x40
[ 0.000000] [<c01050f7>] trace+0x8/0x11
[ 0.000000] [<c0cc929e>] ? mtrr_bp_init+0xe/0x320
[ 0.000000] [<c01050f7>] ? trace+0x8/0x11
[ 0.000000] [<c0cd7369>] ? memory_present+0x9/0x50
[ 0.000000] [<c0cc7a09>] ? find_max_pfn+0x99/0xb0
[ 0.000000] [<c0cc6af7>] setup_arch+0x217/0x470
[ 0.000000] [<c012c59b>] ? printk+0x1b/0x20
[ 0.000000] [<c0cc2b46>] start_kernel+0x96/0x3f0
[ 0.000000] [<c0cc22fd>] i386_start_kernel+0xd/0x10
[ 0.000000] =======================
[ 0.000000] x86: PAT support disabled.and the backtrace had all the guilty parties on stack - memory_present()
[which was just called] and find_max_pfn()/setup_arch() - thanks to the
new fuzzy "?" backtrace entries we print out in v2.6.25.(i could also have printed out the current ftrace buffer as well,
showing the history of all recent function calls that the kernel
executed.)Ingo
--
Very cool :) This fixed the silent lock-up that I was getting when using
your config as well.At a bit of a loss yesterday to explain what was going wrong, I had started
putting together patches to sanity check memory initialisation at various
different stages trying to catch where things were going pear-shaped. You
found the bug before it was done but I finished the basics anyway and posted
it as "[RFC] Verification and debugging of memory initialisation". Something
like it may help avoid similar headaches for people who tend to run into--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
I'm sorry to be too late here..
On Wed, 16 Apr 2008 02:03:56 +0200
how aboutmax_arch_pfn = NR_MEM_SECTIONS * PAGES_PER_SECTION.
?
Thanks,
-Kame--
the corruption might happen when encoding a non-zero node ID, or due to
the SECTION_MARKED_PRESENT which is 0x1:mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0)
Ingo
--
Ok, you didn't make that addendum to your second version, so I added it
myself.Anyway, good job. I've pushed this out, and will let this simmer at least
overnight to see if there are any brown-paper-bag issues (either with this
or with some last changes from Andrew), but I'm happy, and I think I'll do
the real 2.6.25 tomorrow.Linus
--
Joe Perches pointed out that the ULL was superfluous (i typoed it, i
knew it's a pfn). Updated patch below.Ingo
-------------------------->
Subject: mm: sparsemem memory_present() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.that causes this loop to happily walk over the limit of the
sparse memory section map:for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;sparse_index_init(section, nid);
set_section_nid(section, nid);ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file changed, 10 insertions(+)Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -149,8 +149,18 @@ static inline int sparse_early_nid(struc
/* Record a memory area against a node. */
void __init memory_present(in...
i believe this was the reason why my many bisection attempts were
unsuccessful: the bug pattern was not stable and seemingly working
kernels had the memory corruption too. It was pure luck that v2.6.24
"worked" and v2.6.25-rc9 broke visibly.Ingo
--
So its a general issue that has been there for years that we are now
noticing because we are now testing with memory sizes > 4GB. This also
affects the enterprise releases (SLES10, RHEL5). Argh!I wonder why this did not show up earlier in testing? Running a kernel
that cannot access all of memory is unusual I guess.--
i guess people saw the "you are not running a PAE kernel" warning and
went to a PAE kernel which didnt have this issue.OTOH, quite a few testers consciously use non-PAE kernels on 4GB
systems, so i'd not be surprised if this solved a few mystery
regressions we have.Ingo
--
Well okay this fixes it but is this the right fix? The arch should not
call memory_present() with an invalid pfn.
--
it is the right fix. The architecture memory setup code doesnt even
_know_ the limits at this place in an open-coded way (and shouldnt know
them) - and even later on we use pfn_valid() to determine whether to
attempt to get to a struct page and free it into the buddy.[ Of course the architecture code in general 'knows' about the limits -
but still it's cleaner to have a dumb enumeration interface here
combined with a resilient core code - that's always going to be less
fragile. ]btw., i just did some bug history analysis, the calls were originally
added when sparsemem support was added:| commit 215c3409eed16c89b6d11ea1126bd9d4f36b9afd
| Author: Andy Whitcroft <apw@shadowen.org>
| Date: Fri Jan 6 00:12:06 2006 -0800
|
| [PATCH] i386 sparsemem for single node systemsin v2.6.15-1003-g215c340. (so this is appears to be an unfixed bug in
v2.6.16 as well)Ingo
--
yes in find_max_pfn...
YH
--
i re-checked the original SLAB config too and that boots fine as well
now - so i'm confident that the regression has been sufficiently cured.it's getting quite late here (or rather, it's getting early :-/ ) so it
would be nice if others could double-check this calculation (with an eye
on all possible architectures):+ unsigned long max_arch_pfn = 1ULL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
and also check my analysis whether it is correct and whether it matches
the reported bug patterns. But otherwise the fix looks like a safe fix
for v2.6.25-final to me - it only filters out values from sparsemem
input that are nonsensical in the sparsemem framework anyway.Ingo
--
can you check why find_max_pfn() e820_32.c need to call memory_present?
wonder if it can be removed.YH
--
this is the only call to memory_present() we do in 32-bit arch setup, so
it's required.(the function find_max_pfn() is woefully misnamed, but that's a cleanup
- i just fixed this in x86.git.)Ingo
--
64 bit is calling that via paging_init
==>sparse_memory_present_with_active_regions(MAX_NUMNODES).and
void __init sparse_memory_present_with_active_regions(int nid)
{
int i;for_each_active_range_index_in_nid(i, nid)
memory_present(early_node_map[i].nid,
early_node_map[i].start_pfn,
early_node_map[i].end_pfn);
}that is some late than 32 bit.
YH
--
yeah - 64-bit is different here and it's not affected by the problem
because there SECTION_SIZE_BITS is 27 (==128 MB chunks),
MAX_PHYSADDR_BITS is 40 (== 1 TB) - giving 8192 section map entries.
Once larger than 1 TB 64-bit x86 systems are created MAX_PHYSADDR_BITS
needs to be increased.The only downside of the current setup on 64-bit is that it wastes 128K
of RAM on the majority of systems. We could perhaps try a shift of 28,
which halves the footprint to 64K of RAM, and which still is good enough
to allow the PCI aperture to remain a hole on most systems. It would
also compress the data-cache footprint of the sparse memory maps.
(without having to use sparsemem-extreme indirection)Ingo
--
also 64 bit
early_node_map[10] active PFN ranges
0: 0 -> 149
0: 256 -> 917408
0: 1048576 -> 8519680
1: 8519680 -> 16908288
2: 16908288 -> 25296896
3: 25296896 -> 33685504
4: 33685504 -> 42074112
5: 42074112 -> 50462720
6: 50462720 -> 58851328
7: 58851328 -> 67239936and 32 bit only has one entry
[ 0.000000] early_node_map[1] active PFN ranges
[ 0.000000] 0: 0 -> 1048576YH
--
We could clip there if SPARSEMEM is configured. I wonder if this affects
other platforms that need HIGHMEM support?
--
clip where and what?
Ingo
--
i.e. as per my previous argument i'd consider the need to sanitize the
calls in the architecture fundamentally wrong.whether the core code emits a warning or allows the call is an
additional question i mention in the changelog - but the core sparse
memory code should _definitely_ not silently overflow a key internal
array ... (of which data structure the architecture code is not even
aware of)Ingo
--
or you can move that check into find_max_pfn for x86_32? so it will
not affect other platform regarding Christoph's concern?YH
--
the patch doesn't have side effects on x86_64.
YH
--
On Tue, 15 Apr 2008 19:00:18 -0700
also no side effects on my ia64/NUMA box, which has sparse physical memory map.Thanks,
-Kame--
Yes that fixes it here too. And the corruption that I saw of slab
variables is explained by your analysis. Thanks!Tested-by: Christoph Lameter <clameter@sgi.com>
--
ok, will try that now.
The 'bad' points i posted are definitely well-established as i
post-validated them them by looking for the slab.c crash pattern in the
serial logs and looking at the git commit in the bootup signature. (this
is more reliable than looking at bisection logs - i tried 4 different
bisection runs and not all were reliable.)Ingo
--
btw., as i progress with that bisection effort, i triggered new crash
patterns, which lappen later during bootup:[ 0.775886] initcall 0xc0b00559 ran for 0 msecs: ksysfs_init+0x0/0x96()
[ 0.777885] Calling initcall 0xc0b01eb8: filelock_init+0x0/0x27()
[ 0.780137] BUG: unable to handle kernel NULL pointer dereference at 00000001
[ 0.782883] IP: [<c0293981>] strlen+0xb/0x15
[ 0.784884] *pde = 00000000
[ 0.786889] Oops: 0000 [#1] SMP
[ 0.787880]
[ 0.787880] Pid: 1, comm: swapper Not tainted (2.6.24-05281-g6232665 #3)
[ 0.787880] EIP: 0060:[<c0293981>] EFLAGS: 00010286 CPU: 0
[ 0.787880] EIP is at strlen+0xb/0x15
[ 0.787880] EAX: 00000000 EBX: 00040000 ECX: ffffffff EDX: 00040000
[ 0.787880] ESI: c0915320 EDI: 00000001 EBP: f7c23f08 ESP: f7c23f04
[ 0.787880] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 0.787880] Process swapper (pid: 1, ti=f7c22000 task=f7c20000 task.ti=f7c22000)
[ 0.787880] Stack: f7c1f540 f7c23f18 c015b138 00000094 c016b45a f7c23f48 c015c92c c016b45a
[ 0.787880] 00000025 c0915320 f7c1f540 000000d0 f7c23f58 00000282 f7c1f540 00000000
[ 0.787880] 00000000 f7c23f84 c015cbc8 00000060 00000000 00040000 c016b45a c0133e81
[ 0.787880] Call Trace:
[ 0.787880] [<c015b138>] ? kmem_cache_flags+0x3d/0x5b
[ 0.787880] [<c016b45a>] ? init_once+0x0/0xc
[ 0.787880] [<c015c92c>] ? kmem_cache_open+0x64/0x128
[ 0.787880] [<c016b45a>] ? init_once+0x0/0xc
[ 0.787880] [<c015cbc8>] ? kmem_cache_create+0x14e/0x1d6
[ 0.787880] [<c016b45a>] ? init_once+0x0/0xc
[ 0.787880] [<c0133e81>] ? ktime_get_ts+0x3b/0x3f
[ 0.787880] [<c0133e96>] ? ktime_get+0x11/0x2f
[ 0.787880] [<c0b01ed6>] ? filelock_init+0x1e/0x27
[ 0.787880] [<c016b45a>] ? init_once+0x0/0xc
[ 0.787880] [<c0af176d>] ? kernel_init+0x13f/0x298
[ 0.787880] [<c0103b46>] ? ret_from_fork+0x6/0x20
[ 0.787880] [<c0af162e>]...
It may help to run with slub_debug (or CONFIG_SLUB_DEBUG_ON) then to
detect the corruption as early as possible. Only works if you get by
kmem_cache_init() though. Should give us some informative dumps of what is
exactly corrupted if it hits a slab object.
--
grumble. Do you read my bugreports?
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_DEBUG_ON=yIngo
--
Cannot memorize everything sorry. This looked like slab object corruption
and there were no slub diagnostics in the log.Trying to duplicate the issue here.
Boot with PAE support on a machine with 8G RAM 8 processors here works
fine.Also booting without highmem support (SMP) works fine.
On how many machines does the problem occur?
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 21:59:36 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""...
same .config with -rc9 on one system with 128g 4 sockets with quad core crashed here too.
YH
--
have you tried my config on that box? Check:
http://redhat.com/~mingo/misc/config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9
Ingo
--
Yup that config fails here too...
--
Hmmm... If one enables CONFIG_X86_PAE (even with no highmem) then
everything is fine. For PAE to be enabled some other things also fall by
the wayside. Diff to your failing config follows. Will try to minimize
the diff even further:--- config-Thu_Apr_10_10_41_16_CEST_2008.bad.rc9 2008-04-15 06:02:13.000000000 +0000
+++ .config 2008-04-15 23:15:53.000000000 +0000
@@ -1,7 +1,7 @@
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
-# Tue Apr 15 07:37:33 2008
+# Tue Apr 15 23:15:53 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
@@ -169,11 +169,11 @@
# CONFIG_X86_VSMP is not set
# CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER is not set
CONFIG_PARAVIRT_GUEST=y
+# CONFIG_XEN is not set
CONFIG_VMI=y
-# CONFIG_LGUEST_GUEST is not set
CONFIG_PARAVIRT=y
# CONFIG_M386 is not set
-CONFIG_M486=y
+# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
@@ -196,20 +196,23 @@
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
-# CONFIG_MCORE2 is not set
+CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
-CONFIG_X86_L1_CACHE_SHIFT=4
+CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
-# CONFIG_X86_PPRO_FENCE is not set
-CONFIG_X86_F00F_BUG=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
-CONFIG_X86_ALIGNMENT_16=y
-CONFIG_X86_MINIMUM_CPU_FAMILY=4
+CONFIG_X86_GOOD_APIC=y
+CONFIG_X86_INTEL_USERCOPY=y
+CONFIG_X86_USE_PPRO_CHECKSUM=y
+CONFIG_X86_P6_NOP=y
+CONFIG_X86_TSC=y
+CONFIG_X86_MINIMUM_CPU_FAMILY=6
+CONFIG_X86_DEBUGCTLMSR=y
# CONFIG_HPET_TIMER is not set
# CONFIG_IOMMU_HELPER is not set
CONFIG_NR_CPUS=8
@@ -229,11 +232,11 @@
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
-# CONFIG_NOHIGHMEM is not set
-CONFIG_HIGHMEM4G=y
+CONFIG_NOHIGHMEM=y
+# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
CONFIG_...
... in the thread i've already explained that it's because on PAE we use
1GB sparse chunks (shift 30) which masks the bug.(on PAE we cannot go below a shift of 29 due to shortage of page->flags)
Ingo
--
Added some printks to the initialization of slub and I see 0x1 double
words written over global variables that should be zero. The cpu mask to
track processors that are initialized is screwed up
(kmem_cach_cpu_free_init_once).[ 0.000999] 0xc0cb320c: 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................
[ 0.000999] 0xc0cb321c: 00 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 ................
[ 0.000999] 0xc0cb322c: 00 00 00 00 01 00 00 00 ........c0cb3010 B mem_section
c0cb3210 b lock.25923
c0cb3214 b shmem_inode_cachep
c0cb3218 b shm_mnt
c0cb321c b slab_state
c0cb3220 b kmem_cach_cpu_free_init_once
c0cb3224 b slub_debug
c0cb3228 b slub_debug_slabsmem_section is 512 bytes long. Array overrun?
--
Ahh. Right. That is the same situation as HIGHMEM_64G. I was able to
enable PAE without HIGHMEM_64G. Thought that would keep things as is.
--
great. Note that it's randconfig generated - so watch out for weird
config combinations.randconfig, besides finding build-bugs, is also good at finding various
runtime bugs: it is great at finding weird alignment and
boundary-condition bugs in generic code, and it's also great at finding
races (by virtue of introducing random delays between various functions,
via random enabling/disabling of debug facilities and other options that
impact the generated code's layout and timing).Ingo
--
btw., highmem shouldnt matter because it does not influence how we
allocate our key data structures.i confirmed that by turning set_highmem_pages_init() into a NOP - the
kernel still crashed with just lowmem memory being around.Ingo
--
and booting with NOHIGHMEM gives a crash too - updated config attached.
(in case anyone wonders about the CONFIG_M486=y - it crashes with
CONFIG_M686=y too)Ingo
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 22:14:51 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# C...
changing the .config to UP makes it boot up fine. Config and bootlog
attached.Ingo
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.25-rc9
# Tue Apr 15 22:20:58 2008
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
# CONFIG_SWAP is not set
# CONFIG_SYSVIPC is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_...
Vexing. The failure in the slabs suggests that no lowmem pages were freed
during the walk of the bootmem bitmaps. Could you call show_mem before
kmem_cache_init() runs?
--
sure - find the crashlog below.
but it seems there's plenty of free RAM in the buddy:
[ 0.000999] DMA: 3*4kB 2*8kB 4*16kB 2*32kB 3*64kB 1*128kB 1*256kB
0*512kB 1*1024kB 1*2048kB 0*4096kB = 3804kB[ 0.000999] Normal: 54*4kB 54*8kB 54*16kB 54*32kB 54*64kB 60*128kB
60*256kB 0*512kB 1*1024kB 0*2048kB 197*4096kB =
837672kBand the bug pattern seems to be memory corruption - not memory
exhaustion.i.e. we allocated RAM but it got corrupted after allocation.
Ingo
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -609,6 +609,7 @@ asmlinkage void __init start_kernel(void
mem_init();
enable_debug_pagealloc();
cpu_hotplug_init();
+ show_mem();
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();[ 0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #968 SMP Tue Apr 15 22:39:35 CEST 2008
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable)
[ 0.000000] console [earlyser0] enabled
[ 0.000000] Warning only 896MB will be used.
[ 0.000000] Use a HIGHMEM64G enabled kernel.
[ 0.000000] 896MB LOWMEM available.
[ 0.000000] Scan SMP from c0000000 for 1024 bytes.
[ 0.000000] Scan SMP from c009fc00 for 1024 ...
SLUB does not do a memory allocation where it fails here but simply
In some situations we are screwing up the per cpu data handling on
32 bit x86? Adding Mike. This looks like the per cpu area overlaps with
something else?--
yep, that was my other theory - and i doubled CONFIG_NR_CPUS to reduce
that chance.in hindsight ... that wont save us from any overlap, right?
what's the best way to artificially increase the size of the allocated
per cpu area? (say double it)Ingo
--
I don't know that there is a boot option. If modules are defined it
adds an extra 8k. The size is defined in include/linux/percpu.h
(PERCPU_ENOUGH_ROOM).Otherwise define a really large per_cpu variable...?
-Mike
--
Add a big per cpu declaration?
static DEFINE_PER_CPU(char, dummy)[10000];
--
what's the guarantee that it's at the end of the section? I'd like to
pad the per cpu areas at their end. (doubling their size is a good way
to achieve that)Ingo
--
No guarantee. Its up to the linker. Sorry. We could add a new percpu.last
section but that requires a number of changes to linking.--
ah. Then the patch below should do the trick, right?
Ingo
------------->
Subject: larger: percpu
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 23:13:18 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/vmlinux_32.lds.S | 1 +
1 file changed, 1 insertion(+)Index: linux/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux.orig/arch/x86/kernel/vmlinux_32.lds.S
+++ linux/arch/x86/kernel/vmlinux_32.lds.S
@@ -186,6 +186,7 @@ SECTIONS
__per_cpu_start = .;
*(.data.percpu)
*(.data.percpu.shared_aligned)
+ . = . + 65536;
__per_cpu_end = .;
}
. = ALIGN(PAGE_SIZE);
--
this seems to have the intended effect of +0x10000 padding at the end of
the percpu area:c0b927d0 D per_cpu__cpu_info
c0b92880 d per_cpu__runqueues
c0ba2d00 D __per_cpu_end
c0ba3000 B __bss_startit still crashes though, with an very similar crash pattern to the
previous ones.Ingo
--
Or you could try this:
--- linux-2.6.x86.sched-last-0415.orig/include/linux/percpu.h
+++ linux-2.6.x86.sched-last-0415/include/linux/percpu.h
@@ -38,10 +38,7 @@/* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
#ifndef PERCPU_ENOUGH_ROOM
-#ifdef CONFIG_MODULES
-#define PERCPU_MODULE_RESERVE 8192
-#else
-#define PERCPU_MODULE_RESERVE 0
+#define PERCPU_MODULE_RESERVE 65536
#endif#define PERCPU_ENOUGH_ROOM \
--
Hopefully. The linker sometimes reacts in funky ways and we do some
strange magic with the offsets through gcc memory models (at least on
x86_64 not sure what 32 bit does).
--
I'll certainly take a look...
-Mike
--
still crashes with the patch below - find the crash-log further below.
(the kernel has a few more non-destructive debug printouts and debug
checks included as well, which you can see in the log, but it's a
vanilla kernel otherwise.)Ingo
----------------------->
Subject: nodes: shift fix
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 21:15:21 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -424,7 +424,7 @@ static inline void set_compound_order(st
* We are going to use the flags for the page to node mapping if its in
* there. This includes the case where there is no node, so it is implicit.
*/
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#if NODES_WIDTH <= 0 || NODES_SHIFT == 0
#define NODE_NOT_IN_PAGE_FLAGS
#endif[ 0.000000] Linux version 2.6.25-rc9 (mingo@dione) (gcc version 4.2.2) #960 SMP Tue Apr 15 21:16:23 CEST 2008
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable)
[ 0.000000] console [earlyser0] enabled
[ 0.000000] Warning only 4GB will be used.
[ 0.000000] Use a HIGHMEM64G enabled kernel.
[ 0.000000] 3200...
Peter "radar eye" Zijlstra noticed an ugly and annoying typo in mm.h:
-#ifdef NODE_NOT_IN_PAGEFLAGS
+#ifdef NODE_NOT_IN_PAGE_FLAGSbut even with the full fix (see below) the same crash remains.
i think getting NODE_NOT_IN_PAGEFLAGS wrong seems to result in
non-optimal but still correct code - by virtue of NODES_MASK ending up
zero.Ingo
----------------------->
Subject: nodes: shift fix
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 21:15:21 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -424,7 +424,7 @@ static inline void set_compound_order(st
* We are going to use the flags for the page to node mapping if its in
* there. This includes the case where there is no node, so it is implicit.
*/
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#if NODES_WIDTH <= 0 || NODES_SHIFT == 0
#define NODE_NOT_IN_PAGE_FLAGS
#endif@@ -442,7 +442,7 @@ static inline void set_compound_order(st
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allcator */
-#ifdef NODE_NOT_IN_PAGEFLAGS
+#ifdef NODE_NOT_IN_PAGE_FLAGS
#define ZONEID_SHIFT (SECTIONS_SHIFT + ZONES_SHIFT)
#define ZONEID_PGOFF ((SECTIONS_PGOFF < ZONES_PGOFF)? \
SECTIONS_PGOFF : ZONES_PGOFF)
--
I think it's still pointing to the page allocator and/or setting up
...especially considering you have similar crash with SLUB as well.
Now this:
[ 0.000999] ------------[ cut here ]------------
[ 0.000999] WARNING: at mm/slab.c:1685 cache_alloc_refill+0x2a6/0x4a3()
[ 0.000999] Pid: 0, comm: swapper Not tainted 2.6.25-rc9 #924
[ 0.000999] [<c0121b6f>] warn_on_slowpath+0x3c/0x4c
[ 0.000999] [<c0781873>] ? _spin_unlock_irqrestore+0xf/0x13
[ 0.000999] [<c02941ad>] ? delay_tsc+0x2e/0x4e
[ 0.000999] [<c029414d>] ? __delay+0x9/0xb
[ 0.000999] [<c0353db3>] ? serial8250_console_putchar+0x80/0x86
[ 0.000999] [<c0148822>] ? get_page_from_freelist+0x230/0x345
[ 0.000999] [<c0121eb1>] ? __call_console_drivers+0x56/0x63
[ 0.000999] [<c01489bb>] ? __alloc_pages+0x6e/0x2be
[ 0.000999] [<c015bd2e>] cache_alloc_refill+0x2a6/0x4a3
[ 0.000999] [<c015ba3f>] kmem_cache_alloc+0x5b/0xa4Says that alloc_pages_node() returned NULL early on in the boot.
However, GFP_THISNODE is ruled out as this:Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1485,6 +1485,7 @@ restart:
* Happens if we have an empty zonelist as a result of
* GFP_THISNODE being used on a memoryless node
*/
+ WARN_ON(1);
return NULL;
}does not trigger. Hmm...
--
i did a .config bisection and it pinpointed CONFIG_SPARSEMEM=y as the
culprit. Changing it to FLATMEM gives a correctly booting system.if you look at the good versus bad bootup log:
http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.good
http://redhat.com/~mingo/misc/log-Tue_Apr_15_07_24_59_CEST_2008.bad(both SLUB) you'll see that the zone layout provided by the architecture
code is _exactly_ the same and looks sane as well. So this is not an
architecture zone layout bug, this is probably sparsemem setup (and/or
the page allocator) getting confused by something.why are there no good debug logs possible in this area? To debug such
bugs we'd need an early dump of the precise layout of all memory maps,
what points where, how large it is, where it is allocated - and then
compare it with how the rest of the system is layed out - looking at
possible overlaps or other bugs. This 8-way box is a pain to debug on,
it takes a long time to boot it up, etc. etc.Ingo
--
i've done a revert of the page allocator to v2.6.24 status (with fixes
ontop to make it work on .25 infrastructure), via the patch below - but
this didnt change the problem.i also doubled the sparse mem_map[] allocations on the theory that they
might overflow - but that didnt solve the crash either.Ingo
------------------------>
Subject: revert: page alloc
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Apr 15 10:44:34 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/gfp.h | 2
include/linux/mmzone.h | 2
mm/page_alloc.c | 169 ++++++++++++++++++++++---------------------------
mm/vmstat.c | 61 ++++++++---------
4 files changed, 110 insertions(+), 124 deletions(-)Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -227,7 +227,5 @@ extern void free_cold_page(struct page *void page_alloc_init(void);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
-void drain_all_pages(void);
-void drain_local_pages(void *dummy);#endif /* __LINUX_GFP_H */
Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -113,7 +113,7 @@ struct per_cpu_pages {
};struct per_cpu_pageset {
- struct per_cpu_pages pcp;
+ struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
#ifdef CONFIG_NUMA
s8 expire;
#endif
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -19,7 +19,6 @@
#include <linux/swap.h>
#include <linux/interrupt.h>
#include <linux/pagemap.h>
-#include <linux/jiffies.h>
#include <linux/bootmem.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
@@ -44,7 +43,6 @@
#include...
so same config 64 bit with SLUB works and only 32bit is broken? or it 2.6.24 with 32bit + sparse + slub is broken already?
YH
--
this is a 32-bit-only box.
Ingo
--
yeah, sorry - we are working hard to unify generic bits like that, but
it's a huge architecture.btw., i always felt that the zone/memory setup is rather fragile and
ad-hoc in places and it trusts the architecture code too much. Just in
the .25 cycle i've seen about a dozen bugs all around that thing. I
believe we should work on making the info that an architecture feeds to
the MM "fool proof" - i.e. sanity-check for overlaps and other common
setup errors. It is easy for an architecture to mess up those things...
Especially on oddball systems that are too large or too small to be
normally tested. It's a common, reoccuring bug pattern that we could
avoid by being a bit more resilient.if this is a zone setup bug then a sanity-check could catch it right
where it happens - not much later in the slab code or so.Ingo
--
I hadn't realised that such setup errors were common. It should be already able
to handle some overlapping problems in add_active_range().I'm playing catch-up here but looking at your dmesg output, I see the
following snippets.[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[ 0.000000] BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000efff8000 (usable)
[ 0.000000] BIOS-e820: 00000000efff8000 - 00000000f0000000 (ACPI data)There are two portions of usable memory with a few holes there.
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000110000000 (usable)And is memory over the 4GB boundary but....
[ 0.000000] Warning only 4GB will be used.
[ 0.000000] Use a HIGHMEM64G enabled kernel.
[ 0.000000] Entering add_active_range(0, 0, 1048576) 0 entries of 256 usedIt's recognised and only memory below 4GB is registered and it's all on
node 0. However, I do note that it also registers all the holes as valid
memory. The memory should never get freed because it should be reserved
during boot by reserve_bootmem() but it still raises an eyebrow.[ 0.000000] early_node_map[1] active PFN ranges
[ 0.000000] 0: 0 -> 1048576
[ 0.000000] On node 0 totalpages: 1048576
[ 0.000000] DMA zone: 32 pages used for memmap
[ 0.000000] DMA zone: 0 pages reserved
[ 0.000000] DMA zone: 4064 pages, LIFO batch:0
[ 0.000000] Normal zone: 1760 pages used for memmap
[ 0.000000] Normal zone: 223520 pages, LIFO batch:31
[ 0.000000] HighMem zone: 6400 pages used for memmap
[ 0.000000] HighMem zone: 812800 pages, LIFO batch:31
[ 0.000000] Movable...
32-bit does memory_present() calls to register all RAM - and those calls
are correct (they do not include holes) and the resulting sparse memory
section layout looks correct, and all the mem_map[] chunk allocations
succeed as well.furthermore, when freeing memory from bootmem allocator into the buddy
allocator we consult the e820 map again via a page_is_ram() call, so we
make sure holes do not end up in the memory map and in the free pageyep. I did a few extra printouts to make sure, but came to the same
conclusion. The system boots fine with the same config on v2.6.24.Ingo
--
Could you post the zone setup of the system that fails? A memory map would
be useful and full dmesg output up to the failure.--
it's all in my bugreport you are replying to:
http://lkml.org/lkml/2008/4/11/34
it's a full dmesg up to the failure, which starts with the memory map of
the system ...Ingo
--
Pekka pointed out the orig post that had the boot info.
One commonality in the two failures inhouse and yours is that
both had more than 4G memory and had CONFIG_HIGHMEM4G set. Possibly
a wraparound of some limit?--
Ingo,
does your old 8 socket system work with 64bit kernel?
YH
--
BTW. I think I'm seeing some problems perhaps related to change page
attr stuff for DEBUG_PAGEALLOC on x86-64. And I don't know if it is the
same thing, but some general instability around either the page allocator
or slab allocator.The debug pagealloc problems seem to be that a thread suddenly get stuck
in the kernel spinning in cpa (usually on one of the locks) and never
seems to recover. Once it seemed to be spinning in clear_page_... too,
but perhaps could it be messing up the page attributes and running so
slowly that it just appears to be hanging? I'll try to get more info here
but it is hard to reproduce.The general instability -- I've just seen an oops or two in the page
allocation path in slub recently. Nothing reportable because I've been
running my own patches and/or been unable to reproduce... but it is a bit
unusual and I'll keep an eye out.Anyway, I'd suggest cooking this kernel a bit longer before release...
--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Nigel Cunningham | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
| Paul Mundt | Re: 2.6.22-rc4-mm2 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
git: | |
| Arjan van de Ven | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
