"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar [1]. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:
"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."
Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."
From: Ingo Molnar <mingo@...>
Subject: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 8:03 pm 2008
finally found it ... the patch below solves the sparsemem crash and the
testsystem boots up fine now:
mars:~> uname -a
Linux mars 2.6.25-rc9-sched-devel.git-x86-latest.git #985 SMP Wed Apr 16
01:37:37 CEST 2008 i686 i686 i386 GNU/Linux
yay! :-)
Ingo
ps. anyone who can correctly guess the method with which i found the
exact place that corrupted memory will get a free beer next time we
meet :-)
------------------------->
Subject: mm: sparsemem memory_present() memory corruption fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008
fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.
that causes this loop to happily walk over the limit of the
sparse memory section map:
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID. Depending on what that memory is, we might crash,
misbehave or just not notice the bug.
the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.
an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -149,8 +149,18 @@ static inline int sparse_early_nid(struc
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long start, unsigned long end)
{
+ unsigned long max_arch_pfn = 1ULL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
unsigned long pfn;
+ /*
+ * Sanity checks - do not allow an architecture to pass
+ * in larger pfns than the maximum scope of sparsemem:
+ */
+ if (start >= max_arch_pfn)
+ return;
+ if (end >= max_arch_pfn)
+ end = max_arch_pfn;
+
start &= PAGE_SECTION_MASK;
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
--
From: Ingo Molnar <mingo@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 8:34 pm 2008
small addendum to the changelog:
> if (!ms->section_mem_map)
> ms->section_mem_map = sparse_encode_early_nid(nid) |
+ | SECTION_MARKED_PRESENT
>
> 'ms' will be out of bounds and we'll corrupt a small amount of memory by
> encoding the node ID. Depending on what that memory is, we might crash,
> misbehave or just not notice the bug.
the corruption might happen when encoding a non-zero node ID, or due to
the SECTION_MARKED_PRESENT which is 0x1:
mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0)
Ingo
--
From: Linus Torvalds <torvalds@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 10:45 pm 2008
On Wed, 16 Apr 2008, Ingo Molnar wrote:
>
> small addendum to the changelog:
Ok, you didn't make that addendum to your second version, so I added it
myself.
Anyway, good job. I've pushed this out, and will let this simmer at least
overnight to see if there are any brown-paper-bag issues (either with this
or with some last changes from Andrew), but I'm happy, and I think I'll do
the real 2.6.25 tomorrow.
Linus
--
From: Ingo Molnar <mingo@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 8:40 pm 2008
* Ingo Molnar <mingo@elte.hu> wrote:
> small addendum to the changelog:
Joe Perches pointed out that the ULL was superfluous (i typoed it, i
knew it's a pfn). Updated patch below.
Ingo
-------------------------->
Subject: mm: sparsemem memory_present() fix
From: Ingo Molnar <mingo@elte.hu>
Date: Wed Apr 16 01:40:00 CEST 2008
fix memory corruption and crash on 32-bit x86 systems.
if a !PAE x86 kernel is booted on a 32-bit system with more than
4GB of RAM, then we call memory_present() with a start/end that
goes outside the scope of MAX_PHYSMEM_BITS.
that causes this loop to happily walk over the limit of the
sparse memory section map:
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;
'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
the fix is to sanity check anything the architecture passes to sparsemem.
this bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs,
which made this bug more prominent in v2.6.25-to-be.
an additional enhancement might be to print a warning about ignored
or trimmed memory ranges.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/sparse.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: linux/mm/sparse.c
===================================================================
--- linux.orig/mm/sparse.c
+++ linux/mm/sparse.c
@@ -149,8 +149,18 @@ static inline int sparse_early_nid(struc
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long start, unsigned long end)
{
+ unsigned long max_arch_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
unsigned long pfn;
+ /*
+ * Sanity checks - do not allow an architecture to pass
+ * in larger pfns than the maximum scope of sparsemem:
+ */
+ if (start >= max_arch_pfn)
+ return;
+ if (end >= max_arch_pfn)
+ end = max_arch_pfn;
+
start &= PAGE_SECTION_MASK;
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
--
From: Ingo Molnar <mingo@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 9:14 pm 2008
* Ingo Molnar <mingo@elte.hu> wrote:
> but the exact incarnation depended on random details like configs,
> which made this bug more prominent in v2.6.25-to-be.
i believe this was the reason why my many bisection attempts were
unsuccessful: the bug pattern was not stable and seemingly working
kernels had the memory corruption too. It was pure luck that v2.6.24
"worked" and v2.6.25-rc9 broke visibly.
Ingo
--
From: Christoph Lameter <clameter@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 8:45 pm 2008
> fix memory corruption and crash on 32-bit x86 systems.
>
> if a !PAE x86 kernel is booted on a 32-bit system with more than
> 4GB of RAM, then we call memory_present() with a start/end that
> goes outside the scope of MAX_PHYSMEM_BITS.
So its a general issue that has been there for years that we are now
noticing because we are now testing with memory sizes > 4GB. This also
affects the enterprise releases (SLES10, RHEL5). Argh!
I wonder why this did not show up earlier in testing? Running a kernel
that cannot access all of memory is unusual I guess.
--
From: Ingo Molnar <mingo@...>
Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix
[1]Date: Apr 15, 8:52 pm 2008
* Christoph Lameter <clameter@sgi.com> wrote:
> I wonder why this did not show up earlier in testing? Running a kernel
> that cannot access all of memory is unusual I guess.
i guess people saw the "you are not running a PAE kernel" warning and
went to a PAE kernel which didnt have this issue.
OTOH, quite a few testers consciously use non-PAE kernels on 4GB
systems, so i'd not be surprised if this solved a few mystery
regressions we have.
Ingo
--From: Christoph Lameter <clameter@...> Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix [1]Date: Apr 15, 8:19 pm 2008 On Wed, 16 Apr 2008, Ingo Molnar wrote: > if a !PAE x86 kernel is booted on a 32-bit system with more than > 4GB of RAM, then we call memory_present() with a start/end that > goes outside the scope of MAX_PHYSMEM_BITS. Well okay this fixes it but is this the right fix? The arch should not call memory_present() with an invalid pfn. --
From: Ingo Molnar <mingo@...> Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix [1]Date: Apr 15, 8:36 pm 2008 * Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 16 Apr 2008, Ingo Molnar wrote: > > > if a !PAE x86 kernel is booted on a 32-bit system with more than 4GB > > of RAM, then we call memory_present() with a start/end that goes > > outside the scope of MAX_PHYSMEM_BITS. > > Well okay this fixes it but is this the right fix? The arch should not > call memory_present() with an invalid pfn. it is the right fix. The architecture memory setup code doesnt even _know_ the limits at this place in an open-coded way (and shouldnt know them) - and even later on we use pfn_valid() to determine whether to attempt to get to a struct page and free it into the buddy. [ Of course the architecture code in general 'knows' about the limits - but still it's cleaner to have a dumb enumeration interface here combined with a resilient core code - that's always going to be less fragile. ] btw., i just did some bug history analysis, the calls were originally added when sparsemem support was added: | commit 215c3409eed16c89b6d11ea1126bd9d4f36b9afd | Author: Andy Whitcroft <apw@shadowen.org> | Date: Fri Jan 6 00:12:06 2006 -0800 | | [PATCH] i386 sparsemem for single node systems in v2.6.15-1003-g215c340. (so this is appears to be an unfixed bug in v2.6.16 as well) Ingo --
From: Ingo Molnar <mingo@...> Subject: Re: [patch] mm: sparsemem memory_present() memory corruption fix [1]Date: Apr 16, 11:03 am 2008 * Ingo Molnar <mingo@elte.hu> wrote: > ps. anyone who can correctly guess the method with which i found the > exact place that corrupted memory will get a free beer next time > we meet :-) the method was to notice that the slub_debug_slabs SLUB variable got corrupted from an expected value of 0 to a value of 0x1. Then i added a simple brute-force function-tracer hook (in sched-devel) that checked when slub_debug_slabs went from 0 to 1, and which then printed a backtrace. Since under CONFIG_FTRACE=y every kernel function calls this callback, it triggered immediately after the value got corrupted: [ 0.000000] console [earlyser0] enabled [ 0.000000] BUG: slub_debug_slabs: 00000001 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.25-rc9-sched-devel.git-x86-latest.git #982 [ 0.000000] [<c0177fba>] print_slub_debug_slabs+0x3a/0x40 [ 0.000000] [<c01050f7>] trace+0x8/0x11 [ 0.000000] [<c0cc929e>] ? mtrr_bp_init+0xe/0x320 [ 0.000000] [<c01050f7>] ? trace+0x8/0x11 [ 0.000000] [<c0cd7369>] ? memory_present+0x9/0x50 [ 0.000000] [<c0cc7a09>] ? find_max_pfn+0x99/0xb0 [ 0.000000] [<c0cc6af7>] setup_arch+0x217/0x470 [ 0.000000] [<c012c59b>] ? printk+0x1b/0x20 [ 0.000000] [<c0cc2b46>] start_kernel+0x96/0x3f0 [ 0.000000] [<c0cc22fd>] i386_start_kernel+0xd/0x10 [ 0.000000] ======================= [ 0.000000] x86: PAT support disabled. and the backtrace had all the guilty parties on stack - memory_present() [which was just called] and find_max_pfn()/setup_arch() - thanks to the new fuzzy "?" backtrace entries we print out in v2.6.25. (i could also have printed out the current ftrace buffer as well, showing the history of all recent function calls that the kernel executed.) Ingo --
Related links:
- Archive of above thread [1]