Please make NO_BOOTMEM default to n, at least for amd64, where I've found
that it leads to all kinds of strange, undebuggable boot hangs and errors
(with relatively current Fedora development userland).
Also, the help text for the item makes little sense to a non-expert in
this area:
" ---help---
Use early_res directly instead of bootmem before slab is ready.
- allocator (buddy) [generic]
- early allocator (bootmem) [generic]
- very early allocator (reserve_early*()) [x86]
- very very early allocator (early brk model) [x86]
So reduce one layer between early allocator to final allocator."
I had no idea what all this meant, so trusted the default=y and then spent
several hours wondering why everything was breaking, and would likley not
have figured it out in linear time without a suggestion from Dave Airlie.
- James
--
James Morris
<jmorris@namei.org>
--
Have you tested it with the latest fixes that are now in Linus' tree (-rc3)? -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Yes, it was happening with -rc3. -- James Morris <jmorris@namei.org> --
please send out bootlog if possible. BTW please try git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-2.6-yinghai.git Thanks Yinghai --
Could you please send the bootlog that Yinghai asked for, plus also one that you get with NO_BOOTMEM turned off (for comparison)? Also, when did you first hit this bug? This code has been upstream for almost a month, and it was in linux-next before that - so you should have hit this much sooner. A rough timeframe would suffice. I suppose you were booting upstream kernels during the merge window as well? We can flip the default around if there's no fix available based on the bootlogs. (Plus the help text should definitely be improved.) Thanks, Ingo --
A default y config option causing regressions still at rc3? and you guys keep going? This is the sort of shit Linus would flame me for a day or two for, Can we get some f'ing consistency here? --
Yeah. I think we need to remove the crap. I thought the problems were known, and fixed in -rc3. Clearly they weren't. And by now it's not about changing the default any more - by now it's about removing the known-crap code. Linus --
Yeah. It would still be nice to get the before/after bootlogs, because we'd like to Ok, we can certainly do that too. Should we scrap the whole x86 bootmem conversion to begin with? I'm not sure there's any fundamentally less risky way to it so if we try this again in .35 we might run into similar regressions and i'd like to avoid that. I wouldnt mind not having to do that at all, it's been a lot of pain to pull it off and the lmb conversion looks even more intrusive. Thanks, Ingo --
Note, without trying to defend the bootmem conversion itself, which didnt work out well, this is not some optional new driver feature that was default-y randomly but it was an infrastructure change that was to be made unconditional in .35. The flag was basically a testing/debug flag to allow the old code to be used too, in case the new code was buggy. This is what helped James to report this today, instead of forcing James through a very difficult ~14-reboot bisection. Thanks, Ingo --
Are you testing this btw with initramfs/initrds? I suspect lots of testing is being done by people on monolithic kernels, this is just a misc guess, considering I couldn't boot from when this landed until rc3 with this option on a basic 32-bit install on a dual-core 64-bit CPU, it suggested a hole of some sort in the test coverage. Dave --
so -rc3 is working your setup? Yinghai --
Hi Dave, The only bug report I remember getting from you had no details and was in reply to another bug report which was, indeed, addressed, so we had every reason to believe it was being dealt with with the patchset which did indeed go into -rc3 (and does address a problem with initramfs in particular cases.) Clearly James Morris' problem is something unrelated, and regardless of course of action we need to track it down. If you also are having problems with -rc3 we would really appreciate as much detail as possible -- boot logs at the very minimum -- so we have a chance to at all track down the problems that do exist. -hpa --
I don't have the old boot logs, and have since upgraded the system further. IIRC, the boot was failing after not being able to find the root fs (ext3/lvm/raid0). I thought it was a dracut issue, but it seemed to be In this case, in the last few days (also when I first saw or noticed the bootmem option). I was booting relatively recent linus kernels during the merge window, although my main work was being done on an older upstream kernel. -- James Morris <jmorris@namei.org> --
Please, could you send any bootlog then that we could work from? That way we could check the memory layout and guess the rough shape of the early Ok - initrd unpack failing or initial mount failing is consistent with the initrd getting corrupted by overlapping early reservations due to allocator Ok, so it's not an old regression but possibly a bug in one of the fixes. Not good. Ingo --
This would rather match the problem that was addressed by the patch in -rc3. Any help in reproducing it would be great. -hpa --
Upgraded to the latest rawhide userland -- I have not since tested with -- James Morris <jmorris@namei.org> --
That would be great. The sooner the better, obviously. -hpa --
I'm not seeing any problems now, with current Linus and rawhide. I'll leave bootmem off and see if anything comes up again. -- James Morris <jmorris@namei.org> --
(a current bootlog would still be nice) Dave, can you reproduce any of these problems with Linus's latest? Ingo --
ping? Can you or Dave reproduce the bug with -rc3 or later kernels? (If not then it probably means that the bug you triggered was already fixed at the time you reported it, as hpa suspected.) Thanks, Ingo --
James already reported -rc3 fix the problem for him. Dave implied -rc3 fixed problem for him Thanks Yinghai --
Hm, i'm confused, does this mean that it was all fixed upstream already when Dave and James sent their complaints? Would be nice to have a confirmation from Dave for that (beyond 'implying' it), to not keep this thread open-ended. Thanks, Ingo --
Okay I built a linus head and it booted on the previously broken machine. with CONFIG_NO_BOOTMEM=y Dave. --
I haven't seen it since. - James -- James Morris <jmorris@namei.org> --
in case, you have one 32bit system without RAM installed on node0. please check Thanks Yinghai Subject: [PATCH] x86: Fix 32bit system without RAM on Node0 when 32bit numa is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, We need to go with node1 because early_node_map still use 1 for all ranges, and ram from node1 becom low ram. Try to use MAX_NUMNODES like 64 numa does. Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- arch/x86/mm/init_32.c | 5 +++++ 1 file changed, 5 insertions(+) Index: linux-2.6/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_32.c +++ linux-2.6/arch/x86/mm/init_32.c @@ -875,7 +875,12 @@ void __init mem_init(void) BUG_ON(!mem_map); #endif /* this will put all low memory onto the freelists */ +#if defined(CONFIG_NO_BOOTMEM) && defined(MAX_NUMNODES) + /* In case some 32bit systems don't have RAM installed on node0 */ + totalram_pages += free_all_memory_core_early(MAX_NUMNODES); +#else totalram_pages += free_all_bootmem(); +#endif reservedpages = 0; for (tmp = 0; tmp < max_low_pfn; tmp++) --
So we get into this branch if CONFIG_NO_BOOTMEM is enabled but MAX_NUMNODES is Btw., and i said this before, i absolutely hate the CONFIG_NO_BOOTMEM naming as well (a negative in the option), but it is was what expresses the 'this is where we want to go' state better and thus CONFIG_NO_BOOTMEM removal will be a straight removal instead of a removal of the inverse. Thanks, Ingo --
yes. free_all_bootmem() will call free_all_memory_core_early(NODE_DATA(0)->node_id); Thanks Yinghai Lu --
Well and that whole #ifdeffery is disgusting as well - even if the goal was to remove CONFIG_NO_BOOTMEM ASAP. Please learn to use proper intermediate helper functions and at minimum put the conversion ugliness somewhere that doesnt intrude our daily flow in .c files. The best rule is to _never ever_ put an #ifdef construct into a .c file. It doesnt matter what the goal if the #ifdef is - such ugliness in code is never justified. Thanks, Ingo --
if you agree that i can have one nobootmem.c in mm/ Thanks Yinghai --
I think what we want is your lmb series, with CONFIG_NO_BOOTMEM eliminated altogether and x86 converted to pure (extended) lmb facilities, and without any traces of bootmem left in x86. I.e. a really clean series with no CONFIG_NO_BOOTMEM kind of #ifdef crap left around. This means 'nobootmem.c' (albeit saner than an #ifdef jungle) would be moot as well. We tried the dual model as it seemed prudent from a testing/conversion POV (and it certainly allowed people to turn the new code off), but it's rather ugly and we still have bugs left. This means that if Linus likes that approach the conversion will be very binary and very painful. The other option would be to go back to bootmem and forget about the whole nobootmem and lmb thing. Ingo --
That does not make much sense as bootmem is not only used on the architecture side but also in generic code. So you either have to emulate the API on x86 I think this was an implementation thing rather than a problem with the model per se. As written above, you can hardly get away without emulating the bootmem API I suppose it would be safest to replace early_res with lmb first to get in sync with the other archs using it. Step two would be to extend LMB and implement a bootmem emulation API on top of it so that architectures can switch over to non-bootmem mode one by one. Then you can drop the real bootmem code and switch generic code to use LMB natively, also site by site. And finally, drop the emulation API. If other architectures object to removing bootmem, there really is no point for x86 to even try it. For step one to work out, it's probably easiest to fully revert to the .33 state than having to replace early_res while in its current state? --
That would be better, or more commonly, use inlines. I'm still totally puzzled about this patch as well as the comment: +#if defined(CONFIG_NO_BOOTMEM) && defined(MAX_NUMNODES) + /* In case some 32bit systems don't have RAM installed on node0 */ + totalram_pages += free_all_memory_core_early(MAX_NUMNODES); +#else totalram_pages += free_all_bootmem(); +#endif Why is that "32 bits" specific? Second, MAX_NUMNODES is defined whenever <linux/numa.h> is included, so what on Earth is this supposed to signify? Are you trying to say MAX_NUMNODES > 1? Or are you trying to say CONFIG_NUMA? Furthermore, I really don't see the connection between this and James Morris' reported problem, which he reports as "amd64", which presumably is an x86-64 kernel and not 32 bits... James, is that correct? Any more details you can give about the system? I *really* don't want to go into cargo cult programming mode, that would suck eggs no matter what. -hpa --
you are right, this one should be more clear.
Subject: [PATCH -v2] nobootmem, x86: Fix 32bit system without RAM on Node0
when 32bit numa is used, free_all_bootmem() will still only go over with
node id 0.
If node 0 doesn't have RAM installed, We need to go with node1
because early_node_map still use 1 for all ranges, and ram from node1
becom low ram.
Try to use MAX_NUMNODES like 64 numa does.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
mm/bootmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -303,7 +303,7 @@ unsigned long __init free_all_bootmem_no
unsigned long __init free_all_bootmem(void)
{
#ifdef CONFIG_NO_BOOTMEM
- return free_all_memory_core_early(NODE_DATA(0)->node_id);
+ return free_all_memory_core_early(MAX_NUMNODES);
#else
return free_all_bootmem_core(NODE_DATA(0)->bdata);
it happened one of my test setup, node0 ram disappear somehow.
and i found the 32bit numa doesn't work on that.
Thanks
Yinghai
--
... which is useful and valid, but I still think this isn't related to James' problem, if James' problem wasn't actually fixed in -rc3. That's the part that I'm afraid I have to be confused about... all the known problems except the above are fixed in -rc3, and I'd at least like to have a validated bug report of any sort before saying it should all be tossed. This patch looks a lot better. The whole use of MAX_NUMNODES as a sentinel (which appears inherited from mm/page_alloc.c, and as such is a pre-existing convention which is also invoked here) really could use a comment, though. -hpa --
sure. will have updated one with coments there Thanks Yinghai --
on one system without RAM on nod0, got following dump with 32bit numa kernel
early_node_map[4] active PFN ranges
1: 0x00000010 -> 0x00000099
1: 0x00000100 -> 0x0007da00
1: 0x0007e800 -> 0x0007ffa0
1: 0x0007ffae -> 0x0007ffb0
Subtract (29 early reservations)
#000 [0000001000 - 0000002000]
#001 [0000089000 - 000008f000]
#002 [0000091000 - 0000093500]
#003 [0000094000 - 0000099000]
#004 [0000099400 - 0000100000]
#005 [0000200000 - 0000eb7644]
#006 [0000eb8000 - 0000ec327c]
#007 [007c400000 - 007c40e000]
#008 [007c440000 - 007c44e000]
#009 [007c480000 - 007c48e000]
#010 [007c4c0000 - 007c4ce000]
#011 [007c500000 - 007c50e000]
#012 [007c540000 - 007c54e000]
#013 [007c580000 - 007c58e000]
#014 [007c5c0000 - 007c5ce000]
#015 [007c674000 - 007cbfe000]
#016 [007cbfe500 - 007cbfe530]
#017 [007cbfe540 - 007cbfe5d0]
#018 [007cbfe600 - 007cbfe620]
#019 [007cbfe640 - 007cbfe660]
#020 [007cbfe680 - 007cbfe684]
#021 [007cbfe6c0 - 007cbfe6c4]
#022 [007cbfe700 - 007cbfe77e]
#023 [007cbfe780 - 007cbfe7fe]
#024 [007cbfe800 - 007cbfec54]
#025 [007cbfec80 - 007cbfeede]
#026 [007cbfef00 - 007cbfef2d]
#027 [007cbfef40 - 007e800000]
#028 [007e9ca000 - 007ff95000]
(0 free memory ranges)
Initializing HighMem for node 0 (00000000:00000000)
Initializing HighMem for node 1 (00000000:00000000)
Memory: 0k/2096832k available (6662k kernel code, 2096300k reserved, 4829k data, 484k init, 0k highmem)
virtual kernel memory layout:
fixmap : 0xff637000 - 0xfffff000 (10016 kB)
pkmap : 0xff200000 - 0xff400000 (2048 kB)
vmalloc : 0xc07b0000 - 0xff1fe000 (1002 MB)
lowmem : 0x40000000 - 0xbffb0000 (2047 MB)
.init : 0x40d39000 - 0x40db2000 ( 484 kB)
.data : 0x40881924 - 0x40d38e1c (4829 kB)
.text : 0x40200000 - 0x40881924 (6662 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
swapper: page allocation failure. order:0, mode:0x0
Pid: 0, comm: ...Please address the separate bug fix in a separate patch. -- Sent from my mobile phone, pardon any lack of formatting.
When 32bit numa is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, We need to go with node1 because early_node_map still use 1 for all ranges, and ram from node1 become low ram. this one fixes BOOTMEM path by loop bdata_list. -v3: add more comments, and fix bootmem path too. -v4: seperate from one big patch Signed-off-by: Yinghai Lu <yinghai@kernel.org> --- mm/bootmem.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6/mm/bootmem.c =================================================================== --- linux-2.6.orig/mm/bootmem.c +++ linux-2.6/mm/bootmem.c @@ -312,7 +312,13 @@ unsigned long __init free_all_bootmem(vo */ return free_all_memory_core_early(MAX_NUMNODES); #else - return free_all_bootmem_core(NODE_DATA(0)->bdata); + unsigned long total_pages = 0; + bootmem_data_t *bdata; + + list_for_each_entry(bdata, &bdata_list, list) + total_pages += free_all_bootmem_core(bdata); + + return total_pages; #endif } --
Commit-ID: aa235fc712f379d4194cff9217f07026c452c141 Gitweb: http://git.kernel.org/tip/aa235fc712f379d4194cff9217f07026c452c141 Author: Yinghai Lu <yinghai@kernel.org> AuthorDate: Wed, 31 Mar 2010 20:45:27 -0700 Committer: H. Peter Anvin <hpa@zytor.com> CommitDate: Thu, 1 Apr 2010 14:41:19 -0700 bootmem, x86: Fix 32bit numa system without RAM on node 0 When 32bit numa is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, the lowest populated node becomes low RAM. This one fixes BOOTMEM path by iterating over the bdata_list. -v3: add more comments, and fix bootmem path too. -v4: seperate from one big patch Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4BB416D7.6090203@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> --- mm/bootmem.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/mm/bootmem.c b/mm/bootmem.c index 2058cb7..ba37d62 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -312,7 +312,13 @@ unsigned long __init free_all_bootmem(void) */ return free_all_memory_core_early(MAX_NUMNODES); #else - return free_all_bootmem_core(NODE_DATA(0)->bdata); + unsigned long total_pages = 0; + bootmem_data_t *bdata; + + list_for_each_entry(bdata, &bdata_list, list) + total_pages += free_all_bootmem_core(bdata); + + return total_pages; #endif } --
on one system without RAM on nod0, got following dump with 32bit numa kernel
early_node_map[4] active PFN ranges
1: 0x00000010 -> 0x00000099
1: 0x00000100 -> 0x0007da00
1: 0x0007e800 -> 0x0007ffa0
1: 0x0007ffae -> 0x0007ffb0
...
Subtract (29 early reservations)
#000 [0000001000 - 0000002000]
#001 [0000089000 - 000008f000]
#002 [0000091000 - 0000093500]
...
#027 [007cbfef40 - 007e800000]
#028 [007e9ca000 - 007ff95000]
(0 free memory ranges)
Initializing HighMem for node 0 (00000000:00000000)
Initializing HighMem for node 1 (00000000:00000000)
Memory: 0k/2096832k available (6662k kernel code, 2096300k reserved, 4829k data, 484k init, 0k highmem)
...
Checking if this processor honours the WP bit even in supervisor mode...Ok.
swapper: page allocation failure. order:0, mode:0x0
Pid: 0, comm: swapper Not tainted 2.6.34-rc3-tip-03818-g4b1ea6c-dirty #35
Call Trace:
[<4087a5dc>] ? printk+0xf/0x11
[<40286728>] __alloc_pages_nodemask+0x417/0x487
[<402a9ce1>] new_slab+0xe2/0x1fe
[<402aa5b2>] kmem_cache_open+0x185/0x358
[<402abbc0>] T.954+0x1c/0x60
[<40d52a29>] kmem_cache_init+0x24/0x113
[<40d39738>] start_kernel+0x166/0x2e4
[<40d3940e>] ? unknown_bootoption+0x0/0x18e
[<40d390ce>] i386_start_kernel+0xce/0xd5
Mem-Info:
Node 1 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 1 Normal per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:0 slab_reclaimable:0 slab_unreclaimable:0
mapped:0 shmem:0 pagetables:0 bounce:0
When 32bit numa is used, free_all_bootmem() will still only go over with
node id 0.
If node 0 doesn't have RAM installed, We need to go with node1
because early_node_map still use 1 for all ranges, and ram from node1
become low ram.
Try to use MAX_NUMNODES like 64 numa does.
Note: BOOTMEM path has the same problem.
this bug exist before We have NO_BOOTMEM ...Commit-ID: 337998587f802535896e9ed16d19f97915ccd368 Gitweb: http://git.kernel.org/tip/337998587f802535896e9ed16d19f97915ccd368 Author: Yinghai Lu <yinghai@kernel.org> AuthorDate: Wed, 31 Mar 2010 20:44:09 -0700 Committer: H. Peter Anvin <hpa@zytor.com> CommitDate: Thu, 1 Apr 2010 14:39:29 -0700 nobootmem, x86: Fix 32bit numa system without RAM on node 0 On one system without RAM on node0, got following boot dump with a 32 bit NUMA kernel: early_node_map[4] active PFN ranges 1: 0x00000010 -> 0x00000099 1: 0x00000100 -> 0x0007da00 1: 0x0007e800 -> 0x0007ffa0 1: 0x0007ffae -> 0x0007ffb0 ... Subtract (29 early reservations) #000 [0000001000 - 0000002000] #001 [0000089000 - 000008f000] #002 [0000091000 - 0000093500] ... #027 [007cbfef40 - 007e800000] #028 [007e9ca000 - 007ff95000] (0 free memory ranges) Initializing HighMem for node 0 (00000000:00000000) Initializing HighMem for node 1 (00000000:00000000) Memory: 0k/2096832k available (6662k kernel code, 2096300k reserved, 4829k data, 484k init, 0k highmem) ... Checking if this processor honours the WP bit even in supervisor mode...Ok. swapper: page allocation failure. order:0, mode:0x0 Pid: 0, comm: swapper Not tainted 2.6.34-rc3-tip-03818-g4b1ea6c-dirty #35 Call Trace: [<4087a5dc>] ? printk+0xf/0x11 [<40286728>] __alloc_pages_nodemask+0x417/0x487 [<402a9ce1>] new_slab+0xe2/0x1fe [<402aa5b2>] kmem_cache_open+0x185/0x358 [<402abbc0>] T.954+0x1c/0x60 [<40d52a29>] kmem_cache_init+0x24/0x113 [<40d39738>] start_kernel+0x166/0x2e4 [<40d3940e>] ? unknown_bootoption+0x0/0x18e [<40d390ce>] i386_start_kernel+0xce/0xd5 Mem-Info: Node 1 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Node 1 Normal per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:0 slab_reclaimable:0 slab_unreclaimable:0 mapped:0 shmem:0 pagetables:0 ...
I too noticed this absolutely catastrophic "help" text but forgot to send a bug report. Either this option can be explained and the text fixed, or it cannot be explained and shouldn't be an option in the first place. -- Stefan Richter -=====-==-=- --== ===== http://arcgraph.de/sr/ --
