Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Linus Torvalds
Date: Monday, August 25, 2008 - 11:00 am

On Mon, 25 Aug 2008, Alan D. Brunelle wrote:

Ok, so I took a closer look, and the oops really is suggestive..


Ok, 4840 bytes left out of 8kB.


.. and this one is 4784 bytes left..


Uhhuh! The previous "modprobe" uses stack like mad.  It could be 
"fuse_init()" that has done it, but looking at fuse, I seriously doubt it. 
It doesn't seem to do anything particularly bad.

So something has used over 6kB of stack, and it may well be the module 
loading code itself.

The next stage is the actual oops itself:


This really looks like

	ti->task->blocked_on = waiter;

where "ti->task" is NULL. You probably have almost everything enabled in 
order to turn "struct task_struct" that big, but judging by your register 
state it's really an offset off a NULL pointer, not some small integer.

Now, there is no way "ti->task" can _possibly_ be NULL. No way.

Well, except that "ti" is just below the stack, and if you had a stack 
overflow that overwrote it.

So I seriously do believe that you have run out of stack. If that is true, 
then it's quite likely that with DEBUG_PAGE_ALLOC you'll actually get a 
double fault, which in turn is fairly hard to debug (you look at it wrong 
and it turns into a triple fault which is going to just reboot your 
machine immediately).

Now, the stack oveflow probably happened a few calls earlier (and just 
left your thread_info corrupted), but there is more reason to believe you 
have stack overflow and thread_info corruption later in your output:


Here there is only 408 bytes left, which is _way_ too little, but it's 
also an optimistic measure. What the stack code usage code does is to just 
see how many zeroes it can find on the stack. If you have a big stack 
frame somewhere, it's quite possible that it actually used all your stack 
and then some, but left a bunch of zeroes around.

And the do_exit() oops is simply because once the thread_info is 
corrupted, all the basic thread data structures are crap, and yes, you're 
almost guaranteed to oops at that point.

Could you make your kernel image available somewhere, and we can take a 
look at it? Some versions of gcc are total pigs when it comes to stack 
usage, and your exact configuration matters too.  But yes, module loading 
is a bad case, for me "sys_init_module()" contains

	subq    $392, %rsp      #,

which is probably mostly because of the insane inlining gcc does (ie it 
will likely have inlined every single function in that file that is only 
called once, and then it will make all local variables of all those 
functions alive over the whole function and allocate stack-space for them 
ALL AT THE SAME TIME).

Gcc sometimes drives me mad. It's inlining decisions are almost always 
pure and utter sh*t. But clearly something changed for you to start 
triggering this, and I think that also explains why you bisected things to 
the merge commit rather than to any individual change - because it was 
probably not any individual change that pushed it over the limit, but two 
different changes that made for bigger stack pressure, and _together_ they 
pushed you over the limit.

So it also explains why the merge you found had no possible merge errors 
on a source level - there were no actual clashes anywhere. Just a slow 
growth of stack that combined to something that overflowed.

And yes, I bet the change by Arjan to use do_one_initcall() was _part_ of 
it. It adds roughly 112 bytes of stack pressure to that module loading 
path, because of the 64-byte array and the extra function call (8 bytes 
for return address) with at least 5 quad-words saved (40 bytes) for 
register spills.

But there were probably other things happening too that made things worse.

So if there is some place where you can upload your 'vmlinux' binary, it 
would be good.

			Linus
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
2.6.27-rc4-git1: Reported regressions from 2.6.26, Rafael J. Wysocki, (Sat Aug 23, 11:07 am)
[Bug #11141] no battery or DC status - Dell i1501, Rafael J. Wysocki, (Sat Aug 23, 11:07 am)
[Bug #11191] 2.6.26-git8: spinlock lockup in c1e_idle(), Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11207] VolanoMark regression with 2.6.27-rc1, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11220] Screen stays black after resume, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11219] KVM modules break emergency reboot, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11215] INFO: possible recursive locking detected ps2 ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11210] libata badness, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11209] 2.6.27-rc1 process time accounting, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11237] corrupt PMD after resume, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11224] Only three cores found on quad-core machine., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11230] Kconfig no longer outputs a .config with fres ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11271] BUG: fealnx in 2.6.27-rc1, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11264] Invalid op opcode in kernel/workqueue, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11254] KVM: fix userspace ABI breakage, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11282] Please fix x86 defconfig regression, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11279] 2.6.27-rc0 Power Bugs with HP/Compaq Laptops, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11276] build error: CONFIG_OPTIMIZE_INLINING=y cause ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11272] BUG: parport_serial in 2.6.27-rc1 for NetMos ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11334] myri10ge: use ioremap_wc: compilation failure ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11336] 2.6.27-rc2:stall while mounting root fs, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11335] 2.6.27-rc2-git5 BUG: unable to handle kernel ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11308] tbench regression on each kernel release from ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11340] LTP overnight run resulted in unusable box, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11343] SATA Cold Boot Problems with 2.6.27-rc[23] on ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11358] net: forcedeth call restore mac addr in nv_sh ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11357] Can not boot up with zd1211rw USB-Wlan Stick, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11356] Linux 2.6.27-rc3 - build failure: undefined r ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11355] Regression in 2.6.27-rc2 when cross-building ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11354] AMD Elan regression with 2.6.27-rc3, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11380] lockdep warning: cpu_add_remove_lock at:cpu_m ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11379] char/tpm: tpm_infineon no longer loaded for H ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11361] my servers with nvidia mcp55 nic don't work w ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11360] mpc8xxx_wdt.c doesn't build modular, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11401] pktcdvd: BUG, NULL pointer dereference in pkt ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11398] hda_intel: IRQ timing workaround is activated ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11388] 2.6.27-rc3 warns about MTRR range; only 3 of ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11382] e1000e: 2.6.27-rc1 corrupts EEPROM/NVM, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11405] 2.6.27-rc3 segfault on cold boot; not on warm ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11403] 2.6.27-rc2 USB suspend regression, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11402] skbuff bug?, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11413] get_rtc_time() triggers NMI watchdog in hpet_ ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11409] build issue #564 for v2.6.27-rc4 : undefined ..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11407] suspend: unable to handle kernel paging request, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11410] SLUB list_lock vs obj_hash.lock..., Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
[Bug #11414] Random crashes with 2.6.27-rc3 on PPC, Rafael J. Wysocki, (Sat Aug 23, 11:10 am)
Re: [Bug #11210] libata badness, Jeff Garzik, (Sat Aug 23, 3:23 pm)
Re: [Bug #11271] BUG: fealnx in 2.6.27-rc1, Jeff Garzik, (Sat Aug 23, 3:26 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 10:48 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 11:03 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 11:34 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Vegard Nossum, (Sun Aug 24, 11:43 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 11:52 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 11:58 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 12:03 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Adrian Bunk, (Sun Aug 24, 12:23 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, David Greaves, (Sun Aug 24, 12:23 pm)
Re: [Bug #11254] KVM: fix userspace ABI breakage, Adrian Bunk, (Sun Aug 24, 12:27 pm)
Re: [Bug #11210] libata badness, Rafael J. Wysocki, (Sun Aug 24, 2:04 pm)
Re: [Bug #11334] myri10ge: use ioremap_wc: compilation fai ..., Rafael J. Wysocki, (Sun Aug 24, 2:05 pm)
Re: [Bug #11356] Linux 2.6.27-rc3 - build failure: undefin ..., Rafael J. Wysocki, (Sun Aug 24, 2:10 pm)
Re: [Bug #11379] char/tpm: tpm_infineon no longer loaded f ..., Rafael J. Wysocki, (Sun Aug 24, 2:12 pm)
Re: [Bug #11355] Regression in 2.6.27-rc2 when cross-build ..., Rafael J. Wysocki, (Sun Aug 24, 2:34 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Rafael J. Wysocki, (Sun Aug 24, 2:40 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, H. Peter Anvin, (Sun Aug 24, 5:16 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Benjamin Herrenschmidt, (Sun Aug 24, 5:48 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Sun Aug 24, 5:51 pm)
Re: [Bug #11254] KVM: fix userspace ABI breakage, Avi Kivity, (Mon Aug 25, 3:23 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Rafael J. Wysocki, (Mon Aug 25, 4:40 am)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Daniel J Blueman, (Mon Aug 25, 6:03 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Linus Torvalds, (Mon Aug 25, 11:00 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Christoph Lameter, (Mon Aug 25, 3:07 pm)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 1:34 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 1:44 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 2:00 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 6:17 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 9:38 am)
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmallo ..., Bernd Petrovitsch, (Wed Aug 27, 12:30 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Peter Osterlund, (Wed Aug 27, 1:17 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 1:40 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 1:45 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 3:38 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, David Miller, (Wed Aug 27, 3:43 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Alexey Dobriyan, (Wed Aug 27, 3:45 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 4:00 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 4:12 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Linus Torvalds, (Wed Aug 27, 5:35 pm)
Re: 2.6.27-rc4-git1: Reported regressions from 2.6.26, Christoph Hellwig, (Thu Aug 28, 6:52 am)
Subject: [RFC 1/1] cpumask: Provide new cpumask API, Mike Travis, (Thu Sep 25, 1:59 pm)