Hello,
I'm seeing an early exception (0e) - which seems related to lockdep - at
boot with many 2.6.27 kernels and I'm having troubles to track it down.
The strange thing is that it comes and goes with different kernel
versions, but a "bad" kernel consistently fails across reboots. It also
seems to be sensitive to the configuration (attached), at least in one
case the difference between a non-working kernel and a working one is
CONFIG_DEBUG enabled in the latter.
The address printed is inside the function __lock_acqurie:
in __lock_acquire (/home/kronos/src/linux-2.6.git/kernel/lockdep.c:727).
722
723 /*
724 * We can walk the hash lockfree, because the hash only
725 * grows, and we are careful when adding entries to the end:
726 */
727 list_for_each_entry(class, hash_head, hash_entry) {
728 if (class->key == key) {
729 WARN_ON_ONCE(class->name != lock->name);
730 return class;
731 }
And the disassembly (faulting address is 0xffffffff80253b66)
0xffffffff80253b00 <__lock_acquire+299>: shr $0x34,%rax
0xffffffff80253b04 <__lock_acquire+303>: shl $0x4,%rax
0xffffffff80253b08 <__lock_acquire+307>: lea -0x7f57e9b0(%rax),%rdx
0xffffffff80253b0f <__lock_acquire+314>: mov -0x7f57e9b0(%rax),%r13
0xffffffff80253b16 <__lock_acquire+321>: jmp 0xffffffff80253b66 <__lock_acquire+401>
0xffffffff80253b18 <__lock_acquire+323>: cmp %r8,0x20(%r13)
0xffffffff80253b1c <__lock_acquire+327>: jne 0xffffffff80253b63 <__lock_acquire+398>
0xffffffff80253b1e <__lock_acquire+329>: mov -0x90(%rbp),%r10
0xffffffff80253b25 <__lock_acquire+336>: mov 0x10(%r10),%r10
0xffffffff80253b29 <__lock_acquire+340>: cmp %r10,0x140(%r13)
0xffffffff80253b30 <__lock_acquire+347>: je 0xffffffff80253ea0 <__lock_acquire+1227>
0xffffffff80253b36 ...Right - except this isn't in __lock_acquire, its from Might - I've never tried it.. --
Of course not ;-) It dies before dumping the stack trace (or at least it doesn't make to the console - that machine has a serial port, but -rc2 is working, after that it's intermittent. The last kernel that I Ops, will send as soon as I get home. Luca --
OK, once you send you .config, I'll try and reproduce on one of my machine. Perhaps you can also provide the git-describe output of a known Sure thing.. --
Hum, I forgot to mention that it dies very early, just after: Kernel is alive Kernel is really alive <exception> <dead> Is netconsole already up at that point? Luca --
Good question, I suppose not, that only happens after the device probing has found the eth card. --
Another thing you could try is booting this kernel in qemu and try attaching to its gdb stub. I've had varying levels of success doing that, you need a very recent (might still be svn snapshot) of qemu to get x86_64 gdb to work iirc. --
Config is attached. I tried netconsole, but got not output. No luck with qemu either :S QEMU from SVN reports that: qemu: fatal: Trying to execute code outside RAM or ROM at 0x00000000000a0000 KVM goes a bit further but dies due to an exception while the kernel is still executing in real mode. Luca
Sadly your config just boots, albeit not to userspace due to missing drivers. root@twins:/mnt/build/linux-2.6# git-describe v2.6.27-rc5-6-gbef69ea root@twins:/mnt/build/linux-2.6# /mnt/md0/cross/bin/x86_64-linux-gcc --version x86_64-linux-gcc (GCC) 4.3.1 20080510 (prerelease) --
Yes, I managed to boot it with qemu... I tried kgdb - without luck - kernel dies too early. I also managed to get a stack trace :D http://img151.imageshack.us/my.php?image=tracedm1.jpg It seems that lockdep is an innocent bystander... the kernel died with panic() in __reserve_early, and then took another exception while printing the panic (I guess). Will add further debug stuff to see wtf is going on. Luca --
