Re: [2.6.27] early exception - lockdep related?

Previous thread: TLB IPI FLushTLB vs Invl Page by jmerkey on Tuesday, September 2, 2008 - 1:01 pm. (1 message)

Next thread: none
From: Luca Tettamanti
Date: Tuesday, September 2, 2008 - 2:06 pm

Hello,
I'm seeing an early exception (0e) - which seems related to lockdep - at
boot with many 2.6.27 kernels and I'm having troubles to track it down.
The strange thing is that it comes and goes with different kernel
versions, but a "bad" kernel consistently fails across reboots. It also
seems to be sensitive to the configuration (attached), at least in one
case the difference between a non-working kernel and a working one is
CONFIG_DEBUG enabled in the latter.

The address printed is inside the function __lock_acqurie:

in __lock_acquire (/home/kronos/src/linux-2.6.git/kernel/lockdep.c:727).
722
723             /*
724              * We can walk the hash lockfree, because the hash only
725              * grows, and we are careful when adding entries to the end:
726              */
727             list_for_each_entry(class, hash_head, hash_entry) {
728                     if (class->key == key) {
729                             WARN_ON_ONCE(class->name != lock->name);
730                             return class;
731                     }

And the disassembly (faulting address is 0xffffffff80253b66)

0xffffffff80253b00 <__lock_acquire+299>:        shr    $0x34,%rax
0xffffffff80253b04 <__lock_acquire+303>:        shl    $0x4,%rax
0xffffffff80253b08 <__lock_acquire+307>:        lea    -0x7f57e9b0(%rax),%rdx
0xffffffff80253b0f <__lock_acquire+314>:        mov    -0x7f57e9b0(%rax),%r13
0xffffffff80253b16 <__lock_acquire+321>:        jmp    0xffffffff80253b66 <__lock_acquire+401>
0xffffffff80253b18 <__lock_acquire+323>:        cmp    %r8,0x20(%r13)
0xffffffff80253b1c <__lock_acquire+327>:        jne    0xffffffff80253b63 <__lock_acquire+398>
0xffffffff80253b1e <__lock_acquire+329>:        mov    -0x90(%rbp),%r10
0xffffffff80253b25 <__lock_acquire+336>:        mov    0x10(%r10),%r10
0xffffffff80253b29 <__lock_acquire+340>:        cmp    %r10,0x140(%r13)
0xffffffff80253b30 <__lock_acquire+347>:        je     0xffffffff80253ea0 <__lock_acquire+1227>
0xffffffff80253b36 ...
From: Peter Zijlstra
Date: Tuesday, September 2, 2008 - 11:52 pm

Right - except this isn't in __lock_acquire, its from


Might - I've never tried it..

--

From: Luca Tettamanti
Date: Wednesday, September 3, 2008 - 1:28 am

Of course not ;-) It dies before dumping the stack trace (or at least
it doesn't make to the console - that machine has a serial port, but

-rc2 is working, after that it's intermittent. The last kernel that I

Ops, will send as soon as I get home.

Luca
--

From: Peter Zijlstra
Date: Wednesday, September 3, 2008 - 1:37 am

OK, once you send you .config, I'll try and reproduce on one of my
machine. Perhaps you can also provide the git-describe output of a known

Sure thing..

--

From: Luca Tettamanti
Date: Wednesday, September 3, 2008 - 1:54 am

Hum, I forgot to mention that it dies very early, just after:

Kernel is alive
Kernel is really alive
<exception>
<dead>

Is netconsole already up at that point?

Luca
--

From: Peter Zijlstra
Date: Wednesday, September 3, 2008 - 2:01 am

Good question, I suppose not, that only happens after the device probing
has found the eth card.

--

From: Peter Zijlstra
Date: Wednesday, September 3, 2008 - 2:03 am

Another thing you could try is booting this kernel in qemu and try
attaching to its gdb stub.

I've had varying levels of success doing that, you need a very recent
(might still be svn snapshot) of qemu to get x86_64 gdb to work iirc.

--

From: Luca Tettamanti
Date: Wednesday, September 3, 2008 - 12:21 pm

Config is attached.

I tried netconsole, but got not output. No luck with qemu either :S

QEMU from SVN reports that:
qemu: fatal: Trying to execute code outside RAM or ROM at 0x00000000000a0000

KVM goes a bit further but dies due to an exception while the kernel
is still executing in real mode.

Luca
From: Peter Zijlstra
Date: Thursday, September 4, 2008 - 7:25 am

Sadly your config just boots, albeit not to userspace due to missing
drivers.

root@twins:/mnt/build/linux-2.6# git-describe
v2.6.27-rc5-6-gbef69ea

root@twins:/mnt/build/linux-2.6# /mnt/md0/cross/bin/x86_64-linux-gcc --version
x86_64-linux-gcc (GCC) 4.3.1 20080510 (prerelease)


--

From: Luca Tettamanti
Date: Thursday, September 4, 2008 - 1:51 pm

Yes, I managed to boot it with qemu... I tried kgdb - without luck -
kernel dies too early.
I also managed to get a stack trace :D

http://img151.imageshack.us/my.php?image=tracedm1.jpg

It seems that lockdep is an innocent bystander... the kernel died with
panic() in __reserve_early, and then took another exception while
printing the panic (I guess).
Will add further debug stuff to see wtf is going on.

Luca
--

Previous thread: TLB IPI FLushTLB vs Invl Page by jmerkey on Tuesday, September 2, 2008 - 1:01 pm. (1 message)

Next thread: none