The oops below happened today with v2.6.37-rc5-62-g6313e3c with the
following hpet oneline fix which should be unrelated (tglx has an
equivalent one in -tip/urgent already):
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index ae03cab..0388a70 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -408,7 +408,7 @@ static int hpet_next_event(unsigned long delta,
*/
res = (s32)(cnt - hpet_readl(HPET_COUNTER));
- return res < 8 ? -ETIME : 0;
+ return res < 128 ? -ETIME : 0;
}
static void hpet_legacy_set_mode(enum clock_event_mode mode,
--
SysRq was still functioning though and I was able to sync+reboot
therefore the oops is from /var/log/syslog. Any ideas?
--
Dec 14 10:17:26 liondog kernel: [12312.144385] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1081
Dec 14 10:17:26 liondog kernel: [12312.144390] in_atomic(): 0, irqs_disabled(): 1, pid: 1930, name: Xorg
Dec 14 10:17:26 liondog kernel: [12312.144393] no locks held by Xorg/1930.
Dec 14 10:17:26 liondog kernel: [12312.144394] irq event stamp: 959429057
Dec 14 10:17:26 liondog kernel: [12312.144396] hardirqs last enabled at (959429057): [<ffffffff81021d4e>] do_page_fault+0xa5/0x3a7
Dec 14 10:17:26 liondog kernel: [12312.144402] hardirqs last disabled at (959429056): [<ffffffff8144f403>] error_sti+0x5/0x6
Dec 14 10:17:26 liondog kernel: [12312.144407] softirqs last enabled at (959429046): [<ffffffff81041f38>] __do_softirq+0x2b4/0x304
Dec 14 10:17:26 liondog kernel: [12312.144412] softirqs last disabled at (959429021): [<ffffffff8100324c>] call_softirq+0x1c/0x28
Dec 14 10:17:26 liondog kernel: [12312.144417] Pid: 1930, comm: Xorg Not tainted 2.6.37-rc5-00063-g779049f #1
Dec 14 10:17:26 liondog kernel: [12312.144419] Call Trace:
Dec 14 10:17:26 liondog kernel: [12312.144424] [<ffffffff81068dc0>] ? print_irqtrace_events+0xd0/0xd4
Dec 14 10:17:26 liondog kernel: [12312.144429] [<ffffffff81030cc8>] __might_sleep+0x11c/0x120
Dec 14 ...The "sleeping function called from invalid context" part is
uninteresting - it's just a result of the real bug, which is the oops.
The oops gets printed out later, because the might_sleep() test just
happens to trigger first in the page faulting path.
So ignore the early part of the dmesg, the only thing relevant is the
kernel paging request in __lock_acquire.
Of course, that one is then followed rather closely by a GP fault in
the radeon driver, which I don't know is real or not. It could easily
be some bad interaction with the kmsg_dump() thing (a result of the
first page fault) interacting badly with the X server getting killed
by that bogus page fault. So it's hard to tell whether the GP fault is
real and maybe relevant for the discussion, or whether it's a result
That code disassembles to
0: 74 33 je 0x35
2: 80 3d 25 1b c7 00 01 cmpb $0x1,0xc71b25(%rip) # 0xc71b2e
9: 74 2a je 0x35
b: be a6 02 00 00 mov $0x2a6,%esi
10: 48 c7 c7 d1 8b 5e 81 mov $0xffffffff815e8bd1,%rdi
17: e8 af dc fc ff callq 0xfffffffffffcdccb
1c: c6 05 0b 1b c7 00 01 movb $0x1,0xc71b0b(%rip) # 0xc71b2e
23: eb 10 jmp 0x35
25: 49 89 d6 mov %rdx,%r14
28:* 49 8b 16 mov (%r14),%rdx <-- trapping instruction
2b: 49 39 ce cmp %rcx,%r14
2e: 0f 18 0a prefetcht0 (%rdx)
31: 75 bc jne 0xffffffffffffffef
33: eb 09 jmp 0x3e
which I think matches up with this:
movq (%rdx), %rsi # class_293->hash_entry.next, D.31652
cmpq %rcx, %rdx # hash_head, class
prefetcht0 (%rsi) # D.31652
jne .L449 #,
jmp .L560 #
which looks like one of the "list_for_each_entry()" uses of that
hash_head. In fact, building lockdep.lst and matching against that
0x2a6 constant (it's a warn_slowpath_null thing and is the ...Hm, that's something I wouldn't have seen from this perspective - I
would've started chasing the reason for triggering the might_sleep()
yes it does, __lock_acquire+0x19d = 0x4eb6, i.e.
4ea5: e8 00 00 00 00 callq 4eaa <__lock_acquire+0x191>
4eaa: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4eb1 <__lock_acquire+0x198>
4eb1: eb 10 jmp 4ec3 <__lock_acquire+0x1aa>
4eb3: 49 89 d6 mov %rdx,%r14
4eb6: 49 8b 16 mov (%r14),%rdx <-------------------
4eb9: 49 39 ce cmp %rcx,%r14
4ebc: 0f 18 0a prefetcht0 (%rdx)
4ebf: 75 bc jne 4e7d <__lock_acquire+0x164>
Also true:
hash_head = classhashentry(key);
ffffffff8106d17f: 48 89 c1 mov %rax,%rcx
ffffffff8106d182: 48 c1 e1 3f shl $0x3f,%rcx
ffffffff8106d186: 48 01 ca add %rcx,%rdx
ffffffff8106d189: 48 c1 ea 34 shr $0x34,%rdx
ffffffff8106d18d: 48 c1 e2 04 shl $0x4,%rdx
ffffffff8106d191: 48 8d 8a 00 00 00 00 lea 0x0(%rdx),%rcx
ffffffff8106d194: R_X86_64_32S .bss+0x5dfeb0
/*
* We can walk the hash lockfree, because the hash only
* grows, and we are careful when adding entries to the end:
*/
list_for_each_entry(class, hash_head, hash_entry) {
ffffffff8106d198: 4c 8b b2 00 00 00 00 mov 0x0(%rdx),%r14
ffffffff8106d19b: R_X86_64_32S .bss+0x5dfeb0
ffffffff8106d19f: eb 39 jmp ffffffff8106d1da <__lock_acquire+0x19d>
if (class->key == key) {
ffffffff8106d1a1: 49 39 46 20 cmp %rax,0x20(%r14)
ffffffff8106d1a5: 75 30 jne ffffffff8106d1d7 <__lock_acquire+0x19a>
WARN_ON_ONCE(class->name != lock->name);
ffffffff8106d1a7: 48 8b 43 18 mov 0x18(%rbx),%rax
ffffffff8106d1ab: 49 39 86 a0 01 00 00 cmp %rax,0x1a0(%r14)
ffffffff8106d1b2: 74 33 je ffffffff8106d1e7 ...Typically such crashes in lockdep are indeed indicative of a scribble, typically a use after free with slab poisoning. But I'll try and run through that code later today to see if I can spot anything. --
