(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Thu, 11 Sep 2008 16:46:29 -0700 (PDT)
argh, death by wordwrapping.
I can't work out who called panic(), nor why.
The panic code called the kexec code which called mutex_trylock() which
called spin_lock_mutex() which then stupidly went and blurted a load of
debug stuff because of in_interrupt().
Something like this:
--- a/include/linux/debug_locks.h~a
+++ a/include/linux/debug_locks.h
@@ -17,7 +17,7 @@ extern int debug_locks_off(void);
({ \
int __ret = 0; \
\
- if (unlikely(c)) { \
+ if (!oops_in_progress && unlikely(c)) { \
if (debug_locks_off() && !debug_locks_silent) \
WARN_ON(1); \
__ret = 1; \
_
might prevent the debugging code from preventing us from finding bugs :(
--
It might be a regression. ;) The last build we were running on this hardware was 2.6.24.2 and NMI watchdog support was not enabled. We were however experiencing random deadlocks, which I had been attributing to problems with forcedeth.c (which causes the NIC to totally crap out but not deadlock the machine) but I am now of the mind that there are One more data point. We booted this kernel on 14 machines this morning Do you want me to give that patch a try or sit tight for a bit? -J -- --
It's be good if you can try it please, see if we can get a cleaner trace. --
agreed - applied your fix in the form below to tip/master - thanks Andrew. J, you might want to try tip/master, it includes all known fixes for this area and this debug improvement as well. You can pick it up via: http://people.redhat.com/mingo/tip.git/README Ingo ----------> From 53b9d87f41a3d8838210ad7cdef02d814817ce85 Mon Sep 17 00:00:00 2001 From: Andrew Morton <akpm@linux-foundation.org> Date: Thu, 11 Sep 2008 17:02:58 -0700 Subject: [PATCH] lock debug: sit tight when we are already in a panic in: > http://bugzilla.kernel.org/show_bug.cgi?id=11543 The panic code called the kexec code which called mutex_trylock() which called spin_lock_mutex() which then stupidly went and blurted a load of debug stuff because of in_interrupt(). Keep the lock debug code from escallating an already crappy situation. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/debug_locks.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h index 4aaa4af..096476f 100644 --- a/include/linux/debug_locks.h +++ b/include/linux/debug_locks.h @@ -17,7 +17,7 @@ extern int debug_locks_off(void); ({ \ int __ret = 0; \ \ - if (unlikely(c)) { \ + if (!oops_in_progress && unlikely(c)) { \ if (debug_locks_off() && !debug_locks_silent) \ WARN_ON(1); \ __ret = 1; \ --
I just rolled out -rc5 from netdev + Andrew's debug patch + the HPET patch Thomas pointed me at. I'll let it roast on these 14 machine is production over the weekend to see if we get another panic. I'm attaching the dmesg from this kernel. We're still getting the NMI watchdog warning and the rtc is [still] hosed (I think it was last working around -rc3). -J --
Since Friday 2 different machines have experienced crashes. One was a total
deadlock with no response on the console. The other one reported the trace
below on the console and stopped responding to ssh but I was able to loging via
the serial console and reboot the system. This particular system has had a
number of "odd" kernel traces over the last year and I'm starting to actually
wonder if it may have a bad DIMM in it as occasionally the failure mode seems
to be different then the deadlocks/etc. we see in the other 15 nodes with
identical hardware.
[30712.654542] general protection fault: 0000 [1] SMP
<Sep/12 09:25 pm>[30712.657678] CPU 3
<Sep/12 09:25 pm>[30712.657678] Modules linked in: w83627hf hwmon_vid autofs4 smsc37b787_wdt k8temp i2c_nforce2 i2c_core forcedeth tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
<Sep/12 09:25 pm>[30712.657678] Pid: 1178, comm: rpciod/3 Not tainted 2.6.27-rc5-22033-gd26acd9-dirty #2
<Sep/12 09:25 pm>[30712.657678] RIP: 0010:[<ffffffff805ac57d>] [<ffffffff805ac57d>] rpc_count_iostats+0x35/0xb8
<Sep/12 09:25 pm>[30712.657678] RSP: 0018:ffff88012e5d1e08 EFLAGS: 00010206
<Sep/12 09:25 pm>[30712.657678] RAX: ffffffff807adb48 RBX: ffff880126d61088 RCX: 0400000000000000
<Sep/12 09:25 pm>[30712.657678] RDX: ffff88022bcc0380 RSI: ffff88022bcc0000 RDI: ffff880126d61088
<Sep/12 09:25 pm>[30712.657678] RBP: ffff88022bc88000 R08: 0000000000000003 R09: ffff88022e038f10
<Sep/12 09:25 pm>[30712.657678] R10: 0000000000000001 R11: ffff88012e4a0048 R12: 0400000000000000
<Sep/12 09:25 pm>[30712.657678] R13: ffffffff8059f7a8 R14: ffff88022bc88610 R15: 0000000000000000
<Sep/12 09:25 pm>[30712.657678] FS: 00007f006fb306f0(0000) GS:ffff88022fa0d780(0000) knlGS:0000000000000000
<Sep/12 09:25 pm>[30712.657678] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<Sep/12 09:25 pm>[30712.657678] CR2: 00000000011e3018 CR3: 00000001f3527000 CR4:<Sep/12 09:25 pm>
<Sep/12 09:25 pm>[30712.657678] ...In addition to the deadlocks, we still have the watchdog warning: [ 0.460034] Testing NMI watchdog ... [ 0.532557] WARNING: CPU#0: NMI appears to be stuck (0->0)! [ 0.533301] Please report this to bugzilla.kernel.org, [ 0.536635] and attach the output of the 'dmesg' command. Perhaps an HPET problem: [ 0.993800] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 31 [ 0.999969] hpet0: 3 32-bit timers, 25000000 Hz [ 1.004396] ACPI: RTC can wake from S4 [ 1.006637] Clockevents: could not switch to one-shot mode:<6>Clockevents: could not switch to one-shot mode: lapic is not functional. [ 1.009844] Could not switch to high resolution mode on CPU 3 [ 1.009848] Clockevents: could not switch to one-shot mode: lapic is not functional. [ 1.009852] Could not switch to high resolution mode on CPU 2 [ 1.009855] Clockevents: could not switch to one-shot mode: lapic is not functional. [ 1.009858] Could not switch to high resolution mode on CPU 1 [ 1.009969] lapic is not functional. [ 1.056944] Could not switch to high resolution mode on CPU 0 And a failure to create a /dev/rtc[0] device with udev 115 or 119 and this entry in the dmesg. [ 7.498900] drivers/rtc/hctosys.c: unable to open rtc device (rtc0) -J -- --
Can you try nmi_watchdog=2 ? Thanks, tglx --
[Thomas Gleixner - Tue, Sep 16, 2008 at 07:14:40AM -0700] | On Mon, 15 Sep 2008, Joshua Hoblitt wrote: | | > In addition to the deadlocks, we still have the watchdog warning: | > | > [ 0.460034] Testing NMI watchdog ... | > [ 0.532557] WARNING: CPU#0: NMI appears to be stuck (0->0)! | > [ 0.533301] Please report this to bugzilla.kernel.org, | > [ 0.536635] and attach the output of the 'dmesg' command. | > | > Perhaps an HPET problem: | > | > [ 0.993800] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 31 | > [ 0.999969] hpet0: 3 32-bit timers, 25000000 Hz | > [ 1.004396] ACPI: RTC can wake from S4 | > [ 1.006637] Clockevents: could not switch to one-shot | > mode:<6>Clockevents: could not switch to one-shot mode: lapic is not functional. | > [ 1.009844] Could not switch to high resolution mode on CPU 3 | > [ 1.009848] Clockevents: could not switch to one-shot mode: lapic is not functional. | > [ 1.009852] Could not switch to high resolution mode on CPU 2 | > [ 1.009855] Clockevents: could not switch to one-shot mode: lapic is not functional. | > [ 1.009858] Could not switch to high resolution mode on CPU 1 | > [ 1.009969] lapic is not functional. | > [ 1.056944] Could not switch to high resolution mode on CPU 0 | | No, that's documented behaviour: | | > > > [ 0.126660] APIC timer registered as dummy, due to nmi_watchdog=1! | | Can you try nmi_watchdog=2 ? | | Thanks, | | tglx | And get apic=debug a try too please. I remember there was a problem with SB600 on ACPI side (but they should be already fixed) - Cyrill - --
[Cyrill Gorcunov - Tue, Sep 16, 2008 at 09:56:29PM +0400] ... | | And get apic=debug a try too please. I remember there | was a problem with SB600 on ACPI side (but they should | be already fixed) | | - Cyrill - Sorry Thomas, I meant to send the message to Joshua - Cyrill - --
