Re: [Bugme-new] [Bug 11543] New: kernel panic: softlockup in tick_periodic() ???

Previous thread: [update5] [PATCH] init: bzip2 or lzma -compressed kernels and initrds by Alain Knaff on Thursday, September 11, 2008 - 4:54 pm. (1 message)

Next thread: Re: Hard drive not seen by mount when booting using busybox/initrd under 2.6.27-2-generic (Ubuntu) by Robert Hancock on Thursday, September 11, 2008 - 5:06 pm. (1 message)
From: Andrew Morton
Date: Thursday, September 11, 2008 - 5:02 pm

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 11 Sep 2008 16:46:29 -0700 (PDT)


argh, death by wordwrapping.

I can't work out who called panic(), nor why.

The panic code called the kexec code which called mutex_trylock() which
called spin_lock_mutex() which then stupidly went and blurted a load of
debug stuff because of in_interrupt().

Something like this:

--- a/include/linux/debug_locks.h~a
+++ a/include/linux/debug_locks.h
@@ -17,7 +17,7 @@ extern int debug_locks_off(void);
 ({									\
 	int __ret = 0;							\
 									\
-	if (unlikely(c)) {						\
+	if (!oops_in_progress && unlikely(c)) {				\
 		if (debug_locks_off() && !debug_locks_silent)		\
 			WARN_ON(1);					\
 		__ret = 1;						\
_

might prevent the debugging code from preventing us from finding bugs :(

--

From: j_kernel
Date: Thursday, September 11, 2008 - 7:54 pm

It might be a regression. ;) The last build we were running on this
hardware was 2.6.24.2 and NMI watchdog support was not enabled.  We were
however experiencing random deadlocks, which I had been attributing to
problems with forcedeth.c (which causes the NIC to totally crap out
but not deadlock the machine) but I am now of the mind that there are

One more data point.  We booted this kernel on 14 machines this morning

Do you want me to give that patch a try or sit tight for a bit?

-J

--
--

From: Andrew Morton
Date: Thursday, September 11, 2008 - 7:57 pm

It's be good if you can try it please, see if we can get a cleaner
trace.
--

From: Ingo Molnar
Date: Friday, September 12, 2008 - 2:13 am

agreed - applied your fix in the form below to tip/master - thanks 
Andrew.

J, you might want to try tip/master, it includes all known fixes for 
this area and this debug improvement as well. You can pick it up via:

  http://people.redhat.com/mingo/tip.git/README

	Ingo

---------->
From 53b9d87f41a3d8838210ad7cdef02d814817ce85 Mon Sep 17 00:00:00 2001
From: Andrew Morton <akpm@linux-foundation.org>
Date: Thu, 11 Sep 2008 17:02:58 -0700
Subject: [PATCH] lock debug: sit tight when we are already in a panic

in:

  > http://bugzilla.kernel.org/show_bug.cgi?id=11543

The panic code called the kexec code which called mutex_trylock() which
called spin_lock_mutex() which then stupidly went and blurted a load of
debug stuff because of in_interrupt().

Keep the lock debug code from escallating an already crappy situation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/debug_locks.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h
index 4aaa4af..096476f 100644
--- a/include/linux/debug_locks.h
+++ b/include/linux/debug_locks.h
@@ -17,7 +17,7 @@ extern int debug_locks_off(void);
 ({									\
 	int __ret = 0;							\
 									\
-	if (unlikely(c)) {						\
+	if (!oops_in_progress && unlikely(c)) {				\
 		if (debug_locks_off() && !debug_locks_silent)		\
 			WARN_ON(1);					\
 		__ret = 1;						\
--

From: Joshua Hoblitt
Date: Friday, September 12, 2008 - 5:13 pm

I just rolled out -rc5 from netdev + Andrew's debug patch + the HPET
patch Thomas pointed me at.  I'll let it roast on these 14 machine is
production over the weekend to see if we get another panic.

I'm attaching the dmesg from this kernel.  We're still getting the NMI
watchdog warning and the rtc is [still] hosed (I think it was last
working around -rc3).

-J

--
From: Joshua Hoblitt
Date: Monday, September 15, 2008 - 2:06 pm

Since Friday 2 different machines have experienced crashes.  One was a total
deadlock with no response on the console.  The other one reported the trace
below on the console and stopped responding to ssh but I was able to loging via
the serial console and reboot the system.  This particular system has had a
number of "odd" kernel traces over the last year and I'm starting to actually
wonder if it may have a bad DIMM in it as occasionally the failure mode seems
to be different then the deadlocks/etc. we see in the other 15 nodes with
identical hardware.

[30712.654542] general protection fault: 0000 [1] SMP 
<Sep/12 09:25 pm>[30712.657678] CPU 3 
<Sep/12 09:25 pm>[30712.657678] Modules linked in: w83627hf hwmon_vid autofs4 smsc37b787_wdt k8temp i2c_nforce2 i2c_core forcedeth tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
<Sep/12 09:25 pm>[30712.657678] Pid: 1178, comm: rpciod/3 Not tainted 2.6.27-rc5-22033-gd26acd9-dirty #2
<Sep/12 09:25 pm>[30712.657678] RIP: 0010:[<ffffffff805ac57d>]  [<ffffffff805ac57d>] rpc_count_iostats+0x35/0xb8
<Sep/12 09:25 pm>[30712.657678] RSP: 0018:ffff88012e5d1e08  EFLAGS: 00010206
<Sep/12 09:25 pm>[30712.657678] RAX: ffffffff807adb48 RBX: ffff880126d61088 RCX: 0400000000000000
<Sep/12 09:25 pm>[30712.657678] RDX: ffff88022bcc0380 RSI: ffff88022bcc0000 RDI: ffff880126d61088
<Sep/12 09:25 pm>[30712.657678] RBP: ffff88022bc88000 R08: 0000000000000003 R09: ffff88022e038f10
<Sep/12 09:25 pm>[30712.657678] R10: 0000000000000001 R11: ffff88012e4a0048 R12: 0400000000000000
<Sep/12 09:25 pm>[30712.657678] R13: ffffffff8059f7a8 R14: ffff88022bc88610 R15: 0000000000000000
<Sep/12 09:25 pm>[30712.657678] FS:  00007f006fb306f0(0000) GS:ffff88022fa0d780(0000) knlGS:0000000000000000

<Sep/12 09:25 pm>[30712.657678] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<Sep/12 09:25 pm>[30712.657678] CR2: 00000000011e3018 CR3: 00000001f3527000 CR4:<Sep/12 09:25 pm>
                 <Sep/12 09:25 pm>[30712.657678] ...
From: Joshua Hoblitt
Date: Monday, September 15, 2008 - 7:54 pm

In addition to the deadlocks, we still have the watchdog warning:

[    0.460034] Testing NMI watchdog ... 
[    0.532557] WARNING: CPU#0: NMI appears to be stuck (0->0)!
[    0.533301] Please report this to bugzilla.kernel.org,
[    0.536635] and attach the output of the 'dmesg' command.

Perhaps an HPET problem:

[    0.993800] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 31
[    0.999969] hpet0: 3 32-bit timers, 25000000 Hz
[    1.004396] ACPI: RTC can wake from S4
[    1.006637] Clockevents: could not switch to one-shot
mode:<6>Clockevents: could not switch to one-shot mode: lapic is not functional.
[    1.009844] Could not switch to high resolution mode on CPU 3
[    1.009848] Clockevents: could not switch to one-shot mode: lapic is not functional.
[    1.009852] Could not switch to high resolution mode on CPU 2
[    1.009855] Clockevents: could not switch to one-shot mode: lapic is not functional.
[    1.009858] Could not switch to high resolution mode on CPU 1
[    1.009969]  lapic is not functional.
[    1.056944] Could not switch to high resolution mode on CPU 0

And a failure to create a /dev/rtc[0] device with udev 115 or 119 and
this entry in the dmesg.

[    7.498900] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)

-J

--
--

From: Thomas Gleixner
Date: Tuesday, September 16, 2008 - 7:14 am

Can you try nmi_watchdog=2 ?

Thanks,

	tglx
--

From: Cyrill Gorcunov
Date: Tuesday, September 16, 2008 - 10:56 am

[Thomas Gleixner - Tue, Sep 16, 2008 at 07:14:40AM -0700]
| On Mon, 15 Sep 2008, Joshua Hoblitt wrote:
| 
| > In addition to the deadlocks, we still have the watchdog warning:
| > 
| > [    0.460034] Testing NMI watchdog ... 
| > [    0.532557] WARNING: CPU#0: NMI appears to be stuck (0->0)!
| > [    0.533301] Please report this to bugzilla.kernel.org,
| > [    0.536635] and attach the output of the 'dmesg' command.
| > 
| > Perhaps an HPET problem:
| > 
| > [    0.993800] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 31
| > [    0.999969] hpet0: 3 32-bit timers, 25000000 Hz
| > [    1.004396] ACPI: RTC can wake from S4
| > [    1.006637] Clockevents: could not switch to one-shot
| > mode:<6>Clockevents: could not switch to one-shot mode: lapic is not functional.
| > [    1.009844] Could not switch to high resolution mode on CPU 3
| > [    1.009848] Clockevents: could not switch to one-shot mode: lapic is not functional.
| > [    1.009852] Could not switch to high resolution mode on CPU 2
| > [    1.009855] Clockevents: could not switch to one-shot mode: lapic is not functional.
| > [    1.009858] Could not switch to high resolution mode on CPU 1
| > [    1.009969]  lapic is not functional.
| > [    1.056944] Could not switch to high resolution mode on CPU 0
| 
| No, that's documented behaviour:
| 
| > > > [    0.126660] APIC timer registered as dummy, due to nmi_watchdog=1!
| 
| Can you try nmi_watchdog=2 ?
| 
| Thanks,
| 
| 	tglx
| 

And get apic=debug a try too please. I remember there
was a problem with SB600 on ACPI side (but they should
be already fixed)

		- Cyrill -
--

From: Cyrill Gorcunov
Date: Tuesday, September 16, 2008 - 10:57 am

[Cyrill Gorcunov - Tue, Sep 16, 2008 at 09:56:29PM +0400]
...
| 
| And get apic=debug a try too please. I remember there
| was a problem with SB600 on ACPI side (but they should
| be already fixed)
| 
| 		- Cyrill -

Sorry Thomas, I meant to send the message to Joshua

		- Cyrill -
--

Previous thread: [update5] [PATCH] init: bzip2 or lzma -compressed kernels and initrds by Alain Knaff on Thursday, September 11, 2008 - 4:54 pm. (1 message)

Next thread: Re: Hard drive not seen by mount when booting using busybox/initrd under 2.6.27-2-generic (Ubuntu) by Robert Hancock on Thursday, September 11, 2008 - 5:06 pm. (1 message)