Re: kernel exception in Fusion MPT base driver 3.04.06 (linux-2.6.24.3)

Previous thread: RFC: /dev/stdin, symlinks & permissions by Michael Tokarev on Monday, March 17, 2008 - 4:26 pm. (8 messages)

Next thread: vfree with spin_lock_bh by Jan Engelhardt on Monday, March 17, 2008 - 4:30 pm. (2 messages)
From: Sabuj Pattanayek
Date: Monday, March 17, 2008 - 4:29 pm

Hi,

I'm running (uname -a):

Linux porpoise 2.6.24.3 #11 Mon Mar 17 17:24:30 CDT 2008 ppc64
PPC970FX, altivec supported RackMac3,1 GNU/Linux

on an Apple Xserve G5. With the following fibre channel card (lspci
-vvv, it has two fibre connections):

0001:06:03.0 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre
Channel Adapter (rev 81)
        Subsystem: LSI Logic / Symbios Logic Unknown device 10d0
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Latency: 16 (16000ns min, 2500ns max), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 53
        Region 0: I/O ports at 0400 [size=256]
        Region 1: Memory at 90030000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at 90020000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 90200000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [68] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=8
                Status: Dev=ff:1f.0 64bit+ 133MHz+ SCD- USC- DC=simple
DMMRBC=2048 DMOST=8 DMCRS=64 RSCEM- 266MHz- 533MHz-

0001:06:03.1 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre
Channel Adapter (rev 81)
        Subsystem: LSI Logic / Symbios Logic Unknown device 10d0
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Latency: 16 (16000ns min, 2500ns max), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 53
        Region 0: I/O ports at <unassigned> [disabled]
        Region 1: Memory at 90010000 (64-bit, non-prefetchable)
[disabled] [size=64K]
        Region 3: ...
From: Andrew Morton
Date: Tuesday, March 18, 2008 - 12:25 am

And that's a totally, wildly different part of the kernel.  ANd it's a
code-patch which millions of machines run all the time.


Did any earlier kernel work OK?  If so, which?  2.6.23??

Thanks.
--

From: Michael Reed
Date: Tuesday, March 18, 2008 - 10:40 am

Fixing up Eric's email address.  It's no longer "lsil.com".


Does the exception happen every time?

--

From: Sabuj Pattanayek
Date: Wednesday, March 19, 2008 - 2:11 pm

Yes and now I've tested older kernels:

Linux porpoise 2.6.20 #4 Wed Mar 19 15:32:32 CDT 2008 ppc64 PPC970FX,
altivec supported RackMac3,1 GNU/Linux

...and put another very similar LSI FC card into the box that is known
to be working (the card in the previous email was also known to be
working, but just to be sure...):


0001:06:03.0 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre
Channel Adapter (rev 81)
        Subsystem: LSI Logic / Symbios Logic Unknown device 10d0
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Latency: 16 (16000ns min, 2500ns max), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 53
        Region 0: I/O ports at 0400 [disabled] [size=256]
        Region 1: Memory at 90030000 (64-bit, non-prefetchable)
[disabled] [size=64K]
        Region 3: Memory at 90020000 (64-bit, non-prefetchable)
[disabled] [size=64K]
        Expansion ROM at 90200000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [68] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=8
                Status: Dev=ff:1f.0 64bit+ 133MHz+ SCD- USC- DC=simple
DMMRBC=2048 DMOST=8 DMCRS=64 RSCEM- 266MHz- 533MHz-


0001:06:03.1 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre
Channel Adapter (rev 81)
        Subsystem: LSI Logic / Symbios Logic Unknown device 10d0
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Latency: 16 (16000ns min, 2500ns max), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 53
        Region 0: I/O ports at <unassigned> ...
From: Sabuj Pattanayek
Date: Wednesday, March 19, 2008 - 4:36 pm

Well, I just got the kernel to boot all the way with both 2.6.24.3 and
2.6.20 accidentally by changing the FC host side settings on all the
target infortrend eonstor RAID boxes to point-to-point from loop. All
the targets and initiators are connected to a QLogic 5600 FC switch
and zones have been setup such that the initiators can see only the
targets. This active zoning has not changed during testing.

At the time I was actually testing the eonstor boxes outside the FC
switch by connecting them directly to another Apple Xserve running OS
X, which is using the same LSI FC HBA that I reported in my first post
regarding this issue. After testing that each target (LUN) could be
seen by the LSI FC HBA in the OSX Xserve, I reconnected the targets
back into the switch. Since the Linux Xserve was endlessly rebooting
itself after each kernel panic, to my surprise I had found that it
completed booting! All targets and LUNs can be seen under "cat
/proc/scsi/scsi".

To confirm that the loop host side setting was causing the problem, I
set one of the eonstor's back to loop and reset it. Immediately after
the eonstor came up it caused a kernel panic on the Linux Xserve
(which was already booted into the OS). This was the output on the
console:

porpoise login: Unrecoverable FP Unavailable Exception 800 at d000000000073da0
Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]

Modules linked in: mptfc mptscsih mptbase
NIP: D000000000073DA0 LR: D0000000000675BC CTR: D000000000073DA0
REGS: c00000000076b4e0 TRAP: 0800   Not tainted  (2.6.20)
MSR: 9000000000009032 <EE,ME,IR,DR>  CR: 48004048  XER: 00000000
TASK = c000000000671420[0] 'swapper' THREAD: c000000000768000
GPR00: D000000000073DA0 C00000000076B760 D0000000000803D0 C00000000FA46000
GPR04: C000000001B81310 0000000000000053 C00000000076B833 9000000000049032
GPR08: C000000001B82800 D0000000000782F8 0000000000000000 0000000000000000
GPR12: D00000000006A2F0 C000000000671C80 000000000023FB28 C00000000058C6D8
GPR16: C000000000664DB0 ...
From: Sabuj Pattanayek
Date: Wednesday, April 2, 2008 - 5:02 pm

Hi all,


After going through lots of hardware (HBAs, motherboards, PSUs) it was
determined that there was a problem with the PCI-X riser in this
particular system. Sorry about the false alarm but this one was
difficult to track down.

Thanks,
Sabuj
--

Previous thread: RFC: /dev/stdin, symlinks & permissions by Michael Tokarev on Monday, March 17, 2008 - 4:26 pm. (8 messages)

Next thread: vfree with spin_lock_bh by Jan Engelhardt on Monday, March 17, 2008 - 4:30 pm. (2 messages)