Ok. I was just so happy .. (See "Status update on Sun Neptune 10Gbit driver earlier). But then it "blew up" again: Mar 26 13:25:49 hest kernel: [25335.505049] ------------[ cut here ]------------ Mar 26 13:25:49 hest kernel: [25335.505055] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x1fd/0x210() Mar 26 13:25:49 hest kernel: [25335.505057] Hardware name: Sun Fire X4600 M2 Mar 26 13:25:49 hest kernel: [25335.505059] NETDEV WATCHDOG: eth4 (niu): transmit timed out Mar 26 13:25:49 hest kernel: [25335.505060] Modules linked in: af_packet ext4 jbd2 crc16 nfsd exportfs autofs4 nfs lockd auth_rpcgss sunrpc iptable_filter ip_tables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 parport_pc lp parport loop sr_mod joydev psmouse niu usb_storage usbhid i2c_nforce2 libusual hid serio_raw pcspkr shpchp k8temp pci_hotplug i2c_core button evdev ext3 jbd mbcache ide_cd_mod cdrom sg sd_mod ata_generic libata mptsas mptspi mptscsih qla2xxx mptbase scsi_transport_sas scsi_transport_fc ehci_hcd scsi_transport_spi ohci_hcd e1000 scsi_mod amd74xx usbcore dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys fuse Mar 26 13:25:49 hest kernel: [25335.505109] Pid: 0, comm: swapper Not tainted 2.6.29 #30 Mar 26 13:25:49 hest kernel: [25335.505111] Call Trace: Mar 26 13:25:49 hest kernel: [25335.505113] <IRQ> [<ffffffff8023d5c2>] warn_slowpath+0xf2/0x130 Mar 26 13:25:49 hest kernel: [25335.505124] [<ffffffff80239d2d>] task_tick_fair+0x4d/0xd0 Mar 26 13:25:49 hest kernel: [25335.505130] [<ffffffff80355e33>] cpumask_next_and+0x23/0x40 Mar 26 13:25:49 hest kernel: [25335.505132] [<ffffffff80233f84>] find_busiest_group+0x204/0x870 Mar 26 13:25:49 hest kernel: [25335.505136] [<ffffffff8035b65e>] strlcpy+0x4e/0x80 Mar 26 13:25:49 hest kernel: [25335.505138] [<ffffffff8041f11d>] dev_watchdog+0x1fd/0x210 Mar 26 13:25:49 hest kernel: [25335.505141] ...
There was actually a bit more in the log: Mar 26 13:25:49 hest kernel: [25335.505176] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:25:49 hest kernel: [25335.587191] niu 0000:84:00.0: niu: eth4: bits (40000000) of register RXDMA_CFIG1 would not cl ear, val[c0000000] Mar 26 13:25:49 hest last message repeated 4 times Mar 26 13:25:58 hest kernel: [25345.504898] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:08 hest kernel: [25355.504758] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:13 hest kernel: [25360.504687] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:18 hest kernel: [25365.504619] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:23 hest kernel: [25370.504549] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:28 hest kernel: [25375.504479] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:33 hest kernel: [25380.504409] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting Mar 26 13:26:38 hest kernel: [25385.504340] niu 0000:84:00.0: niu: eth4: Transmit timed out, resetting This is probably the interesting part: Mar 26 13:25:49 hest kernel: [25335.587191] niu 0000:84:00.0: niu: eth4: bits (40000000) of register RXDMA_CFIG1 would not clear, val[c0000000] Any suggestions? Is this perhaps just broken hardware.. or a driver issue? (I had the Sun nxge driver working for around 180 days on the same card.. so I would assume the hardware is ok). Jesper -- Jesper --
Jesper, One of the RX ring DMAs is failing to reset. I guess whatever is hanging the TX side is affecting the RX side as well. Can you do lspci on the function and its siblings? Regards --
Like this(please guide me if that wasn't the correct lspci output):
k# lspci -vvv -s 84:00
84:00.0 Ethernet controller: Sun Microsystems Computer Corp.
Multithreaded 10 Gigabit Ethernet Network Controller (rev 01)
Subsystem: Sun Microsystems Computer Corp. Unknown device 0000
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 43
Region 0: Memory at fd000000 (64-bit, non-prefetchable) [size=16M]
Region 2: Memory at fe9f8000 (64-bit, non-prefetchable) [size=32K]
Region 4: Memory at fe9f0000 (64-bit, non-prefetchable) [size=32K]
Expansion ROM at fe800000 [disabled] [size=1M]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+
Queue=0/5 Enable-
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Mask- TabSize=32
Vector table: BAR=2 offset=00000000
PBA: BAR=2 offset=00004000
Capabilities: [80] Express Endpoint IRQ 0
Device: Supported: MaxPayload 1024 bytes, PhantFunc 0,
ExtTag-
Device: Latency L0s <4us, L1 <8us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 1
Link: Latency L0s <512ns, L1 ...Jesper, I was wondering if you can get the register dump just after the NIC hangs. lspci -vvv -xxx -s 84:0 Regards --
I will try to do that, but it involves more or less "putting a known bad driver" in production. And wait X days. (where X usually is less than 7 and more than 2). So if there is more debugging code that would be helpful to have in the driver/kernel then it would be preferrable to get it in at the same time, in order to reduce the amount of trial-and-error cycles. -- Jesper --
