niu driver - Transmit timed out - 2.6.29

Previous thread: [GIT]: Networking for 2.6.30 by David Miller on Thursday, March 26, 2009 - 2:07 am. (15 messages)

Next thread: ipv6: Plug sk_buff leak in ipv6_rcv (net/ipv6/ip6_input.c) by Jesper Nilsson on Thursday, March 26, 2009 - 5:38 am. (2 messages)
From: Jesper Krogh
Date: Thursday, March 26, 2009 - 5:44 am

Ok. I was just so happy .. (See "Status update on Sun Neptune 10Gbit 
driver earlier).

But then it "blew up" again:

Mar 26 13:25:49 hest kernel: [25335.505049] ------------[ cut here 
]------------
Mar 26 13:25:49 hest kernel: [25335.505055] WARNING: at 
net/sched/sch_generic.c:226 dev_watchdog+0x1fd/0x210()
Mar 26 13:25:49 hest kernel: [25335.505057] Hardware name: Sun Fire X4600 M2
Mar 26 13:25:49 hest kernel: [25335.505059] NETDEV WATCHDOG: eth4 (niu): 
transmit timed out
Mar 26 13:25:49 hest kernel: [25335.505060] Modules linked in: af_packet 
ext4 jbd2 crc16 nfsd exportfs autofs4 nfs lockd auth_rpcgss sunrpc 
iptable_filter ip_tables x_tables ib_iser rdma_cm ib_cm iw_cm ib_sa 
ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ipv6 parport_pc lp parport loop sr_mod joydev 
psmouse niu usb_storage usbhid i2c_nforce2 libusual hid serio_raw pcspkr 
shpchp k8temp pci_hotplug i2c_core button evdev ext3 jbd mbcache 
ide_cd_mod cdrom sg sd_mod ata_generic libata mptsas mptspi mptscsih 
qla2xxx mptbase scsi_transport_sas scsi_transport_fc ehci_hcd 
scsi_transport_spi ohci_hcd e1000 scsi_mod amd74xx usbcore dm_mirror 
dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan 
thermal_sys fuse
Mar 26 13:25:49 hest kernel: [25335.505109] Pid: 0, comm: swapper Not 
tainted 2.6.29 #30
Mar 26 13:25:49 hest kernel: [25335.505111] Call Trace:
Mar 26 13:25:49 hest kernel: [25335.505113]  <IRQ>  [<ffffffff8023d5c2>] 
warn_slowpath+0xf2/0x130
Mar 26 13:25:49 hest kernel: [25335.505124]  [<ffffffff80239d2d>] 
task_tick_fair+0x4d/0xd0
Mar 26 13:25:49 hest kernel: [25335.505130]  [<ffffffff80355e33>] 
cpumask_next_and+0x23/0x40
Mar 26 13:25:49 hest kernel: [25335.505132]  [<ffffffff80233f84>] 
find_busiest_group+0x204/0x870
Mar 26 13:25:49 hest kernel: [25335.505136]  [<ffffffff8035b65e>] 
strlcpy+0x4e/0x80
Mar 26 13:25:49 hest kernel: [25335.505138]  [<ffffffff8041f11d>] 
dev_watchdog+0x1fd/0x210
Mar 26 13:25:49 hest kernel: [25335.505141]  ...
From: Jesper Krogh
Date: Friday, March 27, 2009 - 12:31 pm

There was actually a bit more in the log:

Mar 26 13:25:49 hest kernel: [25335.505176] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:25:49 hest kernel: [25335.587191] niu 0000:84:00.0: niu: eth4: 
bits (40000000) of register RXDMA_CFIG1 would not cl
ear, val[c0000000]
Mar 26 13:25:49 hest last message repeated 4 times
Mar 26 13:25:58 hest kernel: [25345.504898] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:08 hest kernel: [25355.504758] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:13 hest kernel: [25360.504687] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:18 hest kernel: [25365.504619] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:23 hest kernel: [25370.504549] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:28 hest kernel: [25375.504479] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:33 hest kernel: [25380.504409] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting
Mar 26 13:26:38 hest kernel: [25385.504340] niu 0000:84:00.0: niu: eth4: 
Transmit timed out, resetting

This is probably the interesting part:
Mar 26 13:25:49 hest kernel: [25335.587191] niu 0000:84:00.0: niu: eth4: 
bits (40000000) of register RXDMA_CFIG1 would not clear, val[c0000000]

Any suggestions?

Is this perhaps just broken hardware.. or a driver issue?  (I had the 
Sun nxge driver working for around 180 days on the same card.. so I 
would assume the hardware is ok).

Jesper
-- 
Jesper

--

From: Matheos Worku
Date: Friday, March 27, 2009 - 5:42 pm

Jesper,

One of the RX  ring DMAs  is failing to reset. I guess whatever is 
hanging the TX side is affecting the RX side as well. Can you do lspci 
on the function  and its siblings?
Regards

--

From: Jesper Krogh
Date: Friday, March 27, 2009 - 11:05 pm

Like this(please guide me if that wasn't the correct lspci output):

k# lspci -vvv -s 84:00
84:00.0 Ethernet controller: Sun Microsystems Computer Corp. 
Multithreaded 10 Gigabit Ethernet Network Controller (rev 01)
         Subsystem: Sun Microsystems Computer Corp. Unknown device 0000
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 43
         Region 0: Memory at fd000000 (64-bit, non-prefetchable) [size=16M]
         Region 2: Memory at fe9f8000 (64-bit, non-prefetchable) [size=32K]
         Region 4: Memory at fe9f0000 (64-bit, non-prefetchable) [size=32K]
         Expansion ROM at fe800000 [disabled] [size=1M]
         Capabilities: [40] Power Management version 2
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ 
Queue=0/5 Enable-
                 Address: 0000000000000000  Data: 0000
                 Masking: 00000000  Pending: 00000000
         Capabilities: [70] MSI-X: Enable+ Mask- TabSize=32
                 Vector table: BAR=2 offset=00000000
                 PBA: BAR=2 offset=00004000
         Capabilities: [80] Express Endpoint IRQ 0
                 Device: Supported: MaxPayload 1024 bytes, PhantFunc 0, 
ExtTag-
                 Device: Latency L0s <4us, L1 <8us
                 Device: AtnBtn- AtnInd- PwrInd-
                 Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                 Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
                 Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 1
                 Link: Latency L0s <512ns, L1 ...
From: Matheos Worku
Date: Friday, March 27, 2009 - 11:18 pm

Jesper,

I was wondering if you can get the register dump just after the NIC hangs.

lspci -vvv -xxx -s 84:0

Regards

--

From: Jesper Krogh
Date: Saturday, March 28, 2009 - 12:25 am

I will try to do that, but it involves more or less "putting a known bad 
driver" in production. And wait X days. (where X usually is less than 7 
and more than 2). So if there is more debugging code that would be 
helpful to have in the driver/kernel then it would be preferrable to get 
it in at the same time, in order to reduce the amount of trial-and-error 
cycles.

-- 
Jesper

--

Previous thread: [GIT]: Networking for 2.6.30 by David Miller on Thursday, March 26, 2009 - 2:07 am. (15 messages)

Next thread: ipv6: Plug sk_buff leak in ipv6_rcv (net/ipv6/ip6_input.c) by Jesper Nilsson on Thursday, March 26, 2009 - 5:38 am. (2 messages)