Re: [Bugme-new] [Bug 12877] New: tg3: eth0 transit timed out, resetting -> dead NIC

Previous thread: Re: [Bugme-new] [Bug 12876] New: irq 18: nobody cared after down-ing an e1000 interface by Andrew Morton on Sunday, March 15, 2009 - 2:30 pm. (2 messages)

Next thread: Re: [PATCH 1/7] tcp: remove pointless .dsack/.num_sacks code by David Miller on Sunday, March 15, 2009 - 8:10 pm. (1 message)
From: Andrew Morton
Date: Sunday, March 15, 2009 - 2:32 pm

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).


--

From: Michael Chan
Date: Monday, March 16, 2009 - 2:23 pm

At the time of tx timeout, the registers all return 0xffffffff.  Does
the subsequent reset bring the device back?  If the device is brought
back, there should be a link up message and traffic should resume.  If
not, please provide lspci -vvvxxx on the eth0 device after the failure.

Also, when one ethernet port fails, does the other port (from the same
dual port device) function ok?

Thanks.




--

From: Bernhard Schmidt
Date: Monday, March 16, 2009 - 3:46 pm

The port does not pass traffic until I rmmod/modprobe the driver or 
reset the whole system, it doesn't recover by itself. "[784081.605984] 
tg3: eth0: Link is down." is the last message I see.


I'm not exactly sure about that, as nowadays there is nothing connected 
to it anymore. However, when the problem first occured there was, and 
I'm pretty sure the second port was okay.

Bernhard
--

From: Bernhard Schmidt
Date: Tuesday, March 17, 2009 - 3:09 pm

Attached, both after the crash (tg3.crashed) and after I reloaded the 
module (tg3.reloaded). Additional info, ifdown/ifup does not fix the 
situation, both take pretty long

# ifdown eth0
tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear 
MAC_TX_MODE=ffffffff
# ifup eth0
tg3 0000:03:04.0: irq 1272 for MSI/MSI-X
ADDRCONF(NETDEV_UP): eth0: link is not ready

and it stays dead.

# rmmod tg3
tg3 0000:03:04.1: PCI INT B disabled
tg3 0000:03:04.0: PCI INT A disabled
# modprobe tg3
tg3.c:v3.94 (August 14, 2008)
tg3 0000:03:04.0: enabling device (0000 -> 0002)
tg3 0000:03:04.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
eth0: Tigon3 [partno(N/A) rev 9003 PHY(5714)] (PCIX:133MHz:64-bit) 
10/100/1000Base-T Ethernet 00:21:5a:99:0a:28
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1]
eth0: dma_rwctrl[76148000] dma_mask[40-bit]
tg3 0000:03:04.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
eth1: Tigon3 [partno(N/A) rev 9003 PHY(5714)] (PCIX:133MHz:64-bit) 
10/100/1000Base-T Ethernet 00:21:5a:99:0a:29
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[76148000] dma_mask[40-bit]
# ifup eth0
ADDRCONF(NETDEV_UP): eth0: link is not ready
device eth0 entered promiscuous mode
tg3: eth0: Link is up at 100 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready


Still no clue about that, I need to find someone who can connect 
something there.

Bernhard
From: Michael Chan
Date: Tuesday, March 17, 2009 - 4:30 pm

Thanks for the information.  The memory enable bit in the PCI command
register was cleared during tx_timeout.  That's why all the registers
were reading 0xffffffff.  The tx_timeout code in tg3 would not be able
to reset the chip if that bit was cleared.  We need to find out why that
bit was cleared.  We should also enhance the tx timeout code so that it
can recover more completely even if the memory enable bit is cleared.

Thanks.


--

From: Matt Carlson
Date: Thursday, March 19, 2009 - 9:58 am

Hi Bernhard.  I talked to Michael about this and we'd like you to try
two things.

1) Can you disable iLo2 and see if you can still reproduce the problem?
2) Can you apply the following patch to get more information on when
   MMIO gets disabled?


diff -Nrup 1/drivers/net/tg3.c 2/drivers/net/tg3.c
--- 1/drivers/net/tg3.c	2009-02-20 14:41:27.000000000 -0800
+++ 2/drivers/net/tg3.c	2009-03-19 09:51:17.000000000 -0700
@@ -7799,6 +7799,12 @@ static void tg3_timer(unsigned long __op
 
 	/* This part only runs once per second. */
 	if (!--tp->timer_counter) {
+		u16 pci_cmd;
+
+		pci_read_config_word(tp->pdev, PCI_COMMAND, &pci_cmd);
+		if (!(pci_cmd & PCI_COMMAND_MEMORY))
+			printk( KERN_WARNING "PCI Memory Mapped IO Disabled!!!!\n" );
+
 		if (tp->tg3_flags2 & TG3_FLG2_5705_PLUS)
 			tg3_periodic_fetch_stats(tp);
 



> 70: 00 00 00 00 00 00 00 00 00 
From: Bernhard Schmidt
Date: Thursday, March 19, 2009 - 11:06 am

On Thu, Mar 19, 2009 at 09:58:42AM -0700, Matt Carlson wrote:


That will take a few days, I'll ask the on-site guys to check whether we
have an external IP-KVM available. 

Would switching the uplink connection to eth1 (and not use eth0 in

Applied, I'll send you the information as soon as it happens again
(which seems to be happening rather often the last couple of days).

Bernhard
--

From: Matt Carlson
Date: Thursday, March 19, 2009 - 11:15 am

Sure, let's try that.  Maybe this is the better way to go anyways.  I just
learned that disabling iLo2 doesn't necessarily disable the management
firmware on the network device.

For this to be a meaningful test though, we need to verify that the
driver sign-on messages have a line that reads "ASF[0]" on eth1.  Can

Thanks.

--

From: Bernhard Schmidt
Date: Thursday, March 19, 2009 - 11:19 am

Yes:

[1186598.218205] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1]
WireSpeed[1] TSOcap[1]
[1186598.281314] eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0]
WireSpeed[1] TSOcap[1]

I'll keep the traffic on eth0 now until it breaks again, hopefully
spewing out more debugging info with your patch and then I'll switch to
eth1.

Bernhard
--

From: Bernhard Schmidt
Date: Sunday, March 22, 2009 - 6:21 am

On Thu, Mar 19, 2009 at 09:58:42AM -0700, Matt Carlson wrote:


Not sure this is going to help you. NIC crashed two times tonight, logs
look like this

Mar 22 04:06:46 svr02 kernel: [1392136.468921] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:47 svr02 kernel: [1392137.520288] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:48 svr02 kernel: [1392138.568267] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:49 svr02 kernel: [1392139.616266] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:50 svr02 kernel: [1392140.664266] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:51 svr02 kernel: [1392141.712276] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:52 svr02 kernel: [1392142.760297] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:53 svr02 kernel: [1392143.808258] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:54 svr02 kernel: [1392144.856256] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:55 svr02 kernel: [1392145.904266] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:56 svr02 kernel: [1392146.952276] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:57 svr02 kernel: [1392148.000267] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:58 svr02 kernel: [1392149.048271] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:06:59 svr02 kernel: [1392150.096268] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:00 svr02 kernel: [1392151.144277] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:01 svr02 kernel: [1392152.192287] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:02 svr02 kernel: [1392153.240268] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:03 svr02 kernel: [1392154.288267] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:04 svr02 kernel: [1392155.336267] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:05 svr02 kernel: [1392156.384268] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:07 svr02 kernel: [1392157.432267] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:08 svr02 kernel: [1392158.480266] PCI Memory Mapped IO Disabled!!!!
Mar 22 04:07:09 svr02 kernel: [1392159.528289] PCI Memory Mapped IO ...
From: Matt Carlson
Date: Monday, March 23, 2009 - 11:18 am

So traffic on this box must be pretty light for the watchdog to fire off

O.K.  I eagerly await your results.

--

From: Bernhard Schmidt
Date: Monday, March 23, 2009 - 5:35 pm

On 23.03.2009 19:18, Matt Carlson wrote:


Just to make sure I didn't confuse you, the "watchdog" I was talking 
about here is a shellscript like this, executed every minute

---
/bin/ping -q -c 5 <defaultgw> > /dev/null
RC=$?
if [ ${RC} -ne 0 ]; then
	rmmod tg3; sleep 5; modprobe tg3; sleep 5; ifup --force eth0
fi
---


The tg3 watchdog (tg3: eth0: transmit timed out, resetting) did not 
appear at all in this circle, so I guess the checkscript killed the 
module before.

Yes, the NIC is very lightly loaded, around 100kbps / 70pps in each 

So far so good, but it has only been running ~36 hours, that's not 
really a stability spree yet :-)

I'll keep you updated.

Bernhard
--

From: Matt Carlson
Date: Tuesday, March 31, 2009 - 9:26 am

Bernhard, any word on what happened?

--

From: Bernhard Schmidt
Date: Tuesday, March 31, 2009 - 3:16 pm

So far so good. In the last week my watchdog (cannot reach the default
gateway) triggered once, but since there were no "PCI Memory Mapped IO
Disabled!!!!" messages in the logfile I assume that was a real network
problem. Since the box ran fine for months initially and the first two
occurances of this issue were two weeks apart I cannot say for sure, but
it definitely feels better than the "once in two days" in the end.

Bernhard
--

From: Bernhard Schmidt
Date: Monday, April 13, 2009 - 2:54 pm

No crashes in the last two weeks.

Do you have any further suggestions how to debug this or should we 
accept that portsharing doesn't work very well? This is the second 
problem we can directly attribute to the sharing with the iLO (the first 
one being the "no IPv6 unless the port is in promiscous mode", we had a 
thread about this here on netdev a few months ago).

Bernhard
--

From: Matt Carlson
Date: Tuesday, April 14, 2009 - 11:29 am

No, I think we need to get this to work with portsharing.  We just need
to figure out what it is about it that causes these types of errors.

I talked to the firmware maintainer here.  We have a couple ideas that
might uncover what is happening.  Our next step is to develop a set of
tests that will show under what assumptions the firmware is operating.
Once we have that, I'll ask you to patch your driver so that we can see
what is happening from your end.

Stay tuned.

--

Previous thread: Re: [Bugme-new] [Bug 12876] New: irq 18: nobody cared after down-ing an e1000 interface by Andrew Morton on Sunday, March 15, 2009 - 2:30 pm. (2 messages)

Next thread: Re: [PATCH 1/7] tcp: remove pointless .dsack/.num_sacks code by David Miller on Sunday, March 15, 2009 - 8:10 pm. (1 message)