(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). --
At the time of tx timeout, the registers all return 0xffffffff. Does the subsequent reset bring the device back? If the device is brought back, there should be a link up message and traffic should resume. If not, please provide lspci -vvvxxx on the eth0 device after the failure. Also, when one ethernet port fails, does the other port (from the same dual port device) function ok? Thanks. --
The port does not pass traffic until I rmmod/modprobe the driver or reset the whole system, it doesn't recover by itself. "[784081.605984] tg3: eth0: Link is down." is the last message I see. I'm not exactly sure about that, as nowadays there is nothing connected to it anymore. However, when the problem first occured there was, and I'm pretty sure the second port was okay. Bernhard --
Attached, both after the crash (tg3.crashed) and after I reloaded the module (tg3.reloaded). Additional info, ifdown/ifup does not fix the situation, both take pretty long # ifdown eth0 tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff # ifup eth0 tg3 0000:03:04.0: irq 1272 for MSI/MSI-X ADDRCONF(NETDEV_UP): eth0: link is not ready and it stays dead. # rmmod tg3 tg3 0000:03:04.1: PCI INT B disabled tg3 0000:03:04.0: PCI INT A disabled # modprobe tg3 tg3.c:v3.94 (August 14, 2008) tg3 0000:03:04.0: enabling device (0000 -> 0002) tg3 0000:03:04.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 eth0: Tigon3 [partno(N/A) rev 9003 PHY(5714)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:21:5a:99:0a:28 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[76148000] dma_mask[40-bit] tg3 0000:03:04.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17 eth1: Tigon3 [partno(N/A) rev 9003 PHY(5714)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:21:5a:99:0a:29 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1] eth1: dma_rwctrl[76148000] dma_mask[40-bit] # ifup eth0 ADDRCONF(NETDEV_UP): eth0: link is not ready device eth0 entered promiscuous mode tg3: eth0: Link is up at 100 Mbps, full duplex. tg3: eth0: Flow control is off for TX and off for RX. ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Still no clue about that, I need to find someone who can connect something there. Bernhard
Thanks for the information. The memory enable bit in the PCI command register was cleared during tx_timeout. That's why all the registers were reading 0xffffffff. The tx_timeout code in tg3 would not be able to reset the chip if that bit was cleared. We need to find out why that bit was cleared. We should also enhance the tx timeout code so that it can recover more completely even if the memory enable bit is cleared. Thanks. --
Hi Bernhard. I talked to Michael about this and we'd like you to try
two things.
1) Can you disable iLo2 and see if you can still reproduce the problem?
2) Can you apply the following patch to get more information on when
MMIO gets disabled?
diff -Nrup 1/drivers/net/tg3.c 2/drivers/net/tg3.c
--- 1/drivers/net/tg3.c 2009-02-20 14:41:27.000000000 -0800
+++ 2/drivers/net/tg3.c 2009-03-19 09:51:17.000000000 -0700
@@ -7799,6 +7799,12 @@ static void tg3_timer(unsigned long __op
/* This part only runs once per second. */
if (!--tp->timer_counter) {
+ u16 pci_cmd;
+
+ pci_read_config_word(tp->pdev, PCI_COMMAND, &pci_cmd);
+ if (!(pci_cmd & PCI_COMMAND_MEMORY))
+ printk( KERN_WARNING "PCI Memory Mapped IO Disabled!!!!\n" );
+
if (tp->tg3_flags2 & TG3_FLG2_5705_PLUS)
tg3_periodic_fetch_stats(tp);
> 70: 00 00 00 00 00 00 00 00 00 On Thu, Mar 19, 2009 at 09:58:42AM -0700, Matt Carlson wrote: That will take a few days, I'll ask the on-site guys to check whether we have an external IP-KVM available. Would switching the uplink connection to eth1 (and not use eth0 in Applied, I'll send you the information as soon as it happens again (which seems to be happening rather often the last couple of days). Bernhard --
Sure, let's try that. Maybe this is the better way to go anyways. I just learned that disabling iLo2 doesn't necessarily disable the management firmware on the network device. For this to be a meaningful test though, we need to verify that the driver sign-on messages have a line that reads "ASF[0]" on eth1. Can Thanks. --
Yes: [1186598.218205] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[1] [1186598.281314] eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1] I'll keep the traffic on eth0 now until it breaks again, hopefully spewing out more debugging info with your patch and then I'll switch to eth1. Bernhard --
On Thu, Mar 19, 2009 at 09:58:42AM -0700, Matt Carlson wrote: Not sure this is going to help you. NIC crashed two times tonight, logs look like this Mar 22 04:06:46 svr02 kernel: [1392136.468921] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:47 svr02 kernel: [1392137.520288] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:48 svr02 kernel: [1392138.568267] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:49 svr02 kernel: [1392139.616266] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:50 svr02 kernel: [1392140.664266] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:51 svr02 kernel: [1392141.712276] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:52 svr02 kernel: [1392142.760297] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:53 svr02 kernel: [1392143.808258] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:54 svr02 kernel: [1392144.856256] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:55 svr02 kernel: [1392145.904266] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:56 svr02 kernel: [1392146.952276] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:57 svr02 kernel: [1392148.000267] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:58 svr02 kernel: [1392149.048271] PCI Memory Mapped IO Disabled!!!! Mar 22 04:06:59 svr02 kernel: [1392150.096268] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:00 svr02 kernel: [1392151.144277] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:01 svr02 kernel: [1392152.192287] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:02 svr02 kernel: [1392153.240268] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:03 svr02 kernel: [1392154.288267] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:04 svr02 kernel: [1392155.336267] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:05 svr02 kernel: [1392156.384268] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:07 svr02 kernel: [1392157.432267] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:08 svr02 kernel: [1392158.480266] PCI Memory Mapped IO Disabled!!!! Mar 22 04:07:09 svr02 kernel: [1392159.528289] PCI Memory Mapped IO ...
So traffic on this box must be pretty light for the watchdog to fire off O.K. I eagerly await your results. --
On 23.03.2009 19:18, Matt Carlson wrote:
Just to make sure I didn't confuse you, the "watchdog" I was talking
about here is a shellscript like this, executed every minute
---
/bin/ping -q -c 5 <defaultgw> > /dev/null
RC=$?
if [ ${RC} -ne 0 ]; then
rmmod tg3; sleep 5; modprobe tg3; sleep 5; ifup --force eth0
fi
---
The tg3 watchdog (tg3: eth0: transmit timed out, resetting) did not
appear at all in this circle, so I guess the checkscript killed the
module before.
Yes, the NIC is very lightly loaded, around 100kbps / 70pps in each
So far so good, but it has only been running ~36 hours, that's not
really a stability spree yet :-)
I'll keep you updated.
Bernhard
--
So far so good. In the last week my watchdog (cannot reach the default gateway) triggered once, but since there were no "PCI Memory Mapped IO Disabled!!!!" messages in the logfile I assume that was a real network problem. Since the box ran fine for months initially and the first two occurances of this issue were two weeks apart I cannot say for sure, but it definitely feels better than the "once in two days" in the end. Bernhard --
No crashes in the last two weeks. Do you have any further suggestions how to debug this or should we accept that portsharing doesn't work very well? This is the second problem we can directly attribute to the sharing with the iLO (the first one being the "no IPv6 unless the port is in promiscous mode", we had a thread about this here on netdev a few months ago). Bernhard --
No, I think we need to get this to work with portsharing. We just need to figure out what it is about it that causes these types of errors. I talked to the firmware maintainer here. We have a couple ideas that might uncover what is happening. Our next step is to develop a set of tests that will show under what assumptions the firmware is operating. Once we have that, I'll ask you to patch your driver so that we can see what is happening from your end. Stay tuned. --
