This patch adds a workaround for lost MSI interrupts. There is a race
condition in the HW in which future interrupts could be missed. The
workaround is to toggle the MSI irq mask.Signed-off-by: Ayaz Abdulla <aabdulla@nvidia.com>
This patch adds a workaround for lost MSI interrupts. There is a race
condition in the HW in which future interrupts could be missed. The
workaround is to toggle the MSI irq mask.Added cleanup based on comments from Andrew Morton.
Signed-off-by: Ayaz Abdulla <aabdulla@nvidia.com>
On Tue, 03 Jun 2008 16:51:46 -0400
I'm not loving the implementation.
- That `inline' adds 35 bytes more text to the driver, and we expect
that this often yields slower code.- Every caller of nv_msi_workaround() already has the fe_priv* in a
local variable, so why not pass that in and save the additional
pointer calculation?So this:
diff -puN drivers/net/forcedeth.c~forcedeth-msi-interrupts-uninlining drivers/net/forcedeth.c
--- a/drivers/net/forcedeth.c~forcedeth-msi-interrupts-uninlining
+++ a/drivers/net/forcedeth.c
@@ -3277,15 +3277,14 @@ static void nv_link_irq(struct net_devic
dprintk(KERN_DEBUG "%s: link change notification done.\n", dev->name);
}-static inline void nv_msi_workaround(struct net_device *dev)
+static void nv_msi_workaround(struct fe_priv *np)
{
- struct fe_priv *np = netdev_priv(dev);
- u8 __iomem *base = get_hwbase(dev);
-
/* Need to toggle the msi irq mask within the ethernet device,
* otherwise, future interrupts will not be detected.
*/
if (np->msi_flags & NV_MSI_ENABLED) {
+ u8 __iomem *base = np->base;
+
writel(0, base + NvRegMSIIrqMask);
writel(NVREG_MSI_VECTOR_0_ENABLED, base + NvRegMSIIrqMask);
}
@@ -3313,7 +3312,7 @@ static irqreturn_t nv_nic_irq(int foo, v
if (!(events & np->irqmask))
break;- nv_msi_workaround(dev);
+ nv_msi_workaround(np);spin_lock(&np->lock);
nv_tx_done(dev);
@@ -3430,7 +3429,7 @@ static irqreturn_t nv_nic_irq_optimized(
if (!(events & np->irqmask))
break;- nv_msi_workaround(dev);
+ nv_msi_workaround(np);spin_lock(&np->lock);
nv_tx_done_optimized(dev, TX_WORK_PER_LOOP);
@@ -3772,7 +3771,7 @@ static irqreturn_t nv_nic_irq_test(int f
if (!(events & NVREG_IRQ_TIMER))
return IRQ_RETVAL(0);- nv_msi_workaround(dev);
+ nv_msi_workaround(np);spin_lock(&np->lock);
np->intr_test = 1;
_save 42 bytes of text.
Now, if the (np->msi_flags & NV_MSI_EN...
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
--
On Fri, 06 Jun 2008 13:15:32 -0400
Sorry, what I meant was: do you believe that this patch should be in
2.6.26?
--
Yes, it should be treated as a critical fix.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
--
On Fri, 06 Jun 2008 14:04:05 -0400
So should it also be backported into 2.6.25.x?
Bear in mind that $major_distros are apparently basing product on
2.6.25.
--
Yes, that would be great!
-----Original Message-----
From: Andrew Morton [mailto:akpm@linux-foundation.org]
Sent: Friday, June 06, 2008 2:11 PM
To: Ayaz Abdulla
Cc: jgarzik@pobox.com; manfred@colorfullife.com; netdev@vger.kernel.org
Subject: Re: [PATCH] forcedeth: msi interruptsOn Fri, 06 Jun 2008 14:04:05 -0400
So should it also be backported into 2.6.25.x?
Bear in mind that $major_distros are apparently basing product on
2.6.25.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
--
Hi Ayaz,
How far back should it be back ported? Do you know?
Thanks,
---end quoted text-----
Karen Shaeffer
Neuralscape, Palo Alto, Ca. 94306
shaeffer@neuralscape.com http://www.neuralscape.com
--
Karen,
It is a bug fix that affects MSI interrupts. Any kernel that is
accepting critical fixes should add the patch.Thanks,
Ayaz-----Original Message-----
From: Karen Shaeffer [mailto:shaeffer@neuralscape.com]
Sent: Friday, June 06, 2008 2:40 PM
To: Ayaz Abdulla
Cc: Andrew Morton; jgarzik@pobox.com; manfred@colorfullife.com;
netdev@vger.kernel.org
Subject: Re: [PATCH] forcedeth: msi interruptsHi Ayaz,
How far back should it be back ported? Do you know?
Thanks,
---end quoted text-----
Karen Shaeffer
Neuralscape, Palo Alto, Ca. 94306
shaeffer@neuralscape.com http://www.neuralscape.com
--
Hello,
Let me explain and maybe get some advice. I have recently
done work with the Sun Netra X4200 M2 Server that uses
the Nvidia CK48 chipset and the forcedeth driver. You can
see an architecture overview here:
http://www.sun.com/servers/netra/x4200/wp.pdfThe Nvidia NIC that is integrated into the Nvidia 2200 chip,
has a failure mode for the following linux kernels
2.6.21.x
2.6.23.x
2.6.24.x
RHEL 2.6.18*The NIC will hang under specific conditions for all these
kernels. First, you must run the NIC in 100 Mb mode with autoneg
enabled, then it will always link in a mismatch with the switch.
The switch will link at 100 Mb full duplex, while the Nvidia NIC
will link at 100 Mb half duplex. This was shown with both Cisco
managed switches and HP managed switches, and I suspect it will
happen with any switch. (You can force the NIC to 100 Mb full,
but autoneg will always result in the link mismatch.)Once this link mismatch is in effect, then, if you run it long
enough, the NIC will eventually hang and become completely
disabled. (I know you shouldn't run a NIC in link mismatch, but end
users in the field sometimes don't realize it has happened.) It could
take days or weeks under reasonably heavy load, but it will always
hang in the end. Continually rebooting the server will result in the
hang in a matter of hours, where the link negotiation results in the
hang. No packets are ever transmitted in these cases. Because it
is reproducable in a matter of hours, this is the preferred way to
reproduce the failure mode.The ethtool online test will pass. The ethtool offline test
will fail. The driver does TX register dumps into the logs and
reports TX busy errors. I provided all this information to Ayaz
in real time, but never got any response or comment from him.Even a soft reboot will not clear this failure. This initially
lead me to conclude this is a hardware failure, but it isn't
100% certain to be the case. This is because the NIC is known
to hang at boot time during ...
Karen,
Is the switch in forced mode? That would explain the mismatch. I can
look into a fix to workaround the hang.Please open a bugzilla bug as you recommend. Emails get deleted after a
couple of weeks and it is better to have a permanent location to track
issues like this.Thanks,
Ayaz-----Original Message-----
From: Karen Shaeffer [mailto:shaeffer@neuralscape.com]
Sent: Saturday, June 07, 2008 11:31 AM
To: Ayaz Abdulla
Cc: Andrew Morton; jgarzik@pobox.com; manfred@colorfullife.com;
netdev@vger.kernel.org
Subject: Re: [PATCH] forcedeth: msi interruptsHello,
Let me explain and maybe get some advice. I have recently done work with
the Sun Netra X4200 M2 Server that uses the Nvidia CK48 chipset and the
forcedeth driver. You can see an architecture overview here:
http://www.sun.com/servers/netra/x4200/wp.pdfThe Nvidia NIC that is integrated into the Nvidia 2200 chip, has a
failure mode for the following linux kernels 2.6.21.x 2.6.23.x 2.6.24.x
RHEL 2.6.18*The NIC will hang under specific conditions for all these kernels.
First, you must run the NIC in 100 Mb mode with autoneg enabled, then it
will always link in a mismatch with the switch.
The switch will link at 100 Mb full duplex, while the Nvidia NIC will
link at 100 Mb half duplex. This was shown with both Cisco managed
switches and HP managed switches, and I suspect it will happen with any
switch. (You can force the NIC to 100 Mb full, but autoneg will always
result in the link mismatch.)Once this link mismatch is in effect, then, if you run it long enough,
the NIC will eventually hang and become completely disabled. (I know you
shouldn't run a NIC in link mismatch, but end users in the field
sometimes don't realize it has happened.) It could take days or weeks
under reasonably heavy load, but it will always hang in the end.
Continually rebooting the server will result in the hang in a matter of
hours, where the link negotiation results in the hang. No packets are
ever transmitted in these cas...
Hi Ayaz,
Thank you for responding.
No, the switch vlan port is configured for 100 Mb autoneg on. The NIC
is configured with a boot time script using ethtool for 100 Mb autoneg on.
The result is always the same. The switch port links at 100 Mb full duplex,
and the NIC links at 100 Mb half duplex.One can configure the NIC for 100 Mb autoneg off full duplex. And then
the link mismatch does not occur. And I have never seen the failure
in this case. But end users in the field sometimes just don't pay attention
and end up with the mismatch, so the failure was first discovered by an
end user.At 1000 Mb, the link mismatch never happens. I've only seen this failure at
100 Mb.The bug has been opened with ID 10885. I don't work with that project anymore,
but I'll definitely follow up with the appropriate folks, when this issue
is resolved. I am sure they will be very interested in a resolution.Thanks,
Karen
--
Karen Shaeffer
Neuralscape, Palo Alto, Ca. 94306
shaeffer@neuralscape.com http://www.neuralscape.com
--
Hi,
One more detail. I personally reproduced the NIC mismatch and the NIC TX
failure many times in the lab on 5 different Netra servers, using quite
a few different kernel.org and RHEL kernels. It happened on every kernel I
tested, with 32 and 64 bit compiles. And it was produced in a data center
using the 2.6.22.10 kernel several times completely independent of me. For
more details, please see the bug ID 10885.And I need to clarify that the last kernel I tested this on was actually
linux-2.6.24-rc8-git6. I mistated in the bug that I observed this failure
on the 2.6.25.4 kernel. That is inaccurate. I don't know, if it exist in
the 2.6.25.4 kernel, because I never tested that kernel. My error.Thanks,
Karen
--
Karen Shaeffer
Neuralscape, Palo Alto, Ca. 94306
shaeffer@neuralscape.com http://www.neuralscape.com
--
Or does it go back even further, maybe even as far as 2.6.18?
I've observed the CK48 NICs hang on 2.6.18 - 2.6.24 kernels.
Just wondering if this is relevant.Thanks,
Karen
--
Karen Shaeffer
Neuralscape, Palo Alto, Ca. 94306
shaeffer@neuralscape.com http://www.neuralscape.com
--
| Zach Brown | [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole tas... |
| Linus Torvalds | Re: LSM conversion to static interface |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| Gregory Haskins | [RFC PATCH 00/17] virtual-bus |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
