Re: [PATCH 2/5] forcedeth: interrupt handling cleanup

Previous thread: [PATCH] [RFC] Time-based RFC 4122 UUID generator by Helge Deller on Saturday, October 6, 2007 - 6:53 am. (4 messages)

Next thread: [PATCH]AIO: fix cleanup in io_submit_one(...) by Yan Zheng on Saturday, October 6, 2007 - 9:26 am. (1 message)
From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:12 am

The 'fe-lock' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git fe-lock

contains the following changes that I would like to get tested:

      [netdrvr] forcedeth: make NAPI unconditional
      [netdrvr] forcedeth: interrupt handling cleanup
      [netdrvr] forcedeth: process TX completions using NAPI
      [netdrvr] forcedeth: internal simplification and cleanups
      [netdrvr] forcedeth: timer overhaul

These are intended for feedback and testing, NOT for merging.

The goals of these changes are:
* move the driver towards a more sane, simple, easy to verify locking
  setup -- irq handler would often acquire/release the lock twice
  for each interrupt -- and hopefully

* to eliminate a rarely used, apparently fragile locking scheme that
  includes heavy use of disable_irq().  this tool is most often employed
  during NIC reset/reconfiguration, so satisfying this goal implies
  changing the way NIC reset and config are accomplished.

Miscellaneous notes:
* using the new napi_struct stuff in net-2.6.24, the TX completion
  process has been moved to a -separate- NAPI polling channel.  thus
  there are now two napi_structs, one for RX and one for TX, independent
  of each other.

* I feel TX NAPI is a useful tool, because it provides an independent TX
  process control point and system load feedback point.
  Thus I felt this was slightly superior to tasklets.

* But who knows if this is a good idea?  :)  I am interested in
  feedback and criticism on this issue.

-

From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:13 am

commit 7bfc023b952e8e12c7333efccd2e78023c546a7c
Author: Jeff Garzik <jeff@garzik.org>
Date:   Fri Oct 5 20:50:24 2007 -0400

    [netdrvr] forcedeth: make NAPI unconditional
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/Kconfig     |   17 -----
 drivers/net/forcedeth.c |  149 +-----------------------------------------------
 2 files changed, 4 insertions(+), 162 deletions(-)

7bfc023b952e8e12c7333efccd2e78023c546a7c
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9c635a2..59eab61 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1430,23 +1430,6 @@ config FORCEDETH
 	  <file:Documentation/networking/net-modules.txt>.  The module will be
 	  called forcedeth.
 
-config FORCEDETH_NAPI
-	bool "Use Rx and Tx Polling (NAPI) (EXPERIMENTAL)"
-	depends on FORCEDETH && EXPERIMENTAL
-	help
-	  NAPI is a new driver API designed to reduce CPU and interrupt load
-	  when the driver is receiving lots of packets from the card. It is
-	  still somewhat experimental and thus not yet enabled by default.
-
-	  If your estimated Rx load is 10kpps or more, or if the card will be
-	  deployed on potentially unfriendly networks (e.g. in a firewall),
-	  then say Y here.
-
-	  See <file:Documentation/networking/NAPI_HOWTO.txt> for more
-	  information.
-
-	  If in doubt, say N.
-
 config CS89x0
 	tristate "CS89x0 support"
 	depends on NET_PCI && (ISA || MACH_IXDP2351 || ARCH_IXDP2X01 || ARCH_PNX010X)
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index dae30b7..49906cc 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -123,12 +123,7 @@
  * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few
  * superfluous timer interrupts from the nic.
  */
-#ifdef CONFIG_FORCEDETH_NAPI
-#define DRIVERNAPI "-NAPI"
-#else
-#define DRIVERNAPI
-#endif
-#define FORCEDETH_VERSION		"0.60"
+#define FORCEDETH_VERSION		"1.00"
 #define DRV_NAME			"forcedeth"
 
 #include <linux/module.h>
@@ ...
From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:14 am

commit a606d2a111cdf948da5d69eb1de5526c5c2dafef
Author: Jeff Garzik <jeff@garzik.org>
Date:   Fri Oct 5 22:56:05 2007 -0400

    [netdrvr] forcedeth: interrupt handling cleanup
    
    * nv_nic_irq_optimized() and nv_nic_irq_other() were complete duplicates
      of nv_nic_irq(), with the exception of one function call.  Consolidate
      all three into a single interrupt handler, deleting a lot of redundant
      code.
    
    * greatly simplify irq handler locking.
    
      Prior to this change, the irq handler(s) would acquire and release
      np->lock for each action (RX, TX, other events).
    
      For the common case -- RX or TX work -- the lock is always acquired,
      making all successive acquire/release combinations largely redundant.
    
      Acquire the lock at the beginning of the irq handler, and release it at
      the end of the irq handler.  This is simple, easy, and obvious.
    
    * remove irq handler work loop.
    
      All interesting events emanating from the irq handler either have
      their own work loops, or they poke a timer into action.
    
      Therefore, delete the pointless master interrupt handler work loop.
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/forcedeth.c |  325 +++++++++++-------------------------------------
 1 file changed, 77 insertions(+), 248 deletions(-)

a606d2a111cdf948da5d69eb1de5526c5c2dafef
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 49906cc..1d1a5f5 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -2917,208 +2917,110 @@ static void nv_link_irq(struct net_device *dev)
 	dprintk(KERN_DEBUG "%s: link change notification done.\n", dev->name);
 }
 
-static irqreturn_t nv_nic_irq(int foo, void *data)
+static irqreturn_t __nv_nic_irq(struct net_device *dev, bool optimized)
 {
-	struct net_device *dev = (struct net_device *) data;
 	struct fe_priv *np = netdev_priv(dev);
 	u8 __iomem *base = get_hwbase(dev);
 	u32 ...
From: Yinghai Lu
Date: Saturday, October 6, 2007 - 9:43 pm

any chance to create three verion irq handlers for ioapic, msi, msi-x...?

MACRO or inline function?

YH
-

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 4:40 am

MSI-X already has its own separate interrupt handlers.  MSI and INTx 
call the same interrupt handling code, like the unmodified driver goes. 
  Creating an MSI-specific irq handler would not save very much AFAICS, 
but I might be missing something.

Do you have ideas/suggestions for a different method?

	Jeff



-

From: Yinghai Lu
Date: Sunday, October 7, 2007 - 12:36 pm

in the e1000 driver, it has seperate handler for msi and ioapic.

but in forcedeth, the nv_nic_irq_optimized keep check msi_flags...,
with num of msi interrupt number, that could cause cpu loading get a
little bit high..., even the network performance is ok.

YH
-

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 1:07 pm

With all the activity in the interrupt handler, a few in-cache branches 
are definitely going to be lost in the noise.

Separating the interrupt handlers between MSI and non-MSI tends to be of 
more benefit when the separation is accompanied by more efficient 
locking in the MSI interrupt handler, or a different mode of interrupt 
clear, or some other attribute.

Though CPU usage would be a good thing to measure, with these patches.

	Jeff




-

From: Ingo Molnar
Date: Sunday, October 7, 2007 - 2:03 am

i like that! One forcedeth annoyance that triggers frequently on one of 
my testboxes is:

[  120.955202] eth0: too many iterations (6) in nv_nic_irq.
[  121.233865] eth0: too many iterations (6) in nv_nic_irq.
[  129.215450] eth0: too many iterations (6) in nv_nic_irq.
[  139.734408] eth0: too many iterations (6) in nv_nic_irq.
[  144.546811] eth0: too many iterations (6) in nv_nic_irq.
[  153.811005] eth0: too many iterations (6) in nv_nic_irq.
[  154.695879] eth0: too many iterations (6) in nv_nic_irq.
[  155.455078] eth0: too many iterations (6) in nv_nic_irq.
[  173.912162] eth0: too many iterations (6) in nv_nic_irq.

	Ingo
-

From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:14 am

commit 57cbfacc00d69be2ba02b65d1021442273b76263
Author: Jeff Garzik <jeff@garzik.org>
Date:   Fri Oct 5 23:25:56 2007 -0400

    [netdrvr] forcedeth: process TX completions using NAPI
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/forcedeth.c |  143 +++++++++++++++++++++++++++---------------------
 1 file changed, 83 insertions(+), 60 deletions(-)

57cbfacc00d69be2ba02b65d1021442273b76263
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 1d1a5f5..1c236e6 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -744,6 +744,7 @@ struct fe_priv {
 
 	struct net_device *dev;
 	struct napi_struct napi;
+	struct napi_struct tx_napi;
 
 	/* General data:
 	 * Locking: spin_lock(&np->lock); */
@@ -810,7 +811,6 @@ struct fe_priv {
 	union ring_type tx_ring;
 	u32 tx_flags;
 	int tx_ring_size;
-	int tx_stop;
 
 	/* vlan fields */
 	struct vlan_group *vlangrp;
@@ -1763,10 +1763,7 @@ static int nv_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	empty_slots = nv_get_empty_tx_slots(np);
 	if (unlikely(empty_slots <= entries)) {
-		spin_lock_irq(&np->lock);
 		netif_stop_queue(dev);
-		np->tx_stop = 1;
-		spin_unlock_irq(&np->lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1879,10 +1876,7 @@ static int nv_start_xmit_optimized(struct sk_buff *skb, struct net_device *dev)
 
 	empty_slots = nv_get_empty_tx_slots(np);
 	if (unlikely(empty_slots <= entries)) {
-		spin_lock_irq(&np->lock);
 		netif_stop_queue(dev);
-		np->tx_stop = 1;
-		spin_unlock_irq(&np->lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1987,12 +1981,16 @@ static int nv_start_xmit_optimized(struct sk_buff *skb, struct net_device *dev)
  *
  * Caller must own np->lock.
  */
-static void nv_tx_done(struct net_device *dev)
+static int nv_tx_done(struct net_device *dev, int limit)
 {
 	struct fe_priv *np = netdev_priv(dev);
 	u32 flags;
+	int orig_limit = limit;
 	struct ring_desc* orig_get_tx = np->get_tx.orig;
 
+	if (unlikely(limit < ...
From: Jeff Garzik
Date: Sunday, October 7, 2007 - 7:39 am

The attached patch fixes an obvious bug.  Once applied, TX NAPI actually 
works :)


From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:14 am

commit 39572457a4dfe9a9dc1efd6641e7a6467e5658a1
Author: Jeff Garzik <jeff@garzik.org>
Date:   Sat Oct 6 01:21:01 2007 -0400

    [netdrvr] forcedeth: internal simplification and cleanups
    
    * remove changelog from source; its kept in git repository
    
    * split guts of RX/TX DMA engine disable into disable portion,
      and wait/etc. portions.
    
    * consolidate descriptor version tests using nv_optimized()
    
    * consolidate NIC DMA start, stop and drain into
      nv_start_txrx(), nv_stop_txrx(), nv_drain_txrx()
    
    * change nv_poll_controller() to call interrupt handling function
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/forcedeth.c |  228 +++++++++++++++++-------------------------------
 1 file changed, 84 insertions(+), 144 deletions(-)

39572457a4dfe9a9dc1efd6641e7a6467e5658a1
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 1c236e6..d6eacd7 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -29,89 +29,7 @@
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  *
- * Changelog:
- * 	0.01: 05 Oct 2003: First release that compiles without warnings.
- * 	0.02: 05 Oct 2003: Fix bug for nv_drain_tx: do not try to free NULL skbs.
- * 			   Check all PCI BARs for the register window.
- * 			   udelay added to mii_rw.
- * 	0.03: 06 Oct 2003: Initialize dev->irq.
- * 	0.04: 07 Oct 2003: Initialize np->lock, reduce handled irqs, add printks.
- * 	0.05: 09 Oct 2003: printk removed again, irq status print tx_timeout.
- * 	0.06: 10 Oct 2003: MAC Address read updated, pff flag generation updated,
- * 			   irq mask updated
- * 	0.07: 14 Oct 2003: Further irq mask updates.
- * 	0.08: 20 Oct 2003: rx_desc.Length initialization added, nv_alloc_rx refill
- * 			   added into irq handler, NULL check for drain_ring.
- * 	0.09: 20 Oct 2003: Basic link speed irq implementation. Only handle the
- * ...
From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:15 am

commit d7c766113ee2ec66ae8975e0acbad086d2c23594
Author: Jeff Garzik <jeff@garzik.org>
Date:   Sat Oct 6 10:57:56 2007 -0400

    [netdrvr] forcedeth: timer overhaul
    
    * convert stats_poll timer to a delayed-work workqueue stats_task
    
    * protect hw stats update with a lock
    
    * now that recovery is the only remaining use of nv_do_nic_poll(),
      rename it to nv_reset_task(), move it from a timer to a workqueue,
      and delete all non-recovery-related code from the function.
    
    * kill np->in_shutdown, it mirrors netif_running().  furthermore,
      the overwhelming majority of sites that tested np->in_shutdown
      were inside rtnl_lock() and guaranteed never to race against shutdown
      anyway.
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/forcedeth.c |  194 ++++++++++++++++++------------------------------
 1 file changed, 75 insertions(+), 119 deletions(-)

d7c766113ee2ec66ae8975e0acbad086d2c23594
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index d6eacd7..a037f49 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -62,6 +62,7 @@
 #include <linux/init.h>
 #include <linux/if_vlan.h>
 #include <linux/dma-mapping.h>
+#include <linux/workqueue.h>
 
 #include <asm/irq.h>
 #include <asm/io.h>
@@ -450,7 +451,6 @@ union ring_type {
 #define NV_PKTLIMIT_2	9100	/* Actual limit according to NVidia: 9202 */
 
 #define OOM_REFILL	(1+HZ/20)
-#define POLL_WAIT	(1+HZ/100)
 #define LINK_TIMEOUT	(3*HZ)
 #define STATS_INTERVAL	(10*HZ)
 
@@ -670,7 +670,6 @@ struct fe_priv {
 	 * Locking: spin_lock(&np->lock); */
 	struct net_device_stats stats;
 	struct nv_ethtool_stats estats;
-	int in_shutdown;
 	u32 linkspeed;
 	int duplex;
 	int autoneg;
@@ -681,7 +680,6 @@ struct fe_priv {
 	unsigned int phy_model;
 	u16 gigabit;
 	int intr_test;
-	int recover_error;
 
 	/* General data: RO fields */
 	dma_addr_t ring_addr;
@@ -710,9 +708,9 @@ struct fe_priv {
 	unsigned int ...
From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:17 am

It should also be pointed out that these patches were generated on top
of davem's net-2.6.24.git tree.

They -probably- apply to -mm, but you might have to remove the forcedeth
patches -mm already has, before applying.


-

From: Jeff Garzik
Date: Saturday, October 6, 2007 - 8:24 am

s/and hopefully//   (it became the next bullet point)

-

From: Ingo Molnar
Date: Sunday, October 7, 2007 - 2:08 am

/me agrees violently

btw., when i played with this tunable under -rt:

 enum {
         NV_OPTIMIZATION_MODE_THROUGHPUT,
         NV_OPTIMIZATION_MODE_CPU
 };
 static int optimization_mode = NV_OPTIMIZATION_MODE_THROUGHPUT;

the MODE_CPU one gave (much) _higher_ bandwidth. The queueing model in 
forcedeth seemed to be not that robust and i think a single queueing 
model should be adopted instead of this tunable. (which i think just hid 
some bug/dependency) But i never got to the bottom of it so it's just 
the impression i got.

	Ingo
-

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 4:34 am

That's interesting.  It will be informative to narrow down the variables 
affected by this.  My changes stirred the pot quite a bit :)

* 'throughput' mode enables MSI-X, and separate interrupt vectors for RX 
and TX.  so, NVIDIA's MSI-X implementation, our generic MSI-X support, 
or "Known bugs" (see top of file) may be a factor here.

* 'throughput' mode also changes the NIC's timer interrupt frequency

* do you recall if you were running in NAPI mode?  It defaulted to off 
in Kconfig, but I turned it on unconditionally.

* I think TX NAPI has the potential to make the optimization_mode 
irrelevant (along with the other changes, most notably the interrupt 
handling change)

* and overall, yes, if we can have a single queueing model / 
optimization mode I am strongly in favor of that.

Testing welcome ;-)  Though these patches are raw and "hot off the 
presses", so unrelated bugs are practically a certainty.  And I am 
worrying about the "Known bugs" note at the top.  My gut feeling is that 
this was, in part, misunderstanding on the part of reverse-engineers, 
since corrected when NVIDIA started contributing to the driver.

	Jeff



-

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 4:23 am

commit abca163a14b28c234df9bf38034bc967ff81c3aa
Author: Jeff Garzik <jeff@garzik.org>
Date:   Sun Oct 7 07:22:14 2007 -0400

    [netdrvr] forcedeth: wrap slow path hw manipulation inside hw_mutex
    
    * This makes sure everybody who wants to start/stop the RX and TX engines
      first acquires this mutex.
    
    * tx_timeout code was deleted, replaced by scheduling reset_task.
    
    * linkchange moved to a workqueue (always inside hw_mutex)
    
    * simplified irq handling a bit
    
    * make sure to disable workqueues before NAPI
    
    Signed-off-by: Jeff Garzik <jgarzik@redhat.com>

 drivers/net/forcedeth.c |  272 ++++++++++++++++++++++++++++++------------------
 1 file changed, 175 insertions(+), 97 deletions(-)

abca163a14b28c234df9bf38034bc967ff81c3aa
diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index a037f49..d1c1efa 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -63,6 +63,7 @@
 #include <linux/if_vlan.h>
 #include <linux/dma-mapping.h>
 #include <linux/workqueue.h>
+#include <linux/mutex.h>
 
 #include <asm/irq.h>
 #include <asm/io.h>
@@ -647,6 +648,12 @@ struct nv_skb_map {
 	unsigned int dma_len;
 };
 
+struct nv_mc_info {
+	u32 addr[2];
+	u32 mask[2];
+	u32 pff;
+};
+
 /*
  * SMP locking:
  * All hardware access under dev->priv->lock, except the performance
@@ -709,6 +716,8 @@ struct fe_priv {
 	unsigned int pkt_limit;
 	struct timer_list oom_kick;
 	struct work_struct reset_task;
+	struct work_struct linkchange_task;
+	struct work_struct mcast_task;
 	struct delayed_work stats_task;
 	u32 reset_task_irq;
 	int rx_ring_size;
@@ -731,14 +740,18 @@ struct fe_priv {
 	int tx_ring_size;
 
 	/* vlan fields */
-	struct vlan_group *vlangrp;
+	struct vlan_group	*vlangrp;
 
 	/* msi/msi-x fields */
-	u32 msi_flags;
-	struct msix_entry msi_x_entry[NV_MSI_X_MAX_VECTORS];
+	u32			msi_flags;
+	struct msix_entry	msi_x_entry[NV_MSI_X_MAX_VECTORS];
 
 	/* flow control */
-	u32 ...
From: Jeff Garzik
Date: Sunday, October 7, 2007 - 7:40 am

You will need the attached patch to even build (oops).

Also, testing shows there is a mutex deadlock somewhere.

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 7:47 am

OK, I've successfully tested patches 1-5 on an AMD64 system with

00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)

A trivial s/napi/tx_napi/ fix must be applied to patch #3 (sent in 
separate email).  Once that is done, the patchset works flawlessly here, 
passing simple RX, TX, RX+TX TCP stress tests.

I never ran into any TX problems, of the type hinted at by the "Known 
bugs" section at the top of forcedeth.c.  Either (a) my chip does not 
have that bug, (b) my chip needs DEV_NEED_TIMERIRQ for other reasons, or 
(c) the issue is not a hardware issue but a driver bug that is now fixed.

I'm going to hope its (c), <grin>  But only testing will tell.

	Jeff


P.S.  Depending on when this gets fixed, you might have to revert 
net-2.6.24.git commit 5f5dace1ce001b24fb8286e09ffd3c4d2b547e09 ("[NET]: 
Various dst_ifdown routines to catch refcounting bugs").
-

From: Yinghai Lu
Date: Sunday, October 7, 2007 - 12:39 pm

need to test that with mcp55 based, that will use MSI.

YH
-

From: Jeff Garzik
Date: Sunday, October 7, 2007 - 1:05 pm

I would love to see an MSI-X test too (might have to turn on throughput 
mode).

	Jeff



-

Previous thread: [PATCH] [RFC] Time-based RFC 4122 UUID generator by Helge Deller on Saturday, October 6, 2007 - 6:53 am. (4 messages)

Next thread: [PATCH]AIO: fix cleanup in io_submit_one(...) by Yan Zheng on Saturday, October 6, 2007 - 9:26 am. (1 message)