Re: [PATCH net-2.6.24] Fix documentation for dev_put()/dev_hold()

Previous thread: ipsec: icmp fragmentation-needed from ipsec-gateway is not encrypted by Wolfgang Walter on Tuesday, September 18, 2007 - 3:57 am. (1 message)

Next thread: [PATCH] IPoIB: Optimizations in poll handler. by Krishna Kumar on Tuesday, September 18, 2007 - 4:39 am. (1 message)
From: Krishna Kumar
Date: Tuesday, September 18, 2007 - 4:18 am

While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with
the message: "waiting for ib2 to become free.  Usage count = -515276",
etc.

The problem is that the poll handler does netif_rx_complete (which
does a dev_put) followed by netif_rx_reschedule() to schedule for
more receives (which again does a dev_put). This reduces refcount to
< 0 (depending on how many times netif_rx_complete followed by
netif_rx_reschedule was called).

The following patch fixes the bug, but I don't know if there is some
specific IB issue that prevents this approach.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |   11 ++++-------
 1 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-18 15:50:09.000000000 +0530
+++ new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-18 16:14:20.000000000 +0530
@@ -291,7 +291,6 @@ int ipoib_poll(struct napi_struct *napi,
 
 	done  = 0;
 
-poll_more:
 	while (done < budget) {
 		int max = (budget - done);
 
@@ -316,12 +315,10 @@ poll_more:
 	}
 
 	if (done < budget) {
-		netif_rx_complete(dev, napi);
-		if (unlikely(ib_req_notify_cq(priv->cq,
-					      IB_CQ_NEXT_COMP |
-					      IB_CQ_REPORT_MISSED_EVENTS)) &&
-		    netif_rx_reschedule(napi))
-			goto poll_more;
+		if (likely(!ib_req_notify_cq(priv->cq,
+					     IB_CQ_NEXT_COMP |
+					     IB_CQ_REPORT_MISSED_EVENTS)))
+			netif_rx_complete(dev, napi);
 	}
 
 	return done;
-

From: Krishna Kumar
Date: Tuesday, September 18, 2007 - 4:18 am

Follow-up cleanup and "while loop" optimization in the poll handler.
net_rx_action guarantees that 'budget' is atleast 1.

Note: This could also be done for poll handlers of other drivers.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |   22 ++++++++--------------
 1 files changed, 8 insertions(+), 14 deletions(-)

diff -ruNp new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-18 16:14:20.000000000 +0530
+++ new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-18 16:31:42.000000000 +0530
@@ -285,19 +285,16 @@ int ipoib_poll(struct napi_struct *napi,
 {
 	struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi);
 	struct net_device *dev = priv->dev;
-	int done;
-	int t;
-	int n, i;
+	int num_wc, max_wc;
+	int done = 0;
 
-	done  = 0;
-
-	while (done < budget) {
-		int max = (budget - done);
+	do {
+		int i;
 
-		t = min(IPOIB_NUM_WC, max);
-		n = ib_poll_cq(priv->cq, t, priv->ibwc);
+		max_wc = min(IPOIB_NUM_WC, budget - done);
+		num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc);
 
-		for (i = 0; i < n; i++) {
+		for (i = 0; i < num_wc; i++) {
 			struct ib_wc *wc = priv->ibwc + i;
 
 			if (wc->wr_id & IPOIB_CM_OP_SRQ) {
@@ -309,10 +306,7 @@ int ipoib_poll(struct napi_struct *napi,
 			} else
 				ipoib_ib_handle_tx_wc(dev, wc);
 		}
-
-		if (n != t)
-			break;
-	}
+	} while (num_wc == max_wc && done < budget);
 
 	if (done < budget) {
 		if (likely(!ib_req_notify_cq(priv->cq,
-

From: Roland Dreier
Date: Tuesday, September 18, 2007 - 7:27 am

Thanks for testing on ehca...

 > While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with

I don't think you're actually using rc6 bits, since in your patch you have:

 > -poll_more:

and I think that is only in Dave's net-2.6.24 tree now, right?

 > The problem is that the poll handler does netif_rx_complete (which
 > does a dev_put) followed by netif_rx_reschedule() to schedule for
 > more receives (which again does a dev_put). This reduces refcount to
 > < 0 (depending on how many times netif_rx_complete followed by
 > netif_rx_reschedule was called).

Dave, the real problem seems to be that netif_rx_recschedule() calls
__napi_schedule() rather than __netif_rx_schedule(), so it misses the
call to dev_hold() that is needed to balance the dev_put() in
netif_rx_complete().  The current netif_rx_reschedule() looks like it
really should be napi_reschedule(), and we need a new function that
takes a netdev too.  Or am I misunderstanding the refcounting?

I'll send a patch once I've had some breakfast and had a chance to at
least compile it...

Krishna, unfortunately your proposed fix has a race:

 > -		netif_rx_complete(dev, napi);
 > -		if (unlikely(ib_req_notify_cq(priv->cq,
 > -					      IB_CQ_NEXT_COMP |
 > -					      IB_CQ_REPORT_MISSED_EVENTS)) &&
 > -		    netif_rx_reschedule(napi))
 > -			goto poll_more;
 > +		if (likely(!ib_req_notify_cq(priv->cq,
 > +					     IB_CQ_NEXT_COMP |
 > +					     IB_CQ_REPORT_MISSED_EVENTS)))

It is possible for an interrupt to happen immediately right here,
before the netif_rx_complete(), so that netif_rx_schedule() gets
called while we are still on the poll list.

 > +			netif_rx_complete(dev, napi);

 - R.
-

From: Krishna Kumar2
Date: Tuesday, September 18, 2007 - 8:23 pm

Hi Roland,


Nope, that was what I downloaded yesterday:

VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 23
EXTRAVERSION =-rc6

To be clear, netif_rx_schedule while we are still in the poll list will not
do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared
by netif_rx_complete which has not yet run). Effectively we lost/delayed
processing an interrupt, if I understood the code right.

I agree with you on the new patch.

thanks,

- KK

-

From: Roland Dreier
Date: Tuesday, September 18, 2007 - 8:30 pm

> > and I think that is only in Dave's net-2.6.24 tree now, right?
 > 
 > Nope, that was what I downloaded yesterday:
 > 
 > VERSION = 2
 > PATCHLEVEL = 6
 > SUBLEVEL = 23
 > EXTRAVERSION =-rc6
 > NAME = Pink Farting Weasel

Please double check your tree.  I just very carefully looked at my
trees, and the poll_more: label is added in commit 6b460a71 ("[NET]:
Make NAPI polling independent of struct net_device objects.") which is
only in the net-2.6.24 tree.  Of course Dave did not change the
version information in the Makefile since he wouldn't want Linus to
pick up any extra strange changes when he pulls, so a net-2.6.24 tree
will look like 2.6.23-rc6 as you quoted.

And the refcounting bug I fixed is only in net-2.6.24.

 > To be clear, netif_rx_schedule while we are still in the poll list will not
 > do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared
 > by netif_rx_complete which has not yet run). Effectively we lost/delayed
 > processing an interrupt, if I understood the code right.

Right, we lose an interrupt, and since the CQ events are one-shot, we
never get another one, and the interface is effectively dead.

 - R.
-

From: Krishna Kumar2
Date: Tuesday, September 18, 2007 - 9:24 pm

You are absolutely right. My wording was incorrect, I should have said

Thanks,

- KK

-

From: Roland Dreier
Date: Tuesday, September 18, 2007 - 10:58 am

netif_rx_complete() takes a netdev parameter and does dev_put() on
that netdev, so netif_rx_reschedule() needs to also take a netdev
parameter and do dev_hold() on it to avoid reference counts from
getting becoming negative because of unbalanced dev_put()s.

This should fix the problem reported by Krishna Kumar
<krkumar2@in.ibm.com> with IPoIB waiting forever for netdev refcounts
to become 0 during module unload.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
Dave, feel free to roll this up into earlier NAPI conversion patches
(assuming I'm understanding things correctly and this patch actually
makes sense!).

BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have 
built in the current net-2.6.24 tree, since its call to
netif_rx_reschedule() was left with the netdev parameter.  So that
file does not need to be touched in this patch.

 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    2 +-
 drivers/net/arm/ep93xx_eth.c            |    2 +-
 drivers/net/ehea/ehea_main.c            |    2 +-
 drivers/net/ibmveth.c                   |    2 +-
 include/linux/netdevice.h               |   21 +++++++++++----------
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 6a2bff4..481e4b6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -320,7 +320,7 @@ poll_more:
 		if (unlikely(ib_req_notify_cq(priv->cq,
 					      IB_CQ_NEXT_COMP |
 					      IB_CQ_REPORT_MISSED_EVENTS)) &&
-		    netif_rx_reschedule(napi))
+		    netif_rx_reschedule(dev, napi))
 			goto poll_more;
 	}
 
diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
index f3858d1..7f016f3 100644
--- a/drivers/net/arm/ep93xx_eth.c
+++ b/drivers/net/arm/ep93xx_eth.c
@@ -309,7 +309,7 @@ poll_some_more:
 		}
 		spin_unlock_irq(&ep->rx_lock);
 
-		if (more && netif_rx_reschedule(napi))
+		if (more && netif_rx_reschedule(dev, ...
From: Roland Dreier
Date: Tuesday, September 18, 2007 - 11:04 am

It looks like the comments for dev_put() and dev_hold() got reversed somehow.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
 include/linux/netdevice.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be5fe05..239ae68 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1030,7 +1030,7 @@ extern int		netdev_budget;
 extern void netdev_run_todo(void);
 
 /**
- *	dev_put - get reference to device
+ *	dev_put - release reference to device
  *	@dev: network device
  *
  * Hold reference to device to keep it from being freed.
@@ -1041,7 +1041,7 @@ static inline void dev_put(struct net_device *dev)
 }
 
 /**
- *	dev_hold - release reference to device
+ *	dev_hold - get reference to device
  *	@dev: network device
  *
  * Release reference to device to allow it to be freed.
-

From: David Miller
Date: Tuesday, September 18, 2007 - 1:16 pm

From: Roland Dreier <rdreier@cisco.com>

Applied, thanks Roland.
-

From: David Miller
Date: Tuesday, September 18, 2007 - 1:15 pm

From: Roland Dreier <rdreier@cisco.com>


Yes, I know, this is the one NAPI driver that hasn't been converted.

It's a complicated conversion because of how the driver and the data
structures have been arranged (in short, a mess) which makes it
insanely difficult to get from a queue instance back up to a network
device or similar.

Further complicating things is that you need to setup a ppc32
cross-build environment to even build test a conversion, and I'm not
comfortable doing the surgery until I can test build the thing.

And this may be hard to believe, but other things have been more
pressing than setting up a ppc32 cross-build environment :-)

This is a hint of anyone looking for something to do that it'd
be much appreciated for someone to tackle the ibm_emac conversion.

Thanks.
-

From: Roland Dreier
Date: Tuesday, September 18, 2007 - 3:46 pm

> Further complicating things is that you need to setup a ppc32
 > cross-build environment to even build test a conversion, and I'm not
 > comfortable doing the surgery until I can test build the thing.

OK, I actually have a system with a ppc 440 SoC that uses this driver,
so I'll try to get things to the stage where I can boot net-2.6.24 on
it and see if I can get the driver working...
-

From: David Miller
Date: Tuesday, September 18, 2007 - 3:50 pm

From: Roland Dreier <rdreier@cisco.com>

Thanks a lot Roland.
-

Previous thread: ipsec: icmp fragmentation-needed from ipsec-gateway is not encrypted by Wolfgang Walter on Tuesday, September 18, 2007 - 3:57 am. (1 message)

Next thread: [PATCH] IPoIB: Optimizations in poll handler. by Krishna Kumar on Tuesday, September 18, 2007 - 4:39 am. (1 message)