While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with
the message: "waiting for ib2 to become free. Usage count = -515276",
etc.
The problem is that the poll handler does netif_rx_complete (which
does a dev_put) followed by netif_rx_reschedule() to schedule for
more receives (which again does a dev_put). This reduces refcount to
< 0 (depending on how many times netif_rx_complete followed by
netif_rx_reschedule was called).
The following patch fixes the bug, but I don't know if there is some
specific IB issue that prevents this approach.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_ib.c | 11 ++++-------
1 files changed, 4 insertions(+), 7 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 15:50:09.000000000 +0530
+++ new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:14:20.000000000 +0530
@@ -291,7 +291,6 @@ int ipoib_poll(struct napi_struct *napi,
done = 0;
-poll_more:
while (done < budget) {
int max = (budget - done);
@@ -316,12 +315,10 @@ poll_more:
}
if (done < budget) {
- netif_rx_complete(dev, napi);
- if (unlikely(ib_req_notify_cq(priv->cq,
- IB_CQ_NEXT_COMP |
- IB_CQ_REPORT_MISSED_EVENTS)) &&
- netif_rx_reschedule(napi))
- goto poll_more;
+ if (likely(!ib_req_notify_cq(priv->cq,
+ IB_CQ_NEXT_COMP |
+ IB_CQ_REPORT_MISSED_EVENTS)))
+ netif_rx_complete(dev, napi);
}
return done;
-
Follow-up cleanup and "while loop" optimization in the poll handler.
net_rx_action guarantees that 'budget' is atleast 1.
Note: This could also be done for poll handlers of other drivers.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_ib.c | 22 ++++++++--------------
1 files changed, 8 insertions(+), 14 deletions(-)
diff -ruNp new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- new1/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:14:20.000000000 +0530
+++ new2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-18 16:31:42.000000000 +0530
@@ -285,19 +285,16 @@ int ipoib_poll(struct napi_struct *napi,
{
struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi);
struct net_device *dev = priv->dev;
- int done;
- int t;
- int n, i;
+ int num_wc, max_wc;
+ int done = 0;
- done = 0;
-
- while (done < budget) {
- int max = (budget - done);
+ do {
+ int i;
- t = min(IPOIB_NUM_WC, max);
- n = ib_poll_cq(priv->cq, t, priv->ibwc);
+ max_wc = min(IPOIB_NUM_WC, budget - done);
+ num_wc = ib_poll_cq(priv->cq, max_wc, priv->ibwc);
- for (i = 0; i < n; i++) {
+ for (i = 0; i < num_wc; i++) {
struct ib_wc *wc = priv->ibwc + i;
if (wc->wr_id & IPOIB_CM_OP_SRQ) {
@@ -309,10 +306,7 @@ int ipoib_poll(struct napi_struct *napi,
} else
ipoib_ib_handle_tx_wc(dev, wc);
}
-
- if (n != t)
- break;
- }
+ } while (num_wc == max_wc && done < budget);
if (done < budget) {
if (likely(!ib_req_notify_cq(priv->cq,
-
Thanks for testing on ehca... > While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with I don't think you're actually using rc6 bits, since in your patch you have: > -poll_more: and I think that is only in Dave's net-2.6.24 tree now, right? > The problem is that the poll handler does netif_rx_complete (which > does a dev_put) followed by netif_rx_reschedule() to schedule for > more receives (which again does a dev_put). This reduces refcount to > < 0 (depending on how many times netif_rx_complete followed by > netif_rx_reschedule was called). Dave, the real problem seems to be that netif_rx_recschedule() calls __napi_schedule() rather than __netif_rx_schedule(), so it misses the call to dev_hold() that is needed to balance the dev_put() in netif_rx_complete(). The current netif_rx_reschedule() looks like it really should be napi_reschedule(), and we need a new function that takes a netdev too. Or am I misunderstanding the refcounting? I'll send a patch once I've had some breakfast and had a chance to at least compile it... Krishna, unfortunately your proposed fix has a race: > - netif_rx_complete(dev, napi); > - if (unlikely(ib_req_notify_cq(priv->cq, > - IB_CQ_NEXT_COMP | > - IB_CQ_REPORT_MISSED_EVENTS)) && > - netif_rx_reschedule(napi)) > - goto poll_more; > + if (likely(!ib_req_notify_cq(priv->cq, > + IB_CQ_NEXT_COMP | > + IB_CQ_REPORT_MISSED_EVENTS))) It is possible for an interrupt to happen immediately right here, before the netif_rx_complete(), so that netif_rx_schedule() gets called while we are still on the poll list. > + netif_rx_complete(dev, napi); - R. -
Hi Roland, Nope, that was what I downloaded yesterday: VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 23 EXTRAVERSION =-rc6 To be clear, netif_rx_schedule while we are still in the poll list will not do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared by netif_rx_complete which has not yet run). Effectively we lost/delayed processing an interrupt, if I understood the code right. I agree with you on the new patch. thanks, - KK -
> > and I think that is only in Dave's net-2.6.24 tree now, right? > > Nope, that was what I downloaded yesterday: > > VERSION = 2 > PATCHLEVEL = 6 > SUBLEVEL = 23 > EXTRAVERSION =-rc6 > NAME = Pink Farting Weasel Please double check your tree. I just very carefully looked at my trees, and the poll_more: label is added in commit 6b460a71 ("[NET]: Make NAPI polling independent of struct net_device objects.") which is only in the net-2.6.24 tree. Of course Dave did not change the version information in the Makefile since he wouldn't want Linus to pick up any extra strange changes when he pulls, so a net-2.6.24 tree will look like 2.6.23-rc6 as you quoted. And the refcounting bug I fixed is only in net-2.6.24. > To be clear, netif_rx_schedule while we are still in the poll list will not > do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared > by netif_rx_complete which has not yet run). Effectively we lost/delayed > processing an interrupt, if I understood the code right. Right, we lose an interrupt, and since the CQ events are one-shot, we never get another one, and the interface is effectively dead. - R. -
You are absolutely right. My wording was incorrect, I should have said Thanks, - KK -
netif_rx_complete() takes a netdev parameter and does dev_put() on that netdev, so netif_rx_reschedule() needs to also take a netdev parameter and do dev_hold() on it to avoid reference counts from getting becoming negative because of unbalanced dev_put()s. This should fix the problem reported by Krishna Kumar <krkumar2@in.ibm.com> with IPoIB waiting forever for netdev refcounts to become 0 during module unload. Signed-off-by: Roland Dreier <rolandd@cisco.com> --- Dave, feel free to roll this up into earlier NAPI conversion patches (assuming I'm understanding things correctly and this patch actually makes sense!). BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have built in the current net-2.6.24 tree, since its call to netif_rx_reschedule() was left with the netdev parameter. So that file does not need to be touched in this patch. drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 +- drivers/net/arm/ep93xx_eth.c | 2 +- drivers/net/ehea/ehea_main.c | 2 +- drivers/net/ibmveth.c | 2 +- include/linux/netdevice.h | 21 +++++++++++---------- 5 files changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 6a2bff4..481e4b6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -320,7 +320,7 @@ poll_more: if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && - netif_rx_reschedule(napi)) + netif_rx_reschedule(dev, napi)) goto poll_more; } diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c index f3858d1..7f016f3 100644 --- a/drivers/net/arm/ep93xx_eth.c +++ b/drivers/net/arm/ep93xx_eth.c @@ -309,7 +309,7 @@ poll_some_more: } spin_unlock_irq(&ep->rx_lock); - if (more && netif_rx_reschedule(napi)) + if (more && netif_rx_reschedule(dev, ...
It looks like the comments for dev_put() and dev_hold() got reversed somehow. Signed-off-by: Roland Dreier <rolandd@cisco.com> --- include/linux/netdevice.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be5fe05..239ae68 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1030,7 +1030,7 @@ extern int netdev_budget; extern void netdev_run_todo(void); /** - * dev_put - get reference to device + * dev_put - release reference to device * @dev: network device * * Hold reference to device to keep it from being freed. @@ -1041,7 +1041,7 @@ static inline void dev_put(struct net_device *dev) } /** - * dev_hold - release reference to device + * dev_hold - get reference to device * @dev: network device * * Release reference to device to allow it to be freed. -
From: Roland Dreier <rdreier@cisco.com> Applied, thanks Roland. -
From: Roland Dreier <rdreier@cisco.com> Yes, I know, this is the one NAPI driver that hasn't been converted. It's a complicated conversion because of how the driver and the data structures have been arranged (in short, a mess) which makes it insanely difficult to get from a queue instance back up to a network device or similar. Further complicating things is that you need to setup a ppc32 cross-build environment to even build test a conversion, and I'm not comfortable doing the surgery until I can test build the thing. And this may be hard to believe, but other things have been more pressing than setting up a ppc32 cross-build environment :-) This is a hint of anyone looking for something to do that it'd be much appreciated for someone to tackle the ibm_emac conversion. Thanks. -
> Further complicating things is that you need to setup a ppc32 > cross-build environment to even build test a conversion, and I'm not > comfortable doing the surgery until I can test build the thing. OK, I actually have a system with a ppc 440 SoC that uses this driver, so I'll try to get things to the stage where I can boot net-2.6.24 on it and see if I can get the driver working... -
From: Roland Dreier <rdreier@cisco.com> Thanks a lot Roland. -
