Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Andy Gospodarek
Date: Friday, June 4, 2010 - 12:18 pm

On Wed, Jun 02, 2010 at 06:04:45PM +0800, Cong Wang wrote:

Sorry I've been silent until now.  This does seem quite similar to a
problem I've previously encountered when dealing with bonding+netpoll on
some old 2.6.9-based kernels.  There is no guarantee the methods used
there will apply here, but I'll talk about them anyway.

As Flavio noticed, recursive calls into the bond transmit routines were
not a good idea.  I discovered the same and worked around this issue by
checking to see if we could take the bond->lock for writing before
continuing.  If we could not get, I wanted to signal that this should be
queued for transmission later.  Based on the flow of netpoll_send_skb
(or possibly for another reason that is escaping me right now) I added
one of these checks in bond_poll_controller too.  These aren't the
prettiest fixes, but seemed to work well for me when I did this work in
the past.  I realize the differences are not that great compared to some
of the patches posted by Flavio, but I think they are worth trying.

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ef60244..d7b9b99 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1290,6 +1290,12 @@ static bool slaves_support_netpoll(struct net_device *bond_dev)
 static void bond_poll_controller(struct net_device *bond_dev)
 {
 	struct net_device *dev = bond_dev->npinfo->netpoll->real_dev;
+	struct bonding *bond = netdev_priv(bond_dev);
+
+	if (!write_trylock(&bond->lock))
+		return;
+	write_unlock(&bond->lock);
+
 	if (dev != bond_dev)
 		netpoll_poll_dev(dev);
 }
@@ -4418,7 +4424,11 @@ static void bond_set_xmit_hash_policy(struct bonding *bond)
 
 static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	const struct bonding *bond = netdev_priv(dev);
+	struct bonding *bond = netdev_priv(dev);
+
+	if (!write_trylock(&bond->lock))
+		return NETDEV_TX_BUSY;
+	write_unlock(&bond->lock);
 
 	switch (bond->params.mode) {
 	case BOND_MODE_ROUNDROBIN:

The other key to all of this is to make sure that queuing is done
correctly now that we expect to queue these frames and have them sent at
some point when there is a member of the bond that is actually capable
of sending them out.

The new style of sending queued skbs in a workqueue is much better than
what was done in the 2.6.9 timeframe, but careful attention should still
be paid to txq lock and which processor is the owner.  Returning
something other than NETDEV_TX_OK from bond_start_xmit and checking for
locks being held there should also help with any deadlocks that show up
while running in queue_process (though they would not be recursive).

I'm not in a good spot to test this right now, but I can take a look at
next week and we can try and track down any of the other deadlocks that
currently exist as I suspect this will not resolve all of the issues.
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[v5 Patch 2/3] bridge: make bridge support netpoll, Amerigo Wang, (Wed May 5, 1:11 am)
[v5 Patch 3/3] bonding: make bonding support netpoll, Amerigo Wang, (Wed May 5, 1:11 am)
Re: [v5 Patch 1/3] netpoll: add generic support for bridge ..., Andy Gospodarek, (Fri Jun 4, 12:18 pm)
[PATCH] netconsole: queue console messages to send later, Flavio Leitner, (Mon Jun 7, 12:24 pm)
Re: [PATCH] netconsole: queue console messages to send later, Stephen Hemminger, (Mon Jun 7, 1:00 pm)