Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue

Previous thread: Please pull 'upstream-davem' branch of wireless-2.6 by John W. Linville on Tuesday, October 9, 2007 - 5:21 pm. (2 messages)

Next thread: [git patches] net driver updates by Jeff Garzik on Tuesday, October 9, 2007 - 6:03 pm. (2 messages)
From: Jeff Garzik
Date: Tuesday, October 9, 2007 - 5:56 pm

unfortunately it does not seem to build flawlessly:


drivers/net/bonding/bond_main.c: In function ‘bond_setup_by_slave’:
drivers/net/bonding/bond_main.c:1264: error: ‘struct net_device’ has no 
member named ‘hard_header’
drivers/net/bonding/bond_main.c:1264: error: ‘struct net_device’ has no 
member named ‘hard_header’
drivers/net/bonding/bond_main.c:1265: error: ‘struct net_device’ has no 
member named ‘rebuild_header’
drivers/net/bonding/bond_main.c:1265: error: ‘struct net_device’ has no 
member named ‘rebuild_header’
drivers/net/bonding/bond_main.c:1266: error: ‘struct net_device’ has no 
member named ‘hard_header_cache’
drivers/net/bonding/bond_main.c:1266: error: ‘struct net_device’ has no 
member named ‘hard_header_cache’
drivers/net/bonding/bond_main.c:1267: error: ‘struct net_device’ has no 
member named ‘header_cache_update’
drivers/net/bonding/bond_main.c:1267: error: ‘struct net_device’ has no 
member named ‘header_cache_update’
drivers/net/bonding/bond_main.c:1268: error: ‘struct net_device’ has no 
member named ‘hard_header_parse’
drivers/net/bonding/bond_main.c:1268: error: ‘struct net_device’ has no 
member named ‘hard_header_parse’
drivers/net/bonding/bond_main.c: In function ‘bond_release_and_destroy’:
drivers/net/bonding/bond_main.c:1864: warning: too few arguments for format

-

From: David Miller
Date: Tuesday, October 9, 2007 - 6:12 pm

From: Jeff Garzik <jeff@garzik.org>

Yeah it doesn't handle Stephen Hemmingers headerops change
in net-2.6.24

-

From: Jay Vosburgh
Date: Tuesday, October 9, 2007 - 6:18 pm

Gaah.  I'll sort it out and repost.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
-

From: Moni Shoua
Date: Wednesday, October 10, 2007 - 9:03 am

Hi Jay, Jeff
Thanks for the help with making the patch work compile under 2.6.24.
However, patch #3 has a missing line in bond_setup_by_slave that should look like this

	bond_dev->header_ops        = slave_dev->header_ops;

I rewrote the patch and also fixed patch #8 that became broken.

I would send the new patches now but there is more....
I also ran a test for the code in the branch of 2.6.24 and found a problem.
I see that ifconfig down doesn't return (for IPoIB interfaces) and it's stuck in napi_disable() in the kernel (any idea why?)

I am trying to solve it now so I'd like to wait a short time before applying these patches. 
I guess that I'll need to add something.



thanks
   MoniS

-

From: Roland Dreier
Date: Wednesday, October 10, 2007 - 11:31 am

> I also ran a test for the code in the branch of 2.6.24 and found a problem.
 > I see that ifconfig down doesn't return (for IPoIB interfaces) and it's stuck in napi_disable() in the kernel (any idea why?)

For what it's worth, I took the upstream 2.6.23 git tree and merged in
Dave's latest net-2.6.24 tree and my latest for-2.6.24 tree and tried
that.  I brought up an IPoIB interface, sent a few pings, and did
ifconfig down, and it worked fine.

Can you try the same thing without the bonding patches to see if your
setup works OK too?

Also can you give more details about what you do to get ifconfig down stuck?

 - R.
-

From: Moni Shoua
Date: Thursday, October 11, 2007 - 7:48 am

Without bonding ifconfig down works fine. 
It happens only when ib interfaces are slaves of a bonding device.
I thought before that the stuck is in napi_disable() but it's almost right.
I put prints before and after call to napi_disable and see that it is called twice.
I'll try to investigate in this direction.

ib0: stopping interface
ib0: before napi_disable
ib0: after napi_disable
ib0: downing ib_dev
ib0: All sends and receives done.
ib0: stopping interface
ib0: before napi_disable



There is also a dump of the kernel log after 'echo t > /proc/sysrq-trigger' (for ifconfig)

SysRq : Show State

ifconfig      S 0000000000000000     0  6311   6099
 ffff810034f49d18 0000000000000086 0000000000000000 ffffffffffffffff
 ffff810037e747c0 ffff810037e747c0 000000013481e000 ffff81003a851a78
 ffff81003a851840 000000003b0c8c00 0000000000000000 00000000802358ee
Call Trace:
 [<ffffffff8023cc89>] lock_timer_base+0x24/0x49
 [<ffffffff80403754>] schedule_timeout+0x8a/0xad
 [<ffffffff8023d241>] process_timeout+0x0/0x5
 [<ffffffff8023d6ec>] msleep_interruptible+0x11/0x39
 [<ffffffff884081a7>] :ib_ipoib:ipoib_stop+0x64/0x12c
 [<ffffffff8039fc07>] dev_close+0x3e/0x56
 [<ffffffff803a1c31>] dev_change_flags+0xa7/0x15f
 [<ffffffff803e5bee>] devinet_ioctl+0x293/0x5ed
 [<ffffffff803e775b>] inet_ioctl+0x7f/0x9d
 [<ffffffff80395b2e>] sock_ioctl+0x0/0x1fe
 [<ffffffff80395d08>] sock_ioctl+0x1da/0x1fe
 [<ffffffff802947d9>] do_ioctl+0x29/0x6f
 [<ffffffff80294a75>] vfs_ioctl+0x256/0x267
 [<ffffffff80294adf>] sys_ioctl+0x59/0x7a
 [<ffffffff8020bc0e>] system_call+0x7e/0x83


-

From: Roland Dreier
Date: Thursday, October 11, 2007 - 1:17 pm

> It happens only when ib interfaces are slaves of a bonding device.
 > I thought before that the stuck is in napi_disable() but it's almost right.
 > I put prints before and after call to napi_disable and see that it is called twice.
 > I'll try to investigate in this direction.
 > 
 > ib0: stopping interface
 > ib0: before napi_disable
 > ib0: after napi_disable
 > ib0: downing ib_dev
 > ib0: All sends and receives done.
 > ib0: stopping interface
 > ib0: before napi_disable

Yes, two napi_disable()s in a row without a matching napi_enable()
will deadlock.  I guess the question is why the ipoib interface is
being stopped twice.

If you just take the net-2.6.24 tree (without bonding patches), does
bonding for ethernet interfaces work OK, or is there a similar problem
with double napi_disable()?  How about bonding of ethernet after this
batch of bonding patches?

 - R.
-

From: Jay Vosburgh
Date: Thursday, October 11, 2007 - 3:01 pm

Roland Dreier <rdreier@cisco.com> wrote:

	I just checked this on an x86 box.  The bonding in stock net-2.6
pulled this morning or last night works ok (I did some basic tests,
including ifconfig down / up, with e100).  This remains true with the
IPoIB bonding patches applied.  I do not have hardware available to test
IPoIB.

	I did get a whammy from tg3, but I think this is unrelated to
bonding (as it happens when tg3 comes up, before bonding is involved):

BUG: unable to handle kernel paging request at virtual address 00004214
 printing eip:
e0828017
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: thermal processor fan button loop e1000 sg evdev tg3 e100 rtb
CPU:    0
EIP:    0060:[<e0828017>]    Not tainted VLI
EFLAGS: 00010206   (2.6.23-ipv6 #1)
EIP is at tg3_ape_write32+0x7/0x10 [tg3]
eax: de9304c0   ebx: dde8fe18   ecx: 00000000   edx: 00004214
esi: de9304c0   edi: 00000000   ebp: dde8fe28   esp: dde8fdd4
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process ip (pid: 2817, ti=dde8e000 task=dff4e0b0 task.ti=dde8e000)
Stack: e082fb2e 00000000 dde8fdf4 c01ece3e dde8fdf8 000003fe 00000000 00005400
       08000000 00001aa0 e083b340 08001aa0 00000060 e083ce00 08001b20 00000030
       e083ce80 00000101 de9304c0 00000001 dde56800 dde8fe38 e0830178 dff69000
Call Trace:
 [<c010536a>] show_trace_log_lvl+0x1a/0x30
 [<c0105429>] show_stack_log_lvl+0xa9/0xd0
 [<c0105639>] show_registers+0x1e9/0x2f0
 [<c0105851>] die+0x111/0x260
 [<c011c5dc>] do_page_fault+0x18c/0x6a0
 [<c0319bea>] error_code+0x72/0x78
 [<e0830178>] tg3_init_hw+0x38/0x50 [tg3]
 [<e0838886>] tg3_open+0x276/0x5d0 [tg3]
 [<c02aead8>] dev_open+0x38/0x80
 [<c02ad5cd>] dev_change_flags+0x7d/0x1a0
 [<c02f63d8>] devinet_ioctl+0x4c8/0x660
 [<c02f698b>] inet_ioctl+0x6b/0x90
 [<c02a0e5a>] sock_ioctl+0x5a/0x210
 [<c017cd98>] do_ioctl+0x28/0x80
 [<c017ce47>] vfs_ioctl+0x57/0x290
 [<c017d0b9>] sys_ioctl+0x39/0x60
 [<c01042a2>] sysenter_past_esp+0x5f/0x99
 =======================
Code: <89> 0a c3 8d b6 00 00 ...
From: Moni Shoua
Date: Saturday, October 13, 2007 - 8:24 am

I will be near my lab only tomorrow...
I will check this and let you know.

-

From: Moni Shoua
Date: Sunday, October 14, 2007 - 8:51 am

Ok, I think I know what happens here.
When bonding gets an NETDEV_GOING_DONW event it releases the slave and 
by the way closes the slave device (this is a new code). ifconfig on the other hand
closes the deivice one more time and this is why we see 2 napi_disable() in a row.

The fix in my opinion is in bonding - it should react to NETDEV_UNREGISTER and not to NETDEV_GOING_DONW.
I want to test this point and if it's good I'll submit new patches.


thanks
  MoniS

-

Previous thread: Please pull 'upstream-davem' branch of wireless-2.6 by John W. Linville on Tuesday, October 9, 2007 - 5:21 pm. (2 messages)

Next thread: [git patches] net driver updates by Jeff Garzik on Tuesday, October 9, 2007 - 6:03 pm. (2 messages)