Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

Previous thread: InfiniBand/RDMA merge plans for 2.6.24 by Roland Dreier on Thursday, September 13, 2007 - 10:57 am. (26 messages)

Next thread: [PATCH] i386: Compaq EVO N800c needs PCI bus renumbering by Juha Laiho on Thursday, September 13, 2007 - 11:21 am. (1 message)
From: Steve Wise
Date: Thursday, September 13, 2007 - 12:16 pm

iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

Version 2:

- added a per-device mutex for the address and listening endpoints lists.

- wait for all replies if sending multiple passive_open requests to rnic.

- log warning if no addresses are available when a listen is issued.

- tested

---

Design:

The sysadmin creates "for iwarp use only" alias interfaces of the form
"devname:iw*" where devname is the native interface name (eg eth0) for the
iwarp netdev device.  The alias label can be anything starting with "iw".
The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver.

EG:
	ifconfig eth0 192.168.70.123 up
	ifconfig eth0:iw1 192.168.71.123 up
	ifconfig eth0:iw2 192.168.72.123 up

In the above example, 192.168.70/24 is for TCP traffic, while
192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use.

The rdma-only interface must be on its own IP subnet. This allows routing
all rdma traffic onto this interface.

The iWARP driver must translate all listens on address 0.0.0.0 to the
set of rdma-only ip addresses for the device in question.  This prevents
incoming connect requests to the TCP ipaddresses from going up the
rdma stack.

Implementation Details:

- The iw_cxgb3 driver registers for inetaddr events via
register_inetaddr_notifier().  This allows tracking the iwarp-only
addresses/subnets as they get added and deleted.  The iwarp driver
maintains a list of the current iwarp-only addresses.

- The iw_cxgb3 driver builds the list of iwarp-only addresses for its
devices at module insert time.  This is needed because the inetaddr
notifier callbacks don't "replay" address-add events when someone
registers.  So the driver must build the initial list at module load time.

- When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver
must translate that into a set of listens on the iwarp-only addresses.
This is implemented by maintaining a list of stid/addr entries per
listening endpoint.

- When a new ...
From: Sean Hefty
Date: Thursday, September 13, 2007 - 12:54 pm

I've only given this a high level review at this point, and while the 
patch looks okay on first pass, is there a way to move some of this 
functionality to either the rdma_cm or iw_cm?  I don't like the idea of 
every iwarp driver having to implement address/listen list maintenance. 

There are a couple of areas that I made a note to look at in more detail 
(because I didn't understand everything that was happening), but I did 
have one minor nit - most uses of list_del_init can just be list_del.

- Sean
-

From: Steve Wise
Date: Saturday, September 15, 2007 - 7:07 am

I think the translating of listen requests from 0.0.0.0->specific 

Ok.

-

From: Steve Wise
Date: Sunday, September 23, 2007 - 1:33 pm

Note: some rnic drivers might want to support this differently.  So 
maybe we don't want this in the iwcm yet until we see that more iwarp 

fixed.

-

From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 6:09 am

If the only solutions to solve a problem with hardware are to steal
packets or became a real device, then real device is much more



I.e. it is not allowed to create ':iw' alias for anyone else?

Do you know, that cxgb3 function names suck? :)

Wants to be sizeof(struct iwch_listen_entry) and in other places too.

I skipped rdma internals of the patch, since I do not know it enough 
to judge, but your approach looks good from core network point of view.
Maybe you should automatically create an alias each time new interface
is added so that admin would not care about proper aliases?

-- 
	Evgeniy Polyakov
-

From: Roland Dreier
Date: Friday, September 14, 2007 - 9:06 am

> Maybe you should automatically create an alias each time new interface
 > is added so that admin would not care about proper aliases?

I agree that makes much more sense from a user interface point of
view.  Unfortunately an alias without an address doesn't make sense,
so there doesn't seem to be a way to implement that.

 - R.
-

From: Steve Wise
Date: Saturday, September 15, 2007 - 8:56 am

This is a real device.  I don't understand your question?  Packets 

No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a 

There are two causes where this is called: 1) during module init to 
populate the list of iwarp addresses.  If we failed in that case then, I 
_could_ then not register.  2) we get called via the notifier mechanism 
when an address is added.  If that fails, the caller doesn't care (since 
we're on the notifier callout thread).  But the code could perhaps 
unregister the device.  I prefer just logging an error in case 2.  I'll 
look into not registering if we cannot get any address due to lack of 
memory.  But there's another case:  we load the module and the admin 
hasn't yet created the ethX:iw interface.

Perhaps I should change the code to only register as a working rdma 


Do you mean I shouldn't use sizeof *le, but rather sizeof(struct 

That would be much better IMO, but the problem is that I cannot create 
an alias without an actual ip address.  Unless we change the core 
services to allow it.

Thanks for reviewing!

Steve.


-

From: Evgeniy Polyakov
Date: Sunday, September 16, 2007 - 7:22 am

Hi Steve.



sizeof(struct iwch_addrlist) of course, not (*addr).

Does second case ends up with problem you described in the initial

Does creating the whole new netdevice is a too big overhead, or is it


-- 
	Evgeniy Polyakov
-

From: Steve Wise
Date: Monday, September 17, 2007 - 8:25 am

No, the 2nd case (a failure to get the list of iwarp-only ip addresses) 
will cause rdma apps to not receive any incoming connections.

Consider we have eth1 with 1.1.1.1/24.  Next, the admin creates the 
iwarp interface thusly:

ifconfig eth1:iw 2.2.2.2 netmask 255.255.255.0 up

The iw_cxgb3 driver gets a netblock notifier event for the addition of 
the eth1:iw interface, but FAILS to alloc the memory to keep track that 
2.2.2.2 is the iwarp-only address.

Next an rdma app binds to 0.0.0.0, port X.  The iw_cxgb3 attempts to map 
0.0.0.0 to the set of valid iwarp addresses, but there are none. 
iw_cxgb3 logs a warning saying no addresses are available.  The 
application ends up not listening to any address.

So TCP apps aren't affected.

Also, if the admin notes the error log entry, and re-initializes the 
eth1:iw interface -and- this time the kmalloc() works, then the existing 
rdma app bound to 0.0.0.0 port X will then start receiving connect 
requests to 2.2.2.2 port X.


I think its too big overhead, and pretty invasive on the low level cxgb3 
driver.  I think having a device in the 'ifconfig -a' after iw_cxgb3 is 
loaded and devices discovered would be a good thing for the admin.  This 
is the angle Roland suggested.  I'm just not sure how to implement it.

But if someone could explain how I might create this full netdevice as a 
pseudo device on top of the real one, maybe I could implement it.

Note that non TCP traffic still needs to utilize this interface for ND 

Ok, now I get it. :)


Steve.
-

From: Evgeniy Polyakov
Date: Wednesday, September 19, 2007 - 3:56 am

Hi Steve.


Just a though - what about allowing secondary addresses with the same
address as main one? I.e. change bit of the core code to allow creating
aliases with the same address as main device, so that you would be able
to create ':iw' alias during rdma device initialization?

If this solution is not acceptible, then I belive your alias change is
the way to go.

-- 
	Evgeniy Polyakov
-

From: Steve Wise
Date: Friday, September 21, 2007 - 8:10 am

The problem is that on rdma route/address resolution the rdma core CM 
uses the routing table to look up which local device to use.  So what we 
need is separate ip subnets for rdma vs non rdma tcp.

Also, to avoid the original issue of 4-tuple conflicts, the rdma device 
_must_ listen on specific local "rdma-only" ip addresses and thus they 
must be not the same address as that used for native host tcp traffic.

Steve.
-

From: Sean Hefty
Date: Monday, September 17, 2007 - 9:09 am

See chapter 14 of CodingStyle document.  kmalloc(sizeof *addr... is correct.

- Sean
-

From: Evgeniy Polyakov
Date: Monday, September 17, 2007 - 9:17 am

Come on, do not start a flame war about how parameters into kmalloc
should be provided - there are much more serious issues unresolved yes. It
does help grepping the code, but if you feel that this is a serious threat,
then use your preferred way.

-- 
	Evgeniy Polyakov
-

Previous thread: InfiniBand/RDMA merge plans for 2.6.24 by Roland Dreier on Thursday, September 13, 2007 - 10:57 am. (26 messages)

Next thread: [PATCH] i386: Compaq EVO N800c needs PCI bus renumbering by Juha Laiho on Thursday, September 13, 2007 - 11:21 am. (1 message)