Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

Previous thread: InfiniBand/RDMA merge plans for 2.6.24 by Roland Dreier on Thursday, September 13, 2007 - 1:57 pm. (26 messages)

Next thread: [PATCH] i386: Compaq EVO N800c needs PCI bus renumbering by Juha Laiho on Thursday, September 13, 2007 - 2:21 pm. (1 message)
To: <rdreier@...>, <sean.hefty@...>
Cc: <netdev@...>, <linux-kernel@...>, <general@...>
Date: Thursday, September 13, 2007 - 3:16 pm

iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

Version 2:

- added a per-device mutex for the address and listening endpoints lists.

- wait for all replies if sending multiple passive_open requests to rnic.

- log warning if no addresses are available when a listen is issued.

- tested

---

Design:

The sysadmin creates "for iwarp use only" alias interfaces of the form
"devname:iw*" where devname is the native interface name (eg eth0) for the
iwarp netdev device. The alias label can be anything starting with "iw".
The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver.

EG:
ifconfig eth0 192.168.70.123 up
ifconfig eth0:iw1 192.168.71.123 up
ifconfig eth0:iw2 192.168.72.123 up

In the above example, 192.168.70/24 is for TCP traffic, while
192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use.

The rdma-only interface must be on its own IP subnet. This allows routing
all rdma traffic onto this interface.

The iWARP driver must translate all listens on address 0.0.0.0 to the
set of rdma-only ip addresses for the device in question. This prevents
incoming connect requests to the TCP ipaddresses from going up the
rdma stack.

Implementation Details:

- The iw_cxgb3 driver registers for inetaddr events via
register_inetaddr_notifier(). This allows tracking the iwarp-only
addresses/subnets as they get added and deleted. The iwarp driver
maintains a list of the current iwarp-only addresses.

- The iw_cxgb3 driver builds the list of iwarp-only addresses for its
devices at module insert time. This is needed because the inetaddr
notifier callbacks don't "replay" address-add events when someone
registers. So the driver must build the initial list at module load time.

- When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver
must translate that into a set of listens on the iwarp-only addresses.
This is implemented by maintaining a list of stid/addr entries per
listening endpoint.

- When a new iwarp-only ...

To: Steve Wise <swise@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Friday, September 14, 2007 - 9:09 am

If the only solutions to solve a problem with hardware are to steal
packets or became a real device, then real device is much more

I.e. it is not allowed to create ':iw' alias for anyone else?

Do you know, that cxgb3 function names suck? :)

Wants to be sizeof(struct iwch_listen_entry) and in other places too.

I skipped rdma internals of the patch, since I do not know it enough
to judge, but your approach looks good from core network point of view.
Maybe you should automatically create an alias each time new interface
is added so that admin would not care about proper aliases?

--
Evgeniy Polyakov
-

To: Evgeniy Polyakov <johnpol@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Saturday, September 15, 2007 - 11:56 am

This is a real device. I don't understand your question? Packets

No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a

There are two causes where this is called: 1) during module init to
populate the list of iwarp addresses. If we failed in that case then, I
_could_ then not register. 2) we get called via the notifier mechanism
when an address is added. If that fails, the caller doesn't care (since
we're on the notifier callout thread). But the code could perhaps
unregister the device. I prefer just logging an error in case 2. I'll
look into not registering if we cannot get any address due to lack of
memory. But there's another case: we load the module and the admin
hasn't yet created the ethX:iw interface.

Perhaps I should change the code to only register as a working rdma

Do you mean I shouldn't use sizeof *le, but rather sizeof(struct

That would be much better IMO, but the problem is that I cannot create
an alias without an actual ip address. Unless we change the core
services to allow it.

Thanks for reviewing!

Steve.

-

To: Steve Wise <swise@...>
Cc: Evgeniy Polyakov <johnpol@...>, <netdev@...>, <rdreier@...>, <general@...>, <linux-kernel@...>
Date: Monday, September 17, 2007 - 12:09 pm

See chapter 14 of CodingStyle document. kmalloc(sizeof *addr... is correct.

- Sean
-

To: Sean Hefty <mshefty@...>
Cc: Steve Wise <swise@...>, <netdev@...>, <rdreier@...>, <general@...>, <linux-kernel@...>
Date: Monday, September 17, 2007 - 12:17 pm

Come on, do not start a flame war about how parameters into kmalloc
should be provided - there are much more serious issues unresolved yes. It
does help grepping the code, but if you feel that this is a serious threat,
then use your preferred way.

--
Evgeniy Polyakov
-

To: Steve Wise <swise@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Sunday, September 16, 2007 - 10:22 am

Hi Steve.

sizeof(struct iwch_addrlist) of course, not (*addr).

Does second case ends up with problem you described in the initial

Does creating the whole new netdevice is a too big overhead, or is it

--
Evgeniy Polyakov
-

To: Evgeniy Polyakov <johnpol@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Monday, September 17, 2007 - 11:25 am

No, the 2nd case (a failure to get the list of iwarp-only ip addresses)
will cause rdma apps to not receive any incoming connections.

Consider we have eth1 with 1.1.1.1/24. Next, the admin creates the
iwarp interface thusly:

ifconfig eth1:iw 2.2.2.2 netmask 255.255.255.0 up

The iw_cxgb3 driver gets a netblock notifier event for the addition of
the eth1:iw interface, but FAILS to alloc the memory to keep track that
2.2.2.2 is the iwarp-only address.

Next an rdma app binds to 0.0.0.0, port X. The iw_cxgb3 attempts to map
0.0.0.0 to the set of valid iwarp addresses, but there are none.
iw_cxgb3 logs a warning saying no addresses are available. The
application ends up not listening to any address.

So TCP apps aren't affected.

Also, if the admin notes the error log entry, and re-initializes the
eth1:iw interface -and- this time the kmalloc() works, then the existing
rdma app bound to 0.0.0.0 port X will then start receiving connect
requests to 2.2.2.2 port X.

I think its too big overhead, and pretty invasive on the low level cxgb3
driver. I think having a device in the 'ifconfig -a' after iw_cxgb3 is
loaded and devices discovered would be a good thing for the admin. This
is the angle Roland suggested. I'm just not sure how to implement it.

But if someone could explain how I might create this full netdevice as a
pseudo device on top of the real one, maybe I could implement it.

Note that non TCP traffic still needs to utilize this interface for ND

Ok, now I get it. :)

Steve.
-

To: Steve Wise <swise@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Wednesday, September 19, 2007 - 6:56 am

Hi Steve.

Just a though - what about allowing secondary addresses with the same
address as main one? I.e. change bit of the core code to allow creating
aliases with the same address as main device, so that you would be able
to create ':iw' alias during rdma device initialization?

If this solution is not acceptible, then I belive your alias change is
the way to go.

--
Evgeniy Polyakov
-

To: Evgeniy Polyakov <johnpol@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Friday, September 21, 2007 - 11:10 am

The problem is that on rdma route/address resolution the rdma core CM
uses the routing table to look up which local device to use. So what we
need is separate ip subnets for rdma vs non rdma tcp.

Also, to avoid the original issue of 4-tuple conflicts, the rdma device
_must_ listen on specific local "rdma-only" ip addresses and thus they
must be not the same address as that used for native host tcp traffic.

Steve.
-

To: Evgeniy Polyakov <johnpol@...>
Cc: Steve Wise <swise@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Friday, September 14, 2007 - 12:06 pm

> Maybe you should automatically create an alias each time new interface
> is added so that admin would not care about proper aliases?

I agree that makes much more sense from a user interface point of
view. Unfortunately an alias without an address doesn't make sense,
so there doesn't seem to be a way to implement that.

- R.
-

To: Steve Wise <swise@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Thursday, September 13, 2007 - 3:54 pm

I've only given this a high level review at this point, and while the
patch looks okay on first pass, is there a way to move some of this
functionality to either the rdma_cm or iw_cm? I don't like the idea of
every iwarp driver having to implement address/listen list maintenance.

There are a couple of areas that I made a note to look at in more detail
(because I didn't understand everything that was happening), but I did
have one minor nit - most uses of list_del_init can just be list_del.

- Sean
-

To: Sean Hefty <mshefty@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Sunday, September 23, 2007 - 4:33 pm

Note: some rnic drivers might want to support this differently. So
maybe we don't want this in the iwcm yet until we see that more iwarp

fixed.

-

To: Sean Hefty <mshefty@...>
Cc: <rdreier@...>, <sean.hefty@...>, <netdev@...>, <linux-kernel@...>, <general@...>
Date: Saturday, September 15, 2007 - 10:07 am

I think the translating of listen requests from 0.0.0.0->specific

Ok.

-

Previous thread: InfiniBand/RDMA merge plans for 2.6.24 by Roland Dreier on Thursday, September 13, 2007 - 1:57 pm. (26 messages)

Next thread: [PATCH] i386: Compaq EVO N800c needs PCI bus renumbering by Juha Laiho on Thursday, September 13, 2007 - 2:21 pm. (1 message)