iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts. Version 2: - added a per-device mutex for the address and listening endpoints lists. - wait for all replies if sending multiple passive_open requests to rnic. - log warning if no addresses are available when a listen is issued. - tested --- Design: The sysadmin creates "for iwarp use only" alias interfaces of the form "devname:iw*" where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with "iw". The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. EG: ifconfig eth0 192.168.70.123 up ifconfig eth0:iw1 192.168.71.123 up ifconfig eth0:iw2 192.168.72.123 up In the above example, 192.168.70/24 is for TCP traffic, while 192.168.71/24 and 192.168.72/24 are for iWARP/RDMA use. The rdma-only interface must be on its own IP subnet. This allows routing all rdma traffic onto this interface. The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. Implementation Details: - The iw_cxgb3 driver registers for inetaddr events via register_inetaddr_notifier(). This allows tracking the iwarp-only addresses/subnets as they get added and deleted. The iwarp driver maintains a list of the current iwarp-only addresses. - The iw_cxgb3 driver builds the list of iwarp-only addresses for its devices at module insert time. This is needed because the inetaddr notifier callbacks don't "replay" address-add events when someone registers. So the driver must build the initial list at module load time. - When a listen is done on address 0.0.0.0, then the iw_cxgb3 driver must translate that into a set of listens on the iwarp-only addresses. This is implemented by maintaining a list of stid/addr entries per listening endpoint. - When a new ...
I've only given this a high level review at this point, and while the patch looks okay on first pass, is there a way to move some of this functionality to either the rdma_cm or iw_cm? I don't like the idea of every iwarp driver having to implement address/listen list maintenance. There are a couple of areas that I made a note to look at in more detail (because I didn't understand everything that was happening), but I did have one minor nit - most uses of list_del_init can just be list_del. - Sean -
I think the translating of listen requests from 0.0.0.0->specific Ok. -
Note: some rnic drivers might want to support this differently. So maybe we don't want this in the iwcm yet until we see that more iwarp fixed. -
If the only solutions to solve a problem with hardware are to steal packets or became a real device, then real device is much more I.e. it is not allowed to create ':iw' alias for anyone else? Do you know, that cxgb3 function names suck? :) Wants to be sizeof(struct iwch_listen_entry) and in other places too. I skipped rdma internals of the patch, since I do not know it enough to judge, but your approach looks good from core network point of view. Maybe you should automatically create an alias each time new interface is added so that admin would not care about proper aliases? -- Evgeniy Polyakov -
> Maybe you should automatically create an alias each time new interface > is added so that admin would not care about proper aliases? I agree that makes much more sense from a user interface point of view. Unfortunately an alias without an address doesn't make sense, so there doesn't seem to be a way to implement that. - R. -
This is a real device. I don't understand your question? Packets No, insert_ifa() allocates a struct iwch_addrlist, which has 2 fields: a There are two causes where this is called: 1) during module init to populate the list of iwarp addresses. If we failed in that case then, I _could_ then not register. 2) we get called via the notifier mechanism when an address is added. If that fails, the caller doesn't care (since we're on the notifier callout thread). But the code could perhaps unregister the device. I prefer just logging an error in case 2. I'll look into not registering if we cannot get any address due to lack of memory. But there's another case: we load the module and the admin hasn't yet created the ethX:iw interface. Perhaps I should change the code to only register as a working rdma Do you mean I shouldn't use sizeof *le, but rather sizeof(struct That would be much better IMO, but the problem is that I cannot create an alias without an actual ip address. Unless we change the core services to allow it. Thanks for reviewing! Steve. -
Hi Steve. sizeof(struct iwch_addrlist) of course, not (*addr). Does second case ends up with problem you described in the initial Does creating the whole new netdevice is a too big overhead, or is it -- Evgeniy Polyakov -
No, the 2nd case (a failure to get the list of iwarp-only ip addresses) will cause rdma apps to not receive any incoming connections. Consider we have eth1 with 1.1.1.1/24. Next, the admin creates the iwarp interface thusly: ifconfig eth1:iw 2.2.2.2 netmask 255.255.255.0 up The iw_cxgb3 driver gets a netblock notifier event for the addition of the eth1:iw interface, but FAILS to alloc the memory to keep track that 2.2.2.2 is the iwarp-only address. Next an rdma app binds to 0.0.0.0, port X. The iw_cxgb3 attempts to map 0.0.0.0 to the set of valid iwarp addresses, but there are none. iw_cxgb3 logs a warning saying no addresses are available. The application ends up not listening to any address. So TCP apps aren't affected. Also, if the admin notes the error log entry, and re-initializes the eth1:iw interface -and- this time the kmalloc() works, then the existing rdma app bound to 0.0.0.0 port X will then start receiving connect requests to 2.2.2.2 port X. I think its too big overhead, and pretty invasive on the low level cxgb3 driver. I think having a device in the 'ifconfig -a' after iw_cxgb3 is loaded and devices discovered would be a good thing for the admin. This is the angle Roland suggested. I'm just not sure how to implement it. But if someone could explain how I might create this full netdevice as a pseudo device on top of the real one, maybe I could implement it. Note that non TCP traffic still needs to utilize this interface for ND Ok, now I get it. :) Steve. -
Hi Steve. Just a though - what about allowing secondary addresses with the same address as main one? I.e. change bit of the core code to allow creating aliases with the same address as main device, so that you would be able to create ':iw' alias during rdma device initialization? If this solution is not acceptible, then I belive your alias change is the way to go. -- Evgeniy Polyakov -
The problem is that on rdma route/address resolution the rdma core CM uses the routing table to look up which local device to use. So what we need is separate ip subnets for rdma vs non rdma tcp. Also, to avoid the original issue of 4-tuple conflicts, the rdma device _must_ listen on specific local "rdma-only" ip addresses and thus they must be not the same address as that used for native host tcp traffic. Steve. -
See chapter 14 of CodingStyle document. kmalloc(sizeof *addr... is correct. - Sean -
Come on, do not start a flame war about how parameters into kmalloc should be provided - there are much more serious issues unresolved yes. It does help grepping the code, but if you feel that this is a serious threat, then use your preferred way. -- Evgeniy Polyakov -
