My apologies for the multiple post. I got bit the first time around by my MUA's configuration. ---- Greetings. For some time now, the kernel and I have been having an argument over what the MTU should be for serving NFS over Infiniband. I say 65520, the documented maximum for connected mode. But, so far, I've been unable to have anything over 32192 remain stable. Back in the 2.6.14 -> .15 period, sunrpc's sk_buff allocations were changed from GFP_KERNEL to GFP_ATOMIC (b079fa7baa86b47579f3f60f86d03d21c76159b8 mainstream commit). Understandably, this was to prevent recursion through the NFS and sunrpc code. This is fine for the most common MTU out there, as the kernel is almost certain to find a free page. But, as one increases the MTU, memory fragmentation starts to play a role in nixing these allocations. These allocation failures ultimately result in sparse files being written through NFS. Granted, many of my users' application are oblivious to this because they don't check for such errors. But it would be nice if the kernel were more resilient in this regard. For a few months now, I've been running with sunrpc sk_buff allocations using GFP_NOFS instead, which allows for dirty data to be flushed out and still avoids recursion through sunrpc. With this, I've been able to increase the stable MTU to 32192. But no further, as eventually there is no dirty data left and memory fragmentation becomes mostly due to yet-to-be-sync'ed filesystem data. There's also the matter that using GFP_NOFS for this can slow down NFS quite a bit. In regrouping for my next tack at this, I noticed that all stack traces go through ip_append_data(). This would be ipv6_append_data() in the IPv6 case. A _very_ rough draft that would have ip_append_data() temporarily drop down to a smaller fake MTU follows ... diff -adNpru linux-2.6.35.2/net/ipv4/ip_output.c devel-2.6.35.2/net/ipv4/ip_output.c --- linux-2.6.35.2/net/ipv4/ip_output.c 2010-08-13 14:44:56.000000000 ...
On Mon, 23 Aug 2010 08:44:37 -0600 (MDT) Why doesn't NFS generate page size fragments? Does Infiniband or your device not support this? Any thing that requires higher order allocation is going to unstable under load. Let's fix the cause not the apply bandaid solution to the symptom. --
From what I can tell, IP fragmentation is done centrally. The MTU is a device attribute, yes. But, here, it is ip_append_data(), not NFS nor the device driver, whose responsibility it is to break up the payload into fragments, either by itself or using any facility supported by the adapter. What I'm saying is that there's no reason to require all fragments, except the last, to be MTU-sized. The RFCs I've looked at allow them to be shorter which can be used to advantage when MTU-sized fragments cannot be allocated in a memory fragmentation scenario, instead of reporting an error. Marc. +----------------------------------+----------------------------------+ | Marc Aurele La France | work: 1-780-492-9310 | | Academic Information and | fax: 1-780-492-1729 | | Communications Technologies | email: tsi@ualberta.ca | | 352 General Services Building +----------------------------------+ | University of Alberta | | | Edmonton, Alberta | Standard disclaimers apply | | T6G 2H1 | | | CANADA | | +----------------------------------+----------------------------------+ --
[...] Stephen and I are not talking about IP fragmentation, but about the ability to append 'fragments' to an skb rather than putting the entire packet payload in a linear buffer. See <http://vger.kernel.org/~davem/skb_data.html>. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
[<ffffffff810a5abe>] __alloc_pages_nodemask+0x617/0x692 [<ffffffff81061688>] ? mark_held_locks+0x49/0x64 [<ffffffff810d018b>] kmalloc_large_node+0x61/0x9e [<ffffffff810d3050>] __kmalloc_node_track_caller+0x32/0x159 [<ffffffff812612da>] ? sock_alloc_send_pskb+0xc9/0x2ea [<ffffffff81265cc6>] __alloc_skb+0x74/0x163 [<ffffffff812612da>] sock_alloc_send_pskb+0xc9/0x2ea [<ffffffff81061688>] ? mark_held_locks+0x49/0x64 [<ffffffff81261510>] sock_alloc_send_skb+0x15/0x17 [<ffffffff81299317>] ip_append_data+0x500/0x9d0 [<ffffffff8103feae>] ? local_bh_enable+0xb7/0xbd [<ffffffff8129a804>] ? ip_generic_getfrag+0x0/0x92 [<ffffffff81292bcd>] ? ip_route_output_flow+0x82/0x1f9 [<ffffffff812b8990>] udp_sendmsg+0x4ec/0x60c [<ffffffff812bf2ac>] inet_sendmsg+0x4b/0x58 [<ffffffff8125dd89>] sock_sendmsg+0xd9/0xfa [<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5 [<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5 [<ffffffff8125fcf5>] kernel_sendmsg+0x37/0x43 [<ffffffffa0267cd2>] xs_send_kvec+0x88/0x93 [sunrpc] [<ffffffff812f08dc>] ? _raw_spin_unlock_irqrestore+0x44/0x4c [<ffffffffa0267d5c>] xs_sendpages+0x7f/0x1be [sunrpc] [<ffffffffa026952f>] xs_udp_send_request+0x5b/0x103 [sunrpc] [<ffffffffa0266c0a>] xprt_transmit+0x11f/0x1f5 [sunrpc] [<ffffffffa02ea140>] ? nfs3_xdr_writeargs+0x0/0x82 [nfs] [<ffffffffa02648b9>] call_transmit+0x218/0x25e [sunrpc] [<ffffffffa026aced>] __rpc_execute+0x9b/0x288 [sunrpc] [<ffffffffa026aeef>] rpc_async_schedule+0x15/0x17 [sunrpc] [<ffffffff81051137>] worker_thread+0x1ed/0x2e6 [<ffffffff810510e1>] ? worker_thread+0x197/0x2e6 [<ffffffffa026aeda>] ? rpc_async_schedule+0x0/0x17 [sunrpc] [<ffffffff8105450f>] ? autoremove_wake_function+0x0/0x3d [<ffffffff81050f4a>] ? worker_thread+0x0/0x2e6 [<ffffffff810541b2>] kthread+0x82/0x8a [<ffffffff81002f14>] kernel_thread_helper+0x4/0x10 [<ffffffff81030d20>] ? finish_task_switch+0x0/0xd6 [<ffffffff81002f10>] ? kernel_thread_helper+0x0/0x10 Humm. ...
Not necessarily. Offloading it to hardware, where possible, is usually The inability to allocate large linear buffers is not a good reason to generate packets smaller than the MTU. You are working around the real problem. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
On Tue, 24 Aug 2010 23:20:41 +0100 IF NFS server is smart enough to generate: Header (skb) + one or more pages in fragment list then IP fragmentation could do fragmentation by allocating new headers skb (small) and assigning the same pages to multiple skb's using page ref count. It obviously isn't working that way. The whole problem is moot because NFS over UDP has known data corruption issues in the face of packet loss. The sequence number of the IP fragment can easily wrap around causing old data to be grouped with new data and the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS client ans passed to the user application as corrupted disk block. DON'T USE NFS OVER UDP! --
It is, but ip_append_data() is allocating a huge head if MTU is huge. But Marc point is using a big MTU, so that no IP fragmentation is needed. All UDP applications using MSG_MORE will hit the order-2 allocations if MTU=9000 for example... --
Hi Alexey, Few hours ago, I privately asked to Marc Aurele if its infiniband device was supporting NETIF_F_SG in its features ;) Thanks ! --
Hmm, strange, as I remember, it was supposed to work right. If the device supports SG (which is required to accept non-linear skbs anyway), then ip_append_* should allocate skbs not rounded up to mtu and we should allocate small skb with NFS header only. Does not it work? I can only guess one possible trap: people could do _one_ huge ip_append_data() (instead of "planned" scenario, when the header is sent with ip_append_data() and the following payload is appended with ip_append_page()). Huge ip_append_data() will generate huge skb indeed. Is this the problem? BTW this issue could be revisited and this "will generate huge" can be reconsidered. Automatic generation of fragmented skbs was deliberately suppressed, because it was found that all devices existing at the moment when this code was written are strongly biased against SG. Current code tries to _avoid_ generating non-linear skbs, unless it is intended for zero-copy, which compensated bias against SG. Modern hardware should work better. Alexey --
Generating smaller-than-MTU fragments is better than giving up and Point of clarification: we're talking about the client here, not the Steady now. There's no need to YELL nor be arrogant. You and I both know there's a place for NFS over UDP. That's not changing any time soon. While I'm aware of the issue you brought up, it is separate from the one at hand in this discussion. I do want to thank you, however, for reminding me of TCP. It's something 20/20 hindsight says I should have checked out before starting this thread. Logistically, it'll be a few days before I can do so though. If that allows me to increase the MTU all the way up to 65520, then this UDP thing will likely remain unresolved. Thanks. Marc. +----------------------------------+----------------------------------+ | Marc Aurele La France | work: 1-780-492-9310 | | Academic Information and | fax: 1-780-492-1729 | | Communications Technologies | email: tsi@ualberta.ca | | 352 General Services Building +----------------------------------+ | University of Alberta | | | Edmonton, Alberta | Standard disclaimers apply | | T6G 2H1 | | | CANADA | | +----------------------------------+----------------------------------+ --
Unfortunately, your infiniband device lacks NETIF_F_SG support. MTU a bit larger than PAGE_SIZE-overhead will need high order allocations ? --
On Thu, 26 Aug 2010 08:43:42 -0600 (Mountain Daylight Time) Infiniband device driver needs to be fixed to do SG and checksum offload. Otherwise it is insane to try and run large MTU over it. I even wonder if the dev_change_mtu() function should reject > PAGESIZE mtu for devices that don't do scatter/gather or at least a raise a warning. --
From: Stephen Hemminger <shemminger@vyatta.com> Agreed, this problem is in the infiniband layer and should be fixed there. But I fear there is a real potential blocker for this, if the infiniband layer can't checksum transmit packets in hardware we cannot legitimately add SG support. Paged SKBs can have references to page cache pages and similar. These can be updated asynchronously to the transmit, there is no locking at all to freeze the contents, and therefore full checksum offload is required to support SG correctly. So don't get the idea to do the checksum in software in the infiniband layer, and advertize hw checksumming support, to get around this :-) --
> Infiniband device driver needs to be fixed to do SG and checksum offload. > Otherwise it is insane to try and run large MTU over it. I even wonder if > the dev_change_mtu() function should reject > PAGESIZE mtu for devices > that don't do scatter/gather or at least a raise a warning. It's not possible to "fix" the driver to do checksum offload, since the underlying hardware does not support it. Theoretically we could handle SG but of course there's no point in that without checksum offload. I think there is some confusion about what IPoIB is in this thread, so let me try to give some basic background to help the discussion. There are two "modes" that an IPoIB interface can operate in: datagram mode and connected mode. In datagram mode, packets given to the IPoIB driver are sent as IB unreliable datagram messages, which means each skb turns into one packet on the wire -- very much like the ethernet case. In this mode, the MTU is limited by the MTU on the IB side, which is typically either 2K or 4K depending on the adapter and the switches involved. Modern IB adapters do support checksum offload and large send offload for datagrams, so we can and do enable SG and IP_CSUM. In connected mode, the IPoIB driver actually makes a reliable connection to each peer. For reliable connections, IB adapters can actually send messages up to 4GB, with the adapter handling all the segmentation and transport level acks etc. -- the host system simply queues one work request for each message of any size. These work requests do support gather/scatter, but no existing adapter supports checksum offload for messages on reliable connections. However, since reliable connections support arbitrary sized messages, in connected mode the IPoIB driver allows an MTU up to roughly the maximum 64K IP message size. (I don't think anyone has tried it with bigger IPv6 jumbograms ;) It does seem even with all the horrible memory allocation problems caused by requiring huge linear skbs, ...
By the way, for the original poster: is using NFS/RDMA a possibility? That might give even better performance than any config of IPoIB if you have an InfiniBand fabric anyway. - R. --
Yes, NFS/RDMA is a possibility I need to look at as well. Thanks. Marc. +----------------------------------+----------------------------------+ | Marc Aurele La France | work: 1-780-492-9310 | | Academic Information and | fax: 1-780-492-1729 | | Communications Technologies | email: tsi@ualberta.ca | | 352 General Services Building +----------------------------------+ | University of Alberta | | | Edmonton, Alberta | Standard disclaimers apply | | T6G 2H1 | | | CANADA | | +----------------------------------+----------------------------------+ --
On advanced cluster-area networks with large MTUs, the ACK packets in TCP will probably kill your performance. That's one of the main reasons we keep NFS over UDP on life support! :-) -- chuck[dot]lever[at]oracle[dot]com --
Just to close off on this. It's been a few weeks now, but moving to NFS over TCP allows me to increase the MTU all the way up to 65520 without issues. Thanks for the help. Marc. +----------------------------------+----------------------------------+ | Marc Aurele La France | work: 1-780-492-9310 | | Academic Information and | fax: 1-780-492-1729 | | Communications Technologies | email: tsi@ualberta.ca | | 352 General Services Building +----------------------------------+ | University of Alberta | | | Edmonton, Alberta | Standard disclaimers apply | | T6G 2H1 | | | CANADA | | +----------------------------------+----------------------------------+ --
[...] I'm not familiar with the NFS server, but what you're saying suggests that this code needs a more radical rethink. Firstly, I don't see why NFS should require each packet's payload to be contiguous. It could use page fragments and then leave it to the networking core to linearize the buffer if necessary for stupid hardware. Secondly, if it's doing its own segmentation it can't take advantage of TSO. This is likely to be a real drag on performance. If it were taking advantage of TSO then the effective MTU over TCP/IP could be about 64K and it would already have hit this problem on Ethernet. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
