Re: RFC: MTU for serving NFS on Infiniband

Previous thread: [PATCH] binfmt_misc: Fix binfmt_misc priority by Jan Sembera on Monday, August 23, 2010 - 7:15 am. (4 messages)

Next thread: No subject by auto595907 on Monday, August 23, 2010 - 7:32 am. (1 message)
From: Marc Aurele La France
Date: Monday, August 23, 2010 - 7:44 am

My apologies for the multiple post.  I got bit the first time around by my 
MUA's configuration.

----

Greetings.

For some time now, the kernel and I have been having an argument over what 
the MTU should be for serving NFS over Infiniband.  I say 65520, the 
documented maximum for connected mode.  But, so far, I've been unable to have 
anything over 32192 remain stable.

Back in the 2.6.14 -> .15 period, sunrpc's sk_buff allocations were changed 
from GFP_KERNEL to GFP_ATOMIC (b079fa7baa86b47579f3f60f86d03d21c76159b8 
mainstream commit).  Understandably, this was to prevent recursion through 
the NFS and sunrpc code.  This is fine for the most common MTU out there, as 
the kernel is almost certain to find a free page.  But, as one increases the 
MTU, memory fragmentation starts to play a role in nixing these allocations.

These allocation failures ultimately result in sparse files being written 
through NFS.  Granted, many of my users' application are oblivious to 
this because they don't check for such errors.  But it would be nice if the 
kernel were more resilient in this regard.

For a few months now, I've been running with sunrpc sk_buff allocations using 
GFP_NOFS instead, which allows for dirty data to be flushed out and still 
avoids recursion through sunrpc.  With this, I've been able to increase the 
stable MTU to 32192.  But no further, as eventually there is no dirty data 
left and memory fragmentation becomes mostly due to yet-to-be-sync'ed 
filesystem data.  There's also the matter that using GFP_NOFS for this can 
slow down NFS quite a bit.

In regrouping for my next tack at this, I noticed that all stack traces go 
through ip_append_data().  This would be ipv6_append_data() in the IPv6 case.
A _very_ rough draft that would have ip_append_data() temporarily drop down 
to a smaller fake MTU follows ...

diff -adNpru linux-2.6.35.2/net/ipv4/ip_output.c devel-2.6.35.2/net/ipv4/ip_output.c
--- linux-2.6.35.2/net/ipv4/ip_output.c	2010-08-13 14:44:56.000000000 ...
From: Stephen Hemminger
Date: Monday, August 23, 2010 - 8:05 am

On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)

Why doesn't NFS generate page size fragments?  Does Infiniband or your
device not support this?  Any thing that requires higher order allocation
is going to unstable under load.  Let's fix the cause not the apply bandaid
solution to the symptom.
--

From: Marc Aurele La France
Date: Tuesday, August 24, 2010 - 8:14 am

From what I can tell, IP fragmentation is done centrally.

The MTU is a device attribute, yes.  But, here, it is ip_append_data(), 
not NFS nor the device driver, whose responsibility it is to break up the 
payload into fragments, either by itself or using any facility supported 
by the adapter.  What I'm saying is that there's no reason to require all 
fragments, except the last, to be MTU-sized.  The RFCs I've looked at 
allow them to be shorter which can be used to advantage when MTU-sized 
fragments cannot be allocated in a memory fragmentation scenario, instead 
of reporting an error.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
--

From: Ben Hutchings
Date: Tuesday, August 24, 2010 - 10:57 am

[...]

Stephen and I are not talking about IP fragmentation, but about the
ability to append 'fragments' to an skb rather than putting the entire
packet payload in a linear buffer.  See
<http://vger.kernel.org/~davem/skb_data.html>.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Marc Aurele La France
Date: Tuesday, August 24, 2010 - 1:33 pm

[<ffffffff810a5abe>] __alloc_pages_nodemask+0x617/0x692
  [<ffffffff81061688>] ? mark_held_locks+0x49/0x64
  [<ffffffff810d018b>] kmalloc_large_node+0x61/0x9e
  [<ffffffff810d3050>] __kmalloc_node_track_caller+0x32/0x159
  [<ffffffff812612da>] ? sock_alloc_send_pskb+0xc9/0x2ea
  [<ffffffff81265cc6>] __alloc_skb+0x74/0x163
  [<ffffffff812612da>] sock_alloc_send_pskb+0xc9/0x2ea
  [<ffffffff81061688>] ? mark_held_locks+0x49/0x64
  [<ffffffff81261510>] sock_alloc_send_skb+0x15/0x17
  [<ffffffff81299317>] ip_append_data+0x500/0x9d0
  [<ffffffff8103feae>] ? local_bh_enable+0xb7/0xbd
  [<ffffffff8129a804>] ? ip_generic_getfrag+0x0/0x92
  [<ffffffff81292bcd>] ? ip_route_output_flow+0x82/0x1f9
  [<ffffffff812b8990>] udp_sendmsg+0x4ec/0x60c
  [<ffffffff812bf2ac>] inet_sendmsg+0x4b/0x58
  [<ffffffff8125dd89>] sock_sendmsg+0xd9/0xfa
  [<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5
  [<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5
  [<ffffffff8125fcf5>] kernel_sendmsg+0x37/0x43
  [<ffffffffa0267cd2>] xs_send_kvec+0x88/0x93 [sunrpc]
  [<ffffffff812f08dc>] ? _raw_spin_unlock_irqrestore+0x44/0x4c
  [<ffffffffa0267d5c>] xs_sendpages+0x7f/0x1be [sunrpc]
  [<ffffffffa026952f>] xs_udp_send_request+0x5b/0x103 [sunrpc]
  [<ffffffffa0266c0a>] xprt_transmit+0x11f/0x1f5 [sunrpc]
  [<ffffffffa02ea140>] ? nfs3_xdr_writeargs+0x0/0x82 [nfs]
  [<ffffffffa02648b9>] call_transmit+0x218/0x25e [sunrpc]
  [<ffffffffa026aced>] __rpc_execute+0x9b/0x288 [sunrpc]
  [<ffffffffa026aeef>] rpc_async_schedule+0x15/0x17 [sunrpc]
  [<ffffffff81051137>] worker_thread+0x1ed/0x2e6
  [<ffffffff810510e1>] ? worker_thread+0x197/0x2e6
  [<ffffffffa026aeda>] ? rpc_async_schedule+0x0/0x17 [sunrpc]
  [<ffffffff8105450f>] ? autoremove_wake_function+0x0/0x3d
  [<ffffffff81050f4a>] ? worker_thread+0x0/0x2e6
  [<ffffffff810541b2>] kthread+0x82/0x8a
  [<ffffffff81002f14>] kernel_thread_helper+0x4/0x10
  [<ffffffff81030d20>] ? finish_task_switch+0x0/0xd6
  [<ffffffff81002f10>] ? kernel_thread_helper+0x0/0x10



Humm.  ...
From: Ben Hutchings
Date: Tuesday, August 24, 2010 - 3:20 pm

Not necessarily.  Offloading it to hardware, where possible, is usually

The inability to allocate large linear buffers is not a good reason to
generate packets smaller than the MTU.  You are working around the real
problem.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Stephen Hemminger
Date: Tuesday, August 24, 2010 - 3:39 pm

On Tue, 24 Aug 2010 23:20:41 +0100

IF NFS server is smart enough to generate:
   Header (skb) + one or more pages in fragment list
then IP fragmentation could do fragmentation by allocating
new headers skb (small) and assigning the same pages to
multiple skb's using page ref count.

It obviously isn't working that way.

The whole problem is moot because NFS over UDP has known data corruption
issues in the face of packet loss.  The sequence number of the IP fragment
can easily wrap around causing old data to be grouped with new data and
the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
client ans passed to the user application as corrupted disk block.

DON'T USE NFS OVER UDP!




--

From: Eric Dumazet
Date: Tuesday, August 24, 2010 - 10:54 pm

It is, but ip_append_data() is allocating a huge head if MTU is huge.


But Marc point is using a big MTU, so that no IP fragmentation is
needed.

All UDP applications using MSG_MORE will hit the order-2 allocations if
MTU=9000 for example...



--

From: Eric Dumazet
Date: Wednesday, August 25, 2010 - 5:17 am

Hi Alexey,

Few hours ago, I privately asked to Marc Aurele if its infiniband device
was supporting NETIF_F_SG in its features ;)

Thanks !


--

From: Alexey Kuznetsov
Date: Wednesday, August 25, 2010 - 5:10 am

Hmm, strange, as I remember, it was supposed to work right.

If the device supports SG (which is required to accept non-linear skbs anyway),
then ip_append_* should allocate skbs not rounded up to mtu and we should
allocate small skb with NFS header only. Does not it work?

I can only guess one possible trap: people could do _one_ huge ip_append_data()
(instead of "planned" scenario, when the header is sent with ip_append_data()
and the following payload is appended with ip_append_page()). Huge ip_append_data()
will generate huge skb indeed. Is this the problem?


BTW this issue could be revisited and this "will generate huge" can be reconsidered.
Automatic generation of fragmented skbs was deliberately suppressed, because it was
found that all devices existing at the moment when this code was written
are strongly biased against SG. Current code tries to _avoid_ generating
non-linear skbs, unless it is intended for zero-copy, which compensated
bias against SG. Modern hardware should work better.

Alexey
--

From: Marc Aurele La France
Date: Thursday, August 26, 2010 - 4:40 am

Generating smaller-than-MTU fragments is better than giving up and 

Point of clarification:  we're talking about the client here, not the 

Steady now.  There's no need to YELL nor be arrogant.  You and I both know 
there's a place for NFS over UDP.  That's not changing any time soon.  While 
I'm aware of the issue you brought up, it is separate from the one at hand in 
this discussion.

I do want to thank you, however, for reminding me of TCP.  It's something 
20/20 hindsight says I should have checked out before starting this thread. 
Logistically, it'll be a few days before I can do so though.  If that allows 
me to increase the MTU all the way up to 65520, then this UDP thing will 
likely remain unresolved.

Thanks.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
--

From: Eric Dumazet
Date: Thursday, August 26, 2010 - 4:57 am

Unfortunately, your infiniband device lacks NETIF_F_SG support.

MTU a bit larger than PAGE_SIZE-overhead will need high order
allocations ?



--

From: Stephen Hemminger
Date: Thursday, August 26, 2010 - 4:53 pm

On Thu, 26 Aug 2010 08:43:42 -0600 (Mountain Daylight Time)

Infiniband device driver needs to be fixed to do SG and checksum offload.
Otherwise it is insane to try and run large MTU over it. I even wonder if
the dev_change_mtu() function should reject > PAGESIZE mtu for devices
that don't do scatter/gather or at least a raise a warning.
--

From: David Miller
Date: Thursday, August 26, 2010 - 5:06 pm

From: Stephen Hemminger <shemminger@vyatta.com>

Agreed, this problem is in the infiniband layer and should be fixed
there.

But I fear there is a real potential blocker for this, if the
infiniband layer can't checksum transmit packets in hardware we cannot
legitimately add SG support.

Paged SKBs can have references to page cache pages and similar.  These
can be updated asynchronously to the transmit, there is no locking at
all to freeze the contents, and therefore full checksum offload is
required to support SG correctly.

So don't get the idea to do the checksum in software in the infiniband
layer, and advertize hw checksumming support, to get around this :-)
--

From: Roland Dreier
Date: Friday, August 27, 2010 - 9:20 am

> Infiniband device driver needs to be fixed to do SG and checksum offload.
 > Otherwise it is insane to try and run large MTU over it. I even wonder if
 > the dev_change_mtu() function should reject > PAGESIZE mtu for devices
 > that don't do scatter/gather or at least a raise a warning.

It's not possible to "fix" the driver to do checksum offload, since the
underlying hardware does not support it.  Theoretically we could handle
SG but of course there's no point in that without checksum offload.

I think there is some confusion about what IPoIB is in this thread, so
let me try to give some basic background to help the discussion.  There
are two "modes" that an IPoIB interface can operate in: datagram mode
and connected mode.

In datagram mode, packets given to the IPoIB driver are sent as IB
unreliable datagram messages, which means each skb turns into one packet
on the wire -- very much like the ethernet case.  In this mode, the MTU
is limited by the MTU on the IB side, which is typically either 2K or 4K
depending on the adapter and the switches involved.  Modern IB adapters
do support checksum offload and large send offload for datagrams, so we
can and do enable SG and IP_CSUM.

In connected mode, the IPoIB driver actually makes a reliable connection
to each peer.  For reliable connections, IB adapters can actually send
messages up to 4GB, with the adapter handling all the segmentation and
transport level acks etc. -- the host system simply queues one work
request for each message of any size.  These work requests do support
gather/scatter, but no existing adapter supports checksum offload for
messages on reliable connections.

However, since reliable connections support arbitrary sized messages, in
connected mode the IPoIB driver allows an MTU up to roughly the maximum
64K IP message size.  (I don't think anyone has tried it with bigger
IPv6 jumbograms ;)

It does seem even with all the horrible memory allocation problems
caused by requiring huge linear skbs, ...
From: Roland Dreier
Date: Friday, August 27, 2010 - 10:16 am

By the way, for the original poster: is using NFS/RDMA a possibility?
That might give even better performance than any config of IPoIB if you
have an InfiniBand fabric anyway.

 - R.
--

From: Marc Aurele La France
Date: Friday, August 27, 2010 - 10:53 am

Yes, NFS/RDMA is a possibility I need to look at as well.

Thanks.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
--

From: Chuck Lever
Date: Thursday, August 26, 2010 - 7:58 am

On advanced cluster-area networks with large MTUs, the ACK packets in TCP will probably kill your performance.  That's one of the main reasons we keep NFS over UDP on life support!  :-)

-- 
chuck[dot]lever[at]oracle[dot]com




--

From: Marc Aurele La France
Date: Thursday, September 30, 2010 - 11:50 am

Just to close off on this.  It's been a few weeks now, but moving to NFS 
over TCP allows me to increase the MTU all the way up to 65520 without 
issues.

Thanks for the help.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
--

From: Ben Hutchings
Date: Monday, August 23, 2010 - 8:12 am

[...]

I'm not familiar with the NFS server, but what you're saying suggests
that this code needs a more radical rethink.

Firstly, I don't see why NFS should require each packet's payload to be
contiguous.  It could use page fragments and then leave it to the
networking core to linearize the buffer if necessary for stupid
hardware.

Secondly, if it's doing its own segmentation it can't take advantage of
TSO.  This is likely to be a real drag on performance.  If it were
taking advantage of TSO then the effective MTU over TCP/IP could be
about 64K and it would already have hit this problem on Ethernet.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

Previous thread: [PATCH] binfmt_misc: Fix binfmt_misc priority by Jan Sembera on Monday, August 23, 2010 - 7:15 am. (4 messages)

Next thread: No subject by auto595907 on Monday, August 23, 2010 - 7:32 am. (1 message)