Re: UDP path MTU discovery

Previous thread: [PATCH] igb: update hw_debug macro to make use of netdev_dbg call by Jeff Kirsher on Thursday, March 25, 2010 - 4:10 pm. (2 messages)

Next thread: [net-2.6 PATCH] ixgbe: Do not run all Diagnostic offline tests when VFs are active by Jeff Kirsher on Thursday, March 25, 2010 - 8:06 pm. (2 messages)
From: Glen Turner
Date: Thursday, March 25, 2010 - 5:02 pm

[This is a second attempt to report this bug.]

Path MTU Discovery for UDP underperforms for IPv4 and fails
for IPv6 in Linux for transactional services like DHCP and
RADIUS running on jumbo frame interfaces.

These servers send packets with exponential back-off. UDP
Path MTU Discovery probes for the path MTU each time the
application sends a packet. So if you start with a high
enough interface MTU then the server application backoff
times get huge and the client gives up before the path
MTU is discovered.

This differs from TCP, where it is the kernel -- and not
the application -- which organises retransmission. On
receiving a ICMP Fragmentation Needed the kernel can
immediately re-probe the path MTU wiht no waiting for
an exponential timer to expire.

In IPv4 there is a work-around for the server, turn off
Path MTU Discovery and allow routers to fragment the packet
as needed. Looking at the code for the various transactional
servers (ISC DHCP, FreeRADIUS, RADIATOR, radsecproxy) they
all disable Path MTU Discovery on Linux. This workaround has
the side effect of hiding the problem, misleading people into
thinking that UDP Path MTU Discovery actually works for these
transactional servers.

In IPv6 routers do not fragment packets, so there is no work
around. Transactional servers which use UDP over IPv6 encounter
exponential backoffs within the application and the client
abandons the transaction. There is no way for the server to
know that the packet was lost due to Path MTU Discovery and
to immediately re-transmit it (without an exponential penalty)
so that the MTU can be probed again.

This can be viewed as a flaw in the RFC and in the sockets API
for which IPv6 has removed the common work-around.

Thank you, Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt

--

From: Rick Jones
Date: Thursday, March 25, 2010 - 5:53 pm

So, presuming it is indeed a bug what form might a fix take? Are you suggesting 
there should be a way for an application to say "Please let me see/know about 
the ICMP messages?"  Is that option available on other platforms as a 
platform-specific extension?  I don't have the details, but the HP-UX 11i v3 
(11.31) netinet/udp.h file contains these:

#define UDP_RX_ICMP     0x02    /* boolean; get/set ICMP packets reception */
                                 /* Set to 1 if ICMP packets are to be received*/

#define UDP_RX_ICMP6    0x03    /* boolean; get/set ICMPv6 packets reception */
                                 /* Set to 1 if ICMPv6 packets are to be
                                    received */

and it does appear that they are in more places than just HP-UX - there are some 
hits for that for the old Apple Open Transport - which makes sense - it too had 
Mentat origins.

rick jones
--

From: David Miller
Date: Thursday, March 25, 2010 - 8:26 pm

From: Rick Jones <rick.jones2@hp.com>

We already provide this information.

The socket ends up with EMSGSIZE in it's error queue, so the next time
the application does I/O it sees that error immediately from the
read/write call and thus knows that path MTU arrived.
--

From: Rick Jones
Date: Friday, March 26, 2010 - 10:48 am

A possibly pedantic question, but only when it does I/O, or also when/if it is 
in poll/select?

What distinguishes this EMSGSIZE from a run-of-the-mill EMSGSIZE error such as 
one gets from trying to send a datagram larger than SO_SNDBUF?

That is something that happens all the time in netperf when people forget a -m 
option on UDP_STREAM tests :)  Netperf gets the error and exits.  But supposing 
I wanted to make netperf more sophisticated in that regard - what sort of things 
must it do?  Call getsockopt(SO_SNDBUF) to check the size of the failed send 
against SO_SNDBUF and only then decide if it is an error on this send or an ICMP 
Datagram Too Big arrived indication from a previous send?  I know that netperf 
already has this information, so using it as the example is a bit stretched, but 
lets presume for the moment that netperf just has a socket handed to it from 
"somewhere."

rick jones
--

From: Glen Turner
Date: Wednesday, March 31, 2010 - 4:42 pm

Thanks David.

Does select() return from its blocking so the application can make
use of this indication immediately, rather than after the
application's exponentially-increasing wait?

Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
application safely retransmit immediately?

-- 
 Glen Turner
 www.gdt.id.au/~gdt

--

From: Rick Jones
Date: Wednesday, March 31, 2010 - 5:06 pm

Under Linux perhaps, and assuming it can guess which prior send triggered the 
EMSGSIZE, but under HP-UX EMSGSIZE means you tried to send a datagram larger 
than the socket buffer:

tusc src/netperf -t UDP_RR -- -s 1024 -r 60K
...
send(4, 0x4000ee68, 61440, 0) ............................ ERR#218 EMSGSIZE

I've not checked BSD, Solaris or AIX.

On a 2.6.22 kernel where I do the same thing, it returns ENOBUFS instead.

strace src/netperf -H localhost -t UDP_RR -- -s 1024 -r 60K
...
send(4, "netperf\0netperf\0netperf\0netperf\0n"..., 61440, 0) = -1 ENOBUFS (No 
buffer space available)

Of course the send() manpage on various Linux systems I've tried says:

        EMSGSIZE
               The  socket  type  requires that message be sent atomically, and
               the size of the message to be sent made this impossible.

        ENOBUFS
               The output queue for a network interface was full.  This  gener-
               ally  indicates  that the interface has stopped sending, but may
               be caused by transient congestion.   (Normally,  this  does  not
               occur in Linux.  Packets are just silently dropped when a device
               queue overflows.)

I suppose they are old on that system.  Netperf interprets an ENOBUFS per the 
manpage, and will not exit immediately in a UDP_STREAM test, but will simply 
count the send as failed and try again.  Not sure if it is worth trying to teach 
netperf differently here or not.

rick jones
--

From: Hagen Paul Pfeifer
Date: Wednesday, March 31, 2010 - 4:51 pm

IIRC, yes.


Cheers, Hagen

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22
--

From: David Miller
Date: Thursday, March 25, 2010 - 8:24 pm

From: Glen Turner <gdt@gdt.id.au>

So the argument is, the kernel TCP does retransmission smart,
userspace UDP apps do it stupidly, so let's turn off the feature
instead of fixing userspace.

Right?

Sorry, fix this correctly in the user apps.  Putting the
blame on UDP path MTU discovery is placing it in the
wrong spot.

--

From: Andi Kleen
Date: Sunday, March 28, 2010 - 1:41 am

It means though that all IPv6 UDP applications essentially have
to implement path mtu discovery support (which is non trivial) 

Will be likely a long time until they're all fixed.

Seems like a big hole not considered by the IPv6 designers?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Glen Turner
Date: Wednesday, March 31, 2010 - 4:57 pm

It is trivial from the applications point of view to let the
kernel find the UDP Path MTU. We just need more information
from the kernel as to when it would like to see those packets
(ie, for performance we'd like to feed in the packet to re-send
as soon as the ICMP Packet Too Big arrives for the previous

There's no need to make that assumption.  We'd very much like
transactional UDP protocols to work well in advanced networks.
The other choices -- holding down millions of TCP sockets,
or using new protocols (and there are competing proposals) --
don't exactly fill our operations teams with confidence.

We'd very much like to use UDP were we can and something else

Yeah. The sockets API for IPv6 required an additional feature that
the IETF did not foresee.

-- 
 Glen Turner
 www.gdt.id.au/~gdt

--

From: Andi Kleen
Date: Wednesday, March 31, 2010 - 5:57 pm

Linux (or in this concrete case ANK) did foresee it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Andi Kleen
Date: Sunday, March 28, 2010 - 1:50 am

You can still turn path mtu discovery off and Linux will
fragment based on the known path MTU (I believe when
the too big fragment gets a icmp back the pmtu gets updated)

However you might lose a few packets in the process until the path MTU
is known, but at least it will stay cached (unless you thrash the
routing cache)

In theory one could probably add some hack in the the kernel UDP code
to hold one packet and retransmit it immediately with fragments when
the ICMP comes in. However that would be quite far in behaviour from
traditional UDP and be considered very ugly. It could also mess up
congestion avoidance schemes done by the application. 

Still might be preferable over rewriting zillions of applications?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Rick Jones
Date: Monday, March 29, 2010 - 10:01 am

But which of the last N datagrams sent by the application should be retained for 
retransmission?  It could be scores if not hundreds of datagrams depending on 
the behaviour of the application and the latency to the narrow part of the network.

That the IPv6 specification was heavily "influenced" by "the router guys" seems 
increasingly clear...

rick jones
--

From: Andi Kleen
Date: Monday, March 29, 2010 - 1:14 pm

Yes, if there's a large window you lose. I guess it would make protocols
like DHCP work at least ("transactional UDP" as the original poster called it)

I don't know if it would fix enough applications to be worth 
implementing. The only way to find out would be to try I guess.

Yes it sounds like the IETF didn't completely think that through.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Rick Jones
Date: Monday, March 29, 2010 - 1:25 pm

I don't think there are any good solutions that do not require either 
application involvement, or a modification to IPv6.

How about allowing an application to request that (copies of) ICMP(v6) messages 
be made available through the socket?  In that way, the application, which 
ostensibly already has to be keeping track of its sends for its own nefarious 
retransmission porpoises can receive the "signal" just like TCP does and perhaps 
there will be enough in the ICMPv6 message for the application to know which 
message(s) need to be retransmitted.

rick jones
--

From: Edgar E. Iglesias
Date: Monday, March 29, 2010 - 1:50 pm

Are things really that bad?

These "transactional" IPv6 apps all have the option to stick to 1280
sized datagrams to avoid the problem. If throughput is an issue these
apps will surely benefit from proper PMTUD anyway or?

Cheers
--

From: Rick Jones
Date: Monday, March 29, 2010 - 2:01 pm

I would get the alphabet soup completely garbled, but the DNS folks are talking 
about EDNS (?) message sizes upwards of 4096 bytes - encryption/authentication 
and other angels being asked to dance on the head of the DNS pin are asking for 
more and more space in the messages.

So, someone will have to blink somewhere - either DNS will have to go TCP and 
*possibly* take RTT hits there depending on various patch streams, or the IEEE 
will have to sanction jumbo frames and people deploy them widely, or it will 
have to become feasible to actually do the occasional IPv6 datagram 
fragmentation and get a timely retransmission out of a UDP application on a PMTU 
hit.

rick jones
--

From: Eric Dumazet
Date: Monday, March 29, 2010 - 2:29 pm

1) 4096 bytes UDP messages... well...
2) Using regular TCP for DNS servers... well...

I believe some guys were pushing TCPCT (Cookie Transactions) for this
case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html )

(That is, using an enhanced TCP for long DNS queries... but not only for
DNS...)



--

From: Templin, Fred L
Date: Monday, March 29, 2010 - 4:38 pm

IPv4 gets by this by setting DF=0 in the IP header, and
lets the network fragment the packet if necessary. IPv6 can
similarly get by this by having the sending host fragment
the large UDP packet into IPv6 fragments no longer than
1280 bytes each.

But wait! IPv4 hosts are only required to reassemble 576 bytes
at a minimum, and IPv6 hosts are only required to reassemble
1500 bytes at a minimum. Indeed, RFC2460 says:

   "An upper-layer protocol or application that depends on IPv6
   fragmentation to send packets larger than the MTU of a path should
   not send packets larger than 1500 octets unless it has assurance that
   the destination is capable of reassembling packets of that larger
   size."

but it is not clear how the sender can get such "assurance".
In the end, perhaps IPv6 should just do what IPv4 does;
turn off PMTUD and hope for the best?

Fred
--

From: Andi Kleen
Date: Monday, March 29, 2010 - 10:20 pm

That's true -- in theory the UDP app unwilling/unable to do proper ptmudisc 
could set the path mtu to 1280 + header and still keep path mtu discovery off 
and then just fragment. 

Drawback would be of course suboptimal network use with too small MTUs
in the common case.

Right now there is no right socket option to set the path mtu. We
have a IP_MTU option, but it only works for getting the MTU.
That's because the PMTU is in the routing cache entry and shared
by multiple sockets. Presumably one could add a special case
with an MTU in the socket overriding the one in the destination entry.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Eric Dumazet
Date: Monday, March 29, 2010 - 11:06 pm

We have IP_MTU_DISCOVER option with four existing values



/* IP_MTU_DISCOVER values */
#define IP_PMTUDISC_DONT                0       /* Never send DF frames */
#define IP_PMTUDISC_WANT                1       /* Use per route hints  */
#define IP_PMTUDISC_DO                  2       /* Always DF            */
#define IP_PMTUDISC_PROBE               3       /* Ignore dst pmtu      */

We might add a fifth value (or open full range) and change 

static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
        struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;

        return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
               skb_dst(skb)->dev->mtu : dst_mtu(skb_dst(skb));
}

->

static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
	if (skb->sk) {
		struct inet_sock *inet = inet_sk(skb->sk);

		if (inet->pmtudisc > IP_PMTUDISC_PROBE)
			return inet->pmtudisc;
		if (inet->pmtudisc == IP_PMTUDISC_PROBE)
			return skb_dst(skb)->dev->mtu;
	}
	return dst_mtu(skb_dst(skb));
}



--

From: Andi Kleen
Date: Monday, March 29, 2010 - 11:16 pm

I think you would just need a sk->pmtu and a option to set/unset it
Or perhaps just a flag that clamps to 1280? 

Then the existing mtu discover options would be sufficient.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Andi Kleen
Date: Monday, March 29, 2010 - 11:17 pm

Hmm, never mind your scheme of mapping it to the pmtudisc field would
probably work. Except it's u8 currently, would need to be u16 or u32

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Edgar E. Iglesias
Date: Monday, March 29, 2010 - 11:16 pm

Sorry I'm not following you here.. Why do you need to set the MTU?

IIUC:
UDP is supposed to preserve datagram boundaries, so the sender should
when seeing an EMSGSIZE, read the PMTU and avoid sending further UDP
packets larger than that. Userspace has control over the UDP datagram
size. If it can, the app will also at this point retransmit any recent
packets that went out larger than the fresh PMTU.

If you don't want to hassle with all of that, the app can stick to
1280 (or I guess for the extreme/lazy cases turn on fragmentation)..

Cheers
--

From: Andi Kleen
Date: Monday, March 29, 2010 - 11:19 pm

See the early mails in this thread. This is about apps who can't
limit themselves to 1280, but still don't want full blown PMTU.
[They probably should, but it can be a lot of work]

The MTU would allow to force fragmentation on the sending host
as a workaround similar to IPv4.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Edgar E. Iglesias
Date: Tuesday, March 30, 2010 - 1:20 am

Yes, but I dont see why you need an option with semantics of setting an MTU.

If an UDP app wants to use fragmentation (for whatever reason) setting
a boolean flag like XXX_PMTUDISC_DONT should be enough. The kernel will for
IPv6 have to work with the real PMTU or stick to 1280 when generating the
fragments. Keep in mind that unlike IPv4, IPv6 has no DF flag. It's up to
the sender to create the the fragments.

Where does the application controllable per socket MTU come into the
picture?

Cheers
--

From: Andi Kleen
Date: Tuesday, March 30, 2010 - 7:12 am

To set the minimum path MTU so that there is a guarantee that IPv6 routers
(which are unable to fragment themselves) will never drop it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Edgar E. Iglesias
Date: Tuesday, March 30, 2010 - 3:04 pm

Not sure I agree with that kind of solution but thats probably
because of missunderstandings on my side :)

Thanks for explaning.
--

From: Andi Kleen
Date: Tuesday, March 30, 2010 - 9:06 am

Thanks for the pointer. The option is right now only defined,
but not implemented.  But yes it would help. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Templin, Fred L
Date: Tuesday, March 30, 2010 - 8:58 am

Right. Some apps may need to send isolated packets that
are larger than the path MTU without invoking path MTU

Right again. Unlike IPv4, however, IPv6 does not allow
in-the-network fragmentation. So when in doubt, apps
that need to send isolated packets that may violate the
path MTU should really perform host-based fragmentation
with a maximum fragment size of 1280. Isn't there a
socket option "IPV6_USE_MIN_MTU" that apps can use to
force fragmentation on large packets (RFC3542)?

Caveat - the app may have no way of knowing whether
the destination is capable of reassembling fragmented
packets larger than 1500...

Fred
--

From: Glen Turner
Date: Wednesday, March 31, 2010 - 4:43 pm

We don't need that sort of exotica from the kernel.  The applications
have to be prepared to retransmit lost packets in any case.

What we need is an API for an instant notification that a ICMP Packet
Too Big message has arrived concerning the socket.

Then the application simply retransmits immediately, without adding
to the exponential backoff penalty which the application maintains.
The application maintain a overall packet-transmitted limit to prevent
it can use for UDP Path MTU Discovery (paced at the RTT, so not
contributing to congestion collapse). That stream halts when the
first packet makes it to the end system.

As for David Miller's rant, the applications currently have no choice
but to "do it stupidly" as the kernel doesn't pass enough information
for user space to do it intelligently.  If the kernel passed user space
the same indication as TCP gets, then we could -- and would -- do it
right.

Re-writing the applications to take advantage of the API is no great
shakes -- there aren't many of them, they are written by people with
a good knowledge of networking, but unfortunately they tend to do
important stuff (allocate addresses, serve names, authenticate link
layer access).

It would be nice if the API had some commonality between platforms.
But there's no shortage of #ifdefs already, and one more to make
these applications work well for IPv6 on jumbo frames on the platform
of choice for networking infrastructure would be seen by application
authors as well worthwhile.

Thanks for your consideration,
Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt

--

From: Andi Kleen
Date: Wednesday, March 31, 2010 - 5:55 pm

That's wrong. Linux has supported UDP/RAW pmtu discovery since many many
years.

I have a really old presentation on it (from 2000 or so):

http://halobates.de/net-topics/text33.htm
http://halobates.de/net-topics/text34.htm
http://halobates.de/net-topics/text35.htm
http://halobates.de/net-topics/text36.htm

It's also in the manpages.

However I suspect it's too much work to change a lot of applications
to that, so I suspect the IPV6_MIN_MTU workaround is still needed.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Glen Turner
Date: Thursday, April 1, 2010 - 10:41 pm

Hi Andi,

So what should I code?  The suggested EMSGSIZE or your suggestion
of grabbing all returning ICMP and parsing it?  Noting that the
second choice is pretty ugly.  That both seem specific to Linux is
frustrating, but that is life -- adding support for an operating
system seems to inevitably add #ifdefs for this sort of code.

Let me know and I'll code it into FreeRADIUS and radsecproxy and
I'll see how they go with 802.1x requests over IPv6.

Thanks so much for your time, Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt

--

From: Andi Kleen
Date: Sunday, April 4, 2010 - 3:25 am

You don't need to parse any ICMPs, the kernel does that for you. 
See the documentation of IP_RECVERR in ip(7). The MTU is in ee_info

First you need to enable path mtu discovery for the socket
using IP_MTU_DISCOVER.

So you can either keep track of the MTU yourself based on extended
errors coming out of IP_RECVERR, or ask the kernel using IP_MTU when
the socket is connected or simply lower when you see a EMSGSIZE. It's also 

Well when the other OS see the need they will hopefully add similar
interfaces, with some luck even compatible to the ones in Linux.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

Previous thread: [PATCH] igb: update hw_debug macro to make use of netdev_dbg call by Jeff Kirsher on Thursday, March 25, 2010 - 4:10 pm. (2 messages)

Next thread: [net-2.6 PATCH] ixgbe: Do not run all Diagnostic offline tests when VFs are active by Jeff Kirsher on Thursday, March 25, 2010 - 8:06 pm. (2 messages)