[This is a second attempt to report this bug.] Path MTU Discovery for UDP underperforms for IPv4 and fails for IPv6 in Linux for transactional services like DHCP and RADIUS running on jumbo frame interfaces. These servers send packets with exponential back-off. UDP Path MTU Discovery probes for the path MTU each time the application sends a packet. So if you start with a high enough interface MTU then the server application backoff times get huge and the client gives up before the path MTU is discovered. This differs from TCP, where it is the kernel -- and not the application -- which organises retransmission. On receiving a ICMP Fragmentation Needed the kernel can immediately re-probe the path MTU wiht no waiting for an exponential timer to expire. In IPv4 there is a work-around for the server, turn off Path MTU Discovery and allow routers to fragment the packet as needed. Looking at the code for the various transactional servers (ISC DHCP, FreeRADIUS, RADIATOR, radsecproxy) they all disable Path MTU Discovery on Linux. This workaround has the side effect of hiding the problem, misleading people into thinking that UDP Path MTU Discovery actually works for these transactional servers. In IPv6 routers do not fragment packets, so there is no work around. Transactional servers which use UDP over IPv6 encounter exponential backoffs within the application and the client abandons the transaction. There is no way for the server to know that the packet was lost due to Path MTU Discovery and to immediately re-transmit it (without an exponential penalty) so that the MTU can be probed again. This can be viewed as a flaw in the RFC and in the sockets API for which IPv6 has removed the common work-around. Thank you, Glen -- Glen Turner www.gdt.id.au/~gdt --
So, presuming it is indeed a bug what form might a fix take? Are you suggesting
there should be a way for an application to say "Please let me see/know about
the ICMP messages?" Is that option available on other platforms as a
platform-specific extension? I don't have the details, but the HP-UX 11i v3
(11.31) netinet/udp.h file contains these:
#define UDP_RX_ICMP 0x02 /* boolean; get/set ICMP packets reception */
/* Set to 1 if ICMP packets are to be received*/
#define UDP_RX_ICMP6 0x03 /* boolean; get/set ICMPv6 packets reception */
/* Set to 1 if ICMPv6 packets are to be
received */
and it does appear that they are in more places than just HP-UX - there are some
hits for that for the old Apple Open Transport - which makes sense - it too had
Mentat origins.
rick jones
--
From: Rick Jones <rick.jones2@hp.com> We already provide this information. The socket ends up with EMSGSIZE in it's error queue, so the next time the application does I/O it sees that error immediately from the read/write call and thus knows that path MTU arrived. --
A possibly pedantic question, but only when it does I/O, or also when/if it is in poll/select? What distinguishes this EMSGSIZE from a run-of-the-mill EMSGSIZE error such as one gets from trying to send a datagram larger than SO_SNDBUF? That is something that happens all the time in netperf when people forget a -m option on UDP_STREAM tests :) Netperf gets the error and exits. But supposing I wanted to make netperf more sophisticated in that regard - what sort of things must it do? Call getsockopt(SO_SNDBUF) to check the size of the failed send against SO_SNDBUF and only then decide if it is an error on this send or an ICMP Datagram Too Big arrived indication from a previous send? I know that netperf already has this information, so using it as the example is a bit stretched, but lets presume for the moment that netperf just has a socket handed to it from "somewhere." rick jones --
Thanks David. Does select() return from its blocking so the application can make use of this indication immediately, rather than after the application's exponentially-increasing wait? Is an incoming ICMP the only cause of EMSGSIZE? That is, can an application safely retransmit immediately? -- Glen Turner www.gdt.id.au/~gdt --
Under Linux perhaps, and assuming it can guess which prior send triggered the
EMSGSIZE, but under HP-UX EMSGSIZE means you tried to send a datagram larger
than the socket buffer:
tusc src/netperf -t UDP_RR -- -s 1024 -r 60K
...
send(4, 0x4000ee68, 61440, 0) ............................ ERR#218 EMSGSIZE
I've not checked BSD, Solaris or AIX.
On a 2.6.22 kernel where I do the same thing, it returns ENOBUFS instead.
strace src/netperf -H localhost -t UDP_RR -- -s 1024 -r 60K
...
send(4, "netperf\0netperf\0netperf\0netperf\0n"..., 61440, 0) = -1 ENOBUFS (No
buffer space available)
Of course the send() manpage on various Linux systems I've tried says:
EMSGSIZE
The socket type requires that message be sent atomically, and
the size of the message to be sent made this impossible.
ENOBUFS
The output queue for a network interface was full. This gener-
ally indicates that the interface has stopped sending, but may
be caused by transient congestion. (Normally, this does not
occur in Linux. Packets are just silently dropped when a device
queue overflows.)
I suppose they are old on that system. Netperf interprets an ENOBUFS per the
manpage, and will not exit immediately in a UDP_STREAM test, but will simply
count the send as failed and try again. Not sure if it is worth trying to teach
netperf differently here or not.
rick jones
--
IIRC, yes. Cheers, Hagen -- Hagen Paul Pfeifer <hagen@jauu.net> || http://jauu.net/ Telephone: +49 174 5455209 || Key Id: 0x98350C22 Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22 --
From: Glen Turner <gdt@gdt.id.au> So the argument is, the kernel TCP does retransmission smart, userspace UDP apps do it stupidly, so let's turn off the feature instead of fixing userspace. Right? Sorry, fix this correctly in the user apps. Putting the blame on UDP path MTU discovery is placing it in the wrong spot. --
It means though that all IPv6 UDP applications essentially have to implement path mtu discovery support (which is non trivial) Will be likely a long time until they're all fixed. Seems like a big hole not considered by the IPv6 designers? -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It is trivial from the applications point of view to let the kernel find the UDP Path MTU. We just need more information from the kernel as to when it would like to see those packets (ie, for performance we'd like to feed in the packet to re-send as soon as the ICMP Packet Too Big arrives for the previous There's no need to make that assumption. We'd very much like transactional UDP protocols to work well in advanced networks. The other choices -- holding down millions of TCP sockets, or using new protocols (and there are competing proposals) -- don't exactly fill our operations teams with confidence. We'd very much like to use UDP were we can and something else Yeah. The sockets API for IPv6 required an additional feature that the IETF did not foresee. -- Glen Turner www.gdt.id.au/~gdt --
Linux (or in this concrete case ANK) did foresee it. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
You can still turn path mtu discovery off and Linux will fragment based on the known path MTU (I believe when the too big fragment gets a icmp back the pmtu gets updated) However you might lose a few packets in the process until the path MTU is known, but at least it will stay cached (unless you thrash the routing cache) In theory one could probably add some hack in the the kernel UDP code to hold one packet and retransmit it immediately with fragments when the ICMP comes in. However that would be quite far in behaviour from traditional UDP and be considered very ugly. It could also mess up congestion avoidance schemes done by the application. Still might be preferable over rewriting zillions of applications? -Andi -- ak@linux.intel.com -- Speaking for myself only. --
But which of the last N datagrams sent by the application should be retained for retransmission? It could be scores if not hundreds of datagrams depending on the behaviour of the application and the latency to the narrow part of the network. That the IPv6 specification was heavily "influenced" by "the router guys" seems increasingly clear... rick jones --
Yes, if there's a large window you lose. I guess it would make protocols like DHCP work at least ("transactional UDP" as the original poster called it) I don't know if it would fix enough applications to be worth implementing. The only way to find out would be to try I guess. Yes it sounds like the IETF didn't completely think that through. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I don't think there are any good solutions that do not require either application involvement, or a modification to IPv6. How about allowing an application to request that (copies of) ICMP(v6) messages be made available through the socket? In that way, the application, which ostensibly already has to be keeping track of its sends for its own nefarious retransmission porpoises can receive the "signal" just like TCP does and perhaps there will be enough in the ICMPv6 message for the application to know which message(s) need to be retransmitted. rick jones --
Are things really that bad? These "transactional" IPv6 apps all have the option to stick to 1280 sized datagrams to avoid the problem. If throughput is an issue these apps will surely benefit from proper PMTUD anyway or? Cheers --
I would get the alphabet soup completely garbled, but the DNS folks are talking about EDNS (?) message sizes upwards of 4096 bytes - encryption/authentication and other angels being asked to dance on the head of the DNS pin are asking for more and more space in the messages. So, someone will have to blink somewhere - either DNS will have to go TCP and *possibly* take RTT hits there depending on various patch streams, or the IEEE will have to sanction jumbo frames and people deploy them widely, or it will have to become feasible to actually do the occasional IPv6 datagram fragmentation and get a timely retransmission out of a UDP application on a PMTU hit. rick jones --
1) 4096 bytes UDP messages... well... 2) Using regular TCP for DNS servers... well... I believe some guys were pushing TCPCT (Cookie Transactions) for this case ( http://tools.ietf.org/html/draft-simpson-tcpct-00.html ) (That is, using an enhanced TCP for long DNS queries... but not only for DNS...) --
IPv4 gets by this by setting DF=0 in the IP header, and lets the network fragment the packet if necessary. IPv6 can similarly get by this by having the sending host fragment the large UDP packet into IPv6 fragments no longer than 1280 bytes each. But wait! IPv4 hosts are only required to reassemble 576 bytes at a minimum, and IPv6 hosts are only required to reassemble 1500 bytes at a minimum. Indeed, RFC2460 says: "An upper-layer protocol or application that depends on IPv6 fragmentation to send packets larger than the MTU of a path should not send packets larger than 1500 octets unless it has assurance that the destination is capable of reassembling packets of that larger size." but it is not clear how the sender can get such "assurance". In the end, perhaps IPv6 should just do what IPv4 does; turn off PMTUD and hope for the best? Fred --
That's true -- in theory the UDP app unwilling/unable to do proper ptmudisc could set the path mtu to 1280 + header and still keep path mtu discovery off and then just fragment. Drawback would be of course suboptimal network use with too small MTUs in the common case. Right now there is no right socket option to set the path mtu. We have a IP_MTU option, but it only works for getting the MTU. That's because the PMTU is in the routing cache entry and shared by multiple sockets. Presumably one could add a special case with an MTU in the socket overriding the one in the destination entry. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
We have IP_MTU_DISCOVER option with four existing values
/* IP_MTU_DISCOVER values */
#define IP_PMTUDISC_DONT 0 /* Never send DF frames */
#define IP_PMTUDISC_WANT 1 /* Use per route hints */
#define IP_PMTUDISC_DO 2 /* Always DF */
#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */
We might add a fifth value (or open full range) and change
static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
skb_dst(skb)->dev->mtu : dst_mtu(skb_dst(skb));
}
->
static inline int ip_skb_dst_mtu(struct sk_buff *skb)
{
if (skb->sk) {
struct inet_sock *inet = inet_sk(skb->sk);
if (inet->pmtudisc > IP_PMTUDISC_PROBE)
return inet->pmtudisc;
if (inet->pmtudisc == IP_PMTUDISC_PROBE)
return skb_dst(skb)->dev->mtu;
}
return dst_mtu(skb_dst(skb));
}
--
I think you would just need a sk->pmtu and a option to set/unset it Or perhaps just a flag that clamps to 1280? Then the existing mtu discover options would be sufficient. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Hmm, never mind your scheme of mapping it to the pmtudisc field would probably work. Except it's u8 currently, would need to be u16 or u32 -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Sorry I'm not following you here.. Why do you need to set the MTU? IIUC: UDP is supposed to preserve datagram boundaries, so the sender should when seeing an EMSGSIZE, read the PMTU and avoid sending further UDP packets larger than that. Userspace has control over the UDP datagram size. If it can, the app will also at this point retransmit any recent packets that went out larger than the fresh PMTU. If you don't want to hassle with all of that, the app can stick to 1280 (or I guess for the extreme/lazy cases turn on fragmentation).. Cheers --
See the early mails in this thread. This is about apps who can't limit themselves to 1280, but still don't want full blown PMTU. [They probably should, but it can be a lot of work] The MTU would allow to force fragmentation on the sending host as a workaround similar to IPv4. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yes, but I dont see why you need an option with semantics of setting an MTU. If an UDP app wants to use fragmentation (for whatever reason) setting a boolean flag like XXX_PMTUDISC_DONT should be enough. The kernel will for IPv6 have to work with the real PMTU or stick to 1280 when generating the fragments. Keep in mind that unlike IPv4, IPv6 has no DF flag. It's up to the sender to create the the fragments. Where does the application controllable per socket MTU come into the picture? Cheers --
To set the minimum path MTU so that there is a guarantee that IPv6 routers (which are unable to fragment themselves) will never drop it. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Not sure I agree with that kind of solution but thats probably because of missunderstandings on my side :) Thanks for explaning. --
Thanks for the pointer. The option is right now only defined, but not implemented. But yes it would help. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Right. Some apps may need to send isolated packets that are larger than the path MTU without invoking path MTU Right again. Unlike IPv4, however, IPv6 does not allow in-the-network fragmentation. So when in doubt, apps that need to send isolated packets that may violate the path MTU should really perform host-based fragmentation with a maximum fragment size of 1280. Isn't there a socket option "IPV6_USE_MIN_MTU" that apps can use to force fragmentation on large packets (RFC3542)? Caveat - the app may have no way of knowing whether the destination is capable of reassembling fragmented packets larger than 1500... Fred --
We don't need that sort of exotica from the kernel. The applications have to be prepared to retransmit lost packets in any case. What we need is an API for an instant notification that a ICMP Packet Too Big message has arrived concerning the socket. Then the application simply retransmits immediately, without adding to the exponential backoff penalty which the application maintains. The application maintain a overall packet-transmitted limit to prevent it can use for UDP Path MTU Discovery (paced at the RTT, so not contributing to congestion collapse). That stream halts when the first packet makes it to the end system. As for David Miller's rant, the applications currently have no choice but to "do it stupidly" as the kernel doesn't pass enough information for user space to do it intelligently. If the kernel passed user space the same indication as TCP gets, then we could -- and would -- do it right. Re-writing the applications to take advantage of the API is no great shakes -- there aren't many of them, they are written by people with a good knowledge of networking, but unfortunately they tend to do important stuff (allocate addresses, serve names, authenticate link layer access). It would be nice if the API had some commonality between platforms. But there's no shortage of #ifdefs already, and one more to make these applications work well for IPv6 on jumbo frames on the platform of choice for networking infrastructure would be seen by application authors as well worthwhile. Thanks for your consideration, Glen -- Glen Turner www.gdt.id.au/~gdt --
That's wrong. Linux has supported UDP/RAW pmtu discovery since many many years. I have a really old presentation on it (from 2000 or so): http://halobates.de/net-topics/text33.htm http://halobates.de/net-topics/text34.htm http://halobates.de/net-topics/text35.htm http://halobates.de/net-topics/text36.htm It's also in the manpages. However I suspect it's too much work to change a lot of applications to that, so I suspect the IPV6_MIN_MTU workaround is still needed. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Hi Andi, So what should I code? The suggested EMSGSIZE or your suggestion of grabbing all returning ICMP and parsing it? Noting that the second choice is pretty ugly. That both seem specific to Linux is frustrating, but that is life -- adding support for an operating system seems to inevitably add #ifdefs for this sort of code. Let me know and I'll code it into FreeRADIUS and radsecproxy and I'll see how they go with 802.1x requests over IPv6. Thanks so much for your time, Glen -- Glen Turner www.gdt.id.au/~gdt --
You don't need to parse any ICMPs, the kernel does that for you. See the documentation of IP_RECVERR in ip(7). The MTU is in ee_info First you need to enable path mtu discovery for the socket using IP_MTU_DISCOVER. So you can either keep track of the MTU yourself based on extended errors coming out of IP_RECVERR, or ask the kernel using IP_MTU when the socket is connected or simply lower when you see a EMSGSIZE. It's also Well when the other OS see the need they will hopefully add similar interfaces, with some luck even compatible to the ones in Linux. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
