Hi all, based on multiple user complaints about poor LAN performance with TCP window autotuning on receiver side we conducted several tests at our university to verify whether these complaints are valid. Unfortunately, our results confirmed, that the present implementation indeed behaves erratically in LAN context and causes serious harm to LAN operation. The behaviour could be descibed as "spiraling death" syndrome. While TCP with constant and decently sized rx window natively reduces transmission rate when RTT increases, autotuning performs exactly the opposite - as a response to increased RTT it increases the rx window size (which in turn again increases RTT...) As this happens again and again, the result is complete waste of all available buffers at sending host or at the bottleneck point, resulting in upto 267 msec (!) latency in LAN context (with 100 Mbps ethernet connection, default txqueuelen=1000, MTU=1500 and sky2 driver). Needles to say that this means the LAN is almost unusable. With autotuning disabled, the same situation results in just 5 msec latency and still full 100 Mpbs link utilization, since with 64 kB rx window the TCP transmission is solely controlled by RTT without ever going into congestion avoidance mode since there are no packet drops. As rx window autotuning is enabled in all recent kernels and with 1 GB of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly and we believe it needs urgent attention. As demontrated above, such huge rx window (which is at least 100*BDP of the example above) does not deliver any performance gain but instead it seriously harms other hosts and/or applications. It should also be noted, that host with autotuning enabled steals an unfair share of the total available bandwidth, which might look like a "better" performing TCP stack at first sight - however such behaviour is not appropriate (RFC2914, section 3.2). The possible solution to the above problem could be e.g. to limit ...
It's well known that "standard" TCP fills all available drop-tail buffers, and that this behavior is not desirable. The situation you describe is exactly what congestion control (the topic of RFC2914) should fix. It is not the role of receive window (flow control). It is really the sender's job to detect and react to this, not the receiver's. (We have had this discussion before on netdev.) There are a number of delay-based congestion control algorithms that have been implemented and are available in Linux, but all have proved problematic in many cases, and has not been suitable to enable widely. This is still an active research topic. Another option in LANs is to enable AQM. In Linux, you can configure the bottleneck interface qdisc to be any of a number of RED-like early droppers. Most commercial routers also offer the ability to configure AQM on interfaces, though most do not enable by default. -John --
Well, in practice that was always limited by receive window size, which
was by default 64 kB on most operating systems. So this undesirable behavior
was limited to hosts where receive window was manually increased to huge
values.
Today, the real effect of autotuning is the same as changing the receive window
size to 4 MB on *all* hosts, since there's no mechanism to prevent it from
It's not of high importance whose job it is according to pure theory.
What matters is, that autotuning introduced serious problem at LAN context
by disabling any possibility to properly react to increasing RTT. Again,
it's not important whether this functionality was there by design or by
coincidence, but it was holding the system well-balanced for many years.
Now, as autotuning is enabled by default in stock kernel, this problem is
spreading into LANs without users even knowing what's going on. Therefore
I'd like to suggest to look for a decent fix which could be implemented
in relatively short time frame. My proposal is this:
- measure RTT during the initial phase of TCP connection (first X segments)
- compute maximal receive window size depending on measured RTT using
configurable constant representing the bandwidth part of BDP
- let autotuning do its work upto that limit.
With kind regards,
M.
--
On Mon, 9 Mar 2009 21:05:05 +0100 So you have broken infrastructure or senders and you want to blame the receiver? The receiver is not responsible for flow control in TCP. --
From: Marian Ďurkovič <md@bts.sk> You say "was" as if this was a recent change. Linux has been doing There is, on the sender side (congestion control) and at the intermediate bottleneck routers (active queue management). You are pointing the blame at the wrong area, as both John and Stephen are trying to tell you.
If I recall correctly, when I have asked about this behaviour in the past, I was told that the autotuning receiver would always try to offer the sender 2X what the receiver thought the sender's cwnd happened to be. Is my recollection incorrect, or is this then: [root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 0 AF_INET THROUGHPUT=941.30 LSS_SIZE_REQ=131072 LSS_SIZE=262142 LSS_SIZE_END=262142 RSR_SIZE_REQ=-1 RSR_SIZE=87380 RSR_SIZE_END=3900000 not intended behaviour? LSS == Local Socket Send; RSR == Remote Socket Receive. dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a nf-next-2.6 about two or three weeks old with some of the 32-core scaling patches applied (2.6.29-rc5-nfnextconntrack) I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to 128K/256K that will be the limit on what it will ever put out onto the connection at one time, but by the end of the 10 second test over the local GbE LAN the receiver's autotuned SO_RCVBUF has grown to 3900000. rick jones --
Hi Rick, (Pretty sure we went over this already, but once more..) The receiver does not size to twice cwnd. It sizes to twice the amount of data that the application read in one RTT. In the common case of a path bottleneck and a receiving application that always keeps up, this equals 2*cwnd, but the distinction is very important to understanding its behavior in other cases. In your test where you limit sndbuf to 256k, you will find that you did not fill up the bottleneck queues, and you did not get a significantly increased RTT, which are the negative effects we want to avoid. The large receive window caused no trouble at all. -John --
> (Pretty sure we went over this already, but once more..) Sometimes I am but dense north by northwest, but I am also occasionally simply What is the definition of "significantly" here? With my 256K capped SO_SNDBUF ping seems to report like this: [root@dl5855 ~]# ping sut42 PING sut42.west (10.208.0.45) 56(84) bytes of data. 64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=1.58 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.126 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.104 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.100 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.140 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.103 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=11.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=10.3 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=7.42 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=4.51 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=1.56 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=4.47 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=4.63 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=1.66 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=7.65 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=4.73 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.135 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.116 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=22 ttl=64 time=0.102 ms 64 bytes from sut42.west (10.208.0.45): icmp_seq=23 ttl=64 time=0.098 ms 64 bytes from ...
I think his point was the only now does it become a visible problem as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes. Perhaps this points to the default buffer sizing heuristics to be too aggressive for >= 1GB? Perhaps something like this patch? Marian, does that help? -Andi TCP: Lower per socket RX buffer sizing threshold Signed-off-by: Andi Kleen <ak@linux.intel.com> --- net/ipv4/tcp.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.28-test/net/ipv4/tcp.c =================================================================== --- linux-2.6.28-test.orig/net/ipv4/tcp.c 2009-02-09 11:06:52.000000000 +0100 +++ linux-2.6.28-test/net/ipv4/tcp.c 2009-03-11 11:01:53.000000000 +0100 @@ -2757,9 +2757,9 @@ sysctl_tcp_mem[1] = limit; sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; - /* Set per-socket limits to no more than 1/128 the pressure threshold */ - limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); - max_share = min(4UL*1024*1024, limit); + /* Set per-socket limits to no more than 1/256 the pressure threshold */ + limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8); + max_share = min(2UL*1024*1024, limit); sysctl_tcp_wmem[0] = SK_MEM_QUANTUM; sysctl_tcp_wmem[1] = 16*1024; -- ak@linux.intel.com -- Speaking for myself only. --
Yes, exactly! We run into this after number of workstations were upgraded
Sure - as it lowers the maximum from 4MB to 2MB, the net result is that
RTTs at 100 Mbps immediately went down from 267 msec into:
--- x.x.x.x ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8992ms
rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms
Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static
rx buffer look like this (with no performance penalty):
--- x.x.x.x ping statistics --
10 packets transmitted, 10 received, 0% packet loss, time 9000ms
rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms
I.e. the patch significantly helps as expected, however having one static
limit for all NIC speeds as well as for the whole range of RTTs is suboptimal
by principle.
Thanks & kind regards,
M.
--
From: Andi Kleen <andi@firstfloor.org> It's necessary Andi, you can't fill a connection on a trans- continental connection without at least a 4MB receive buffer. Did you read the commit message of the change that increased the limit? --
Seems pretty arbitary to me. It's the value for a given bandwidth*latency product, but why not half or twice the bandwidth? I don't think that number is written in stone like you claim. Anyways it was just a test patch and it indeeds seems to address the problem at least partly. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Besides being arbitrary, it's also incorrect. The defaults at
tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring
the fact, that it results in 4MB send buffer but only 3 MB
receive buffer due to other defaults (tcp_adv_win_scale=2).
Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec
- i.e. exactly the latency we're seeing.
With kind regards,
M.
--
It is of course just a number, though not exactly arbitrary -- it's approximately the required value for transcontinental 100 Mbps paths. Choosing the value is a matter of engineering trade-offs, and seemed like a reasonable cap at this time. Any cap so much lower that it would give a small bound for LAN latencies would bring us back to the bad old days where you couldn't get anything more than 10 Mbps on the wide area. -John --
This is very likely a stupid question, but anyway... Is this with all applications, or only some pathological ones (one of which we both wrote code for, alright) with abnormally large send buffers? -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D --
