Re: TCP rx window autotuning harmful at LAN context

Previous thread: [PATCH 3/3] bnx2x: Using DMAE to initialize the chip by Eilon Greenstein on Monday, March 9, 2009 - 3:52 am. (1 message)

Next thread: [GIT]: Networking by David Miller on Monday, March 9, 2009 - 5:29 am. (4 messages)
From: Marian
Date: Monday, March 9, 2009 - 4:25 am

Hi all,

  based on multiple user complaints about poor LAN performance with 
TCP window autotuning on receiver side we conducted several tests at
our university to verify whether these complaints are valid. Unfortunately,
our results confirmed, that the present implementation indeed behaves 
erratically in LAN context and causes serious harm to LAN operation.

  The behaviour could be descibed as "spiraling death" syndrome. While
TCP with constant and decently sized rx window natively reduces transmission
rate when RTT increases, autotuning performs exactly the opposite - as a
response to increased RTT it increases the rx window size (which in turn
again increases RTT...) As this happens again and again, the result is
complete waste of all available buffers at sending host or at the bottleneck
point, resulting in upto 267 msec (!) latency in LAN context (with 100 Mbps
ethernet connection, default txqueuelen=1000, MTU=1500 and sky2 driver).  
Needles to say that this means the LAN is almost unusable.

   With autotuning disabled, the same situation results in just 5 msec
latency and still full 100 Mpbs link utilization, since with 64 kB rx window
the TCP transmission is solely controlled by RTT without ever going into
congestion avoidance mode since there are no packet drops.

   As rx window autotuning is enabled in all recent kernels and with 1 GB
of RAM the maximum tcp_rmem becomes 4 MB, this problem is spreading rapidly
and we believe it needs urgent attention. As demontrated above, such huge
rx window (which is at least 100*BDP of the example above) does not deliver
any performance gain but instead it seriously harms other hosts and/or
applications. It should also be noted, that host with autotuning enabled
steals an unfair share of the total available bandwidth, which might look
like a "better" performing TCP stack at first sight - however such behaviour
is not appropriate (RFC2914, section 3.2).

   The possible solution to the above problem could be e.g. to limit ...
From: John Heffner
Date: Monday, March 9, 2009 - 11:01 am

It's well known that "standard" TCP fills all available drop-tail
buffers, and that this behavior is not desirable.

The situation you describe is exactly what congestion control (the
topic of RFC2914) should fix.  It is not the role of receive window
(flow control).  It is really the sender's job to detect and react to
this, not the receiver's.  (We have had this discussion before on
netdev.)  There are a number of delay-based congestion control
algorithms that have been implemented and are available in Linux, but
all have proved problematic in many cases, and has not been suitable
to enable widely.  This is still an active research topic.

Another option in LANs is to enable AQM.  In Linux, you can configure
the bottleneck interface qdisc to be any of a number of RED-like early
droppers.  Most commercial routers also offer the ability to configure
AQM on interfaces, though most do not enable by default.

  -John
--

From: Marian
Date: Monday, March 9, 2009 - 1:05 pm

Well, in practice that was always limited by receive window size, which
was by default 64 kB on most operating systems. So this undesirable behavior
was limited to hosts where receive window was manually increased to huge
values.

Today, the real effect of autotuning is the same as changing the receive window
size to 4 MB on *all* hosts, since there's no mechanism to prevent it from

It's not of high importance whose job it is according to pure theory.
What matters is, that autotuning introduced serious problem at LAN context
by disabling any possibility to properly react to increasing RTT. Again,
it's not important whether this functionality was there by design or by
coincidence, but it was holding the system well-balanced for many years.

Now, as autotuning is enabled by default in stock kernel, this problem is
spreading into LANs without users even knowing what's going on. Therefore
I'd like to suggest to look for a decent fix which could be implemented
in relatively short time frame. My proposal is this:

- measure RTT during the initial phase of TCP connection (first X segments)
- compute maximal receive window size depending on measured RTT using
  configurable constant representing the bandwidth part of BDP
- let autotuning do its work upto that limit.

  With kind regards,

        M. 
--

From: Stephen Hemminger
Date: Monday, March 9, 2009 - 1:24 pm

On Mon, 9 Mar 2009 21:05:05 +0100

So you have broken infrastructure or senders and you want to blame
the receiver? The receiver is not responsible for flow control in TCP.

--

From: David Miller
Date: Monday, March 9, 2009 - 5:09 pm

From: Marian Ďurkovič <md@bts.sk>

You say "was" as if this was a recent change.  Linux has been doing

There is, on the sender side (congestion control) and at the
intermediate bottleneck routers (active queue management).

You are pointing the blame at the wrong area, as both John and Stephen
are trying to tell you.
From: Rick Jones
Date: Monday, March 9, 2009 - 5:34 pm

If I recall correctly, when I have asked about this behaviour in the past, I was 
told that the autotuning receiver would always try to offer the sender 2X what 
the receiver thought the sender's cwnd happened to be.  Is my recollection 
incorrect, or is this then:

[root@dl5855 ~]# netperf -t omni -H sut42 -- -k foo -s 128K
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sut42.west (10.208.0.45) port 
0 AF_INET
THROUGHPUT=941.30
LSS_SIZE_REQ=131072
LSS_SIZE=262142
LSS_SIZE_END=262142
RSR_SIZE_REQ=-1
RSR_SIZE=87380
RSR_SIZE_END=3900000

not intended behaviour?  LSS == Local Socket Send; RSR == Remote Socket Receive. 
  dl5855 is running RHEL 5.2 (2.6.18-92.el5) sut42 is running a nf-next-2.6 about 
two or three weeks old with some of the 32-core scaling patches applied 
(2.6.29-rc5-nfnextconntrack)

I'm assuming that by setting the SO_SNDBUF on the netperf (sending) side to 
128K/256K that will be the limit on what it will ever put out onto the connection 
at one time, but by the end of the 10 second test over the local GbE LAN the 
receiver's autotuned SO_RCVBUF has grown to 3900000.

rick jones
--

From: John Heffner
Date: Monday, March 9, 2009 - 8:55 pm

Hi Rick,

(Pretty sure we went over this already, but once more..)  The receiver
does not size to twice cwnd.  It sizes to twice the amount of data
that the application read in one RTT.  In the common case of a path
bottleneck and a receiving application that always keeps up, this
equals 2*cwnd, but the distinction is very important to understanding
its behavior in other cases.

In your test where you limit sndbuf to 256k, you will find that you
did not fill up the bottleneck queues, and you did not get a
significantly increased RTT, which are the negative effects we want to
avoid.  The large receive window caused no trouble at all.

  -John
--

From: Rick Jones
Date: Tuesday, March 10, 2009 - 10:20 am

> (Pretty sure we went over this already, but once more..) 

Sometimes I am but dense north by northwest, but I am also occasionally simply 

What is the definition of "significantly" here?

With my 256K capped SO_SNDBUF ping seems to report like this:

[root@dl5855 ~]# ping sut42
PING sut42.west (10.208.0.45) 56(84) bytes of data.
64 bytes from sut42.west (10.208.0.45): icmp_seq=1 ttl=64 time=1.58 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=2 ttl=64 time=0.126 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=3 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=5 ttl=64 time=0.104 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=6 ttl=64 time=0.100 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=7 ttl=64 time=0.140 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=8 ttl=64 time=0.103 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=9 ttl=64 time=11.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=10 ttl=64 time=10.3 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=11 ttl=64 time=7.42 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=12 ttl=64 time=4.51 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=13 ttl=64 time=1.56 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=14 ttl=64 time=4.47 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=15 ttl=64 time=4.63 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=16 ttl=64 time=1.66 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=17 ttl=64 time=7.65 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=18 ttl=64 time=4.73 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=19 ttl=64 time=0.135 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=20 ttl=64 time=0.116 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=21 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=22 ttl=64 time=0.102 ms
64 bytes from sut42.west (10.208.0.45): icmp_seq=23 ttl=64 time=0.098 ms
64 bytes from ...
From: Andi Kleen
Date: Wednesday, March 11, 2009 - 3:03 am

I think his point was the only now does it become a visible problem
as >= 1GB of memory is wide spread, which leads to 4MB rx buffer sizes.

Perhaps this points to the default buffer sizing heuristics to 
be too aggressive for >= 1GB?

Perhaps something like this patch? Marian, does that help?

-Andi

TCP: Lower per socket RX buffer sizing threshold 

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 net/ipv4/tcp.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.28-test/net/ipv4/tcp.c
===================================================================
--- linux-2.6.28-test.orig/net/ipv4/tcp.c	2009-02-09 11:06:52.000000000 +0100
+++ linux-2.6.28-test/net/ipv4/tcp.c	2009-03-11 11:01:53.000000000 +0100
@@ -2757,9 +2757,9 @@
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
-	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
-	max_share = min(4UL*1024*1024, limit);
+	/* Set per-socket limits to no more than 1/256 the pressure threshold */
+	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 8);
+	max_share = min(2UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
 	sysctl_tcp_wmem[1] = 16*1024;


-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Marian
Date: Wednesday, March 11, 2009 - 4:03 am

Yes, exactly! We run into this after number of workstations were upgraded

Sure - as it lowers the maximum from 4MB to 2MB, the net result is that
RTTs at 100 Mbps immediately went down from 267 msec into:

--- x.x.x.x ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8992ms
rtt min/avg/max/mdev = 134.417/134.770/134.911/0.315 ms

Still this is too high for 100 Mpbs network, since the RTTs with 64 KB static
rx buffer look like this (with no performance penalty):

--- x.x.x.x ping statistics --
10 packets transmitted, 10 received, 0% packet loss, time 9000ms
rtt min/avg/max/mdev = 5.163/5.355/5.476/0.102 ms

I.e. the patch significantly helps as expected, however having one static
limit for all NIC speeds as well as for the whole range of RTTs is suboptimal 
by principle.


   Thanks & kind regards,

       M.

--

From: David Miller
Date: Wednesday, March 11, 2009 - 6:30 am

From: Andi Kleen <andi@firstfloor.org>

It's necessary Andi, you can't fill a connection on a trans-
continental connection without at least a 4MB receive buffer.

Did you read the commit message of the change that increased
the limit?
--

From: Andi Kleen
Date: Wednesday, March 11, 2009 - 8:01 am

Seems pretty arbitary to me. It's the value for a given bandwidth*latency
product, but why not half or twice the bandwidth? I don't think
that number is written in stone like you claim.

Anyways it was just a test patch and it indeeds seems to address
the problem at least partly.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Marian
Date: Wednesday, March 11, 2009 - 7:56 am

Besides being arbitrary, it's also incorrect. The defaults at
tcp.c are setting both tcp_wmem and tcp_rmem to 4 MB ignoring
the fact, that it results in 4MB send buffer but only 3 MB 
receive buffer due to other defaults (tcp_adv_win_scale=2).
 
Indeed, 3MB*(1538/1448)/100Mbps is equal to 267.3 msec
- i.e. exactly the latency we're seeing.

   With kind regards,

         M.



--

From: John Heffner
Date: Wednesday, March 11, 2009 - 8:34 am

It is of course just a number, though not exactly arbitrary -- it's
approximately the required value for transcontinental 100 Mbps paths.
Choosing the value is a matter of engineering trade-offs, and seemed
like a reasonable cap at this time.

Any cap so much lower that it would give a small bound for LAN
latencies would bring us back to the bad old days where you couldn't
get anything more than 10 Mbps on the wide area.

  -John
--

From: Rémi
Date: Wednesday, March 11, 2009 - 2:02 am

This is very likely a stupid question, but anyway...

Is this with all applications, or only some pathological ones (one of which we 
both wrote code for, alright) with abnormally large send buffers?

-- 
Rémi Denis-Courmont
Maemo Software, Nokia Devices R&D

--

Previous thread: [PATCH 3/3] bnx2x: Using DMAE to initialize the chip by Eilon Greenstein on Monday, March 9, 2009 - 3:52 am. (1 message)

Next thread: [GIT]: Networking by David Miller on Monday, March 9, 2009 - 5:29 am. (4 messages)