Bonding, GRO and tcp_reordering

Previous thread: [PATCH] fix hang in dmfe driver on sending of big packet (linux-2.6.35) by Alexander V. Lukyanov on Tuesday, November 30, 2010 - 6:46 am. (3 messages)

Next thread: [PATCH 1/2] af_packet: use vmalloc_to_page() instead for the addresss returned by vmalloc() by Changli Gao on Tuesday, November 30, 2010 - 6:56 am. (10 messages)
From: Simon Horman
Date: Tuesday, November 30, 2010 - 6:55 am

Hi,

I just wanted to share what is a rather pleasing,
though to me somewhat surprising result.

I am testing bonding using balance-rr mode with three physical links to try
to get > gigabit speed for a single stream. Why?  Because I'd like to run
various tests at > gigabit speed and I don't have any 10G hardware at my
disposal.

The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both
LSO and GSO disabled on both the sender and receiver I see:

# netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
(172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

  87380  16384   1472    10.01      1646.13   40.01    -1.00    3.982  -1.000

But with GRO enabled on the receiver I see.

# netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216
(172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384   1472    10.01      2613.83   19.32    -1.00    1.211   -1.000

Which is much better than any result I get tweaking tcp_reordering when
GRO is disabled on the receiver.

Tweaking tcp_reordering when GRO is enabled on the receiver seems to have
negligible effect.  Which is interesting, because my brief reading on the
subject indicated that tcp_reordering was the key tuning parameter for
bonding with balance-rr.

The only other parameter that seemed to ...
From: Ben Hutchings
Date: Tuesday, November 30, 2010 - 8:42 am

Did you also enable TSO/GSO on the sender?

What TSO/GSO will do is to change the round-robin scheduling from one
packet per interface to one super-packet per interface.  GRO then
coalesces the physical packets back into a super-packet.  The intervals
between receiving super-packets then tend to exceed the difference in
delay between interfaces, hiding the reordering.

[...]

Increasing MTU also increases the interval between packets on a TCP flow
using maximum segment size so that it is more likely to exceed the
difference in delay.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Eric Dumazet
Date: Tuesday, November 30, 2010 - 9:04 am

GRO really is operational _if_ we receive in same NAPI run several
packets for the same flow.

As soon as we exit NAPI mode, GRO packets are flushed.

Big MTU --> bigger delays between packets, so big chance that GRO cannot
trigger at all, since NAPI runs for one packet only.

One possibility with big MTU is to tweak "ethtool -c eth0" params
rx-usecs: 20
rx-frames: 5
rx-usecs-irq: 0
rx-frames-irq: 5
so that "rx-usecs" is bigger than the delay between two MTU full sized
packets.

Gigabit speed means 1 nano second per bit, and MTU=9000 means 72 us
delay between packets.

So try :

ethtool -C eth0 rx-usecs 100

to get chance that several packets are delivered at once by NIC.

Unfortunately, this also add some latency, so it helps bulk transferts,
and slowdown interactive traffic 


--

From: Simon Horman
Date: Tuesday, November 30, 2010 - 9:34 pm

Thanks Eric,

I was tweaking those values recently for some latency tuning
but I didn't think of them in relation to last night's tests.

In terms of my measurements, its just benchmarking at this stage.
So a trade-off between throughput and latency is acceptable, so long
as I remember to measure what it is.

--

From: Eric Dumazet
Date: Tuesday, November 30, 2010 - 9:47 pm

I was thinking again this morning about GRO and bonding, and dont know
if it actually works...

Is GRO on on individual eth0/eth1/eth2 you use, or on bonding device
itself ?



--

From: Simon Horman
Date: Wednesday, December 1, 2010 - 11:39 pm

All of the above. I can check different combinations if it helps.

--

From: Simon Horman
Date: Friday, December 3, 2010 - 6:38 am

To clarify my statement in a previous email that GSO had no effect: I
re-ran the tests and I still haven't observed any affect of GSO on my
results. However, I did notice that in order for GRO on the server to have
effect I also need TSO enabled on the client.  I thought that I had
previously checked that but I was mistaken.

Enabling TSO on the client while leaving GSO disabled on the server

Thanks, rx-usecs was set to 3 and changing it to 15 on the server
did seem increase throughput with 1500 byte packets. Although
CPU utilisation increased too, disproportionally so on the client.

MTU=1500, client,server:tcp_reordering=3, client:GSO=off,
	client:TSO=on, server:GRO=off, server:rx-usecs=3(default)
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1591.34   16.35    5.80     1.683   2.390

MTU=1500, client,server:tcp_reordering=3(default), client:GSO=off,
	client:TSO=on, server:GRO=off server:rx-usecs=15
# netperf -c -4 -t TCP_STREAM -H 172.17.60.216
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      1774.38   23.75    7.58     2.193   2.801

I also saw an improvement with GRO enabled on the server and TSO enabled on
the client.  Although in this ...
From: Simon Horman
Date: Tuesday, November 30, 2010 - 9:31 pm

It didn't seem to make any difference either way.

I hadn't considered that, thanks.

--

From: Rick Jones
Date: Tuesday, November 30, 2010 - 10:56 am

Why 1472 bytes per send?  If you wanted a 1-1 between the send size and the MSS, 
I would guess that 1448 would have been in order.  1472 would be the maximum 

If you are changing things on the receiver, you should probably enable remote 

You are in a maze of twisty heuristics and algorithms, all interacting :)  If 
there are only three links in the bond, I suspect the chances for spurrious fast 
retransmission are somewhat smaller than if you had say four, based on just 
hand-waving on three duplicate ACKs requires receipt of perhaps four out of 


Short of packet traces, taking snapshots of netstat statistics before and after 
each netperf run might be goodness - you can look at things like ratio of ACKs 
to data segments/bytes and such.  LRO/GRO can have a non-trivial effect on the 
number of ACKs, and ACKs are what matter for fast retransmit.

netstat -s > before
netperf ...
netstat -s > after
beforeafter before after > delta

where beforeafter comes (for now, the site will have to go away before long as 
the campus on which it is located has been sold) 
ftp://ftp.cup.hp.com/dist/networking/tools/  and will subtract before from after.

happy benchmarking,

rick jones
--

From: Eric Dumazet
Date: Tuesday, November 30, 2010 - 11:14 am

Yes indeed. With fast enough medium (or small MTUS), we can enter in a
backlog processing problem {filling huge receive queues}, as seen on
loopback lately...

netstat -s can show some receive queue overrun in this case.

    TCPBacklogDrop: xxx



--

From: Simon Horman
Date: Tuesday, November 30, 2010 - 9:30 pm

Only to be consistent with UDP testing that I was doing at the same time.


Unfortunately NIC/slot availability only stretches to three links :-(


Thanks, I'll take a look into that.

--

From: Rick Jones
Date: Wednesday, December 1, 2010 - 12:42 pm

Only if you want to increase the chances of reordering that triggers spurrious 
fast retransmits.

rick jones
--

Previous thread: [PATCH] fix hang in dmfe driver on sending of big packet (linux-2.6.35) by Alexander V. Lukyanov on Tuesday, November 30, 2010 - 6:46 am. (3 messages)

Next thread: [PATCH 1/2] af_packet: use vmalloc_to_page() instead for the addresss returned by vmalloc() by Changli Gao on Tuesday, November 30, 2010 - 6:56 am. (10 messages)