Hi, I just wanted to share what is a rather pleasing, though to me somewhat surprising result. I am testing bonding using balance-rr mode with three physical links to try to get > gigabit speed for a single stream. Why? Because I'd like to run various tests at > gigabit speed and I don't have any 10G hardware at my disposal. The result I have is that with a 1500 byte MTU, tcp_reordering=3 and both LSO and GSO disabled on both the sender and receiver I see: # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 1472 10.01 1646.13 40.01 -1.00 3.982 -1.000 But with GRO enabled on the receiver I see. # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 -- -m 1472 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 1472 10.01 2613.83 19.32 -1.00 1.211 -1.000 Which is much better than any result I get tweaking tcp_reordering when GRO is disabled on the receiver. Tweaking tcp_reordering when GRO is enabled on the receiver seems to have negligible effect. Which is interesting, because my brief reading on the subject indicated that tcp_reordering was the key tuning parameter for bonding with balance-rr. The only other parameter that seemed to ...
Did you also enable TSO/GSO on the sender? What TSO/GSO will do is to change the round-robin scheduling from one packet per interface to one super-packet per interface. GRO then coalesces the physical packets back into a super-packet. The intervals between receiving super-packets then tend to exceed the difference in delay between interfaces, hiding the reordering. [...] Increasing MTU also increases the interval between packets on a TCP flow using maximum segment size so that it is more likely to exceed the difference in delay. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
GRO really is operational _if_ we receive in same NAPI run several packets for the same flow. As soon as we exit NAPI mode, GRO packets are flushed. Big MTU --> bigger delays between packets, so big chance that GRO cannot trigger at all, since NAPI runs for one packet only. One possibility with big MTU is to tweak "ethtool -c eth0" params rx-usecs: 20 rx-frames: 5 rx-usecs-irq: 0 rx-frames-irq: 5 so that "rx-usecs" is bigger than the delay between two MTU full sized packets. Gigabit speed means 1 nano second per bit, and MTU=9000 means 72 us delay between packets. So try : ethtool -C eth0 rx-usecs 100 to get chance that several packets are delivered at once by NIC. Unfortunately, this also add some latency, so it helps bulk transferts, and slowdown interactive traffic --
Thanks Eric, I was tweaking those values recently for some latency tuning but I didn't think of them in relation to last night's tests. In terms of my measurements, its just benchmarking at this stage. So a trade-off between throughput and latency is acceptable, so long as I remember to measure what it is. --
I was thinking again this morning about GRO and bonding, and dont know if it actually works... Is GRO on on individual eth0/eth1/eth2 you use, or on bonding device itself ? --
All of the above. I can check different combinations if it helps. --
To clarify my statement in a previous email that GSO had no effect: I re-ran the tests and I still haven't observed any affect of GSO on my results. However, I did notice that in order for GRO on the server to have effect I also need TSO enabled on the client. I thought that I had previously checked that but I was mistaken. Enabling TSO on the client while leaving GSO disabled on the server Thanks, rx-usecs was set to 3 and changing it to 15 on the server did seem increase throughput with 1500 byte packets. Although CPU utilisation increased too, disproportionally so on the client. MTU=1500, client,server:tcp_reordering=3, client:GSO=off, client:TSO=on, server:GRO=off, server:rx-usecs=3(default) # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 1591.34 16.35 5.80 1.683 2.390 MTU=1500, client,server:tcp_reordering=3(default), client:GSO=off, client:TSO=on, server:GRO=off server:rx-usecs=15 # netperf -c -4 -t TCP_STREAM -H 172.17.60.216 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.17.60.216 (172.17.60.216) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB 87380 16384 16384 10.00 1774.38 23.75 7.58 2.193 2.801 I also saw an improvement with GRO enabled on the server and TSO enabled on the client. Although in this ...
It didn't seem to make any difference either way. I hadn't considered that, thanks. --
Why 1472 bytes per send? If you wanted a 1-1 between the send size and the MSS, I would guess that 1448 would have been in order. 1472 would be the maximum If you are changing things on the receiver, you should probably enable remote You are in a maze of twisty heuristics and algorithms, all interacting :) If there are only three links in the bond, I suspect the chances for spurrious fast retransmission are somewhat smaller than if you had say four, based on just hand-waving on three duplicate ACKs requires receipt of perhaps four out of Short of packet traces, taking snapshots of netstat statistics before and after each netperf run might be goodness - you can look at things like ratio of ACKs to data segments/bytes and such. LRO/GRO can have a non-trivial effect on the number of ACKs, and ACKs are what matter for fast retransmit. netstat -s > before netperf ... netstat -s > after beforeafter before after > delta where beforeafter comes (for now, the site will have to go away before long as the campus on which it is located has been sold) ftp://ftp.cup.hp.com/dist/networking/tools/ and will subtract before from after. happy benchmarking, rick jones --
Yes indeed. With fast enough medium (or small MTUS), we can enter in a
backlog processing problem {filling huge receive queues}, as seen on
loopback lately...
netstat -s can show some receive queue overrun in this case.
TCPBacklogDrop: xxx
--
Only to be consistent with UDP testing that I was doing at the same time. Unfortunately NIC/slot availability only stretches to three links :-( Thanks, I'll take a look into that. --
Only if you want to increase the chances of reordering that triggers spurrious fast retransmits. rick jones --
