Scott Feldman announced that with the release of development kernel 2.5.33, the e1000 driver now supports TCP Segmentation Offloading (TSO), offering a significant boost in two-way transfer rates. (In the provided benchmark, send only throughput did not increase as the wire's physicial limitation had already been reached.)
TCP Segmentation Offload (or TCP Large Send) is when buffer's much larger than the supported maximum transmission unit (MTU) of a given medium are passed through the bus to the network interface card. The work of dividing the much larger packets into smaller packets is thus offloaded to the NIC. More specifically, the e1000 driver is passing 64k packets to the network card, which then divides these into proper MTU-sized 1500 byte packets.
Alexey Kuznetsov added TSO support into the stack, noting that as of yet, "the implementation in tcp is still at [the] level of a toy". Be that as it may, it's a good start, and before long other TSO capable devices will likely also be supported.
From: Scott Feldman To: linux-kernel mailing list, linux-net Subject: TCP Segmentation Offloading (TSO) Date: Mon, 2 Sep 2002 10:45:08 -0700 TCP Segmentation Offloading (TSO) is enabled[1] in 2.5.33, along with an enabled e1000 driver. Other capable devices can be enabled ala e1000; the driver interface (NETIF_F_TSO) is very simple. So, fire up you favorite networking performance tool and compare the performance gains between 2.5.32 and 2.5.33 using e1000. I ran a quick test on a dual P4 workstation system using the commercial tool Chariot: Tx/Rx TCP file send long (bi-directional Rx/Tx)w/o TSO: 1500Mbps, 82% CPU
w/ TSO: 1633Mbps, 75% CPUTx TCP file send long (Tx only)
w/o TSO: 940Mbps, 40% CPU
w/ TSO: 940Mbps, 19% CPUA good bump in throughput for the bi-directional test. The Tx-only test was
already at wire speed, so the gains are pure CPU savings.I'd like to see SPECWeb results w/ and w/o TSO, and any other relevant
testing. UDP framentation is not offloaded, so keep testing to TCP.-scott
[1] Kudos to Alexey Kuznetsov for enabling the stack with TSO support, to
Chris Leech for providing the e1000 bits and a prototype stack, and to David
Miller for consultation.
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 02 Sep 2002 16:13:05 -0700 (PDT)I would like to praise Intel for working so closely with us on
this. They gave us immediately, in one email, all the information we
needed to implement and test e1000 support for TSO under Linux.With some other companies, doing this is like pulling teeth.
From: Alexey Kuznetsov
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 2 Sep 2002 22:58:15 +0400 (MSD)Hello!
> [1] Kudos to
Hmm... wait awhile with celebrating, the implementation in tcp is still
at level of a toy. Well, and it happens to crash, the patch is enclosed.Alexey
[patch [0]]
From: Jordi Ros
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Mon, 2 Sep 2002 21:58:32 -0700One question regarding the throughput numbers,
what was the size of the packets built at the tcp layer (mss)?
i assume the mtu is ethernet 1500 Bytes, right? and that mss should be
something much bigger than mtu, which gives the performance improvement
shown in the numbers.thanks,
jordi
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 02 Sep 2002 23:52:44 -0700 (PDT)The performance improvement comes from the fact that the card
is given huge 64K packets, then the card (using the given ip/tcp
headers as a template) spits out 1500 byte mtu sized packets.Less data DMA'd to the device per normal-mtu packet and less
per-packet data structure work by the cpu is where the improvement
comes from.
From: Jordi Ros
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Tue, 3 Sep 2002 00:26:13 -0700What i am wondering is how come we only get a few percentage improvement in
throughput. Theoretically, since 64KB/1.5KB ~= 40, we should get a
throughput improvement of 40 times. That would be the case of udp
transmiting in one direction, in the case of tcp transmiting in one
direction (which is the one you have implemented), since in average we have
(at most) 1 ack every 2 data packets, we should theoretically obtain a
throughput improvement of (40+20)/(1+20) = 3 (this comes from: without tso
we send 40 packets and receive 20 acks, this is, the cpu processes 60
packets; whereas with tso we send 1 packet and receive 20 acks, this is, the
cpu processes 21 packets).
However, we don't see in the numbers obtained neither an increase of
throughput of 300% nor a decrease in cpu utilization of such magnitude. Is
there any other bottleneck in the system that prevents us to see the 300%
improvement? (i am assuming the card can do tso at wire speed)thank you,
jordi
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Tue, 03 Sep 2002 00:39:13 -0700 (PDT)Because he's maxing out the physical medium already.
All the headers for each 1500 byte packet still have to hit the
physical wire, that isn't what is being eliminated. It's just
what is going over the PCI bus to the card that is being made
smaller.
From: Scott Feldman
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Tue, 3 Sep 2002 10:50:27 -0700Jordi Ros wrote:
> What i am wondering is how come we only get a few percentage
> improvement in throughput. Theoretically, since 64KB/1.5KB ~=
> 40, we should get a throughput improvement of 40 times.You're confusing number of packets with throughput. Cut the wire, and you
can't tell the difference with or without TSO. It's the same amount of data
on the wire. As David pointed out, the savings comes in how much data is
DMA'ed across the bus and how much the CPU is unburdened by the segmentation
task. A 64K TSO would be one pseudo header and the rest payload. Without
TSO you would add ~40 more headers. That's the savings across the bus.> Is there any other bottleneck in the system that prevents
> us to see the 300% improvement? (i am assuming the card can
> do tso at wire speed)My numbers are against PCI 64/66Mhz, so that's limiting. You're not going
to get much more that 940Mbps at 1GbE unidirectional. That's why all of the
savings at unidirectional Tx are in CPU reduction.-scott
Related Links:
- Google archive of above thread [1]