Scott Feldman announced that with the release of development kernel 2.5.33, the e1000 driver now supports TCP Segmentation Offloading (TSO), offering a significant boost in two-way transfer rates. (In the provided benchmark, send only throughput did not increase as the wire's physicial limitation had already been reached.)
TCP Segmentation Offload (or TCP Large Send) is when buffer's much larger than the supported maximum transmission unit (MTU) of a given medium are passed through the bus to the network interface card. The work of dividing the much larger packets into smaller packets is thus offloaded to the NIC. More specifically, the e1000 driver is passing 64k packets to the network card, which then divides these into proper MTU-sized 1500 byte packets.
Alexey Kuznetsov added TSO support into the stack, noting that as of yet, "the implementation in tcp is still at [the] level of a toy". Be that as it may, it's a good start, and before long other TSO capable devices will likely also be supported.
From: Scott Feldman To: linux-kernel mailing list, linux-net Subject: TCP Segmentation Offloading (TSO) Date: Mon, 2 Sep 2002 10:45:08 -0700 TCP Segmentation Offloading (TSO) is enabled[1] in 2.5.33, along with an enabled e1000 driver. Other capable devices can be enabled ala e1000; the driver interface (NETIF_F_TSO) is very simple. So, fire up you favorite networking performance tool and compare the performance gains between 2.5.32 and 2.5.33 using e1000. I ran a quick test on a dual P4 workstation system using the commercial tool Chariot: Tx/Rx TCP file send long (bi-directional Rx/Tx)w/o TSO: 1500Mbps, 82% CPU
w/ TSO: 1633Mbps, 75% CPUTx TCP file send long (Tx only)
w/o TSO: 940Mbps, 40% CPU
w/ TSO: 940Mbps, 19% CPUA good bump in throughput for the bi-directional test. The Tx-only test was
already at wire speed, so the gains are pure CPU savings.I'd like to see SPECWeb results w/ and w/o TSO, and any other relevant
testing. UDP framentation is not offloaded, so keep testing to TCP.-scott
[1] Kudos to Alexey Kuznetsov for enabling the stack with TSO support, to
Chris Leech for providing the e1000 bits and a prototype stack, and to David
Miller for consultation.
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 02 Sep 2002 16:13:05 -0700 (PDT)I would like to praise Intel for working so closely with us on
this. They gave us immediately, in one email, all the information we
needed to implement and test e1000 support for TSO under Linux.With some other companies, doing this is like pulling teeth.
From: Alexey Kuznetsov
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 2 Sep 2002 22:58:15 +0400 (MSD)Hello!
> [1] Kudos to
Hmm... wait awhile with celebrating, the implementation in tcp is still
at level of a toy. Well, and it happens to crash, the patch is enclosed.Alexey
[patch]
From: Jordi Ros
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Mon, 2 Sep 2002 21:58:32 -0700One question regarding the throughput numbers,
what was the size of the packets built at the tcp layer (mss)?
i assume the mtu is ethernet 1500 Bytes, right? and that mss should be
something much bigger than mtu, which gives the performance improvement
shown in the numbers.thanks,
jordi
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Mon, 02 Sep 2002 23:52:44 -0700 (PDT)The performance improvement comes from the fact that the card
is given huge 64K packets, then the card (using the given ip/tcp
headers as a template) spits out 1500 byte mtu sized packets.Less data DMA'd to the device per normal-mtu packet and less
per-packet data structure work by the cpu is where the improvement
comes from.
From: Jordi Ros
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Tue, 3 Sep 2002 00:26:13 -0700What i am wondering is how come we only get a few percentage improvement in
throughput. Theoretically, since 64KB/1.5KB ~= 40, we should get a
throughput improvement of 40 times. That would be the case of udp
transmiting in one direction, in the case of tcp transmiting in one
direction (which is the one you have implemented), since in average we have
(at most) 1 ack every 2 data packets, we should theoretically obtain a
throughput improvement of (40+20)/(1+20) = 3 (this comes from: without tso
we send 40 packets and receive 20 acks, this is, the cpu processes 60
packets; whereas with tso we send 1 packet and receive 20 acks, this is, the
cpu processes 21 packets).
However, we don't see in the numbers obtained neither an increase of
throughput of 300% nor a decrease in cpu utilization of such magnitude. Is
there any other bottleneck in the system that prevents us to see the 300%
improvement? (i am assuming the card can do tso at wire speed)thank you,
jordi
From: David S. Miller
Subject: Re: TCP Segmentation Offloading (TSO)
Date: Tue, 03 Sep 2002 00:39:13 -0700 (PDT)Because he's maxing out the physical medium already.
All the headers for each 1500 byte packet still have to hit the
physical wire, that isn't what is being eliminated. It's just
what is going over the PCI bus to the card that is being made
smaller.
From: Scott Feldman
Subject: RE: TCP Segmentation Offloading (TSO)
Date: Tue, 3 Sep 2002 10:50:27 -0700Jordi Ros wrote:
> What i am wondering is how come we only get a few percentage
> improvement in throughput. Theoretically, since 64KB/1.5KB ~=
> 40, we should get a throughput improvement of 40 times.You're confusing number of packets with throughput. Cut the wire, and you
can't tell the difference with or without TSO. It's the same amount of data
on the wire. As David pointed out, the savings comes in how much data is
DMA'ed across the bus and how much the CPU is unburdened by the segmentation
task. A 64K TSO would be one pseudo header and the rest payload. Without
TSO you would add ~40 more headers. That's the savings across the bus.> Is there any other bottleneck in the system that prevents
> us to see the 300% improvement? (i am assuming the card can
> do tso at wire speed)My numbers are against PCI 64/66Mhz, so that's limiting. You're not going
to get much more that 940Mbps at 1GbE unidirectional. That's why all of the
savings at unidirectional Tx are in CPU reduction.-scott
Other NICs that can do this?
Does anyone know what other NICs can do TSO? Any chance this can benefit NICs besides the GigE ones? Google turns up not much.
I've got an SMC/WD 8013 16 b
I've got an SMC/WD 8013 16 bit ISA card I'm using for my 512/256 Kbps DSL link - could I benefit from using TSO ? I'm also wondering if NAPI would help ?
(Sorry, bit of a lame joke I know - I'm just happy that an ethernet card manufactured in 1991 is still useful to me today!)
Different question should be asked
I don't think it would be useful because, from what I understand (and I don't claim to follow the Linux kernel anymore nor to be an expert in networking), it's biggest impact will come in the way of large data transfers. I doubt you're doing large data transfers and even if you were, you're probably going to hit the limits of the ISA bus and the DSL link before you run into a need for TSO. WAY before you run into a need. But in server environments, it would be great. And along with jumbo frames, it would be delicious!
but the board would need to s
but the board would need to support splitting the packets itself! so only few boards support it and certainly not older ones anyway...
Well
Actually it _would_ be very useful (if the card supported it, which it doesn't) because it would reduce ISA bus traffic. Also, TSO will not benifet jumbo-frames-installations as much as standard ones.
16-bit card?
My Linux box still runs a 8-bit WD8003 ISA card in the interface to the cable modem! Nothing else did the trick with zero empty PCI slots available (2nd ethernet card, VGA, TV tuner used all PCIs)
2b
Know what you mean, what's t
Know what you mean, what's the point of wasting a PCI slot when you have an ISA slot free, and a free ISA ethernet card to go in it, and a DSL / Cable connection that will never flood a Half duplex 10 Mbps ethernet.
I've just run out of PCI slots as well, though my last one went to a S3 Virge PCI video card, that I wanted to use to play with using video RAM as a block device (ie. as per the article on Slashdot recently).
Well it sort of worked, I was able to write to the video ram, but not read from it ! mke2fs, mkswap worked all ok, but couldn't mount or add the RAM to swap.
The thought I've had is that because I'm not using XFree86 to initialise the video card (I'm using a G550 for X), the card might be write only if it is not initialised. Either that or the memory on the card is only write only :-( Just wish I had another AGP slot, so I could stick my G200 back in my box.
I tried this out too..
I have a voodoo3 with 16 mb.. In 1600x1200x16bpp, I was using slightly less than 4mb of it. I was able to create a 12mb device, mkswap, swapon, etc.. It worked pretty well. I was even able to start X with the swap enabled.
However, I lost certain features of the video card. I could no longer do xvid colorspace conversions or video scaling. Once I saw that didn't work, I didn't even want to try 3D accel.
I ended up turning it back off because of this. In your situation with 2 video cards, you don't have to worry about that, I guess.
-molo
Broadcom does this
Broadcom does this
Help at work..
At work I have a "cluster" of servers that are on a dedicated gigabit switch all running with e1000 cards. We do a lot of large sequential file transfers.
Of course there's no way I can put 2.5.x on those boxen since they're in production. Any hope of seeing these in the 2.4.x series? If only one machine in the cluster has this will it show any improvement?
Well...
The e1000 driver and NAPI (IIRC e1000 has NAPI support implemented) so that should be a good start for you.
Also, the segment offload patches are pretty small - and a 50% CPU reduction on the send side is nothing to sneeze at. IMO We'll probably see it in 2.4 soon.
Yes if one machine in the cluster has this, you'll see an improvement _for that machine_, you wouldn't see any difference on the other machines though of course. Sit tight, let it get some testing, and wait for it to be backported.
Do TSO require a dedicated hw?
Hello,
I’m working on an embedded network interface (core by Synopsys) and I’ve already written a Linux device driver.
My hardware has an internal DMA so I’ve used the scatter gather implementation (plus some optimization: zero-copy etc etc…). Unfortunately, this hardware is not able to do the csum calculation (I perform that using the skb_checksum_help() function).
I’m wondering: could I add the TSO support too?
Sorry but I’ve not clear if the segmentation offload requires a dedicated hardware or I only need to modify my device driver in order to support segmentation offload.
Regards,
Giuseppe
Hardware support is needed
TSO is a hardware function; the NIC needs to know how to fit the data into the template segment (IP and TCP headers, with some details left for the NIC to fill in) supplied by the OS. Without hardware support, you can't do TSO.
difference between TSO, UFO and GSO
Which are the differences among TSO, UFO and GSO?
If I’ve well understood TSO and UFO are a "special" HW functions.
Is GSO a software support, only? I mean, does GSO require a dedicated hardware too?
Sorry but I’m not familiar with network protocols, optimizations etc.
many thanks!!!