ipv4: Add GRO infrastructure
This patch adds GRO support for IPv4.
The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum. In addition, the ID must form an
arithmetic sequence with a difference of one.
The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule. Linux itself also obeys this
whether GSO is in use or not.
In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---
include/net/protocol.h | 3 +
net/ipv4/af_inet.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 99 insertions(+)
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 8d024d7..cb2965a 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,6 +39,9 @@ struct net_protocol {
int (*gso_send_check)(struct sk_buff *skb);
struct sk_buff *(*gso_segment)(struct sk_buff *skb,
int features);
+ struct sk_buff **(*gro_receive)(struct sk_buff **head,
+ struct sk_buff *skb);
+ int (*gro_complete)(struct sk_buff *skb);
unsigned int no_policy:1,
netns_ok:1;
};
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1aa2dc9..260f081 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -94,6 +94,7 @@
#include <linux/igmp.h>
#include <linux/inetdevice.h>
#include <linux/netdevice.h>
+#include <net/checksum.h>
#include <net/ip.h>
#include <net/protocol.h>
#include <net/arp.h>
@@ -1245,6 +1246,99 @@ out:
return segs;
}
+static struct sk_buff **inet_gro_receive(struct sk_buff **head,
+ struct sk_buff *skb)
+{
+ struct net_protocol *ops;
+ struct sk_buff **pp = NULL;
+ struct sk_buff *p;
+ struct iphdr *iph;
+ int flush = 1;
+ int proto;
+ int id;
+
+ if ...net: Add Generic Receive Offload infrastructure This patch adds the top-level GRO (Generic Receive Offload) infrastructure. This is pretty similar to LRO except that this is protocol-independent. Instead of holding packets in an lro_mgr structure, they're now held in napi_struct. For drivers that intend to use this, they can set the NETIF_F_GRO bit and call napi_gro_receive instead of netif_receive_skb or just call netif_rx. The latter will call napi_receive_skb automatically. When napi_gro_receive is used, the driver must either call napi_complete/napi_rx_complete, or call napi_gro_flush in softirq context if the driver uses the primitives __napi_complete/__napi_rx_complete. Protocols will set the gro_receive and gro_complete function pointers in order to participate in this scheme. In addition to the packet, gro_receive will get a list of currently held packets. Each packet in the list has a same_flow field which is non-zero if it is a potential match for the new packet. For each packet that may match, they also have a flush field which is non-zero if the held packet must not be merged with the new packet. Once gro_receive has determined that the new skb matches a held packet, the held packet may be processed immediately if the new skb cannot be merged with it. In this case gro_receive should return the pointer to the existing skb in gro_list. Otherwise the new skb should be merged into the existing packet and NULL should be returned, unless the new skb makes it impossible for any further merges to be made (e.g., FIN packet) where the merged skb should be returned. Whenever the skb is merged into an existing entry, the gro_receive function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb merely matches an existing entry but can't be merged with it, then this shouldn't be set. If gro_receive finds it pointless to hold the new skb for future merging, it should set NAPI_GRO_CB(skb)->flush. Held packets will be flushed by napi_gro_flush which is ...
Success after freeing skb on error? :) -- Evgeniy Polyakov --
If this ever gets run we're in big trouble :) And if you're hacking Nope, same_flow == 1 means that the skb was merged into an existing No, ->count is for how many original skb's we have merged into skb, whilc the local count is for how many held skb's we have in total (used to limit the length of that list). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:
So why not call this field "merged"?
This belongs in a kernel-doc comment, not in the commit message.
We used a hash table in our own soft-LRO, used in out-of-tree driver
releases. This certainly improved performance in many-to-one
benchmarks. How much it matters in real applications, I'm less sure.
Are you intending for the VLAN driver to call napi_gro_receive()? If
not, I think this should treat VLAN tags as part of the MAC header.
Is this assignment to flush really necessary? Surely any skb on the
gro_list with flush == 1 gets removed before the next call to
The above loop is unclear because most of the body is supposed to run at
most once; I would suggest writing the loop and the failure case as:
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list)
if (ptype->type == type && !ptype->dev && ptype->gro_receive)
break;
if (&ptype->list == head) {
rcu_read_unlock();
goto normal;
}
and then moving the rest of the loop body after this.
The inet_lro code accepts either skbs or pages and the sfc driver takes
advantage of this: so long as most packets can be coalesced by LRO, it's
cheaper to allocate page buffers in advance and then attach them to skbs
during LRO. I think you should support the use of page buffers.
Obviously it adds complexity but there's a real performance benefit.
(Alternately you could work out how to make skb allocation cheaper, and
everyone would be happy!)
[...]
Shouldn't the list already be empty at this point?
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
--
Because when it's used on held skb's it indicates whether the new Yes it is necessary. Only the one with flushing != 0 and is of the This is something which I will look at in future. The top priority It should be. However, if your driver is buggy and forgets to call napi_complete (or calls __napi_complete without napi_gro_flush) then we'll need this. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
BTW this should be pretty easy to implement through a second entry, e.g., napi_gro_receive_pages() that works just like its LRO counter-part lro_receive_frags. This would have its own protocol hooks so it doesn't need to do anything nasty to get at the packet headers. But I'd like to get the main infrastructure nailed down first before we tack on bells and whistles like this and VLAN support. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
In fact we don't even need extra hooks. All we need to do is to keep one pre-allocated in the napi struct, and use that to hold the pages while we process it. If it's merged (we'd modify the low-level skb_gro_receive to merge the pages directly rather than the skb), then we return the skb to use it again for the next packet. If not then we just allocate a new skb. This way none of the protocol-specific code needs to handle pages at all. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
This sounds good! -- Evgeniy Polyakov --
On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote: This covers frag_off, ttl and protocol. But frag_off and protocol have [...] Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
No we haven't checked frag_off yet, we've only checked the DF bit. In any case, checking twice doesn't hurt. Good catch on the missing tos check though. I'll get that added. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
