Re: [PATCH 3/8] net: Add Generic Receive Offload infrastructure

Previous thread: [PATCH 2/8] net: Add frag_list support to GSO by Herbert Xu on Thursday, December 11, 2008 - 10:31 pm. (1 message)

Next thread: [PATCH 8/8] e1000e: Add GRO support by Herbert Xu on Thursday, December 11, 2008 - 10:31 pm. (2 messages)
From: Herbert Xu
Date: Thursday, December 11, 2008 - 10:31 pm

ipv4: Add GRO infrastructure

This patch adds GRO support for IPv4.

The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum.  In addition, the ID must form an
arithmetic sequence with a difference of one.

The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule.  Linux itself also obeys this
whether GSO is in use or not.

In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/protocol.h |    3 +
 net/ipv4/af_inet.c     |   96 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 8d024d7..cb2965a 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,6 +39,9 @@ struct net_protocol {
 	int			(*gso_send_check)(struct sk_buff *skb);
 	struct sk_buff	       *(*gso_segment)(struct sk_buff *skb,
 					       int features);
+	struct sk_buff	      **(*gro_receive)(struct sk_buff **head,
+					       struct sk_buff *skb);
+	int			(*gro_complete)(struct sk_buff *skb);
 	unsigned int		no_policy:1,
 				netns_ok:1;
 };
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1aa2dc9..260f081 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -94,6 +94,7 @@
 #include <linux/igmp.h>
 #include <linux/inetdevice.h>
 #include <linux/netdevice.h>
+#include <net/checksum.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -1245,6 +1246,99 @@ out:
 	return segs;
 }
 
+static struct sk_buff **inet_gro_receive(struct sk_buff **head,
+					 struct sk_buff *skb)
+{
+	struct net_protocol *ops;
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	struct iphdr *iph;
+	int flush = 1;
+	int proto;
+	int id;
+
+	if ...
From: Herbert Xu
Date: Thursday, December 11, 2008 - 10:31 pm

net: Add Generic Receive Offload infrastructure

This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.

For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically.  When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.

Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.

In addition to the packet, gro_receive will get a list of currently held
packets.  Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet.  For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.

Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it.  In this case gro_receive should return the pointer to
the existing skb in gro_list.  Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.

Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow.  Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.

If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.

Held packets will be flushed by napi_gro_flush which is ...
From: Evgeniy Polyakov
Date: Friday, December 12, 2008 - 12:51 pm

Success after freeing skb on error? :)



-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Friday, December 12, 2008 - 2:45 pm

If this ever gets run we're in big trouble :) And if you're hacking

Nope, same_flow == 1 means that the skb was merged into an existing

No, ->count is for how many original skb's we have merged into
skb, whilc the local count is for how many held skb's we have
in total (used to limit the length of that list).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Ben Hutchings
Date: Friday, December 12, 2008 - 3:25 pm

On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:

So why not call this field "merged"?


This belongs in a kernel-doc comment, not in the commit message.


We used a hash table in our own soft-LRO, used in out-of-tree driver
releases.  This certainly improved performance in many-to-one
benchmarks.  How much it matters in real applications, I'm less sure.


Are you intending for the VLAN driver to call napi_gro_receive()?  If
not, I think this should treat VLAN tags as part of the MAC header.

Is this assignment to flush really necessary?  Surely any skb on the
gro_list with flush == 1 gets removed before the next call to

The above loop is unclear because most of the body is supposed to run at
most once; I would suggest writing the loop and the failure case as:

	rcu_read_lock();
	list_for_each_entry_rcu(ptype, head, list)
		if (ptype->type == type && !ptype->dev && ptype->gro_receive)
			break;
	if (&ptype->list == head) {
		rcu_read_unlock();
		goto normal;
	}

and then moving the rest of the loop body after this.

The inet_lro code accepts either skbs or pages and the sfc driver takes
advantage of this: so long as most packets can be coalesced by LRO, it's
cheaper to allocate page buffers in advance and then attach them to skbs
during LRO.  I think you should support the use of page buffers.
Obviously it adds complexity but there's a real performance benefit.
(Alternately you could work out how to make skb allocation cheaper, and
everyone would be happy!)

[...]

Shouldn't the list already be empty at this point?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Herbert Xu
Date: Friday, December 12, 2008 - 3:56 pm

Because when it's used on held skb's it indicates whether the new


Yes it is necessary.  Only the one with flushing != 0 and is of the

This is something which I will look at in future.  The top priority

It should be.  However, if your driver is buggy and forgets to
call napi_complete (or calls __napi_complete without napi_gro_flush)
then we'll need this.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Friday, December 12, 2008 - 4:11 pm

BTW this should be pretty easy to implement through a second
entry, e.g., napi_gro_receive_pages() that works just like its
LRO counter-part lro_receive_frags.  This would have its own
protocol hooks so it doesn't need to do anything nasty to get
at the packet headers.

But I'd like to get the main infrastructure nailed down first
before we tack on bells and whistles like this and VLAN support.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Friday, December 12, 2008 - 8:43 pm

In fact we don't even need extra hooks.  All we need to do is to
keep one pre-allocated in the napi struct, and use that to hold
the pages while we process it.  If it's merged (we'd modify the
low-level skb_gro_receive to merge the pages directly rather than
the skb), then we return the skb to use it again for the next
packet.  If not then we just allocate a new skb.  This way none
of the protocol-specific code needs to handle pages at all.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Saturday, December 13, 2008 - 7:03 am

This sounds good!

-- 
	Evgeniy Polyakov
--

From: Ben Hutchings
Date: Friday, December 12, 2008 - 3:55 pm

On Fri, 2008-12-12 at 16:31 +1100, Herbert Xu wrote:


This covers frag_off, ttl and protocol.  But frag_off and protocol have
[...]

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Herbert Xu
Date: Friday, December 12, 2008 - 4:04 pm

No we haven't checked frag_off yet, we've only checked the DF bit.
In any case, checking twice doesn't hurt.  Good catch on the missing
tos check though.  I'll get that added.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

Previous thread: [PATCH 2/8] net: Add frag_list support to GSO by Herbert Xu on Thursday, December 11, 2008 - 10:31 pm. (1 message)

Next thread: [PATCH 8/8] e1000e: Add GRO support by Herbert Xu on Thursday, December 11, 2008 - 10:31 pm. (2 messages)