Re: Data corruption issue with splice() on 2.6.27.10

Previous thread: Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 by Andrew Morton on Wednesday, December 24, 2008 - 12:20 am. (6 messages)

Next thread: [PATCH 0/5] qlge: Fixes for powerpc platform. by Ron Mercer on Wednesday, December 24, 2008 - 11:18 am. (11 messages)
From: Willy Tarreau
Date: Wednesday, December 24, 2008 - 8:28 am

[Empty message]
From: Jarek Poplawski
Date: Tuesday, January 6, 2009 - 1:54 am

Great story! Alas I don't understand this fully either, but it seems
Changli Gao was concerned with sendpage sending this "as pages", so
when NETIF_F_SG flag is available. Did you try this without SG btw?

Thanks,
Jarek P.
--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 2:41 am

Hi Jarek,


No I did not. I can try, it's not too hard. It would in part defeat the
purpose of the mechanism (especially at 10 Gbps) but at least it will
help narrow the problem down.

Thanks for the tip, I'll keep you informed !
Willy

--

From: Jarek Poplawski
Date: Tuesday, January 6, 2009 - 3:01 am

On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote:

Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a
bit for these sendpages should then make it more reproducible, I guess.

Jarek P.
--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 3:04 am

ah ? I can try that too, using iptables on the target machine to drop
a few outgoing acks.

Regards,
Willy

--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 8:57 am

OK here is an update. It does not change anything to turn off any acceleration
feature on the interface (tg3) :

root@wtap:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off

It still forwards corrupted data like mad. I noticed that the corruption rate
is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo.

Maybe this can help find the culprit ?

Willy

--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 2:39 am

I hope it will, but I still don't get it. Anyway, here is an untested
patch, which I guess partly tries Changli Gao's recommendation to give
real pages to splice/pipe (but here it's always - not for sendpage
only).

BTW, I added Changli to Cc - great review!

Thanks,
Jarek P.

---

 net/core/skbuff.c |   41 +++++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy(p + offset, page + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, ...
From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 5:22 am

[ CCing Evgeniy and Herbert who also participate to the thread ]


Well, I've just tested it. It did not fix the problem but made it worse.
Sending a few bytes at a time, the corruption is still there, from the
beginning. Here's what I fed :

willy@pcw:~$ nc -lp4001 
azerazerazerazerazerazer
eiguhaeihgaeighaeighaeirghiareg
aeroigjaeorgjaeorgjaoeigjaeoig
ioejrgoiaerjgoiaerjgoiaerjgaoiejgoaiejg

Here's what I got :

willy@pcw:~$ telnet 10.0.3.2 4000
Trying 10.0.3.2...
Connected to 10.0.3.2.
Escape character is '^]'.
_J0s9ñuMG1S9Ðt2D$EðL$T$

However, when feeding /dev/zero as in previous tests, the kernel paniced
in skb_release_data().

Regards,
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:24 am

Well this patch can only make it worse because not only are you
still ref counting skb->head with get_page, but you've also
completely removed the skb ref count which means that the corruption
can only occur sooner.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 5:38 am

But we don't need this skb... except its ->frags[]  pages, which are
get_paged?! (The rest is copied to new pages.)

Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 5:31 am

...

Terrible mistake! Here is take 2.

Sorry!
Jarek P.

---

 net/core/skbuff.c |   41 +++++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..a598034 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy((void *)p + offset, (void *)page + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, len, offset);
+		if (!page)
+			return 1;
+	}
+
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = ...
From: Jens Axboe
Date: Wednesday, January 7, 2009 - 5:35 am

is trying to do. I'm assuming you want to copy the page contents? If so,
you'd want something like

        memcpy(page_address(p) + offset, page_address(page) + offset, len);

with possible kmaps for 'page'.

Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.

-- 
Jens Axboe

--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 5:40 am

As a proof of concept we can put a delayed work_struct into the buffer
and only release its content after some timeout big enough (like one
second or so) for the hardware to actually transmit its buffers.

-- 
	Evgeniy Polyakov
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 5:52 am

Evgeniy, I'd like to understand something related to our apparent lack of
knowledge of when the data is effectively transmitted. If we're focusing
on the send part, I can't understand why I never reproduce the corruption
when the data source is a file or loopback, but I only see it when the source
is an ethernet interface. How is it possible that a problem affecting only
the send side is so much selective about the source ? And in fact, why can't
we apply the same workflow for outgoing data for both types of sources ? It
seems to me that the page is released at the right time when sending a file,
and I don't see why we cannot apply the same principle when splicing between
sockets.

Please excuse me for my blattant ignorance in this area, as I once said, I
could not completely follow the whole splice process between tcp_splice_read()
and the moment the data leaves the machine. Also, I failed to understand what
linear data means. It seems to me this is the parts that are memcpy'd, but I'm
not sure.

Thanks,
Willy

--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:53 am

It doesn't happen with a file because in that case you don't
start with an skb so there is no skb->head.  It probably doesn't
happen with loopback because loopback does GSO so again skb->head
does not exist (so to speak).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 5:57 am

Yup, basically splice's transmit pipe buffer contains page references,
where the first one is actually not a real page but skb, while in the
case of sendfile() and/or splice() from the file first page is a real
page of the appropriate file.

-- 
	Evgeniy Polyakov
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 6:08 am

OK thanks guys for the clarifications.

Evgeniy, my printk() in tcp_sendpage() fired several times indicating we were
going through do_tcp_sendpage. During the same test, I observed a lot of
corruption.

Also, I have a good news. As you suggested, disabling both SG and GSO indeed
fixes the issue. do_tcp_sendpage() is not called anymore from tcp_sendpage()
in this case (according to dmesg).

Cheers,
Willy

--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 5:49 am

Hmm... in any case: take 3

Jarek P. 

---

 net/core/skbuff.c |   41 +++++++++++++++++++++++++++--------------
 1 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, len, offset);
+		if (!page)
+			return 1;
+	}
+
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private ...
From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:52 am

[Empty message]
From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 6:02 am

Hmm.. Again - the main point of this patch was a proof of concept...
If the bug is really here it could be optimized (e.g. done only for
sendmesage) or fixed another way.

Jarek P.
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 6:00 am

that's what I was initially wondering about, but it looks like only linear
data is copied. Will that cause too much a overhead ? (I don't like copying
at all anyway, but if it can help us find the cause, I'll happily test it).

BTW, Jarek, don't be sorry, I *am* expecting to crash my laptop a number of
times before ensuring the bug is fixed. I just don't want to lose my data.

Cheers,
Willy

--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 6:01 am

For splicing from socket to socket or socket to file, all the
data will be linear with most drivers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Monday, January 12, 2009 - 5:02 am

However, as we don't have a better fix yet, we probably should
take Jarek's patch for now since data corruption is bad.

This is a very hard problem, so in the end the only viable solution
might be to get the drivers to switch to using page frags, like
the Intel page split method.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Monday, January 12, 2009 - 5:45 am

Iirc it copies data from skb to the new pipe page unconditionally while
it is needed only for skb->sendpage path, although it is not possible to
know what is the other side of the pipe (or not?).

What about storing a callback and private pointer in the shared info for
the skb and clone them during usual clone, and invoke the callback at
shared info freeing time, which in turn will call spd->spd_release()?

Given that we only need to protect linear part, it should be simple

As a long-term solution this sounds as the best case, but introduces
quite heavy overhead for the allocators. Right now we allocate
1500+shared_info rounded up to the nearest power of the two (2k), but
then we will either need to have own network allocator (I have one :) or
allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e.
8k), which is unfeasible.

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Monday, January 12, 2009 - 5:56 am

No that's not what I was suggesting.  The page split model allocates
an skb with a very small head that accomodates only the headers.
All payload is stored in the frags structure.

For 1500-byte packets, we can manage the payload area efficiently
by dividing each allocated page into 2K chunks.  The page will
then be automatically freed once all the 2K chunks on it have been
freed through the page ref count mechanism.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Monday, January 12, 2009 - 5:59 am

That's the part I referred to as a network own allocator. We can have
multiple kmem_caches though for the popular MTUs and round up the
requested size otherwise.

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Monday, January 12, 2009 - 2:11 pm

No we can't use kmem_caches because we're relying on using the
page refereounce count to free the page.  It has to be naked
pages.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Monday, January 12, 2009 - 6:15 am

I've wondered if something like this could work as a temporary hack?

!!! not compiled/tested !!!
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce572f9..99b0876 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -264,6 +264,7 @@
 #include <linux/cache.h>
 #include <linux/err.h>
 #include <linux/crypto.h>
+#include <linux/page-flags.h>
 
 #include <net/icmp.h>
 #include <net/tcp.h>
@@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
 	struct sock *sk = sock->sk;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
-	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
+	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM) ||
+	    PageSlab(page))
 		return sock_no_sendpage(sock, page, offset, size, flags);
 
 	lock_sock(sk);
--

From: Herbert Xu
Date: Monday, January 12, 2009 - 2:12 pm

No we should fix it in skb_splice_bits because the corruption
can affect other terminations as well if a delay occurs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jarek Poplawski
Date: Monday, January 19, 2009 - 12:32 am

On Mon, Jan 12, 2009 at 01:15:17PM +0000, Jarek Poplawski wrote:

Just for the record: this wouldn't work yet:-( It should be probably
something like "PageCompound(compound_head(page))" test instead.

Jarek P.
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 5:39 am

no problem. However, Herbert explained that this fix could not work for
multiple reasons (refcounting not fixed where the issue really is). BTW,
I've just tried it, and it crashed again, this time in tcp_send_ack().

Regards,
Willy

--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 5:56 am

On Wed, Jan 07, 2009 at 01:39:02PM +0100, Willy Tarreau wrote:

I'm extremely sorry! (I don't even expect you'll dare take 3...)

Jarek P.
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:44 am

Can you please upgrade your ethtool so that it knows about GSO?
Once you've done that please retest with everything including
GSO turned off.

Alternatively edit out the bit that enables GSO by default in
net/core/dev.c.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Ben Mansell
Date: Tuesday, January 6, 2009 - 10:42 am

FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel), 
using both forcedeth and tg3 network drivers. It's reassuring to hear of 
this network corruption as I have been puzzling over non-blocking 
splice() code recently!

The corruption does seem to be confined to the user data on the 
connections, as I have been able to run some benchmarks using my own 
splice()-enabled HTTP proxy to transfer lots of data. All the TCP 
connections 'work' fine (as in no broken TCP), the initial HTTP request 
& response headers get through OK, but the body data of the responses 
sometimes gets corrupted. The benchmark seems to work flawlessly until 
you look at the web page data!

I'm happy to run any test code on systems here or provide any debug 
information if it would help to track this down.


Ben
--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 11:15 am

Hi Ben,


Ah, so you might also have discovered a few annoyances with the API, eg
the fact that splice() returns after the first read in non-blocking mode,
as well as the fact that it never returns zero on close, but -EAGAIN,
which requires an additional recv(MSG_PEEK) to distinguish between a
close and a lack of data. But I leave that for a later discussion, let's

I confirm your observations. Benchmarks were OK, it's the first user of
my experimental code who reported wrong md5sums on their ISO images :-/
For this reason, I think that it's completely related to the way pages
are passed between sockets, but I'm too much a loser in this area. I
understood tcp_splice_read(), but can't manage to find what is called

That's nice, because I'd like to ensure that whatever fix is proposed
is properly validated, not only on my platforms!

Regards,
Willy

--

From: Jarek Poplawski
Date: Thursday, January 8, 2009 - 12:16 am

On 06-01-2009 19:15, Willy Tarreau wrote:

FYI, this should be just fixed:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54...

Regards,
Jarek P.
--

From: Willy Tarreau
Date: Thursday, January 8, 2009 - 1:05 am

Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch
I had pending here ;-)

I'll ping Greg for a backport into -stable, as applications relying on this
will clearly not work without that fix.

The other one I had consists in removing "|| !timeo" at the end of the loop,
because otherwise splice() returns very small chunks (typically 1448 or
1460 bytes), leading to disastrous performance on high bandwidth links.
At 10 Gbps, this means about 800000 calls to splice() per second!

Regards,
Willy

--

From: Ingo Molnar
Date: Thursday, January 8, 2009 - 7:53 am

looks interesting - would you mind to submit it?

	Ingo
--

From: Ben Mansell
Date: Thursday, January 8, 2009 - 8:16 am

FWIW, I've also tested this change with some splice() benchmarks. I can 
confirm that removing the "|| !timeo" works well and improves 
performance significantly.


Ben
--

From: Willy Tarreau
Date: Thursday, January 8, 2009 - 10:14 am

OK I will. I preferred to focus on the bug first, which is clearly
more annoying since it basically makes splice() unusuable in a proxy.
I wanted to avoid changes before the bug is fixed, but now, with the
first investigations, it seems that the fix will be in a different
area anyway so that doesn't matter.

Regards,
Willy

--

From: Evgeniy Polyakov
Date: Tuesday, January 6, 2009 - 11:32 am

Hi Willy.

Unfortunately I can not work on this problem right now, but will do if
things are not resolved after Jan 11 (long vacations will be finished in
Russia and I will return to my test machines :) But right now I have
one quesstion: I read several times your mail but still can not figure
out if receiving or sending side is broken?

I.e. can you splice from socket into the file, check the file, and then
splice to the another socket and check received data to find out which
side is broken? Or did I just missed that in the problem description?

Thanks a lot for the test application, it will greatly help to resolve
this issue.

-- 
	Evgeniy Polyakov
--

From: Jens Axboe
Date: Tuesday, January 6, 2009 - 11:37 am

I'll give this a spin tomorrow as well. A hunch tells me that this is
likely a page reuse issue, that splice is getting the reference to the
buffer dropped before the data has really been transmitted. IOW, the
page is likely fine reaching the ->sendpage() bit, but will be reused
before the data has actually been transmitted. So once you get that far,
other random data from that page is going out.

Just a guess, I'll try and reproduce this tomorrow!

-- 
Jens Axboe

--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 11:55 am

Hi Jens,


I like your explanation because eventhough I don't understand the code
(can't follow it past the actors in fact), I understand the problem you're

OK. In order not to waste your time, run the test app from one interface to
the same one, with both the client and the server on the same machine, distinct
from the test app. It will trigger immediately. "nc|od -Ax -tx1" will save you
a lot of time on the client side too BTW.

Thanks,
Willy

--

From: Herbert Xu
Date: Tuesday, January 6, 2009 - 9:42 pm

[Empty message]
From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 11:38 pm

So this means that anything relying on sendpage() is at risk ? What
I find really strange is that I can only reproduce the issue if the
spliced data come from a real interface. If they come from the loopback
or from a file, there is no problem. Maybe the ref counting is different

My question will sound stupid to some of you, but wouldn't increasing
the refcount on those skb solve the problem (and decreasing it once
the skb is effectively sent) ?

Regards,
Willy

--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 2:52 am

No the bug is in the splice code.  sendfile() and other sendpage
users are not affected.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 2:54 am

but sendfile() now uses splice().

Willy

--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 4:52 am

Good point.  However, this bug only triggers when inbound splice
is reinjected into sendpage so sendfile() users shouldn't see it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Jens Axboe
Date: Wednesday, January 7, 2009 - 1:17 am

So my hunch was pretty close. The fix would seem to involve NOT calling
ops->release(pipe, buf) until we actually have an ACK on that data gone
out.

-- 
Jens Axboe

--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 4:29 am

Hi Herbert.


That would not happen without scatter-gather support on the interface,
date would be plain copied, and after Jarek's requst Willy confirmed
that corruption happens with all acceleration being turned off.

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 4:50 am

It will happen without scatter-gather support.  The problem
is with skb->head, which is always there.  In any case, SG can't
make any difference because the skb is an inbound one and most
drivers only produce linear packets.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 4:56 am

But it will not push pages into the stack, but copy them including copy
of the submitted head?

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 4:59 am

It will use a single page entry for skb->head with splice, see
skb_splice_bits.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 5:15 am

Looks like we are talking about different directions of the dataflow.
I meant that set of pages submitted into the sending part will be copied
if sending interface does not support acceleration, and thus it will
copy part of the page corresponding to the linear part of the skb prior
the transmission, so even if skb will be freed right after that call
(prior data transmission by the hardware), it should not affect copied
data.

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:22 am

You must be looking at a different tcp.c than the one I've got
because mine clearly always uses skb frags in sendpage regardless
of SG support.

Yes we will linearize the packet in dev_queue_xmit but as soon
as the netdev stops the tx queue you'll get corruption.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:27 am

OK that can't happen because the linearisation precedes that, but
ARP (or anything else that delays the skb prior to dev_queue_xmit)
can still cause corruption.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:30 am

In fact it's probably TCP that's delaying the packet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 5:37 am

Doesn't your tcp fallbacks to kernel_sendmsg() without sg in
tcp_sendpage()? And then just feeds data into the stack the same way it

That's perfectly valid when sendpage() returns and holds a reference to
the pages but not skb->head, so freed skb will free (and potentially
reuse) that area which has not been transmitted yet.
But without acceleration it will copy data and the whole original skb
may be freed without any problems.

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:42 am

Good point.  Did he check GSO though? GSO will always enable SG
on the socket regardless of the netdev's setting.  And if the device
started out with SG enabled then recent kernels will enable GSO
by default.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 5:46 am

Willy, what was the kernel you are tested no-accel behaviour and what
were the gso settings?
Can you add a simple single print into tcp_sendpage() to determine if
content was copied or fed into do_tcp_sendpages() otherwise?

-- 
	Evgeniy Polyakov
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 5:55 am

kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to
this area). My ethtool is a bit old and does not report GSO. I'll download

Yes, will do that too.

Willy

--

From: Herbert Xu
Date: Wednesday, January 7, 2009 - 5:57 am

2.6.27 will enable GSO by default if SG is enabled at registration
time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Evgeniy Polyakov
Date: Wednesday, January 7, 2009 - 6:02 am

So even if later sg was turned off without gso being turned off, it
will try doing fair page transfer and not falling back to the copy.

The simplest case is to also turn gso off, but with older ethtools it is
not possible and you hit the bug.

If I understood correctly Jarek's patch, he wants to allocate a page and
copy linear content of the skb there, so this will end up being the same
case as with splicing from the file. And while generally this may be a
good idea, but with usual 1.5k mtu this will copy every skb, which is
rather bad idea. Will have to think :)

-- 
	Evgeniy Polyakov
--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 6:10 am

On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote:

As I mentioned earlier, I (poorly) tried to realize Changli's idea
only, and it's more to verify the reason still (after it failed with
SG method...).

Thanks,
Jarek P.
--

From: Willy Tarreau
Date: Wednesday, January 7, 2009 - 6:15 am

Jarek, since it now works when disabling both SG and GSO, we know that the
bug is triggered when passing through do_tcp_sendpage(). Thus, I will not
try your patch #3 yet, but rather wait for fix candidates to try.

Thanks!
Willy

--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 6:22 am

Sure! Congratulations for debugging to you and Evgeniy!

Jarek P.
--

From: Jarek Poplawski
Date: Wednesday, January 7, 2009 - 7:01 am

On Wed, Jan 07, 2009 at 01:22:58PM +0000, Jarek Poplawski wrote:

...and Herbert! (I missed the beginning of this GSO trail...)

Jarek P.
--

From: Willy Tarreau
Date: Tuesday, January 6, 2009 - 11:50 am

Hi Evgeniy,


Maybe I wasn't very clear. It's only when splicing from a socket to another
socket that it breaks. Splicing from a file to a socket is OK (sendfile is
OK too BTW). Splicing from a socket to a file is OK.

And BTW, the receiving side must be on a real interface, not loopback. Here
are my observations :
  - splice data from lo to lo     => no corruption at all
  - splice data from lo to eth0   => no corruption at all
  - splice data from eth0 to lo   => occasional corruption
  - splice data from eth0 to eth0 => massive corruption


I figured it was an absolute necessity. The original code in my proxy is in
an experimental state and far too hard to debug for these purposes. It was
enough to detect the problem, but I could run a lot more tests with this
small test app ! An who know, maybe it will serve as an example for
non-blocking splice ;-)

Cheers,
Willy

--

From: Lennert Buytenhek
Date: Monday, January 19, 2009 - 1:39 am

Just to throw some more (hacky) example code into the pool, below is
an echo server that I hacked up to test nonblocking splice().  (You'll
need sf.net/projects/libivykis to use it.)  I also have a splice()
discard server and a patch to my intercept-connection-via-iptables-and-
forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
use splice() somewhere.

My main annoyances with splice(2) are/were:

1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
   directly tell you whether the source ran out of data or the
   destination can't accept more data, which means you can't e.g. use
   epoll in edge triggered mode without jumping through some minor
   number of extra hoops.  (For a pipe you can keep track of how many
   bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
   you'll have to assume that the pipe is full if there are >0 bytes in
   it.)

2. Because of (1), and because when splicing from a socket to a pipe
   it returns after the first bit of data (you mentioned this as well),
   you don't know at that point whether your pipe is full or not.

3. Always returns -EAGAIN even if there was a FIN or error on the
   source socket.  (Now fixed.)



#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <iv.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <signal.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>

struct connection {
	struct iv_fd sock;
	int pfd[2];
	int pipe_bytes;
	int saw_fin;
};


static void conn_kill(struct connection *conn)
{
	iv_unregister_fd(&conn->sock);
	close(conn->sock.fd);
	close(conn->pfd[0]);
	close(conn->pfd[1]);
	free(conn);
}

static void conn_pollin(void *_conn);
static void conn_pollout(void *_conn);
static void conn_pollerr(void *_conn);

static void conn_pollin(void *_conn)
{
	struct connection *conn = (struct connection ...
From: Willy Tarreau
Date: Monday, January 19, 2009 - 2:53 am

I proceeded the same way : if EAGAIN and data still in the pipe, then

In fact this is fixed now. tcp_splice_read() returns all available data,
which somewhat hides problem #1. I'm running with 23 kB in a push/pull

Yes I saw your fix, it was indeed very annoying because the only workaround
I found was to perform an recv(MSG_PEEK) on the socket after each EAGAIN
to check whether the connection was closed or not.

For these reasons, I'd really love to see the few recent fixes backported
to -stable ASAP. It will boost splice() adoption among products.

Regards,
Willy

--

Previous thread: Re: [Bugme-new] [Bug 12282] New: Network data corruption on eee 1000 by Andrew Morton on Wednesday, December 24, 2008 - 12:20 am. (6 messages)

Next thread: [PATCH 0/5] qlge: Fixes for powerpc platform. by Ron Mercer on Wednesday, December 24, 2008 - 11:18 am. (11 messages)