Great story! Alas I don't understand this fully either, but it seems Changli Gao was concerned with sendpage sending this "as pages", so when NETIF_F_SG flag is available. Did you try this without SG btw? Thanks, Jarek P. --
Hi Jarek, No I did not. I can try, it's not too hard. It would in part defeat the purpose of the mechanism (especially at 10 Gbps) but at least it will help narrow the problem down. Thanks for the tip, I'll keep you informed ! Willy --
On Tue, Jan 06, 2009 at 10:41:38AM +0100, Willy Tarreau wrote: Yes, I meant it only as a proof of concept. BTW, delaying TCP acks a bit for these sendpages should then make it more reproducible, I guess. Jarek P. --
ah ? I can try that too, using iptables on the target machine to drop a few outgoing acks. Regards, Willy --
OK here is an update. It does not change anything to turn off any acceleration feature on the interface (tg3) : root@wtap:~# ethtool -k eth0 Offload parameters for eth0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off It still forwards corrupted data like mad. I noticed that the corruption rate is 10-100 times higher when forwarding from eth0 to eth0 than from eth0 to lo. Maybe this can help find the culprit ? Willy --
I hope it will, but I still don't get it. Anyway, here is an untested
patch, which I guess partly tries Changli Gao's recommendation to give
real pages to splice/pipe (but here it's always - not for sendpage
only).
BTW, I added Changli to Cc - great review!
Thanks,
Jarek P.
---
net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..4c080cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}
- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(p + offset, page + offset, len);
+
+ return p;
}
/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;
+ if (linear) {
+ page = linear_to_page(page, ...[ CCing Evgeniy and Herbert who also participate to the thread ] Well, I've just tested it. It did not fix the problem but made it worse. Sending a few bytes at a time, the corruption is still there, from the beginning. Here's what I fed : willy@pcw:~$ nc -lp4001 azerazerazerazerazerazer eiguhaeihgaeighaeighaeirghiareg aeroigjaeorgjaeorgjaoeigjaeoig ioejrgoiaerjgoiaerjgoiaerjgaoiejgoaiejg Here's what I got : willy@pcw:~$ telnet 10.0.3.2 4000 Trying 10.0.3.2... Connected to 10.0.3.2. Escape character is '^]'. _J0s9ñuMG1S9Ðt2D$EðL$T$ However, when feeding /dev/zero as in previous tests, the kernel paniced in skb_release_data(). Regards, --
Well this patch can only make it worse because not only are you still ref counting skb->head with get_page, but you've also completely removed the skb ref count which means that the corruption can only occur sooner. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But we don't need this skb... except its ->frags[] pages, which are get_paged?! (The rest is copied to new pages.) Jarek P. --
...
Terrible mistake! Here is take 2.
Sorry!
Jarek P.
---
net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..a598034 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}
- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy((void *)p + offset, (void *)page + offset, len);
+
+ return p;
}
/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;
+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = ...is trying to do. I'm assuming you want to copy the page contents? If so,
you'd want something like
memcpy(page_address(p) + offset, page_address(page) + offset, len);
with possible kmaps for 'page'.
Irregardless of that particular oddity, I don't think this is the right
path to take at all. We need to delay the pipe buffer consumption until
the appropriate time.
--
Jens Axboe
--
As a proof of concept we can put a delayed work_struct into the buffer and only release its content after some timeout big enough (like one second or so) for the hardware to actually transmit its buffers. -- Evgeniy Polyakov --
Evgeniy, I'd like to understand something related to our apparent lack of knowledge of when the data is effectively transmitted. If we're focusing on the send part, I can't understand why I never reproduce the corruption when the data source is a file or loopback, but I only see it when the source is an ethernet interface. How is it possible that a problem affecting only the send side is so much selective about the source ? And in fact, why can't we apply the same workflow for outgoing data for both types of sources ? It seems to me that the page is released at the right time when sending a file, and I don't see why we cannot apply the same principle when splicing between sockets. Please excuse me for my blattant ignorance in this area, as I once said, I could not completely follow the whole splice process between tcp_splice_read() and the moment the data leaves the machine. Also, I failed to understand what linear data means. It seems to me this is the parts that are memcpy'd, but I'm not sure. Thanks, Willy --
It doesn't happen with a file because in that case you don't start with an skb so there is no skb->head. It probably doesn't happen with loopback because loopback does GSO so again skb->head does not exist (so to speak). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Yup, basically splice's transmit pipe buffer contains page references, where the first one is actually not a real page but skb, while in the case of sendfile() and/or splice() from the file first page is a real page of the appropriate file. -- Evgeniy Polyakov --
OK thanks guys for the clarifications. Evgeniy, my printk() in tcp_sendpage() fired several times indicating we were going through do_tcp_sendpage. During the same test, I observed a lot of corruption. Also, I have a good news. As you suggested, disabling both SG and GSO indeed fixes the issue. do_tcp_sendpage() is not called anymore from tcp_sendpage() in this case (according to dmesg). Cheers, Willy --
Hmm... in any case: take 3
Jarek P.
---
net/core/skbuff.c | 41 +++++++++++++++++++++++++++--------------
1 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..6e43d52 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly;
static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- kfree_skb(skb);
+ put_page(buf->page);
}
static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
- struct sk_buff *skb = (struct sk_buff *) buf->private;
-
- skb_get(skb);
+ get_page(buf->page);
}
static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@ fault:
*/
static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
{
- struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+ put_page(spd->pages[i]);
+}
- kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+ unsigned int offset)
+{
+ struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+ if (!p)
+ return NULL;
+ memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+ return p;
}
/*
@@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
*/
static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
unsigned int len, unsigned int offset,
- struct sk_buff *skb)
+ struct sk_buff *skb, int linear)
{
if (unlikely(spd->nr_pages == PIPE_BUFFERS))
return 1;
+ if (linear) {
+ page = linear_to_page(page, len, offset);
+ if (!page)
+ return 1;
+ }
+
spd->pages[spd->nr_pages] = page;
spd->partial[spd->nr_pages].len = len;
spd->partial[spd->nr_pages].offset = offset;
- spd->partial[spd->nr_pages].private ...Hmm.. Again - the main point of this patch was a proof of concept... If the bug is really here it could be optimized (e.g. done only for sendmesage) or fixed another way. Jarek P. --
that's what I was initially wondering about, but it looks like only linear data is copied. Will that cause too much a overhead ? (I don't like copying at all anyway, but if it can help us find the cause, I'll happily test it). BTW, Jarek, don't be sorry, I *am* expecting to crash my laptop a number of times before ensuring the bug is fixed. I just don't want to lose my data. Cheers, Willy --
For splicing from socket to socket or socket to file, all the data will be linear with most drivers. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
However, as we don't have a better fix yet, we probably should take Jarek's patch for now since data corruption is bad. This is a very hard problem, so in the end the only viable solution might be to get the drivers to switch to using page frags, like the Intel page split method. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Iirc it copies data from skb to the new pipe page unconditionally while it is needed only for skb->sendpage path, although it is not possible to know what is the other side of the pipe (or not?). What about storing a callback and private pointer in the shared info for the skb and clone them during usual clone, and invoke the callback at shared info freeing time, which in turn will call spd->spd_release()? Given that we only need to protect linear part, it should be simple As a long-term solution this sounds as the best case, but introduces quite heavy overhead for the allocators. Right now we allocate 1500+shared_info rounded up to the nearest power of the two (2k), but then we will either need to have own network allocator (I have one :) or allocate PAGE_SIZE+shared_info rounded up to the pwoer of the two (i.e. 8k), which is unfeasible. -- Evgeniy Polyakov --
No that's not what I was suggesting. The page split model allocates an skb with a very small head that accomodates only the headers. All payload is stored in the frags structure. For 1500-byte packets, we can manage the payload area efficiently by dividing each allocated page into 2K chunks. The page will then be automatically freed once all the 2K chunks on it have been freed through the page ref count mechanism. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
That's the part I referred to as a network own allocator. We can have multiple kmem_caches though for the popular MTUs and round up the requested size otherwise. -- Evgeniy Polyakov --
No we can't use kmem_caches because we're relying on using the page refereounce count to free the page. It has to be naked pages. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I've wondered if something like this could work as a temporary hack? !!! not compiled/tested !!! --- diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ce572f9..99b0876 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -264,6 +264,7 @@ #include <linux/cache.h> #include <linux/err.h> #include <linux/crypto.h> +#include <linux/page-flags.h> #include <net/icmp.h> #include <net/tcp.h> @@ -776,7 +777,8 @@ ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, struct sock *sk = sock->sk; if (!(sk->sk_route_caps & NETIF_F_SG) || - !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) + !(sk->sk_route_caps & NETIF_F_ALL_CSUM) || + PageSlab(page)) return sock_no_sendpage(sock, page, offset, size, flags); lock_sock(sk); --
No we should fix it in skb_splice_bits because the corruption can affect other terminations as well if a delay occurs. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Mon, Jan 12, 2009 at 01:15:17PM +0000, Jarek Poplawski wrote: Just for the record: this wouldn't work yet:-( It should be probably something like "PageCompound(compound_head(page))" test instead. Jarek P. --
no problem. However, Herbert explained that this fix could not work for multiple reasons (refcounting not fixed where the issue really is). BTW, I've just tried it, and it crashed again, this time in tcp_send_ack(). Regards, Willy --
On Wed, Jan 07, 2009 at 01:39:02PM +0100, Willy Tarreau wrote: I'm extremely sorry! (I don't even expect you'll dare take 3...) Jarek P. --
Can you please upgrade your ethtool so that it knows about GSO? Once you've done that please retest with everything including GSO turned off. Alternatively edit out the bit that enables GSO by default in net/core/dev.c. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
FWIW, I can easily reproduce this on a Linux 2.6.27-9 (Ubuntu kernel), using both forcedeth and tg3 network drivers. It's reassuring to hear of this network corruption as I have been puzzling over non-blocking splice() code recently! The corruption does seem to be confined to the user data on the connections, as I have been able to run some benchmarks using my own splice()-enabled HTTP proxy to transfer lots of data. All the TCP connections 'work' fine (as in no broken TCP), the initial HTTP request & response headers get through OK, but the body data of the responses sometimes gets corrupted. The benchmark seems to work flawlessly until you look at the web page data! I'm happy to run any test code on systems here or provide any debug information if it would help to track this down. Ben --
Hi Ben, Ah, so you might also have discovered a few annoyances with the API, eg the fact that splice() returns after the first read in non-blocking mode, as well as the fact that it never returns zero on close, but -EAGAIN, which requires an additional recv(MSG_PEEK) to distinguish between a close and a lack of data. But I leave that for a later discussion, let's I confirm your observations. Benchmarks were OK, it's the first user of my experimental code who reported wrong md5sums on their ISO images :-/ For this reason, I think that it's completely related to the way pages are passed between sockets, but I'm too much a loser in this area. I understood tcp_splice_read(), but can't manage to find what is called That's nice, because I'd like to ensure that whatever fix is proposed is properly validated, not only on my platforms! Regards, Willy --
On 06-01-2009 19:15, Willy Tarreau wrote: FYI, this should be just fixed: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4f7d54... Regards, Jarek P. --
Ah cool, thanks Jarek for notifying us. Indeed, it's the exact same patch I had pending here ;-) I'll ping Greg for a backport into -stable, as applications relying on this will clearly not work without that fix. The other one I had consists in removing "|| !timeo" at the end of the loop, because otherwise splice() returns very small chunks (typically 1448 or 1460 bytes), leading to disastrous performance on high bandwidth links. At 10 Gbps, this means about 800000 calls to splice() per second! Regards, Willy --
looks interesting - would you mind to submit it? Ingo --
FWIW, I've also tested this change with some splice() benchmarks. I can confirm that removing the "|| !timeo" works well and improves performance significantly. Ben --
OK I will. I preferred to focus on the bug first, which is clearly more annoying since it basically makes splice() unusuable in a proxy. I wanted to avoid changes before the bug is fixed, but now, with the first investigations, it seems that the fix will be in a different area anyway so that doesn't matter. Regards, Willy --
Hi Willy. Unfortunately I can not work on this problem right now, but will do if things are not resolved after Jan 11 (long vacations will be finished in Russia and I will return to my test machines :) But right now I have one quesstion: I read several times your mail but still can not figure out if receiving or sending side is broken? I.e. can you splice from socket into the file, check the file, and then splice to the another socket and check received data to find out which side is broken? Or did I just missed that in the problem description? Thanks a lot for the test application, it will greatly help to resolve this issue. -- Evgeniy Polyakov --
I'll give this a spin tomorrow as well. A hunch tells me that this is likely a page reuse issue, that splice is getting the reference to the buffer dropped before the data has really been transmitted. IOW, the page is likely fine reaching the ->sendpage() bit, but will be reused before the data has actually been transmitted. So once you get that far, other random data from that page is going out. Just a guess, I'll try and reproduce this tomorrow! -- Jens Axboe --
Hi Jens, I like your explanation because eventhough I don't understand the code (can't follow it past the actors in fact), I understand the problem you're OK. In order not to waste your time, run the test app from one interface to the same one, with both the client and the server on the same machine, distinct from the test app. It will trigger immediately. "nc|od -Ax -tx1" will save you a lot of time on the client side too BTW. Thanks, Willy --
So this means that anything relying on sendpage() is at risk ? What I find really strange is that I can only reproduce the issue if the spliced data come from a real interface. If they come from the loopback or from a file, there is no problem. Maybe the ref counting is different My question will sound stupid to some of you, but wouldn't increasing the refcount on those skb solve the problem (and decreasing it once the skb is effectively sent) ? Regards, Willy --
No the bug is in the splice code. sendfile() and other sendpage users are not affected. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
but sendfile() now uses splice(). Willy --
Good point. However, this bug only triggers when inbound splice is reinjected into sendpage so sendfile() users shouldn't see it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
So my hunch was pretty close. The fix would seem to involve NOT calling ops->release(pipe, buf) until we actually have an ACK on that data gone out. -- Jens Axboe --
Hi Herbert. That would not happen without scatter-gather support on the interface, date would be plain copied, and after Jarek's requst Willy confirmed that corruption happens with all acceleration being turned off. -- Evgeniy Polyakov --
It will happen without scatter-gather support. The problem is with skb->head, which is always there. In any case, SG can't make any difference because the skb is an inbound one and most drivers only produce linear packets. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
But it will not push pages into the stack, but copy them including copy of the submitted head? -- Evgeniy Polyakov --
It will use a single page entry for skb->head with splice, see skb_splice_bits. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Looks like we are talking about different directions of the dataflow. I meant that set of pages submitted into the sending part will be copied if sending interface does not support acceleration, and thus it will copy part of the page corresponding to the linear part of the skb prior the transmission, so even if skb will be freed right after that call (prior data transmission by the hardware), it should not affect copied data. -- Evgeniy Polyakov --
You must be looking at a different tcp.c than the one I've got because mine clearly always uses skb frags in sendpage regardless of SG support. Yes we will linearize the packet in dev_queue_xmit but as soon as the netdev stops the tx queue you'll get corruption. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
OK that can't happen because the linearisation precedes that, but ARP (or anything else that delays the skb prior to dev_queue_xmit) can still cause corruption. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
In fact it's probably TCP that's delaying the packet. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Doesn't your tcp fallbacks to kernel_sendmsg() without sg in tcp_sendpage()? And then just feeds data into the stack the same way it That's perfectly valid when sendpage() returns and holds a reference to the pages but not skb->head, so freed skb will free (and potentially reuse) that area which has not been transmitted yet. But without acceleration it will copy data and the whole original skb may be freed without any problems. -- Evgeniy Polyakov --
Good point. Did he check GSO though? GSO will always enable SG on the socket regardless of the netdev's setting. And if the device started out with SG enabled then recent kernels will enable GSO by default. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Willy, what was the kernel you are tested no-accel behaviour and what were the gso settings? Can you add a simple single print into tcp_sendpage() to determine if content was copied or fed into do_tcp_sendpages() otherwise? -- Evgeniy Polyakov --
kernel is 2.6.27.10 + a few patches (squashfs, etc..., nothing related to this area). My ethtool is a bit old and does not report GSO. I'll download Yes, will do that too. Willy --
2.6.27 will enable GSO by default if SG is enabled at registration time. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
So even if later sg was turned off without gso being turned off, it will try doing fair page transfer and not falling back to the copy. The simplest case is to also turn gso off, but with older ethtools it is not possible and you hit the bug. If I understood correctly Jarek's patch, he wants to allocate a page and copy linear content of the skb there, so this will end up being the same case as with splicing from the file. And while generally this may be a good idea, but with usual 1.5k mtu this will copy every skb, which is rather bad idea. Will have to think :) -- Evgeniy Polyakov --
On Wed, Jan 07, 2009 at 04:02:59PM +0300, Evgeniy Polyakov wrote: As I mentioned earlier, I (poorly) tried to realize Changli's idea only, and it's more to verify the reason still (after it failed with SG method...). Thanks, Jarek P. --
Jarek, since it now works when disabling both SG and GSO, we know that the bug is triggered when passing through do_tcp_sendpage(). Thus, I will not try your patch #3 yet, but rather wait for fix candidates to try. Thanks! Willy --
Sure! Congratulations for debugging to you and Evgeniy! Jarek P. --
On Wed, Jan 07, 2009 at 01:22:58PM +0000, Jarek Poplawski wrote: ...and Herbert! (I missed the beginning of this GSO trail...) Jarek P. --
Hi Evgeniy, Maybe I wasn't very clear. It's only when splicing from a socket to another socket that it breaks. Splicing from a file to a socket is OK (sendfile is OK too BTW). Splicing from a socket to a file is OK. And BTW, the receiving side must be on a real interface, not loopback. Here are my observations : - splice data from lo to lo => no corruption at all - splice data from lo to eth0 => no corruption at all - splice data from eth0 to lo => occasional corruption - splice data from eth0 to eth0 => massive corruption I figured it was an absolute necessity. The original code in my proxy is in an experimental state and far too hard to debug for these purposes. It was enough to detect the problem, but I could run a lot more tests with this small test app ! An who know, maybe it will serve as an example for non-blocking splice ;-) Cheers, Willy --
Just to throw some more (hacky) example code into the pool, below is
an echo server that I hacked up to test nonblocking splice(). (You'll
need sf.net/projects/libivykis to use it.) I also have a splice()
discard server and a patch to my intercept-connection-via-iptables-and-
forward-it-to-a-remote-SOCKS5-server-to-deal-with-crappy-VPNs app to
use splice() somewhere.
My main annoyances with splice(2) are/were:
1. -EAGAIN return on splice from socket/pipe to socket/pipe doesn't
directly tell you whether the source ran out of data or the
destination can't accept more data, which means you can't e.g. use
epoll in edge triggered mode without jumping through some minor
number of extra hoops. (For a pipe you can keep track of how many
bytes are in it by hand, but for a socket->pipe splice -EAGAIN return
you'll have to assume that the pipe is full if there are >0 bytes in
it.)
2. Because of (1), and because when splicing from a socket to a pipe
it returns after the first bit of data (you mentioned this as well),
you don't know at that point whether your pipe is full or not.
3. Always returns -EAGAIN even if there was a FIN or error on the
source socket. (Now fixed.)
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <iv.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <signal.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
struct connection {
struct iv_fd sock;
int pfd[2];
int pipe_bytes;
int saw_fin;
};
static void conn_kill(struct connection *conn)
{
iv_unregister_fd(&conn->sock);
close(conn->sock.fd);
close(conn->pfd[0]);
close(conn->pfd[1]);
free(conn);
}
static void conn_pollin(void *_conn);
static void conn_pollout(void *_conn);
static void conn_pollerr(void *_conn);
static void conn_pollin(void *_conn)
{
struct connection *conn = (struct connection ...I proceeded the same way : if EAGAIN and data still in the pipe, then In fact this is fixed now. tcp_splice_read() returns all available data, which somewhat hides problem #1. I'm running with 23 kB in a push/pull Yes I saw your fix, it was indeed very annoying because the only workaround I found was to perform an recv(MSG_PEEK) on the socket after each EAGAIN to check whether the connection was closed or not. For these reasons, I'd really love to see the few recent fixes backported to -stable ASAP. It will boost splice() adoption among products. Regards, Willy --
