Hi, I am observing intermittent TCP-MD5 checksum failures (CONFIG_TCP_MD5SIG) on kernel 2.6.31 while talking to a BGP router. The problem is only seen in multi-core 64 bit machines. Is there any known bug in the per_cpu_ptr implementation (I am aware that the percpu allocator has been re-implemented in 2.6.33) that might cause a corruption in 64 bit SMP machines? Any pointers would be appreciated. Thanks, Bhaskar --
There was another recent report of incorrect MD5 signatures in <http://thread.gmane.org/gmane.linux.network/159556>, but without any response. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
I found another thread posted back in Jan 2007 with a similar bug (x86_64 on 2.6.20) but no replies to that as well. http://lkml.org/lkml/2007/1/20/56 Bhaskar --
On Tue, 4 May 2010 19:58:32 +0530 2.6.20 had lots of other MD5 bugs. Your problem might be related to GRO. MD5 may not handle multi-fragment packets. --
I am getting the issue on 2.6.31 and 2.6.28 (gro infrastructure was added in 2.6.29). Also, both segmentation offloading as well as receive offloading (gso/gro) are turned off. Moreover outgoing TCP packets are the ones with the corrupt checksums. Both tcpdump on my local machine and the BGP router on the other side complain of the bad checksums with the same packet. I am trying to figure out if there is something in the per-cpu implementation that might be causing a corruption (SMP and x86_64) but I am not really getting anywhere. I am trying to reproduce the bad checksums with the latest kernel sources since it has a new implementation of the percpu allocator. Any pointers would be highly appreciated! Thanks, Bhaskar --
On Tue, 4 May 2010 22:38:49 +0530 First turn off all offload settings on the device (TSO,GSO,SG,CSUM) then check that size of the bad packets. Are they fragmented or just simple linear packets? -- --
On Tue, May 4, 2010 at 10:43 PM, Stephen Hemminger Hi, TSO, GSO and SG are already turned off. rx/tx checksumming is on, but that shouldn't matter, right? # ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off The bad packets are very small in size, most have no data at all (<300 bytes). After adding some logs to kernel 2.6.31-12, it seems that tcp_v4_md5_hash_skb (function that calculates the md5 hash) is (might?) getting corrupt. The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr and len fields get modified to different values towards the end of the tcp_v4_md5_hash_skb function whenever there is a checksum error. The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr). Using a local copy of the tcp4_pseudohdr in the same function tcp_v4_md5_hash_skb (copied all fields from the original tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5 checksum with the local tcp4_pseudohdr seems to solve the issue (don't see bad packets for a hours in load tests, and without the change I can see them instantaneously in the load tests). I am still unable to figure out how this is happening. Please let me know if you have any pointers. Thanks a lot! Bhaskar --
I am not familiar with this code, but I suspect same per_cpu data can be
used at both time by a sender (process context) and by a receiver
(softirq context).
To trigger this, you need at least two active md5 sockets.
tcp_get_md5sig_pool() should probably disable bh to make sure current
cpu wont be preempted by softirq processing
Something like :
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fb5c66b..e232123 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
if (!ret)
put_cpu();
+ else
+ local_bh_disable();
return ret;
}
static inline void tcp_put_md5sig_pool(void)
{
__tcp_put_md5sig_pool();
+ local_bh_enable();
put_cpu();
}
--
I put in the above change and ran some load tests with around 50 active TCP connections doing MD5. I could see only 1 bad packet in 30 min (earlier the problem used to occur instantaneously and repeatedly). I think there is another possibility of being preempted when calling tcp_alloc_md5sig_pool() this function releases the spinlock when calling __tcp_alloc_md5sig_pool(). I will run some more tests after changing the tcp_alloc_md5sig_pool and see if the problem is completely resolved. Thanks a lot for your help! Bhaskar --
This code should be completely rewritten for linux-2.6.35, its very ugly and over complex, yet it is not scalable. It could use true percpu data, with no central lock or refcount. --
From: Eric Dumazet <eric.dumazet@gmail.com> Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's frankly crap and this is like the 5th major bug that had to get fixed in it. :-) But let's fix this as simply as possible in net-2.6 and -stable, so Eric let me know when you have a tested patch for me to apply. Thanks a lot! --
There are so many things ... I am comtemplating commit aa133076 __tcp_alloc_md5sig_pool() now do a : p = kzalloc(sizeof(*p), sk->sk_allocation); But later call : hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); (GFP_KERNEL allocations for sure) tcp_v4_parse_md5_keys() also use : p = kzalloc(sizeof(*p), sk->sk_allocation); right after a (possibly sleeping) copy_from_user() Oh well... I'll test the already suggested patch before official submission, thanks. --
On Fri, 07 May 2010 07:32:09 +0200 Forget the per cpu data; the pool should just be scrapped. The only reason the pool exists is that the crypto hash state which should just be moved into the md5_info (88 bytes). The pseudo header can just be generated on the stack before passing to the crypto code. --
Sure, but I'm afraid there is no generic API do do that (if we want to reuse crypto/md5.c code) --
On Fri, 07 May 2010 19:21:33 +0200 It looks like the pool is just an optimization to avoid opening too many crypto API connections. This should only be an issue if offloading MD5. --
You mean we could allocate two contexts per socket, one for tx path, one for rx path, but TCP-MD5 implementors chose to use percpu allocations to factorize them. They should have allocated two contexts per cpu (one for process context, preemption disabled, one for BH context) As you said, this could be allocated on stack, with some changes to crypto API I guess. Since TCP is not a module, md5 is also static, so there is no module loading involved. struct crypto_tfm *__crypto_alloc_tfm(struct crypto_alg *alg, u32 type,u32 mask) --> struct crypto_tfm *__crypto_alloc_tfm_onstack(struct crypto_alg *alg, u32 type, u32 mask, void *storage, size_t storage_max_length) Or a direct plug to crypto/md5.c functions and hand crafted context. --
Hi, I had noticed the corruption in the context and actually did what is mentioned. I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread. But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario. Regards, Bijay --
Thats very fine, but you mix very different problems. Step by step resolution is required, and clean patches too, because plugging md5.c functions is not an option for stable series :) Obviously, nobody seriously used TCP-MD5 on linux, but you... --
Hi Eric, Didn't intend to mix the issues. It was a hack intended to calm the nerves. I am going to apply the proper patches asap. About the latest problem of MD5 not working with MTU set to 4470. I noticed this when i needed to change the MTU for some other purpose. Since it was a production box, i have to first set up my box with the right NIC card to reproduce this and try debugging it. In the meantime any ques will help. Thanks, Bijay --
Hi Eric, I guess that makes me the enviable one. So I am keen to test out this feature completely, as long as I know what to do as a next step, directions, patches. Thanks, Bijay --
MTU > 4000 is not reliable because of high order allocations on typical NICS. I am afraid you need NIC able to deliver page fragments. Its working here (32bit kernel) with a tg3 NIC, but I got following message : ifconfig eth3 mtu 9000 ... [51492.936500] 167731 total pagecache pages [51492.936500] 0 pages in swap cache [51492.936500] Swap cache stats: add 0, delete 0, find 0/0 [51492.936500] Free swap = 4192928kB [51492.936500] Total swap = 4192928kB [51492.936500] 1114110 pages RAM [51492.936500] 885761 pages HighMem [51492.936500] 77073 pages reserved [51492.936500] 134483 pages shared [51492.936500] 159131 pages non-shared [51492.953027] tg3 0000:14:04.1: eth3: Using a smaller RX standard ring. Only 183 out of 511 buffers were allocated successfully $ ethtool -g eth3 Ring parameters for eth3: Pre-set maximums: RX: 511 RX Mini: 0 RX Jumbo: 0 TX: 511 Current hardware settings: RX: 183 RX Mini: 0 RX Jumbo: 0 TX: 511 $ cat /proc/buddyinfo Node 0, zone DMA 5 2 1 1 2 2 2 1 0 1 0 Node 0, zone Normal 4285 1823 248 73 9 5 0 0 0 0 0 Node 0, zone HighMem 97 199 921 583 383 261 155 117 69 41 649 I know that if I try to stress RX path, I'll get failures. Could you explain me why you need both big MTUS and TCP-MD5 ? --
I need MD5 for my BGP sessions and need the jumbo packets for the IS-IS peering. MTU of 1500 results in LSPs higher that 1500 getting dropped at the peering router. --
I believe third problem comes from commit 4957faad (TCPCT part 1g: Responder Cookie => Initiator), from William Allen Simpson. When a SYN-ACK packet is built (in tcp_synack_options()), it specifically forbids a TIMESTAMP option to be included if SACK is also selected : doing_ts &= !ireq->sack_ok; Problem is this mask is done on a local variable. socket is still marked as being timestamp enabled. Later, when we build tcp options for data packets, we _include_ a timestamp, while our SYNACK didnt mention the option. So the following trafic can happen (and fails) : 18:38:29.041966 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [S], seq 4014064674, win 8860, options [mss 4430,sackOK,TS val 519041 ecr 0,nop,wscale 7,nop,nop,md5can't check - 9b44126367effcf3247fcbf6da76b24d], length 0 18:38:29.042072 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [S.], seq 586328714, ack 4014064675, win 5792, options [nop,nop,md5can't check - badd847799ded46f39642c341cc7e92b,mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 18:38:29.042093 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [.], ack 1, win 70, options [nop,nop,md5can't check - 3994ef6987df02a592963fba04c5d313], length 0 18:38:29.043217 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [.], seq 1:1441, ack 1, win 70, options [nop,nop,md5can't check - 8399f7ccab3a6b8c5a3027ed58bba314], length 1440 18:38:29.043226 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [P.], seq 1441:2501, ack 1, win 70, options [nop,nop,md5can't check - 701ebf65b1894a6bed4cefbf7a56596a], length 1060 18:38:29.043374 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [.], ack 1441, win 68, options [nop,nop,md5can't check - 1badb315ba436ab59bff5b37daa871be,nop,nop,TS val 113051377 ecr 519041], length 0 18:38:29.043383 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [.], ack 2501, win 91, options [nop,nop,md5can't check - 120564dcb99f822f3b70910282a6ed9d,nop,nop,TS val 113051377 ecr 519041], length 0 18:38:29.043673 IP 192.168.0.56.22226 > ...
And a fourth problem might be that tcp_md5_hash_skb_data() is not frag_list aware ? diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8ce2974..56ee0f2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2985,6 +2985,7 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, const unsigned head_data_len = skb_headlen(skb) > header_len ? skb_headlen(skb) - header_len : 0; const struct skb_shared_info *shi = skb_shinfo(skb); + struct sk_buff *frag_iter; sg_init_table(&sg, 1); @@ -2999,6 +3000,10 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, return 1; } + skb_walk_frags(skb, frag_iter) + if (tcp_md5_hash_skb_data(hp, frag_iter, 0)) + return 1; + return 0; } --
On Wed, 12 May 2010 05:20:21 +0200 Yes, that looks like a possible bug, not sure what hardware generates frag_list. --
From: Stephen Hemminger <shemminger@vyatta.com> GRO generates frag_list --
ixgbe (82599) too, if I understand well this driver (TCP Receive Side Coalescing RSC) --
Bijay, had you tested this patch by any chance ? Thanks --
I am on quite an old kernel 2.6.27 and could not apply your patches. --
Thanks again for the status report. I see bug is older than what I stated in my previous mail I could reproduce it in my lab and confirm following patch fixes it This is a stable candidate (2.6.27 kernels) Thanks [PATCH] tcp: tcp_synack_options() fix Commit 33ad798c924b4a (tcp: options clean up) introduced a problem if MD5+SACK+timestamps were used in initial SYN message. Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP sessions, but since 20 bytes of tcp options space are not enough to store all the bits needed, we chose to disable timestamps in this case. We send a SYN-ACK _without_ timestamp option, but socket has timestamps enabled and all further outgoing messages contain a TS block, all with the initial timestamp of the remote peer. Fix is to really disable timestamps option for the whole session. Reported-by: Bijay Singh <Bijay.Singh@guavus.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/tcp_output.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0dda86e..b8bb226 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -667,7 +667,7 @@ static unsigned tcp_synack_options(struct sock *sk, u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ? xvp->cookie_plus : 0; - bool doing_ts = ireq->tstamp_ok; + bool doing_ts; #ifdef CONFIG_TCP_MD5SIG *md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req); @@ -680,11 +680,12 @@ static unsigned tcp_synack_options(struct sock *sk, * rather than TS in order to fit in better with old, * buggy kernels, but that was deemed to be unnecessary. */ - doing_ts &= !ireq->sack_ok; + ireq->tstamp_ok &= !ireq->sack_ok; } #else *md5 = NULL; #endif + doing_ts = ireq->tstamp_ok; /* We always send an MSS option. */ opts->mss = mss; --
On Mon, 17 May 2010 07:03:49 +0200 Make this gets back to stable as well. -- --
On Mon, 17 May 2010 07:03:49 +0200
Since you are doing away with flag variable, why not this instead?
--- a/net/ipv4/tcp_output.c 2010-05-17 13:38:32.822025583 -0700
+++ b/net/ipv4/tcp_output.c 2010-05-17 13:41:47.321734775 -0700
@@ -668,7 +668,6 @@ static unsigned tcp_synack_options(struc
u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ?
xvp->cookie_plus :
0;
- bool doing_ts = ireq->tstamp_ok;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
@@ -681,7 +680,7 @@ static unsigned tcp_synack_options(struc
* rather than TS in order to fit in better with old,
* buggy kernels, but that was deemed to be unnecessary.
*/
- doing_ts &= !ireq->sack_ok;
+ ireq->tstamp_ok &= !ireq->sack_ok;
}
#else
*md5 = NULL;
@@ -696,7 +695,7 @@ static unsigned tcp_synack_options(struc
opts->options |= OPTION_WSCALE;
remaining -= TCPOLEN_WSCALE_ALIGNED;
}
- if (likely(doing_ts)) {
+ if (likely(ireq->tstamp_ok)) {
opts->options |= OPTION_TS;
opts->tsval = TCP_SKB_CB(skb)->when;
opts->tsecr = req->ts_recent;
@@ -704,7 +703,7 @@ static unsigned tcp_synack_options(struc
}
if (likely(ireq->sack_ok)) {
opts->options |= OPTION_SACK_ADVERTISE;
- if (unlikely(!doing_ts))
+ if (unlikely(!ireq->tstamp_ok))
remaining -= TCPOLEN_SACKPERM_ALIGNED;
}
@@ -712,7 +711,7 @@ static unsigned tcp_synack_options(struc
* If the <SYN> options fit, the same options should fit now!
*/
if (*md5 == NULL &&
- doing_ts &&
+ ireq->tstamp_ok &&
cookie_plus > TCPOLEN_COOKIE_BASE) {
int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */
--
--
Sure, we can eliminate this doing_ts variable and save few bytes
Thanks
[PATCH] tcp: tcp_synack_options() fix
Commit 33ad798c924b4a (tcp: options clean up) introduced a problem
if MD5+SACK+timestamps were used in initial SYN message.
Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
sessions, but since 40 bytes of tcp options space are not enough to
store all the bits needed, we chose to disable timestamps in this case.
We send a SYN-ACK _without_ timestamp option, but socket has timestamps
enabled and all further outgoing messages contain a TS block, all with
the initial timestamp of the remote peer.
Fix is to really disable timestamps option for the whole session.
Reported-by: Bijay Singh <Bijay.Singh@guavus.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0dda86e..44e98d9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -667,7 +667,6 @@ static unsigned tcp_synack_options(struct sock *sk,
u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ?
xvp->cookie_plus :
0;
- bool doing_ts = ireq->tstamp_ok;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
@@ -680,7 +679,7 @@ static unsigned tcp_synack_options(struct sock *sk,
* rather than TS in order to fit in better with old,
* buggy kernels, but that was deemed to be unnecessary.
*/
- doing_ts &= !ireq->sack_ok;
+ ireq->tstamp_ok &= !ireq->sack_ok;
}
#else
*md5 = NULL;
@@ -695,7 +694,7 @@ static unsigned tcp_synack_options(struct sock *sk,
opts->options |= OPTION_WSCALE;
remaining -= TCPOLEN_WSCALE_ALIGNED;
}
- if (likely(doing_ts)) {
+ if (likely(ireq->tstamp_ok)) {
opts->options |= OPTION_TS;
opts->tsval = TCP_SKB_CB(skb)->when;
opts->tsecr = req->ts_recent;
@@ -703,7 +702,7 @@ static unsigned tcp_synack_options(struct sock *sk,
}
if (likely(ireq->sack_ok)) {
opts->options |= ...From: Eric Dumazet <eric.dumazet@gmail.com> Applied, thanks! --
From: Stephen Hemminger <shemminger@vyatta.com> It's an issue because creating a crypto API context is expensive, so this influences our connection rates with MD5. --
Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.) You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO. Lars
From: Lars Eggert <lars.eggert@nokia.com> Code we have and users use which is unstable and crashes is more important to work on and fix than code we don't have which user's therefore don't use. You're wrong from just about every possible angle. Whoever finds AO useful will work on it, just as was the case with MD5 support. And right now, that "whoever" is definitely not us. :-) --
I cant see a race with spinlock in tcp_alloc_md5sig_pool/__tcp_alloc_md5sig_pool(). We allocate structures for all cpus, so preemption.migration should be OK Could you elaborate please ? --
Here is my official patch submission, could you please test it ?
Thanks
[PATCH] tcp: fix MD5 (RFC2385) support
TCP MD5 support uses percpu data for temporary storage. It currently
disables preemption so that same storage cannot be reclaimed by another
thread on same cpu.
We also have to make sure a softirq handler wont try to use also same
context. Various bug reports demonstrated corruptions.
Fix is to disable preemption and BH.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/tcp.h | 21 +++------------------
net/ipv4/tcp.c | 34 ++++++++++++++++++++++++----------
2 files changed, 27 insertions(+), 28 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 75be5a2..aa04b9a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1197,30 +1197,15 @@ extern int tcp_v4_md5_do_del(struct sock *sk,
extern struct tcp_md5sig_pool * __percpu *tcp_alloc_md5sig_pool(struct sock *);
extern void tcp_free_md5sig_pool(void);
-extern struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu);
-extern void __tcp_put_md5sig_pool(void);
+extern struct tcp_md5sig_pool *tcp_get_md5sig_pool(void);
+extern void tcp_put_md5sig_pool(void);
+
extern int tcp_md5_hash_header(struct tcp_md5sig_pool *, struct tcphdr *);
extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, struct sk_buff *,
unsigned header_len);
extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
struct tcp_md5sig_key *key);
-static inline
-struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
-{
- int cpu = get_cpu();
- struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
- if (!ret)
- put_cpu();
- return ret;
-}
-
-static inline void tcp_put_md5sig_pool(void)
-{
- __tcp_put_md5sig_pool();
- put_cpu();
-}
-
/* write queue abstraction */
static inline void tcp_write_queue_purge(struct sock *sk)
{
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f8caf6..296150b ...Eric, Thanks a lot! I will test it out and let you know. BTW this patch seems to essentially do the same as the earlier fix you had posted (where you just do bh disable/enable). Am I missing something? With the earlier fix, I ran load tests with 80 TCP connections for over 6 hrs and found 5 bad checksum packets. So there is still a problem. Without the fix I see a bad packet every minute or so. Bhaskar --
My second patch is cleaner, using only out of line code (inline was not necessary and made include file bigger than necessary). Inline is fine if we can avoid a function call, but it was not the case. If you notice another corruption, it may be because of another problem, yet to be discovered. To you have a userland suite to test/stress tcp md5 connections ? --
We are still trying to find the other corruption. I will send you the tarball of a userland client/server suite to stress test the TCP-MD5 that we've been using to reproduce the issue. Thanks, Bhaskar --
OK, I found the second problem. if/when IP route cache is invalidated, ip_queue_xmit() has to refetch a route and calls sk_setup_caps(sk, &rt->u.dst), destroying the sk->sk_route_caps &= ~NETIF_F_GSO_MASK that MD5 desesperatly try to make all over its way (from tcp_transmit_skb() for example) So we send few bad packets, and everything is fine when tcp_transmit_skb() is called again. You get many errors on remote peer if you do ip route flush cache --
I am testing following patch :
include/net/sock.h | 8 ++++++++
net/core/sock.c | 1 +
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 ++--
5 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..abfadfe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -177,6 +177,7 @@ struct sock_common {
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
+ * @sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
* @sk_gso_max_size: Maximum GSO segment size to build
* @sk_lingertime: %SO_LINGER l_linger setting
@@ -276,6 +277,7 @@ struct sock {
int sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
+ int sk_route_nocaps;
int sk_gso_type;
unsigned int sk_gso_max_size;
int sk_rcvlowat;
@@ -1257,6 +1259,12 @@ static inline int sk_can_gso(const struct sock *sk)
extern void sk_setup_caps(struct sock *sk, struct dst_entry *dst);
+static inline void sk_nocaps_add(struct sock *sk, int flags)
+{
+ sk->sk_route_nocaps |= flags;
+ sk->sk_route_caps &= ~flags;
+}
+
static inline int skb_copy_to_page(struct sock *sk, char __user *from,
struct sk_buff *skb, struct page *page,
int off, int copy)
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..5056a6a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1227,6 +1227,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
sk->sk_route_caps = dst->dev->features;
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
+ sk->sk_route_caps &= ~sk->sk_route_nocaps;
if (sk_can_gso(sk)) {
if (dst->header_len) {
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
diff --git ...Patch solves the problem for me. I tested it with 200 MD5 sockets
established between two 16 cpus machine, with a multiqueue NIC. Trafic
of 100.000 pps per second, and "ip route flush cache" every minute on
both machines. After five hours, not a single frame had a bad hash
value.
Here is the official submission.
[PATCH] net: Introduce sk_route_nocaps
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the
sk->sk_route_caps &= ~NETIF_F_GSO_MASK
that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)
So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.
Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sock.h | 8 ++++++++
net/core/sock.c | 1 +
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 ++--
5 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..abfadfe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -177,6 +177,7 @@ struct sock_common {
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
+ * @sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
* @sk_gso_max_size: Maximum GSO segment size to build
* @sk_lingertime: %SO_LINGER l_linger setting
@@ -276,6 +277,7 @@ struct sock {
int sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
+ int sk_route_nocaps;
...From: Eric Dumazet <eric.dumazet@gmail.com> Since the connection does recover eventually, I'm stuffing this into net-next-2.6 into net-2.6 After some time in net-next-2.6, we can submit it to -stable. --
From: Eric Dumazet <eric.dumazet@gmail.com> Applied. --
