Re: TCP-MD5 checksum failure on x86_64 SMP

Previous thread: [PATCH] IPv6: fix IPV6_RECVERR handling of locally-generated errors by Brian Haley on Monday, May 3, 2010 - 6:44 pm. (1 message)

Next thread: [PATCH] net-next: remove useless union keyword by Changli Gao on Monday, May 3, 2010 - 8:04 pm. (3 messages)
From: Bhaskar Dutta
Date: Monday, May 3, 2010 - 8:30 pm

Hi,

I am observing intermittent TCP-MD5 checksum failures
(CONFIG_TCP_MD5SIG)  on kernel 2.6.31 while talking to a BGP router.

The problem is only seen in multi-core 64 bit machines.
Is there any known bug in the per_cpu_ptr implementation (I am aware
that the percpu allocator has been re-implemented in 2.6.33) that
might cause a corruption in 64 bit SMP machines?

Any pointers would be appreciated.

Thanks,
Bhaskar
--

From: Ben Hutchings
Date: Tuesday, May 4, 2010 - 4:32 am

There was another recent report of incorrect MD5 signatures in
<http://thread.gmane.org/gmane.linux.network/159556>, but without any
response.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Bhaskar Dutta
Date: Tuesday, May 4, 2010 - 7:28 am

I found another thread posted back in Jan 2007 with a similar bug
(x86_64 on 2.6.20) but no replies to that as well.
http://lkml.org/lkml/2007/1/20/56

Bhaskar
--

From: Stephen Hemminger
Date: Tuesday, May 4, 2010 - 9:12 am

On Tue, 4 May 2010 19:58:32 +0530

2.6.20 had lots of other MD5 bugs. Your problem might be related to
GRO.  MD5 may not handle multi-fragment packets.
--

From: Bhaskar Dutta
Date: Tuesday, May 4, 2010 - 10:08 am

I am getting the issue on 2.6.31 and 2.6.28 (gro infrastructure was
added in 2.6.29).
Also, both segmentation offloading as well as receive offloading
(gso/gro) are turned off.

Moreover outgoing TCP packets are the ones with the corrupt checksums.
Both tcpdump on my local machine and the BGP router on the other side
complain of the bad checksums with the same packet.

I am trying to figure out if there is something in the per-cpu
implementation that might be causing a corruption (SMP and x86_64) but
I am not really getting anywhere.

I am trying to reproduce the bad checksums with the latest kernel
sources since it has a new implementation of the percpu allocator.

Any pointers would be highly appreciated!

Thanks,
Bhaskar
--

From: Stephen Hemminger
Date: Tuesday, May 4, 2010 - 10:13 am

On Tue, 4 May 2010 22:38:49 +0530


First turn off all offload settings on the device (TSO,GSO,SG,CSUM)
then check that size of the bad packets. Are they fragmented or
just simple linear packets?



-- 
--

From: Bhaskar Dutta
Date: Wednesday, May 5, 2010 - 11:03 am

On Tue, May 4, 2010 at 10:43 PM, Stephen Hemminger

Hi,

TSO, GSO and SG are already turned off.
rx/tx checksumming is on, but that shouldn't matter, right?

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

The bad packets are very small in size, most have no data at all (<300 bytes).

After adding some logs to kernel 2.6.31-12, it seems that
tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
(might?) getting corrupt.

The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
and len fields get modified to different values towards the end of the
tcp_v4_md5_hash_skb function whenever there is a checksum error.

The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).

Using a local copy of the tcp4_pseudohdr in the same function
tcp_v4_md5_hash_skb (copied all fields from the original
tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
checksum with the local  tcp4_pseudohdr seems to solve the issue
(don't see bad packets for a hours in load tests, and without the
change I can see them instantaneously in the load tests).

I am still unable to figure out how this is happening. Please let me
know if you have any pointers.

Thanks a lot!
Bhaskar
--

From: Eric Dumazet
Date: Wednesday, May 5, 2010 - 11:53 am

I am not familiar with this code, but I suspect same per_cpu data can be
used at both time by a sender (process context) and by a receiver
(softirq context).

To trigger this, you need at least two active md5 sockets.

tcp_get_md5sig_pool() should probably disable bh to make sure current
cpu wont be preempted by softirq processing


Something like :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fb5c66b..e232123 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
 	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
 	if (!ret)
 		put_cpu();
+	else
+		local_bh_disable();
 	return ret;
 }
 
 static inline void             tcp_put_md5sig_pool(void)
 {
 	__tcp_put_md5sig_pool();
+	local_bh_enable();
 	put_cpu();
 }


--

From: Bhaskar Dutta
Date: Thursday, May 6, 2010 - 4:55 am

I put in the above change and ran some load tests with around 50
active TCP connections doing MD5.
I could see only 1 bad packet in 30 min (earlier the problem used to
occur instantaneously and repeatedly).

I think there is another possibility of being preempted when calling
tcp_alloc_md5sig_pool()
this function releases the spinlock when calling __tcp_alloc_md5sig_pool().

I will run some more tests after changing the  tcp_alloc_md5sig_pool
and see if the problem is completely resolved.

Thanks a lot for your help!
Bhaskar
--

From: Eric Dumazet
Date: Thursday, May 6, 2010 - 5:06 am

This code should be completely rewritten for linux-2.6.35, its very ugly
and over complex, yet it is not scalable.

It could use true percpu data, with no central lock or refcount.



--

From: David Miller
Date: Thursday, May 6, 2010 - 10:04 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's
frankly crap and this is like the 5th major bug that had to get fixed
in it. :-)

But let's fix this as simply as possible in net-2.6 and -stable, so Eric
let me know when you have a tested patch for me to apply.

Thanks a lot!

--

From: Eric Dumazet
Date: Thursday, May 6, 2010 - 10:32 pm

There are so many things ...

I am comtemplating commit aa133076

__tcp_alloc_md5sig_pool() now do a : 

p = kzalloc(sizeof(*p), sk->sk_allocation);

But later call :

hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);

(GFP_KERNEL allocations for sure)


tcp_v4_parse_md5_keys() also use :

p = kzalloc(sizeof(*p), sk->sk_allocation);

right after a (possibly sleeping) copy_from_user()

Oh well...


I'll test the already suggested patch before official submission,
thanks.



--

From: Stephen Hemminger
Date: Friday, May 7, 2010 - 10:14 am

On Fri, 07 May 2010 07:32:09 +0200

Forget the per cpu data; the pool should just be scrapped.

The only reason the pool exists is that the crypto hash state which
should just be moved into the md5_info (88 bytes).  The pseudo
header can just be generated on the stack before passing to the crypto
code.

--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 10:21 am

Sure, but I'm afraid there is no generic API do do that (if we want to
reuse crypto/md5.c code)



--

From: Stephen Hemminger
Date: Friday, May 7, 2010 - 10:36 am

On Fri, 07 May 2010 19:21:33 +0200

It looks like the pool is just an optimization to avoid opening too
many crypto API connections.  This should only be an issue if offloading
MD5.
--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 2:40 pm

You mean we could allocate two contexts per socket, one for tx path, one
for rx path, but TCP-MD5 implementors chose to use percpu allocations to
factorize them. They should have allocated two contexts per cpu (one for
process context, preemption disabled, one for BH context)

As you said, this could be allocated on stack, with some changes to
crypto API I guess. Since TCP is not a module, md5 is also static, so
there is no module loading involved.

struct crypto_tfm *__crypto_alloc_tfm(struct crypto_alg *alg, u32
type,u32 mask)

-->

struct crypto_tfm *__crypto_alloc_tfm_onstack(struct crypto_alg *alg,
u32 type, u32 mask, void *storage, size_t storage_max_length)


Or a direct plug to crypto/md5.c functions and hand crafted context.



--

From: Bijay Singh
Date: Monday, May 10, 2010 - 7:55 am

Hi,
I had noticed the corruption in the context and actually did what is mentioned.

I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread.

But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario.

Regards,
Bijay


--

From: Eric Dumazet
Date: Monday, May 10, 2010 - 8:18 am

Thats very fine, but you mix very different problems.

Step by step resolution is required, and clean patches too, because
plugging md5.c functions is not an option for stable series :)

Obviously, nobody seriously used TCP-MD5 on linux, but you...



--

From: Bijay Singh
Date: Monday, May 10, 2010 - 10:27 am

Hi Eric,

Didn't intend to mix  the issues. It was a hack intended to calm the nerves. I am going to apply the proper patches asap.

About the latest problem of MD5 not working with MTU set to 4470. I noticed this when i needed to change the MTU for some other  purpose. 

Since it was a production box, i have to first set up my box with the right NIC card to reproduce this and try debugging it. In the meantime any ques will  help.

Thanks,
Bijay
 

--

From: Bijay Singh
Date: Monday, May 10, 2010 - 9:08 pm

Hi Eric,

I guess that makes me the enviable one. So I am keen to test out this feature completely, as long as I know what to do as a next step, directions, patches.

Thanks,
Bijay


--

From: Eric Dumazet
Date: Monday, May 10, 2010 - 11:27 pm

MTU > 4000 is not reliable because of high order allocations on typical
NICS. I am afraid you need NIC able to deliver page fragments.

Its working here (32bit kernel) with a tg3 NIC, but I got following
message :

ifconfig eth3 mtu 9000
...
[51492.936500] 167731 total pagecache pages
[51492.936500] 0 pages in swap cache
[51492.936500] Swap cache stats: add 0, delete 0, find 0/0
[51492.936500] Free swap  = 4192928kB
[51492.936500] Total swap = 4192928kB
[51492.936500] 1114110 pages RAM
[51492.936500] 885761 pages HighMem
[51492.936500] 77073 pages reserved
[51492.936500] 134483 pages shared
[51492.936500] 159131 pages non-shared
[51492.953027] tg3 0000:14:04.1: eth3: Using a smaller RX standard ring.
Only 183 out of 511 buffers were allocated successfully

$ ethtool -g eth3
Ring parameters for eth3:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		183
RX Mini:	0
RX Jumbo:	0
TX:		511

$ cat /proc/buddyinfo
Node 0, zone      DMA      5      2      1      1      2      2      2      1      0      1      0 
Node 0, zone   Normal   4285   1823    248     73      9      5      0      0      0      0      0 
Node 0, zone  HighMem     97    199    921    583    383    261    155    117     69     41    649 

I know that if I try to stress RX path, I'll get failures.

Could you explain me why you need both big MTUS and TCP-MD5 ?



--

From: Bijay Singh
Date: Tuesday, May 11, 2010 - 1:23 am

I need MD5 for my BGP sessions and need the jumbo packets for the IS-IS peering. MTU of 1500 results in LSPs higher that 1500 getting dropped at the peering router. 


--

From: Eric Dumazet
Date: Tuesday, May 11, 2010 - 1:50 pm

I believe third problem comes from commit 4957faad
(TCPCT part 1g: Responder Cookie => Initiator), from William Allen
Simpson.

When a SYN-ACK packet is built (in tcp_synack_options()),
it specifically forbids a TIMESTAMP option to be included if SACK is
also selected :

doing_ts &= !ireq->sack_ok;

Problem is this mask is done on a local variable. socket is still marked
as being timestamp enabled.


Later, when we build tcp options for data packets, we _include_ a
timestamp, while our SYNACK didnt mention the option.  

So the following trafic can happen (and fails) :

18:38:29.041966 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [S], seq 4014064674, win 8860, options [mss 4430,sackOK,TS val 519041 ecr 0,nop,wscale 7,nop,nop,md5can't check - 9b44126367effcf3247fcbf6da76b24d], length 0
18:38:29.042072 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [S.], seq 586328714, ack 4014064675, win 5792, options [nop,nop,md5can't check - badd847799ded46f39642c341cc7e92b,mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
18:38:29.042093 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [.], ack 1, win 70, options [nop,nop,md5can't check - 3994ef6987df02a592963fba04c5d313], length 0
18:38:29.043217 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [.], seq 1:1441, ack 1, win 70, options [nop,nop,md5can't check - 8399f7ccab3a6b8c5a3027ed58bba314], length 1440
18:38:29.043226 IP 192.168.0.33.58906 > 192.168.0.56.22226: Flags [P.], seq 1441:2501, ack 1, win 70, options [nop,nop,md5can't check - 701ebf65b1894a6bed4cefbf7a56596a], length 1060
18:38:29.043374 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [.], ack 1441, win 68, options [nop,nop,md5can't check - 1badb315ba436ab59bff5b37daa871be,nop,nop,TS val 113051377 ecr 519041], length 0
18:38:29.043383 IP 192.168.0.56.22226 > 192.168.0.33.58906: Flags [.], ack 2501, win 91, options [nop,nop,md5can't check - 120564dcb99f822f3b70910282a6ed9d,nop,nop,TS val 113051377 ecr 519041], length 0
18:38:29.043673 IP 192.168.0.56.22226 > ...
From: Eric Dumazet
Date: Tuesday, May 11, 2010 - 8:20 pm

And a fourth problem might be that tcp_md5_hash_skb_data() is not
frag_list aware ?




diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8ce2974..56ee0f2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2985,6 +2985,7 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 	const unsigned head_data_len = skb_headlen(skb) > header_len ?
 				       skb_headlen(skb) - header_len : 0;
 	const struct skb_shared_info *shi = skb_shinfo(skb);
+	struct sk_buff *frag_iter;
 
 	sg_init_table(&sg, 1);
 
@@ -2999,6 +3000,10 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 			return 1;
 	}
 
+	skb_walk_frags(skb, frag_iter)
+		if (tcp_md5_hash_skb_data(hp, frag_iter, 0))
+			return 1;
+
 	return 0;
 }
 


--

From: Stephen Hemminger
Date: Wednesday, May 12, 2010 - 3:22 pm

On Wed, 12 May 2010 05:20:21 +0200

Yes, that looks like a possible bug, not sure what hardware
generates frag_list.

--

From: David Miller
Date: Wednesday, May 12, 2010 - 3:24 pm

From: Stephen Hemminger <shemminger@vyatta.com>

GRO generates frag_list
--

From: Eric Dumazet
Date: Sunday, May 16, 2010 - 12:53 pm

ixgbe (82599) too, if I understand well this driver (TCP Receive Side
Coalescing RSC)



--

From: Eric Dumazet
Date: Sunday, May 16, 2010 - 1:48 pm

Bijay, had you tested this patch by any chance ?

Thanks


--

From: Bijay Singh
Date: Sunday, May 16, 2010 - 8:49 pm

I am on quite an old kernel 2.6.27 and could not apply your patches.


--

From: Eric Dumazet
Date: Sunday, May 16, 2010 - 10:03 pm

Thanks again for the status report.

I see bug is older than what I stated in my previous mail

I could reproduce it in my lab and confirm following patch fixes it

This is a stable candidate (2.6.27 kernels)

Thanks

[PATCH] tcp: tcp_synack_options() fix 

Commit 33ad798c924b4a (tcp: options clean up) introduced a problem
if MD5+SACK+timestamps were used in initial SYN message.

Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
sessions, but since 20 bytes of tcp options space are not enough to
store all the bits needed, we chose to disable timestamps in this case.

We send a SYN-ACK _without_ timestamp option, but socket has timestamps
enabled and all further outgoing messages contain a TS block, all with
the initial timestamp of the remote peer.

Fix is to really disable timestamps option for the whole session.

Reported-by: Bijay Singh <Bijay.Singh@guavus.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/tcp_output.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0dda86e..b8bb226 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -667,7 +667,7 @@ static unsigned tcp_synack_options(struct sock *sk,
 	u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ?
 			 xvp->cookie_plus :
 			 0;
-	bool doing_ts = ireq->tstamp_ok;
+	bool doing_ts;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
@@ -680,11 +680,12 @@ static unsigned tcp_synack_options(struct sock *sk,
 		 * rather than TS in order to fit in better with old,
 		 * buggy kernels, but that was deemed to be unnecessary.
 		 */
-		doing_ts &= !ireq->sack_ok;
+		ireq->tstamp_ok &= !ireq->sack_ok;
 	}
 #else
 	*md5 = NULL;
 #endif
+	doing_ts = ireq->tstamp_ok;
 
 	/* We always send an MSS option. */
 	opts->mss = mss;


--

From: Stephen Hemminger
Date: Monday, May 17, 2010 - 10:22 am

On Mon, 17 May 2010 07:03:49 +0200

Make this gets back to stable as well.

-- 
--

From: Stephen Hemminger
Date: Monday, May 17, 2010 - 1:42 pm

On Mon, 17 May 2010 07:03:49 +0200

Since you are doing away with flag variable, why not this instead?


--- a/net/ipv4/tcp_output.c	2010-05-17 13:38:32.822025583 -0700
+++ b/net/ipv4/tcp_output.c	2010-05-17 13:41:47.321734775 -0700
@@ -668,7 +668,6 @@ static unsigned tcp_synack_options(struc
 	u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ?
 			 xvp->cookie_plus :
 			 0;
-	bool doing_ts = ireq->tstamp_ok;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
@@ -681,7 +680,7 @@ static unsigned tcp_synack_options(struc
 		 * rather than TS in order to fit in better with old,
 		 * buggy kernels, but that was deemed to be unnecessary.
 		 */
-		doing_ts &= !ireq->sack_ok;
+		ireq->tstamp_ok &= !ireq->sack_ok;
 	}
 #else
 	*md5 = NULL;
@@ -696,7 +695,7 @@ static unsigned tcp_synack_options(struc
 		opts->options |= OPTION_WSCALE;
 		remaining -= TCPOLEN_WSCALE_ALIGNED;
 	}
-	if (likely(doing_ts)) {
+	if (likely(ireq->tstamp_ok)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = req->ts_recent;
@@ -704,7 +703,7 @@ static unsigned tcp_synack_options(struc
 	}
 	if (likely(ireq->sack_ok)) {
 		opts->options |= OPTION_SACK_ADVERTISE;
-		if (unlikely(!doing_ts))
+		if (unlikely(!ireq->tstamp_ok))
 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
 	}
 
@@ -712,7 +711,7 @@ static unsigned tcp_synack_options(struc
 	 * If the <SYN> options fit, the same options should fit now!
 	 */
 	if (*md5 == NULL &&
-	    doing_ts &&
+	    ireq->tstamp_ok &&
 	    cookie_plus > TCPOLEN_COOKIE_BASE) {
 		int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */
 


 



-- 
--

From: Eric Dumazet
Date: Monday, May 17, 2010 - 2:04 pm

Sure, we can eliminate this doing_ts variable and save few bytes

Thanks

[PATCH] tcp: tcp_synack_options() fix 

Commit 33ad798c924b4a (tcp: options clean up) introduced a problem
if MD5+SACK+timestamps were used in initial SYN message.

Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
sessions, but since 40 bytes of tcp options space are not enough to
store all the bits needed, we chose to disable timestamps in this case.

We send a SYN-ACK _without_ timestamp option, but socket has timestamps
enabled and all further outgoing messages contain a TS block, all with
the initial timestamp of the remote peer.

Fix is to really disable timestamps option for the whole session.

Reported-by: Bijay Singh <Bijay.Singh@guavus.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0dda86e..44e98d9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -667,7 +667,6 @@ static unsigned tcp_synack_options(struct sock *sk,
 	u8 cookie_plus = (xvp != NULL && !xvp->cookie_out_never) ?
 			 xvp->cookie_plus :
 			 0;
-	bool doing_ts = ireq->tstamp_ok;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
@@ -680,7 +679,7 @@ static unsigned tcp_synack_options(struct sock *sk,
 		 * rather than TS in order to fit in better with old,
 		 * buggy kernels, but that was deemed to be unnecessary.
 		 */
-		doing_ts &= !ireq->sack_ok;
+		ireq->tstamp_ok &= !ireq->sack_ok;
 	}
 #else
 	*md5 = NULL;
@@ -695,7 +694,7 @@ static unsigned tcp_synack_options(struct sock *sk,
 		opts->options |= OPTION_WSCALE;
 		remaining -= TCPOLEN_WSCALE_ALIGNED;
 	}
-	if (likely(doing_ts)) {
+	if (likely(ireq->tstamp_ok)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = req->ts_recent;
@@ -703,7 +702,7 @@ static unsigned tcp_synack_options(struct sock *sk,
 	}
 	if (likely(ireq->sack_ok)) {
 		opts->options |= ...
From: David Miller
Date: Monday, May 17, 2010 - 10:35 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks!
--

From: David Miller
Date: Sunday, May 16, 2010 - 12:30 am

From: Stephen Hemminger <shemminger@vyatta.com>

It's an issue because creating a crypto API context is expensive, so this
influences our connection rates with MD5.
--

From: Lars Eggert
Date: Friday, May 7, 2010 - 1:46 am

Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.)

You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO.

Lars
From: Eric Dumazet
Date: Friday, May 7, 2010 - 1:55 am

Thanks for this info !


--

From: David Miller
Date: Friday, May 7, 2010 - 2:12 am

From: Lars Eggert <lars.eggert@nokia.com>

Code we have and users use which is unstable and crashes is more
important to work on and fix than code we don't have which
user's therefore don't use.

You're wrong from just about every possible angle.

Whoever finds AO useful will work on it, just as was the case with MD5
support.  And right now, that "whoever" is definitely not us. :-)

--

From: Eric Dumazet
Date: Thursday, May 6, 2010 - 10:39 pm

I cant see a race with spinlock in
tcp_alloc_md5sig_pool/__tcp_alloc_md5sig_pool().

We allocate structures for all cpus, so preemption.migration should be
OK

Could you elaborate please ?


--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 1:00 am

Here is my official patch submission, could you please test it ?

Thanks

[PATCH] tcp: fix MD5 (RFC2385) support

TCP MD5 support uses percpu data for temporary storage. It currently
disables preemption so that same storage cannot be reclaimed by another
thread on same cpu.

We also have to make sure a softirq handler wont try to use also same
context. Various bug reports demonstrated corruptions.

Fix is to disable preemption and BH.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h |   21 +++------------------
 net/ipv4/tcp.c    |   34 ++++++++++++++++++++++++----------
 2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 75be5a2..aa04b9a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1197,30 +1197,15 @@ extern int			tcp_v4_md5_do_del(struct sock *sk,
 extern struct tcp_md5sig_pool * __percpu *tcp_alloc_md5sig_pool(struct sock *);
 extern void			tcp_free_md5sig_pool(void);
 
-extern struct tcp_md5sig_pool	*__tcp_get_md5sig_pool(int cpu);
-extern void			__tcp_put_md5sig_pool(void);
+extern struct tcp_md5sig_pool	*tcp_get_md5sig_pool(void);
+extern void			tcp_put_md5sig_pool(void);
+
 extern int tcp_md5_hash_header(struct tcp_md5sig_pool *, struct tcphdr *);
 extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, struct sk_buff *,
 				 unsigned header_len);
 extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
 			    struct tcp_md5sig_key *key);
 
-static inline
-struct tcp_md5sig_pool		*tcp_get_md5sig_pool(void)
-{
-	int cpu = get_cpu();
-	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
-	if (!ret)
-		put_cpu();
-	return ret;
-}
-
-static inline void		tcp_put_md5sig_pool(void)
-{
-	__tcp_put_md5sig_pool();
-	put_cpu();
-}
-
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f8caf6..296150b ...
From: Bhaskar Dutta
Date: Friday, May 7, 2010 - 1:59 am

Eric,

Thanks a lot! I will test it out and let you know.
BTW this patch seems to essentially do the same as the earlier fix you
had posted (where you just do bh disable/enable).
Am I missing something?

With the earlier fix, I ran load tests with 80 TCP connections for
over 6 hrs and found 5 bad checksum packets.
So there is still a problem. Without the fix I see a bad packet every
minute or so.

Bhaskar
--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 2:37 am

My second patch is cleaner, using only out of line code (inline was not
necessary and made include file bigger than necessary). Inline is fine
if we can avoid a function call, but it was not the case.

If you notice another corruption, it may be because of another problem,
yet to be discovered.

To you have a userland suite to test/stress tcp md5 connections ?



--

From: Bhaskar Dutta
Date: Friday, May 7, 2010 - 3:50 am

We are still trying to find the other corruption.

I will send you the tarball of a userland client/server suite to
stress test the TCP-MD5 that we've been using to reproduce the issue.

Thanks,
Bhaskar
--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 8:18 am

OK, I found the second problem.

if/when IP route cache is invalidated, ip_queue_xmit() has to refetch a
route and calls sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desesperatly try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again.

You get many errors on remote peer if you do

ip route flush cache



--

From: Eric Dumazet
Date: Friday, May 7, 2010 - 8:44 am

I am testing following patch :

 include/net/sock.h    |    8 ++++++++
 net/core/sock.c       |    1 +
 net/ipv4/tcp_ipv4.c   |    6 +++---
 net/ipv4/tcp_output.c |    2 +-
 net/ipv6/tcp_ipv6.c   |    4 ++--
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..abfadfe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -177,6 +177,7 @@ struct sock_common {
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *	@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
   *	@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
+  *	@sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
   *	@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
   *	@sk_gso_max_size: Maximum GSO segment size to build
   *	@sk_lingertime: %SO_LINGER l_linger setting
@@ -276,6 +277,7 @@ struct sock {
 	int			sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
+	int			sk_route_nocaps;
 	int			sk_gso_type;
 	unsigned int		sk_gso_max_size;
 	int			sk_rcvlowat;
@@ -1257,6 +1259,12 @@ static inline int sk_can_gso(const struct sock *sk)
 
 extern void sk_setup_caps(struct sock *sk, struct dst_entry *dst);
 
+static inline void sk_nocaps_add(struct sock *sk, int flags)
+{
+	sk->sk_route_nocaps |= flags;
+	sk->sk_route_caps &= ~flags;
+}
+
 static inline int skb_copy_to_page(struct sock *sk, char __user *from,
 				   struct sk_buff *skb, struct page *page,
 				   int off, int copy)
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..5056a6a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1227,6 +1227,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 	sk->sk_route_caps = dst->dev->features;
 	if (sk->sk_route_caps & NETIF_F_GSO)
 		sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
+	sk->sk_route_caps &= ~sk->sk_route_nocaps;
 	if (sk_can_gso(sk)) {
 		if (dst->header_len) {
 			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
diff --git ...
From: Eric Dumazet
Date: Friday, May 7, 2010 - 2:18 pm

Patch solves the problem for me. I tested it with 200 MD5 sockets
established between two 16 cpus machine, with a multiqueue NIC. Trafic
of 100.000 pps per second, and "ip route flush cache" every minute on
both machines. After five hours, not a single frame had a bad hash
value.

Here is the official submission.

[PATCH] net: Introduce sk_route_nocaps

TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h    |    8 ++++++++
 net/core/sock.c       |    1 +
 net/ipv4/tcp_ipv4.c   |    6 +++---
 net/ipv4/tcp_output.c |    2 +-
 net/ipv6/tcp_ipv6.c   |    4 ++--
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..abfadfe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -177,6 +177,7 @@ struct sock_common {
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *	@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
   *	@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
+  *	@sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
   *	@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
   *	@sk_gso_max_size: Maximum GSO segment size to build
   *	@sk_lingertime: %SO_LINGER l_linger setting
@@ -276,6 +277,7 @@ struct sock {
 	int			sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
+	int			sk_route_nocaps;
 ...
From: David Miller
Date: Sunday, May 16, 2010 - 12:37 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Since the connection does recover eventually, I'm stuffing this into
net-next-2.6 into net-2.6

After some time in net-next-2.6, we can submit it to -stable.
--

From: David Miller
Date: Sunday, May 16, 2010 - 12:35 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--

Previous thread: [PATCH] IPv6: fix IPV6_RECVERR handling of locally-generated errors by Brian Haley on Monday, May 3, 2010 - 6:44 pm. (1 message)

Next thread: [PATCH] net-next: remove useless union keyword by Changli Gao on Monday, May 3, 2010 - 8:04 pm. (3 messages)