Fw: [Bug 14470] New: freez in TCP stack

Previous thread: [PATCH v1 6/7] gianfar: Add Multiple group Support by Sandeep Gopalpet on Monday, October 26, 2009 - 9:27 am. (1 message)

Next thread: [PATCH] can: sja1000: fix bug using library functions for skb allocation by Wolfgang Grandegger on Monday, October 26, 2009 - 8:46 am. (2 messages)
From: Stephen Hemminger
Date: Monday, October 26, 2009 - 8:41 am

Begin forwarded message:

Date: Mon, 26 Oct 2009 12:47:22 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 14470] New: freez in TCP stack


http://bugzilla.kernel.org/show_bug.cgi?id=14470

           Summary: freez in TCP stack
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.31
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: kolo@albatani.cz
        Regression: No


We are hiting kernel panics on Dell R610 servers with e1000e NICs; it apears
usualy under a high network trafic ( around 100Mbit/s) but it is not a rule it
has happened even on low trafic.

Servers are used as reverse http proxy (varnish).

On 6 equal servers this panic happens aprox 2 times a day depending on network
load. Machine completly freezes till the management watchdog reboots. 


We had to put serial console on these servers to catch the oops. Is there
anything else We can do to debug this?
The RIP is always the same:

RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
tcp_xmit_retransmit_queue+0x8c/0x290

rest of the oops always differs a litle ... here is an example:

RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
tcp_xmit_retransmit_queue+0x8c/0x290
RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: ...
From: Andrew Morton
Date: Wednesday, October 28, 2009 - 3:13 pm

On Mon, 26 Oct 2009 08:41:32 -0700

Stephen, please retain the bugzilla and reporter email cc's when

Twice a day on six separate machines.  That ain't no hardware glitch.

Vaclav, are you able to say whether this is a regression?  Did those
machines run 2.6.30 (for example)?


--

From: Denys Fedoryschenko
Date: Wednesday, October 28, 2009 - 3:27 pm

I had issues on Dell also. On one fixed by bios update, another only after 
tuning some voodoo settings in sysctl (i was in hurry, no redundancy for this 
server, and it was rebooting each day 1-3 times). It happens also in 32 and 
64bit kernels (32bit userspace), also "heavy" tcp workload, both of them act 
as proxy.
But my issue probably different, on both Dell servers i had bnx2 with IPMI.
It was very weird, nmi_watchdog, panic on reboot / on oops, detect 
softlockups, detect deadlocks, detect hang tasks, hangcheck timer - didn't 
help, only hardware watchdog (IPMI or iTCO) able to catch hang and reboot 
server. Because i didn't had anything useful to report(remote server and 
netconsole didn't give anything), i didn't fill bugzilla report. 


--

From: Eric Dumazet
Date: Wednesday, October 28, 2009 - 10:35 pm

Code: 00 eb 28 8b 83 d0 03 00 00
  41 39 44 24 40    cmp    %eax,0x40(%r12)
  0f 89 00 01 00 00 jns ...
  41 0f b6 cd       movzbl %r13b,%ecx
  41 bd 2f 00 00 00 mov    $0x2f000000,%r13d
  83 e1 03          and    $0x3,%ecx
  0f 84 fc 00 00 00 je ...
  4d 8b 24 24       mov    (%r12),%r12    skb = skb->next
<>49 8b 04 24       mov    (%r12),%rax     << NULL POINTER dereference >>
  4d 39 f4          cmp    %r14,%r12
  0f 18 08          prefetcht0 (%rax)
  0f 84 d9 00 00 00 je  ...
  4c 3b a3 b8 01    cmp


crash is in 
void tcp_xmit_retransmit_queue(struct sock *sk)
{

<< HERE >> tcp_for_write_queue_from(skb, sk) {

}


Some skb in sk_write_queue has a NULL ->next pointer

Strange thing is R14 and RAX =ffff8807e7420678  (&sk->sk_write_queue) 
R14 is the stable value during the loop, while RAW is scratch register.

I dont have full disassembly for this function, but I guess we just entered the loop
(or RAX should be really different at this point)

So, maybe list head itself is corrupted (sk->sk_write_queue->next = NULL)

or, retransmit_skb_hint problem ? (we forget to set it to NULL in some cases ?)

--

From: Eric Dumazet
Date: Wednesday, October 28, 2009 - 10:59 pm

David, what do you think of following patch ?

I wonder if we should reorganize code to add sanity checks in tcp_unlink_write_queue()
that the skb we delete from queue is not still referenced.

[PATCH] tcp: clear retrans hints in tcp_send_synack()

There is a small possibility the skb we unlink from write queue 
is still referenced by retrans hints.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..b22a72d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2201,6 +2201,7 @@ int tcp_send_synack(struct sock *sk)
 			struct sk_buff *nskb = skb_copy(skb, GFP_ATOMIC);
 			if (nskb == NULL)
 				return -ENOMEM;
+			tcp_clear_all_retrans_hints(tcp_sk(sk));
 			tcp_unlink_write_queue(skb, sk);
 			skb_header_release(nskb);
 			__tcp_add_write_queue_head(sk, nskb);
--

From: David Miller
Date: Wednesday, October 28, 2009 - 11:02 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Yes, the first thing I thought of when I saw this crash was the hints.

I'll think this over.
--

From: David Miller
Date: Thursday, October 29, 2009 - 1:00 am

From: Eric Dumazet <eric.dumazet@gmail.com>

So, this would only be true if we were dealing with a data
packet here.  We're not, this is a SYN+ACK which happens to
be cloned in the write queue.

The hint SKBs pointers can only point to real data packets.

And we're only dealing with data packets once we enter established
state, and when we enter established by definition we have unlinked
and freed up any SYN and SYN+ACK SKBs in the write queue.
--

From: =?ISO-8859-15?Q?Ilpo_J=E4rvinen?=
Date: Thursday, November 26, 2009 - 2:54 pm

How about this then... Does the original reporter have NFS in use?

[PATCH] tcp: clear hints to avoid a stale one (nfs only affected?)

Eric Dumazet mentioned in a context of another problem:

"Well, it seems NFS reuses its socket, so maybe we miss some
cleaning as spotted in this old patch"

I've not check under which conditions that actually happens but
if true, we need to make sure we don't accidently leave stale
hints behind when the write queue had to be purged (whether reusing
with NFS can actually happen if purging took place is something I'm
not sure of).

...At least it compiles.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..6b13faa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1228,6 +1228,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
 	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
 		sk_wmem_free_skb(sk, skb);
 	sk_mem_reclaim(sk);
+	tcp_clear_all_retrans_hints(tcp_sk(sk));
 }
 
 static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
-- 
1.5.6.5
From: David Miller
Date: Thursday, November 26, 2009 - 4:37 pm

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>

I must be getting old and senile, but I specifically remembered that
we prevented a socket from ever being bound again once it has been
bound one time specifically so we didn't have to deal with issues
like this.

I really don't think it's valid for NFS to reuse the socket structure
like this over and over again.  And that's why only NFS can reproduce
this, the interfaces provided userland can't actually go through this
sequence after a socket goes down one time all the way to close.

Do we really want to audit each and every odd member of the socket
structure from the generic portion all the way down to INET and
TCP specifics to figure out what needs to get zero'd out?

So much relies upon the one-time full zero out during sock allocation.

Let's fix NFS instead.
--

From: Eric Dumazet
Date: Thursday, November 26, 2009 - 11:17 pm

bugzilla reference : http://bugzilla.kernel.org/show_bug.cgi?id=14580

Trond said :
  NFS MUST reuse the same port because on most servers, the replay cache is keyed
  to the port number. In other words, when we replay an RPC call, the server will
  only recognise it as a replay if it originates from the same port.
  See http://www.connectathon.org/talks96/werme1.html


Please note the socket stays bound to a given local port.

We want to connect() it to a possible other target, that's all.

In NFS case 'other target' is in fact the same target, but this
is a special case of a more general one.

Hmm... if an application wants to keep a local port for itself (not
allowing another one to get this (ephemeral ?) port during the 
close()/socket()/bind() window), this is the only way.
TCP state machine allows this IMHO.

google for "tcp AF_UNSPEC connect" to find many references and man pages
for this stuff.

http://kerneltrap.org/Linux/Connect_Specification_versus_Man_Page

How other Unixes / OS handle this ?
How many applications use this trick ?


--

From: David Miller
Date: Wednesday, December 2, 2009 - 4:10 pm

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>

I think this is a good safety net even if it doesn't specifically
fix a specific problem.

But I'd like to see this patch tested by the person seeing the
problem so we can know whether that is fixed or not.
--

From: David Miller
Date: Wednesday, December 2, 2009 - 11:24 pm

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>

Ok, since Linus just released 2.6.32 I'm tossing this into net-next-2.6
so it gets wider exposure.

I still want to see test results from the bug reporter, and if it fixes
things we can toss this into -stable too.
--

From: Andrew Morton
Date: Thursday, March 18, 2010 - 2:04 pm

On Wed, 02 Dec 2009 22:24:46 -0800 (PST)

Despite my request to take this to email, quite a few people have been
jumping onto this report via bugzilla:
http://bugzilla.kernel.org/show_bug.cgi?id=14470

Bit of a pita, but it'd be worth someone taking a look to ensure that
we're all talking about the same bug.

--

From: =?ISO-8859-15?Q?Ilpo_J=E4rvinen?=
Date: Friday, March 19, 2010 - 8:52 am

Could one try with this debug patch:

http://marc.info/?l=linux-kernel&m=126624014117610&w=2

It should prevent crashing too.

-- 
 i.
--

From: =?ISO-8859-15?Q?Ilpo_J=E4rvinen?=
Date: Thursday, October 29, 2009 - 5:58 am

One more alternative along those lines could perhaps be:

We enter with empty write_queue there and with the hint being null, so we 
take the else branch... and skb_peek then gives us the NULL ptr. However, 
I cannot see how this could happen as all branches trap with return 

...I don't understand how a stale reference would yield to a consistent 
NULL ptr crash there rather than hard to track corruption for most of the 
times and random crashes then here and there. Or perhaps we were just very 
lucky to immediately get only those reports which point out to the right 
track :-).

...I tried to find what is wrong with it but sadly came up only
ah-this-is-it-oh-wait-it's-ok type of things.

-- 
 i.
From: Eric Dumazet
Date: Thursday, October 29, 2009 - 7:08 am

When a skb is freed, and re-allocated, we clear most of its fields
in __alloc_skb()

memset(skb, 0, offsetof(struct sk_buff, tail));

Then if this skb is freed again, not queued anywhere, its skb->next stays NULL

So if we have a stale reference to a freed skb, we can :

- Get a NULL pointer, or a poisonned value (if SLUB_DEBUG)


Here is a debug patch to check we dont have stale pointers, maybe this will help ?sync


[PATCH] tcp: check stale pointers in tcp_unlink_write_queue()

In order to track some obscure bug, we check in tcp_unlink_write_queue() if
we dont have stale references to unlinked skb

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h     |    4 ++++
 net/ipv4/tcp.c        |    2 +-
 net/ipv4/tcp_input.c  |    4 ++--
 net/ipv4/tcp_output.c |    8 ++++----
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 740d09b..09da342 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1357,6 +1357,10 @@ static inline void tcp_insert_write_queue_before(struct sk_buff *new,
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
+	WARN_ON(skb == tcp_sk(sk)->retransmit_skb_hint);
+	WARN_ON(skb == tcp_sk(sk)->lost_skb_hint);
+	WARN_ON(skb == tcp_sk(sk)->scoreboard_skb_hint);
+	WARN_ON(skb == sk->sk_send_head);
 	__skb_unlink(skb, &sk->sk_write_queue);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e0cfa63..328bdb1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1102,11 +1102,11 @@ out:
 
 do_fault:
 	if (!skb->len) {
-		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
 		 * reset, where we can be unlinking the send_head.
 		 */
 		tcp_check_send_head(sk, skb);
+		tcp_unlink_write_queue(skb, sk);
 		sk_wmem_free_skb(sk, skb);
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ba0eab6..fccc6e9 100644
--- a/net/ipv4/tcp_input.c
+++ ...
Previous thread: [PATCH v1 6/7] gianfar: Add Multiple group Support by Sandeep Gopalpet on Monday, October 26, 2009 - 9:27 am. (1 message)

Next thread: [PATCH] can: sja1000: fix bug using library functions for skb allocation by Wolfgang Grandegger on Monday, October 26, 2009 - 8:46 am. (2 messages)