Re: Re : Oops preceded by WARNING: at net/ipv4/tcp_input.c:1571 tcp_remove_reno_sacks()

Previous thread: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode by Joonwoo Park on Saturday, November 10, 2007 - 5:51 pm. (31 messages)

Next thread: Re: [PATCH 01/01] iproute2-2.6.23: RFC4214 Support (3) by Patrick McHardy on Saturday, November 10, 2007 - 8:26 pm. (1 message)
From: Chazarain Guillaume
Date: Saturday, November 10, 2007 - 6:39 pm

I'm currently running bittorrent with all of this, I just saw this (for the first time ever),
but otherwise it works fine:

WARNING: at net/ipv4/tcp_output.c:1807 tcp_simple_retransmit()
 [<c0104cb3>] show_trace_log_lvl+0x1a/0x2f
 [<c0105563>] show_trace+0x12/0x14
 [<c0105668>] dump_stack+0x15/0x17
 [<c02f6a79>] tcp_simple_retransmit+0xfa/0x185
 [<c02fa072>] tcp_v4_err+0x35d/0x4cb
 [<c0301f7d>] icmp_unreach+0x327/0x352
 [<c030159d>] icmp_rcv+0xe0/0xf7
 [<c02e2d75>] ip_local_deliver_finish+0x124/0x1ba
 [<c02e3178>] ip_local_deliver+0x72/0x7e
 [<c02e2c31>] ip_rcv_finish+0x299/0x2b9
 [<c02e30e8>] ip_rcv+0x1e1/0x1ff
 [<c02c755c>] netif_receive_skb+0x37d/0x401
 [<c02c9372>] process_backlog+0x5b/0x96
 [<c02c9037>] net_rx_action+0x87/0x152
 [<c0121c9f>] __do_softirq+0x38/0x7a

Just ran it with no errors for 6 minutes 30. The box is otherwise stable though.

I forgot to say that I have a kdump image of the crash (I had to recompile this

2.6.24-rc2 kernel as I deleted its vmlinux), so I could check that you are

(gdb) p sk->sk_write_queue.next
$11 = (struct sk_buff *) 0xe43a04b0
(gdb) p &sk->sk_write_queue

(gdb) p ((struct tcp_sock *) sk)->packets_out

(gdb) p ((struct tcp_sock *) sk)->lost_out
$14 = 4294967295

Some more gdb output for information:

#0  tcp_xmit_retransmit_queue (sk=0xe43a0440) at net/ipv4/tcp_output.c:1962
1962                            __u8 sacked = TCP_SKB_CB(skb)->sacked;

(gdb) bt
#0  tcp_xmit_retransmit_queue (sk=0xe43a0440) at net/ipv4/tcp_output.c:1962
#1  0xc02f298a in tcp_ack (sk=0xe43a0440, skb=0xc75720c0, flag=1038) at net/ipv4/tcp_input.c:2524
#2  0xc02f5208 in tcp_rcv_established (sk=0xe43a0440, skb=0xc75720c0, th=0xeac35058, len=32) at net/ipv4/tcp_input.c:4502
#3  0xc02fa711 in tcp_v4_do_rcv (sk=0xe43a0440, skb=0xc75720c0) at net/ipv4/tcp_ipv4.c:1572
#4  0xc02fc557 in tcp_v4_rcv (skb=0xc75720c0) at net/ipv4/tcp_ipv4.c:1696
#5  0xc02e4961 in ip_local_deliver_finish (skb=0xc75720c0) at net/ipv4/ip_input.c:233
#6  0xc02e4d64 in ...
From: Ilpo Järvinen
Date: Sunday, November 11, 2007 - 3:40 pm

The messages you had in the other mail are very likely symptom of the 
same problem, it's just hard to tell from them where it really originates 
from (because it would requires expensive verification that nobody wants 
to do by default after simple operations). In many cases that WARN_ON is 
simply too late to tell when the problem causing adjustment/corruption 

WARNING: at net/ipv4/tcp_output.c:1807 tcp_simple_retransmit()
 [<c0104cb3>] show_trace_log_lvl+0x1a/0x2f
 [<c0105563>] show_trace+0x12/0x14
 [<c0105668>] dump_stack+0x15/0x17
 [<c02f6a79>] tcp_simple_retransmit+0xfa/0x185
 [<c02fa072>] tcp_v4_err+0x35d/0x4cb
 [<c0301f7d>] icmp_unreach+0x327/0x352

Hmm, that's related to path MTU things... It might have something to do 

Yeah, it's more likely a miscount somewhere rather than corruption but 
that wasn't obvious from the first mail...

...but alas, I haven't yet been able to come up with any theory on how 

Yeah, they are expected, the write_queue is empty. Another cause for 
those could have been corrupted write_queue (that's why I asked for the 

Underflows by one. ...We should just find out what causes this and fix 

Thanks about them, though they're not that useful because the problem 


No, it won't happen like that. ...I'd say that gdb is just confused. In 
case packets_out is zero (it occurs after a cumulative ACK only), for sure 
skb will become NULL because the retransmit_skb_hint was cleared due to 
cumulative ACK.

The crash location is the expected one in case packets_out gets zero 
during recovery and lost_out is miscounted/corrupt, as your dump shows.

Anyway, thanks for digging these out.


Here's a bruteforce patch below... Since you had couple of them during 
your overnight test, I'm sure it's relatively easy to catch... The 
first place where the tcp_verify_lost is triggered is the most 
interesting, rest are likely ripples due to that earlier corruption... 
(Hopefully I've placed them this time to places where both queue and ...
From: Ilpo Järvinen
Date: Tuesday, November 13, 2007 - 2:35 pm

On Mon, 12 Nov 2007, Ilpo J
From: David Miller
Date: Tuesday, November 13, 2007 - 10:04 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

This patch looks correct to me, so I added it to net-2.6

Chazarain please let us know if it does indeed cure your
problem.

Thanks.
-

From: Ilpo Järvinen
Date: Wednesday, November 14, 2007 - 6:32 am

On Tue, 13 Nov 2007, David Miller wrote:

> From: "Ilpo_J
From: David Miller
Date: Wednesday, November 14, 2007 - 4:55 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

Applied.

Thanks for making such an incredibly thorough investigation
into this bug!
-

From: Ilpo Järvinen
Date: Thursday, November 15, 2007 - 1:11 am

On Wed, 14 Nov 2007, David Miller wrote:

> From: "Ilpo_J
From: Guillaume Chazarain
Date: Thursday, November 15, 2007 - 3:31 am

Unfortunately, I couldn't manage to reproduce the problem with an
unpatched kernel. But your investigation Ilpo was really impressive.

BTW, even though I messed up the yahoo webmail configuration, you can
call me by my first name: Guillaume ;-)

Thanks again for such an awesome bug fixing attitude!

-- 
Guillaume
-

From: Ilpo Järvinen
Date: Thursday, November 15, 2007 - 4:51 am

These are usually very sensitive on other traffic because even a simple 
change in packet pattern changes behavior enough for it do disappear.
The same thing occurred with the month ago fackets_out miscount as 
well, at different weekday it just wasn't reproducable. ...Anyway, I'm 
pretty sure it's now fixed because there's a simple explination to it 
due to the frto_highmark premature clearing bug. But if you would still 


The best thing is that usually when forced to really think what could go 
wrong, also other, unrelated bugs seem to come up, though up to 10%
of the initial oh-nos end up being genuine bugs. ...Thus I still have 
couple of miscount-due-to-GSO&hints fixes to do as a result of this 
venture besides the problems already fixed.

-- 
 i.
-

Previous thread: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode by Joonwoo Park on Saturday, November 10, 2007 - 5:51 pm. (31 messages)

Next thread: Re: [PATCH 01/01] iproute2-2.6.23: RFC4214 Support (3) by Patrick McHardy on Saturday, November 10, 2007 - 8:26 pm. (1 message)