I'm currently running bittorrent with all of this, I just saw this (for the first time ever), but otherwise it works fine: WARNING: at net/ipv4/tcp_output.c:1807 tcp_simple_retransmit() [<c0104cb3>] show_trace_log_lvl+0x1a/0x2f [<c0105563>] show_trace+0x12/0x14 [<c0105668>] dump_stack+0x15/0x17 [<c02f6a79>] tcp_simple_retransmit+0xfa/0x185 [<c02fa072>] tcp_v4_err+0x35d/0x4cb [<c0301f7d>] icmp_unreach+0x327/0x352 [<c030159d>] icmp_rcv+0xe0/0xf7 [<c02e2d75>] ip_local_deliver_finish+0x124/0x1ba [<c02e3178>] ip_local_deliver+0x72/0x7e [<c02e2c31>] ip_rcv_finish+0x299/0x2b9 [<c02e30e8>] ip_rcv+0x1e1/0x1ff [<c02c755c>] netif_receive_skb+0x37d/0x401 [<c02c9372>] process_backlog+0x5b/0x96 [<c02c9037>] net_rx_action+0x87/0x152 [<c0121c9f>] __do_softirq+0x38/0x7a Just ran it with no errors for 6 minutes 30. The box is otherwise stable though. I forgot to say that I have a kdump image of the crash (I had to recompile this 2.6.24-rc2 kernel as I deleted its vmlinux), so I could check that you are (gdb) p sk->sk_write_queue.next $11 = (struct sk_buff *) 0xe43a04b0 (gdb) p &sk->sk_write_queue (gdb) p ((struct tcp_sock *) sk)->packets_out (gdb) p ((struct tcp_sock *) sk)->lost_out $14 = 4294967295 Some more gdb output for information: #0 tcp_xmit_retransmit_queue (sk=0xe43a0440) at net/ipv4/tcp_output.c:1962 1962 __u8 sacked = TCP_SKB_CB(skb)->sacked; (gdb) bt #0 tcp_xmit_retransmit_queue (sk=0xe43a0440) at net/ipv4/tcp_output.c:1962 #1 0xc02f298a in tcp_ack (sk=0xe43a0440, skb=0xc75720c0, flag=1038) at net/ipv4/tcp_input.c:2524 #2 0xc02f5208 in tcp_rcv_established (sk=0xe43a0440, skb=0xc75720c0, th=0xeac35058, len=32) at net/ipv4/tcp_input.c:4502 #3 0xc02fa711 in tcp_v4_do_rcv (sk=0xe43a0440, skb=0xc75720c0) at net/ipv4/tcp_ipv4.c:1572 #4 0xc02fc557 in tcp_v4_rcv (skb=0xc75720c0) at net/ipv4/tcp_ipv4.c:1696 #5 0xc02e4961 in ip_local_deliver_finish (skb=0xc75720c0) at net/ipv4/ip_input.c:233 #6 0xc02e4d64 in ...
The messages you had in the other mail are very likely symptom of the same problem, it's just hard to tell from them where it really originates from (because it would requires expensive verification that nobody wants to do by default after simple operations). In many cases that WARN_ON is simply too late to tell when the problem causing adjustment/corruption WARNING: at net/ipv4/tcp_output.c:1807 tcp_simple_retransmit() [<c0104cb3>] show_trace_log_lvl+0x1a/0x2f [<c0105563>] show_trace+0x12/0x14 [<c0105668>] dump_stack+0x15/0x17 [<c02f6a79>] tcp_simple_retransmit+0xfa/0x185 [<c02fa072>] tcp_v4_err+0x35d/0x4cb [<c0301f7d>] icmp_unreach+0x327/0x352 Hmm, that's related to path MTU things... It might have something to do Yeah, it's more likely a miscount somewhere rather than corruption but that wasn't obvious from the first mail... ...but alas, I haven't yet been able to come up with any theory on how Yeah, they are expected, the write_queue is empty. Another cause for those could have been corrupted write_queue (that's why I asked for the Underflows by one. ...We should just find out what causes this and fix Thanks about them, though they're not that useful because the problem No, it won't happen like that. ...I'd say that gdb is just confused. In case packets_out is zero (it occurs after a cumulative ACK only), for sure skb will become NULL because the retransmit_skb_hint was cleared due to cumulative ACK. The crash location is the expected one in case packets_out gets zero during recovery and lost_out is miscounted/corrupt, as your dump shows. Anyway, thanks for digging these out. Here's a bruteforce patch below... Since you had couple of them during your overnight test, I'm sure it's relatively easy to catch... The first place where the tcp_verify_lost is triggered is the most interesting, rest are likely ripples due to that earlier corruption... (Hopefully I've placed them this time to places where both queue and ...
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> This patch looks correct to me, so I added it to net-2.6 Chazarain please let us know if it does indeed cure your problem. Thanks. -
On Tue, 13 Nov 2007, David Miller wrote: > From: "Ilpo_J
From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi> Applied. Thanks for making such an incredibly thorough investigation into this bug! -
On Wed, 14 Nov 2007, David Miller wrote: > From: "Ilpo_J
Unfortunately, I couldn't manage to reproduce the problem with an unpatched kernel. But your investigation Ilpo was really impressive. BTW, even though I messed up the yahoo webmail configuration, you can call me by my first name: Guillaume ;-) Thanks again for such an awesome bug fixing attitude! -- Guillaume -
These are usually very sensitive on other traffic because even a simple change in packet pattern changes behavior enough for it do disappear. The same thing occurred with the month ago fackets_out miscount as well, at different weekday it just wasn't reproducable. ...Anyway, I'm pretty sure it's now fixed because there's a simple explination to it due to the frto_highmark premature clearing bug. But if you would still The best thing is that usually when forced to really think what could go wrong, also other, unrelated bugs seem to come up, though up to 10% of the initial oh-nos end up being genuine bugs. ...Thus I still have couple of miscount-due-to-GSO&hints fixes to do as a result of this venture besides the problems already fixed. -- i. -
