Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

Previous thread: [PATCH 1/1] Char: vt, make sysfs operations atomic by Jiri Slaby on Monday, May 26, 2008 - 4:53 am. (6 messages)

Next thread: [PATCH take 2] UBIFS - new flash file system by Artem Bityutskiy on Monday, May 26, 2008 - 7:03 am. (2 messages)
From: Ingo Molnar
Date: Monday, May 26, 2008 - 4:56 am

in an overnight -tip testruns that is based on recent -git i got two 
stuck TCP connections:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0 174592 10.0.1.14:58015             10.0.1.14:3632              ESTABLISHED 
tcp    72134      0 10.0.1.14:3632              10.0.1.14:58015             ESTABLISHED 

on a previously reliable machine. That connection has been stuck for 9 
hours so it does not time out, etc. - and the distcc run that goes over 
that connection is stuck as well.

kernel config is attached.

in terms of debugging there's not much i can do i'm afraid. It's not 
possible to get a tcpdump of this incident, given the extreme amount of 
load these testboxes handle. This problem started sometime around rc3 
and it occured on two boxes (on a laptop and on a desktop), both are SMP 
Core2Duo based systems. I never saw this problem before on thousands of 
similar bootups, so i'm 99.9% sure the bug is either new or became 
easier to trigger.

It's not possible to bisect it as it needs up to 12 hours of heavy 
workload to trigger. The incident happened about 5 times since the first 
incident a couple of days ago - 4 times on one box and once on another 
box. The first failing head i became aware of was 78b58e549a3098. (-tip 
has other changes beyond -git but changes nothing in networking.)

One clue (which might or might not matter) is that distcc is one of the 
very few applications that makes use of sendfile().

	Ingo

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26-rc3
# Sat May 24 04:28:24 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_DEFCONFIG_LIST="arch/x86/configs/x86_64_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not ...
From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 6:28 am

...but you can still tcpdump that particular flow once the situation is 
discovered to see if TCP still tries to do something, no? One needs to 
tcpdump couple of minutes at minimum. Also please get /proc/net/tcp for 

Can you please try with /proc/sys/net/ipv4/tcp_frto set to zero though 
recv-q symptom seems weird would it be related to that (but there were 
some recent fixes to FRTO and retrans_stamp change could have some 
significance here)?

Other than that, nothing since -rc1 seems suspicious to me (though
I hardly understand every part of networking).


-- 
 i.
--

From: Ingo Molnar
Date: Monday, May 26, 2008 - 6:59 am

ok, i will first wait for it to trigger on a box and will do the tcpdump 
session (and /proc/net/tcp output), then i'll continue the tests with 
this done in the rc.local:

  echo 0 > /proc/sys/net/ipv4/tcp_frto

and will see whether the hung connections still occur. The cycle of 
testing will be very slow i suspect.

	Ingo
--

From: Ingo Molnar
Date: Monday, May 26, 2008 - 7:12 am

i got lucky ... the bug just reproduced again - find the netstat and 
/proc/net/tcp output below. New config attached.

i also ran this for 15 minutes:

  [root@europe ~]# tcpdump src and dst 10.0.1.15 -n
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

but there was no output at all - i.e. TCP has not attempted to do 
anything to these connections.

there's a hung distcc task on the system, waiting for socket action 
forever:

  [root@europe ~]# strace -fp 19578
  Process 19578 attached - interrupt to quit
  select(5, NULL, [4], [4], {82, 90000} <unfinished ...>

disturbing that task via strace did not change the state of the socket - 
and that's not unexpected as it's a select(). [TCP state might be 
affected if strace impacted a recvmsg or a sendmsg wait directly.]

i'll keep this system in its hung state for an hour, just in case you 
can think of anything else to extract out of it.

	Ingo

------------->
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 10.0.1.15:22                10.0.1.16:42843             ESTABLISHED 
tcp        0 207232 10.0.1.15:37198             10.0.1.15:3632              ESTABLISHED 
tcp        0      0 10.0.1.15:22                10.0.1.16:36284             ESTABLISHED 
tcp    72283      0 10.0.1.15:3632              10.0.1.15:37198             ESTABLISHED 
Tcp:
    1931 active connections openings
    149 passive connection openings
    0 failed connection attempts
    0 connection resets received
    4 connections established
    436004 segments received
    98819 segments send out
    4 segments retransmited
    0 bad segments received.
    0 resets sent
UdpLite:
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
   0: ...
From: Ingo Molnar
Date: Monday, May 26, 2008 - 7:17 am

... which is not surprising as it traced eth0 ;-) [10.0.1.15 is the IP 
of the box and routes back to loopback]

with the proper tcpdump it gave:

[root@europe ~]# tcpdump -i lo
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes
16:15:05.260000 IP europe.37198 > europe.distcc: . ack 2022529163 win 257 <nop,nop,timestamp 209526 197526>
16:15:05.260000 IP europe.distcc > europe.37198: . ack 1 win 0 <nop,nop,timestamp 209526 7020>

hm, a window of 257 and zero, is that right?

	Ingo
--

From: Ingo Molnar
Date: Monday, May 26, 2008 - 7:29 am

there's TCP ACK activity every 120 seconds:

 16:19:05.260000 IP europe.37198 > europe.distcc: . ack 2022529163 win 257
  <nop,nop,timestamp 233526 221526>
 16:19:05.260000 IP europe.distcc > europe.37198: . ack 1          win 0 
  <nop,nop,timestamp 233526 7020>
 16:21:05.260000 IP europe.37198 > europe.distcc: . ack 1          win 257
  <nop,nop,timestamp 245526 233526>
 16:21:05.260000 IP europe.distcc > europe.37198: . ack 1          win 0
  <nop,nop,timestamp 245526 7020>
 16:23:05.260000 IP europe.37198 > europe.distcc: . ack 1          win 257
  <nop,nop,timestamp 257526 245526>
 16:23:05.260000 IP europe.distcc > europe.37198: . ack 1          win 0
  <nop,nop,timestamp 257526 7020>
 16:25:05.260000 IP europe.37198 > europe.distcc: . ack 1          win 257 
  <nop,nop,timestamp 269526 257526>
 16:25:05.260000 IP europe.distcc > europe.37198: . ack 1          win 0 
  <nop,nop,timestamp 269526 7020>
 16:27:05.260000 IP europe.37198 > europe.distcc: . ack 1          win 257 
  <nop,nop,timestamp 281526 269526>
 16:27:05.260000 IP europe.distcc > europe.37198: . ack 1          win 0 
  <nop,nop,timestamp 281526 7020>

The socket on the other side of the localhost connection, 
10.0.1.15:3632, produces no packets.

a detail i forgot to mention: i reproduced the same TCP hang on a 32-bit 
system and on a 64-bit system as well.

	Ingo
--

From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 7:43 am

Thanks, it's much easier to diagnose right with symptoms that are based

...Yes, though that 257 is lacks the effect of window scaling (and it's 
mostly irrelevant anyway since the problem occurs in opposite dir). The 
sender is using just window probes (/proc/net/tcp shows that too), that's 
why you see them occassionally. So the main problem seems to be that 
receiver doesn't read, rest is just a consequence of that (from TCP POV). 
I doubt that tcp_frto will affect to that (and I read what a stale 
retrans_stamp could do, it would hard cause any troubles, and those minor 
things wouldn't realize in your setup anyway).

I'll take a look into pre -rc1 changes now but it may well be beoynd
what I'm able to figure out.

-- 
 i.
--

From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 7:58 am

Hmm, readfds is NULL isn't it?!? Are you sure you straced the right 

I fail to understand this paragraph due to excessive negation... :-)



-- 
 i.
--

From: Ingo Molnar
Date: Monday, May 26, 2008 - 9:23 am

i mean, sometimes a TCP connection can get 'unstuck' if you strace a 
task - that is because the TCP related syscall the task sits in gets 
interrupted. But in this case it's select() which doesnt explicitly take 
the socket, doesnt do any tcp_push_pending_frames() processing, etc. - 
it just its on the socket waitqueue AFAICS. And that's expected.

	Ingo
--

From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 9:32 am

> * Ilpo J
From: Ingo Molnar
Date: Monday, May 26, 2008 - 9:54 am

it wasnt the receiving process. There's no receiving process - which is 
weird:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0 207232 europe:37198                europe:distcc               ESTABLISHED 19578/distcc        
tcp        0      0 europe:ssh                  dione:36284                 ESTABLISHED -                   
tcp        0      0 europe:ssh                  e2:45910                    ESTABLISHED -                   

i just gave it as a general example of why sometimes stracing a task can 
'disturb' the observed system and can kick the TCP state machine out of 

the socket does not seem to be owned. It should have closed down? 
Refcounting issue?

find below the sysrq-t dump.

	Ingo

-------------------------->
[11811.540000] SysRq : Show State
[11811.540000]   task                PC stack   pid father
[11811.540000] init          S f7c67b04     0     1      0
[11811.540000]        f7c70000 00000086 00000002 f7c67b04 f7c67b0c 00000000 00000046 f7ca08b8 
[11811.540000]        f7c67b2c 00119272 00000002 80130c53 00000000 01ca0000 f7c67b08 f7c70164 
[11811.540000]        82cbb080 0011907f 00000000 00000000 00000000 00000000 000000ff f7c67b2c 
[11811.540000] Call Trace:
[11811.540000]  [<80130c53>] internal_add_timer+0x103/0x110
[11811.540000]  [<806ab7da>] schedule_timeout+0x4a/0xd0
[11811.540000]  [<801310e0>] process_timeout+0x0/0x40
[11811.540000]  [<806ab7d5>] schedule_timeout+0x45/0xd0
[11811.540000]  [<8019868c>] do_select+0x3dc/0x500
[11811.540000]  [<801982b0>] do_select+0x0/0x500
[11811.540000]  [<80198d60>] __pollwait+0x0/0xe0
[11811.540000]  [<80120070>] default_wake_function+0x0/0x10
[11811.540000]  [<801852b6>] check_bytes_and_report+0x26/0xd0
[11811.540000]  [<80184f6e>] slab_pad_check+0x7e/0x100
[11811.540000]  [<80185695>] check_object+0xe5/0x1f0
[11811.540000]  [<80185e08>] __slab_free+0x238/0x2e0
[11811.540000]  ...
From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 10:08 am

> * Ilpo J
From: Ingo Molnar
Date: Monday, May 26, 2008 - 11:12 am

yes, that would match the symptoms i think. I half-assumed that it's a 
state machine problem so i didnt even check what the reader does - and 
in this case it appears to not exist at all anymore ;-)

	Ingo
--

From: Ingo Molnar
Date: Monday, May 26, 2008 - 1:41 pm

after ~7 hours of uptime the networking code produced this assertion:

 [25441.140000] KERNEL: assertion (!sk->sk_ack_backlog) failed at 
 net/ipv4/inet_connection_sock.c (642)

no further information in the logs. I kept the failing system booted up 
for 8 hours and will reboot it now.

	Ingo
--

From: Ilpo Järvinen
Date: Monday, May 26, 2008 - 2:20 pm

...Added Pavel & Denis.


I keep wondering why distcc in the first place has a connection orphaned
before it has read all data from it, I guess it should not yet be
speculatively executing anything or so :-). Anyway, there should be RST
due to that data because nobody is going read it after the process in gone 
but since state is still ESTABLISHED, I doubt that tcp_close() was ever 

...I checked /proc/net/tcp output you gave earlier, and indeed, it shows 



-- 
 i.
--

From: Ray Lee
Date: Friday, May 30, 2008 - 9:23 am

On Mon, May 26, 2008 at 6:28 AM, Ilpo Järvinen

Okay, but in some sense you've already bisected this somewhat. I'm
assuming that your testing uses the latest tip and is refreshed daily.

If that's the case, then I would (possibly naively) expect the culprit
to show up in a:
  git log -p v2.6.26-rc1..78b58e549a3098
net/{compat.c,core,ipv4,netfilter,packet,sched,socket.c}

There are only a few commits in there that appear to touch network behavior:

  79d44516b4b178ffb6e2159c75584cfcfc097914
  a1c1f281b84a751fdb5ff919da3b09df7297619f
  62ab22278308a40bcb7f4079e9719ab8b7fe11b5

Reverting just those three and running overnight might provide a clue.
OTOH, I'm in no way a net/ expert, so if you are already working a
debugging strategy then feel free to ignore this. I'm only piping up
From: Arjan van de Ven
Date: Monday, May 26, 2008 - 9:24 am

On Mon, 26 May 2008 13:56:28 +0200

interestingly enough, I think I'm seeing something very similar to this
with the claws email client; as of 2.6.26-rc3 (sorry, didn't run
earlier ones because they broke my wireless) claws gets "stuck" once in
a while, really stuck, when talking to the imap server.....
It seems to happen a little less frequently when running under strace,
but it still happens, so that could just be a timing thing.

--

From: Peter Zijlstra
Date: Wednesday, May 28, 2008 - 2:27 am

Just a quick note to say, me too!!

same scenario: distcc on localhost.

--

From: Håkon Løvdal
Date: Saturday, May 31, 2008 - 7:25 am

Me too, however with a completely different scenario; my hung connections
are not related to distcc at all. The output from /proc/net/tcp that Ingo
posted a few days ago are somewhat different from mine, however I believe
this is the same problem or at least related. Just as Ingo experienced,
netstat -p only shows PID/program as '-' for the hung connections while
for other connections it shows the expected results.

I have recently bought a new PC and have started the process of copying
stuff from my old PC to the new PC. During this I have experienced this
hang several times. I started copying by using tar on both ends over a ssh
pipe but in order to eliminate possible ssh problems I also have tried tar
over a ttcp connection which also fails. There is no obvious pattern of
when this happens, I have experienced failures after transferring 1.15GB,
51.4GB and 23.6GB.

Here is the output from netstat -n -o filtered for port 22 and slightly
edited. All the lines started with Proto == tcp and Recv-Q == 0.

Send-Q Local Addr Foreign Addr  State       Timer
     0 old_pc:22  new_pc:52667  ESTABLISHED keepalive (3513.93/0/0)
     0 old_pc:22  new_pc:43825  ESTABLISHED keepalive (5467.38/0/0)
  2896 old_pc:22  new_pc:58601  ESTABLISHED on (21020884.65/0/0)
  4344 old_pc:22  new_pc:54105  ESTABLISHED on (21017016.33/0/0)
  2896 old_pc:22  new_pc:34149  ESTABLISHED on (20986889.24/0/0)

The first two connections are ongoing, working, interactive ssh
connections. The other three connections died days ago on my new PC.

One thing that caught my eyes was these very high timer values.
Checking the netstat source reveals that the value printed is "(double)
time_len / HZ" and that time_len is extracted from /proc/net/tcp. While
my CONFIG_HZ is 1000, I assume netstat has picked up HZ as 100 from
/usr/include/asm/param.h, and then things really seems to imply that
there is some integer overflow since 2^31 = 2147483648.

Looking into get_tcp4_sock in net/ipv4/tcp_ipv4.c I see that timer_expires
is initialized with ...
From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 9:09 am

Thanks for reporting!

On Sat, 31 May 2008, H
From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 10:22 am

On Sat, 31 May 2008, Ilpo J
From: Håkon Løvdal
Date: Saturday, May 31, 2008 - 10:58 am

The ssh connection used for copying (using the command <ssh old_pc "cd
/directory; tar cvf - *" | pv | tar xvf ->) died in the following way:
   ...
   Read from remote host old_pc: Connection timed out
   51.4GB 4:26:19 [3.29MB/s] [<=>                                   ]
   tar: Unexpected EOF in archive
   tar: Unexpected EOF in archive
   tar: Error is not recoverable: exiting now

and there are currently no traces of those ssh processes any longer on
the new PC, only these two active ssh interactive connections are present:

new_pc>ps -elf | sed -n '1p;/ssh/p'
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S hlovdal   1310  1195  0  80   0 - 14139 select 09:28 pts/14   00:00:00 ssh old_pc
5 S root      2477     1  0  80   0 - 15203 -      May26 ?        00:00:00 /usr/sbin/sshd
1 S hlovdal   2727  2701  0  80   0 - 12466 select May26 ?        00:00:00 /usr/bin/ssh-agent /bin/sh -c exec -l /bin/bash -c "/usr/bin/startkde"
0 S hlovdal  26436 19976  0  80   0 - 20375 pipe_w 19:43 pts/9    00:00:00 sed -n 1p;/ssh/p
0 S hlovdal  30065 18970  0  80   0 - 14174 select May28 pts/6    00:00:00 ssh old_pc
new_pc>

On the old PC there are three corresponding zombies:

old_pc>ps -elf | sed -n '1p;/ssh/p'
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S root      4402     1  0  80   0 -  1600 -      May25 ?        00:00:02 /usr/sbin/sshd
1 S root      5418     1  0  80   0 -  1347 -      May25 ?        00:00:00 ssh-agent
1 S hlovdal   5752     1  0  80   0 -  1347 -      May25 ?        00:00:00 ssh-agent
4 S root      6824  4402  0  80   0 -  2315 -      09:28 ?        00:00:00 sshd: hlovdal [priv]
5 S hlovdal   6828  6824  0  80   0 -  2355 -      09:28 ?        00:00:01 sshd: hlovdal@pts/8
0 S hlovdal  17186 25557  0  80   0 -  1191 -      19:45 pts/0    00:00:00 sed -n 1p;/ssh/p
4 D root     24444  4402  0  80   0 -     0 -      May27 ?        00:00:00 [sshd]
5 Z hlovdal  24448 24444  0  80   0 -     0 -      May27 ...
From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 11:37 am

On Sat, 31 May 2008, H
From: Håkon Løvdal
Date: Saturday, May 31, 2008 - 1:25 pm

This is only from checking after the connection was dead. By the way,
I just had to remotely reboot the new machine because the window
manager locked up, however the old PC are still listing the defunct

I can of course keep on copying for testing purposes, but then I would like to
be able to dump only that single tcp connection, any tips of how to do that?
I found nothing specific in the manuals of wireshark and tcpdump. Of course
it is possible to capture everything and filter afterwards, but since I will be
transferring lots of data the logs will get huge and I would not like to have

I have not done any specific selection myself. On old_pc: bic, new_pc: cubic.

BR Håkon Løvdal
From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 2:39 pm

On Sat, 31 May 2008, H
From: Håkon Løvdal
Date: Saturday, May 31, 2008 - 2:45 pm

MjAwOC81LzMxIElscG8gSsOkcnZpbmVuIDxpbHBvLmphcnZpbmVuQGhlbHNpbmtpLmZpPjoKPiBP
biBTYXQsIDMxIE1heSAyMDA4LCBIw6Vrb24gTMO4dmRhbCB3cm90ZToKPj4gMjAwOC81LzMxIEls
cG8gSsOkcnZpbmVuIDxpbHBvLmphcnZpbmVuQGhlbHNpbmtpLmZpPjoKPj4gPiBTbyB5b3UgaGFk
IHRoYXQgJy0nIGVhcmxpZXIgYW5kIHlvdSBjaGVja2VkIGF0IHRoYXQgdGltZSBidXQgdGhlCj4+
ID4gY29ubmVjdGlvbiBpcyBub3cgYWxyZWFkeSBkZWFkPwo+Pgo+PiBUaGlzIGlzIG9ubHkgZnJv
bSBjaGVja2luZyBhZnRlciB0aGUgY29ubmVjdGlvbiB3YXMgZGVhZC4KPgo+IENvdWxkIHlvdSBw
bGVhc2UgcmVwaHJhc2UgdGhlIGFuc3dlciwgSSBmYWlsZWQgdG8gdW5kZXJzdGFuZCBpdC4uLiA6
LSkKPiAuLi5Zb3Ugc2FpZCBlYXJsaWVyIHRoYXQgeW91IGhhZCAnLScgb3duZWQgY29ubmVjdGlv
bnMgbGlrZSBJbmdvLCB3aGVuIGRpZAo+IHRoYXQgaGFwcGVuIChub3cgdGhlIGNvbm5lY3Rpb25z
IHdvbid0IGV4aXN0cyBhbnltb3JlLCBzbyBhdCB3aGF0IHBvaW50IG9mCj4gdGltZSB5b3Ugc2F3
IHRob3NlIG5vbi1vd25lZCBjb25uZWN0aW9ucyk/CgpXaGF0IEkgbWVhbiBpcyB0aGF0IEkganVz
dCB0ZXN0ZWQgIm5ldHN0YXQgLXAiIGFuZCBub3RpY2VkIHNpbWlsYXJpdGllcwp3aXRoIEluZ28n
cyByZXN1bHQgdG9kYXkgd2hpbGUgcmVwbHlpbmcgdG8gdGhpcyB0aHJlYWQsIHdoaWNoIGlzIGRh
eXMKYWZ0ZXIgdGhlIGNvbm5lY3Rpb25zIGVuZGVkLgoKQlIgSMOla29uIEzDuHZkYWwK
--

From: Håkon Løvdal
Date: Tuesday, June 3, 2008 - 5:10 pm

MjAwOC81LzMxIElscG8gSsOkcnZpbmVuIDxpbHBvLmphcnZpbmVuQGhlbHNpbmtpLmZpPjoKPiBJ
IGRpZG4ndCByZWFsbHkgbWVhbiB0Y3BkdW1wLCBJIHdhcyBtb3JlIHRoaW5raW5nIG9mIHN5c2Nh
bGwgd2hhdCBpcyB0aGUKPiBzeXNjYWxsIHdoZXJlIHRoZSBwcm9jZXNzIGlzIHdhaXRpbmcuIFRo
b3VnaCB0Y3BkdW1wIG1pZ2h0IHJldmVhbAo+IHNvbWV0aGluZyBhcyB3ZWxsIGFib3V0IHRoZSBi
ZWhhdmlvciB3aGVuIG5lYXJpbmcgdGhlIHByb2JsZW0sCgo+IEhtbSwgcGVyaGFwcyBwZXJpb2Rp
Y2FsbHkgY2hlY2tpbmcgL3Byb2MvbmV0L3RjcCAoZS5nLiwgb25jZSBwZXIgMTBzKSBpZgo+IHRo
ZSB0aW1lb3V0IGlzIGxhcmdlciB0aGFuIFRDUF9SVE9fTUFYIG1pZ2h0IGFsbG93IHNvbWUgc2Ny
aXB0IHRvCj4gaW1tZWRpYXRlbHkgbm90aWNlIHdoZW4gdGhpbmdzIGJyb2tlIHdoaWxlIHJlcHJv
ZHVjaW5nIGl0LiBTdG9yaW5nIGFsbAo+IHRob3NlIG9uY2UgcGVyIDEwcyB2YWx1ZXMgc2hvdWxk
bid0IGJlIGEgdG9vIGJpZyBlaXRoZXIsIGl0IGNvdWxkIGV2ZW4gYmUKPiBkb25lIGluIGJvdGgg
ZW5kcyBmb3IgYSBzaW5nbGUgZmxvdyAoYnV0IEknbGwgbGVhdmUgYSBzY3JpcHQgdG8gZG8gdGhh
dCBvbgo+IE1vbmRheSkuCgpIaS4gSSBoYXZlIG5vdyB0ZXN0ZWQgYW5kIHRha2VuIHZhcmlvdXMg
bG9ncyBiZWZvcmUvdW5kZXIvYWZ0ZXIgKHBzLCBuZXRzdGF0LAp0Y3BkdW1wLCBzdHJhY2UsIC9w
cm9jL25ldC90Y3ApLiBUaGUgbG9ncyBkaWQgbm90IGdyb3cgYXMgYmlnIGFzIEkgZmVhcmVkCmhv
d2V2ZXIgdGhleSBhcmUgd2F5IHRvbyBiaWcgYXR0YWNoIGhlcmUuIEFueSB0aXBzIG9uIGhvdyB0
byBkaXN0cmlidXRlPwoKSSBtZW50aW9uZWQgYmVmb3JlIHRoYXQgdGhlIHNzaCBjb25uZWN0aW9u
IGRpZWQgb24gdGhlIG5ldyBQQywgYnV0IHNvbWV0aW1lcwppdCBqdXN0IGhhbmdzIGFzIHdlbGws
IGkuZS4gdGhlIHNzaCBjb25uZWN0aW9uIGlzIGFsaXZlIGJ1dCB0aGVyZSBpcwpubyBkYXRhIHRy
YW5zZmVycmVkLgpUaGUgbGFzdCB0ZXN0IEkgZGlkIGhhbmcgYWZ0ZXIganVzdCAyOTJNQiB0cmFu
c2ZlcnJlZC4gSW4gdGhpcyBjYXNlIHB2Cmp1c3QgaGFuZ3MKb24gdGhlIG5ldyBQQywgd2FpdGlu
ZyBmb3IgZGF0YSB0aGF0IG5ldmVyIGFycml2ZXMuIE9uIHRoZSBzZW5kaW5nCnNpZGUgdGFyIHdo
aWNoCmlzIHByb2R1Y2luZyBkYXRhIGhhdmUgZGllZCBhbmQgdGhlIGNvcnJlc3BvbmRpbmcgc3No
IGlzIGxvY2tlZCBpbiBzb21lCnVuaW50ZXJydXB0aWJsZSBzbGVlcDoKCi8vbmV0c3RhdCBiZWZv
cmUgdGhlIGhhbmc6ClByb3RvIFJlY3YtUSBTZW5kLVEgTG9jYWwgQWRkcmVzcyAgICAgICAgICAg
ICAgIEZvcmVpZ24gQWRkcmVzcwogICAgIFN0YXRlICAgICAgIFBJRC9Qcm9ncmFtIG5hbWUgICAg
VGltZXIKdGNwICAgICAgICAwICAgIDUwNCBuZXdfcGM6MjIgICAgICAgICAgICAgICAgICAgb2xk
X3BjOjM5NDY0CiAg ...
From: Ilpo Järvinen
Date: Wednesday, June 4, 2008 - 4:14 am

On Wed, 4 Jun 2008, H
From: Håkon Løvdal
Date: Wednesday, June 4, 2008 - 7:00 am

MjAwOC82LzQgSWxwbyBKw6RydmluZW4gPGlscG8uamFydmluZW5AaGVsc2lua2kuZmk+Ogo+IE9u
IFdlZCwgNCBKdW4gMjAwOCwgSMOla29uIEzDuHZkYWwgd3JvdGU6Cj4KPj4gSGkuIEkgaGF2ZSBu
b3cgdGVzdGVkIGFuZCB0YWtlbiB2YXJpb3VzIGxvZ3MgYmVmb3JlL3VuZGVyL2FmdGVyIChwcywK
Pj4gbmV0c3RhdCwgdGNwZHVtcCwgc3RyYWNlLCAvcHJvYy9uZXQvdGNwKS4gVGhlIGxvZ3MgZGlk
IG5vdCBncm93IGFzIGJpZwo+PiBhcyBJIGZlYXJlZCBob3dldmVyIHRoZXkgYXJlIHdheSB0b28g
YmlnIGF0dGFjaCBoZXJlLiBBbnkgdGlwcyBvbiBob3cgdG8KPj4gZGlzdHJpYnV0ZT8KPgo+IFVz
dWFsbHkgc3VjaCBsb2dzIGNvbXByZXNzIHF1aXRlIHdlbGwuIEEgYmVzdCB0aGluZyBtaWdodCBi
ZSB0byBwdXQKPiBzb21ld2hlcmUgKGlmIHBvc3NpYmxlKSBhbmQgaHR0cDovLyByZWZlcmVuY2Ug
dGhlbS4KClRoZSBsYXJnZXN0IGZpbGUgd2FzIGEgc3RyYWNlIGxvZyBvZiAxLjVHQiB1bmNvbXBy
ZXNzZWQuIEkgd2lsbCB0cnkgdG8Kc29ydCBvdXQgc29tZXRoaW5nLgoKCj4+IC8vbmV0c3RhdCBi
ZWZvcmUgdGhlIGhhbmc6Cj4+IFByb3RvIFJlY3YtUSBTZW5kLVEgTG9jYWwgQWRkcmVzcyAgICAg
ICAgICAgICAgIEZvcmVpZ24gQWRkcmVzcwo+PiAgICAgIFN0YXRlICAgICAgIFBJRC9Qcm9ncmFt
IG5hbWUgICAgVGltZXIKPj4gdGNwICAgICAgICAwICAgIDUwNCBuZXdfcGM6MjIgICAgICAgICAg
ICAgICAgICAgb2xkX3BjOjM5NDY0Cj4+ICAgICAgRVNUQUJMSVNIRUQgMTcwNTQvc3NoZDogaGxv
dmRhbCBvbiAoMC4yMC8wLzApCj4+Cj4+IC8vIGFmdGVyIHRoZSBoYW5nOgo+PiBQcm90byBSZWN2
LVEgU2VuZC1RIExvY2FsIEFkZHJlc3MgICAgICAgICAgICAgICBGb3JlaWduIEFkZHJlc3MKPj4g
ICAgICBTdGF0ZSAgICAgICBQSUQvUHJvZ3JhbSBuYW1lICAgIFRpbWVyCj4+IHRjcCAgICAgICAg
MCAgIDI4OTYgb2xkX3BjOjIyICAgICAgICAgICAgICAgICAgIG5ld19wYzozOTQ2NAo+PiAgICAg
IEVTVEFCTElTSEVEIC0gICAgICAgICAgICAgICAgICAgb24gKDIxMzI5NDQxLjE5LzAvMCkKPgo+
ID8/PyBBcmUgdGhlc2Ugb24gdGhlIHNhbWUgbWFjaGluZT8gSG93IGNvbWUgdGhlIGlwLWFkZHJl
c3NlcyBnb3QgcmV2ZXJzZWQKPiAod2hpbGUgcG9ydHMgcmVtYWluIHRoZSBzYW1lKS4uLj8hPwoK
T29wcywgdGhhdCB3YXMgYW4gZWRpdGluZyBtaXN0YWtlIG9uIG15IHBhcnQsIG9sZCBhbmQgbmV3
IGFyZSBtaXhlZCB1cAppbiB0aGUgbGFzdCBsaW5lLgoKQlIgSMOla29uIEzDuHZkYWwK
--

From: Ilpo Järvinen
Date: Wednesday, June 4, 2008 - 8:09 am

On Wed, 4 Jun 2008, H
From: Håkon Løvdal
Date: Friday, June 6, 2008 - 2:32 am

T24gMDQvMDYvMjAwOCwgSWxwbyBKw6RydmluZW4gPGlscG8uamFydmluZW5AaGVsc2lua2kuZmk+
IHdyb3RlOgo+IEkgZ3Vlc3Mgbm90IGFsbCBvZiB0aGUgc3RyYWNlIHdpbGwgYmUgbmVjZXNzYXJ5
IGFueXdheS4gWW91IGNvdWxkIGZpcnN0Cj4gIGxlYXZlIHRoYXQgb3V0IGNvbXBsZXRlbHkgYW5k
IGNvbmNlbnRyYXRlIG9uIHRoZSBvdGhlciBsb2dzIGZpcnN0LgoKSSBoYXZlIG5vdyBzZXQgdXAg
YSBzZXJ2ZXIgb24gbXkgbWFjaGluZSBhbmQgbWFkZSBhbGwgbG9ncyBhdmFpbGFibGUuCkkgd2ls
bCBmb2xsb3cgdXAgd2l0aCB0aGUgYWRyZXNzIGluIHByaXZhdGUgZW1haWwgdG8gdGhvc2UgdGhh
dCBhbHJlYWR5CmhhdmUgcG9zdGVkIGluIHRoaXMgdGhyZWFkLiBGb3Igb3RoZXJzLCBwbGVhc2Ug
YXNrIHRvIHJlY2VpdmUgaXQuCgpCUiBIw6Vrb24gTMO4dmRhbAoKUFMKSSB3aWxsIGJlIGF3YXkg
Zm9yIHRoaXMgd2Vla2VuZC4K
--

From: Ilpo Järvinen
Date: Monday, June 9, 2008 - 12:24 pm

On Fri, 6 Jun 2008, H
From: Håkon Løvdal
Date: Tuesday, June 10, 2008 - 4:26 pm

MjAwOC82LzkgSWxwbyBKw6RydmluZW4gPGlscG8uamFydmluZW5AaGVsc2lua2kuZmk+Ogo+IE9u
IEZyaSwgNiBKdW4gMjAwOCwgSMOla29uIEzDuHZkYWwgd3JvdGU6Cj4+IEkgaGF2ZSBub3cgc2V0
IHVwIGEgc2VydmVyIG9uIG15IG1hY2hpbmUgYW5kIG1hZGUgYWxsIGxvZ3MgYXZhaWxhYmxlLgo+
PiBJIHdpbGwgZm9sbG93IHVwIHdpdGggdGhlIGFkcmVzcyBpbiBwcml2YXRlIGVtYWlsIHRvIHRo
b3NlIHRoYXQgYWxyZWFkeQo+PiBoYXZlIHBvc3RlZCBpbiB0aGlzIHRocmVhZC4gRm9yIG90aGVy
cywgcGxlYXNlIGFzayB0byByZWNlaXZlIGl0Lgo+Cj4gQ2FuIHlvdSBwbGVhc2UgZXhwbGFpbiB0
aGlzOgo+Cj4gaWpqYXJ2aW5Ad3JsLTU5On4vdG1wJCBidW56aXAyIC10IGNvcHl0ZXN0MDAzLmNh
cHR1cmUuYnoyCj4gYnVuemlwMjogY29weXRlc3QwMDMuY2FwdHVyZS5iejI6IGRhdGEgaW50ZWdy
aXR5IChDUkMpIGVycm9yIGluIGRhdGEKPgo+IFlvdSBjYW4gdXNlIHRoZSBgYnppcDJyZWNvdmVy
JyBwcm9ncmFtIHRvIGF0dGVtcHQgdG8gcmVjb3Zlcgo+IGRhdGEgZnJvbSB1bmRhbWFnZWQgc2Vj
dGlvbnMgb2YgY29ycnVwdGVkIGZpbGVzLgo+Cj4gaWpqYXJ2aW5Ad3JsLTU5On4vdG1wJCBscyAt
YWwgY29weXRlc3QwMDMuY2FwdHVyZS5iejIKPiAtcnctci0tLS0tIDEgaWpqYXJ2aW4gdGtvbCAy
NzU0NDE2MjAgIDkuIGtlc8ODwqTDgiDDgiAgMTY6MzkKPiBjb3B5dGVzdDAwMy5jYXB0dXJlLmJ6
Mgo+IGlqamFydmluQHdybC01OTp+L3RtcCQgbWQ1c3VtIGNvcHl0ZXN0MDAzLmNhcHR1cmUuYnoy
Cj4gNGJiZjgwMWNmMWZiZDUyMzk5MzIyYjA4OGMxYTIwNTggIGNvcHl0ZXN0MDAzLmNhcHR1cmUu
YnoyCj4gaWpqYXJ2aW5Ad3JsLTU5On4vdG1wJAo+Cj4gLi4uYWxzbyB0aGUgb25lIGZyb20gYWxs
ZmlsZXMuemlwIGlzIHNpbWlsYXJpbHkgZGFtYWdlZC4KPgo+IC4uLklmIHlvdSBkb24ndCBoYXZl
IGFuIGV4cGxhbmF0aW9uLCB0aGVtIGl0IG1pZ2h0IGJlIHRpbWUgdG8gY2hlY2sKPiB5b3VyIGhh
cmR3YXJlLi4uLgoKSSBkbyBub3QgaGF2ZSBhbnkgZ29vZCBleHBsYW5hdGlvbiAoYmFkIG1lbW9y
eT8pLiBXaGVuIGNoZWNraW5nIHRoZQpjb3B5dGVzdDAwMy5jYXB0dXJlLmJ6MiBmaWxlIGluIHRo
ZSBodHRwIHNlcnZlciBzdG9yYWdlIGRpcmVjdG9yeQpJIGdldCB0aGUgc2FtZSByZXN1bHRzIGFz
IHlvdS4gSG93ZXZlciBJIGFsc28gaGF2ZSB0aGUgb3JpZ2luYWwgZmlsZQpzdG9yZWQgc3RpbGwg
YW5kIGNvbXBhcmluZyB0aGVtIHdpdGggY21wIC1sIGdpdmVzCiAyMDU1ODY1NiAxMjQgMTA0CnNv
IG9idmlvdXNseSBhIGJpdCBoYXMgYmVlbiBmbGlwcGVkLiBJIGhhdmUgbm93IHJlcGxhY2VkIHRo
ZQplcnJvbmVvdXMgZmlsZSB3aXRoIGEgZ29vZCBjb3B5LCBzbyB5b3Ugc2hvdWxkIGJlIGFibGUg
dG8gZG93bmxvYWQgYWdhaW4uCgpCUiBIw6Vrb24gTMO4dmRhbAoKQnR3LCBJIGdvdCBhICJ0cmFk
aXRpb25hbCIgaGFu ...
From: Ilpo Järvinen
Date: Wednesday, June 11, 2008 - 6:39 am

On Wed, 11 Jun 2008, H
From: Håkon Løvdal
Date: Wednesday, June 18, 2008 - 5:30 pm

Hi. I tested the old pc with memtest86 and memtest86+ for 16 and 50
hours without any errors detected, so apparently the ram is ok.

I now just did another test from a freshley booted old pc, standing
idle for around 2 hours before I started the testing (no network activity
before I plugged in the cable just before starting testing).

The copying failed after just 2.62GB, so the logs are short and
nice. Stored on same server as earlier, new 2008-06-19 directory. One
thing I noticed in the dmesg output on the old pc was the following:

BUG: unable to handle kernel paging request at virtual address ffff3cff
printing eip: ffff3cff *pde = 00793067 *pte = 00000000
Oops: 0000 [#1] PREEMPT
Modules linked in: ...
Pid: 18103, comm: sshd Not tainted (2.6.24-gafd564a8 #4)
...
Call Trace:
 [<f8c45b39>] nf_conntrack_find_get+0x36/0x3a [nf_conntrack]
 [<f8c46379>] nf_conntrack_in+0x135/0x3e8 [nf_conntrack]
 [<c0423c5d>] local_bh_disable+0xa/0xb
 [<c0626d54>] _read_lock_bh+0x12/0x1a
...

BR Håkon Løvdal
From: Ingo Molnar
Date: Thursday, May 29, 2008 - 1:45 am

update: in the past 5 days of -tip testing i've gathered about 10 
randconfig kernel configs that all produced such failures.

Since the bug itself is very elusive (it takes up to 50 boot + 
kernel-rebuild-via-distccc iterations to trigger) bisection was still 
not an option - but with 10 configs statistical analysis of the configs 
is now possible.

I made a histogram of all kernel options present in those configs, and 
one networking related kernel option stood out:

      5 CONFIG_TCP_CONG_ADVANCED=y
      6 CONFIG_INET_TCP_DIAG=y
      6 CONFIG_TCP_MD5SIG=y
      9 CONFIG_TCP_CONG_CUBIC=y


the likelyhood of CONFIG_TCP_CONG_CUBIC=y being enabled in my randconfig 
runs is 75%. The likelyhood of CONFIG_TCP_CONG_CUBIC=y being enabled in 
10 configs in a row is 0.75^10, or 5.6%. So statistical analysis can say 
it with a 95% confidence that the presence of this option correlates to 
the hung sockets.

i have started testing this theory now, via the patch below, which turns 
off TCP_CONG_CUBIC. It will take about 50 bootups on the affected 
testsystems to confirm. (it will take a couple of hours today as not all 
testsystems show these hung socket symptoms)

distributions enable TCP_CONG_CUBIC by default:

  $ grep CUBIC /boot/config-2.6.24.7-92.fc8
  CONFIG_TCP_CONG_CUBIC=y
  CONFIG_DEFAULT_CUBIC=y

which would explain why Arjan and Peter triggered similar hangs as well.

	Ingo

---------------------->
Subject: qa: no TCP_CONG_CUBIC
From: Ingo Molnar <mingo@elte.hu>
Date: Thu May 29 09:45:51 CEST 2008

---
 net/ipv4/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

Index: tip/net/ipv4/Kconfig
===================================================================
--- tip.orig/net/ipv4/Kconfig
+++ tip/net/ipv4/Kconfig
@@ -454,6 +454,8 @@ config TCP_CONG_BIC
 config TCP_CONG_CUBIC
 	tristate "CUBIC TCP"
 	default y
+	depends on BROKEN_BOOT_ALLOWED
+	select BROKEN_BOOT
 	---help---
 	This is version 2.0 of BIC-TCP which uses a cubic growth function
 ...
From: Ilpo Järvinen
Date: Thursday, May 29, 2008 - 4:14 am

...I tried yesterday some accept (& read some) & close/exit type 
stressing but I couldn't get it to show up (though I'll try longer 

Do I understand you correctly... it doesn't explain the tenth case out 

Main problem with this explanation is that congestion control modules are 
only in use when TCP is in ESTABLISHED and transmitting normally, while it 
has nothing to do how we enter or leave ESTABLISHED.

But if it's really that the process who owned the connection already went 
away, I think we should end up into tcp_close() which changes the state 
from established (and send RST too if there's still data to be received, 
which would be picked up by the other end and that end would no longer 
keep established either). ...A failure to send the reset would show up in 
LINUX_MIB_TCPABORTFAILED. Because both ends remain in established, it kind 
of excludes the possibility that something would have accidently allowed 
the Recv-Q end to return back to established too (due to some bug).

To me there are mainly two weird things:
1) Why we see orphaning with data in the first place (I think distcc would 
be interested to read everything, unless some worker crashed in early...
Though some timeout in distcc could explain it as well but I don't know 
too well how distcc does everything)...
2) Why the connection is still in ESTABLISHED when it was orphaned and 
it has some data to receive... If it had some unread data, should be in 
CLOSE or in FIN_WAIT1 otherwise.

-- 
 i.
--

From: Ingo Molnar
Date: Thursday, May 29, 2008 - 4:22 am

after about 50 bootups i got a hung test again:

titan:~/tip> netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0      0 10.0.1.14:22                10.0.1.16:58062  ESTABLISHED
tcp        0      0 10.0.1.14:22                10.0.1.16:60109  ESTABLISHED
tcp        0  86368 10.0.1.14:43914             10.0.1.16:3632   ESTABLISHED

and this time with CUBIC_TCP disabled - so that was a red herring.

	Ingo
--

From: Evgeniy Polyakov
Date: Thursday, May 29, 2008 - 6:05 am

Hi.


Are namespaces enabled? It is pretty non-trivial to make established
socket without process from userspace :)

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Thursday, May 29, 2008 - 6:43 am

they are enabled in this particular kernel config, but there were a 
couple of other failures where they were not enabled.

	Ingo
--

From: Ingo Molnar
Date: Thursday, May 29, 2008 - 6:08 am

btw., i now also have a hung socket over real network:

titan:~> netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
tcp        0  86752 10.0.1.14:49921             10.0.1.16:3632              ESTABLISHED

(the scenario is similar to the incident i've analyzed before, so i wont 
repeat the tcpdump, etc.)

	Ingo
--

From: Ilpo Järvinen
Date: Thursday, May 29, 2008 - 6:48 am

...Yes, also the previous one you quoted was already over the network.
But I think the key is in the receiving end rather than in the sending 
end. Could try to extract any information about these I mentioned earlier,
to me there are mainly two weird things:

1) Why we see orphaning with data in the first place. I think distcc would 
be interested to read everything, unless some worker crashed in early...
Though some timeout in distcc could explain it as well but I don't know 
too well how distcc does everything...

2) Why the receiving end of the connection is still in ESTABLISHED 
without an owner... If it had some unread data, should be in CLOSE or in 
FIN_WAIT1 otherwise. Ie., tcp_close() would change the state of the flow.

These both basically happen at the receiving end, though if there was 
unread data the RST would also reflected state change to the sending side 
and since there's window, also FIN would be sent right away and 
cause sender to leave ESTABLISHED rather than that FIN would get stuck 
into the tail of the TCP queue.

How to collect the information, I'm not too sure, tcp_close might well 
work as a plain printk because I think it shouldn't be that noisy. But 
that would likely just show that it wasn't called for a stuck receiver
at all, so it would probably end up being a dead-end anyway.

-- 
 i.
--

From: Ingo Molnar
Date: Friday, May 30, 2008 - 4:09 am

last night i turned off distcc support, and got about 200 successful 
bootups and zero TCP hangs (as expected - there's not much TCP traffic 
if the distcc cluster is not utilized).

but that's 200 overnight tests instead of the expected 600, so this is a 
major and rather crippling bug to me.

There's no good way to detect these hung sockets by me from userspace 
and get rid of them. Has anyone before thought of the obvious: to write 
a kernel-space "TCP socket watchdog" kernel feature that detects them 
and tries to free them so that people can become aware of it?

Hung sockets is a re-occuring bug in the TCP stack after all. (and it's 
a natural property of it: state machine designs are always vulnerable to 
lost event problems.)

	Ingo
--

From: Ilpo Järvinen
Date: Friday, May 30, 2008 - 2:12 pm

...It is quite easy to kill a TCP flow with enough information (seqnos 
which one could collect e.g., from a tcpdump), just a RST is necessary,
so user-space could do that too.

-- 
 i.
--

From: Ingo Molnar
Date: Friday, May 30, 2008 - 11:18 am

ah, in retrospect i realized that this test had one flaw: some of the 
systems i the build cluster already ran a newer kernel and hence were 
targets for this bug.

so i turned off CONFIG_TCP_CONG_CUBIC on all the testboxes and rebooted 
the cluster boxes into 2.6.25, and the hung sockets are now gone. (about 
150 successful iterations)

i did another change as well: i removed the localhost distcc component. 
I'll reinstate that now to make sure it's really related to 
TCP_CONG_CUBIC and not to localhost networking.

	Ingo
--

From: Ingo Molnar
Date: Friday, May 30, 2008 - 11:09 pm

ok, once i added back the localhost distcc component and the hung kernel 
build + stuck TCP socket bug happened again overnight:

 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address               Foreign Address             State
 tcp    72187      0 10.0.1.14:3632              10.0.1.14:47910             ESTABLISHED
 tcp        0 174464 10.0.1.14:47910             10.0.1.14:3632              ESTABLISHED

so it seems distcc over localhost was the aspect that made it fail.

_Perhaps_ what matters is to have the new post-rc3 TCP code on _both_ 
sides of the connection. But that is just a theory - it could be timing, 
etc.

	Ingo
--

From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 4:46 am

Btw, does your distcc perhaps happen enable TCP_DEFER_ACCEPT (there were
some post 2.6.25 changes into it)?

-- 
 i.
--

From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 5:18 am

On Sat, 31 May 2008, Ilpo J
From: Ingo Molnar
Date: Saturday, May 31, 2008 - 5:54 am

what is the easiest way to figure out whether my version of distcc 
enables TCP_DEFER_ACCEPT?

	Ingo
--

From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 5:58 am

> * Ilpo J
From: Ingo Molnar
Date: Saturday, May 31, 2008 - 9:35 am

seems to be used:

 22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
 22003 listen(4, 10)                     = 0
 22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0

i'll queue up your reverts for testing in -tip.

	Ingo
--

From: Patrick McManus
Date: Saturday, May 31, 2008 - 3:46 pm

So the code you will revert came from my fingers. The circumstances here
make me nervous; while I'm at a loss to explain what might be going on
in particular, let me offer an apology in advance should the revert help
resolve the issue.

Here's what makes me nervous:

 * not a lot of code uses DEFER_ACCEPT.. frankly it was pretty broken
before 26 - but not broken this way .. the correlation of your bug using
it is significant. 

 * in 26, a server TCP socket (with DA) goes to ESTABLISHED when the 3rd
part of the handshake is received (as normal without DA), but the socket
isn't put on the accept queue until a real data packet arrives. (That's
the point of DA). In <= 25 this socket would have syn-recv until the
data packet arrived.

  - I did run tests where the server died in between the handshake being
completed and first data packet arriving - the client should see RST and
the server socket should disappear. But maybe something was missed?

Do I understand this correctly, the server process is gone but the
socket is still in the table? And the client process is still there
waiting for the server to do something - having sent a bunch of data?

Do we know if any data bytes (not handshake bytes) have been consumed by
the server side? If they were, that would seem to vindicate DA.

Also pointing away from DA is that you started seeing this with rc3 -
that code was included in rc1.Is that a firm observation, or maybe there
weren't enough datapoints to conclude that rc1 and rc2 were clean?

The most interesting patch is ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
if anyone wants to eyeball it.



 

--

From: Ilpo Järvinen
Date: Saturday, May 31, 2008 - 10:51 pm

> > * Ilpo J
From: Eric Dumazet
Date: Saturday, May 31, 2008 - 11:04 pm

I believe Ingo problems come on long lived sockets (were many bytes were 
exchanged between the peers), so I dont think DEFER_ACCEPT is the cullprit.

I suggest to enable CONFIG_TIMER_STATS and to check timers, because 
/proc/net/tcp
can display apparently large timer values when the timer is elapsed 
(jiffies > icsk->icsk_timeout)
and jiffies_to_clock_t(timer_expires - jiffies) is then overflowing 
doing a multiply and a divide.

On a 64bits server running linux-2.6.24-rc2, I can see *strange* timers 
values too in /proc/net/tcp, but not
stuck TCP sessions. On 64 bits, these strange values have 1AD7F

grep 1AD7F /proc/net/tcp | obfuscate_IP_and_ports

2017: local_peer remote_peer 03 00000000:00000000 01:1AD7F29ABBA 
00000001     0        0 0 2 ffff81067e7520c0
2019: local_peer remote_peer 03 00000000:00000000 01:1AD7F29ABBA 
00000003     0        0 0 2 ffff8106c580bcc0
2029: local_peer remote_peer 03 00000000:00000000 01:1AD7F29ABBA 
00000002     0        0 0 2 ffff81067313fe40
2032: local_peer remote_peer 03 00000000:00000000 01:1AD7F29ABBA 
00000003     0        0 0 2 ffff8106c716c340
2039: local_peer remote_peer 03 00000000:00000000 01:1AD7F29ABBA 
00000002     0        0 0 2 ffff8107d45b3f40
2041: local_peer remote_peer 03 00000000:00000000 01:1AD7F29AB37 
00000000     0        0 0 2 ffff810718e221c0
6610: local_peer remote_peer 01 00000000:00000000 00:1AD7F29ABCA 
00000000     0        0 136594789 1 ffff8107183fb940 94 10 16 2 -1
9925: local_peer remote_peer 01 00000000:00000000 00:1AD7F29ABCA 
00000000     0        0 144451161 1 ffff8107051a9840 351 10 0 2 -1

On TCP_SYN_RECV (03) sockets, timer can apparently be elapsed by many 
ticks, while on TCP_ESTABLISHED (01) one, I get
jiffies_to_clock(-1) -> 1AD7F29ABCA value because the way 
get_tcp4_sock() is coded (jiffies can change while running this function).
Note the 00: that means that no timer in my case.

Running again the command one second later gives completely different 
results (other sockets are ...
From: Ingo Molnar
Date: Monday, June 2, 2008 - 2:23 am

I turned off localhost distcc two days ago and there has not been a 
single hung socket since then, so we now know it for sure that without 
localhost distcc connections, -tip's QA will not produce any hung 
sockets in about 1000 random-kernel-build+boot iterations.

i've added those reverts this morning and added back the localhost 

i'm wondering whether your suspicion on broken TCP timers is consistent 
with the symptoms i've seen: the hung sockets clearly produced periodic 
packet activity every 180 seconds, up to 8 hours, without ever changing 
their receive of send queue. So at least a part of the TCP timer 
mechanism for that specific stuck socket was working fine.

is there no sysctl or other debug mechanism to somehow get its full TCP 
state and the reasons for why it is stuck? I'm wondering how you debug 
broken TCP state machines without enabling testers to be able to dump 
all state and passing it to developers.

I have a clearly reproducable testcase and i'd like to help out, but the 
whole effort is stalled on 'not enough information' it appears. Doing 
random reverts might help in truly helpless situations where a bug has 
no debuggable state - but this situation seems really routine to me: 
it's very difficult to trigger the bug but once it triggers the bug 
scenario is stable and analyzable. I'd be glad to test any 
instrumentation patch that makes similar scenarios more analyzable.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, June 3, 2008 - 2:40 am

update: your 3 reverts in tip/out-of-tree [commit dad98991c] definitely 
fixed the hangs!

Here is the testing i did:

first i ran about 500+ successful iterations on the affected testboxes 
with your revert patch applied, on multiple systems. Then today, without 
changing anything else on one of the testsystems i reverted your revert 
on that single system. After about an hour of testing, in 20 iterations 
i got a hang again over localhost:

 titan:~> netstat -nt
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address               Foreign Address             State
 tcp        0 174592 10.0.1.14:34710             10.0.1.14:3632              ESTABLISHED
 tcp    72145      0 10.0.1.14:3632              10.0.1.14:34710             ESTABLISHED

so i hereby conclude that your revert works :) I've repeated the commit 
below that resolves this nasty regression.

phew ...

	Ingo

---------------->
commit dad98991c137c089d64a3057dddc3a12bd6866b0
Author: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date:   Sat May 31 15:18:56 2008 +0300

    tcp: revert DEFER_ACCEPT modifications
    
    On Sat, 31 May 2008, Ilpo Järvinen wrote:
    
    > On Sat, 31 May 2008, Ingo Molnar wrote:
    >
    > > ok, once i added back the localhost distcc component and the hung kernel
    > > build + stuck TCP socket bug happened again overnight:
    > >
    > >  Active Internet connections (w/o servers)
    > >  Proto Recv-Q Send-Q Local Address               Foreign Address             State
    > >  tcp    72187      0 10.0.1.14:3632              10.0.1.14:47910             ESTABLISHED
    > >  tcp        0 174464 10.0.1.14:47910             10.0.1.14:3632              ESTABLISHED
    > >
    > > so it seems distcc over localhost was the aspect that made it fail.
    > >
    > > _Perhaps_ what matters is to have the new post-rc3 TCP code on _both_
    > > sides of the connection. But that is just a theory - it could be timing,
    > > etc.
    >
    > Btw, does ...
From: Patrick McManus
Date: Tuesday, June 3, 2008 - 7:41 am

hm.

Here's a theory - the DA code put the socket in the accept queue after
data has been queued on the socket, instead of the other way around
which is normal.

Could this be confusing the select() mechanism?

-Pat








--

From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 2:46 pm

...It wasn't exactly out-of-tree, Evgeniy fixed a problem that was found
in "TCP_DEFER_ACCEPT updates - process as established", perhaps it just 
wasn't in your testing tree yet.

$ git-log -n 1 --pretty=full 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 | 
grep "Commit:"

Are you sure this is enough to conclude the results? Seems quite small 
number to me to rule out luck. Especially considering that it was some 
amount of time in the tree already until you noticed it for the first 
time.

Anyway, nice that it seems to be helping. It was almost the only 
possibility on TCP side, I don't think there were any other state machine 
related changes. So it wasn't just "random revert" in that sense like you 
were implying :-), I just didn't have any theory how it would cause the 
problem... ...I even first disregarded DA that because of timeline 
in-exactness and because I wrongly assumed that distcc probably won't use 
it anyway, but then I checked later on and found out that it was present 
at least in the source I had lying around.

Anyway, it might be that the revert was a bit overkill, I'm not fully sure 
if 539fae and e4c7884 need to be reverted to fix it since main changes 
are in ec3c098. I just didn't want to take chances at first and put them 

...I couldn't immediately find anything obviously wrong with those changes 
but the patch below might be worth of a try (without the revert of 
course). If it ever spits out that WARN_ON for you, we were playing with 
fire too much and it's better to return on the safe side there...

-- 
 i.

[PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on

If header prediction is turned on under some circumstances,
DA can deadlock though I have great trouble in figuring out
how it could ever happen while ending up into that else
branch (but I've been wrong before as well :-)).

Signed-off-by: Ilpo J
From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 3:01 pm

On Wed, 4 Jun 2008, Ilpo J
From: David Miller
Date: Tuesday, June 3, 2008 - 3:03 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

Keepalive is very long, it might still "seem" like a deadlock for
someone without much patience :-)
--

From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 3:10 pm

On Tue, 3 Jun 2008, David Miller wrote:

> From: "Ilpo_J
From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 4:22 pm

On Tue, 3 Jun 2008, David Miller wrote:

> From: "Ilpo_J
From: Joe Perches
Date: Tuesday, June 3, 2008 - 4:54 pm

Not commenting on the locking, but I think the code
would be clearer if tp->defer_tcp_accept was used in
a temporary instead.

Perhaps:

 net/ipv4/tcp_input.c |   33 ++++++++++++++++-----------------
 1 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b54d9d3..f846e11 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4535,18 +4535,18 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
 static int tcp_defer_accept_check(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (tp->defer_tcp_accept.request) {
-		int queued_data =  tp->rcv_nxt - tp->copied_seq;
-		int hasfin =  !skb_queue_empty(&sk->sk_receive_queue) ?
+	struct tcp_deferred_accept_info *tdai = &tp->defer_tcp_accept;
+	if (tdai->request) {
+		int queued_data = tp->rcv_nxt - tp->copied_seq;
+		int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
 			tcp_hdr((struct sk_buff *)
 				sk->sk_receive_queue.prev)->fin : 0;
 
 		if (queued_data && hasfin)
 			queued_data--;
 
-		if (queued_data &&
-		    tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
+		bh_lock_sock(tdai->listen_sk);
+		if (queued_data && tdai->listen_sk->sk_state == TCP_LISTEN) {
 			if (sock_flag(sk, SOCK_KEEPOPEN)) {
 				inet_csk_reset_keepalive_timer(sk,
 							       keepalive_time_when(tp));
@@ -4554,23 +4554,22 @@ static int tcp_defer_accept_check(struct sock *sk)
 				inet_csk_delete_keepalive_timer(sk);
 			}
 
-			inet_csk_reqsk_queue_add(
-				tp->defer_tcp_accept.listen_sk,
-				tp->defer_tcp_accept.request,
-				sk);
+			inet_csk_reqsk_queue_add(tdai->listen_sk,
+						 tdai->request,
+						 sk);
 
-			tp->defer_tcp_accept.listen_sk->sk_data_ready(
-				tp->defer_tcp_accept.listen_sk, 0);
+			tdai->listen_sk->sk_data_ready(tdai->listen_sk, 0);
 
-			sock_put(tp->defer_tcp_accept.listen_sk);
+			sock_put(tdai->listen_sk);
 			sock_put(sk);
-			tp->defer_tcp_accept.listen_sk = ...
From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 11:25 pm

On Tue, 3 Jun 2008, Joe Perches wrote:

> On Wed, 2008-06-04 at 02:22 +0300, Ilpo J
From: Patrick McManus
Date: Tuesday, June 3, 2008 - 7:54 pm

On Wed, 2008-06-04 at 02:22 +0300, Ilpo Järvinen wrote:


Ilpo, has anyone told you today that you rock? Well allow me - you
do_rcv() clearly has the listening socket locked in the non-DA case, and
in the DA case it is the 'child' ESTABLISHED socket that is locked -
leaving the accept queue unprotected. So simple.

-Pat


--

From: Ilpo Järvinen
Date: Tuesday, June 3, 2008 - 11:42 pm

On Tue, 3 Jun 2008, Patrick McManus wrote:

> On Wed, 2008-06-04 at 02:22 +0300, Ilpo J
From: Ingo Molnar
Date: Thursday, June 5, 2008 - 7:22 am

i unapplied the reverts then applied your fix, and got this hang after 
1.5 hours of testing:

 titan:~> netstat -nt
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address           Foreign Address   State
 tcp        0  65032 10.0.1.14:39454         10.0.1.19:3632    ESTABLISHED

10.0.1.14 is this box, Core2Duo dual-core. 10.0.1.19 is another box with 
16 CPUs.

	Ingo
--

From: Ilpo Järvinen
Date: Thursday, June 5, 2008 - 11:00 am

> * Ilpo J
From: Ilpo Järvinen
Date: Thursday, June 5, 2008 - 2:13 pm

On Thu, 5 Jun 2008, Ilpo J
From: Patrick McManus
Date: Thursday, June 5, 2008 - 4:29 pm

Sorry, I don't see anything - it seems to boil down to the same code in
the DA and non-DA case as far as I can tell, but after a while all the
twisty passages seem to look alike.

If Ingo confirms that the recv end was running the locking patch code,
it would be interesting to just confirm the sysreq+t looks the same as
before - it is possible the patch turned the race into a non-obvious
deadlock.

I'm sure your smaller revert will make the problem go away just as the
larger one did, fwiw. 

The other odd thing is that Ingo did a lot of experimentation and was
only making this happen on localhost before (though I agree there is
nothing inherent about that lock and localhost) - isn't it odd that the
first trigger of it now is between two hosts? What do you make of that?




--

From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 3:03 am

On Thu, 5 Jun 2008, Patrick McManus wrote:

> On Fri, 2008-06-06 at 00:13 +0300, Ilpo J
From: Patrick McManus
Date: Friday, June 6, 2008 - 10:11 am

maybe - I've noted that I can get the distcc server to crash with just a
little fuzz (telnet to it and close the telnet) - but it is true I
haven't seen anything odd using the distcc client.

Anyhow, my news is that using rc5 I have managed to reproduce it on
localhost - so it isn't just ingo anymore ! ;)

I didn't have a 16 cpu machine at my disposal, so I was concerned I
wouldn't be able to make it happen - but I setup a 64 bit kvm image with
-smp 4, running on my core2-duo and created a makefile with 3 src files
that I just compiled as "while true; do make -j3 all; done" - the
makefile uses distcc to localhost and has intentionally broken
dependencies so it just keeps recompiling stuff. The input files are
approximately 135k, 98k, and 16k after running gcc -E on them (which I
what I assume distcc does before putting them down the socket).

On rc5 I could get the lockup in under 20 minutes.. usually 10. I think
I did it 4 times. My compile test is probably a better trigger than the
kernel compile because the distcc connects are never staggered like they
would be in a large directory of files. (3 files, -j4).

When I apply the locking patch you (Ilpo) wrote, I cannot reproduce the
error at all in the first 90 minutes of testing. I'll let the test run
and update the list.

I'm holding out hope that Ingo's report did not have the locking patch
on the distcc server end - because it certainly makes a difference for 
me.

-Patrick

--

From: Ingo Molnar
Date: Friday, June 6, 2008 - 10:33 am

Hm, the distcc server had the full 3-patch-revert from Ilpo, was that 
supposed to fix the problem too, indirectly?

The box is running that 3-patch revert right now as well:

 phoenix:~> uptime
  19:20:28 up  9:58,  2 users,  load average: 7.75, 13.88, 30.95
 phoenix:~> uname -a
 Linux phoenix 2.6.26-rc4 #2352 SMP Fri Jun 6 09:18:07 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux

... and i never saw a single hang today in the 10 hours of uptime this 
box has. (and it built a good 500 kernel today) Nor any hang yesterday, 
and that was a good 500 kernels too.

You can see it that the box built more than two thousand kernels in the 
past few days alone, so it's a rather busy little bee. The other 
testboxes built even more kernels - a quad box built and booted 2500 
kernels:

 #define UTS_VERSION "#2524 SMP PREEMPT Fri Jun 6 19:22:21 CEST 2008"

and i never saw a hang on that box either. 

a third box has:

 titan:~> uname -a
 Linux titan 2.6.26-rc5-00002-g737697d-dirty #2557 SMP PREEMPT Fri Jun 6 
 19:24:00 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux

(this is the one that showed the hang for the first time)

The total count of kernel bootups i did this week for -tip QA was 
somewhere between five and ten thousand random build+bootups - and the 
only time i got a hang was when i removed the 3-patch-revert 
intentionally, on one of the boxes.

Maybe that 3-patch-revert just makes this locking bug a bit less likely 
to trigger, by accident? Out tip test-setup is specialized to find 
arch/x86 and scheduler bugs, not primarily to find networking bugs. (but 
at this test volume, and given that it makes use of distcc, it will 
trigger them too.)

i have a rather accurate timeline of when the hang first occured, do we 
know the timeline of the introduction of the locking bug by any chance? 
Which commit introduced it? (Ilpo's commit log does not say it)

Your test results are compelling nevertheless so i'll do a retest in any 
case, with all boxes either running an older ...
From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 11:19 am

...I kind of fail to follow in general in this mail which patch have been 
tested and where and when... But I understand that it's just due to number 
of tests & hosts & kernels & what-else you use and know by heart (and we 
don't do all that well :-)). But I'll try to still sort it out below...


Yes, the problematic outside of locking portion shouldn't be there


No, part of the DEFER_ACCEPT stuff was postponed in 2.6.25..2.6.26-rc1 
timeframe (ec3c0982a2dd1e671bad8e9d26c28dcba0039d87) so that one portion 
of it ended up being added outside of the socket lock of the listening 
socket, while touching its datastructures. Without 
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 the deferred accept related 
things happen earlier, ie., while we still are under the lock of the 
listening socket. So that particular locking bug was _introduced_ by that 
ec3c change, not made more likely or so.

...Of course software is known to have bugs, so we might always be 

It seems to work quite well actually for this kind of networking related 

Ah, sorry I forgot to add that one there, it was sent quite late in the 
night and I just couldn't get sleep until sending the fix... :-) It was 
one of the reverted ones that did it:

If you want an older kernel, you would have to go basically to 2.6.25 or 
so.

To summarize. Both 3changes+1fix revert (you refer to it only as 3-patch 
revert) _and_ the locking fix I made should fix the problem (obviously 
they exclude each other). ...And end which is significant is the one which 
has LISTENing sockets (please keep this in mind if you still get the hang
and provide some info).

-- 
 i.
--

From: Ingo Molnar
Date: Friday, June 6, 2008 - 11:39 am

correct, that's what i use as fallback, some distro kernel which is 
2.6.25 or older.

but i'm confused a bit, you say v2.6.25-rc6-475-gec3c098 introduced the 
locking problem - so 2.6.25 is affected as well? This is a significant 
question because the fallback kernel is kernel-2.6.25.3-18.fc9.x86_64 on 
the 16-way box. (all other build-boxes have 2.6.24 or older as a 

ok.

For completeness, let me repeat the patch i referred to as the 
'3-patch-revert' below. (which indeed is 3+1 as you note)

this is the patch that appears to be working empirically. (Disclaimer: 
it might just hide the problem, change timings, have a lucky code 
layout, etc.)

	Ingo

----------------->
commit 9e5b6ca8fa8e07aceaddbfb0f2a324449e5ef014
Author: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date:   Sat May 31 15:18:56 2008 +0300

    tcp: revert DEFER_ACCEPT modifications
    
    On Sat, 31 May 2008, Ilpo Järvinen wrote:
    
    > On Sat, 31 May 2008, Ingo Molnar wrote:
    >
    > > ok, once i added back the localhost distcc component and the hung kernel
    > > build + stuck TCP socket bug happened again overnight:
    > >
    > >  Active Internet connections (w/o servers)
    > >  Proto Recv-Q Send-Q Local Address               Foreign Address             State
    > >  tcp    72187      0 10.0.1.14:3632              10.0.1.14:47910             ESTABLISHED
    > >  tcp        0 174464 10.0.1.14:47910             10.0.1.14:3632              ESTABLISHED
    > >
    > > so it seems distcc over localhost was the aspect that made it fail.
    > >
    > > _Perhaps_ what matters is to have the new post-rc3 TCP code on _both_
    > > sides of the connection. But that is just a theory - it could be timing,
    > > etc.
    >
    > Btw, does your distcc perhaps happen enable TCP_DEFER_ACCEPT (there were
    > some post 2.6.25 changes into it)?
    
    Hmm, if so, please try this revert below (I hope I got everything that is
    related there, gitk is currently broken for me so that ...
From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 12:49 pm

> * Ilpo J
From: Patrick McManus
Date: Friday, June 6, 2008 - 1:08 pm

This is all a bit confusing, but here are the conclusions I have drawn.

There definitely is a problem with the locking of the DA commit
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 . That code was part of 26-rc1
but it never appeared in 25. It exists in pretty much the same form in
rc5 (there was 1 patch to it over that time to fix a different problem).

We're certain this code has a problem with the accept queue both because
of code inspection and the fact that Ingo can back it out (as the
significant part of the 3-patch revert) and the problem goes away in his
testing.

I have run tests that can reproduce the hung socket with distcc over
localhost using 26-rc5. I can also apparently cure it using the locking
fix patch Ilpo sent (c9454f0..d21d2b9) on top of that. (My test of rc5
+lockpatch is at 4.5+ hrs and counting without failures, it fails 6
times an hour with vanilla rc5)

Based on all of that, the right thing to do seems to be to apply the
lockpatch (c9454f0..d21d2b9) to Linus's tree and not revert anything -
just fix the code and I'll send Ilpo and Ingo cookies at Christmas time
for being  great guys. Alternatively, Ingo could run the distcc servers
and clients on -tip with the lockpatch (nothing reverted) for more
testing.

The only lingering problem is Ingo's report yesterday
http://marc.info/?l=linux-netdev&m=121267587715976&w=2
of a distcc hang. In this one it was not over localhost and the distcc
server had the ec3c DA changes totally reverted. (The server is really
the only stack that matters in this case - the client is not impacted by
the DA changes). This has to be a different issue, because the ec3c code
we're talking about here wasn't on the server at all. As Ilpo mentions,
Hakon is beleived to have a different problem and maybe you've tripped
over that too?

If we're sure of that conclusion we should just take Ilpo's DA patch as
that will narrow the field for finding Hakon's issue. Its just with all
of these data points I'm not sure if I'm reaching the ...
From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 2:12 pm

...added Arjan.



Problems were at least these:
- Accept queue addition was racy and could leave dangling items
- Dangling items caused inconsistent sk_ack_backlog
- Checking for still in LISTEN state was racy, could be changed after
  the check was made (shouldn't happen with distcc though)

I didn't read ->sk_data_ready that carefully, it could have some 
additional problems that are not listed (but they all should be fixed
by the added locking anyway).

AFAICT, rest of that ec3c change is safe wrt. locking, just holding sk is 
enough for the rest and those bits mostly shouldn't anyway be executed 

Anyway, we still would have an option to revert both the DA change + the 
locking fix later if the problem is still clearly more likely than with 

It definately didn't fit to picture that well if we would be talking just 
a single bug here.


...The H
From: Arjan van de Ven
Date: Friday, June 6, 2008 - 2:23 pm

On Sat, 7 Jun 2008 00:12:55 +0300 (EEST)

at this point the observed hung I had seems to be the app doing
something bad (it's the app that hung) so for now please disregard my


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 2:28 pm

> "Ilpo J
From: David Miller
Date: Tuesday, June 10, 2008 - 3:49 pm

From: Patrick McManus <mcmanus@ducksong.com>

While the conclusion might be accurate, Ilpo's patch adds a lock
ordering problem that might introduce potential deadlocks (worst
case) or at least some lockdep warnings.  Both of which are
unacceptable.

I personally don't see any clear solution, and because this new
DA code moves an operation into a code block where the locking
order pretty much has to be inverted, I don't see any real short
term solution other than a revert of this particular DA commit.
--

From: Ilpo Järvinen
Date: Friday, June 6, 2008 - 11:25 am

In addition I think I've also seen some bits floating around that 
occassionally distcc does something weird in a correct setup too.

I briefly looked how distcc behaved while doing the stress_accept. Distcc 
basically seems to have n processes each accept()ing and some kind of 
memleak killer by limiting number of successive accepts then exit, while 
the parent who did the listen is only periodically (had some sleep(1)s) 

Also Peter Z has reported it earlier, it was distcc+localhost for him as 


It could be even easier if you make next in path gcc to play with 
nice, trying a number of different values might reveal some really fast 


...He had some issue with different versions being deployed at least in 
the past, and I failed to follow his latest answer :-).


-- 
 i.
--

From: David Miller
Date: Tuesday, June 10, 2008 - 3:32 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

I took a close look at this, it seems this patch adds
an ABBA deadlock.  But I might be wrong.

Normally the locking order is:

	listen_sk --> some_child_sk

And this can be seen by the code paths that flow down
to tcp_child_process() in net/ipv4/tcp_minisocks.c (via
tcp_v4_do_rcv() for example).

However here in this patch we will lock in the:

	child_sk --> listen_sk

order.

Unless... these defer accepted sockets live outside of the
normal socket collection found by tcp_v{4,6}_hnd_req().  If
that is the case, that ought to make this locking order OK
but I fear that lockdep will likely complain because it has
no way to see this distinction.

If we cannot find a simple and easy way to deal with this locking
problem, I am reverting the defer-accept changes entirely.  It's not
the end of the world if this feature has to be deferred to 2.6.27
and this bug has been known for several weeks already.

--

From: Patrick McManus
Date: Wednesday, June 11, 2008 - 6:10 am

based on this, plus the comments from vitaliy it is clear that moving
the DA socket to established is pretty tricky. I agree I should address
these things in the >= 27 time frame.

imho we don't need to revert the first two patches:
e4c78840284f3f51b1896cf3936d60a6033c4d2c and
539fae89bebd16ebeafd57a87169bc56eb530d76 .. they have nothing to do with
rearranging states or the locking and they do fix the more notable DA
behavior issues.

-Patrick

--

From: Ilpo Järvinen
Date: Wednesday, June 11, 2008 - 8:13 am

On Tue, 10 Jun 2008, David Miller wrote:

> From: "Ilpo_J
From: Ingo Molnar
Date: Wednesday, June 4, 2008 - 12:23 am

out of the -tip tree :-) The -tip tree has 75+ topic branches at the 
moment, but TCP topics are not in its scope - so any TCP change is "out 
of tree" for the -tip tree.

People got confused in the past when they saw similar test patches show 
up in sched.git and x86.git before, so we wanted to make it very clear 
in -tip (with is the successor of sched.git, x86.git and a couple of 
other git trees) that these are commits we dont want to push anywhere. 

Commits in tip/out-of-tree dont get propagated into the tip/auto-*-next 
topic branches that linux-next and -mm picks up, they are purely a 
courtesy to help the testing/fixing of bugs in subsystems that are 
maintained in other git trees.

See attached below the current shortlog of the tip/out-of-tree topic 
branch - it contains changes all around the tree for various things that 
we triggered in -tip and are not yet upstream or are in flight somewhere 

a full day of testing on a testsystem with 500 random kernel builds and 
bootups (the kernel build done on the testsystem utilizing distcc and 
make -j100, so it's rather heavy and parallel TCP traffic per iteration) 
with no hang, compared to the same system with your reverts not applied 
that hung after an hour with 20-30 iterations.

And that count increased to 1000 successful test iterations since 
yesterday.

So i think yes, it seems rather conclusive, given the circumstances ;-) 

These random kernel boots found many 'impossible to trigger' bugs and 
races in the past. The reason for its race finding capability is the 
timing randomness of the resulting random kernel image: the delays 
caused by random combination of debugging facilities, build variants, 
kernel subsystem variants we have. This -tip qa method - as a 
side-effect of its coverage testing - simulates timing variantions that 
are otherwise only observable via hardware variations.

I.e. this is not the same kernel booted up a 1000 times - that would be 
a very narrow test. This is 1000 _different_ ...
From: David Miller
Date: Wednesday, June 4, 2008 - 11:24 am

From: Ingo Molnar <mingo@elte.hu>

Ilpo posted another patch which fixes a locking bug in the
code, please test with that patch.  I include it below so
that you know exactly which one I am referring to.

The quicker you test this, the faster I can merge it to
Linus and get the bug fixed for good.

[PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk

It seems that replacement of DA code also moved parts outside
of appropriate locking. The Ingo's problem seems to come from
the fact that two flows could now race in
(inet_csk_)reqsk_queue_add corrupting the queue. ...This can
leave dangling socks around which won't resolve themselves
without stimuli from outside (e.g., external RST would help
I think).

Then some details I'm not too sure of:
I guess we want to put listen_sk->sk_state checking under the
lock as well. I've not evaluated if ->sk_data_ready too
requires locking but assumed it does.

I'm by no means familiar with all locking variants, requirements,
etc.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   23 +++++++++++++----------
 1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c9454f0..d21d2b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4562,6 +4562,7 @@ static int tcp_defer_accept_check(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (tp->defer_tcp_accept.request) {
+		struct sock *listen_sk = tp->defer_tcp_accept.listen_sk;
 		int queued_data =  tp->rcv_nxt - tp->copied_seq;
 		int hasfin =  !skb_queue_empty(&sk->sk_receive_queue) ?
 			tcp_hdr((struct sk_buff *)
@@ -4570,8 +4571,9 @@ static int tcp_defer_accept_check(struct sock *sk)
 		if (queued_data && hasfin)
 			queued_data--;
 
-		if (queued_data &&
-		    tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
+		bh_lock_sock(listen_sk);
+
+		if (queued_data && listen_sk->sk_state == TCP_LISTEN) {
 			if (sock_flag(sk, SOCK_KEEPOPEN)) {
 ...
From: Ilpo Järvinen
Date: Wednesday, June 4, 2008 - 1:56 pm

> > * Ilpo J
From: David Miller
Date: Wednesday, June 4, 2008 - 2:55 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

I don't know, and since I'm getting back onto an airplane again tomorrow
morning I'm the last person who is likely do be able to do an audit
of these cases for you :-)
--

Previous thread: [PATCH 1/1] Char: vt, make sysfs operations atomic by Jiri Slaby on Monday, May 26, 2008 - 4:53 am. (6 messages)

Next thread: [PATCH take 2] UBIFS - new flash fil