[TCP bug, regression] stuck distcc connections in latest -git

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Willy Tarreau <w@...>, David S. Miller <davem@...>
Cc: David Newall <davidn@...>, Linus Torvalds <torvalds@...>, <akpm@...>, <netdev@...>, <linux-kernel@...>, Stefan Richter <stefanr@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, July 24, 2008 - 2:04 am

* Ingo Molnar <mingo@elte.hu> wrote:


the permanently hug distcc kernel build bug triggered again, twice. 
First time it happened yesterday, i left it running overnight and it 
never recovered after a 14+ hours of wait.

It shows a similar pattern, 'ESTABLISHED' state on both sides, but the 
client-side is stuck and the server (running latest kernel) is seemingly 
clueless about that fact:

 client:

  Proto Recv-Q Send-Q Local Address       Foreign Address     State
  tcp        0 375450 10.0.1.16:39201     10.0.1.19:3632      ESTABLISHED

 server:

  Proto Recv-Q Send-Q Local Address       Foreign Address     State
  tcp        0      0 10.0.1.19:3632      10.0.1.16:39201     ESTABLISHED

i waited ~30 minutes in this second case.

the client (running 2.6.24) does periodic 120 seconds retransmits:

07:40:48.255452 IP dione.39201 > phoenix.distcc: . 1608:2144(536) ack 1 win 584007:40:48.255547 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:40:48.255564 IP dione.39201 > phoenix.distcc: . 67143:67679(536) ack 1 win 5840
07:40:48.255648 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:42:48.255440 IP dione.39201 > phoenix.distcc: . 2144:2680(536) ack 1 win 5840
07:42:48.255559 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:42:48.255570 IP dione.39201 > phoenix.distcc: . 67679:68215(536) ack 1 win 5840
07:42:48.255659 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:44:48.255436 IP dione.39201 > phoenix.distcc: . 2680:3216(536) ack 1 win 584007:44:48.255570 IP phoenix.distcc > dione.39201: . ack 3216 win 65535
07:44:48.255585 IP dione.39201 > phoenix.distcc: . 68215:68751(536) ack 1 win 5840
07:44:48.255669 IP phoenix.distcc > dione.39201: . ack 3216 win 65535

the server (running the latest kernel) responds:

07:40:47.551098 IP dione.39201 > phoenix.distcc: . 1072:1608(536) ack 1 win 584007:40:47.551141 IP phoenix.distcc > dione.39201: . ack 1608 win 65535
07:40:47.551204 IP dione.39201 > phoenix.distcc: . 66607:67143(536) ack 1 win 5840
07:40:47.551213 IP phoenix.distcc > dione.39201: . ack 1608 win 65535
07:42:47.570994 IP dione.39201 > phoenix.distcc: . 1608:2144(536) ack 1 win 584007:42:47.571027 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:42:47.571117 IP dione.39201 > phoenix.distcc: . 67143:67679(536) ack 1 win 5840
07:42:47.571127 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:44:47.590901 IP dione.39201 > phoenix.distcc: . 2144:2680(536) ack 1 win 584007:44:47.590960 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:44:47.591042 IP dione.39201 > phoenix.distcc: . 67679:68215(536) ack 1 win 5840
07:44:47.591054 IP phoenix.distcc > dione.39201: . ack 2680 win 65535

full client socket state:

 dione:~> grep $(printf "%X\n" 39201) /proc/net/tcp
   44: 1001000A:9921 1301000A:0E30 01 0005ABF2:00000000 01:00002B8A 
       00000000   500        0 63130083 2 ffff81000c762d00 120000 0 0 28 101

 [ a few minutes later ]

   44: 1001000A:9921 1301000A:0E30 01 0005A392:00000000 01:00002BF0 
       00000000   500        0 63130083 2 ffff81000c762d00 120000 0 0 32 101

 [ i.e. the tx queue did increase by 2144 bytes - 4x 536 bytes ]

full server socket state:

 phoenix:~> grep $(printf "%X\n" 39201) /proc/net/tcp
    6: 1301000A:0E30 1001000A:9921 01 00000000:00000000 00:00000000 
       00000000    99        0 728382 1 ffff88042d8db280 300 4 30 2 -1

 [ a few minutes later ]

    6: 1301000A:0E30 1001000A:9921 01 00000000:00000000 00:00000000 
       00000000    99        0 728382 1 ffff88042d8db280 300 4 30 2 -1

 [ i.e. no change - no pending packets ]

I've started a longer capture session as well - it seems the TCP stack 
is slowly cycling through retransmissions of 536-byte packets, with 
375450 bytes pending? At 120 seconds a pop that would be about 23 hours 
to make any progress on - but i'm not sure i interpreted that right. It 
all looks very weird.

The timestamps of the two boxes are synced up to within about 1 second:

  earth4:~> for N in dione phoenix; do ssh $N date; done
  Thu Jul 24 07:44:02 CEST 2008
  Thu Jul 24 07:44:02 CEST 2008

( but the two boxes are responding to each other fine, so ordering of 
  events is not a question here. )

Any other state you'd like to see before i continue with -tip testing? 

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[GIT]: Networking, David Miller, (Sun Jul 20, 1:44 pm)
Re: [GIT]: Networking, Ingo Molnar, (Mon Jul 21, 9:50 am)
Re: [GIT]: Networking, Stefan Richter, (Mon Jul 21, 10:15 am)
[crash] kernel BUG at net/core/dev.c:1328!, Ingo Molnar, (Mon Jul 21, 2:23 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Linus Torvalds, (Mon Jul 21, 2:35 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, David Miller, (Mon Jul 21, 3:00 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Ingo Molnar, (Mon Jul 21, 3:44 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, David Miller, (Mon Jul 21, 4:20 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Stefan Richter, (Mon Jul 21, 3:20 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, David Miller, (Mon Jul 21, 4:11 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Stefan Richter, (Mon Jul 21, 5:26 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Ingo Molnar, (Mon Jul 21, 2:46 pm)
[TCP bug] stuck distcc connections in latest -git, Ingo Molnar, (Tue Jul 22, 7:21 am)
Re: [TCP bug] stuck distcc connections in latest -git, David Newall, (Tue Jul 22, 9:45 am)
Re: [TCP bug] stuck distcc connections in latest -git, Ingo Molnar, (Tue Jul 22, 9:57 am)
Re: [TCP bug] stuck distcc connections in latest -git, David Newall, (Tue Jul 22, 10:54 am)
Re: [TCP bug] stuck distcc connections in latest -git, Ingo Molnar, (Tue Jul 22, 11:34 am)
Re: [TCP bug] stuck distcc connections in latest -git, Willy Tarreau, (Tue Jul 22, 5:12 pm)
Re: [TCP bug] stuck distcc connections in latest -git, Ingo Molnar, (Wed Jul 23, 4:26 am)
[TCP bug, regression] stuck distcc connections in latest -git, Ingo Molnar, (Thu Jul 24, 2:04 am)
Re: [regression] nf_iterate(), BUG: unable to handle kernel ..., Krzysztof Oledzki, (Thu Jul 24, 2:00 pm)
Re: [crash] kernel BUG at net/core/dev.c:1328!, Ingo Molnar, (Mon Jul 21, 3:30 pm)
iwlwifi: fix build bug in "iwlwifi: fix LED stall", Ingo Molnar, (Mon Jul 21, 4:36 am)
RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall", Winkler, Tomas, (Mon Jul 21, 6:02 am)
Re: iwlwifi: fix build bug in "iwlwifi: fix LED stall", Ingo Molnar, (Mon Jul 21, 6:53 am)
Re: [GIT]: Networking, Linus Torvalds, (Sun Jul 20, 8:54 pm)
Re: [GIT]: Networking, Linus Torvalds, (Sun Jul 20, 9:07 pm)
Re: [GIT]: Networking, David Miller, (Sun Jul 20, 9:17 pm)
Re: [GIT]: Networking, David Miller, (Sun Jul 20, 9:03 pm)
Re: [GIT]: Networking, Patrick McHardy, (Sun Jul 20, 9:20 pm)
Re: [GIT]: Networking, James Morris, (Mon Jul 21, 7:45 am)
Re: [GIT]: Networking, Patrick McHardy, (Mon Jul 21, 8:05 am)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 1:28 pm)
Re: [GIT]: Networking, Linus Torvalds, (Mon Jul 21, 1:40 pm)
Re: [GIT]: Networking, Patrick McHardy, (Mon Jul 21, 4:33 pm)
Re: [GIT]: Networking, David Miller, (Wed Jul 23, 7:42 pm)
Re: [GIT]: Networking, Stefan Richter, (Mon Jul 21, 7:28 am)
Re: [GIT]: Networking, Alexey Dobriyan, (Sun Jul 20, 9:09 pm)
Re: [GIT]: Networking, David Miller, (Sun Jul 20, 9:14 pm)
Re: [GIT]: Networking, Alexey Dobriyan, (Sun Jul 20, 10:40 pm)
Re: [GIT]: Networking, David Miller, (Sun Jul 20, 10:48 pm)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 1:11 am)
Re: [GIT]: Networking, Linus Torvalds, (Mon Jul 21, 12:49 pm)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 12:53 pm)
Re: [GIT]: Networking, Alexey Dobriyan, (Mon Jul 21, 7:57 am)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 11:27 am)
Re: [GIT]: Networking, Alexander Beregalov, (Mon Jul 21, 5:48 am)
Re: [GIT]: Networking, Ben Hutchings, (Mon Jul 21, 6:16 am)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 11:35 am)
Re: [GIT]: Networking, Alexander Beregalov, (Mon Jul 21, 12:04 pm)
Re: [GIT]: Networking, Alexey Dobriyan, (Sun Jul 20, 9:22 pm)
Re: [GIT]: Networking, Arjan van de Ven, (Sun Jul 20, 1:59 pm)
Re: [GIT]: Networking, David Miller, (Mon Jul 21, 4:32 pm)
Re: [GIT]: Networking, David Miller, (Sun Jul 20, 7:52 pm)