I turned off localhost distcc two days ago and there has not been a
single hung socket since then, so we now know it for sure that without
localhost distcc connections, -tip's QA will not produce any hung
sockets in about 1000 random-kernel-build+boot iterations.
i've added those reverts this morning and added back the localhost
distcc rules - we'll see whether the hung sockets are back.
i'm wondering whether your suspicion on broken TCP timers is consistent
with the symptoms i've seen: the hung sockets clearly produced periodic
packet activity every 180 seconds, up to 8 hours, without ever changing
their receive of send queue. So at least a part of the TCP timer
mechanism for that specific stuck socket was working fine.
is there no sysctl or other debug mechanism to somehow get its full TCP
state and the reasons for why it is stuck? I'm wondering how you debug
broken TCP state machines without enabling testers to be able to dump
all state and passing it to developers.
I have a clearly reproducable testcase and i'd like to help out, but the
whole effort is stalled on 'not enough information' it appears. Doing
random reverts might help in truly helpless situations where a bug has
no debuggable state - but this situation seems really routine to me:
it's very difficult to trigger the bug but once it triggers the bug
scenario is stable and analyzable. I'd be glad to test any
instrumentation patch that makes similar scenarios more analyzable.
Ingo
--