Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Ray Lee <ray-lk@...>
Cc: Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Netdev <netdev@...>, David S. Miller <davem@...>, Rafael J. Wysocki <rjw@...>, Andrew Morton <akpm@...>
Date: Friday, May 30, 2008 - 5:11 pm

Hi Ray,

...I reorganized it a bit.

On Fri, 30 May 2008, Ray Lee wrote:



I think you miss here a lot of clues. First I suspected my FRTO changes as 
well but later discoveries pointed elsewhere... Those fixes are for sender 
behavior, which is not the problem here. It's just that once you have flow 
control, the sending TCP obviously gets stuck once all "buffering" 
capacity downstream is used up, and that's _correct_ sender behavior 
rather than a bug in itself. Therefore both FRTO and Ingo's theory 
about Cubic (though his test with 2.6.25 will definately seems currently a 
useful result with or without Cubic :-)) completely fails to explain why 
receiver didn't read the portion that was sitting there waiting (see 
below).

Also, I think you missed one (though it's commit message seems to say 
that it isn't relevant here but who knows):

1ac06e0306d0192a7a4d9ea1c9e06d355ce7e7d3

...but still that hardly would explains why the receiver queue was not 
consumed.


Of course Ingo could easily test without FRTO by playing with the sysctl, 
all those three patches are not in use if tcp_frto is set to zero (he 
probably didn't because I "cancelled" that request...?), but I find it 
very unlikely to help any.


Me neither, I just know some about TCP, so I probably have as much 
problems as you do in understanding this :-).


...Thanks, I definately don't mind any help here. Though it probably 
partially seems completely "stalled" because figuring this out leads me 
more and more to a territory which is previously unknown to me (plus the 
time constraints I have), not that it's a bad thing to learn & read a lot 
of other code too but it just takes more time and I cannot do anything 
while off-line like I could with a code that I'm familiar with.

Would you perhaps have any clue about two clearly strange things I listed 
here:
  http://marc.info/?l=linux-kernel&m=121207001329497&w=2

...


...i.e., one connection, two endpoints:


             ^^^^^

Can you perhaps find/guess/think some explanation for this _receiver 
queue_...? This was a trick question :-), as we already know that the 
receiving process is no longer there and therefore obviously won't be 
reading anything anymore. But that opened another question, why TCP is 
then still in ESTABLISHED as orphaned TCP shouldn't be in establised state 
anymore, tcp_close should have changed the state (either at close or at 
process exit). I guess once it becomes known why tcp_close either wasn't 
called at all or it didn't change the state of the flow (it's quite 
simple, see for yourself), the cause of the bug is found (it might even be 
that the process went away when it shouldn't have, either a bookkeeping 
bug somewhere or real death, or something along those lines).

I was thinking of storing some info about old owner while orphaning to 
struct sock and collecting that once one of the flows gets stuck but
this requires me to figure a lot unknowns out before I can just code
it.

-- 
 i.
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+, Ilpo Järvinen, (Fri May 30, 5:11 pm)
Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+, Evgeniy Polyakov, (Sat May 31, 2:03 am)
Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+, Ilpo Järvinen, (Sat May 31, 6:05 am)
Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+, Ilpo Järvinen, (Mon Jun 2, 7:53 am)