From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 20 May 2010 22:47:19 +0200
Yes, this is silly.
...
HZ/25 is TCP_DELACK_MIN and TCP_ATO_MIN, but the actual value we use
here involves incorporation of various measurements made on the
connection (RTO, etc.)
...
...
There must be a tail-call here at tcp_v4_rcv() or something missed in
the backtrace stack scanning logic, because tcp_v4_rcv() and it's main
inline tcp_v4_do_rcv() do not modify established state socket timers,
and in particular do not modify the delack timer, that I can see.
It must be in via tcp_rcv_established() or similar.
...
Nevermind, it's the inlined prequeue stuff.
It uses a seperate calculation of the delack timer offset, independant
of the one made by tcp_send_delayed_ack(), it's timer offset formula is:
(3 * tcp_rto_min(sk)) / 4
with a MAX of:
TCP_RTO_MAX
So every time we go in and out of recvmsg() we'll hop between these
two different delayed ACK settings.
The prequeue logic is trying to stretch the delayed ACK to 3/4 of a
window of data. It's set a bit high, intentionally, in the hopes that
we'll get the process into recvmsg() and have it emit it's response
packet from a subsequent sendmsg() (that the ACK can ride on) before
this timer fires.
But when we drop the socket lock to sleep or return to userspace, one
of the next packets is just going to reset this timer differently.
While the intentions of the prequeue code look legit, the use of two
different delayed ACK timeout schemes has bad implications elsewhere.
For example, if the delack timer does actually fire, there is this
ATO fixup code here:
if (!icsk->icsk_ack.pingpong) {
/* Delayed ACK missed: inflate ATO. */
icsk->icsk_ack.ato = min(icsk->icsk_ack.ato << 1, icsk->icsk_rto);
} else {
/* Delayed ACK missed: leave pingpong mode and
* deflate ATO.
*/
icsk->icsk_ack.pingpong = 0;
icsk->icsk_ack.ato = TCP_ATO_MIN;
}
which is totally wrong if the delack timer offset is the one
calculated by the prequeue code. Doubling the ATO in that case
is completely the wrong thing to do.
So yes we have all kinds of inconsistencies here and we should
probably unify things so that the timer gets kicked less often.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html