Re: [RFC] TCP illinois max rtt aging

Previous thread: [PATCH 1/2] sky2: align IP header on Rx if possible by Stephen Hemminger on Wednesday, November 28, 2007 - 3:27 pm. (3 messages)

Next thread: RE: [PATCH 2/2] [net/wireless/iwlwifi] : iwlwifi 4965 Fix race conditional panic. by Joonwoo Park on Wednesday, November 28, 2007 - 6:43 pm. (1 message)
From: Stephen Hemminger
Date: Wednesday, November 28, 2007 - 4:47 pm

Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
  The parameter  beta  in the paper specifies the amount to decrease
  *by*:  that is, on loss,
     W <-  W -  beta*W
  but in   tcp_illinois_ssthresh() uses  beta  as the amount
  to decrease  *to*: W <- beta*W

This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

--- a/net/ipv4/tcp_illinois.c	2007-08-18 07:50:15.000000000 -0700
+++ b/net/ipv4/tcp_illinois.c	2007-11-28 15:39:04.000000000 -0800
@@ -298,7 +298,7 @@ static u32 tcp_illinois_ssthresh(struct 
 	struct illinois *ca = inet_csk_ca(sk);
 
 	/* Multiplicative decrease */
-	return max((tp->snd_cwnd * ca->beta) >> BETA_SHIFT, 2U);
+	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->beta) >> BETA_SHIFT), 2U);
 }
 
 
-

From: Herbert Xu
Date: Thursday, November 29, 2007 - 7:12 am

Applied to net-2.6.  Thanks Stephen!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-

From: Lachlan Andrew
Date: Wednesday, November 28, 2007 - 5:25 pm

Thanks Stephen.

A related problem (largely due to the published algorithm itself) is
that Illinois is very aggressive when it over-estimates the maximum
RTT.

At high load (say 200Mbps and 200ms RTT), a backlog of packets builds
up just after a loss, causing the RTT estimate to become large.  This
makes Illinois think that *all* losses are due to corruption not
congestion, and so only back off by 1/8 instead of 1/2.

I can't think how to fix this except by better RTT estimation, or
changes to Illinois itself.  Currently, I ignore RTT measurements when
   sacked_out != 0    and have a heuristic "RTT aging" mechanism, but
that's pretty ugly.

Cheers,
Lachlan



-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
-

From: Stephen Hemminger
Date: Wednesday, November 28, 2007 - 5:43 pm

Ageing the RTT estimates needs to be done anyway.
Maybe something can be reused from H-TCP. The two are closely related.

-

From: Shao Liu
Date: Wednesday, November 28, 2007 - 10:26 pm

Hi Stephen and Lachlan,

Thanks for pointing out and fixing this bug.

For the max RTT problem, I have considered it also and I have some idea on
improve it. I also have some other places to improve. I will summarize all
my new ideas and send you an update. For me to change it, could you please
give me a link to download to latest source codes for the whole congestion
control module in Linux implementation, including the general module for all
algorithms, and the implementation for specific algorithms like TCP-Illinois
and H-TCP? 

Thanks for the help!
-Shao



-----Original Message-----
From: Stephen Hemminger [mailto:shemminger@linux-foundation.org] 
Sent: Wednesday, November 28, 2007 4:44 PM
To: Lachlan Andrew
Cc: David S. Miller; Herbert Xu; shaoliu@Princeton.EDU; Douglas Leith;
Robert Shorten; netdev@vger.kernel.org
Subject: Re: [PATCH] tcp-illinois: incorrect beta usage

Ageing the RTT estimates needs to be done anyway.
Maybe something can be reused from H-TCP. The two are closely related.


-

From: Stephen Hemminger
Date: Monday, December 3, 2007 - 3:52 pm

On Wed, 28 Nov 2007 21:26:12 -0800

The following adds gradual aging of max RTT.

--- a/net/ipv4/tcp_illinois.c	2007-11-29 08:58:35.000000000 -0800
+++ b/net/ipv4/tcp_illinois.c	2007-11-29 09:37:33.000000000 -0800
@@ -63,7 +63,10 @@ static void rtt_reset(struct sock *sk)
 	ca->cnt_rtt = 0;
 	ca->sum_rtt = 0;
 
-	/* TODO: age max_rtt? */
+	/* add slowly fading memory for maxRTT to accommodate routing changes */
+	if (ca->max_rtt > ca->base_rtt)
+		ca->max_rtt = ca->base_rtt
+			+ (((ca->max_rtt - ca->base_rtt) * 31) >> 5);
 }
 
 static void tcp_illinois_init(struct sock *sk)
--

From: Lachlan Andrew
Date: Monday, December 3, 2007 - 4:06 pm

Greetings Stephen,

Thanks.  We'll have to play with the rate of ageing.  I used the slower ageing

	if (ca->cnt_rtt > 3) {
		u64 mean_rtt = ca->sum_rtt;
		do_div (mean_rtt, ca->cnt_rtt);

		if (ca->max_rtt > mean_rtt)
			ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9;
	}

and still found that the max_rtt drops considerably within a congestion epoch.

What would also really help would be getting rid of RTT outliers
somehow.  I ignore RTT measurements when SACK is active:
       if (ca->max_rtt < rtt) {
	    struct tcp_sock *tp = tcp_sk(sk);
	    if (! tp->sacked_out )	// SACKs cause hi-CPU/hi-RTT. ignore
		ca->max_rtt = rtt;
	}
which helps a lot, but still gets some outliers.  Would it be possible
to time-stamp packets in the hardware interrupt handler, instead of
waiting for the post-processing stage?

Cheers,
Lachlan



-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: Shao Liu
Date: Monday, December 3, 2007 - 4:59 pm

Hi Stephen and Lachlan,

Thanks for the discussion. The gradual aging is surely an option. And
another possibility is that, we compute the RTT just before congestion
notification, which ideally, represent the full queueing delay + propagation
delay. We can compute the average of the last M such values, and either use
the average as maxRTT, or use it as a benchmark to judge whether a sample is
outlier. How do you think of this idea?

And also a question, why the samples when SACK is active are outliers? 

For the accuracy of time stamping, I am not very familiar with the
implementation details. But I can think of two ways, 1) do time stamp in as
low layer as possible; 2) use as high priority thread to do it as possible.
For 2), we can use separate threads to do time stamp and to process packets.

Thanks and I will let you know more of my thoughts after I go over the
entire code space!
-Shao 


-----Original Message-----
From: Lachlan Andrew [mailto:lachlan.andrew@gmail.com] 
Sent: Monday, December 03, 2007 3:06 PM
To: Stephen Hemminger
Cc: shaoliu@Princeton.EDU; David S. Miller; Herbert Xu; Douglas Leith;
Robert Shorten; netdev@vger.kernel.org
Subject: Re: [RFC] TCP illinois max rtt aging

Greetings Stephen,

Thanks.  We'll have to play with the rate of ageing.  I used the slower
ageing

	if (ca->cnt_rtt > 3) {
		u64 mean_rtt = ca->sum_rtt;
		do_div (mean_rtt, ca->cnt_rtt);

		if (ca->max_rtt > mean_rtt)
			ca->max_rtt -= (ca->max_rtt - mean_rtt) >> 9;
	}

and still found that the max_rtt drops considerably within a congestion
epoch.

What would also really help would be getting rid of RTT outliers
somehow.  I ignore RTT measurements when SACK is active:
       if (ca->max_rtt < rtt) {
	    struct tcp_sock *tp = tcp_sk(sk);
	    if (! tp->sacked_out )	// SACKs cause hi-CPU/hi-RTT. ignore
		ca->max_rtt = rtt;
	}
which helps a lot, but still gets some outliers.  Would it be possible
to time-stamp packets in the hardware interrupt handler, instead of
waiting ...
From: Stephen Hemminger
Date: Monday, December 3, 2007 - 5:32 pm

On Mon, 3 Dec 2007 15:59:23 -0800

The problem with an average like that would be storing enough values
to be useful and choosing how many to store. Perhaps some form of
weighted sliding average which favors recent values heavily would work
best. Remember that RTT's have a huge noise component and you are
fighting against the long tail distribution trying to see the queue

Any sample with SACK is going to mean a loss or reordering has occurred.
So shouldn't the SACK values be useful, but RTT values from retransmits

Right now the resolution is in microseconds using the hardware clock.
The clock usage costs a little bit, but makes the math more accurate.
It would be worth exploring sensitivity by taking out RTT_STAMP from
the flags field and varying HZ.
--

From: Lachlan Andrew
Date: Monday, December 3, 2007 - 6:23 pm

Greetings,


When SACK is active, the per-packet processing becomes more involved,
tracking the list of lost/SACKed packets.  This causes a CPU spike
just after a loss, which increases the RTTs, at least in my
experience.  This is a separate issue from the fact that it is hard to
get RTT measurements from lost/retransmitted packets themselves.

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: Ilpo Järvinen
Date: Tuesday, December 4, 2007 - 1:37 am

I suspect that as long as old code was able to use hint, it wasn't doing 
that bad. But it was seriously lacking ability to take advantage of sack 
processing hint when e.g., a new hole appeared, or cumulative ACK arrived.

...Code available in net-2.6.25 might cure those.


-- 
 i.
--

From: Lachlan Andrew
Date: Thursday, December 6, 2007 - 8:27 pm

Greetings Ilpo,


We had been using one of your earlier patches, and still had the
problem.  I think you've cured the problem with SACK itself, but there
still seems to be something taking a lot of CPU while recovering from
the loss.  It is possible that it was to do with  web100  which we
have also been running, but I cut out most of the statistics from that
and still had problems.

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: Ilpo Järvinen
Date: Friday, December 7, 2007 - 4:05 am

On Thu, 6 Dec 2007, Lachlan Andrew wrote:
> On 04/12/2007, Ilpo J
From: David Miller
Date: Friday, December 7, 2007 - 5:41 am

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

Yes, it's the classic problem.  But it ought to be at least
partially masked when TSO is in use, because we'll only process
a handful of SKBs.  The more effectively TSO batches, the
less work clean_rtx_queue() will do.

When not doing TSO the behavior is super-stupid, we bump reference
counts on the same page multiple times while running over the SKBs
since consequetive SKBs cover data in different spans of the same
page.

The core issue is that we have a poorly behaving data container,
and therefore that's obviously what we need to change.

Conceptually what we probably need to do is seperate the data
maintainence from the SKB objects themselves.  There is a blob
that maintains the paged data state for everything in the
retransmit queue.  SKBs are built and get the page pointers
but don't actually grab references to the pages, the blob
does that and it keeps track of how many SKB references to each
page there are, non-atomically.

The hardest part is dealing with the page lifetime issues.
Unfortunately, when we trim the rtx queue, references to the clones
can still exist in the driver output path.  It's a difficult problem
to overcome in fact, so in the end my suggestion above might not

Web100 just provides statistics and other kinds of connection data
to userspace, all the actual algorithm etc. modifications have been
merged upstream and yanked out of the web100 patch.  I was looking
at it the other night and it's frankly totally uninteresting these
days :-)
--

From: Ilpo Järvinen
Date: Friday, December 7, 2007 - 6:05 am

On Fri, 7 Dec 2007, David Miller wrote:

> From: "Ilpo_J
From: David Miller
Date: Friday, December 7, 2007 - 6:32 pm

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

You're of course right, and it's ironic that I wrote the SACK
splitting code so I should have known this :-)

A possible approach just occurred to me wherein we maintain
the SACK state external to the SKBs so that we don't need to
mess with them at all.

That would allow us to eliminate the TSO splitting but it would
not remove the general problem of clean_rtx_queue()'s overhead.

I'll try to give some thought to this over the weekend.
--


On Fri, 7 Dec 2007, David Miller wrote:

> From: "Ilpo_J
From: David Miller
Date: Tuesday, December 11, 2007 - 5:32 am

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

Interesting approach, but I think there is limited value to this
(arguably) complex form.

The core issue is that the data and the SACK state are maintained in
the same datastructure.  The complexity in all the state management
and fixups in your patch is purely because of this.

If we maintain SACK scoreboard information seperately, outside of
the SKB, then there are only two changes to make:

1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to
   new data structure.

2) Retransmit is adjusted so that it can retransmit an SKB
   constructed as a portion of an existing SKB.  Since TSO
   implies SG, this can be handled with simple offset and
   length arguments and suitable creation of a clone referencing
   the pages in the SG vector that contain the desired data.

I would envision this SACK state thing to reference into the
retransmit queue SKB's somehow.  Each SACK block could perhaps
look something like:

	struct sack_ref {
		struct sk_buff *start_skb;
		struct sk_buff *end_skb;
		unsigned int start_off;
		unsigned int len;
	};

Traditionally we've prioritized the design of the SKB and other
infrastructure to suit TCP optimally and I still think we should
operate that way.

Therefore, long term, it is time to make a formal data blob to assist
with all of the repetitive work we do in cases like this and horribly
inefficient places like clean_rtx_queue().

So I'm basically advocating a two-pronged approach to this, the
seperate SACK scoreboard datastructure and the data blob.  I
think we can work on the former right now, and take our time with
the data blob because it requires lots of thinking and we should
get it right as it might have network driver interface implications.
--

From: Lachlan Andrew
Date: Tuesday, December 11, 2007 - 5:14 pm

Greetings all,


If you're redoing the driver interface, could I put in a request for
packet time-stamping at a lower level?

This thread started because TCP processing interferes with RTT
estimation.  This problem would be eliminated if time-stamping were
done as soon as the packet comes off the NIC.

$0.02
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: David Miller
Date: Wednesday, December 12, 2007 - 8:11 am

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>

We don't do that because such timestamping is too expensive.

It used to be the case that we did this, but we stopped doing
that a long time ago.

On x86 for example, timestamping can involve touching a slow
I/O device to read the timestamp.  We do not want to do that
for every packet.

Also, we timestamp differently for TCP, the global high
resolution timestamp is overkill for this purpose.

Really, this is a silly idea and would only be a bandaid
for the problem at hand, that TCP input processing is
too expensive in certain circumstances.

--

From: Lachlan Andrew
Date: Wednesday, December 12, 2007 - 4:35 pm

Greetings Dave,


OK.  Thanks for the background.

I thought that a TSC read was fairly cheap.  Any messing around to
interpret it could be the responsibility of any task which actually
needs a high-resolution timestamp, couldn't it?  If TSC is disabled,

Overkill for Reno and Cubic, but useful for Vegas, LP, veno, Illinois
and YeAH which are all in the kernel.  They currently use "high
resolution" timestamps which are effectively quantized to the
scheduler resolution because of the way timestamping is done --


That problem should certainly be fixed as well -- I wasn't suggesting
this as an alternative.  Will fixing it fix the problem of those TCP
modules suffering from CPU load from other sources?

(I'm Cc'ing this to Darryl Veitch who has often wanted driver-level
time-stamping for achieving high-resolution synchronization between
hosts.)

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: David Miller
Date: Wednesday, December 12, 2007 - 4:38 pm

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>

The TSC is cheap, but on the vast majority of systems it
isn't usable as a stable time source, and we have to read
the ACPI timer instead.
--

From: Stephen Hemminger
Date: Wednesday, December 12, 2007 - 5:00 pm

On Wed, 12 Dec 2007 15:35:49 -0800

In current kernel, the congestion control can choose the desired accuracy.
The units are always the same (microseconds), 
but if the flag value "TCP_CONG_RTT_STAMP" is set, then TCP
tries to the high resolution real time clock.  This is relatively
expensive so only vegas, veno, LP, and YeAH use it. It would be
worth an experiment to see whether it is worth it.

All the delay based congestion control algorithms are impractical
in the real world, but everyone keeps trying :-)



-- 
Stephen Hemminger <shemminger@linux-foundation.org>
--


Yeah, had just too much time while waiting for person who never
arrived... :-) It would have covered the typical case quite well

Not sure if it is going to all that straightforward because you seem to 
ignore in this simple description all details of performing the linking 
between the structures, which becomes tricky when we add those

...I.e., how to count retrans_out origins in such case because the 
presented sack_ref knows nothing about them nor do we have anything but 
one bit in ->sacked per skb. We need the origins when retransmission get 
sacked/acked to reduce retrans_out correctly. To solve this, I think the 
sack_ref must have ->sacked as well and the structure should store the 
R-bits there as well, and may L-bits too though their defination is 
more obvious because it's mostly a complement of sacked (except for small 
part near fackets_out and timedout marking that probably makes bookkeeping 
of L still necessary).

The pcount boundaries are no longer that well defined after we break skb 
boundaries during retransmitting, so doing this makes fackets_out 
calculation even more trickier to track, unless whole pcount is replaced 
by byte based fackets_out :-/, which is very trivial to determine from 
seqnos only (TCP_NODELAYers might get unhappy though if we do that in a 
straightforward way...). 

This would clearly be a good step from one perspective. Nowadays GSO'ed 
skbs may get fragmented when mss_now changes, for two or more consecutive 
non-SACKed skbs that means one or more extra (small) packets when 

...I assume that these are fast searchable? And skbs as well (because the 
linking of start/end_skb at minimum is a costly operation otherwise)?


-- 
 i.
--

From: David Miller
Date: Tuesday, January 8, 2008 - 12:36 am

Ilpo, just trying to keep an old conversation from dying off.

Did you happen to read a recent blog posting of mine?

	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead

I've been thinking more and more and I think we might be able
to get away with enforcing that SACKs are always increasing in
coverage.

I doubt there are any real systems out there that drop out of order
packets that are properly formed and are in window, even though the
SACK specification (foolishly, in my opinion) allows this.

If we could free packets as SACK blocks cover them, all the problems
go away.

For one thing, this will allow the retransmit queue liberation during
loss recovery to be spread out over the event, instead of batched up
like crazy to the point where the cumulative ACK finally moves and
releases an entire window's worth of data.

Next, it would simplify all of this scanning code trying to figure out
which holes to fill during recovery.

And for SACK scoreboard marking, the RB trie would become very nearly
unecessary as far as I can tell.

I would not even entertain this kind of crazy idea unless I thought
the fundamental complexity simplification payback was enormous.  And
in this case I think it is.

What we could do is put some experimental hack in there for developers
to start playing with, which would enforce that SACKs always increase
in coverage.  If violated the connection reset and a verbose log
message is logged so we can analyze any cases that occur.

Sounds crazy, but maybe has potential.  What do you think?
--

From: Ilpo Järvinen
Date: Tuesday, January 8, 2008 - 5:12 am

Luckily we can see that already from MIBs, so quering people who have 
large servers, which are continously "testing" the internet :-), under 
their supervision or can access, and asking if they see any might help.
I checked my dept's interactive servers and all had zero renegings, but
I don't think I have access to www server which would have much wider 

I thought it a bit yesterday after reading your blog and came to 
conclusion that they won't, we can still get those nasty ACKs regardless 
of received SACK info (in here, missing). Even in some valid cases which 
include ACK losses besides actual data loss, not that this is the most 
common case but just wanted to point out that cleanup work is at least 
partially independent of SACK problem. So not "all" problems would go

Two key cases for real pattern are:

1. Losses once per n, where n is something small, like 2-20 or so, usually
   happens at slow start overshoot or when compething traffic slow starts. 
   Cumulative ACKs will cover only small part of the window once rexmits 
   make through, thus this is not a problem.
2. Single loss (or few at the beginning of the window), rest SACKed. 
   Cumulative ACK will cover original window when the last necessary 
   rexmit gets through.

Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case 
the arriving SACK blocks make the rest of the window equal to 2 :-).

So I'm now trying to solve just case 2. What if we could somehow "combine" 
adjacent skbs (or whatever they're called in that model) if SACK covers 
them both so that we still hold them but can drop them in a very 
efficient way. That would make the combining effort split per ACK. 
And if reneging would occur, we can think a way to put the necessary fuzz 
into a form which cannot hurt the rest of the system (relatively easy & 
fast if we add CA_Reneging and allow retransmitting a portion of an skb 
similar to what you suggested earlier).

And it might even be possible then to offer admin a ...
From: David Miller
Date: Wednesday, January 9, 2008 - 12:58 am

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>

RFCs are great guides by which to implement things, but they have been
often completely wrong or not practicle to follow strictly.

The handling of out of order ACKs with timestamps is my favorite
example.  Nobody performs an RFC compliant timestamp check on ACK
packets, or else their performance would go into the toilet during
packet reordering.

The URG bit setting is another one.

Especially when, practically speaking, we can in fact make changes
like I believe we can here I think we should.
--

From: John Heffner
Date: Tuesday, January 8, 2008 - 9:51 am

Linux has a code path where this can happen under memory over-commit, in 
tcp_prune_queue().  Also, I think one of the motivations for making SACK 
strictly advisory is there was some concern about buggy SACK 
implementations.  Keeping data in your retransmit queue allows you to 
fall back to timeout and go-back-n if things completely fall apart.  For 
better or worse, we have to deal with the spec the way it is.

Even if you made this assumption of "hard" SACKs, you still have to 
worry about large ACKs if SACK is disabled, though I guess you could say 
people running with large windows without SACK deserve what they get. :)


I haven't thought about this too hard, but can we approximate this by 
moving scaked data into a sacked queue, then if something bad happens 
merge this back into the retransmit queue?  The code will have to deal 
with non-contiguous data in the retransmit queue; I'm not sure offhand 
if that violates any assumptions.  You still have a single expensive ACK 
at the end of recovery, though I wonder how much this really hurts.  If 
you want to ameliorate this, you could save this sacked queue to be 
batch processed later, in application context for instance.

   -John


--

From: David Miller
Date: Tuesday, January 8, 2008 - 3:44 pm

From: John Heffner <jheffner@psc.edu>

That defeats the impetus for the change.

We want to free up the data, say, 2 packets at a time as
ACKs come in.  The key goal is smooth liberation of
retransmit queue packets over time.

The big problem is that recovery from even a single packet loss in a
window makes us run kfree_skb() for a all the packets in a full
window's worth of data when recovery completes.

If we just move such packets to a seperate list, we still have to
iterate over all of them when the cumulative ACK arrives.

This problem, that retransmit queue liberation is not smooth, is the
biggest flaw in how SACK is specified.  I mean, consider Ilpo's
mentioned case of 500,000 packet windows.  The issue cannot be
ignored.  SACK is clearly broken.

You speak of a path in Linux where we can reneg on SACKs, but I doubt
it really ever runs because of how aggressive the queue collapser is.
Alexey even has a comment there:

	 * This must not ever occur. */

To be honest this code sits here because it was written before the
queue collapser was coded up.

Really, find me a box where the LINUX_MIB_OFOPRUNED or
LINUX_MIB_RECVPRUNED counters are anything other than zero.

So this is a non-issue and I did consider it before proposing that we
redefine SACK.
--

From: Lachlan Andrew
Date: Tuesday, January 8, 2008 - 6:34 pm

Greetings David,


John also suggested freeing the packets as a lower priority task, just
doing it after they're acknowledged.

When the ACK finally comes, you could do something like moving John's
entire list of packets to a "to be freed" list, and free a few every
time (say) another ACK comes in.

$0.02,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--

From: David Miller
Date: Tuesday, January 8, 2008 - 11:35 pm

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>

You could, but this requires extra state and the operations to
properly queue such items to the deferred task and wake it up.

And none of this would be necessary if we could make SACK
indications hard.
--

From: Andi Kleen
Date: Tuesday, January 8, 2008 - 7:25 pm

Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi
--

From: John Heffner
Date: Tuesday, January 8, 2008 - 9:27 pm

I also wonder how much of a problem this is (for now, with window sizes 
of order 10000 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?

   -John
--

From: David Miller
Date: Tuesday, January 8, 2008 - 11:41 pm

From: John Heffner <jheffner@psc.edu>

Yes, several people have reported this.

It's a real issue, and it's only going to get worse.
--

From: John Heffner
Date: Wednesday, January 9, 2008 - 7:56 am

I may have missed some of this.  Does anyone have a link to some recent 
data?

   -John
--

From: SANGTAE HA
Date: Wednesday, January 9, 2008 - 11:14 am

I had some testing on this a month ago.
A small set of recent results with linux 2.6.23.9 are at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency
One of serious cases with a large number of packet losses (initial
loss is around 8000 packets) is at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--...

Also, there is a comparison among three Linux kernels (2.6.13,
2.6.18-rc4, 2.6.20.3) at
http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing

Sangtae
--

From: John Heffner
Date: Wednesday, January 9, 2008 - 11:23 am

If I'm reading this right, all these tests occur with large amounts of 
loss and tons of sack processing.  What would be most pertinent to this 
discussion would be a test with a large window, with delayed ack and 
sack disabled, and a single loss repaired by fast retransmit.  This 
would isolate the "single big ack" processing from other factors such as 
doubling the ack rate and sack processing.

I could probably set up such a test, but I don't want to duplicate 
effort if someone else already has done something similar.

Thanks,
   -John
--

From: Ilpo Järvinen
Date: Wednesday, January 9, 2008 - 5:55 am

This thread got started because I tried to solve the other latencies but 
realized that it helps very little because this latency spike would 
have remained unsolved and it happens in one of the most common case.

-- 
 i.
--

From: David Miller
Date: Tuesday, January 8, 2008 - 11:39 pm

From: Andi Kleen <andi@firstfloor.org>

If the cpu is there spinning freeing up 500,000 SKBs, it isn't
processing RX packets.

It adds severe spikes in CPU utilization that are even moderate
line rates begins to affect RTTs.

Or do you think it's OK to process 500,000 SKBs while locked
in a software interrupt.

Perhaps you have another broken awk script to prove this :-)
--

From: Andi Kleen
Date: Wednesday, January 9, 2008 - 12:03 am

You can always push it into a work queue.  Even put it to
other cores if you want. 

In fact this is already done partly for the ->completion_queue.
Wouldn't be a big change to queue it another level down.

Also even freeing a lot of objects doesn't have to be
that expensive. I suspect the most cost is in taking
the slab locks, but that could be batched. Without
that the kmem_free fast path isn't particularly
expensive, as long as the headers are still in cache.

Long ago I had sk_buff optimized for the routing case so that freeing
can be all done with a single cache line. That is 
long gone, but could be probably restored.

But asking for protocol changes just to work around such a

Your hand waved numbers on inline sizes there were definitely worse 
than mine. 

-Andi

--

From: David Miller
Date: Wednesday, January 9, 2008 - 12:16 am

From: Andi Kleen <andi@firstfloor.org>

We're touching SKB struct members, doing atomics on them, etc. for
objects we haven't referenced for at least two RTTs so are guarenteed
to be cache cold.

Let's say best case we can get it down to 2 cache line misses, and as
a very aggressive goal we can get the cost down to 100 cycles.  For
500,000 packets this is 500 million cpu cycles to free them all up.

That's 1/4 of a second even on a 2 GHZ cpu.

And yes there are inherent costs in handling TCP windows that are
500,000 packets in size.  But, that freeing cost should be spread

Your primary objective seems to be "being right", and that's fine but
realize that it makes discussing anything with you about as fun as
picking one's toe nails with an ice axe.

Eventually you will be ignored by most folks who get fed up by this
style of argument.

So, have fun being right rather than being pleasant to work with.
--

From: Evgeniy Polyakov
Date: Wednesday, January 9, 2008 - 2:47 am

Hi.


Postponing freeing of the skb has major drawbacks. Some time ago I
made a patch to postpone skb freeing behind rcu and got 2.5 times slower
connection speed on some machines with decreased CPU usage though.
So, queueing solution has to be proven with real data and although it
looks good in one situation, it can be really bad in another.

For interested reader: results of the RCUfication of the kfree_skbmem()
http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05

-- 
	Evgeniy Polyakov
--

From: Andi Kleen
Date: Wednesday, January 9, 2008 - 7:02 am

> Postponing freeing of the skb has major drawbacks. Some time ago I

Yes, the trick would be to make sure that it also does not tie up
too much memory. e.g. it would need some throttling at least.

Also the fast path of kmem_cache_free() is actually not that
much different from just putting something on a list so perhaps
it would not make that much difference.

-Andi
--

From: Ilpo Järvinen
Date: Friday, December 7, 2007 - 11:27 am

On Fri, 7 Dec 2007, Ilpo J
Previous thread: [PATCH 1/2] sky2: align IP header on Rx if possible by Stephen Hemminger on Wednesday, November 28, 2007 - 3:27 pm. (3 messages)

Next thread: RE: [PATCH 2/2] [net/wireless/iwlwifi] : iwlwifi 4965 Fix race conditional panic. by Joonwoo Park on Wednesday, November 28, 2007 - 6:43 pm. (1 message)