Re: Raise initial congestion window size / speedup slow start?

Previous thread: [PATCH 3/3] Added sysfs interface to pcrypt. by Dan Kruchinin on Wednesday, July 14, 2010 - 3:34 am. (1 message)

Next thread: [PATCH] wm8727: add a missing return in wm8727_platform_probe by Axel Lin on Wednesday, July 14, 2010 - 3:57 am. (3 messages)
From: Ed W
Date: Wednesday, July 14, 2010 - 3:43 am

Hi, my network connection looks like 500Kbits with a round trip latency 
of perhaps 1s+ (it's a satellite link).

 From what I can see the linux initial congestion window is signficantly 
limiting me here, with slow start taking many many seconds to open up 
the window wide enough to get the data flowing?  For protocols like http 
this is really hurting with all the short lived connections never really 
getting up to speed.  (throw in some random packet loss and things 
really screech to a halt)

Reading around there appear to be several previous attempts to modify 
the kernel to start with a slightly wider initial congestion window, say 
10 packets.  (Seems even google did some work on this and agreed that a 
small of initial cwd to 10 ish, would help even many non satellite 
users?) However, all the work I can find is quite old and doesn't seem 
to give me much of a leg up in terms of experimenting with such changes 
on a modern kernel?

Does someone have some pointers on where to look to modify initial 
congestion window please?


Thanks

Ed W
--

From: Alan Cox
Date: Wednesday, July 14, 2010 - 4:58 am

For http it's the window the server end that will matter as most data
goes server->client. http also has HTTP/1.1 so that in the normal case
you don't get lots of small connections.

An http request is normally sub MTU size (unless it's got auth and lots
of cookie crap) so your congestion window should be irrelevant. It's the
cwnd the other end that will matter.

Have you considered running a web proxy the other end of the link. That
would keep the DNS lookup work on the fast rtt side, and mean if your web
browser is being sane you are maintaining one connection for most of your
work. It also means you can run advert and junk filters the better end.

If you want to explore it further you want: netdev@vger.kernel.org really.

Alan
--

From: Bill Davidsen
Date: Wednesday, July 14, 2010 - 8:21 am

Are you sure that's the issue? The backlog is in incoming, is it not?

Having dealt with moderately long delays push TB between timezones, have you set 
your window size up? Set /proc/sys/net/ipv4/tcp_adv_win_scale to 5 or 6 and see 
if that helps. You may have to go into /proc/sys/net/core and crank up the 
rmem_* settings, depending on your distribution.

This allows the server to push a lot of data without an ack, which is what you 
want, the ack will be delayed by the long latency, so this helps. You can 
calculate how large to make the setting, but "make it bigger until tcpdump never 
shows the window size < 2k" is the best way, even a bit larger than that wont 
hurt, although it will take a bit of memory.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: David Miller
Date: Wednesday, July 14, 2010 - 11:15 am

From: Bill Davidsen <davidsen@tmr.com>

You should never, ever, have to touch the various networking sysctl
values to get good performance in any normal setup.  If you do, it's a
bug, report it so we can fix it.

I cringe every time someone says to do this, so please do me a favor
and don't spread this further. :-)

For one thing, TCP dynamically adjusts the socket buffer sizes based
upon the behavior of traffic on the connection.

And the TCP memory limit sysctls (not the core socket ones) are sized
based upon available memory.  They are there to protect you from
situations such as having so much memory dedicated to socket buffers
that there is none left to do other things effectively.  It's a
protective limit, rather than a setting meant to increase or improve
performance.  So like the others, leave these alone too.
--

From: Ed W
Date: Wednesday, July 14, 2010 - 11:48 am

Just checking the basics here because I don't think this is a bug so 
much as a, less common installation that differs from the "normal" case.

- When we create a tcp connection we always start with tcp slow start
- This sets the congestion window to effectively 4 packets?
- This applies in both directions?
- Remote sender responds to my hypothetical http request with the first 
4 packets of data
- We need to wait one RTT for the ack to come back and now we can send 
the next 8 packets,
- Wait for the next ack and at 16 packets we are now moving at a 
sensible fraction of the bandwidth delay product?

So just to be clear:
- We don't seem to have any user-space tuning knobs to influence this 
right now?
- In this age of short attention spans, a couple of extra seconds 
between clicking something and it responding is worth optimising (IMHO)
- I think I need to take this to netdev, but anyone else with any ideas 
happy to hear them?

Thanks

Ed W
--

From: Stephen Hemminger
Date: Wednesday, July 14, 2010 - 12:10 pm

On Wed, 14 Jul 2010 19:48:36 +0100

TCP slow start is required by the RFC. It is there to prevent a TCP congestion
collapse. The HTTP problem is exacerbated by things beyond the user's control:
  1. stupid server software that dribbles out data and doesn't used the full
    payload of the packets
  2. web pages with data from multiple sources (ads especially), each of which
    requires a new connection
  3. pages with huge graphics.

Most of this is because of sites that haven't figured out that somebody on a phone
across the globl might not have the same RTT and bandwidth that the developer on a
local network that created them.  Changing the initial cwnd isn't going to fix it.
--

From: Mitchell Erblich
Date: Wednesday, July 14, 2010 - 2:47 pm

IMO, in theory  one of the RFCs state a window with 4 ETH MTU (~6k window)
size packets/segment to allow a fast retransmit if a pkt is dropped.

I thought their is a fast-rexmit knob of 2 or 3 DUPACKs, for faster loss recovery.
Theorecticly it could be set to 1 DUPACK for lossey environments.

Now, the orig slow-start doubles the number of pkts per RTT assuming no loss,
which is a faster ramp up vs the orig congestion avoidance.

Now, with IPv4 with a default of 576 sized segments, without invalidating
the amount of data, 12 pkts could be sent. This would be helpful if your
app only generates smaller buffers,  gets more ACKs in return which sets
the ACK clocking at a faster rate. To compensate for the smaller pkt, the ABC
Experimental  RFC does byte counting to suggest fairness.

During a few round trips, the pkt size could be increased to the 1.5k ETH MTU
and hopefully to even a 9k Jumbo, probing with one increasing sized pkt.
(?to prevent rexmit of the too large pkt, overlap the increasing pkt with the next
one?)


--

From: Rick Jones
Date: Wednesday, July 14, 2010 - 1:17 pm

Any TCP sender in some degree of compliance with the RFCs on the topic will 
employ slow-start.

Linux adds the auto-tuning of the receiver's advertised window.  It will start 

There may be some wrinkles depending on how many ACKs the reciever generates 

There is an effort under way, lead by some folks at Google and including some 
others, to get the RFC's enhanced in support of the concept of larger initial 
congestion windows.  Some of the discussion may be in the "tcpm" mailing list 
(assuming I've not gotten my mailing lists confused).  There may be some 
previous discussion of that work in the netdev archives as well.


--

From: David Miller
Date: Wednesday, July 14, 2010 - 2:55 pm

From: Hagen Paul Pfeifer <hagen@jauu.net>

Although section 3 of RFC 5681 is a great text, it does not say at all
that increasing the initial CWND would lead to fairness issues.

To be honest, I think google's proposal holds a lot of weight.  If
over time link sizes and speeds are increasing (they are) then nudging
the initial CWND every so often is a legitimate proposal.  Were
someone to claim that utilization is lower than it could be because of
the currenttly specified initial CWND, I would have no problem
believing them.

And I'm happy to make Linux use an increased value once it has
traction in the standardization community.

But for all we know this side discussion about initial CWND settings
could have nothing to do with the issue being reported at the start of
this thread. :-)

--

From: Hagen Paul Pfeifer
Date: Wednesday, July 14, 2010 - 3:13 pm

Because it is only one side of the medal, probing conservative the available
link capacity in conjunction with n simultaneous probing TCP/SCTP/DCCP

Currently I know no working link capacity probing approach, without active
network feedback, to conservatively probing the available link capacity with a

;-) sure, but it is often wise to thwart these kind of discussions. It seems
these CWND discussions turn up once every other month. ;-)

Hagen

--

From: Rick Jones
Date: Wednesday, July 14, 2010 - 3:19 pm

Which suggests there is a constant "force" out there yet to be rekoned with. :)

rick jones
--

From: Hagen Paul Pfeifer
Date: Wednesday, July 14, 2010 - 3:40 pm

;-) I am _not_ unconscious, but the better address for this kind of
discussions is still tcpm.

Hagen
--

From: Ed W
Date: Wednesday, July 14, 2010 - 3:52 pm

So lets define the problem more succinctly:
- New TCP connections are assumed to have no knowledge of current 
network conditions (bah)
- We desire the connection to consume the maximum amount of bandwidth 

Sounds like smarter people than I have played this game, but just to 
chuck out one idea: How about attacking the idea that we have no 
knowledge of network conditions?  After all we have a bunch of 
information about:

1) very good information about the size of the link to the first hop (eg 
the modem/network card reported rate)
2) often a reasonably good idea about the bandwidth to the first 
"restrictive" router along our default path (ie usually the situation is 
there is a pool of high speed network locally, then a more limited 
connectivity between our network and other networks.  We can look at the 
maximum flows through our network device to outside our subnet and infer 
an approximate link speed from that)
3) often moderate quality information about the size of the link between 
us and a specific destination IP

So here goes: the heuristic could be to examine current flows through 
our interface, use this to offer hints to the remote end during SYN 
handshake as to a recommended starting size, and additionally the client 
side can examine the implied RTT of the SYN/ACK to further fine tune the 
initial cwnd?

In practice this could be implemented in other ways such as examining 
recent TCP congestion windows and using some heuristic to start "near" 
those.  Or remembering congestion windows recently used for popular 
destinations?  Also we can benefit the receiver of our data - if we see 
some app open up 16 http connections to some poor server then some of 
those connections will NOT be given large initial cwnd.

Essentially perhaps we can refine our initial cwnd heuristic somewhat if 
we assume better than zero knowledge about the network link?


Out of curiousity, why has it taken so long for active feedback to 
appear?  If every router simply added a ...
From: Hagen Paul Pfeifer
Date: Wednesday, July 14, 2010 - 4:01 pm

It is quite late here so I will quickly write two sentence about ECN: one
month ago Lars Eggers posted a link at the tcpm maillinglist where google (not
really sure if it was google) analysed the employment of ECN - the usage was
really low. Search the PDF, it is quite interesting one.

Hagen


--

From: Ed W
Date: Wednesday, July 14, 2010 - 4:05 pm

I would speculate that this is because there is a big warning on ECN 
saying that it may cause you to loose customers who can't connect to 
you... Businesses are driven by needing to support the most common case, 
not the most optimal (witness the pain of html development and needing 
to consider IE6...)

What would be more useful is for google to survey how many devices are 
unable to interoperate with ECN and if that number turned out to be 
extremely low, and this fact were advertised, then I suspect we might 
see a mass increase in it's deployment?  I know I have it turned off on 
all my servers because I worry more about loosing one customer than 
improving the experience for all customers...

Cheers

Ed W
--

From: Bill Fink
Date: Wednesday, July 14, 2010 - 8:49 pm

A long, long time ago, I suggested a Path BW Discovery mechanism
to the IETF, analogous to the Path MTU Discovery mechanism, but
it didn't get any traction.  Such information could be extremely
useful to TCP endpoints, to determine a maximum window size to
use, to effectively rate limit a much stronger sender from
overpowering a much weaker receiver (for example 10-GigE -> GigE),
resulting in abominable performance across large RTT paths
(as low as 12 Mbps), even in the absence of any real network
contention.

						-Bill
--

From: H.K. Jerry Chu
Date: Wednesday, July 14, 2010 - 10:29 pm

Unfortunately that is not going to help initcwnd (unless one can invent a
PBWD protocol from just 3WHS), and the web is dominated by short-lived
connections so the small initcwnd becomes a choke point.

--

From: Rick Jones
Date: Thursday, July 15, 2010 - 12:51 pm

I have to wonder if the only heuristic one could employ for divining the initial 
congestion window is to be either pessimistic/conservative or 
optimistic/liberal.  Or for that matter the only one one really needs here?

That's what it comes down to doesn't it?  At any one point in time, we don't 
*really* know the state of the network and whether it can handle the load we 
might wish to put upon it.  We are always reacting to it. Up until now, it has 
been felt necessary to be pessimistic/conservative at time of connection 
establishment and not rely as much on the robustness of the "control" part of 
avoidance and control.

Now, the folks at Google have lots of data to suggest we don't need to be so 
pessimistic/conservative and so we have to decide if we are willing to be more 
optimistic/liberal.  Broadly handwaving, the "netdev we" seems to be willing to 
be more optimistic/liberal in at least a few cases, and the question comes down 
to whether or not the "IETF we" will be similarly willing.

rick jones
--

From: Stephen Hemminger
Date: Thursday, July 15, 2010 - 1:48 pm

On Thu, 15 Jul 2010 12:51:22 -0700

I am not convinced that a host being aggressive with initial cwnd (Linux) would
not end up unfairly monopolizing available bandwidth compared to older more conservative
implementations (Windows). Whether fairness is important or not is another debate.

--

From: H.K. Jerry Chu
Date: Thursday, July 15, 2010 - 5:23 pm

I don't even consider a modest IW increase to 10 is aggressive. The scaling
of IW is only adequate IMO given the huge b/w growth in the past
decade. Remember there could be plenty of flows sending large cwnd
bursts at
twice the bottleneck link rate at any point of time in the network anyway so
the "fairness" question may already be ill-defined. In any case we're
trying to conduct some experiment in a private testbed to hopefully
get some insights
with real data.

Jerry

On Thu, Jul 15, 2010 at 1:48 PM, Stephen Hemminger
--

From: Hagen Paul Pfeifer
Date: Friday, July 16, 2010 - 2:03 am

Much weaker middlebox? The windowing mechanism should be sufficient to
avoid endpoints from over-commiting.

Anyway, your proposed draft (I didn't searched for it) sound like a
mechanism similar to RFC 4782: Quick-Start for TCP and IP.


   This document specifies an optional Quick-Start mechanism for
   transport protocols, in cooperation with routers, to determine an
   allowed sending rate at the start and, at times, in the middle of a
   data transfer (e.g., after an idle period).  While Quick-Start is
   designed to be used by a range of transport protocols, in this
   document we only specify its use with TCP.  Quick-Start is designed
   to allow connections to use higher sending rates when there is
   significant unused bandwidth along the path, and the sender and all
   of the routers along the path approve the Quick-Start Request.


Cheers, Hagen
--

From: Alan Cox
Date: Thursday, July 15, 2010 - 3:33 am

On Thu, 15 Jul 2010 00:13:01 +0200

Given perfect information from the network nodes you still need to
traverse the network each direction and then return an answer which means
with a 0.5sec end to end time as in the original posting causality itself
demands 1.5 seconds to get an answer which is itself incomplete and
obsolete.

Causality isn't showing any signs of going away soon.

--

From: Ed W
Date: Wednesday, July 14, 2010 - 3:05 pm

I'm sure you have covered this to the point you are fed up, but my 
searches turn up only a smattering of posts covering this - could you 
summarise why "you cannot raise the initial cwnd and expect a fair 
behaviour"?

Initial cwnd was changed (increased) in the past (rfc3390) and the RFC 
claims that studies then suggested that the benefits were all positive. 
Some reasonably smart people have suggested that it might be time to 
review the status quo again so it doesn't seem completely obvious that 

Sorry, what do you mean by a "consolidated contribution"?

That RFC is a subtle read - it appears to give more specific guidance on 
what to do in certain situations, but I'm not sure I see that it 
improves slow start convergence speed for my situation (large RTT)?  
Would you mind highlighting the new bits for those of us a bit newer to 

Oh, excellent.  This seems like exactly what I'm after.  (Thanks Stephen 
Hemminger!)

Many thanks

Ed W
--

From: Hagen Paul Pfeifer
Date: Wednesday, July 14, 2010 - 3:36 pm

Do you cite "An Argument for Increasing TCP's Initial Congestion Window"?
People at google stated that a CWND of 10 seems to be fair in their
measurements. 10 because the test setup was equipped with a reasonable large
link capacity? Do they analyse their modification in environments with a small
BDP (e.g. multihop MANET setup, ...)? I am curious, but We will see what

The objection/hint was more of general nature - not specific for larger RTTs.
Environments with larger RTTs are disadvantaged because TCP is ACK clocked.
Half-truth statement for my part because RTT fairness is and was an issue at

Great, you are welcome! ;-)


Hagen


--

From: Ed W
Date: Wednesday, July 14, 2010 - 4:01 pm

Well, I personally would shoot for starting from the position of 
assuming better than zero knowledge about our link and incorporating 
that into the initial cwnd estimate...

We know something about the RTT from the syn/ack times, speed of the 
local link and quickly we will learn about median window sizes to other 
destinations, plus additionally the kernel has some knowledge of other 
connections currently in progress.  With all that information perhaps we 
can make a more informed option than just a hard coded magic number? (Oh 
and lets make the option pluggable so that we can soon have 10 different 
kernel options...)

Seems like there is evidence that networks are starting to cluster into groups that would benefit from a range of cwnd options (higher/lower) - perhaps there is some way to choose a reasonable heuristic to cluster these and choose a better starting option?

Cheers

Ed W


--

From: Tom Herbert
Date: Wednesday, July 14, 2010 - 9:12 pm

There is an Internet draft
(http://datatracker.ietf.org/doc/draft-hkchu-tcpm-initcwnd/) on
raising the default Initial Congestion window to 10 segments, as well
as a SIGCOMM paper (http://ccr.sigcomm.org/online/?q=node/621).  We
presented this proposal and data supporting it at Anaheim IETF, and
will be following up in Netherlands with more data including some of
which should further address fairness questions.

In terms of Linux implementation, setting ICW via ip route is
sufficient support on the server side.  There is also a proposed patch
which could allow applications to set ICW themselves (in hopes that
application can reduce number of simultaneous connections).  On the
client side we can now adjust the receive window to advertise larger
initial windows.  Among current implementations, Linux advertises the
smallest default receive window of major OSes, so it turns out Linux
clients won't get lower latency benefits currently (so we'll probably
ask to raise the default some day :-)).

--

From: Ed W
Date: Thursday, July 15, 2010 - 12:48 am

You guys have obviously done a lot of work on this, however, it seems 
that there is a case for introducing some heuristics into the choice of 
init cwnd as well as offering the option to go larger?  An initial size 
of 10 packets is just another magic number that obviously works with the 
median bandwidth delay product on today's networks - can we not do 
better still?

Seems like a bunch of clever folks have already suggested tweaks to the 
steady stage congestion avoidance, but so far everyone is afraid to 
touch the early stage heuristics?

Also would you guys not benefit from wider deployment of ECN?  Can you 
not help find some ways that deployment could be increased?  At present 
there are big warnings all over the option that it causes some problems, 
but there is no quantification of how much and really whether this 
warning is still appropriate?

Ed W

--

From: Jerry Chu
Date: Thursday, July 15, 2010 - 10:36 am

This is because there is not enough info for deriving any heuristic.
For initcwnd one is constrained to
only info from 3WHS. This includes a rough estimate of RTT plus all
the bits in the SYN/SYN-ACK
headers. I'm assuming a stateless approach. We've tried a stateful
solution (i.e., seeding initcwnd from
past history) but found its complexity outweigh the gain.

That will add yet another hoop for us to jump over. Also I'm not sure
a couple of bits are sufficient for a
guesstimate of what initcwnd ought to be.

Our reasoning is simple - there has been tremendous b/w growth since
rfc2414 was published. Even the
lowest common denominator (i.e., dialup links) has moved from 9.6Kbps
to 56Kbps. That's a six fold
increase. If you believe initcwnd should grow proportionally to the
buffer sizes in access links, and the
buffer sizes grows proportionally to b/w, then the initcwnd outght to
be 3*6 = 18 today.

We chose a modest increase (10) with the hope to expedite the
standardization process (and would
certainly appreciate helps from folks on this list). 10 is very
conservative considering many deployment
has gone beyond 3, including Linux stack, which allows one additional
pkt if it's the last data pkt.

Longer term it will be nice to find a way to get rid of this fixed,
somewhat arbitrary initcwnd. Mark
Allman's JumpStart is one idea, but it'd be a much longer route.

--

From: H.K. Jerry Chu
Date: Wednesday, July 14, 2010 - 10:09 pm

Please don't mislead. Raising the initcwnd is actively being pursued at IETF
right now. If not here, where else? It is following the same path where initcwnd
was first raised in late 90' through rfc2414/rfc3390.

IETF is not a standard organization just for protocol lawyers to play
word games.
It is responsible for solving real technical issues as well.

--

From: Bill Fink
Date: Wednesday, July 14, 2010 - 7:52 pm

What's normal?  :-)

netem1% cat /proc/version 
Linux version 2.6.30.10-105.2.23.fc11.x86_64 (mockbuild@x86-01.phx2.fedoraproject.org) (gcc version 4.4.1 20090725 (Red Hat 4.4.1-2) (GCC) ) #1 SMP Thu Feb 11 07:06:34 UTC 2010

Linux TCP autotuning across an 80 ms RTT cross country network path:

netem1% nuttcp -T10 -i1 192.168.1.18
   14.1875 MB /   1.00 sec =  119.0115 Mbps     0 retrans
  558.0000 MB /   1.00 sec = 4680.7169 Mbps     0 retrans
  872.8750 MB /   1.00 sec = 7322.3527 Mbps     0 retrans
  869.6875 MB /   1.00 sec = 7295.5478 Mbps     0 retrans
  858.4375 MB /   1.00 sec = 7201.0165 Mbps     0 retrans
  857.3750 MB /   1.00 sec = 7192.2116 Mbps     0 retrans
  865.5625 MB /   1.00 sec = 7260.7193 Mbps     0 retrans
  872.3750 MB /   1.00 sec = 7318.2095 Mbps     0 retrans
  862.7500 MB /   1.00 sec = 7237.2571 Mbps     0 retrans
  857.6250 MB /   1.00 sec = 7194.1864 Mbps     0 retrans

 7504.2771 MB /  10.09 sec = 6236.5068 Mbps 11 %TX 25 %RX 0 retrans 80.59 msRTT

Manually specified 100 MB TCP socket buffer on the same path:

netem1% nuttcp -T10 -i1 -w100m 192.168.1.18
  106.8125 MB /   1.00 sec =  895.9598 Mbps     0 retrans
 1092.0625 MB /   1.00 sec = 9160.3254 Mbps     0 retrans
 1111.2500 MB /   1.00 sec = 9322.6424 Mbps     0 retrans
 1115.4375 MB /   1.00 sec = 9356.2569 Mbps     0 retrans
 1116.4375 MB /   1.00 sec = 9365.6937 Mbps     0 retrans
 1115.3125 MB /   1.00 sec = 9356.2749 Mbps     0 retrans
 1121.2500 MB /   1.00 sec = 9405.6233 Mbps     0 retrans
 1125.5625 MB /   1.00 sec = 9441.6949 Mbps     0 retrans
 1130.0000 MB /   1.00 sec = 9478.7479 Mbps     0 retrans
 1139.0625 MB /   1.00 sec = 9555.8559 Mbps     0 retrans

10258.5120 MB /  10.20 sec = 8440.3558 Mbps 15 %TX 40 %RX 0 retrans 80.59 msRTT

The manually selected TCP socket buffer size both ramps up
quicker and achieves a much higher steady state rate.

					-Bill
--

From: H.K. Jerry Chu
Date: Wednesday, July 14, 2010 - 9:51 pm

Agreed, except there are indeed bugs in the code today in that the
code in various places assumes initcwnd as per RFC3390. So when
initcwnd is raised, that actual value may be limited unnecessarily by
the initial wmem/sk_sndbuf.

Will try to find time to submit a patch.

--

From: Patrick McManus
Date: Friday, July 16, 2010 - 10:01 am

Thanks for the discussion!

can you tell us more about the impl concerns of initcwnd stored on the
route?

and while I'm asking for info, can you expand on the conclusion
regarding poor cache hit rates for reusing learned cwnds? (ok, I admit I
only read the slides.. maybe the paper has more info?)

article and slides much appreciated and very interetsing. I've long been
of the opinion that the downsides of being too aggressive once in a
while aren't all that serious anymore.. as someone else said in a
non-reservation world you are always trying to predict the future anyhow
and therefore overflowing a queue is always possible no matter how
conservative.




--

From: Ed W
Date: Friday, July 16, 2010 - 10:41 am

My guess is that this result is specific to google and their servers?

I guess we can probably stereotype the world into two pools of devices:

1) Devices in a pool of fast networking, but connected to the rest of 
the world through a relatively slow router
2) Devices connected via a high speed network and largely the bottleneck 
device is many hops down the line and well away from us

I'm thinking here 1) client users behind broadband routers, wireless, 
3G, dialup, etc and 2) public servers that have obviously been 
deliberately placed in locations with high levels of interconnectivity.

I think history information could be more useful for clients in category 
1) because there is a much higher probability that their most 
restrictive device is one hop away and hence affects all connections and 
relatively occasionally the bottleneck is multiple hops away.  For 
devices in category 2) it's much harder because the restriction will 
usually be lots of hops away and effectively you are trying to figure 
out and cache the speed of every ADSL router out there...  For sure you 
can probably figure out how to cluster this stuff and say that pool 
there is 56K dialup, that pool there is "broadband", that pool is cell 
phone, etc, but probably it's hard to do better than that?

So my guess is this is why google have had poor results investigating 
cwnd caching?

However, I would suggest that whilst it's of little value for the server 
side, it still remains a very interesting idea for the client side and 
the cache hit ratio would seem to be dramatically higher here?


I haven't studied the code, but given there is a userspace ability to 
change init cwnd through the IP utility, it would seem likely that 
relatively little coding would now be required to implement some kind of 
limited cwnd caching and experiment with whether this is a valuable 
addition?  I would have thought if you are only fiddling with devices 
behind a broadband router then there is little chance of you ...
From: H.K. Jerry Chu
Date: Friday, July 16, 2010 - 6:23 pm

Actually we have investigated two type of caches, a short-history limited size
internal cache that is subject to some LRU replacement policy hence
much limiting
the cache hit rate, and a long-history external cache, which provides much more
accurate cwnd history per subnet but with high complexity and
deployment headache.

Also we have set out for a much more ambitious goal, to not just speed
up our own
services, but also provide a solution that could benefit the whole web
(see http://code.google.com/speed/index.html). The latter pretty much
precludes a complex
external cache scheme mentioned above.

--

From: H.K. Jerry Chu
Date: Friday, July 16, 2010 - 5:36 pm

We have found two issues when altering initcwnd through the ip route cmd:
1. initcwnd is actually capped by sndbuf (i.e., tcp_wmem[1], which is
defaulted to a small value of 16KB). This problem has been made obscured
by the TSO code, which fudges the flow control limit (and could be a bug by
itself).

2. the congestion backoff code is supposed to take inflight, rather than cwnd,
but initcwnd presents a special case. I don't fully understand the code yet to

This is partly due to our load balancer policy resulting in poor cache hit,
partly due to the sheer volumes of remote clients. Some of colleagues
tried to change the host cache to a /24 subnet cache but the result wasn't

Please voice your support to TCPM then :)

--

From: Rick Jones
Date: Monday, July 19, 2010 - 10:08 am

I'll ask my Emily Litella question of the day and inquire as to why that would 
be unique to altering initcwnd via the route?

The slightly less Emily Litella-esque question is why an appliction with a 
desire to know it could send more than 16K at one time wouldn't have either 
asked via its install docs to have the minimum tweaked (certainly if one is 
already tweaking routes...), or "gone all the way" and made an explicit 
setsockopt(SO_SNDBUF) call?  We are in a realm of applications for which there 
was a proposal to allow them to pick their own initcwnd right?  Having them pick 
an SO_SNDBUF size would seem to be no more to ask.

rick jones

sendbuf_init = max(tcp_mem,initcwnd)?
--

From: H.K. Jerry Chu
Date: Monday, July 19, 2010 - 3:51 pm

Per app setting of initcwnd is just one case. Another is per route setting of
initcwnd basis through the ip route cmd. For the latter the initcwnd change is
more or less supposed to be transparent to apps.

This wasn't a big issue and can probably be easily fixed by
initializing sk_sndbuf
to max(tcp_wmem[1], initcwnd) as you alluded to below. It is just our
experiements got hindered by this little bug but we weren't aware of it sooner
due to TSO fudging sndbuf.

--

From: Hagen Paul Pfeifer
Date: Monday, July 19, 2010 - 4:42 pm

Maybe someone is interested: on the Transport Modeling Research Group (TMRG)
mailing list a new thread named "Proposal to increase TCP initial CWND"
starts one day ago.

Cheers, Hagen

--

From: Bill Davidsen
Date: Thursday, July 15, 2010 - 4:14 pm

I think transit time measured in 1/10th sec would disqualify this as a "normal 
setup."

High bandwidth and high latency don't work well because you get "send until the 
window is full then wait for ack" and poor performance. I saw this with sat feed 
to Wyoming from GE's Research Center in upstate NY in the late 80's or early 
90's. (I think this was NYserNet at that time). I did feeds from NYC area to 
California and Hawaii with SBC in the early to mid 2k years. In every case 
SunOS, Solaris, AIX and Linux all failed to hit anything like reasonable 
transfer speeds without manually tweaking, and I got the advice on increasing 
window size from network engineers at ISPs and backbone providers.

The O.P. may have other issues, and may benefit from doing other things as well, 
but raising window size is a reasonable thing to do on links with RTT in 
hundreds of ms, and it's easy to try without changing config files.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: Ed W
Date: Wednesday, July 14, 2010 - 11:32 am

Well, I was simplifying a little bit, actually I have a bunch of 

I think I'm misunderstanding something fundamental here:

- Surely the limited congestion window is what throttles me at 
connection initialisation time and this will not be affected by changing 
the params you mention above?  For sure the sliding window will be 
relevant vs my bandwidth delay product once the tcp connection reaches 
steady state, but I'm mostly worried here about performance right at the 
creation of the connection?

- Both you and Alan mention that the bulk of the traffic is "incoming" - 
this implies you think it's relevant?  Obviously I'm missing something 
fundamental here because my understanding is that the congestion window 
shuts us down in both directions (at the start of the connection?)

Thanks for the replies - I will take it over to netdev

Ed W
--

From: Bill Davidsen
Date: Thursday, July 15, 2010 - 8:10 am

Perhaps they will give you an answer you like better.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

From: Henrique de Moraes Holschuh
Date: Thursday, July 15, 2010 - 7:58 pm

Last time I dealt with such stuff (hundreds of VSATs across the whole
country, arriving at a Satellite Base Station), you absolutely had to use
protocol enhancement proxies in the SBS AND in the VSAT clients to get good
performance for typical end-user Internet usage.  This was a few years ago,
but it probably hasn't changed much.  I don't recall what proprietary stuff
was used for the proxy, but...

http://en.wikipedia.org/wiki/Performance_Enhancing_Proxy
http://sourceforge.net/projects/pepsal/

A Google search for pepsal will return a link to a PDF explaining the
design.  Maybe that could be of some help for you?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

Previous thread: [PATCH 3/3] Added sysfs interface to pcrypt. by Dan Kruchinin on Wednesday, July 14, 2010 - 3:34 am. (1 message)

Next thread: [PATCH] wm8727: add a missing return in wm8727_platform_probe by Axel Lin on Wednesday, July 14, 2010 - 3:57 am. (3 messages)