Re: [PATCH] tcp: Socket option to set congestion window

Previous thread: [RFC] IFLA_PORT_* iproute2 cmd line by Scott Feldman on Tuesday, May 25, 2010 - 8:19 pm. (4 messages)

Next thread: [PATCH] be2net: increase POST timeout for EEH recovery by Sathya Perla on Wednesday, May 26, 2010 - 12:00 am. (2 messages)
From: Tom Herbert
Date: Tuesday, May 25, 2010 - 10:01 pm

This patch allows an application to set the TCP congestion window
for a connection through a socket option.  The maximum value that
may set is specified in a sysctl value.  When the sysctl is set to
zero, the default value, the socket option is disabled.

The socket option is most useful to set the initial congestion
window for a connection to a larger value than the default in
order to improve latency.  This socket option would typically be
used by an "intelligent" application which might have better knowledge
than the kernel as to what an appropriate initial congestion window is.

One use of this might be with an application which maintains per
client path characteristics.  This could allow setting the congestion
window more precisely than which could be achieved through the
route command.

A second use of this might be to reduce the number of simultaneous
connections that a client might open to the server; for instance
when a web browser opens multiple connections to a server.  With multiple
connections the aggregate congestion window is larger than that of a
single connecton (num_conns * cwnd), this effectively can be used to
circumvent slowstart and improve latency.  With this socket option, a
single connection with a large initial congestion window could be used,
which retains the latency properties of multiple connections but
nicely reducing # of connections (load) on the network.

The systctl to enable and control this feature is

  net.ipv4.tcp_user_cwnd_max

The socket option call would be:

  setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))

where val is the congestion window in # MSS.


Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..9e9692f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -105,6 +105,7 @@ enum {
 #define TCP_COOKIE_TRANSACTIONS	15	/* TCP Cookie Transactions */
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin ...
From: Stephen Hemminger
Date: Tuesday, May 25, 2010 - 10:08 pm

On Tue, 25 May 2010 22:01:13 -0700 (PDT)

The IETF TCP maintainers already think Linux TCP allows unsafe
operation, this will just allow more possible misuse and prove
their argument.  Until/unless this behavior was approved by
a wider set of research, I don't think it should be accepted at
this time.


-- 
--

From: David Miller
Date: Tuesday, May 25, 2010 - 10:52 pm

From: Stephen Hemminger <shemminger@vyatta.com>

Yes, and two other points I'd like to add.

1) Stop pretending a network path characteristic can be made into
   an application level one, else I'll stop reading your patches.

   You can try to use smoke and mirrors to make your justification by
   saying that an application can circumvent things right now by
   openning up multiple connections.  But guess what?  If that act
   overflows a network queue, we'll pull the CWND back on all of those
   connections while their CWNDs are still small and therefore way
   before things get out of hand.

   Whereas if you set the initial window high, the CWND is wildly out
   of control before we are even started.

   And even after your patch the "abuse" ability is still there.  So
   since your patch doesn't prevent the "abuse", you really don't care
   about CWND abuse.  Instead, you simply want to pimp your feature.

2) The very last application I'd want to use something like this is a
   damn web browser.

   Maybe a program, which is extremely sophisticated, like a database
   or caching manager, that runs privileged and somehow has complete
   and constantly updated knowledge of the network topology from end
   to end.  And iff, and only iff, we only would let privileged
   applications make the setting.

Right now we only allow to do this via a route setting, exactly because:

1) It is a network path characteristic, full stop.

2) Only humans can really know what the exact end to end path
   characteristics are on a per-route basis, and given that whether it
   is safe to increase the initial CWND as a result.
--

From: Tom Herbert
Date: Wednesday, May 26, 2010 - 12:06 am

It's really not that simple.  In the application with multiple
connections, congestion may only affect some number of connections, so
more of the aggregate window may be preserved.  This is an unfairness

Right, this should be fixed in the server not at the browsers.
Unfortunately, web browsers seem to have lost any self control in
limiting the number of simultaneous connections that can be opened (we
managed to get IE8 to open over 100 of them).  So the cat's way out of
the bag.  Server's can rein this problem in by only allowing fewer
Thanks to NAT, the concept of a network path or even host specific
path is a weakened concept.  On the Internet this may be a path
characteristic per client, which unfortunately has no visibility in
the kernel other than per connection state.  When a single IP address
may have thousands of hosts behind it, caching TCP parameters for that

In all but the most trivial networks, I do not believes humans are
capable of making an intelligent decision about this.  Don't get me
wrong, it's great that it can be set in the route, but there's nothing
at all that prevents naive abuse (2009 study showed that 15%
connections of connections on the Internet violate icw standards
anyway).  We have proposed in iETF to raise the initial congestion
window, but dynamic mechanisms that algorithmically determine safe
values are still of interest and may be safer which is what this patch
would allow.

Thanks for your comments!
--

From: David Miller
Date: Wednesday, May 26, 2010 - 12:33 am

From: Tom Herbert <therbert@google.com>

If this is true, then by all account your patch allows things to be
even worse.

Because now applications can still open up N connections, but with an
even larger initial CWND, with potentially exponential ramifications
on network congestion.

So yet another reason not to consider this feature seriously.  It's
not an application level attribute, it's a network path one.  Please
take it seriously because I really mean it.
--

From: Andi Kleen
Date: Wednesday, May 26, 2010 - 10:33 am

Yes all of Saudi-Arabia used to be (is?) one IP address...

Caching anything per IP is bogus.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Denys Fedorysychenko
Date: Wednesday, May 26, 2010 - 10:41 am

In Lebanon i have around 30k users behind few IP addresses(around 6) (for 
web).
Because backbone here $1200/Mbit, and satellites mostly(rtt 400+ ms)... so TCP 
accelerators and caching proxy a must. Tproxy doesn't work well yet to use 
full set of ip's.

And no local google/youtube servers, so maybe i'm affected by something? :-)
--

From: David Miller
Date: Wednesday, May 26, 2010 - 2:08 pm

From: Andi Kleen <andi@firstfloor.org>

And letting the applications choose the CWND is better?!?!

Every single proposal being mentioned in this thread has huge,
obvious, downsides.

Just because there are some cases of people NAT'ing many machines
behind one IP address doesn't mean we kill performance for the rest of
the world (the majority of internet usage btw) by not caching TCP path
characteristics per IP address.

And just because applications open up many sockets to get better TCP
latency and work around per-connection CWND limits DOES NOT mean we
let the application increase the initial CWND so it can abuse this
EVEN MORE and cause EVEN BIGGER problems.

If people have real, sane, ideas about how to attack this problem I am
all ears.  But everything proposed here so far is complete and utter
crap.
--

From: Andi Kleen
Date: Wednesday, May 26, 2010 - 2:27 pm

No I actually agree with you on that. Just saying that
anything that relies on per IP caching is bad too.

As I understand the idea was that the application knows
what flows belong to a single peer and wants to have
a single cwnd for all of those. Perhaps there would
be a way to generalize that to tell it to the kernel.

e.g. have a "peer id"  that is known by applications
and the kernel could manage cwnds shared between connections
associated with the same peer id?

Just an idea, I admit I haven't thought very deeply
about this. Feel free to poke holes into it.

-Andi
--

From: David Miller
Date: Wednesday, May 26, 2010 - 3:10 pm

From: Andi Kleen <andi@firstfloor.org>

Yes, a CWND "domain" that can include multiple sockets is
something that might gain some traction.

The "domain" could just simply be the tuple {process,peer-IP}
--

From: Rick Jones
Date: Wednesday, May 26, 2010 - 3:29 pm

Then all the app does is say "I'am in peer id foo" right?  Is that really that 
much different from making the setsockopt() call for a different cwnd value? 
Particularly if say the limit were not a global sysctl, but based on the 

Name or PID?

rick jones
--

From: Andi Kleen
Date: Thursday, May 27, 2010 - 12:57 am

The worst case with peer id would be app using an own peer id
for each connection. So each connection would have an own cwnd,
just like today. So the worst case is the same as today.

If it shares connections between peer ids the real effective cwnd
of all those connections would be also never be "worse" (that is
larger) than it could be on single connection. 

So this limits the cwnds effectively with peer ids, although it also 
gives a nice way to reuse an already existing cwnd for a new
connection (this does not make things worse because in theory
the app could have reused the same connection too) 

So overall peer ids don't allow to enlarge cwnds over today.

If the cwnd is fully application controlled all these limits
are not there and a bittorrent client could just always set 
it to 1 million.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Hagen Paul Pfeifer
Date: Wednesday, May 26, 2010 - 4:15 pm

This discussion - as once a month - is about fairness. But if we define a
domain as a tuple of {process,peer-IP} the fairness is applied only for the
last link before "peer-IP".

But fairness applies to *all* links in between! For example: consider a
dumpbell scenario:


+------+                                   +------+ 
|      |                                   |      |  
|  H1  |                                   |  H3  | 
|      |                                   |      |  
+------+                                   +------+  
  10MB  \   +------+            +------+  / 10MB
         \  |      |   1MB/s    |      | / 
          > |  R1  |------------|  R2  |<    
         /  |      |            |      | \      
  10MB  /   +------+            +------+  \ 10MB 
+------+                                   +------+  
|      |                                   |      |        
|  H2  |                                   |  H4  | 
|      |                                   |      | 
+------+                                   +------+


How can a domain defined as {process,peer-IP} fair to the 1MB bottleneck link?
It is not fair! And it is also not fair to open n simultaneous streams and so
on. This problem is discussed in several RFC's.

.02


Best regards, Hagen


-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

--

From: David Miller
Date: Wednesday, May 26, 2010 - 8:04 pm

From: Hagen Paul Pfeifer <hagen@jauu.net>

You're asking about a network level issue in terms of what can be done
on a local end-node.

All an end-node can do is abide by congestion control rules and respond
to packet drops, as has been going on for decades.

People have basically (especially in Europe) given up on crazy crap
like RSVP and other forms of bandwidth limiting and reservation.  They
just oversubscribe their links, and increase their capacity as traffic
increases dictate.  It just isn't all that manageable to put people's
traffic into classes and control what they do on a large scale.

I'm also skeptical about those who say the fight belongs squarely at
the end nodes.  If you want to control the network traffic of the
meeting point of your dumbbell, you'll need a machine there doing RED
or traffic limiting.  End-host schemes simply aren't going to work
because I can just add more end-hosts to reintroduce the problem.

The dumbbell situation is independant of the end-node issues, that's
all I'm really saying.
--

From: Hagen Paul Pfeifer
Date: Thursday, May 27, 2010 - 12:08 am

No, I *write* about network level issues, this is the important item in my
mind.  It is about network stability and network fairness. The lion share of
TCP algorithm are drafted to guarantee _network fairness and network stability_.

And by the way, the IETF (and our) paradigm is still to shift functionality to
end hosts - not into network core. "The Rise of the stupid network" [1] is
still a paradigm that is superior to the alternative where vendors put their
proprietary algorithms into the network and change the behavior in a

Right, and this will be reality for the next decades (at least for TCP;

I am not happy with this statement. This differs from the previous paragraph
where you complain about intelligent network components. Davem until these
days the routers do exactly this, they do RED/WRED whatever and signal to the
producer to reduce their bandwidth.

And this is the most important aspect in this email: core network components
rely on end hosts to behave in a fair manner. Disable Slow Start/Congestion
Avoidance and the network will instantly collapse (mmh, net-next? ;-)

The mechanism as proposed in the patch is not fair. There are a lot of
publications available that analyse the impact CWND in great detail as well as

Davem, I know that you are a good guy and worries about fairness aspects
really well. I wrote this email to popularize fairness and network stability
aspects to the broad audience.

Hagen



-- 
Die Zensur ist das lebendige Gestaendnis der Grossen, dass sie 
nur verdummte Sklaven treten, aber keine freien Voelker regieren koennen.
- Johann Nepomuk Nestroy

--

From: David Miller
Date: Thursday, May 27, 2010 - 12:28 am

From: Hagen Paul Pfeifer <hagen@jauu.net>

Superior or not, it's simply never going to happen.  We are far beyond
being able to get to where we were before NAT'ing and shaping devices
started to get inserted everywhere on the network.

And I also don't see any of this stuff as fundamentally proprietary.

People want deep packet inspection, people want to control their user's
traffic.  And people, most importantly, are willing to pay for this.

Therefore, these elements will always be in the network.

Better to co-exist with them and use them to our advantage instead of
fantasizing about a utopia where they don't exist.
--

From: Hagen Paul Pfeifer
Date: Thursday, May 27, 2010 - 12:46 am

We will see! If no real interaction between peers is required
ISP/Carrier/InternetExchanges will start to put their proprietary components
into the network. Because they have niffty features, the product developed
phase is shorten (no borring standardizations necessary) and so on. This is no

Sure, we have no alternative.

HGN


-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22
--

From: Tom Herbert
Date: Thursday, May 27, 2010 - 9:14 am

The mechanism proposed in the patch is merely an API change; misuse,
abuse, or unfairness are inferences of how it might be used.  Proper
safeguards should be applied to prevent misuse, but I don't see that
it should be any more insidious than 350 other mechanisms in the
system that could be used to screw things up.

Yes, there has been a lot of talk about CWND, but the standard has not
changed since 2002.  In the meantime, browsers have increased the
number of parallel connections they open to a destination, and servers
hide behind multiple domains-- the end result of this is that browsers
use aggregate initial congestion windows much larger than the
standard, which sidesteps slowstart and is a source of unfairness.
This is contrary to RFC 3390:

"When web browsers open simultaneous TCP connections to the same
destination, they are working against TCP's congestion control
mechanisms"

I have yet to find any paper on CWND that analyzed the effect of this
phenomena on the Internet which is quite unfortunate.  In our own full
scale experiments
(http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf), we
anlayzed the effects of using larger initial congestion windows on the
Internet which might be the closest thing to such an analysis.  I know
in LEDBAT WG of IETF they are trying to come up with new
recommendations for number of connections a browser can open, this is
good but I hope it's not after the fact.

It would be better, by almost any perspective, to rein in the number
of connections servers are allowing clients to open.  However this
isn't going to happen if this means increase latency for end users,
there's is no competitive rationale for servers to do that.  That's
where a primary motivation of this patch becomes evident.  Instead of
a server allowing 6 connections from a client, for instance, it could
allow just one connection but with a initial congestion window equal
to the aggregate of the 6 connections.  This reduces connections and
does not change the ...
From: Andi Kleen
Date: Thursday, May 27, 2010 - 11:56 am

I thought the point was to avoid cwnd inflation by multiple connections?
Now you're saying you actually want larger cwnds? 

If you simply want larger CWNDs the easiest is to bump up the
define in your local build.

But that cannot be done by default obviously.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Hagen Paul Pfeifer
Date: Thursday, May 27, 2010 - 12:19 pm

Right, the problem can applied for other protocols as well. Often p2p
protocols behave unfair. This problem is known, but there is currently no
IETF effort to address the problem. The problem is not that simple and it is

I know your paper and if I remember correctly I was a little bit sceptical
about the efforts to analyze the fairness behavior in deep. It takes one day
to validate the fairness issues: take NS3 (with NSC so you can take the Linux
network stack with your patch), setup a dumpbell topology and analyse the
behavior. I will read the paper one more time.

I had no problem with you patch if you apply this patch on top of it:  ;-)


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0ca9832..73f9d46 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2371,7 +2371,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		break;
 
 	case TCP_CWND:
-		if (sysctl_tcp_user_cwnd_max <= 0)
+		if (sysctl_tcp_user_cwnd_max <= 0 || !capable(CAP_NET_ADMIN))
 			err = -EPERM;
 		else if (val > 0 && sk->sk_state == TCP_ESTABLISHED &&
 		    icsk->icsk_ca_state == TCP_CA_Open) {


HGN
--

From: Andi Kleen
Date: Thursday, May 27, 2010 - 1:00 am

If process is in there this wouldn't work for a multi process
server?

Perhaps having it associated with a FD so that it could
be passed around with unix sockets if needed (just would
need to make sure the AF_UNIX gc can handle such cycles)

peer_id = open_peer_id();   
/* peer id is like a fd */

socket = socket( ... ); 
set_peer_id(socket, peer_id); 


...

close(peer_id);

-andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

Previous thread: [RFC] IFLA_PORT_* iproute2 cmd line by Scott Feldman on Tuesday, May 25, 2010 - 8:19 pm. (4 messages)

Next thread: [PATCH] be2net: increase POST timeout for EEH recovery by Sathya Perla on Wednesday, May 26, 2010 - 12:00 am. (2 messages)