This patch allows an application to set the TCP congestion window
for a connection through a socket option. The maximum value that
may set is specified in a sysctl value. When the sysctl is set to
zero, the default value, the socket option is disabled.
The socket option is most useful to set the initial congestion
window for a connection to a larger value than the default in
order to improve latency. This socket option would typically be
used by an "intelligent" application which might have better knowledge
than the kernel as to what an appropriate initial congestion window is.
One use of this might be with an application which maintains per
client path characteristics. This could allow setting the congestion
window more precisely than which could be achieved through the
route command.
A second use of this might be to reduce the number of simultaneous
connections that a client might open to the server; for instance
when a web browser opens multiple connections to a server. With multiple
connections the aggregate congestion window is larger than that of a
single connecton (num_conns * cwnd), this effectively can be used to
circumvent slowstart and improve latency. With this socket option, a
single connection with a large initial congestion window could be used,
which retains the latency properties of multiple connections but
nicely reducing # of connections (load) on the network.
The systctl to enable and control this feature is
net.ipv4.tcp_user_cwnd_max
The socket option call would be:
setsockopt(fd, IPPROTO_TCP, TCP_CWND, &val, sizeof (val))
where val is the congestion window in # MSS.
Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..9e9692f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -105,6 +105,7 @@ enum {
#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
#define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin ...On Tue, 25 May 2010 22:01:13 -0700 (PDT) The IETF TCP maintainers already think Linux TCP allows unsafe operation, this will just allow more possible misuse and prove their argument. Until/unless this behavior was approved by a wider set of research, I don't think it should be accepted at this time. -- --
From: Stephen Hemminger <shemminger@vyatta.com> Yes, and two other points I'd like to add. 1) Stop pretending a network path characteristic can be made into an application level one, else I'll stop reading your patches. You can try to use smoke and mirrors to make your justification by saying that an application can circumvent things right now by openning up multiple connections. But guess what? If that act overflows a network queue, we'll pull the CWND back on all of those connections while their CWNDs are still small and therefore way before things get out of hand. Whereas if you set the initial window high, the CWND is wildly out of control before we are even started. And even after your patch the "abuse" ability is still there. So since your patch doesn't prevent the "abuse", you really don't care about CWND abuse. Instead, you simply want to pimp your feature. 2) The very last application I'd want to use something like this is a damn web browser. Maybe a program, which is extremely sophisticated, like a database or caching manager, that runs privileged and somehow has complete and constantly updated knowledge of the network topology from end to end. And iff, and only iff, we only would let privileged applications make the setting. Right now we only allow to do this via a route setting, exactly because: 1) It is a network path characteristic, full stop. 2) Only humans can really know what the exact end to end path characteristics are on a per-route basis, and given that whether it is safe to increase the initial CWND as a result. --
It's really not that simple. In the application with multiple connections, congestion may only affect some number of connections, so more of the aggregate window may be preserved. This is an unfairness Right, this should be fixed in the server not at the browsers. Unfortunately, web browsers seem to have lost any self control in limiting the number of simultaneous connections that can be opened (we managed to get IE8 to open over 100 of them). So the cat's way out of the bag. Server's can rein this problem in by only allowing fewer Thanks to NAT, the concept of a network path or even host specific path is a weakened concept. On the Internet this may be a path characteristic per client, which unfortunately has no visibility in the kernel other than per connection state. When a single IP address may have thousands of hosts behind it, caching TCP parameters for that In all but the most trivial networks, I do not believes humans are capable of making an intelligent decision about this. Don't get me wrong, it's great that it can be set in the route, but there's nothing at all that prevents naive abuse (2009 study showed that 15% connections of connections on the Internet violate icw standards anyway). We have proposed in iETF to raise the initial congestion window, but dynamic mechanisms that algorithmically determine safe values are still of interest and may be safer which is what this patch would allow. Thanks for your comments! --
From: Tom Herbert <therbert@google.com> If this is true, then by all account your patch allows things to be even worse. Because now applications can still open up N connections, but with an even larger initial CWND, with potentially exponential ramifications on network congestion. So yet another reason not to consider this feature seriously. It's not an application level attribute, it's a network path one. Please take it seriously because I really mean it. --
Yes all of Saudi-Arabia used to be (is?) one IP address... Caching anything per IP is bogus. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
In Lebanon i have around 30k users behind few IP addresses(around 6) (for web). Because backbone here $1200/Mbit, and satellites mostly(rtt 400+ ms)... so TCP accelerators and caching proxy a must. Tproxy doesn't work well yet to use full set of ip's. And no local google/youtube servers, so maybe i'm affected by something? :-) --
From: Andi Kleen <andi@firstfloor.org> And letting the applications choose the CWND is better?!?! Every single proposal being mentioned in this thread has huge, obvious, downsides. Just because there are some cases of people NAT'ing many machines behind one IP address doesn't mean we kill performance for the rest of the world (the majority of internet usage btw) by not caching TCP path characteristics per IP address. And just because applications open up many sockets to get better TCP latency and work around per-connection CWND limits DOES NOT mean we let the application increase the initial CWND so it can abuse this EVEN MORE and cause EVEN BIGGER problems. If people have real, sane, ideas about how to attack this problem I am all ears. But everything proposed here so far is complete and utter crap. --
No I actually agree with you on that. Just saying that anything that relies on per IP caching is bad too. As I understand the idea was that the application knows what flows belong to a single peer and wants to have a single cwnd for all of those. Perhaps there would be a way to generalize that to tell it to the kernel. e.g. have a "peer id" that is known by applications and the kernel could manage cwnds shared between connections associated with the same peer id? Just an idea, I admit I haven't thought very deeply about this. Feel free to poke holes into it. -Andi --
From: Andi Kleen <andi@firstfloor.org>
Yes, a CWND "domain" that can include multiple sockets is
something that might gain some traction.
The "domain" could just simply be the tuple {process,peer-IP}
--
Then all the app does is say "I'am in peer id foo" right? Is that really that much different from making the setsockopt() call for a different cwnd value? Particularly if say the limit were not a global sysctl, but based on the Name or PID? rick jones --
The worst case with peer id would be app using an own peer id for each connection. So each connection would have an own cwnd, just like today. So the worst case is the same as today. If it shares connections between peer ids the real effective cwnd of all those connections would be also never be "worse" (that is larger) than it could be on single connection. So this limits the cwnds effectively with peer ids, although it also gives a nice way to reuse an already existing cwnd for a new connection (this does not make things worse because in theory the app could have reused the same connection too) So overall peer ids don't allow to enlarge cwnds over today. If the cwnd is fully application controlled all these limits are not there and a bittorrent client could just always set it to 1 million. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
This discussion - as once a month - is about fairness. But if we define a
domain as a tuple of {process,peer-IP} the fairness is applied only for the
last link before "peer-IP".
But fairness applies to *all* links in between! For example: consider a
dumpbell scenario:
+------+ +------+
| | | |
| H1 | | H3 |
| | | |
+------+ +------+
10MB \ +------+ +------+ / 10MB
\ | | 1MB/s | | /
> | R1 |------------| R2 |<
/ | | | | \
10MB / +------+ +------+ \ 10MB
+------+ +------+
| | | |
| H2 | | H4 |
| | | |
+------+ +------+
How can a domain defined as {process,peer-IP} fair to the 1MB bottleneck link?
It is not fair! And it is also not fair to open n simultaneous streams and so
on. This problem is discussed in several RFC's.
.02
Best regards, Hagen
--
Hagen Paul Pfeifer <hagen@jauu.net> || http://jauu.net/
Telephone: +49 174 5455209 || Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22
--
From: Hagen Paul Pfeifer <hagen@jauu.net> You're asking about a network level issue in terms of what can be done on a local end-node. All an end-node can do is abide by congestion control rules and respond to packet drops, as has been going on for decades. People have basically (especially in Europe) given up on crazy crap like RSVP and other forms of bandwidth limiting and reservation. They just oversubscribe their links, and increase their capacity as traffic increases dictate. It just isn't all that manageable to put people's traffic into classes and control what they do on a large scale. I'm also skeptical about those who say the fight belongs squarely at the end nodes. If you want to control the network traffic of the meeting point of your dumbbell, you'll need a machine there doing RED or traffic limiting. End-host schemes simply aren't going to work because I can just add more end-hosts to reintroduce the problem. The dumbbell situation is independant of the end-node issues, that's all I'm really saying. --
No, I *write* about network level issues, this is the important item in my mind. It is about network stability and network fairness. The lion share of TCP algorithm are drafted to guarantee _network fairness and network stability_. And by the way, the IETF (and our) paradigm is still to shift functionality to end hosts - not into network core. "The Rise of the stupid network" [1] is still a paradigm that is superior to the alternative where vendors put their proprietary algorithms into the network and change the behavior in a Right, and this will be reality for the next decades (at least for TCP; I am not happy with this statement. This differs from the previous paragraph where you complain about intelligent network components. Davem until these days the routers do exactly this, they do RED/WRED whatever and signal to the producer to reduce their bandwidth. And this is the most important aspect in this email: core network components rely on end hosts to behave in a fair manner. Disable Slow Start/Congestion Avoidance and the network will instantly collapse (mmh, net-next? ;-) The mechanism as proposed in the patch is not fair. There are a lot of publications available that analyse the impact CWND in great detail as well as Davem, I know that you are a good guy and worries about fairness aspects really well. I wrote this email to popularize fairness and network stability aspects to the broad audience. Hagen -- Die Zensur ist das lebendige Gestaendnis der Grossen, dass sie nur verdummte Sklaven treten, aber keine freien Voelker regieren koennen. - Johann Nepomuk Nestroy --
From: Hagen Paul Pfeifer <hagen@jauu.net> Superior or not, it's simply never going to happen. We are far beyond being able to get to where we were before NAT'ing and shaping devices started to get inserted everywhere on the network. And I also don't see any of this stuff as fundamentally proprietary. People want deep packet inspection, people want to control their user's traffic. And people, most importantly, are willing to pay for this. Therefore, these elements will always be in the network. Better to co-exist with them and use them to our advantage instead of fantasizing about a utopia where they don't exist. --
We will see! If no real interaction between peers is required ISP/Carrier/InternetExchanges will start to put their proprietary components into the network. Because they have niffty features, the product developed phase is shorten (no borring standardizations necessary) and so on. This is no Sure, we have no alternative. HGN -- Hagen Paul Pfeifer <hagen@jauu.net> || http://jauu.net/ Telephone: +49 174 5455209 || Key Id: 0x98350C22 Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22 --
The mechanism proposed in the patch is merely an API change; misuse, abuse, or unfairness are inferences of how it might be used. Proper safeguards should be applied to prevent misuse, but I don't see that it should be any more insidious than 350 other mechanisms in the system that could be used to screw things up. Yes, there has been a lot of talk about CWND, but the standard has not changed since 2002. In the meantime, browsers have increased the number of parallel connections they open to a destination, and servers hide behind multiple domains-- the end result of this is that browsers use aggregate initial congestion windows much larger than the standard, which sidesteps slowstart and is a source of unfairness. This is contrary to RFC 3390: "When web browsers open simultaneous TCP connections to the same destination, they are working against TCP's congestion control mechanisms" I have yet to find any paper on CWND that analyzed the effect of this phenomena on the Internet which is quite unfortunate. In our own full scale experiments (http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf), we anlayzed the effects of using larger initial congestion windows on the Internet which might be the closest thing to such an analysis. I know in LEDBAT WG of IETF they are trying to come up with new recommendations for number of connections a browser can open, this is good but I hope it's not after the fact. It would be better, by almost any perspective, to rein in the number of connections servers are allowing clients to open. However this isn't going to happen if this means increase latency for end users, there's is no competitive rationale for servers to do that. That's where a primary motivation of this patch becomes evident. Instead of a server allowing 6 connections from a client, for instance, it could allow just one connection but with a initial congestion window equal to the aggregate of the 6 connections. This reduces connections and does not change the ...
I thought the point was to avoid cwnd inflation by multiple connections? Now you're saying you actually want larger cwnds? If you simply want larger CWNDs the easiest is to bump up the define in your local build. But that cannot be done by default obviously. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Right, the problem can applied for other protocols as well. Often p2p
protocols behave unfair. This problem is known, but there is currently no
IETF effort to address the problem. The problem is not that simple and it is
I know your paper and if I remember correctly I was a little bit sceptical
about the efforts to analyze the fairness behavior in deep. It takes one day
to validate the fairness issues: take NS3 (with NSC so you can take the Linux
network stack with your patch), setup a dumpbell topology and analyse the
behavior. I will read the paper one more time.
I had no problem with you patch if you apply this patch on top of it: ;-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0ca9832..73f9d46 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2371,7 +2371,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;
case TCP_CWND:
- if (sysctl_tcp_user_cwnd_max <= 0)
+ if (sysctl_tcp_user_cwnd_max <= 0 || !capable(CAP_NET_ADMIN))
err = -EPERM;
else if (val > 0 && sk->sk_state == TCP_ESTABLISHED &&
icsk->icsk_ca_state == TCP_CA_Open) {
HGN
--
If process is in there this wouldn't work for a multi process server? Perhaps having it associated with a FD so that it could be passed around with unix sockets if needed (just would need to make sure the AF_UNIX gc can handle such cycles) peer_id = open_peer_id(); /* peer id is like a fd */ socket = socket( ... ); set_peer_id(socket, peer_id); ... close(peer_id); -andi -- ak@linux.intel.com -- Speaking for myself only. --
