Linux: Window Scaling on the Internet

Submitted by Jeremy
on June 14, 2006 - 10:15am

Mark Lord reported a problem with the upcoming 2.6.17 kernel being unable to access www.everymac.com. He detailed a series of tests that isolated the problem down to a recent changeset titled, "set default max buffers from memory pool size", which goes on to explain, "this patch sets the maximum TCP buffer sizes (available to automatic buffer tuning, not to setsockopt) based on the TCP memory pool size. The maximum sndbuf and rcvbuf each will be up to 4 MB, but no more than 1/128 of the memory pressure threshold." John Heffner explained that somewhere between Mark's server and the webserver was a broken box that needs to be fixed, "in the meantime, disabling window scaling will work around the problem for you."

Linux creator Linus Torvalds suggested, "well, arguably, we shouldn't necessarily have defaults that use window scaling, or we should have ways to recognize automatically when it doesn't work (which may not be possible). It's not like there aren't broken boxes out there, and it might be better to make the default buffer sizes just be low enough that window scaling simply isn't an issue." David Miller responded by pointing out that window scaling has been enabled by default for a long time, but has only been scaling the window by a factor of 1 or 2 until now, "it is impossible to fill a cross-continental connection without using window scaling. A 64K window is all you get without scaling. Big buffers are absolutely necessary, and as John Heffner showed this need is growing exponentially and not slowing down. 6 megabit downlink is pretty commonplace in the US, and the standard is much higher in well connected countries such as South Korea." He also explained that it's not possible to detect broken boxes and dynamically turn of window scaling after it's been nogatiated, "it's immutably active for the entire connection once enabled. Window scaling has been standardized and around for 14 years, RFC1323 was published in May of 1992. How much longer can we wait for it to be deployed properly? :-)"


From: Mark Lord [email blocked]
To: Linux Kernel [email blocked]
Subject: 2.6.17: networking bug??
Date:	Tue, 13 Jun 2006 10:08:51 -0400

Not bloody likely, I suppose.

But with 2.6.17-rc6, I am unable to talk to the webserver at www.everymac.com
and with 2.6.16.18 (configured identically), this works just fine.

This is with a very simple text access: {"telnet www.everymac.com 80", "GET /", "", ""}

Does that site work for anyone else here running 2.6.17-rc6 ??

I've tried it on three different machines (two Pentium-M boxes, and an AMD64 box/kernel),
all with the same results.  NFG with rc6, fine with earlier kernels.  Yes, they are all
behind the same Linux (2.6.16.xx) firewall, but that doesn't seem to bother anything.
Again, just switching to the older 2.6.16.xx kernels (or earlier) works fine.

I'm going insane!  Help!

Kernel .config from one of the machines is attached.


From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 10:26:17 -0400 Here are packet traces from the working (2.6.16) and non-working (2.6.17) kernels. The differences I see are widely varying "window sizes". What would cause this? Here's a partial trace of the working connection 2.6.16.18: IP silvy.localnet.32776 > zippy.localnet.domain: 50718+ A? www.everymac.com. (34) IP zippy.localnet.domain > silvy.localnet.32776: 50718 1/5/5 A 216-145-246-23.rev.dls.net (234) IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: S 2933486277:2933486277(0) win 5840 <mss 1460,sackOK,timestamp 730285 0,nop,wscale 2> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56224: S 2545625510:2545625510(0) ack 2933486278 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134760199 730285> IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 1 win 1460 <nop,nop,timestamp 730448 134760199> IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 1460 <nop,nop,timestamp 730448 134760199> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56224: P 1:206(205) ack 607 win 32798 <nop,nop,timestamp 134760217 730448> IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 206 win 1728 <nop,nop,timestamp 730626 134760217> IP silvy.localnet.32776 > zippy.localnet.domain: 24229+ A? www.everymac.com. (34) IP zippy.localnet.domain > silvy.localnet.32776: 24229 1/5/5 A 216-145-246-23.rev.dls.net (234) IP silvy.localnet.56225 > 216-145-246-23.rev.dls.net.www: S 2943511062:2943511062(0) win 5840 <mss 1460,sackOK,timestamp 730932 0,nop,wscale 2> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56225: S 3806049331:3806049331(0) ack 2943511063 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134760264 730932> IP silvy.localnet.56225 > 216-145-246-23.rev.dls.net.www: . ack 1 win 1460 <nop,nop,timestamp 731095 134760264> IP silvy.localnet.56225 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 1460 <nop,nop,timestamp 731095 134760264> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56225: P 1:206(205) ack 607 win 32798 <nop,nop,timestamp 134760281 731095> IP silvy.localnet.56225 > 216-145-246-23.rev.dls.net.www: . ack 206 win 1728 <nop,nop,timestamp 731274 134760281> IP silvy.localnet.32776 > zippy.localnet.domain: 55754+ A? adserver.kylemedia.com. (40) IP zippy.localnet.domain > silvy.localnet.32776: 55754 1/5/5 A 216-145-246-23.rev.dls.net (249) IP silvy.localnet.56226 > 216-145-246-23.rev.dls.net.www: S 2940109661:2940109661(0) win 5840 <mss 1460,sackOK,timestamp 731360 0,nop,wscale 2> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56226: S 388231707:388231707(0) ack 2940109662 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134760306 731360> And again, from the non-working connection 2.6.17-rc6-git2: IP silvy.localnet.32770 > zippy.localnet.domain: 44986+ A? www.everymac.com. (34) IP zippy.localnet.domain > silvy.localnet.32770: 44986 1/5/5 A 216-145-246-23.rev.dls.net (234) IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: S 3000518105:3000518105(0) win 5840 <mss 1460,sackOK,timestamp 4294759165 0,nop,wscale 6> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: S 3368494549:3368494549(0) ack 3000518106 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134771817 4294759165> IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: . ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294760162 134771817> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: . ack 607 win 32798 <nop,nop,timestamp 134771918 4294760162> IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: F 607:607(0) ack 1 win 92 <nop,nop,timestamp 4294770176 134771817> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: . ack 608 win 32798 <nop,nop,timestamp 134772918 4294770176> IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: F 206:206(0) ack 608 win 32798 <nop,nop,timestamp 134772918 4294770176> IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: R 3000518713:3000518713(0) win 0 The client machine is "silvy", my firewall/dns box is "zippy", and 216-145-246-23 is www.everymac.com. The differences begin really early in these traces, with 2.6.16.18 using a win size of 1460, and 2.6.17-rc6 using a win size of 92 ???
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 11:00:12 -0400 Mark Lord wrote: .. > The differences I see are widely varying "window sizes". > What would cause this? This is from (working) 2.6.16.18: > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 1 win 1460 <nop,nop,timestamp 730448 134760199> > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 1460 <nop,nop,timestamp 730448 134760199> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56224: P 1:206(205) ack 607 win 32798 <nop,nop,timestamp 134760217 730448> > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 206 win 1728 <nop,nop,timestamp 730626 134760217> This is from (failing) 2.6.17-rc6-git2: > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: . ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294760162 134771817> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: . ack 607 win 32798 <nop,nop,timestamp 134771918 4294760162> Both kernels default to /proc/sys/net/ipv4/tcp_window_scaling == 1, and 2.6.16.18 works regardless of whether I turn it off/on again. But 2.6.17-rc6-git2 fails to work with the webserver at www.everymac.com when /proc/sys/net/ipv4/tcp_window_scaling == 1. Setting this to 0 "fixes" the problem. BUG.
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 11:28:51 -0400 Mmm. I notice that 2.6.17 has a new sysctl related to this stuff: /proc/sys/net/ipv4/tcp_workaround_signed_windows It makes no difference whatsoever for me here when varied while /proc/sys/net/ipv4/tcp_window_scaling==1. The site www.everymac.com is still not browseable until setting /proc/sys/net/ipv4/tcp_window_scaling===0. There's one other difference I see in the tcpdump traces. The first packets from each trace below show different values for "wscale". The old (working) kernels use "wscale 2", whereas 2.6.17 uses "wscale 6". In both cases, the value seen in /proc/sys/net/ipv4/tcp_adv_win_scale is 2. This is from (working) 2.6.16.18: > > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: S 2933486277:2933486277(0) win 5840 <mss 1460,sackOK,timestamp 730285 0,nop,wscale 2> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56224: S 2545625510:2545625510(0) ack 2933486278 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134760199 730285> > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 1 win 1460 <nop,nop,timestamp 730448 134760199> > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 1460 <nop,nop,timestamp 730448 134760199> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.56224: P 1:206(205) ack 607 win 32798 <nop,nop,timestamp 134760217 730448> > IP silvy.localnet.56224 > 216-145-246-23.rev.dls.net.www: . ack 206 win 1728 <nop,nop,timestamp 730626 134760217> This is from (failing) 2.6.17-rc6-git2: > > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: S 3000518105:3000518105(0) win 5840 <mss 1460,sackOK,timestamp 4294759165 0,nop,wscale 6> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: S 3368494549:3368494549(0) ack 3000518106 win 65535 <mss 1452,nop,wscale 1,nop,nop,timestamp 134771817 4294759165> > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: . ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294759337 134771817> > IP silvy.localnet.33472 > 216-145-246-23.rev.dls.net.www: P 1:607(606) ack 1 win 92 <nop,nop,timestamp 4294760162 134771817> > IP 216-145-246-23.rev.dls.net.www > silvy.localnet.33472: . ack 607 win 32798 <nop,nop,timestamp 134771918 4294760162> Something is broken somewhere.
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 12:58:05 -0400 .. > The site www.everymac.com is still not browseable until > setting /proc/sys/net/ipv4/tcp_window_scaling===0. > > There's one other difference I see in the tcpdump traces. > The first packets from each trace below show different > values for "wscale". The old (working) kernels use "wscale 2", > whereas 2.6.17 uses "wscale 6". In both cases, the value > seen in /proc/sys/net/ipv4/tcp_adv_win_scale is 2. Okay. More progress here. The calculation of the "wscale" values is based on the "tcp_rmem" sysctl numbers. The defaults for these *differ* between 2.6.16.18 and 2.6.17-rc*. 2.6.16: 4096 87380 174760 2.6.17: 4096 87380 2097152 If I change the tcp_rmem setting on 2.6.17 to match the old value, then the website www.everymac.com becomes accessible again: echo 4096 87380 174760 > /proc/sys/net/ipv4/tcp_rmem Looking at diffs between 2.6.16 and 2.6.17, I see a big rework of the tcp_rmem code in linux/net/ipv4/tcp.c Looks like something got broken there, or possibly the wscale calculations have a bug that is only triggered by the new rmem values ??
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 13:22:35 -0400 Mark Lord wrote: > .. >> The site www.everymac.com is still not browseable until >> setting /proc/sys/net/ipv4/tcp_window_scaling===0. >> >> There's one other difference I see in the tcpdump traces. >> The first packets from each trace below show different >> values for "wscale". The old (working) kernels use "wscale 2", >> whereas 2.6.17 uses "wscale 6". In both cases, the value >> seen in /proc/sys/net/ipv4/tcp_adv_win_scale is 2. > > Okay. More progress here. The calculation of the "wscale" values > is based on the "tcp_rmem" sysctl numbers. > > The defaults for these *differ* between 2.6.16.18 and 2.6.17-rc*. > > 2.6.16: 4096 87380 174760 > 2.6.17: 4096 87380 2097152 > > If I change the tcp_rmem setting on 2.6.17 to match the old value, > then the website www.everymac.com becomes accessible again: > > echo 4096 87380 174760 > /proc/sys/net/ipv4/tcp_rmem > > Looking at diffs between 2.6.16 and 2.6.17, I see a big rework > of the tcp_rmem code in linux/net/ipv4/tcp.c > > Looks like something got broken there, or possibly the wscale > calculations have a bug that is only triggered by the new rmem values ?? > Okay, here's the blob that broke it. > [TCP]: Set default max buffers from memory pool size > author John Heffner [email blocked] > Sat, 25 Mar 2006 09:34:07 +0000 (01:34 -0800) > committer David S. Miller [email blocked] > Sat, 25 Mar 2006 09:34:07 +0000 (01:34 -0800) > commit 7b4f4b5ebceab67ce440a61081a69f0265e17c2a > tree ac02c685ce23f2440fecbebaa5b55cd47947c03e tree > parent 2babf9daae4a3561f3264638a22ac7d0b14a6f52 commit | commitdiff > [TCP]: Set default max buffers from memory pool size > > This patch sets the maximum TCP buffer sizes (available to automatic > buffer tuning, not to setsockopt) based on the TCP memory pool size. > The maximum sndbuf and rcvbuf each will be up to 4 MB, but no more > than 1/128 of the memory pressure threshold. > > Signed-off-by: John Heffner [email blocked] > Signed-off-by: David S. Miller [email blocked] John / David: Any ideas on what's gone awry here?
From: John Heffner [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 13:39:42 -0400 Mark Lord wrote: > John / David: Any ideas on what's gone awry here? > > Yes, you have some sort of a broken middlebox in your path (firewall, transparent proxy, or similar) that doesn't correctly handle window scaling. Check out this thread: <http://marc.theaimsgroup.com/?l=linux-netdev&m=114478312100641&w=2&gt;. The best thing you can do is try to find this broken box and inform its owner that it needs to be fixed. (If you can find out what it is, I'd be interested to know.) In the meantime, disabling window scaling will work around the problem for you. -John
From: Linus Torvalds [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 10:50:51 -0700 (PDT) On Tue, 13 Jun 2006, John Heffner wrote: > > The best thing you can do is try to find this broken box and inform its owner > that it needs to be fixed. (If you can find out what it is, I'd be interested > to know.) In the meantime, disabling window scaling will work around the > problem for you. Well, arguably, we shouldn't necessarily have defaults that use window scaling, or we should have ways to recognize automatically when it doesn't work (which may not be possible). It's not like there aren't broken boxes out there, and it might be better to make the default buffer sizes just be low enough that window scaling simply isn't an issue. I suspect that the people who really want/need window scaling know about it, and could be assumed to know enough to raise their limits, no? Linus
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 14:26:12 -0400 Linus Torvalds wrote: > > On Tue, 13 Jun 2006, John Heffner wrote: >> The best thing you can do is try to find this broken box and inform its owner >> that it needs to be fixed. (If you can find out what it is, I'd be interested >> to know.) In the meantime, disabling window scaling will work around the >> problem for you. > > Well, arguably, we shouldn't necessarily have defaults that use window > scaling, or we should have ways to recognize automatically when it > doesn't work (which may not be possible). > > It's not like there aren't broken boxes out there, and it might be better > to make the default buffer sizes just be low enough that window scaling > simply isn't an issue. > > I suspect that the people who really want/need window scaling know about > it, and could be assumed to know enough to raise their limits, no? Agreed. It's taken me over a month here to realize that the particular webserver in question (www.everymac.com) wasn't "dead", but merely being blocked by my 2.6.17 kernel. All was fine with 2.6.16, as I discovered today. I wonder how many other "dead sites" there are out there, that will be shut off from people when they "upgrade" to 2.6.17 ? I'm a kernel hacker. Most users of 2.6.17 will not be. The default should be something that works "by default". Cheers
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 15:08:59 -0400 Mark Lord wrote: > Linus Torvalds wrote: > >> It's not like there aren't broken boxes out there, and it might be >> better to make the default buffer sizes just be low enough that window >> scaling simply isn't an issue. >> >> I suspect that the people who really want/need window scaling know >> about it, and could be assumed to know enough to raise their limits, no? > > Agreed. It's taken me over a month here to realize that the particular > webserver in question (www.everymac.com) wasn't "dead", but merely being > blocked by my 2.6.17 kernel. All was fine with 2.6.16, as I discovered > today. > > I wonder how many other "dead sites" there are out there, > that will be shut off from people when they "upgrade" to 2.6.17 ? > > I'm a kernel hacker. Most users of 2.6.17 will not be. > The default should be something that works "by default". Further to this, the current behaviour is badly unpredictable. A machine could be working perfectly, not (noticeably) affected by this bug. And then the user adds another stick of RAM to it. Poof.. many sites from the internet stop responding. Obviously the RAM upgrade broke things.. must be bad RAM, right? Err.. no, the networking stack simply decided to become incompatible with certain sites, as a result of the user adding more RAM to their machine. BbD.
From: David Miller [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 14:26:03 -0700 (PDT) From: Mark Lord [email blocked] Date: Tue, 13 Jun 2006 15:08:59 -0400 > Err.. no, the networking stack simply decided to become incompatible > with certain sites, as a result of the user adding more RAM to their > machine. Let's discuss some facts. First, you are getting window scaling by default with the older kernel too. It's just a smaller window scale, using a shift value of say 1 or 2. What these broken middle boxes do is ignore the window scale entirely. So they don't apply a window scale to the advertised windows in each packet. Therefore, they think a smaller amount of window space is being advertised than really is. So they will silently drop packets they think is outside of this bogus window they've calculated. Now, when the window scale is smaller, the connection can still limp along, albeit slowly, making forward progress even in the face of such broken devices because half or a quarter of the window is still available. It will retransmit a lot, and the congestion window won't grow at all. When the window scale is larger, this middle box bug makes it such that not even one packet can fit into the miscalculated window and things wedge. The box thinks that your window is "94" instead of "94 << WINDOW_SCALE". I think OpenBSD's claim (they did have the bug and probably still do for all that I know) was that they wanted to make their firewalling "stateless". This is a bogus argument because by definition you cannot interpret the TCP window without having seen the initial connection startup where the parameters are negotiated, and in particular the window scale which will be used. And you want to say we should try to work around systems designed by people who think this is ok? :-) It is impossible to fill a cross-continental connection without using window scaling. A 64K window is all you get without scaling. Big buffers are absolutely necessary, and as John Heffner showed this need is growing exponentially and not slowing down. 6 megabit downlink is pretty commonplace in the US, and the standard is much higher in well connected countries such as South Korea. Also, as John Heffner mentioned, even if we could detect the broken boxes you can't just "turn off window scaling" after it's been negotiated. It's immutably active for the entire connection once enabled. Window scaling has been standardized and around for 14 years, RFC1323 was published in May of 1992. How much longer can we wait for it to be deployed properly? :-) So the broken boxes, which to be honest are few and far between these days, need to go, they really do.
From: Mark Lord [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 17:49:21 -0400 David Miller wrote: >.. > First, you are getting window scaling by default with the older > kernel too. It's just a smaller window scale, using a shift > value of say 1 or 2. > > What these broken middle boxes do is ignore the window scale > entirely. > > So they don't apply a window scale to the advertised windows in each > packet. Therefore, they think a smaller amount of window space is > being advertised than really is. So they will silently drop packets > they think is outside of this bogus window they've calculated. > > Now, when the window scale is smaller, the connection can still limp > along, albeit slowly, making forward progress even in the face of such > broken devices because half or a quarter of the window is still > available. It will retransmit a lot, and the congestion window won't > grow at all. > > When the window scale is larger, this middle box bug makes it such > that not even one packet can fit into the miscalculated window and > things wedge. The box thinks that your window is "94" instead of > "94 << WINDOW_SCALE". .. Unilaterally following the standard is all well and good for those who know how to get around it when a site becomes inaccessible, but not for Joe User. If it always fails, or always works, that's not such a big problem. I would never have complained if I had never been able to access the web sites in question. But since it IS working in 2.6.16, and got broken in 2.6.17, I'm bloody well going to complain. I suppose the most important objection to our current behaviour is that this behaviour *changes* when something totally unrelated (to Joe User) happens: adding or removing a stick of RAM. So I'm not against the window scaling, just against it's apparent randomness (to the vast majority who are not "in the know"). We should perhaps just have a fixed upper memory setting, as we currently do in 2.6.16, so that the behaviour is predictable. On a related note.. I wonder if we can choose better values for the window size, so that if the scale factor is ignored, we still end up with reasonably sized packets? So that the other box will not think our window is a mere "94" when the scale factor is lost? -ml
From: Rick Jones <rick.jones2@hp.com> Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 15:12:31 -0700 Mark From everything I have read so far (which admittedly hasn't been everything) it sounds like the firewall in question was a ticking timebomb. If 2.6.17 hadn't set it off, something else might very well have done so. Or, if you prefer another metaphore, 2.6.17 was simply the last in a series of straws on the back of the camel what was the firewall. Meta issues of whether or not the camel that is firewalls should have ever been allowed to poke its nose in the Internet Tent notwithstanding :) At the very least, the firewall, if it is going to be "stateless," has to strip the window scaling option from the SYN's that go past. Otherwise, I would be inclined to agree with David that the firewall is fundamentally broken. rick jones
From: David Miller [email blocked] Subject: Re: 2.6.17: networking bug?? Date: Tue, 13 Jun 2006 15:23:01 -0700 (PDT) From: Mark Lord [email blocked] Date: Tue, 13 Jun 2006 17:49:21 -0400 > I suppose the most important objection to our current behaviour > is that this behaviour *changes* when something totally unrelated > (to Joe User) happens: adding or removing a stick of RAM. We are pretty much required to choose the TCP memory parameters based upon how much physical memory is in the machine, and these parameters in-turn are inextricably linked to what kind of window scale we try to use for connections. The behavior is unfortunate, but more unfortunate are the boxes that create these problems in the first place. I believe their lifespan is quite limited. > We should perhaps just have a fixed upper memory setting, as we > currently do in 2.6.16, so that the behaviour is predictable. The change in 2.6.17 was exactly that we needed to increase this upper limit to ~4MB. > On a related note.. I wonder if we can choose better values for > the window size, so that if the scale factor is ignored, we still > end up with reasonably sized packets? So that the other box > will not think our window is a mere "94" when the scale factor > is lost? We have an algorithm that tries to pick something based upon the set of the values we might need to represent in the window field. If the scale is too high, you lose accuracy, since the lower bits get chopped off when the TCP header is being built and the computed window size is shifted down. So we try to pick the smallest scale necessary to represent the largest window size we might end up needing to advertise. A complication here is that we dynamically size both receive and send buffers in response to our growing knowledge of the connection's characteristics over time. So at the beginning we'll use a small buffer size, and as the congestion window grows we'll increase our buffer sizes to fill the pipe. This adds even more considerations for window scale selection, as you can imagine. One final word about window sizes. If you have a connection whose bandwidth-delay-product needs an N byte buffer to fill, you actually have to have an "N * 2" sized buffer available in order for fast retransmit to work.

Related Links:

This sounds awfully similar t

Anonymous (not verified)
on
June 14, 2006 - 1:05pm

This sounds awfully similar to the problems with enabling ECN (Explicit Congestion Notification) by default.

It sure does. I've still got

Anonymous (not verified)
on
June 15, 2006 - 12:22am

It sure does. I've still got ECN disabled on my Linux box. I first disabled it a couple years back when I couldn't reach verizon.com with it enabled. At the time I think I narrowed it down to a Cisco PIX.

You might want to enable it n

Anonymous (not verified)
on
June 15, 2006 - 9:01am

You might want to enable it now and see if it works a bit better. There are *very* few boxes still out there now which break with ECN enabled.

sites which break

Anonymous (not verified)
on
December 7, 2006 - 6:26pm

Unfortunately southwest.com (the airline) breaks and they have no technical contact and refuse to read email. So I can't get them to fix it.

... but Cisco knows all about

Anonymous (not verified)
on
June 15, 2006 - 9:29am

... but Cisco knows all about routers!!!!!111one

Yes, that is pretty ridiculous.

Cisco

Craig Ringer (not verified)
on
June 17, 2006 - 11:56am

Cisco are hardly perfect. However, their routers are frequently installed and *never* updated, even when Cisco begs its customers to patch them. Customers literally forget that they have them or where they are; customers refuse to patch them because "they work" and they don't want to risk breaking anything; etc. Doesn't support an important Internet standard? Oh well, the customers aren't complaining. Important security patch? We won't install it because the router is working and we don't want to mess with it.

*sigh*

This is very far from entirely Cisco's fault.

I recently installed a brand

Anonymous (not verified)
on
June 21, 2006 - 1:56pm

I recently installed a brand new Linksys wireless router (the srx200 model) for a client, and was very surprised to discover that it droped packets with ECN (which their linux servers were using). The client's previous router (also a consumer linksys product) had no problem with it (I assume the new one has VxWorks on it...).

When I called Linksys to complain, they told me there was no process available to report bugs in their firmware. Arrrgg!!!

Generic hw bug reporting process

Helge Hafting (not verified)
on
July 13, 2006 - 12:14pm

If they don't want to take bug reports - use the generic bug reporting mechanism for all kinds of hardware:

Return the faulty device for a refund.

That always gets some attention. It is reasonable to assume that they might want proof that the box is indeed broken, so include a description of the ECN problem.

Recommendation: buzzwords

Anonymous (not verified)
on
June 15, 2006 - 8:14pm

Can XML and Web2.0 be brought to bear? ;)

I have an obvious question. D

jh (not verified)
on
June 16, 2006 - 4:20am

I have an obvious question. David Miller says 'you can't just "turn off window scaling" after it's been negotiated'. But why not set a timeout such that when it passes and no packets seem to get through, the connection is closed/shutdown/whatever and a new one is negotiated with window scaling dropped/modified? The "retry connection" would also have a timeout and only after that would the user be informed that "the server cannot be reached". Seems it should work, although IANAKH, of course.

TCP doesn't work that way

on
June 16, 2006 - 7:48am

Once you've opened a socket, whatever daemon got spawned on the other side is attached to it and is waiting for you to start talking (or to talk to you, or both). The connection event's already made its way to the application layer and triggered whatever changes in state are associated with it. If another RST comes in for that port, it's a new connection.

Normal TCP timeouts will cause the affected connection to terminate anyway. From how this article reads, you can't tell the difference between a window scaling problem and a firewall that decided to eat your packets for lunch, other than you got as far as the 3-way negotiation to bring up the connection before your bits started going to the great beyond. But how do you distinguish that from, say, a flaky link or congestion based packet lossage?

The only way such a "do-over" protocol could possibly work is if the TCP/IP stacks on both ends of the connections agreed to engage in non-standard behavior, which, IMHO, seems like an awful idea.

True.

Anonymous (not verified)
on
June 19, 2006 - 5:19am

One can repeat what you have said shorter: TCP connection establishment state machine is known to application and is part of any TCP application logic. You can't reestablish connection w/o application knowing that fact, since that might result in undetected loss of data - something TCP is designed to always detect.

I expect that TCP_WINDOW_CLAMP (found in `man 7 tcp`) is intended for disabling/enabling that on per socket basis.

Anyway, I have to agree with people argueing that broken hardware has to be fixed.

P.S. Another option to work the problem around is to implement "black list": the list with ranges of IP addresses where window scaling is disabled/set differently.

So how to find the offendee?

Anonymous (not verified)
on
June 30, 2006 - 1:13pm

So the problem here is a broken machine in between me and the server I'm trying to reach. How does one go about tracking down the culprit box?

None of the suggested sysctl's work for me

on
July 13, 2006 - 6:00am

Hey guys. I have a very similar issue, it seems. I have a server running a 2.6.16 kernel as my "host" box, and running multiple 2.6.17-rc5 as guest UML (User Mode Linux) hosts. From my desk (different geographical area, different network, behind a Cisco PIX 501 firewall), it seems cookies that my website issues me (served from one of the 2.6.17-rc5 UML hosts) are dropped by the PIX firewall. After pulling my hair out and googling for about a week, I came across a gem, or so I thought, stating a similar issue a guy had, where ECN on the server was the culprit. I checked on both the 2.6.16 host kernel (which binds the public IP address), as well as the 2.6.17-rc5 UML host (which is behind an iptables NAT and uses the TUN/TAP networking driver), and both had ECN disabled. I thought I had another gem here, but after tweeking all suggested sysctl's here, I came up with the same results. My PIX is telling me that it's dropping the packets because it does not see an established connection, so something has to be confusing it into closing the connection (which is why I thought ECN for sure would be the culprit). The odd thing, however, is that the full website comes up, minus the cookie, and the cookie is part of the HTTP header which is supposed to be sent prior to the content. I've tried from behind a few "non-broken" devices like a Linksys WRT54G, and a simple Linux NAT box, and both proved to work as desired. What could be confusing the braindead PIX into thinking the connection should be terminated? Any assistance on this would be greatly appreciated. Please shoot me an email to bryce@shellshark.net. Thanks guys (and gals, possibly).

None of the suggested sysctl's work for me

on
July 13, 2006 - 6:00am

Hey guys. I have a very similar issue, it seems. I have a server running a 2.6.16 kernel as my "host" box, and running multiple 2.6.17-rc5 as guest UML (User Mode Linux) hosts. From my desk (different geographical area, different network, behind a Cisco PIX 501 firewall), it seems cookies that my website issues me (served from one of the 2.6.17-rc5 UML hosts) are dropped by the PIX firewall. After pulling my hair out and googling for about a week, I came across a gem, or so I thought, stating a similar issue a guy had, where ECN on the server was the culprit. I checked on both the 2.6.16 host kernel (which binds the public IP address), as well as the 2.6.17-rc5 UML host (which is behind an iptables NAT and uses the TUN/TAP networking driver), and both had ECN disabled. I thought I had another gem here, but after tweeking all suggested sysctl's here, I came up with the same results. My PIX is telling me that it's dropping the packets because it does not see an established connection, so something has to be confusing it into closing the connection (which is why I thought ECN for sure would be the culprit). The odd thing, however, is that the full website comes up, minus the cookie, and the cookie is part of the HTTP header which is supposed to be sent prior to the content. I've tried from behind a few "non-broken" devices like a Linksys WRT54G, and a simple Linux NAT box, and both proved to work as desired. What could be confusing the braindead PIX into thinking the connection should be terminated? Any assistance on this would be greatly appreciated. Please shoot me an email to bryce@shellshark.net. Thanks guys (and gals, possibly).

OpenBSD firewalls _do_ understand window scaling

James Goodlet (not verified)
on
August 23, 2007 - 12:02pm

I got garden-pathed by the assertion in the above article (and in the linked to thread) that OpenBSD firewalls don't understand window scaling. This is just plain wrong. OpenBSD understands window scaling just fine, and it's firewall software (pf) takes care to extract the scaling information from the syn and syn-ack packets. What might be at fault is that either the ruleset for the firewall is not recording state information (i.e. no "keep state") -- since PF records the window scaling information in the state table -- OR the ruleset is badly written and is recording two states per flow on different (e.g. internal/external) interfaces, thus failing to form a complete picture of the scaling factors in use. The former is clueless, and the latter is bad practice (the OpenBSD PF documentation is littered with exhortations to apply rules to a single interface unless you've got a really, really good reason to do otherwise).

So, for the record, shooting yourself in the foot with a bad ruleset isn't the fault of the foot...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.