Re: Diagnose co-location networking problem

Previous thread: trunk on re0 interface by Sylvie DUPUY on Tuesday, December 26, 2006 - 1:20 pm. (3 messages)

Next thread: BIND running setuid with interface changes by Eugene M. Kim on Wednesday, December 27, 2006 - 1:35 pm. (3 messages)
To: <freebsd-net@...>
Date: Tuesday, December 26, 2006 - 10:45 pm

I just got a server and put it in a co-location.

It runs RELEASE FreeBSD 6.1-RELEASE #0, pound, lighttpd and ruby on rails.

Most of the times I find the server responds nicely. But periodically
it doesn't respond properly when accessing its webpages: Type URL in
browser, hit return, no page appears. Try again and again and after a
few times it appears.

Other sites are accessible during these problematic times. Also, in
parallel I am connected to the server through ssh, and there are not
problems with that. Even during those times when the web pages don't
appear, I can type and see the result.

Before installing it at the datacentre, the server was working without
problems on the local network.

So I am thinking the problem may be with the co-location operation.

How can I make sure? How can I diagnose this? The only idea I had was
to run tcpdump on my Linux client (tcpdump host stbgo.org), and indeed
I can see entries lines this:

18:32:24.970139 sucrose.sugarmotor.net.34503 >
vps-18-138.virtualprivateservers.ca.http: S 2468438613:2468438613(0)
win 5840 <mss 1460,sackOK,timestamp 96924995 0,nop,wscale 0> (DF)
18:32:25.670135 sucrose.sugarmotor.net.34508 >
vps-18-138.virtualprivateservers.ca.http: S 2493895298:2493895298(0)
win 5840 <mss 1460,sackOK,timestamp 96925065 0,nop,wscale 0> (DF)
18:32:49.670152 sucrose.sugarmotor.net.34508 >
vps-18-138.virtualprivateservers.ca.http: S 2493895298:2493895298(0)
win 5840 <mss 1460,sackOK,timestamp 96927465 0,nop,wscale 0> (DF)
[note next is almost half a minute later]
18:33:12.970162 sucrose.sugarmotor.net.34503 >
vps-18-138.virtualprivateservers.ca.http: S 2468438613:2468438613(0)
win 5840 <mss 1460,sackOK,timestamp 96929795 0,nop,wscale 0> (DF)
18:33:12.985071 vps-18-138.virtualprivateservers.ca.http >
sucrose.sugarmotor.net.34503: S 2788301288:2788301288(0) ack
2468438614 win 65535 <mss 1460,nop,wscale 1,nop,nop,timestamp
708478538 96929795,sackOK,eol> (DF)

But I am not sure what ...

To: <freebsd-net@...>
Date: Wednesday, December 27, 2006 - 6:18 pm

I troubleshoot issues just like this for a living so I hope I can be
of some help. Others have already suggested some useful strategies so
I'll try to focus on ones that I haven't seen mentioned yet.

Off the bat based on what you've described I'd tend to suspect some
sort of transparent proxy, be it a stateful firewall or a intermediary
loadbalancer of some sort. The fact that your ssh connection from
the same source IP (I'm assuming) isn't showing any symptoms would
tend to de-emphasize layers 1-3 (IP on down to ethernet, ruling out
packetloss due to ethernet duplex mismatch/cabling and bad IP
routing, doesn't rule out rate limiting). However, if you've been
experiencing intermittent pauses with your ssh session, even if
they don't coincide with interruptions in http traffic then you may
still have a packet loss issue.

If you suspect packetloss, confirm with 'netstat -i' and look in
the Ierrs and Oerrs columns, they should both be 0 if everything
is spiff. Also check the TCP retransmit counters in 'netstat -s'
(you will always have some retransmission, you just don't want a
*lot* of it). I should note that I think this is a low probability
based on symptoms.

Actually based on the traffic snip you quoted, I tend to strongly

the source IP is 192.168.2.54 which isn't a routable IP address.
Unless you're coming through a VPN or are local to the network,
this would be clear evidence that there is a box in the middle
that's at least smart enough to do address translation.

To troubleshoot everything else I would start with recording a full
traffic capture from both the client and the server and try and
reproduce the problem. It sounds like that shouldn't be a problem.

On the client I'd run:

tcpdump -n -s 1600 -i <outgoing interface> -w clientside.dmp host <serverIP>

On the server I'd run

tcpdump -n -s 1600 -i <external interface> -w serverside.dmp

Plan on clientside.dmp and serverside.dmp files getting large fast. That's ok,
you ju...

To: Matthew Hudson <fbsd@...>
Cc: <freebsd-net@...>
Date: Thursday, December 28, 2006 - 2:08 am

Ok, this is a little unfortunate: I can't run traceroute from the client PC (the service provider doesn't seem to like it). (Nor can I use ping)

The server FreeBSD kernel doesn't support tcpdump. I should recompile it then, but not now.

So I ran the netstat tests, seeing no other suggestion. Below is the output before and after "failed" accesses. If I understand, there seems no indication of lost packets.

At least the problem is rather reproducible: run 'lynx -dump http://stbgo.org > /dev/null' in a loop, 15 times and a failure occurs. I also thought maybe the ssh session might be interfering, rather than showing a live connection; but without it the same occurs.

Thanks a lot to all for now.

Stephan

# Both on client and server:
$ netstat -i > /tmp/before
$ netstat -s | grep -i ret >> /tmp/before
... run test .... recognize failure ...
$ netstat -i > /tmp/after
$ netstat -s | grep -i ret >> /tmp/after

Client first.

$ cat /tmp/before
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 012471498 0 0 0 8604916 36 0 1 BMRU
eth0: 1500 0 - no statistics available - BMRU
eth0: 1500 0 - no statistics available - BMRU
lo 16436 0 429696 0 0 0 429696 0 0 0 LRU
66656 segments retransmited
TCPLostRetransmit: 0
TCPFastRetrans: 1233
TCPForwardRetrans: 18
TCPSlowStartRetrans: 476
$ cat /tmp/after
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 012471903 0 0 0 8605107 36 0 1 BMRU
eth0: 1500 0 - no statistics available - BMRU
eth0: 1500 0 - no statistics available - BMRU
lo 16436 0 429786 0 0 0 429786 0 0 0 LRU
66665 segments retransmited
TCPLostRetransmit: 0...

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>
Date: Thursday, December 28, 2006 - 4:31 pm

Ok, that explains the private 192.168 IP address I saw in your earlier

Actually there's significant indication of lost packets and clues that

Generally two TCP connections on different sockets will never interfere
with each other, except in extreme examples of congestion or pathologically

So we're looking at the client here and there are few things of note:
1. No significant interface errors are being recorded so it's
not a layer-2 (ethernet) issue.
2. The retansmit count went up by 9 while the overall transmit count
went up by 191 packets, suggesting an approximate transient packetloss
rate of 4.7% (9/191, fuzzy math) during the test which is
significantly greater than the system-wide average of 0.8%
(66665/8605107). Thus this possibly suggests that the client
saw an abormal packetloss rate during the test. It may be
the case that all of the successful connections experienced
no packet loss and only the failed connect generated the
retransmits. I'm not sure if initial SYN retransmits get
counted in this column or not but I believe this still may be
significant. (The assumptions made in these calculations are
so grossly oversimplified that the evidence derived from
them is weak at best).
3. The loopback saw 90 packets of activity. I don't know how
long this test ran but that could be considered a little chatty.
As a longshot, I'd run a tcpdump on loopback and run the test
again, simply to make sure that no traffic is unintentionally
getting diverted over the loopback interface (unlikely but I've

And here are the server stats which seem to show very little but
in fact are quite informative.
1. No significant interface errors, again ruling out layer-2.
2. pflog and pfsyn devices are registered in the kernel,
suggesting PF firewalling has been compiled in. It doesn't
seem that pflog is being used at all but this does beg the
...

To: Matthew Hudson <fbsd@...>
Cc: Stephan Wehner <stephanwehner@...>, <freebsd-net@...>
Date: Monday, January 1, 2007 - 10:00 am

> On Wed, Dec 27, 2006 at 10:08:25PM -0800, Stephan Wehner wrote:
/usr/ports/net/scand/ designed for such a problem
on client side when overactive.

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: <.@...>
Cc: <freebsd-net@...>, Matthew Hudson <fbsd@...>
Date: Tuesday, January 2, 2007 - 12:07 am

Hello again,

to give a short summary; I contacted this list to ask for help with
diagnosing a problem with a co-located server (accessible at
http://stbgo.org) running FreeBSD 6.1. When I swamp it with requests
(simply, lynx -dump http://stbgo.org > /dev/null in a loop)
some responses take 90 seconds -- far too long. This is reproducible,
but seemingly not from other source addresses - only my own (home)
machine, 64.114.83.92

The recommendations have been to look at the traceroute output, and at
tcpdump from both ends; the tcpdump output was so far only available
from the client side, since the server's kernel didn't support
tcpdump. (Also, to "swamp" a different port and see the result -
planned next step)

The traceroute brought up a host with name
a.core.65-110-0-1.van.data-fortress.com which sits between my server
and the Internet. (As far as I can tell, this looks legitimate. The
co-location provider says their machines perform no filtering)

Traceroute from the client is not available; this service provider
blocked it for some reason, and I can live without it. (tcptraceroute
was suggested, but that gave no meaningful output; I may try again
from a different machine on the local network)

Now to where I got in the mean time.

I recompiled the server's kernel so that tcpdump is available.

Client tcpdump command
-----------------------------------------------------------------------------
$ sudo /usr/sbin/tcpdump -n -s 1600 -w clientside.dmp host stbgo.org
tcpdump: listening on eth0

398 packets received by filter
0 packets dropped by kernel

Server tcpdump command
---------------------------------------------------------------
$ sudo /usr/sbin/tcpdump -n -s 1600 -w /tmp/serverside.dmp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 1600 bytes
Hm, dispatch protocol error: type 3 plen 4
302 packets captured
303 packets received by filter
0 packets dropped by kernel

During the times that these two commands ran, this was executed on t...

To: Stephan Wehner <stephanwehner@...>, <freebsd-net@...>
Date: Thursday, January 4, 2007 - 3:34 pm

Stephan,

I haven't yet looked at the new data you've reported because of the
last point that you mention, the point that I brought up in my
previous email, is of critical importance. There is currently
evidence that suggests the possibility that this problem is not
reproduceable from anywhere but your home network. If that is indeed
the case then what you have is likely not a problem with the FreeBSD
networking code on your colocated box but rather a problem with
your home network. If so, then this is the wrong forum to seek
assistance in troubleshooting a non-FreeBSD networking problem.

Before we can meaningfully assist you with your issue here, we're
going to need to see evidence that shifts the finger of blame
back towards the FreeBSD networking code and away from your home
network where it is currently pointing.

cheers,
--
Matthew Hudson

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: Matthew Hudson <fbsd@...>
Cc: <freebsd-net@...>
Date: Friday, January 5, 2007 - 2:33 am

Yes, it does look like that - a problem at my home end.

Thanks a lot for the assistance!

--
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>, Matthew Hudson <fbsd@...>
Date: Thursday, December 28, 2006 - 1:31 pm

/usr/ports/net/tcptraceroute

You should normally be able to use tcptraceroute to get path information
to systems that are listening on a TCP port (e.g. a web or mail server).
You can try ports that aren't open on the other end, but that may be
less useful.

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>, Matthew Hudson <fbsd@...>
Date: Thursday, December 28, 2006 - 8:07 am

On or about Wed, Dec 27, 2006 at 22:08 , while attempting a

So login to the FreeBSD machine and trace back to your client IP -
or as close as you can get. That may mean just to the edge of your
current provider but that may give you some idea.

Bill
--
Bill Vermillion - bv @ wjv . com
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: <bv@...>
Cc: <freebsd-net@...>, Matthew Hudson <fbsd@...>
Date: Thursday, December 28, 2006 - 12:31 pm

Ok, here is the result.

$ traceroute 64.114.83.92
traceroute to 64.114.83.92 (64.114.83.92), 64 hops max, 40 byte packets
1 VPS-18-137.virtualprivateservers.ca (65.110.18.137) 1.098 ms
0.991 ms 1.151 ms
2 a.core.65-110-0-1.van.data-fortress.com (65.110.0.1) 4.357 ms
1.557 ms 1.147 ms
3 64.69.87.37 (64.69.87.37) 1.740 ms 1.255 ms 1.150 ms
4 216.187.88.241 (216.187.88.241) 1.742 ms 2.438 ms 2.182 ms
5 204.239.129.214 (204.239.129.214) 1.910 ms 2.881 ms 3.489 ms
6 nwmrbc01dr02.bb.telus.com (154.11.4.72) 5.095 ms 3.309 ms 2.322 ms
7 64.114.45.106 (64.114.45.106) 6.555 ms 80.103 ms 9.048 ms
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *

What does this tell??

By the way, other servers look "good". Meaning when I repeatedly
access other websites (not my own) I don't see failures.

--
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>, Matthew Hudson <fbsd@...>
Date: Thursday, December 28, 2006 - 5:46 pm

Wise men talk because they have something to say, however
on Thu, Dec 28, 2006 at 08:31 , Stephan Wehner just had

Well there in no name associated with 64.114.45.106. Whos shows
that is allocated to Telus Communications in Burnaby, British
Columbia. The IP right before that is also a Telus IP.

So the next question is - what connects to 64.114.45.106. Is that
an IP assigned to you and then you use NAT and/oa PAT to translate
to local address. You target IP is in the same block
that Telos is allocated as they have 64.114.0.0 thru 64.114.255.255

The target IP does have a name associated with it and that
is zz83902.cipherkey.net. Cipherkey.net is shown as being
located in Richmond BC. Are they providing services for you.

That sounds like throttling or as another poster said some
firewall/filtering taking place.

I find the same problem as you do tracing to www.buckmaster.ca.
I can't traceroute to it as it stops resonding at 64.114.45.106, so
I'd say they are blocking things at that point - which isn't
helping at all :-( o

However the site comes up very fast.

--
Bill Vermillion - bv @ wjv . com
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>
Date: Wednesday, December 27, 2006 - 9:45 am

Earlier in the linear time track, on approximately Tue, Dec 26, 2006 at 18:45 ,

That sounds like a transport problem between your machine and the
server. It could be anywhere on the link. Is the colo doing any
rate-limiting?

I see this now and then with dropped packets from my machine to my
servers. And I control the colo with a rack we have in the Level 3
space so I can trace the problems. One of the strangest - with
intermittent long delays in packet returns made me think I had a
problem with Level 3.

I contact the NOC in the Denver area, and they checked, and saw no
problems on their net, but they checked further, and what was
happening was the my packets were a different route back to me than
going to the server. [this is not a bug but it doesn't happen very
often - usually when someone screws things up in routers].

Packets left Orlando via Sprint, went to Texas, crossed over to
Level 3 there, back to Orlando and my rack, and then they would go
out onto Level 3, and then go to a Sprinr router in Washington
and come back through Atlanta.

So the first thing I'd suggest is checking your connections via
traceroute. And >>IF<< your provider does not block RECORD ROUTE
and if the hop count is under 8 - you can try ping -R .

That will show you the IP addresses from which the packets are

When you way 'other sites are accessible' do you mean other sites
on your machine, or other sites on the 'net. And what about other

Well there is always the chance the moving it created a problem -
something shook loose. I've had the reverse when I was heading up
a recording studio. Some of the early digital equipment we had
would get flaky. We'd ship it by FedEX to the factory, and they'd
find nothing, but change out something that may have caused it.

Three times FedEX cured the problem in shipping - and each time
another piece was changed. Finally - on number 4 - it worked at
the factory, but they changed ALL the internal cables - and that
fixed it permanent...

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>
Date: Tuesday, December 26, 2006 - 11:11 pm

DNS.

try tcpdump -n

-alex

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

To: <alex@...>
Cc: <freebsd-net@...>
Date: Wednesday, December 27, 2006 - 1:55 am

Ok, thanks, I now ran tcpdump with -n.

Here I am testing with a little script that continuously accesses one
of the pages.
Right at the beginning it doesn't get very far: first response after 90 seconds.

What kind of DNS problem did you have in mind?

Stephan

$ sudo /usr/sbin/tcpdump -n host stbgo.org
Password:
tcpdump: listening on eth0
21:40:22.162536 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
52714 0,nop,wscale 0> (DF)
21:40:25.160150 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
53014 0,nop,wscale 0> (DF)
21:40:31.160150 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
53614 0,nop,wscale 0> (DF)
21:40:43.160143 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
54814 0,nop,wscale 0> (DF)
21:41:07.160149 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
57214 0,nop,wscale 0> (DF)
21:41:55.160152 192.168.2.54.35932 > 65.110.18.138.80: S
1526509984:1526509984(0) win 5840 <mss 1460,sackOK,timestamp 980
62014 0,nop,wscale 0> (DF)
21:41:55.174033 65.110.18.138.80 > 192.168.2.54.35932: S
432853648:432853648(0) ack 1526509985 win 65535 <mss 1460,nop,ws
cale 1,nop,nop,timestamp 719800431 98062014,sackOK,eol> (DF)
21:41:55.174139 192.168.2.54.35932 > 65.110.18.138.80: . ack 1 win
5840 <nop,nop,timestamp 98062015 719800431> (DF)
21:41:55.175738 192.168.2.54.35932 > 65.110.18.138.80: P 1:15(14) ack
1 win 5840 <nop,nop,timestamp 98062015 719800431> (
DF)
21:41:55.290528 65.110.18.138.80 > 192.168.2.54.35932: . ack 15 win
33304 <nop,nop,timestamp 719800549 98062015> (DF)
21:41:55.290665 192.168.2.54.35932 > 65.110.18.138.80: P 15:73(58) ack
1 win 5840 <nop,nop,timestamp 98062027 719800549>
(DF)
...

To: Stephan Wehner <stephanwehner@...>
Cc: <freebsd-net@...>
Date: Wednesday, December 27, 2006 - 1:51 am

Sorry, this doesn't seem to be DNS..

It may be something related to link/autonegotiation, or it could be
something beyond your control, on the Internet between the two hosts in
question. Are you able to do a traceroute when your connections fail?
Where does the trace stop?

-alex

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"

Previous thread: trunk on re0 interface by Sylvie DUPUY on Tuesday, December 26, 2006 - 1:20 pm. (3 messages)

Next thread: BIND running setuid with interface changes by Eugene M. Kim on Wednesday, December 27, 2006 - 1:35 pm. (3 messages)