RE: poll() blocked / packets not received ?

Previous thread: [TOMOYO #11 (linux-next) 11/11] MAINTAINERS info by Kentaro Takeda on Monday, October 20, 2008 - 12:34 am. (1 message)

Next thread: [PATCH v2] Add Dallas DS1390 RTC chip by Mark Jackson on Monday, October 20, 2008 - 1:41 am. (4 messages)
From: Nicolas Cannasse
Date: Monday, October 20, 2008 - 1:25 am

Hello,

We have an application that uses pthreads and (blocking) sockets.

When the application runs with one single thread in separate processes 
(using fork()) we don't get any problem.

However when it's multithreaded, we sometimes get stuck while poll()ing 
a socket (with events set to POLLIN). Even after the other side of the 
connection has closed its side of the connection, we are still stuck 
here. Adding a timeout only makes the poll() exit with 0, so we loop.

In case we don't loop the next operation is a recv() which will block as 
well (which is consistent).

It seems like nothing is longer received on the socket but it's 
difficult to verify with tcpdump since our server outputs something like 
15MB at peek time with 150 hits per seconds.

We have Shorewall installed and enabled, but what seems strange is that 
the problem depends on multithreading. It also occurs much more often on 
the 4 core machines than on a 2 core ones (both with Hyperthreading 
activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by Ubuntu.

Any tip on we could fix that or investigate further would be 
appreciated. After one month of debugging we're really out of solution now.

Best,
Nicolas
--

From: swivel
Date: Monday, October 20, 2008 - 3:15 am

Your usage pattern is a very common one, I highly doubt you are experiencing
a kernel bug here or many people (including myself) would be complaining.

Shorewall sounds like it might be suspect, are FIN's not coming in when the
remote closes?  You can look in the output of netstat to see what state the
TCP is in, still ESTABLISHED?

Have you tried just disabling the firewall to see if the problem
disappears?

Regards,
Vito Caputo
--

From: Nicolas Cannasse
Date: Monday, October 20, 2008 - 3:46 am

Yes, it's still ESTABLISHED, but we can't see the corresponding 
connection on the other machine while running netstat. I'm not a TCP 
expert, so I'm not sure in which case this can occur.

I agree with your comment in general, except that we have been running 
the same application in single-thread environment for years without 
running into this very specific problem.

The only logs we get in the dmesg are the following :

either (a few everyday) :

[10742708.006350] TCP: Treason uncloaked! Peer 213.209.177.218:32924/80 
shrinks window 4049064122:4049064123. Repaired.

Or (more often) :

[10755036.856217] Shorewall:net2all:DROP:IN=eth0 OUT= 
MAC=00:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:00 SRC=60.238.83.204 
DST=XX.XX.XX.43 LEN=404 TOS=0x00 PREC=0x00 TTL=114 ID=12366 PROTO=UDP 
SPT=1057 DPT=1434 LEN=384

Both SRC/DST IPs does not correspond to the connections that are 
stalled, since they occur on the local network.

Best,
Nicolas
--

From: swivel
Date: Monday, October 20, 2008 - 4:39 am

If the end that's blocking still has the TCP in ESTABLISHED state, and
the other end doesnt have the TCP at all... you've already identified
why the one end is still ESTABLISHED.  ESTABLISHED state won't be left
until the FIN is received from the other end, then entering CLOSE_WAIT
state.

When the other end of the TCP is _gone_ that leads me to believe a FIN
will not be coming, hence the indefinite ESTABLISHED state.  Why it's
gone is a different question, maybe your problem is at the other end?
The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
these transitions require the other side to leave ESTABLISHED (receive a

Perhaps when you run in multicore/threaded you are stressing the network
stacks at both ends more, including everything in-between?  The
threading vs. single process relationship is probably not causal, but
just coincidental.

What is the protocol?  Are there any timeouts to take care of these
situations?  Do you schedule an alarm or use SO_RCVTIMEO to shutdown
dead connections and free up consumed threads?

TCP being reliable can block indefinitely, you can employ TCP keepalive
to change indefinite to quite a long time.

Regards,
Vito Caputo
--

From: Nicolas Cannasse
Date: Monday, October 20, 2008 - 5:13 am

Not sure why this should happen, since it's the same servers. What only 
change is part of the software that we are using to handle our server 
requests. It's either embedded in Apache 1.3 with fork() or a standalone 
multithread server which acts as Apache backend.

So the only difference for networking is that we have additional 
Apache<->MT-Server communications, but they should be on 127.0.0.1 so I 

The protocol is MySQL. Since we had the problem with libmysqlclient, we 
reimplemented it again from scratch to make sure that it was not 
software-related.

What happens at the protocol-level is the following :

a) we connect to the server
b) we make several requests and get answers back
c) at some (random+rare) point - always after making a request - we're 
stuck while waiting for the answer.

Sadly, this can happen inside a transaction while we hold the lock on 
some shared resource. This will lock the whole website until we run out 
of File Descriptor due to accept'ed pending connections. In that case we 
get an exception and the server (the multithread one, not MySQL) 
restarts, which release the lock.

In some other cases when we don't hold a lock, the thread remains 
blocked in poll() as I described it. After a timeout (I think it's 28800 
seconds) the MySQL server closes the connection. The client - which is 
waiting in poll() - does not have any timeout activated (it's relying on 
the mysql server). But it doesn't notice that the socket has been closed 
either.

We investigated a lot about signals since poll() can also be interrupted 
by Garbage Collector and child process signals, but we correctly handle 
EINTR everywhere it's needed. So unless there's a possibility that 
interrupting poll() with a signal might somehow consume the data, this 

Sure. We could also use a client timeout, but we don't want to hold the 
lock more than required, and we can't make the difference between a 
given request that would take too much time to complete and a lost ...
From: Nicolas Cannasse
Date: Monday, October 20, 2008 - 5:39 am

Ok, funny thing is that we just found what is occurring...

We had a process that was on a regular basis doing the following :

conntrack -F

This was done in order to prevent the table to grow too big, because we 
were reaching the maximum size as told by :

/proc/sys/net/ipv4/netfilter/ip_conntrack_max
   and
/proc/sys/net/ipv4/netfilter/ip_conntrack_count

Seems like when there are active connections, this will break netfilter 
and stop delivering packets to the socket.

At least I will have nice sleep tonight.

Best,
Nicolas
--

From: David Schwartz
Date: Monday, October 20, 2008 - 8:53 am

Note that this solved your symptom, not your problem. You actually have two
problems:

1) You rely on TCP to detect a lost connection even by a side that will
never transmit any data. TCP simply does not do this. If you are not trying
to send data, you are not assured that a lost connection will be detected.
(You either need a timeout, or you need to send or dribble some data,
depending on the protocl.)

2) You hold a lock on a shared resource while you wait for a reply over a
network. If this is a low-level "block and wait indefinitely" lock, this
will cause many threads to line up behind a slow/stuck thread. The right fix
depends on your circumstances, but you need to use a synchronization
primitive that is suitable. (You need to be able to use multiple connections
or defer operations without holding a thread.)

With both of these bugs, you are vulnerable to precisely the scenario you
observed. The TCP connection close packets were lost (in this case due to
premature expiration of the connnection tracking, but other things can do
it, such as the server rebooting), TCP could not detect the lost connection
because you never sent any data, so one thread blocked forever, and other
threads got in line behind it.

DS


--

From: Nicolas Cannasse
Date: Monday, October 20, 2008 - 10:24 am

I agree with both points, but I can't modify the MySQL protocol to 
implement that.

For (1) I can't add the timeout since I have no way to differentiate 
between a lost connection and a request that takes time to execute. I'll 
maybe check if the protocol allow pings while waiting for the request 
result, but I'm not sure it does.

For (2) the shared resources is on the database side, not on the server 
side. It's the transaction that have some rows locked. I have no 
solution for that.

Best,
Nicolas
--

From: David Schwartz
Date: Monday, October 20, 2008 - 4:21 pm

Sure you can. For example, you can run a proxy on both the server and the
client, with the two proxies speaking a protocol that carries the MySQL
protocol. The protocol between the server and the client can include two
types of messages, one being regular data (which the proxies pass to the
server and client software) and one being a ping (which the proxies use
internally to decide when to drop their connections). Each proxy can 'ping'
the other as often as required and drop both connections if the ping fails
to go through. This will ensure that your program detects a connection loss
rapidly.


That doesn't fit your problem description. Presumably the server detected
the loss of the connection and so would have released any resources it was
holding that were associated with it. The problem in this case was that the

Good luck.

DS


--

From: Willy Tarreau
Date: Monday, October 20, 2008 - 10:12 pm

Not only you can, but you *must*. Any service assuming infinite timeout
is deemed to fail. If you know that one request can take as long as one
minute for instance, then use a 2 minutes timeout. The day all requests
will be automatically cleaned up because of a failed firewall between
client and server, you'll be happy not to have to come there and restart
the service to flush them.

There's a huge difference between using a very large timeout and none at
all.

Willy

--

Previous thread: [TOMOYO #11 (linux-next) 11/11] MAINTAINERS info by Kentaro Takeda on Monday, October 20, 2008 - 12:34 am. (1 message)

Next thread: [PATCH v2] Add Dallas DS1390 RTC chip by Mark Jackson on Monday, October 20, 2008 - 1:41 am. (4 messages)