login
Header Space

 
 

Re: sockets affected by IPsec always block (2.6.23)

Previous thread: [PATCH 4/4] netns: prevent usage of flowi with not initialized fl_net in routing (v2) by Denis V. Lunev on Tuesday, December 4, 2007 - 2:52 pm. (1 message)

Next thread: [PATCH (resubmit)]: fix lro_gen_skb() alignment by Andrew Gallatin on Tuesday, December 4, 2007 - 3:55 pm. (2 messages)
To: Linux Kernel Mailing List <linux-kernel@...>, <netdev@...>
Date: Tuesday, December 4, 2007 - 2:53 pm

If I have a IPsec rule like:
	spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
(i.e. a remote host 1.2.3.4 which will not respond)

Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl64(3, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK)  = 0 &lt;-- non-blocking socket
connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("1.2.3.4")}, 16 &lt;-- blocked connect()

[277657.564773] netcat        S b06bcf20     0  9450   9449
[277657.564785]        c8d51d28 00200046 00200046 b06bcf20 00200286 c8d51d14 00200286 c8d51d84
[277657.564814]        b06bcf20 c8d51d28 b013680b c8d51d84 eeae8800 c8d51d78 c8d51dd0 b04d3fc5
[277657.564843]        c8d51da4 00000002 00000001 ede87284 00000002 00000040 e9318ac0 db3f20a0
[277657.564874] Call Trace:
[277657.564881]  [&lt;b04d3fc5&gt;] __xfrm_lookup+0x2f5/0x510
[277657.564905]  [&lt;b0494f9e&gt;] ip_route_output_flow+0x4e/0x80
[277657.564919]  [&lt;b04af303&gt;] tcp_v4_connect+0x183/0x6d0
[277657.564934]  [&lt;b04beaf2&gt;] inet_stream_connect+0x122/0x1c0
[277657.564949]  [&lt;b0471c8e&gt;] sys_connect+0x9e/0xd0
[277657.564963]  [&lt;b0472785&gt;] sys_socketcall+0xa5/0x230
[277657.564973]  [&lt;b01042ba&gt;] syscall_call+0x7/0xb
[277657.564984]  =======================

I had a process using non-blocking sockets stuck in connect() for over 8 hours because of this...

00002630 &lt;__xfrm_lookup&gt;:
...
    290b:   b8 00 00 00 00          mov    $0x0,%eax
            290c: R_386_32  km_waitq
    2910:   e8 fc ff ff ff          call   2911 &lt;__xfrm_lookup+0x2e1&gt;
            2911: R_386_PC32    add_wait_queue
    2915:   a1 00 00 00 00          mov    0x0,%eax
            2916: R_386_32  per_cpu__current_task
    291a:   c7 00 01 00 00 00       movl   $0x1,(%eax)
...
To: Simon Arlott <simon@...>, David S. Miller <davem@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <netdev@...>
Date: Tuesday, December 4, 2007 - 8:12 pm

This patch should help.

[INET]: Export non-blocking flags to proto connect call

Previously we made connect(2) block on IPsec SA resolution.  This is
good in general but not desirable for non-blocking sockets.

To fix this properly we'd need to implement the larval IPsec dst stuff
that we talked about.  For now let's just revert to the old behaviour
on non-blocking sockets.

Signed-off-by: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV&gt;HI~} &lt;herbert@gondor.apana.org.au&gt;
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/include/net/ip.h b/include/net/ip.h
index 83fb9f1..9b4ed7e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -121,7 +121,8 @@ extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
 extern int		ip4_datagram_connect(struct sock *sk, 
-					     struct sockaddr *uaddr, int addr_len);
+					     struct sockaddr *uaddr,
+					     int addr_len, int flags);
 
 /*
  *	Map a multicast IP onto multicast MAC for type Token Ring.
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index e90f962..2686850 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -567,7 +567,8 @@ extern void			ipv6_packet_init(void);
 extern void			ipv6_packet_cleanup(void);
 
 extern int			ip6_datagram_connect(struct sock *sk, 
-						     struct sockaddr *addr, int addr_len);
+						     struct sockaddr *addr,
+						     int addr_len, int flags);
 
 extern int 			ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len);
 extern void			ipv6_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port,
diff --git a/include/net/sock.h b/include/net/sock.h
index 43e3cd9..d70b110 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -522,8 +522,8 @@ struct proto {
 	void			(*close)(struct sock *sk, 
 					long timeout);
 	int			(*connect)(struct sock *sk,
-				       ...
To: <herbert@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 2:30 am

From: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;

We made an explicit decision not to do things this way.

Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
setting, and this is across the board.  If xfrm_larval_drop is zero,
non-blocking semantics do not extend to IPSEC route resolution,
otherwise it does.

If he sets this sysctl to "1" as I detailed in my reply, he'll
get the behavior he wants.
--
To: David Miller <davem@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 2:51 am

Does anybody actually need the 0 setting? What would we break if
the default became 1?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV&gt;HI~} &lt;herbert@gondor.apana.org.au&gt;
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To: Herbert Xu <herbert@...>
Cc: David Miller <davem@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 2:39 pm

I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking 
sockets is quite well defined in POSIX. If this is changed for some IP 
sockets, event-driven applications will randomly and subtly break.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 10:25 pm

From: Stefan Rompf &lt;stefan@loplof.de&gt;

You are entitled to your opinion.

POSIX says nothing about the semantics of route resolution.

If this was such a clear cut case we'd have changed things
a long time ago, but it isn't so don't pretend this is the
case.
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 4:49 am

Of course not. Applications must not care about what happens at the transport 

... and as O_CREAT on open() isn't specifically documented to apply to 
filenames starting with 'a', it is perfectly normal that "echo x &gt;ash" always 

Ok, irony aside. Just have a look at
http://www.opengroup.org/onlinepubs/009695399/functions/connect.html (I hope 
009695399 is not a personalition cookie ;-)

"If the connection cannot be established immediately and O_NONBLOCK is set for 
the file descriptor for the socket, connect() shall fail and set errno to 
[EINPROGRESS], but the connection request shall not be aborted, and the 
connection shall be established asynchronously."


Well, the only reason this doesn't break on a daily basis is because the code 
isn't in the kernel that long and not many people run applications on an 
IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior 
connections or dnssec based anonymous IPSEC someday. Trust me, you will 
revert this misbehaviour in -stable then.

For some real life applications that break when nonblocking connect() blocks, 
please look f.e. at squid or mozilla firefox.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 4:53 am

From: Stefan Rompf &lt;stefan@loplof.de&gt;

They are, but the context in which they apply is vague.

I can equally generate examples where the non-blocking behavior you
are a proponent of would break non-blocking UDP apps during a
sendmsg() call when we hit IPSEC resolution.  Yet similar language on
blocking semantics exists for sendmsg() in the standards.

The world is shades of gray, implying anything else is foolhardy and

I use IPSEC every single day in this fashion, and I haven't.
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 6:56 am

I am not a good enough kernel hacker to exactly understand the code flow in 
udp_sendmsg(). However, it seems that it first checks destination validity 
via ip_route_output_flow() and queues the message then. The sendmsg() 
documentation only talks about buffer space. I can see your dilemma.

The reason why I'm pushing this issue another time is that I know quite a 
bit about system level application development. A very typical design pattern 
for non-naive single or multi threaded programs is that they set all 
communication sockets to be nonblocking and use a select()/epoll() based loop 
to dispatch IO. This often includes initiating a TCP connect() and 
asynchronously waiting for it to finish or fail from the main loop.

The dangerous situation here is that in 99% of all cases things will just work 
because the phase 2 SA exists. In 0.8%, the SA will be established in &lt;1 sec. 
However, in the rest of time the server application that you have considered 
to be stable will end up sleeping with all threads in a connect() call that 

Even though I consider programmers that ignore the result code on a 
nonblocking UDP sendmsg() fools, I agree. May be the best compromise is what 
Herbert Xu suggested in &lt;20071205001230.GA11391@gondor.apana.org.au&gt; in this 
thread: At least, for connect() O_NONBLOCK ist ALWAYS respected. Because this 
is where the chance for breakage is highest.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 7:13 am

From: Stefan Rompf &lt;stefan@loplof.de&gt;

I meant whether "immediately" mean in reference to socket
state or includes auxiliary things like route lookups.

When you do a non-blocking write on a socket, things like
memory allocations can block, potentially for a long time.
It is an example where there are definite boundaries to where
the non-blocking'ness applies.

And therefore it is not so cut and dry and you present this

And that connect() call can hang for a long time due to any memory
allocation done in the connect() path.

You are not avoiding blocking by setting O_NONBLOCK on the socket, it
is quite foolhardy to think that it does so unilaterally.

And that's why this is a grey area.  Why is waiting for memory
allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
resolution is not?
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 7:35 am

Because you just will put enough RAM modules into you server when setting up a 
scalable system. Local resource, managable by the admin. What you cannot 
control in many cases is the network connection to the remote node. Simon 
Arlott has been talking about an 8 hour network outage.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 7:39 am

From: Stefan Rompf &lt;stefan@loplof.de&gt;

This suggestion is avoiding the important semantic issue, and
won't lead to a real discussion of the core problem.

--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 8:30 am

When writing applications for unix operating systems, it is known since ages 
that stuff can be swapped out and that even things like memory accesses can 
block. So it does not really surprise when a system call has to wait for 
memory - just imagine the kernel code for connect() could be and has been 
swapped out.

Even with moderate swap activity, this memory should be available in much less 
than one second. If on the other hand the system is already threshing, it is 
no difference if it does so within connect() or while reaching the connect() 
system call in the application flow.

Btw, this is where admin responsibility to size their systems kicks in.

So where I would draw the line: connect() is clearly a network related 
function. Therefore, if a nonblocking connect() has to sleep for a local, 
controllable resource like memory to become available, this is ok. Maybe it 
shouldn't wait for a 128MB buffer if someone configured such an abonimation, 
haven't thought deeply about that. But when being told not to wait the 
connection to complete, it should never ever wait for another network related 
activity like IPSEC SA setup to complete, especially not for hours.

IMHO this is what developers expect, and is also consistent with the fact that 
POSIX does not define O_NONBLOCK behaviour for local files.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 9:55 am

From: Stefan Rompf &lt;stefan@loplof.de&gt;

You keep ignoring the fact that, as Herbert and I discussed, not
blocking for IPSEC resolution will make some connect() cases fail that
would otherwise not fail.

There are two sides to this issue, and we need to consider them
both.

Long term a resolution-packet-queue provides a solution that handles
both angles correctly, but we don't have that code yet.
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 10:31 am

as far as I've understood Herbert's patch, at least TCP connect can be fixed 
so that non blocking connect() will neither fail nor block, but just use the 
first or second retransmission of the SYN packet to complete the handshake 
after IPSEC is up. As this will fix the common breakage case, just do so and 
keep UDP sendmsg() etc for later.

You are looking at this issue too much from the kernel side. Admitted, this is 
a corner case, but therefore nobody cares if connection completion takes two 
SYNs and three seconds instead of one SYN and may be two seconds. But 
application developers and users will validly complain if their applications 
block unexpectedly for hours just because some random provider has a network 
outage and IPSEC cannot come up.

Stefan
--
To: <stefan@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Thursday, December 6, 2007 - 11:20 pm

From: Stefan Rompf &lt;stefan@loplof.de&gt;

If IPSEC takes a long time to resolve, and we don't block, the
connect() can hard fail (we will just keep dropping the outgoing SYN
packet send attempts, eventually hitting the retry limit) in cases
where if we did block it would not fail (because we wouldn't send
the first SYN until IPSEC resolved).
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Friday, December 7, 2007 - 5:29 am

David - I'm aware of this, the discussion is which behaviour is ok. Let's go 
back to a real life example. I've already researched that the squid web proxy 
has a poll() based main loop doing nonblocking connects, may be with multiple 
threads.

Situation: One user wants to access a web page that needs IPSEC. The SA takes 
30 seconds to come up.

a) Non-blocking connect is respected: SYN packets during the first 30 seconds 
will be dropped as you said. Connection can be completed on the next SYN 
retry (timeout in linux: 3 minutes). During this time, the 500 other users 
can continue to browse using the proxy.

b) Non-blocking connect is ignored during IPSEC resolving as you advocate it: 
Connection for the one user can be completed immediatly after IPSEC comes up. 
That's the pro. However, until then, the other 500 proxy user CANNOT ACCESS 
THE WEB because squid's threads are stuck in connect()s on sockets they 
configured not to block. If the IPSEC SA never resolves due to some network 
outage, squid will sleep forever or until an admin configures it that it 
doesn't try to connect the adress in question and restarts it.

Don't you realize how broken this behaviour is? Can you give me ONE example of 
an application that works better with b) and why this outweights the problems 
it creates for everybody else?

Even the DNS example you posted in  
&lt;20071204.231200.117152338.davem@davemloft.net&gt; is wrong because the second 
server will never queried if the kernel puts the process into coma while the 
IPSEC SA to the first server cannot be resolved.

Stefan
--
To: <herbert@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 3:12 am

From: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;

I bet there are UDP apps out there that would break if we
didn't do this.

Actually, consider even a case like DNS.  Let's say the timeout
is set to 2 seconds or something and you have 3 DNS servers
listed, on different IPSEC destinations, in your resolv.conf

Each IPSEC route that isn't currently resolved will cause packet loss
of the DNS lookup request with xfrm_larval_drop set to '1'.

If all 3 need to be resolved, the DNS lookup will fully fail
which defeats the purpose of listing 3 servers for redundancy
don't you think? :-)

As much as I even personally prefer the xfrm_larval_drop=1
behavior, it cases like above that keep me from jumping at
making it the default.

Arguably, potentially blocking forever (which is what can easily
happen with xfrm_larval_drop=0 if your IPSEC daemon cannot resolve the
IPSEC path for whatever reason) is worse than the above, but the
other cases are still something to consider as well.
--
To: David Miller <davem@...>
Cc: <herbert@...>, <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 2:42 pm

In your example, the DNS server might actually stop responding to other 
clients while waiting for the (expected to be non-blocking) connect() to 
return. This is much much worse.

Stefan
--
To: David Miller <davem@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 3:16 am

Right.  This is definitely bad for protocols without a retransmission
mechanism.

However, is the 0 setting ever useful for TCP and in particular, TCP's
connect(2) call? Perhaps we can just make that one always drop.

Well, until someone implements queueing to fix all of this properly
that is :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV&gt;HI~} &lt;herbert@gondor.apana.org.au&gt;
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To: <herbert@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 3:34 am

From: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;

TCP has some built-in assumptions about characteristics of
interent links and what constitutes a timeout which is "too long"
and should thus result in a full connection failure.

IPSEC changes this because of IPSEC route resolution via
ISAKMP.

With this in mind I can definitely see people preferring
the "block until IPSEC resolves" behavior, especially for
something like, say, periodic remote backups and stuff like
that where you really want the thing to just sit and wait
for the connect() to succeed instead of failing.
--
To: David Miller <davem@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 3:39 am

Hmm, but connect(2) should succeed in that case thanks to the
blackhole route, no? The subsequent SYNs will then be dropped
until the IPsec SAs are in place.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV&gt;HI~} &lt;herbert@gondor.apana.org.au&gt;
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To: <herbert@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 5:55 am

From: Herbert Xu &lt;herbert@gondor.apana.org.au&gt;

If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard
fail.

--
To: David Miller <davem@...>
Cc: <simon@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 5:57 am

Right.  Let's just forget about this until we have a queueing system :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV&gt;HI~} &lt;herbert@gondor.apana.org.au&gt;
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To: <simon@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Wednesday, December 5, 2007 - 2:06 am

From: Simon Arlott &lt;simon@fire.lp0.eu&gt;

If you don't like this behavior:

	echo "1" &gt;/proc/sys/net/core/xfrm_larval_drop

but those initial connection setup packets will be dropped while
waiting for the IPSEC route to be resolved, and in your 8 hour case
the TCP connect will fail.

Anyways, the choice for different behavior is there, select it
to suit your tastes.
--
Previous thread: [PATCH 4/4] netns: prevent usage of flowi with not initialized fl_net in routing (v2) by Denis V. Lunev on Tuesday, December 4, 2007 - 2:52 pm. (1 message)

Next thread: [PATCH (resubmit)]: fix lro_gen_skb() alignment by Andrew Gallatin on Tuesday, December 4, 2007 - 3:55 pm. (2 messages)
speck-geostationary