If I have a IPsec rule like:
spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
(i.e. a remote host 1.2.3.4 which will not respond)
Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <-- non-blocking socket
connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("1.2.3.4")}, 16 <-- blocked connect()
[277657.564773] netcat S b06bcf20 0 9450 9449
[277657.564785] c8d51d28 00200046 00200046 b06bcf20 00200286 c8d51d14 00200286 c8d51d84
[277657.564814] b06bcf20 c8d51d28 b013680b c8d51d84 eeae8800 c8d51d78 c8d51dd0 b04d3fc5
[277657.564843] c8d51da4 00000002 00000001 ede87284 00000002 00000040 e9318ac0 db3f20a0
[277657.564874] Call Trace:
[277657.564881] [<b04d3fc5>] __xfrm_lookup+0x2f5/0x510
[277657.564905] [<b0494f9e>] ip_route_output_flow+0x4e/0x80
[277657.564919] [<b04af303>] tcp_v4_connect+0x183/0x6d0
[277657.564934] [<b04beaf2>] inet_stream_connect+0x122/0x1c0
[277657.564949] [<b0471c8e>] sys_connect+0x9e/0xd0
[277657.564963] [<b0472785>] sys_socketcall+0xa5/0x230
[277657.564973] [<b01042ba>] syscall_call+0x7/0xb
[277657.564984] =======================
I had a process using non-blocking sockets stuck in connect() for over 8 hours because of this...
00002630 <__xfrm_lookup>:
...
290b: b8 00 00 00 00 mov $0x0,%eax
290c: R_386_32 km_waitq
2910: e8 fc ff ff ff call 2911 <__xfrm_lookup+0x2e1>
2911: R_386_PC32 add_wait_queue
2915: a1 00 00 00 00 mov 0x0,%eax
2916: R_386_32 per_cpu__current_task
291a: c7 00 01 00 00 00 movl $0x1,(%eax)
...This patch should help. [INET]: Export non-blocking flags to proto connect call Previously we made connect(2) block on IPsec SA resolution. This is good in general but not desirable for non-blocking sockets. To fix this properly we'd need to implement the larval IPsec dst stuff that we talked about. For now let's just revert to the old behaviour on non-blocking sockets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/include/net/ip.h b/include/net/ip.h index 83fb9f1..9b4ed7e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -121,7 +121,8 @@ extern void ip_flush_pending_frames(struct sock *sk); /* datagram.c */ extern int ip4_datagram_connect(struct sock *sk, - struct sockaddr *uaddr, int addr_len); + struct sockaddr *uaddr, + int addr_len, int flags); /* * Map a multicast IP onto multicast MAC for type Token Ring. diff --git a/include/net/ipv6.h b/include/net/ipv6.h index e90f962..2686850 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -567,7 +567,8 @@ extern void ipv6_packet_init(void); extern void ipv6_packet_cleanup(void); extern int ip6_datagram_connect(struct sock *sk, - struct sockaddr *addr, int addr_len); + struct sockaddr *addr, + int addr_len, int flags); extern int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len); extern void ipv6_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port, diff --git a/include/net/sock.h b/include/net/sock.h index 43e3cd9..d70b110 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -522,8 +522,8 @@ struct proto { void (*close)(struct sock *sk, long timeout); int (*connect)(struct sock *sk, - ...
From: Herbert Xu <herbert@gondor.apana.org.au> We made an explicit decision not to do things this way. Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl setting, and this is across the board. If xfrm_larval_drop is zero, non-blocking semantics do not extend to IPSEC route resolution, otherwise it does. If he sets this sysctl to "1" as I detailed in my reply, he'll get the behavior he wants. --
Does anybody actually need the 0 setting? What would we break if the default became 1? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking sockets is quite well defined in POSIX. If this is changed for some IP sockets, event-driven applications will randomly and subtly break. Stefan --
From: Stefan Rompf <stefan@loplof.de> You are entitled to your opinion. POSIX says nothing about the semantics of route resolution. If this was such a clear cut case we'd have changed things a long time ago, but it isn't so don't pretend this is the case. --
Of course not. Applications must not care about what happens at the transport ... and as O_CREAT on open() isn't specifically documented to apply to filenames starting with 'a', it is perfectly normal that "echo x >ash" always Ok, irony aside. Just have a look at http://www.opengroup.org/onlinepubs/009695399/functions/connect.html (I hope 009695399 is not a personalition cookie ;-) "If the connection cannot be established immediately and O_NONBLOCK is set for the file descriptor for the socket, connect() shall fail and set errno to [EINPROGRESS], but the connection request shall not be aborted, and the connection shall be established asynchronously." Well, the only reason this doesn't break on a daily basis is because the code isn't in the kernel that long and not many people run applications on an IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior connections or dnssec based anonymous IPSEC someday. Trust me, you will revert this misbehaviour in -stable then. For some real life applications that break when nonblocking connect() blocks, please look f.e. at squid or mozilla firefox. Stefan --
From: Stefan Rompf <stefan@loplof.de> They are, but the context in which they apply is vague. I can equally generate examples where the non-blocking behavior you are a proponent of would break non-blocking UDP apps during a sendmsg() call when we hit IPSEC resolution. Yet similar language on blocking semantics exists for sendmsg() in the standards. The world is shades of gray, implying anything else is foolhardy and I use IPSEC every single day in this fashion, and I haven't. --
I am not a good enough kernel hacker to exactly understand the code flow in udp_sendmsg(). However, it seems that it first checks destination validity via ip_route_output_flow() and queues the message then. The sendmsg() documentation only talks about buffer space. I can see your dilemma. The reason why I'm pushing this issue another time is that I know quite a bit about system level application development. A very typical design pattern for non-naive single or multi threaded programs is that they set all communication sockets to be nonblocking and use a select()/epoll() based loop to dispatch IO. This often includes initiating a TCP connect() and asynchronously waiting for it to finish or fail from the main loop. The dangerous situation here is that in 99% of all cases things will just work because the phase 2 SA exists. In 0.8%, the SA will be established in <1 sec. However, in the rest of time the server application that you have considered to be stable will end up sleeping with all threads in a connect() call that Even though I consider programmers that ignore the result code on a nonblocking UDP sendmsg() fools, I agree. May be the best compromise is what Herbert Xu suggested in <20071205001230.GA11391@gondor.apana.org.au> in this thread: At least, for connect() O_NONBLOCK ist ALWAYS respected. Because this is where the chance for breakage is highest. Stefan --
From: Stefan Rompf <stefan@loplof.de> I meant whether "immediately" mean in reference to socket state or includes auxiliary things like route lookups. When you do a non-blocking write on a socket, things like memory allocations can block, potentially for a long time. It is an example where there are definite boundaries to where the non-blocking'ness applies. And therefore it is not so cut and dry and you present this And that connect() call can hang for a long time due to any memory allocation done in the connect() path. You are not avoiding blocking by setting O_NONBLOCK on the socket, it is quite foolhardy to think that it does so unilaterally. And that's why this is a grey area. Why is waiting for memory allocation on a O_NONBLOCK socket OK but waiting for IPSEC route resolution is not? --
Because you just will put enough RAM modules into you server when setting up a scalable system. Local resource, managable by the admin. What you cannot control in many cases is the network connection to the remote node. Simon Arlott has been talking about an 8 hour network outage. Stefan --
From: Stefan Rompf <stefan@loplof.de> This suggestion is avoiding the important semantic issue, and won't lead to a real discussion of the core problem. --
When writing applications for unix operating systems, it is known since ages that stuff can be swapped out and that even things like memory accesses can block. So it does not really surprise when a system call has to wait for memory - just imagine the kernel code for connect() could be and has been swapped out. Even with moderate swap activity, this memory should be available in much less than one second. If on the other hand the system is already threshing, it is no difference if it does so within connect() or while reaching the connect() system call in the application flow. Btw, this is where admin responsibility to size their systems kicks in. So where I would draw the line: connect() is clearly a network related function. Therefore, if a nonblocking connect() has to sleep for a local, controllable resource like memory to become available, this is ok. Maybe it shouldn't wait for a 128MB buffer if someone configured such an abonimation, haven't thought deeply about that. But when being told not to wait the connection to complete, it should never ever wait for another network related activity like IPSEC SA setup to complete, especially not for hours. IMHO this is what developers expect, and is also consistent with the fact that POSIX does not define O_NONBLOCK behaviour for local files. Stefan --
From: Stefan Rompf <stefan@loplof.de> You keep ignoring the fact that, as Herbert and I discussed, not blocking for IPSEC resolution will make some connect() cases fail that would otherwise not fail. There are two sides to this issue, and we need to consider them both. Long term a resolution-packet-queue provides a solution that handles both angles correctly, but we don't have that code yet. --
as far as I've understood Herbert's patch, at least TCP connect can be fixed so that non blocking connect() will neither fail nor block, but just use the first or second retransmission of the SYN packet to complete the handshake after IPSEC is up. As this will fix the common breakage case, just do so and keep UDP sendmsg() etc for later. You are looking at this issue too much from the kernel side. Admitted, this is a corner case, but therefore nobody cares if connection completion takes two SYNs and three seconds instead of one SYN and may be two seconds. But application developers and users will validly complain if their applications block unexpectedly for hours just because some random provider has a network outage and IPSEC cannot come up. Stefan --
From: Stefan Rompf <stefan@loplof.de> If IPSEC takes a long time to resolve, and we don't block, the connect() can hard fail (we will just keep dropping the outgoing SYN packet send attempts, eventually hitting the retry limit) in cases where if we did block it would not fail (because we wouldn't send the first SYN until IPSEC resolved). --
David - I'm aware of this, the discussion is which behaviour is ok. Let's go back to a real life example. I've already researched that the squid web proxy has a poll() based main loop doing nonblocking connects, may be with multiple threads. Situation: One user wants to access a web page that needs IPSEC. The SA takes 30 seconds to come up. a) Non-blocking connect is respected: SYN packets during the first 30 seconds will be dropped as you said. Connection can be completed on the next SYN retry (timeout in linux: 3 minutes). During this time, the 500 other users can continue to browse using the proxy. b) Non-blocking connect is ignored during IPSEC resolving as you advocate it: Connection for the one user can be completed immediatly after IPSEC comes up. That's the pro. However, until then, the other 500 proxy user CANNOT ACCESS THE WEB because squid's threads are stuck in connect()s on sockets they configured not to block. If the IPSEC SA never resolves due to some network outage, squid will sleep forever or until an admin configures it that it doesn't try to connect the adress in question and restarts it. Don't you realize how broken this behaviour is? Can you give me ONE example of an application that works better with b) and why this outweights the problems it creates for everybody else? Even the DNS example you posted in <20071204.231200.117152338.davem@davemloft.net> is wrong because the second server will never queried if the kernel puts the process into coma while the IPSEC SA to the first server cannot be resolved. Stefan --
From: Herbert Xu <herbert@gondor.apana.org.au> I bet there are UDP apps out there that would break if we didn't do this. Actually, consider even a case like DNS. Let's say the timeout is set to 2 seconds or something and you have 3 DNS servers listed, on different IPSEC destinations, in your resolv.conf Each IPSEC route that isn't currently resolved will cause packet loss of the DNS lookup request with xfrm_larval_drop set to '1'. If all 3 need to be resolved, the DNS lookup will fully fail which defeats the purpose of listing 3 servers for redundancy don't you think? :-) As much as I even personally prefer the xfrm_larval_drop=1 behavior, it cases like above that keep me from jumping at making it the default. Arguably, potentially blocking forever (which is what can easily happen with xfrm_larval_drop=0 if your IPSEC daemon cannot resolve the IPSEC path for whatever reason) is worse than the above, but the other cases are still something to consider as well. --
In your example, the DNS server might actually stop responding to other clients while waiting for the (expected to be non-blocking) connect() to return. This is much much worse. Stefan --
Right. This is definitely bad for protocols without a retransmission mechanism. However, is the 0 setting ever useful for TCP and in particular, TCP's connect(2) call? Perhaps we can just make that one always drop. Well, until someone implements queueing to fix all of this properly that is :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> TCP has some built-in assumptions about characteristics of interent links and what constitutes a timeout which is "too long" and should thus result in a full connection failure. IPSEC changes this because of IPSEC route resolution via ISAKMP. With this in mind I can definitely see people preferring the "block until IPSEC resolves" behavior, especially for something like, say, periodic remote backups and stuff like that where you really want the thing to just sit and wait for the connect() to succeed instead of failing. --
Hmm, but connect(2) should succeed in that case thanks to the blackhole route, no? The subsequent SYNs will then be dropped until the IPsec SAs are in place. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard fail. --
Right. Let's just forget about this until we have a queueing system :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Simon Arlott <simon@fire.lp0.eu> If you don't like this behavior: echo "1" >/proc/sys/net/core/xfrm_larval_drop but those initial connection setup packets will be dropped while waiting for the IPSEC route to be resolved, and in your 8 hour case the TCP connect will fail. Anyways, the choice for different behavior is there, select it to suit your tastes. --
| Christoph Hellwig | Re: [malware-list] [RFC 0/5] [TALPA] Intro to a linux interface for on access scan... |
| David Miller | [GIT]: Networking |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Tetsuo Handa | Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation,pathname matching |
git: | |
| Junio C Hamano | [RFD] On deprecating "git-foo" for builtins |
| Raimund Bauer | [wishlist] graphical diff |
| Dana How | [PATCH v3] Support ent:relative_path |
| Li Frank-B20596 | why not TortoiseGit |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Glenn Griffin | [PATCH] Add IPv6 support to TCP SYN cookies |
| Alexey Suslikov | OT: OpenBSD on Asus eeePC |
| Daniele Pilenga | HP nw9440 does not boot ACPI snapshot |
| Peter | OpenBSD as Virtualbox guest |
| Calomel | Re: Remove escape characters from file |
