On Thu, 29 Jan 2009 10:35:44 +0000
The problem is only in the net-next tree (not mainline 2.6.29-rcX).
Bisected down to this commit is the problem:
a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
commit a9d8f9110d7e953c2f2b521087a4179677843c2a
Author: Evgeniy Polyakov <zbr@ioremap.net>
Date: Mon Jan 19 16:46:02 2009 -0800
inet: Allowing more than 64k connections and heavily optimize bind(0) time.
With simple extension to the binding mechanism, which allows to bind more
than 64k sockets (or smaller amount, depending on sysctl parameters),
we have to traverse the whole bind hash table to find out empty bucket.
And while it is not a problem for example for 32k connections, bind()
completion time grows exponentially (since after each successful binding
we have to traverse one bucket more to find empty one) even if we start
each time from random offset inside the hash table.
So, when hash table is full, and we want to add another socket, we have
to traverse the whole table no matter what, so effectivelly this will be
the worst case performance and it will be constant.
Attached picture shows bind() time depending on number of already bound
sockets.
Not sure why but it breaks VNC, see attached screenshot.Hi. Any chance to get a bit more information about what this console does? Like how it binds or get a strace? Will you be able to run a debug patch which will dump lots of info into the dmesg and slow things down a bit? -- Evgeniy Polyakov --
The virt-manager console is basically just a plain old boring VNC client. It uses GTK-VNC to establish its VNC network connection, and that doesn't do anything unusual AFAIK. We use getaddrinfo() to resolve the hostname, and then try each of its results in turn, until we succesfully connect to the VNC server. We don't explicitly bind() to the client port, just let the kernel pick it for us. The code in question, is the "gvnc_open_host" method from gvnc.c, which starts at about line 2910 http://freehg.org/u/aliguori/gtk-vnc.hg/file/d68935d582f0/src/gvnc.c Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
So it is not explicit bind call, but port autoselection in the connect(). Can you check what errno is returned? Did I understand it right, that connect fails, you try different address, but then suddenly all those sockets become 'alive'? -- Evgeniy Polyakov --
Yes, I think a good strace vs. a bad strace would be really helpful in these cases. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Fri, 30 Jan 2009 23:53:37 +1100
I have the strace but it comes up no different.
What is different is that in the broken case (net-next), I see
IPV6 being used:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 23769 0 ::ffff:127.0.0.1:5900 ::ffff:127.0.0.1:55987
ESTAB 0 0 127.0.0.1:55987 127.0.0.1:5900
and in the working case (2.6.29-rc3), IPV4 is being used
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.1:58894 127.0.0.1:5901
ESTAB 0 0 127.0.0.1:5901 127.0.0.1:58894
Relevant bits of strace in broken case are:
7276 socket(PF_NETLINK, SOCK_RAW, 0) = 21
7276 bind(21, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
7276 getsockname(21, {sa_family=AF_NETLINK, pid=7276, groups=00000000}, [66387309494284]) = 0
7276 sendto(21, "\24\0\0\0\26\0\1\3\353<\203I\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
7276 recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\353<\203Il\34\0\0\2\10\200\376\1\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 168
7276 recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\353<\203Il\34\0\0\n\200\200\376\1\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 256
7276 recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\353<\203Il\34\0\0\0\0\0\0\1\0\0\0\24"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
7276 close(21) = 0
7276 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 21
7276 fcntl(21, F_GETFL) = 0x2 (flags O_RDWR)
7276 fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK) = 0
7276 fstat(21, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
7276 fcntl(21, F_GETFL) = 0x802 (flags ...Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
I see use of a hashinfo->bsockets field that :
- lacks proper lock/synchronization
- suffers from cache line ping pongs on SMP
Also there might be a problem at line 175
if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index df8e72f..752c6b2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -172,7 +172,8 @@ tb_found:
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
- if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+ if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+ smallest_size == -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
}
--
It should contain rough number of sockets, there is no need to be very I used free alignment slot so that socket structure would not be I think it should be smallest_size != -1, since we really want to goto to the again label when hueristic is used, which in turn changes smallest_size. -- Evgeniy Polyakov --
Denying there is a bug is... well... I dont know what to say.
Are you kidding ?
bsockets is not part of socket structure, but part of "struct inet_hashinfo",
shared by all cpus and accessed several thousand times per second on many
machines.
Please read the comment three lines after 'the free alignemnt slot'
you chose.... You just introduced one write on a cache line
that is supposed to *not* be written.
unsigned int bhash_size;
int bsockets;
struct kmem_cache *bind_bucket_cachep;
/* All the above members are written once at bootup and
* never written again _or_ are predominantly read-access.
*
* Now align to a new cache line as all the following members
* might be often dirty.
Yep
--
It is not a bug. It is not supposed to be precise. At all.
I implemented a simple heuristic on when diferent bind port selection
algorithm should start: roughly when number of opened sockets equals to
some predefined value (sysctl at the moment, but it could be 64k or
anything else), so if that number is loosely maintained and does not
precisely corresponds to the number of sockets, it is not a problem.
You also saw 'again' lavel which has magic 5 number - it is another
heuristic - since lock is dropped atfer the bind bucket check, and we
selected it, it is possible that non-reuse socket will be added into the
bucket, so we will have to rerun the process again. I limited this to
the 5 attempts only, since it is better than what we have right now (I
never saw more than 2 attempts needed in the tests), when number of
I have no objection on moving this anywhere at the end of the structure
like after bind_bucket_cachep.
--- ./include/net/inet_hashtables.h~ 2009-01-19 22:19:11.000000000 +0300
+++ ./include/net/inet_hashtables.h 2009-01-31 01:48:21.000000000 +0300
@@ -134,7 +134,6 @@
struct inet_bind_hashbucket *bhash;
unsigned int bhash_size;
- int bsockets;
struct kmem_cache *bind_bucket_cachep;
@@ -148,6 +147,7 @@
* table where wildcard'd TCP sockets can exist. Hash function here
* is just local port number.
*/
+ int bsockets;
struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE]
____cacheline_aligned_in_smp;
--- ./net/ipv4/inet_connection_sock.c~ 2009-01-19 22:21:08.000000000 +0300
+++ ./net/ipv4/inet_connection_sock.c 2009-01-31 01:50:20.000000000 +0300
@@ -172,7 +172,8 @@
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
- if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+ if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+ smallest_size != -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
}
--
Evgeniy ...On Sat, 31 Jan 2009 01:51:14 +0300 How is any of this supposed to fix the bug? --
Nothing from above fixes the bug. It was an explaination of how things work. Patch is based on Eric's observation about unconditional (compared to old code) attempt to get the new socket bucket when code should just return. -- Evgeniy Polyakov --
My working hypothesis is:
1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
preferred over plain IPV4.
2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
3. protocol gets screwed up after that.
It is probably reproducible with other services that support IPV6.
--
getaddrinfo() returns list of addresses and IPv6 was the first one iirc. Previously it bailed out, but with my change it will try again without reason for doing this. With the patch I sent based on Eric's observation things should be fine. -- Evgeniy Polyakov --
Problem is your patch is wrong Evgeniy, please think about it litle bit more
and resubmit it.
Take the time to run this $0.02 program, before and after your upcoming fix :
$ cat size.c
#include <net/inet_hashtables.h>
extern int printf(const char *, ...);
int main(int argc, char *argv[])
{
printf("offsetof(struct inet_hashinfo, bsockets)=0x%x\n",
offsetof(struct inet_hashinfo, bsockets));
return 0;
}
$ make size.o ; gcc -o size size.o ; ./size
CHK include/linux/version.h
CHK include/linux/utsrelease.h
SYMLINK include/asm -> include/asm-x86
CALL scripts/checksyscalls.sh
CC size.o
offsetof(struct inet_hashinfo, bsockets)=0x18
offset of bsockets being 0x18 or 0x20 is same result : bad because in
same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
bhash, bhash_size, unless your cpu is a Pentium.
Also, I suggest you change bsockets to something more appropriate, eg a
percpu counter.
Thank you.
Eric
--
No, patch should be ok. And its part which moves bsockets around was added because of your complaints, that it is written into read-mostly It is not a fix, but enhancement, which really has nothing with the bug in question :) Attached patch makes difference, I'm curious if it ever make any I thought on that first, but found that looping over every cpu and summing the total number of allocated/freed sockets will have noticebly bigger overhead than having loosely maintaned number of sockets. For the reference. This patch has nothing with the bug we discuss here, the proper patch (without need to move bsockets around) was sent earlier, which forces port selection codepath to return error when new selection heuristic is not used. --- ./include/net/inet_hashtables.h.orig 2009-01-31 12:27:41.000000000 +0300 +++ ./include/net/inet_hashtables.h 2009-01-31 12:28:15.000000000 +0300 @@ -134,7 +134,6 @@ struct inet_bind_hashbucket *bhash; unsigned int bhash_size; - int bsockets; struct kmem_cache *bind_bucket_cachep; @@ -150,6 +149,8 @@ */ struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE] ____cacheline_aligned_in_smp; + + int bsockets ____cacheline_aligned_in_smp; }; -- Evgeniy Polyakov --
It appears you are always right, I have nothing to say then. Stupid I am. I vote for plain revert of your initial patch, since you are anaware of performance problems it introduces. Then, probably nobody cares of my complaints, so dont worry. --
Eric, do not get it soo personally :) After all it is only a matter of how we enjoy the process and have fun with the development. Really, I appreciate your work and help, and likely this misunderstanding happened because of a bad mix of the original bug and this performance implication. Original bug has really nothing with what we discuss here. And while the performance problem with bound sockets creation may be visible, I did not observe it, while the idea implemented with this approach shows up clearly in the graph I posted. So I vote by both hands to further improve it by moving things around so that there would be no unneded cache flushes during update of this field. -- Evgeniy Polyakov --
OK OK, as I said, dont worry, it was not a strong feeling from me, only a litle bit upset, thats all. We only need to know if the *fix* is solving Stephen problem About performance effects of careful variable placement and percpu counter strategy you might consult as an example : http://lkml.indiana.edu/hypermail/linux/kernel/0812.1/01624.html Now, with these patches applied, try to see effect of your new bsockets field on a network workload doing lot of socket bind()/unbind() calls... With current kernels, you probably wont notice because of inode/dcache hot cache lines, but it might change eventually... --
Hi Eric. Impressive, but to be 100% fair it is not only because of the cache line David applied the patch which fixed the problem, so we can return to the cache line issues. What do you think about the last version where bsockets field was placed at the very end of the structure and with cacheline_aligned_on_smp attribute? -- Evgeniy Polyakov --
Yes, at a minimum, move it away from first cache line.
And using atomic_t so that we dont have to discuss about accumulated
errors on SMP on this variable. We will see later if percpu counter
is wanted or not.
Thank you
[PATCH] net: move bsockets outside of read only beginning of struct inet_hashinfo
And switch bsockets to atomic_t since it might be changed in parallel.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/net/inet_hashtables.h | 3 ++-
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/inet_hashtables.c | 5 +++--
3 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 8d98dc7..a44e224 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -134,7 +134,7 @@ struct inet_hashinfo {
struct inet_bind_hashbucket *bhash;
unsigned int bhash_size;
- int bsockets;
+ /* 4 bytes hole on 64 bit */
struct kmem_cache *bind_bucket_cachep;
@@ -151,6 +151,7 @@ struct inet_hashinfo {
struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE]
____cacheline_aligned_in_smp;
+ atomic_t bsockets;
};
static inline struct inet_ehash_bucket *inet_ehash_bucket(
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 9bc6a18..22cd19e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -119,7 +119,7 @@ again:
(tb->num_owners < smallest_size || smallest_size == -1)) {
smallest_size = tb->num_owners;
smallest_rover = rover;
- if (hashinfo->bsockets > (high - low) + 1) {
+ if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
spin_unlock(&head->lock);
snum = smallest_rover;
goto have_snum;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d7b6178..625cc5f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,7 +62,7 @@ void ...Ok, let's do it this way. Ack. Thank you Eric. -- Evgeniy Polyakov --
From: Evgeniy Polyakov <zbr@ioremap.net> Applied, thanks everyone. --
On Sat, 31 Jan 2009 00:50:08 +0300
Yes, this fixes the problem, not sure who wants the honors for sending a signed off
version.
--- a/net/ipv4/inet_connection_sock.c 2009-01-31 21:18:45.433239861 -0800
+++ b/net/ipv4/inet_connection_sock.c 2009-01-31 21:30:14.720825414 -0800
@@ -172,7 +172,8 @@ tb_found:
} else {
ret = 1;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
- if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+ if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+ smallest_size != -1 && --attempts >= 0) {
spin_unlock(&head->lock);
goto again;
}
--
From: Stephen Hemminger <shemminger@vyatta.com> I'll sort it out. --
Thanks for testing Stephen. David applied that version and cut the Gordian knot :) -- Evgeniy Polyakov --
On Fri, 30 Jan 2009 19:41:59 +0100 That didn't fix it. --
