Re: virt-manager broken by bind(0) in net-next.

Previous thread: Re: netxen: fix memory leak in drivers/net/netxen_nic_init.c by David Miller on Thursday, January 29, 2009 - 5:56 pm. (1 message)

Next thread: virt-manager broken by bind(0) in net-next. by Stephen Hemminger on Thursday, January 29, 2009 - 11:50 pm. (1 message)
From: Stephen Hemminger
Date: Thursday, January 29, 2009 - 10:35 pm

On Thu, 29 Jan 2009 10:35:44 +0000

The problem is only in the net-next tree (not mainline 2.6.29-rcX).
Bisected down to this commit is the problem:

a9d8f9110d7e953c2f2b521087a4179677843c2a is first bad commit
commit a9d8f9110d7e953c2f2b521087a4179677843c2a
Author: Evgeniy Polyakov <zbr@ioremap.net>
Date:   Mon Jan 19 16:46:02 2009 -0800

    inet: Allowing more than 64k connections and heavily optimize bind(0) time.
    
    With simple extension to the binding mechanism, which allows to bind more
    than 64k sockets (or smaller amount, depending on sysctl parameters),
    we have to traverse the whole bind hash table to find out empty bucket.
    And while it is not a problem for example for 32k connections, bind()
    completion time grows exponentially (since after each successful binding
    we have to traverse one bucket more to find empty one) even if we start
    each time from random offset inside the hash table.
    
    So, when hash table is full, and we want to add another socket, we have
    to traverse the whole table no matter what, so effectivelly this will be
    the worst case performance and it will be constant.
    
    Attached picture shows bind() time depending on number of already bound
    sockets.

Not sure why but it breaks VNC, see attached screenshot.
From: Evgeniy Polyakov
Date: Friday, January 30, 2009 - 1:16 am

Hi.


Any chance to get a bit more information about what this console does?
Like how it binds or get a strace?
Will you be able to run a debug patch which will dump lots of info into
the dmesg and slow things down a bit?

-- 
	Evgeniy Polyakov
--

From: Daniel P. Berrange
Date: Friday, January 30, 2009 - 3:27 am

The virt-manager console is basically just a plain old boring VNC client.
It uses GTK-VNC to establish its VNC network connection, and that doesn't
do anything unusual AFAIK. We use getaddrinfo() to resolve the hostname,
and then try each of its results in turn, until we succesfully connect
to the VNC server. We don't explicitly bind() to the client port, just
let the kernel pick it for us. The code in question, is the "gvnc_open_host"
method from gvnc.c,  which starts at about line 2910

http://freehg.org/u/aliguori/gtk-vnc.hg/file/d68935d582f0/src/gvnc.c

Regards,
Daniel 
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Evgeniy Polyakov
Date: Friday, January 30, 2009 - 4:21 am

So it is not explicit bind call, but port autoselection in the
connect(). Can you check what errno is returned?
Did I understand it right, that connect fails, you try different
address, but then suddenly all those sockets become 'alive'?

-- 
	Evgeniy Polyakov
--

From: Herbert Xu
Date: Friday, January 30, 2009 - 5:53 am

Yes, I think a good strace vs. a bad strace would be really helpful
in these cases.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Stephen Hemminger
Date: Friday, January 30, 2009 - 10:57 am

On Fri, 30 Jan 2009 23:53:37 +1100

I have the strace but it comes up no different.
What is different is that in the broken case (net-next), I see
IPV6 being used:

State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900

and in the working case (2.6.29-rc3), IPV4 is being used
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 

Relevant bits of strace in broken case are:

7276  socket(PF_NETLINK, SOCK_RAW, 0)   = 21
7276  bind(21, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
7276  getsockname(21, {sa_family=AF_NETLINK, pid=7276, groups=00000000}, [66387309494284]) = 0
7276  sendto(21, "\24\0\0\0\26\0\1\3\353<\203I\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\353<\203Il\34\0\0\2\10\200\376\1\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 168
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\353<\203Il\34\0\0\n\200\200\376\1\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 256
7276  recvmsg(21, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\353<\203Il\34\0\0\0\0\0\0\1\0\0\0\24"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
7276  close(21)                         = 0
7276  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 21
7276  fcntl(21, F_GETFL)                = 0x2 (flags O_RDWR)
7276  fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK) = 0
7276  fstat(21, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
7276  fcntl(21, F_GETFL)                = 0x802 (flags ...
From: Eric Dumazet
Date: Friday, January 30, 2009 - 11:41 am

Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a

I see use of a hashinfo->bsockets field that :

- lacks proper lock/synchronization
- suffers from cache line ping pongs on SMP

Also there might be a problem at line 175

if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
	spin_unlock(&head->lock);
	goto again;

If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index df8e72f..752c6b2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -172,7 +172,8 @@ tb_found:
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size == -1 &&  --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}


--

From: Evgeniy Polyakov
Date: Friday, January 30, 2009 - 2:50 pm

It should contain rough number of sockets, there is no need to be very

I used free alignment slot so that socket structure would not be

I think it should be smallest_size != -1, since we really want to goto
to the again label when hueristic is used, which in turn changes
smallest_size.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Friday, January 30, 2009 - 3:30 pm

Denying there is a bug is... well... I dont know what to say.


Are you kidding ?

bsockets is not part of socket structure, but part of "struct inet_hashinfo",
shared by all cpus and accessed several thousand times per second on many
machines.

Please read the comment three lines after 'the free alignemnt slot'
you chose.... You just introduced one write on a cache line
that is supposed to *not* be written.

        unsigned int                    bhash_size;
        int                             bsockets;

        struct kmem_cache               *bind_bucket_cachep;

        /* All the above members are written once at bootup and
         * never written again _or_ are predominantly read-access.
         *
         * Now align to a new cache line as all the following members
         * might be often dirty.

Yep


--

From: Evgeniy Polyakov
Date: Friday, January 30, 2009 - 3:51 pm

It is not a bug. It is not supposed to be precise. At all.
I implemented a simple heuristic on when diferent bind port selection
algorithm should start: roughly when number of opened sockets equals to
some predefined value (sysctl at the moment, but it could be 64k or
anything else), so if that number is loosely maintained and does not
precisely corresponds to the number of sockets, it is not a problem.

You also saw 'again' lavel which has magic 5 number - it is another
heuristic - since lock is dropped atfer the bind bucket check, and we
selected it, it is possible that non-reuse socket will be added into the
bucket, so we will have to rerun the process again. I limited this to
the 5 attempts only, since it is better than what we have right now (I
never saw more than 2 attempts needed in the tests), when number of


I have no objection on moving this anywhere at the end of the structure
like after bind_bucket_cachep.

--- ./include/net/inet_hashtables.h~	2009-01-19 22:19:11.000000000 +0300
+++ ./include/net/inet_hashtables.h	2009-01-31 01:48:21.000000000 +0300
@@ -134,7 +134,6 @@
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -148,6 +147,7 @@
 	 * table where wildcard'd TCP sockets can exist.  Hash function here
 	 * is just local port number.
 	 */
+	int				bsockets;
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
 
--- ./net/ipv4/inet_connection_sock.c~	2009-01-19 22:21:08.000000000 +0300
+++ ./net/ipv4/inet_connection_sock.c	2009-01-31 01:50:20.000000000 +0300
@@ -172,7 +172,8 @@
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}


-- 
	Evgeniy ...
From: Stephen Hemminger
Date: Friday, January 30, 2009 - 5:36 pm

On Sat, 31 Jan 2009 01:51:14 +0300

How is any of this supposed to fix the bug?
--

From: Evgeniy Polyakov
Date: Saturday, January 31, 2009 - 1:35 am

Nothing from above fixes the bug. It was an explaination of how things
work. Patch is based on Eric's observation about unconditional (compared
to old code) attempt to get the new socket bucket when code should just
return.

-- 
	Evgeniy Polyakov
--

From: Stephen Hemminger
Date: Friday, January 30, 2009 - 7:52 pm

My working hypothesis is:
  1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
     preferred over plain IPV4.
  2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
  3. protocol gets screwed up after that.

It is probably reproducible with other services that support IPV6.
--

From: Evgeniy Polyakov
Date: Saturday, January 31, 2009 - 1:37 am

getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
Previously it bailed out, but with my change it will try again without
reason for doing this. With the patch I sent based on Eric's observation
things should be fine.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Saturday, January 31, 2009 - 2:17 am

Problem is your patch is wrong Evgeniy, please think about it litle bit more
and resubmit it. 

Take the time to run this $0.02 program, before and after your upcoming fix :


$ cat size.c
#include <net/inet_hashtables.h>
extern int printf(const char *, ...);
int main(int argc, char *argv[])
{
        printf("offsetof(struct inet_hashinfo, bsockets)=0x%x\n",
                offsetof(struct inet_hashinfo, bsockets));
        return 0;
}
$ make size.o ; gcc -o size size.o ; ./size
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  SYMLINK include/asm -> include/asm-x86
  CALL    scripts/checksyscalls.sh
  CC      size.o
offsetof(struct inet_hashinfo, bsockets)=0x18


offset of bsockets being 0x18 or 0x20 is same result : bad because in
same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
bhash, bhash_size, unless your cpu is a Pentium.

Also, I suggest you change bsockets to something more appropriate, eg a
percpu counter.

Thank you.
Eric

--

From: Evgeniy Polyakov
Date: Saturday, January 31, 2009 - 2:31 am

No, patch should be ok. And its part which moves bsockets around was
added because of your complaints, that it is written into read-mostly

It is not a fix, but enhancement, which really has nothing with the bug
in question :)

Attached patch makes difference, I'm curious if it ever make any

I thought on that first, but found that looping over every cpu and
summing the total number of allocated/freed sockets will have noticebly
bigger overhead than having loosely maintaned number of sockets.

For the reference. This patch has nothing with the bug we discuss here,
the proper patch (without need to move bsockets around) was sent
earlier, which forces port selection codepath to return error when new
selection heuristic is not used.

--- ./include/net/inet_hashtables.h.orig	2009-01-31 12:27:41.000000000 +0300
+++ ./include/net/inet_hashtables.h	2009-01-31 12:28:15.000000000 +0300
@@ -134,7 +134,6 @@
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -150,6 +149,8 @@
 	 */
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
+	
+	int				bsockets ____cacheline_aligned_in_smp;
 
 };
 

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Saturday, January 31, 2009 - 2:49 am

It appears you are always right, I have nothing to say then.

Stupid I am.

I vote for plain revert of your initial patch, since you are anaware
of performance problems it introduces. Then, probably nobody cares
of my complaints, so dont worry.


--

From: Evgeniy Polyakov
Date: Saturday, January 31, 2009 - 2:56 am

Eric, do not get it soo personally :) After all it is only a matter of
how we enjoy the process and have fun with the development.

Really, I appreciate your work and help, and likely this
misunderstanding happened because of a bad mix of the original bug and
this performance implication. Original bug has really nothing with what
we discuss here. And while the performance problem with bound sockets
creation may be visible, I did not observe it, while the idea
implemented with this approach shows up clearly in the graph I posted.
So I vote by both hands to further improve it by moving things around so
that there would be no unneded cache flushes during update of this
field.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Saturday, January 31, 2009 - 3:17 am

OK OK, as I said, dont worry, it was not a strong feeling from me, only
a litle bit upset, thats all.

We only need to know if the *fix* is solving Stephen problem

About performance effects of careful variable placement and percpu counter
strategy you might consult as an example :

http://lkml.indiana.edu/hypermail/linux/kernel/0812.1/01624.html

Now, with these patches applied, try to see effect of your new bsockets field
on a network workload doing lot of socket bind()/unbind() calls...

With current kernels, you probably wont notice because of inode/dcache hot
cache lines, but it might change eventually...


--

From: Evgeniy Polyakov
Date: Sunday, February 1, 2009 - 5:42 am

Hi Eric.


Impressive, but to be 100% fair it is not only because of the cache line

David applied the patch which fixed the problem, so we can return to the
cache line issues. What do you think about the last version where
bsockets field was placed at the very end of the structure and with
cacheline_aligned_on_smp attribute?

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Sunday, February 1, 2009 - 9:12 am

Yes, at a minimum, move it away from first cache line.

And using atomic_t so that we dont have to discuss about accumulated
errors on SMP on this variable. We will see later if percpu counter
is wanted or not.

Thank you

[PATCH] net: move bsockets outside of read only beginning of struct inet_hashinfo

And switch bsockets to atomic_t since it might be changed in parallel.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/inet_hashtables.h   |    3 ++-
 net/ipv4/inet_connection_sock.c |    2 +-
 net/ipv4/inet_hashtables.c      |    5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 8d98dc7..a44e224 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -134,7 +134,7 @@ struct inet_hashinfo {
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
+	/* 4 bytes hole on 64 bit */
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -151,6 +151,7 @@ struct inet_hashinfo {
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
 
+	atomic_t			bsockets;
 };
 
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 9bc6a18..22cd19e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -119,7 +119,7 @@ again:
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
-						if (hashinfo->bsockets > (high - low) + 1) {
+						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
 							spin_unlock(&head->lock);
 							snum = smallest_rover;
 							goto have_snum;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index d7b6178..625cc5f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,7 +62,7 @@ void ...
From: Evgeniy Polyakov
Date: Sunday, February 1, 2009 - 10:40 am

Ok, let's do it this way. Ack.
Thank you Eric.

-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Sunday, February 1, 2009 - 1:31 pm

From: Evgeniy Polyakov <zbr@ioremap.net>

Applied, thanks everyone.
--

From: Stephen Hemminger
Date: Saturday, January 31, 2009 - 10:58 pm

On Sat, 31 Jan 2009 00:50:08 +0300

Yes, this fixes the problem, not sure who wants the honors for sending a signed off
version.




--- a/net/ipv4/inet_connection_sock.c	2009-01-31 21:18:45.433239861 -0800
+++ b/net/ipv4/inet_connection_sock.c	2009-01-31 21:30:14.720825414 -0800
@@ -172,7 +172,8 @@ tb_found:
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+				    smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}
--

From: David Miller
Date: Sunday, February 1, 2009 - 2:07 am

From: Stephen Hemminger <shemminger@vyatta.com>

I'll sort it out.
--

From: Evgeniy Polyakov
Date: Sunday, February 1, 2009 - 5:44 am

Thanks for testing Stephen. David applied that version and cut the
Gordian knot :)

-- 
	Evgeniy Polyakov
--

From: Stephen Hemminger
Date: Saturday, January 31, 2009 - 10:29 pm

On Fri, 30 Jan 2009 19:41:59 +0100

That didn't fix it.
--

Previous thread: Re: netxen: fix memory leak in drivers/net/netxen_nic_init.c by David Miller on Thursday, January 29, 2009 - 5:56 pm. (1 message)

Next thread: virt-manager broken by bind(0) in net-next. by Stephen Hemminger on Thursday, January 29, 2009 - 11:50 pm. (1 message)