Re: [PATCH] netfilter: finer grained nf_conn locking

Previous thread: [RFT 0/4] Netfilter/iptables performance improvements by Stephen Hemminger on Tuesday, February 17, 2009 - 10:19 pm. (2 messages)

Next thread: [RFT 3/4] Use mod_timer_noact to remove nf_conntrack_lock by Stephen Hemminger on Tuesday, February 17, 2009 - 10:19 pm. (8 messages)
From: Stephen Hemminger
Date: Tuesday, February 17, 2009 - 10:19 pm

TCP connection tracking suffers of huge contention on a global rwlock,
is used for protecting the  tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"

tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>

---
 include/linux/netfilter/nf_conntrack_tcp.h   |    1 
 include/net/netfilter/nf_conntrack_helper.h  |    2 -
 include/net/netfilter/nf_conntrack_l4proto.h |    3 --
 net/netfilter/nf_conntrack_netlink.c         |    6 ++--
 net/netfilter/nf_conntrack_proto_dccp.c      |    2 -
 net/netfilter/nf_conntrack_proto_sctp.c      |    2 -
 net/netfilter/nf_conntrack_proto_tcp.c       |   37 ++++++++++++---------------
 7 files changed, 25 insertions(+), 28 deletions(-)

--- a/include/linux/netfilter/nf_conntrack_tcp.h	2009-02-17 11:07:16.884086452 -0800
+++ b/include/linux/netfilter/nf_conntrack_tcp.h	2009-02-17 11:07:31.643846743 -0800
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
 
 struct ip_ct_tcp
 {
+	spinlock_t	lock;
 	struct ip_ct_tcp_state seen[2];	/* connection parameters per direction */
 	u_int8_t	state;		/* state of the connection (enum tcp_conntrack) */
 	/* For detecting stale connections */
--- a/net/netfilter/nf_conntrack_proto_tcp.c	2009-02-17 11:07:16.870763455 -0800
+++ b/net/netfilter/nf_conntrack_proto_tcp.c	2009-02-17 11:21:57.528485882 -0800
@@ -26,9 +26,6 @@
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
     If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct se
 {
 	enum tcp_conntrack state;
 ...
From: Patrick McHardy
Date: Wednesday, February 18, 2009 - 2:56 am

Eric already posted a patch to use an array of locks, which is
a better approach IMO since it keeps the size of the conntrack
entries down.
--

From: Eric Dumazet
Date: Wednesday, February 18, 2009 - 7:17 am

Yes, we probably can use an array for short lived lock sections.

The extra cost to compute the lock address is quite small, if
the array size is known at compile time.

Stephen we might also use this same array to protect the nf_conn_acct_find(ct)
thing as well (I am referring to your RFT 3/4 patch :
Use mod_timer_noact to remove nf_conntrack_lock)

Here is a repost of patch Patrick is referring to :


[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.

As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock. Using an array of spinlocks avoids
enlarging size of connection tracking structures, yet giving reasonable
fanout.

tcp_print_conntrack() doesnt need to lock anything to read
ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.

nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
 include/net/netfilter/nf_conntrack.h   |   32 +++++++++++++++++++++
 net/netfilter/nf_conntrack_core.c      |   10 ++++--
 net/netfilter/nf_conntrack_proto_tcp.c |   34 ++++++++++-------------
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2e0c536..288aff5 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -129,6 +129,38 @@ struct nf_conn
 	struct rcu_head rcu;
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_LOCKING)
+
+/*
+ * We reserve 16 locks per cpu (typical cache line size is 64 bytes)
+ * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory.
+ * (on lockdep we have a quite big spinlock_t, so keep the size down there)
+ */
+#ifdef CONFIG_LOCKDEP
+#define ...
From: Stephen Hemminger
Date: Thursday, February 19, 2009 - 3:03 pm

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"

tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.

In this version the lock is placed in a 4 byte whole in the nf_conntrack
structure. This means no size change, and same method can later be used
for UDP, SCTP, DCCP conntrack.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>

---
 include/linux/skbuff.h                       |    1 
 include/net/netfilter/nf_conntrack_helper.h  |    2 -
 include/net/netfilter/nf_conntrack_l4proto.h |    3 --
 net/netfilter/nf_conntrack_core.c            |    1 
 net/netfilter/nf_conntrack_netlink.c         |    6 ++--
 net/netfilter/nf_conntrack_proto_dccp.c      |    2 -
 net/netfilter/nf_conntrack_proto_sctp.c      |    2 -
 net/netfilter/nf_conntrack_proto_tcp.c       |   37 ++++++++++++---------------
 8 files changed, 26 insertions(+), 28 deletions(-)

--- a/include/net/netfilter/nf_conntrack_helper.h	2009-02-19 13:45:26.103408544 -0800
+++ b/include/net/netfilter/nf_conntrack_helper.h	2009-02-19 13:45:56.136167400 -0800
@@ -34,7 +34,7 @@ struct nf_conntrack_helper
 
 	void (*destroy)(struct nf_conn *ct);
 
-	int (*to_nlattr)(struct sk_buff *skb, const struct nf_conn *ct);
+	int (*to_nlattr)(struct sk_buff *skb, struct nf_conn *ct);
 	unsigned int expect_class_max;
 };
 
--- a/include/net/netfilter/nf_conntrack_l4proto.h	2009-02-19 13:45:26.103408544 -0800
+++ b/include/net/netfilter/nf_conntrack_l4proto.h	2009-02-19 13:45:56.136167400 -0800
@@ -62,8 +62,7 @@ struct nf_conntrack_l4proto
 	int (*print_conntrack)(struct seq_file *s, const struct nf_conn *);
 
 	/* convert protoinfo to nfnetink attributes ...
From: Eric Dumazet
Date: Saturday, March 28, 2009 - 9:55 am

Hi Patrick

Apparently we could not finish the removal of tcp_lock for 2.6.30 :(

Stephen suggested using a 4 bytes hole in struct nf_conntrack,
but this is ok only if sizeof(spinlock_t) <= 4 and 64 bit arches.

We could do an hybrid thing : use nf_conn.ct_general.lock if 64 bit arches
and sizeof(spinlock_t) <= 4.

Other cases would use a carefuly sized array of spinlocks...

Thank you

[PATCH] netfilter: finer grained nf_conn locking

Introduction of fine grained lock infrastructure for nf_conn.
If possible, we use a 32bit hole on 64bit arches.
Else we use a global array of hashed spinlocks, so we dont
change size of "struct nf_conn"

Get rid of central tcp_lock rwlock used in TCP conntracking
using this infrastructure for better performance on SMP.

"tbench 8" results on my 8 core machine (32bit kernel, with
conntracking on) : 2319 MB/s instead of 2284 MB/s


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/skbuff.h                       |    9 ++-
 include/net/netfilter/nf_conntrack_l4proto.h |    2
 net/netfilter/nf_conntrack_core.c            |   47 +++++++++++++++++
 net/netfilter/nf_conntrack_netlink.c         |    4 -
 net/netfilter/nf_conntrack_proto_tcp.c       |   35 +++++-------
 5 files changed, 74 insertions(+), 23 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb1981f..c737f47 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -96,7 +96,14 @@ struct pipe_inode_info;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
-	atomic_t use;
+	atomic_t	use;
+#if BITS_PER_LONG == 64
+	/*
+	 * On 64bit arches, we might use this 32bit hole for spinlock
+	 * (if a spinlock_t fits)
+	 */
+	int		pad;
+#endif
 };
 #endif
 
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index ba32ed7..d66bea9 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ ...
From: Stephen Hemminger
Date: Saturday, March 28, 2009 - 5:48 pm

On Sat, 28 Mar 2009 17:55:38 +0100

I am not a fan of the array of locks. Sizing it is awkward and
it is vulnerable to hash collisions. Let's see if there is another
better way.
--

From: Eric Dumazet
Date: Monday, March 30, 2009 - 12:57 pm

On normal machines, (no debugging spinlocks), patch uses an embedded
spinlock. We probably can use this even on 32bit kernels, considering
previous patch removed the rcu_head (8 bytes on 32bit arches) from
nf_conn :)

if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64.
Adding a spinlock on each nf_conn would be too expensive. In this
case, an array of spinlock is a good compromise, as done in
IP route cache, tcp ehash, ...

I agree sizing of this hash table is not pretty, and should be
a generic kernel service (I wanted such service for futexes for example)

--

From: Stephen Hemminger
Date: Monday, March 30, 2009 - 1:05 pm

On Mon, 30 Mar 2009 21:57:15 +0200

IMO having different locking based on lockdep and architecture is an invitation
to future obscure problems. Perhaps some other locking method or shrinking
ct entry would be better.
--

From: Patrick McHardy
Date: Monday, April 6, 2009 - 5:07 am

I agree. Do people enable lockdep on production machines? Otherwise
I'd say the size increase doesn't really matter.
--

From: Stephen Hemminger
Date: Monday, April 6, 2009 - 10:25 am

On Mon, 6 Apr 2009 14:32:54 +0200 (CEST)

IMHO If they enable lockdep, they can expect that the cost is non-zero.
--

From: Rick Jones
Date: Monday, March 30, 2009 - 11:57 am

Is this an implicit request for me to try to resurrect the 32-core setup?

rick jones
--

From: Eric Dumazet
Date: Monday, March 30, 2009 - 12:20 pm

Not at all, just to keep you informed of work in progress :)


--

From: Jesper Dangaard Brouer
Date: Monday, March 30, 2009 - 12:38 pm

How do you achieve this impressing numbers?
Is it against localhost? (10Gbit/s is max 1250 MB/s)

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: Eric Dumazet
Date: Monday, March 30, 2009 - 12:54 pm

tbench is a tcp test on localhost yes :)

Good to test tcp stack without going to NIC hardware


--

From: Jesper Dangaard Brouer
Date: Monday, March 30, 2009 - 1:34 pm

On Mon, 30 Mar 2009, Eric Dumazet wrote:

> Jesper Dangaard Brouer a 
From: Eric Dumazet
Date: Monday, March 30, 2009 - 1:41 pm

Yes, my dev machine is a dual E5420 (8 cores) at 3.00 GHz

Indeed, tbench is a mix of tcp and process scheduler test/bench



--

From: Jesper Dangaard Brouer
Date: Monday, March 30, 2009 - 2:25 pm

On Mon, 30 Mar 2009, Eric Dumazet wrote:

> Jesper Dangaard Brouer a 
From: Rick Jones
Date: Monday, March 30, 2009 - 3:44 pm

> Indeed, tbench is a mix of tcp and process scheduler test/bench

If I were inclined to run networking tests (eg netperf) over loopback and wanted 
to maximize the trips up and down the protocol stack while minimizing scheduler 
overheads, I might be inclinded to configure --enable-burst with netperf and then 
run N/2 concurrent instances of something like:

netperf -T M,N -t TCP_RR -l 30 -- -b 128 -D &

where M and N were chosen to have each netperf and netserver pair bound to a pair 
of suitable cores, and the value in the -b option wash picked to maximize the CPU 
utilization on those cores.  Then, in theory there would be little to no process 
to process context switching and presumably little in the way of scheduler effect.

What I don't know is if such a setup would have both netperf and netserver each 
consuming 100% of a CPU or if one of them might "peg" before the other.  If one 
did peg before the other, I might be inclined to switch to running N concurrent 
instances, with -T M to bind each  netperf/netserver pair to the same core. 
There would then be the process to process context switching though it would be 
limited to "related" processes.

rick jones
--

From: Jesper Dangaard Brouer
Date: Tuesday, March 31, 2009 - 12:52 pm

On Mon, 30 Mar 2009, Eric Dumazet wrote:
> Jesper Dangaard Brouer a 
From: Eric Dumazet
Date: Tuesday, March 31, 2009 - 1:23 pm

warning, because on my machine, 

cpu0 is on physical id 0, core 0
cpu1 is on physical id 1, core 0
cpu2 is on physical id 0, core 1
cpu3 is on physical id 1, core 1
cpu3 is on physical id 0, core 2
cpu4 is on physical id 1, core 2
cpu5 is on physical id 0, core 3
cpu6 is on physical id 1, core 3

So -T 0,1 might not do what you think...

$ netperf -T 0,1 -l 120 -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    120.00   7423.16
$ netperf -T 0,2 -l 120 -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec



--

From: Rick Jones
Date: Tuesday, March 31, 2009 - 1:35 pm

Thanks.  That makes a decent quantity of sense for a unidirectional test I 
suppose.  If you get bored you might see what happens with a burst 
request/response test.

./configure --enbable-burst
make netperf
netperf -T 0,2 -t TCP_RR -c -C -- -b 64 -D

altering the parms to either -T or -b as you find necessary/desireable.

rick
--

From: Jesper Dangaard Brouer
Date: Tuesday, March 31, 2009 - 1:52 pm

Thanks a lot, I didn't know that!
(hint: cat /proc/cpuinfo | egrep -e 'processor|physical id|core id')

I sort of got a "too large" speedup on the Xeon system:

I want to use core 0 and 1 on physical CPU 0, for some strange reason CPU0 
and CPU4 is these respective cores...

netperf -T 0,4 -l 60  -H localhost
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : demo : cpu bind
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    60.00    19797.42


Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------
--

From: David Miller
Date: Wednesday, February 18, 2009 - 2:55 pm

From: Patrick McHardy <kaber@trash.net>

Just as a side note, we generally frown upon the
hash-array-of-spinlocks approach to scalability.

If you need proof that in the long term it's suboptimal, note that:

1) this is Solaris's approach to locking scalability :-)

2) every such case in the kernel eventually gets transformed into
   RCU, a tree/trie based scheme, or some combination of the two

So maybe for now it's ok, but keep in mind that eventually
this is certain to change. :)
--

From: Patrick McHardy
Date: Wednesday, February 18, 2009 - 4:23 pm

This case might be different in that a normal firewall use case
probably doesn't have more than 16 cpus, even than would be quite
a lot. So for bigger machines this is probably more about keeping
the "non-use" costs low.

I'll keep it in mind though and I'm interested in seeing how it
turns out in the long term :)
--

From: Stephen Hemminger
Date: Wednesday, February 18, 2009 - 4:35 pm

On Thu, 19 Feb 2009 00:23:45 +0100

It doesn't help that spinlock_t keeps growing! In good old days,
a spin lock could fit in one byte.
--

Previous thread: [RFT 0/4] Netfilter/iptables performance improvements by Stephen Hemminger on Tuesday, February 17, 2009 - 10:19 pm. (2 messages)

Next thread: [RFT 3/4] Use mod_timer_noact to remove nf_conntrack_lock by Stephen Hemminger on Tuesday, February 17, 2009 - 10:19 pm. (8 messages)