TCP connection tracking suffers of huge contention on a global rwlock,
is used for protecting the tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
include/linux/netfilter/nf_conntrack_tcp.h | 1
include/net/netfilter/nf_conntrack_helper.h | 2 -
include/net/netfilter/nf_conntrack_l4proto.h | 3 --
net/netfilter/nf_conntrack_netlink.c | 6 ++--
net/netfilter/nf_conntrack_proto_dccp.c | 2 -
net/netfilter/nf_conntrack_proto_sctp.c | 2 -
net/netfilter/nf_conntrack_proto_tcp.c | 37 ++++++++++++---------------
7 files changed, 25 insertions(+), 28 deletions(-)
--- a/include/linux/netfilter/nf_conntrack_tcp.h 2009-02-17 11:07:16.884086452 -0800
+++ b/include/linux/netfilter/nf_conntrack_tcp.h 2009-02-17 11:07:31.643846743 -0800
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
struct ip_ct_tcp
{
+ spinlock_t lock;
struct ip_ct_tcp_state seen[2]; /* connection parameters per direction */
u_int8_t state; /* state of the connection (enum tcp_conntrack) */
/* For detecting stale connections */
--- a/net/netfilter/nf_conntrack_proto_tcp.c 2009-02-17 11:07:16.870763455 -0800
+++ b/net/netfilter/nf_conntrack_proto_tcp.c 2009-02-17 11:21:57.528485882 -0800
@@ -26,9 +26,6 @@
#include <net/netfilter/nf_conntrack_ecache.h>
#include <net/netfilter/nf_log.h>
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
/* "Be conservative in what you do,
be liberal in what you accept from others."
If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct se
{
enum tcp_conntrack state;
...Eric already posted a patch to use an array of locks, which is a better approach IMO since it keeps the size of the conntrack entries down. --
Yes, we probably can use an array for short lived lock sections. The extra cost to compute the lock address is quite small, if the array size is known at compile time. Stephen we might also use this same array to protect the nf_conn_acct_find(ct) thing as well (I am referring to your RFT 3/4 patch : Use mod_timer_noact to remove nf_conntrack_lock) Here is a repost of patch Patrick is referring to : [PATCH] netfilter: Get rid of central rwlock in tcp conntracking TCP connection tracking suffers of huge contention on a global rwlock, used to protect tcp conntracking state. As each tcp conntrack state have no relations between each others, we can switch to fine grained lock. Using an array of spinlocks avoids enlarging size of connection tracking structures, yet giving reasonable fanout. tcp_print_conntrack() doesnt need to lock anything to read ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reported-by: Rick Jones <rick.jones2@hp.com> --- include/net/netfilter/nf_conntrack.h | 32 +++++++++++++++++++++ net/netfilter/nf_conntrack_core.c | 10 ++++-- net/netfilter/nf_conntrack_proto_tcp.c | 34 ++++++++++------------- 3 files changed, 53 insertions(+), 23 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 2e0c536..288aff5 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -129,6 +129,38 @@ struct nf_conn struct rcu_head rcu; }; +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \ + defined(CONFIG_PROVE_LOCKING) + +/* + * We reserve 16 locks per cpu (typical cache line size is 64 bytes) + * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory. + * (on lockdep we have a quite big spinlock_t, so keep the size down there) + */ +#ifdef CONFIG_LOCKDEP +#define ...
TCP connection tracking suffers of huge contention on a global rwlock, used to protect tcp conntracking state. As each tcp conntrack state have no relations between each others, we can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp" tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. In this version the lock is placed in a 4 byte whole in the nf_conntrack structure. This means no size change, and same method can later be used for UDP, SCTP, DCCP conntrack. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reported-by: Rick Jones <rick.jones2@hp.com> --- include/linux/skbuff.h | 1 include/net/netfilter/nf_conntrack_helper.h | 2 - include/net/netfilter/nf_conntrack_l4proto.h | 3 -- net/netfilter/nf_conntrack_core.c | 1 net/netfilter/nf_conntrack_netlink.c | 6 ++-- net/netfilter/nf_conntrack_proto_dccp.c | 2 - net/netfilter/nf_conntrack_proto_sctp.c | 2 - net/netfilter/nf_conntrack_proto_tcp.c | 37 ++++++++++++--------------- 8 files changed, 26 insertions(+), 28 deletions(-) --- a/include/net/netfilter/nf_conntrack_helper.h 2009-02-19 13:45:26.103408544 -0800 +++ b/include/net/netfilter/nf_conntrack_helper.h 2009-02-19 13:45:56.136167400 -0800 @@ -34,7 +34,7 @@ struct nf_conntrack_helper void (*destroy)(struct nf_conn *ct); - int (*to_nlattr)(struct sk_buff *skb, const struct nf_conn *ct); + int (*to_nlattr)(struct sk_buff *skb, struct nf_conn *ct); unsigned int expect_class_max; }; --- a/include/net/netfilter/nf_conntrack_l4proto.h 2009-02-19 13:45:26.103408544 -0800 +++ b/include/net/netfilter/nf_conntrack_l4proto.h 2009-02-19 13:45:56.136167400 -0800 @@ -62,8 +62,7 @@ struct nf_conntrack_l4proto int (*print_conntrack)(struct seq_file *s, const struct nf_conn *); /* convert protoinfo to nfnetink attributes ...
Hi Patrick
Apparently we could not finish the removal of tcp_lock for 2.6.30 :(
Stephen suggested using a 4 bytes hole in struct nf_conntrack,
but this is ok only if sizeof(spinlock_t) <= 4 and 64 bit arches.
We could do an hybrid thing : use nf_conn.ct_general.lock if 64 bit arches
and sizeof(spinlock_t) <= 4.
Other cases would use a carefuly sized array of spinlocks...
Thank you
[PATCH] netfilter: finer grained nf_conn locking
Introduction of fine grained lock infrastructure for nf_conn.
If possible, we use a 32bit hole on 64bit arches.
Else we use a global array of hashed spinlocks, so we dont
change size of "struct nf_conn"
Get rid of central tcp_lock rwlock used in TCP conntracking
using this infrastructure for better performance on SMP.
"tbench 8" results on my 8 core machine (32bit kernel, with
conntracking on) : 2319 MB/s instead of 2284 MB/s
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/linux/skbuff.h | 9 ++-
include/net/netfilter/nf_conntrack_l4proto.h | 2
net/netfilter/nf_conntrack_core.c | 47 +++++++++++++++++
net/netfilter/nf_conntrack_netlink.c | 4 -
net/netfilter/nf_conntrack_proto_tcp.c | 35 +++++-------
5 files changed, 74 insertions(+), 23 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb1981f..c737f47 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -96,7 +96,14 @@ struct pipe_inode_info;
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack {
- atomic_t use;
+ atomic_t use;
+#if BITS_PER_LONG == 64
+ /*
+ * On 64bit arches, we might use this 32bit hole for spinlock
+ * (if a spinlock_t fits)
+ */
+ int pad;
+#endif
};
#endif
diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
index ba32ed7..d66bea9 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ ...On Sat, 28 Mar 2009 17:55:38 +0100 I am not a fan of the array of locks. Sizing it is awkward and it is vulnerable to hash collisions. Let's see if there is another better way. --
On normal machines, (no debugging spinlocks), patch uses an embedded spinlock. We probably can use this even on 32bit kernels, considering previous patch removed the rcu_head (8 bytes on 32bit arches) from nf_conn :) if LOCKDEP is on, size of a spinlock is 64 bytes on x86_64. Adding a spinlock on each nf_conn would be too expensive. In this case, an array of spinlock is a good compromise, as done in IP route cache, tcp ehash, ... I agree sizing of this hash table is not pretty, and should be a generic kernel service (I wanted such service for futexes for example) --
On Mon, 30 Mar 2009 21:57:15 +0200 IMO having different locking based on lockdep and architecture is an invitation to future obscure problems. Perhaps some other locking method or shrinking ct entry would be better. --
I agree. Do people enable lockdep on production machines? Otherwise I'd say the size increase doesn't really matter. --
They do not.[1] [1] http://git.opensuse.org/?p=people/jblunck/kernel-source.git;a=blob;f=config/x86_64/def... --
On Mon, 6 Apr 2009 14:32:54 +0200 (CEST) IMHO If they enable lockdep, they can expect that the cost is non-zero. --
Is this an implicit request for me to try to resurrect the 32-core setup? rick jones --
Not at all, just to keep you informed of work in progress :) --
How do you achieve this impressing numbers? Is it against localhost? (10Gbit/s is max 1250 MB/s) Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
tbench is a tcp test on localhost yes :) Good to test tcp stack without going to NIC hardware --
On Mon, 30 Mar 2009, Eric Dumazet wrote: > Jesper Dangaard Brouer a
Yes, my dev machine is a dual E5420 (8 cores) at 3.00 GHz Indeed, tbench is a mix of tcp and process scheduler test/bench --
On Mon, 30 Mar 2009, Eric Dumazet wrote: > Jesper Dangaard Brouer a
> Indeed, tbench is a mix of tcp and process scheduler test/bench If I were inclined to run networking tests (eg netperf) over loopback and wanted to maximize the trips up and down the protocol stack while minimizing scheduler overheads, I might be inclinded to configure --enable-burst with netperf and then run N/2 concurrent instances of something like: netperf -T M,N -t TCP_RR -l 30 -- -b 128 -D & where M and N were chosen to have each netperf and netserver pair bound to a pair of suitable cores, and the value in the -b option wash picked to maximize the CPU utilization on those cores. Then, in theory there would be little to no process to process context switching and presumably little in the way of scheduler effect. What I don't know is if such a setup would have both netperf and netserver each consuming 100% of a CPU or if one of them might "peg" before the other. If one did peg before the other, I might be inclined to switch to running N concurrent instances, with -T M to bind each netperf/netserver pair to the same core. There would then be the process to process context switching though it would be limited to "related" processes. rick jones --
On Mon, 30 Mar 2009, Eric Dumazet wrote: > Jesper Dangaard Brouer a
warning, because on my machine, cpu0 is on physical id 0, core 0 cpu1 is on physical id 1, core 0 cpu2 is on physical id 0, core 1 cpu3 is on physical id 1, core 1 cpu3 is on physical id 0, core 2 cpu4 is on physical id 1, core 2 cpu5 is on physical id 0, core 3 cpu6 is on physical id 1, core 3 So -T 0,1 might not do what you think... $ netperf -T 0,1 -l 120 -H localhost TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 120.00 7423.16 $ netperf -T 0,2 -l 120 -H localhost TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec --
Thanks. That makes a decent quantity of sense for a unidirectional test I suppose. If you get bored you might see what happens with a burst request/response test. ./configure --enbable-burst make netperf netperf -T 0,2 -t TCP_RR -c -C -- -b 64 -D altering the parms to either -T or -b as you find necessary/desireable. rick --
Thanks a lot, I didn't know that! (hint: cat /proc/cpuinfo | egrep -e 'processor|physical id|core id') I sort of got a "too large" speedup on the Xeon system: I want to use core 0 and 1 on physical CPU 0, for some strange reason CPU0 and CPU4 is these respective cores... netperf -T 0,4 -l 60 -H localhost TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET : demo : cpu bind Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 60.00 19797.42 Hilsen Jesper Brouer -- ------------------------------------------------------------------- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk ------------------------------------------------------------------- --
From: Patrick McHardy <kaber@trash.net> Just as a side note, we generally frown upon the hash-array-of-spinlocks approach to scalability. If you need proof that in the long term it's suboptimal, note that: 1) this is Solaris's approach to locking scalability :-) 2) every such case in the kernel eventually gets transformed into RCU, a tree/trie based scheme, or some combination of the two So maybe for now it's ok, but keep in mind that eventually this is certain to change. :) --
This case might be different in that a normal firewall use case probably doesn't have more than 16 cpus, even than would be quite a lot. So for bigger machines this is probably more about keeping the "non-use" costs low. I'll keep it in mind though and I'm interested in seeing how it turns out in the long term :) --
On Thu, 19 Feb 2009 00:23:45 +0100 It doesn't help that spinlock_t keeps growing! In good old days, a spin lock could fit in one byte. --
