Folks - Under: ftp://ftp.netperf.org/iptable_scaling can be found netperf results and Caliper profiles for three scenarios on a 32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call a "cell based" system in that it is comprised of "cell boards" on which reside CPU and memory resources. In this case there are four cell boards, each with 4, dual-core Montecito processors and 1/4 of the overall RAM. The system was configured with a mix of cell-local and global interleaved memory, where the global interleave is on a cacheline (128 byte) boundary (IIRC). Total RAM in the system is 256 GB. The cells are joined via cross-bar connections. (numactl --hardware output is available under the URL above) There was an "I/O expander" connected to the system. This meant there were as many distinct PCI-X domains as there were cells, and every cell had a "local" set of PCI-X slots. Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - aka Neterion XFrame IIs. These were then connected to an HP ProCurve 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet NICs (Aka Chelsio T3C-based). They were running RHEL 5.2 I think. Each NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble x8 slot (DL585 G5) The kernel is from DaveM's net-next tree ca last week, multiq enabled. The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get multiq support. It was loaded into the kernel via: insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8 There were then 8 tx queues and 8 rx queues per interface in the rx8640. The "setaffinity.txt" script was used to set the IRQ affinities to cores "closest" to the physical NIC. In all three tests all 32 cores went to 100% utilization. At least for all incense and porpoises. (there was some occasional idle reported by top on the full_iptables run) A set of 64, concurrent "burst mode" ...
Hi Rick, nice hardware you have :) Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks. I hit this contention point even with low cost hardware, and quite standard application. I pinged him few days ago to try to finish the job with him, but it seems Stephen is busy at the moment. Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock that must be write_locked() for basically every handled tcp frame... How long is "not indefinitely" ? --
On Tue, 27 Jan 2009 00:10:44 +0100 Hey, I just got back from Linux Conf Au, haven't had time to catch up yet. It is on my list, after dealing with the other work related stuff. --
The system I am using is being borrowed under an open ended loan. However, my use of it can be thought of as being the "null process" in VMS - once anyone else wants it I have to get off of it. That said, I would guess that the chances of someone else trying to get that system are pretty small for the next four+ weeks. I had a similar system (PCIe I/O rather than PCI-X) for quite a few weeks before it got pulled-out from under. rick --
As Stephen will probably send the iptable part, I can do the tcp conntrack
improvment.
Let me know how it helps your last test (connection tracking enabled)
Thanks
[PATCH] netfilter: Get rid of central rwlock in tcp conntracking
TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"
tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index a049df4..3dc72c6 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
struct ip_ct_tcp
{
+ spinlock_t lock;
struct ip_ct_tcp_state seen[2]; /* connection parameters per direction */
u_int8_t state; /* state of the connection (enum tcp_conntrack) */
/* For detecting stale connections */
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..1bc8fc1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
#include <net/netfilter/nf_conntrack_ecache.h>
#include <net/netfilter/nf_log.h>
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
/* "Be conservative in what you do,
be liberal in what you accept from others."
If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
{
enum tcp_conntrack state;
- read_lock_bh(&tcp_lock);
state = ct->proto.tcp.state;
- read_unlock_bh(&tcp_lock);
return ...Thats an interesting test-case, but one lock per conntrack just for TCP tracking seems like overkill. We're trying to keep the conntrack stuctures as small as possible, so I'd prefer an array of spinlocks or something like that. --
Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits,
probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using
an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non
debuging spinlocks, that spread to 32 x 64bytes cache lines)
However I wonder if for very large number of cpus we should at least ask conntrack
to use hardware aligned "struct nf_conn" to avoid false sharing
We might also use a generic SLAB_HWCACHE_ALIGN_IFMANYCPUS flag if same tactic
could help other kmem_cache_create() users
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 90ce9dd..82332ce 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1167,8 +1167,10 @@ static int nf_conntrack_init_init_net(void)
nf_conntrack_max);
nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
- sizeof(struct nf_conn),
- 0, 0, NULL);
+ sizeof(struct nf_conn), 0,
+ num_possible_cpus() >= 32 ?
+ SLAB_HWCACHE_ALIGN : 0,
+ NULL);
if (!nf_conntrack_cachep) {
printk(KERN_ERR "Unable to create nf_conn slab cache\n");
ret = -ENOMEM;
--
Please find a new version of patch, using an array of spinlocks. I chose to define an array at nf_conntrack level, since tcp conntracking is one user of this array but we might find other users of this array. Thanks [PATCH] netfilter: Get rid of central rwlock in tcp conntracking TCP connection tracking suffers of huge contention on a global rwlock, used to protect tcp conntracking state. As each tcp conntrack state have no relations between each others, we can switch to fine grained lock. Using an array of spinlocks avoids enlarging size of connection tracking structures, yet giving reasonable fanout. tcp_print_conntrack() doesnt need to lock anything to read ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well. nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Reported-by: Rick Jones <rick.jones2@hp.com> --- include/net/netfilter/nf_conntrack.h | 32 +++++++++++++++++++++ net/netfilter/nf_conntrack_core.c | 10 ++++-- net/netfilter/nf_conntrack_proto_tcp.c | 34 ++++++++++------------- 3 files changed, 53 insertions(+), 23 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 2e0c536..288aff5 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -129,6 +129,38 @@ struct nf_conn struct rcu_head rcu; }; +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \ + defined(CONFIG_PROVE_LOCKING) + +/* + * We reserve 16 locks per cpu (typical cache line size is 64 bytes) + * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory. + * (on lockdep we have a quite big spinlock_t, so keep the size down there) + */ +#ifdef CONFIG_LOCKDEP +#define CT_HASH_LOCK_SZ 64 +#elif NR_CPUS >= 32 +#define CT_HASH_LOCK_SZ 512 +#else +#define CT_HASH_LOCK_SZ (roundup_pow_of_two(NR_CPUS) * 16) +#endif + +extern spinlock_t ...
This looks good to me. Rick, would you like to give it a try? I'll convert the remaining conntrack protocols when applying it. --
I will give it a try and let folks know the results - unless told otherwise, I will ass-u-me I only need rerun the "full_iptables" test case. rick jones --
The runemomniagg2.sh script is still running, but the initial cycles profile suggests that the main change is converting the write_lock time into spinlock contention time with 78.39% of the cycles spent in ia64_spinlock_contention. When the script completes I'll upload the profiles and the netperf results to the same base URL as in the basenote under "contrack01/" rick jones --
The script completed - although at some point I hit an fd limit - I think I have an fd leak in netperf somewhere :( . Anyhow, there are still some netperfs that end-up kicking the bucket during the run - I suspect starvation because where in the other configs (no iptables, and empty iptables) each netperf seems to consume about 50% of a CPU - stands to reason - 64 netperfs, 32 cores - in the "full" case I see many netperfs consuming 100% of a CPU. My gut is thinking that one or more netperf contexts gets stuck doing something on behalf of others. There is also ksoftirqd time for a few of those processes. Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which does represent an improvement. rick jones PS - just to be certain that running-out of fd's didn't skew the results I'm rerunning the script with ulimit -n 10240 and will see if that changes the results any. And I suppose I need to go fd leak hunting in netperf omni code :( --
Thanks for the report If you have so much contention on spinlocks, maybe hash function is not good at all... hash = (unsigned long)ct; hash ^= hash >> 16; hash ^= hash >> 8; I ass-u-me you compiled your kernel with NR_CPUS=32 ? --
I believe so - CONFIG_NR_CPUS in .config is 64 anyway. If there is a more definitive place to check I'd be happy to look. rick --
By any chance, are you using SLAB instead of SLUB ? While running your netperf session, issuing : grep conntrack /proc/slabinfo Would help to know hash dispertion on your machine. --
Seems that way: sut42:~/net-next-2.6# grep SLAB .config CONFIG_SLAB=y CONFIG_SLABINFO=y # CONFIG_DEBUG_SLAB is not set sut42:~/net-next-2.6# grep SLUB .config # CONFIG_SLUB is not set The config file should be up under the base URL: ftp://ftp.netperf.org/iptable_scaling The system started as a Debian Lenny with a 2.6.26 kernel, and I did a make old sut42:~/net-next-2.6# grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 276 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 0 about a minute later it is: grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 315 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 0 after about another 40 to 60 seconds it was: grep conntrack /proc/slabinfo nf_conntrack_expect 0 0 240 66 1 : tunables 120 60 8 : slabdata 0 0 0 nf_conntrack 314 742 304 53 1 : tunables 54 27 8 : slabdata 14 14 7 --
Yes indeed you have a speedup, tcp conntracking is OK. You now hit the nf_conntrack_lock spinlock we have in generic conntrack code (net/netfilter/nf_conntrack_core.c) nf_ct_refresh_acct() for instance has to lock it. We really want some finer locking here. --
That looks more complicated since it requires to take multiple locks occasionally (f.i. hash insertion, potentially helper-related and expectation-related stuff), and there is the unconfirmed_list, where fine-grained locking can't really be used without changing it to a hash. --
Yes its more complicated, but look what we did in 2.6.29 for tcp/udp sockets, using RCU to have lockless lookups. Yes, we still take a lock when doing an insert or delete at socket bind/unbind time. We could keep a central nf_conntrack_lock to guard insertions/deletes from hash and unconfirmed_list. But *normal* packets that only need to change state of one particular connection could use RCU (without spinlock) to locate the conntrack, then lock the found conntrack to perform all state changes. --
Well... RCU is already used by conntrack :) Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock --
While doing oprofile tests I noticed two loops are not properly unrolled by gcc
Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
net/ipv4/netfilter/ip_tables.c | 32 +++++++++++++++++--------------
1 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..7bdb2cf 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,22 @@ do { \
#define inline
#endif
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ > 3 * sizeof(unsigned long))
+ ret |= (a[3] ^ b[3]) & mask[3];
+ return ret;
+}
+
/*
We keep a set of rules for each CPU, so we can avoid write-locking
them in the softirq when updating the counters and therefore
@@ -83,7 +99,6 @@ ip_packet_match(const struct iphdr *ip,
const struct ipt_ip *ipinfo,
int isfrag)
{
- size_t i;
unsigned long ret;
#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,13 +118,7 @@ ip_packet_match(const struct iphdr *ip,
return false;
}
- /* Look for ifname matches; this should unroll nicely. */
- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)indev)[i]
- ^ ((const unsigned long *)ipinfo->iniface)[i])
- & ((const unsigned long *)ipinfo->iniface_mask)[i];
- }
-
+ ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
if (FWINV(ret != 0, IPT_INV_VIA_IN)) ...That's because nobody passed -funroll-loops. Did you try that for That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long) You should add a dummy loop for that or at least a BUILD_BUG_ON -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I dont want to unroll all loops, only those two :)
I wish gcc (4.3.2 here) was litle bit smarter :(
Without using -funroll-loops
size ipv4/netfilter/ip_tables.o
text data bss dec hex filename
6424 368 16 6808 1a98 net/ipv4/netfilter/ip_tables.o
With -funroll-loops :
size ipv4/netfilter/ip_tables.o
text data bss dec hex filename
7144 368 16 7528 1d68 net/ipv4/netfilter/ip_tables.o
With my patch and no -funroll-loops
text data bss dec hex filename
It will also break the day we port linux to a 128 bits machine :)
Thanks Andi
(By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c
to have a working oprofile on my dev machine...)
[PATCH] netfilter: unfold two critical loops in ip_packet_match()
While doing oprofile tests I noticed two loops are not properly unrolled by gcc
Using hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench, for a small
text size increase (62 bytes for both loops)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
net/ipv4/netfilter/ip_tables.c | 34 ++++++++++++++++++-------------
1 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..9298d0a 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,24 @@ do { \
#define inline
#endif
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ if (IFNAMSIZ > sizeof(unsigned long))
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ ...gcc 4.4 will have a way to do that per function, but earlier you would need to move it to a separate file and specify the option only for that. Doing so would be still a good idea compared to your patch because the code will be cleaner and might be more adaptable to future architectures It cannot do much without profile feedback because Yes I know, sorry for that. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
So... you suggest me to split file, and use a "-funroll-loops", that might doing strange things on some arches / compilers... This is far too complicated and risky imho. Check compare_ether_addr_64bits() definition in include/linux/etherdevice.h for a truly unreadable code :) --
I don't think it will. -funroll-loops is pretty clear defined. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
The interface name matching has shown up in profiles forever though and we've actually already tried to optimize it IIRC. Eric, I'm trying to keep all the *tables files synchronized, could you send me a patch updating the other ones as well please? --
While doing this, I found arp_tables is still using loop using byte operations. Also, I could not find how iniface_mask[], outiface_mask[], iniface[] and outiface[] were forced to long word alignment ... (in struct ipt_ip, struct ip6t_ip6, struct arpt_arp) --
In case of IPv4 and IPv6 they are already suitable aligned, it simply performing the comparison in unsigned long quantities. struct arpt_arp unfortunately doesn't properly align the interface names, so we need to continue to do byte-wise comparisons. --
I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;)
ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables),
since only arp_tables variant has the alignement problem.
[PATCH] netfilter: unfold two critical loops in arp_packet_match()
x86 and powerpc can perform long word accesses in an efficient maner.
We can use this to unroll two loops in arp_packet_match(), to
perform arithmetic on long words instead of bytes. This is a win
on x86_64 for example.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
net/ipv4/netfilter/arp_tables.c | 44 +++++++++++++++++++++++-------
1 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 7ea88b6..b5db463 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -73,6 +73,36 @@ static inline int arp_devaddr_compare(const struct arpt_devaddr_info *ap,
return (ret != 0);
}
+/*
+ * Unfortunatly, _b and _mask are not aligned to an int (or long int)
+ * Some arches dont care, unrolling the loop is a win on them.
+ */
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ if (IFNAMSIZ > sizeof(unsigned long))
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ > 3 * sizeof(unsigned long))
+ ret |= (a[3] ^ b[3]) & mask[3];
+ BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+#else
+ unsigned long ret = 0;
+ int i;
+
+ for (i = 0; i < IFNAMSIZ; i++)
+ ret |= (_a[i] ^ _b[i]) & _mask[i];
+#endif
+ return ret;
+}
+
/* Returns whether packet matches rule or not. */
...get_unaligned() would work as well I guess. But we don't seem to This looks good to me. Applied, thanks. --
Hi Patrick
Please find following patch to prepare next one (loop unrolling)
I am not sure xt_physdev_info is aligned, either on an int or a long,
so please check my assertion before applying :)
Thank you
[PATCH] netfilter: xt_physdev fixes
1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int
2) It also uses word comparisons, while it could use long word ones.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 1bcdfc1..4b13ef7 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -24,9 +24,9 @@ static bool
physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
{
int i;
- static const char nulldevname[IFNAMSIZ];
+ static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
const struct xt_physdev_info *info = par->matchinfo;
- bool ret;
+ unsigned long ret;
const char *indev, *outdev;
const struct nf_bridge_info *nf_bridge;
@@ -68,10 +68,10 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
if (!(info->bitmask & XT_PHYSDEV_OP_IN))
goto match_outdev;
indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
- for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
- ret |= (((const unsigned int *)indev)[i]
- ^ ((const unsigned int *)info->physindev)[i])
- & ((const unsigned int *)info->in_mask)[i];
+ for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
+ ret |= (((const unsigned long *)indev)[i]
+ ^ ((const unsigned long *)info->physindev)[i])
+ & ((const unsigned long *)info->in_mask)[i];
}
if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN))
@@ -82,13 +82,12 @@ match_outdev:
return true;
outdev = nf_bridge->physoutdev ?
nf_bridge->physoutdev->name : nulldevname;
- for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
- ret |= (((const unsigned int *)outdev)[i]
- ^ ((const unsigned ...Yes, the private structures can assume alignment suitable for Applied, thanks. --
xt_physdev netfilter module can use an ifname_compare() helper
so that two loops are unfolded.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
net/netfilter/xt_physdev.c | 32 +++++++++++++++++++++-----------
1 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 4b13ef7..44a234e 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -20,10 +20,27 @@ MODULE_DESCRIPTION("Xtables: Bridge physical device match");
MODULE_ALIAS("ipt_physdev");
MODULE_ALIAS("ip6t_physdev");
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ if (IFNAMSIZ > sizeof(unsigned long))
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ > 3 * sizeof(unsigned long))
+ ret |= (a[3] ^ b[3]) & mask[3];
+ BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+ return ret;
+}
+
static bool
physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
{
- int i;
static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
const struct xt_physdev_info *info = par->matchinfo;
unsigned long ret;
@@ -68,11 +85,7 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
if (!(info->bitmask & XT_PHYSDEV_OP_IN))
goto match_outdev;
indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)indev)[i]
- ^ ((const unsigned long *)info->physindev)[i])
- & ((const unsigned long *)info->in_mask)[i];
- }
+ ret = ifname_compare(indev, info->physindev, info->in_mask);
if (!ret ^ ...ip6_tables netfilter module can use an ifname_compare() helper
so that two loops are unfolded.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
net/ipv6/netfilter/ip6_tables.c | 33 +++++++++++++++++++-----------
1 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index a33485d..d64594b 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -89,6 +89,25 @@ ip6t_ext_hdr(u8 nexthdr)
(nexthdr == IPPROTO_DSTOPTS) );
}
+static unsigned long ifname_compare(const char *_a, const char *_b,
+ const unsigned char *_mask)
+{
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ if (IFNAMSIZ > sizeof(unsigned long))
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ > 3 * sizeof(unsigned long))
+ ret |= (a[3] ^ b[3]) & mask[3];
+ BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+ return ret;
+}
+
/* Returns whether matches rule or not. */
/* Performance critical - called for every packet */
static inline bool
@@ -99,7 +118,6 @@ ip6_packet_match(const struct sk_buff *skb,
unsigned int *protoff,
int *fragoff, bool *hotdrop)
{
- size_t i;
unsigned long ret;
const struct ipv6hdr *ipv6 = ipv6_hdr(skb);
@@ -120,12 +138,7 @@ ip6_packet_match(const struct sk_buff *skb,
return false;
}
- /* Look for ifname matches; this should unroll nicely. */
- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)indev)[i]
- ^ ((const unsigned long *)ip6info->iniface)[i])
- & ((const unsigned long *)ip6info->iniface_mask)[i];
- }
+ ret = ifname_compare(indev, ip6info->iniface, ip6info->iniface_mask);
if (FWINV(ret != 0, ...Also applied, thanks Eric. Are you going to send a new patch for ip_tables.c or should I just take the old one? --
While doing oprofile tests I noticed two loops are not properly unrolled by gcc
Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 08cde5b..e5294ae 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -74,6 +74,25 @@ do { \
Hence the start of any table is given by get_table() below. */
+static unsigned long ifname_compare(const char *_a, const char *_b,
+ const unsigned char *_mask)
+{
+ const unsigned long *a = (const unsigned long *)_a;
+ const unsigned long *b = (const unsigned long *)_b;
+ const unsigned long *mask = (const unsigned long *)_mask;
+ unsigned long ret;
+
+ ret = (a[0] ^ b[0]) & mask[0];
+ if (IFNAMSIZ > sizeof(unsigned long))
+ ret |= (a[1] ^ b[1]) & mask[1];
+ if (IFNAMSIZ > 2 * sizeof(unsigned long))
+ ret |= (a[2] ^ b[2]) & mask[2];
+ if (IFNAMSIZ > 3 * sizeof(unsigned long))
+ ret |= (a[3] ^ b[3]) & mask[3];
+ BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+ return ret;
+}
+
/* Returns whether matches rule or not. */
/* Performance critical - called for every packet */
static inline bool
@@ -83,7 +102,6 @@ ip_packet_match(const struct iphdr *ip,
const struct ipt_ip *ipinfo,
int isfrag)
{
- size_t i;
unsigned long ret;
#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,12 +121,7 @@ ip_packet_match(const struct iphdr *ip,
return false;
}
- /* Look for ifname matches; this should unroll nicely. */
- for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
- ret |= (((const unsigned long *)indev)[i]
- ^ ((const unsigned long *)ipinfo->iniface)[i])
- & ((const unsigned long *)ipinfo->iniface_mask)[i];
- }
+ ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
if (FWINV(ret != 0, ...The lock is currently used to avoid timer update races. Martin has two old patches two remove it, but they require changes to the timer code to support a mod_timer variant that doesn't enable an inactive timer: http://people.netfilter.org/gandalf/performance/patches/mod_timer_noact http://people.netfilter.org/gandalf/performance/patches/__nf_ct_refresh_acct-locking --
On Tue, 27 Jan 2009 00:10:44 +0100 Does anyone have a fix for this bottleneck. --
