Re: [PATCH] netfilter: unfold two critical loops in ip_packet_match()

Previous thread: [PATCH 1/1] Fix getsockname (ipv4/ipv6) by John Reumann on Friday, January 23, 2009 - 2:45 pm. (4 messages)

Next thread: [patch 2.6.28-rc8-davinci-git] dm355: dm9000 by Narnakaje, Snehaprabha on Monday, January 26, 2009 - 3:36 pm. (3 messages)
From: Rick Jones
Date: Monday, January 26, 2009 - 3:15 pm

Folks -

Under:

ftp://ftp.netperf.org/iptable_scaling

can be found netperf results and Caliper profiles for three scenarios on a 
32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call a "cell 
based" system in that it is comprised of "cell boards" on which reside CPU and 
memory resources.  In this case there are four cell boards, each with 4, 
dual-core Montecito processors and 1/4 of the overall RAM.  The system was 
configured with a mix of cell-local and global interleaved memory, where the 
global interleave is on a cacheline (128 byte) boundary (IIRC).  Total RAM in the 
system is 256 GB.  The cells are joined via cross-bar connections. (numactl 
--hardware output is available under the URL above)

There was an "I/O expander" connected to the system.  This meant there were as 
many distinct PCI-X domains as there were cells, and every cell had a "local" set 
of PCI-X slots.

Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - aka 
Neterion XFrame IIs.  These were then connected to an HP ProCurve 5806 switch, 
which was in turn connected to three, 4P/16C, 2.3 GHz HP DL585 G5s, each of which 
had a pair of HP AD386A PCIe 10Gbit Ethernet NICs (Aka Chelsio T3C-based).  They 
were running RHEL 5.2 I think.  Each NIC was in either a PCI-X 2.0 266 MHz slot 
(rx8640) or a PCIe 1.mumble x8 slot (DL585 G5)

The kernel is from DaveM's net-next tree ca last week, multiq enabled.  The s2io 
driver is Neterion's out-of-tree version 2.0.36.15914 to get multiq support.  It 
was loaded into the kernel via:

insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8

There were then 8 tx queues and 8 rx queues per interface in the rx8640.  The 
"setaffinity.txt" script was used to set the IRQ affinities to cores "closest" to 
the physical NIC. In all three tests all 32 cores went to 100% utilization. At 
least for all incense and porpoises. (there was some occasional idle reported by 
top on the full_iptables run)

A set of 64, concurrent "burst mode" ...
From: Eric Dumazet
Date: Monday, January 26, 2009 - 4:10 pm

Hi Rick, nice hardware you have :)

Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
I hit this contention point even with low cost hardware, and quite standard application.

I pinged him few days ago to try to finish the job with him, but it seems Stephen
is busy at the moment.

Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
 that must be write_locked() for basically every handled tcp frame...

How long is "not indefinitely" ? 


--

From: Stephen Hemminger
Date: Monday, January 26, 2009 - 4:14 pm

On Tue, 27 Jan 2009 00:10:44 +0100

Hey, I just got back from Linux Conf Au, haven't had time to catch up yet.
It is on my list, after dealing with the other work related stuff.
--

From: Rick Jones
Date: Monday, January 26, 2009 - 4:19 pm

The system I am using is being borrowed under an open ended loan.  However, my 
use of it can be thought of as being the "null process" in VMS - once anyone else 
wants it I have to get off of it.

That said, I would guess that the chances of someone else trying to get that 
system are pretty small for the next four+ weeks.  I had a similar system (PCIe 
I/O rather than PCI-X) for quite a few weeks before it got pulled-out from under.

rick
--

From: Eric Dumazet
Date: Tuesday, January 27, 2009 - 2:10 am

As Stephen will probably send the iptable part, I can do the tcp conntrack
improvment.

Let me know how it helps your last test (connection tracking enabled)

Thanks

[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.
As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock, using a spinlock per "struct ip_ct_tcp"

tcp_print_conntrack() dont need to lock anything to read ct->proto.tcp.state,
so speedup /proc/net/ip_conntrack as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
diff --git a/include/linux/netfilter/nf_conntrack_tcp.h b/include/linux/netfilter/nf_conntrack_tcp.h
index a049df4..3dc72c6 100644
--- a/include/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/linux/netfilter/nf_conntrack_tcp.h
@@ -50,6 +50,7 @@ struct ip_ct_tcp_state {
 
 struct ip_ct_tcp
 {
+	spinlock_t	lock;
 	struct ip_ct_tcp_state seen[2];	/* connection parameters per direction */
 	u_int8_t	state;		/* state of the connection (enum tcp_conntrack) */
 	/* For detecting stale connections */
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index a1edb9c..1bc8fc1 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -26,9 +26,6 @@
 #include <net/netfilter/nf_conntrack_ecache.h>
 #include <net/netfilter/nf_log.h>
 
-/* Protects ct->proto.tcp */
-static DEFINE_RWLOCK(tcp_lock);
-
 /* "Be conservative in what you do,
     be liberal in what you accept from others."
     If it's non-zero, we mark only out of window RST segments as INVALID. */
@@ -297,9 +294,7 @@ static int tcp_print_conntrack(struct seq_file *s, const struct nf_conn *ct)
 {
 	enum tcp_conntrack state;
 
-	read_lock_bh(&tcp_lock);
 	state = ct->proto.tcp.state;
-	read_unlock_bh(&tcp_lock);
 
 	return ...
From: Patrick McHardy
Date: Tuesday, January 27, 2009 - 2:15 am

Thats an interesting test-case, but one lock per conntrack just for
TCP tracking seems like overkill. We're trying to keep the conntrack
stuctures as small as possible, so I'd prefer an array of spinlocks
or something like that.

--

From: Eric Dumazet
Date: Tuesday, January 27, 2009 - 4:29 am

Yes, this is wise. Current sizeof(struct nf_conn) is 220 (0xdc) on 32 bits,
probably rounded to 0xE0 by SLAB/SLUB. I will provide a new patch using
an array of say 512 spinlocks. (512 spinlocks use 2048 bytes if non
debuging spinlocks, that spread to 32 x 64bytes cache lines)

However I wonder if for very large number of cpus we should at least ask conntrack 
to use hardware aligned "struct nf_conn" to avoid false sharing

We might also use a generic SLAB_HWCACHE_ALIGN_IFMANYCPUS flag if same tactic
could help other kmem_cache_create() users


diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 90ce9dd..82332ce 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1167,8 +1167,10 @@ static int nf_conntrack_init_init_net(void)
 	       nf_conntrack_max);
 
 	nf_conntrack_cachep = kmem_cache_create("nf_conntrack",
-						sizeof(struct nf_conn),
-						0, 0, NULL);
+						sizeof(struct nf_conn), 0,
+						num_possible_cpus() >= 32 ?
+							SLAB_HWCACHE_ALIGN : 0,
+						NULL);
 	if (!nf_conntrack_cachep) {
 		printk(KERN_ERR "Unable to create nf_conn slab cache\n");
 		ret = -ENOMEM;

--

From: Patrick McHardy
Date: Tuesday, January 27, 2009 - 4:37 am

[Empty message]
From: Eric Dumazet
Date: Tuesday, January 27, 2009 - 9:23 am

Please find a new version of patch, using an array of spinlocks.

I chose to define an array at nf_conntrack level, since tcp
conntracking is one user of this array but we might find
other users of this array.

Thanks

[PATCH] netfilter: Get rid of central rwlock in tcp conntracking

TCP connection tracking suffers of huge contention on a global rwlock,
used to protect tcp conntracking state.

As each tcp conntrack state have no relations between each others, we
can switch to fine grained lock. Using an array of spinlocks avoids
enlarging size of connection tracking structures, yet giving reasonable
fanout.

tcp_print_conntrack() doesnt need to lock anything to read
ct->proto.tcp.state, so speedup /proc/net/ip_conntrack as well.

nf_conntrack_hash_rnd_initted & nf_conntrack_hash_rnd declared read_mostly

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reported-by: Rick Jones <rick.jones2@hp.com>
---
 include/net/netfilter/nf_conntrack.h   |   32 +++++++++++++++++++++
 net/netfilter/nf_conntrack_core.c      |   10 ++++--
 net/netfilter/nf_conntrack_proto_tcp.c |   34 ++++++++++-------------
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2e0c536..288aff5 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -129,6 +129,38 @@ struct nf_conn
 	struct rcu_head rcu;
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
+	defined(CONFIG_PROVE_LOCKING)
+
+/*
+ * We reserve 16 locks per cpu (typical cache line size is 64 bytes)
+ * maxed to 512 locks so that ct_hlock[] uses at most 2048 bytes of memory.
+ * (on lockdep we have a quite big spinlock_t, so keep the size down there)
+ */
+#ifdef CONFIG_LOCKDEP
+#define CT_HASH_LOCK_SZ 64
+#elif NR_CPUS >= 32
+#define CT_HASH_LOCK_SZ 512
+#else
+#define CT_HASH_LOCK_SZ (roundup_pow_of_two(NR_CPUS) * 16)
+#endif
+
+extern spinlock_t ...
From: Patrick McHardy
Date: Tuesday, January 27, 2009 - 10:33 am

This looks good to me. Rick, would you like to give it a try?

I'll convert the remaining conntrack protocols when applying it.
--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 11:02 am

I will give it a try and let folks know the results - unless told otherwise, I 
will ass-u-me I only need rerun the "full_iptables" test case.

rick jones

--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 12:09 pm

The runemomniagg2.sh script is still running, but the initial cycles profile 
suggests that the main change is converting the write_lock time into spinlock 
contention time with 78.39% of the cycles spent in ia64_spinlock_contention. 
When the script completes I'll upload the profiles and the netperf results to the 
same base URL as in the basenote under "contrack01/"

rick jones
--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 12:24 pm

The script completed - although at some point I hit an fd limit - I think I have 
an fd leak in netperf somewhere :( .

Anyhow, there are still some netperfs that end-up kicking the bucket during the 
run - I suspect starvation because where in the other configs (no iptables, and 
empty iptables) each netperf seems to consume about 50% of a CPU - stands to 
reason - 64 netperfs, 32 cores - in the "full" case I see many netperfs consuming 
100% of a CPU.  My gut is thinking that one or more netperf contexts gets stuck 
doing something on behalf of others.  There is also ksoftirqd time for a few of 
those processes.

Anyhow, the spread on trans/s/netperf is now 600 to 500 or 6000, which does 
represent an improvement.

rick jones

PS - just to be certain that running-out of fd's didn't skew the results I'm 
rerunning the script with ulimit -n 10240 and will see if that changes the 
results any.  And I suppose I need to go fd leak hunting in netperf omni code :(
--

From: Eric Dumazet
Date: Tuesday, January 27, 2009 - 3:17 pm

Thanks for the report

If you have so much contention on spinlocks, maybe hash function is not 
good at all...

hash = (unsigned long)ct;
hash ^= hash >> 16;
hash ^= hash >> 8;

I ass-u-me you compiled your kernel with NR_CPUS=32 ?

--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 3:29 pm

I believe so - CONFIG_NR_CPUS in .config is 64 anyway.  If there is a more 
definitive place to check I'd be happy to look.

rick

--

From: Eric Dumazet
Date: Tuesday, January 27, 2009 - 3:34 pm

By any chance, are you using SLAB instead of SLUB ?

While running your netperf session, issuing :

grep conntrack /proc/slabinfo

Would help to know hash dispertion on your machine.


--

From: Rick Jones
Date: Tuesday, January 27, 2009 - 3:43 pm

Seems that way:

sut42:~/net-next-2.6# grep SLAB .config
CONFIG_SLAB=y
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set
sut42:~/net-next-2.6# grep SLUB .config
# CONFIG_SLUB is not set

The config file should be up under the base URL:

ftp://ftp.netperf.org/iptable_scaling

The system started as a Debian Lenny with a 2.6.26 kernel, and I did a make old 


sut42:~/net-next-2.6# grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         276    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

about a minute later it is:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         315    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      0

after about another 40 to 60 seconds it was:

grep conntrack /proc/slabinfo
nf_conntrack_expect      0      0    240   66    1 : tunables  120   60    8 : 
slabdata      0      0      0
nf_conntrack         314    742    304   53    1 : tunables   54   27    8 : 
slabdata     14     14      7

--

From: Eric Dumazet
Date: Wednesday, January 28, 2009 - 6:55 am

Yes indeed you have a speedup, tcp conntracking is OK.

You now hit the nf_conntrack_lock spinlock we have in generic conntrack code 
(net/netfilter/nf_conntrack_core.c)

nf_ct_refresh_acct() for instance has to lock it.

We really want some finer locking here.

--

From: Patrick McHardy
Date: Wednesday, January 28, 2009 - 9:25 am

That looks more complicated since it requires to take multiple locks
occasionally (f.i. hash insertion, potentially helper-related and
expectation-related stuff), and there is the unconfirmed_list, where
fine-grained locking can't really be used without changing it to
a hash.
--

From: Eric Dumazet
Date: Wednesday, January 28, 2009 - 10:07 am

Yes its more complicated, but look what we did in 2.6.29 for tcp/udp
 sockets, using RCU to have lockless lookups.
Yes, we still take a lock when doing an insert or delete at socket
bind/unbind time.

We could keep a central nf_conntrack_lock to guard insertions/deletes
from hash and unconfirmed_list.

But *normal* packets that only need to change state of one particular
connection could use RCU (without spinlock) to locate the conntrack,
then lock the found conntrack to perform all state changes.


--

From: Eric Dumazet
Date: Wednesday, January 28, 2009 - 10:34 am

Well... RCU is already used by conntrack :)

Maybe only __nf_ct_refresh_acct() needs not taking nf_conntrack_lock



--

From: Eric Dumazet
Date: Thursday, January 29, 2009 - 8:31 am

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/ip_tables.c |   32 +++++++++++++++++--------------
 1 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..7bdb2cf 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,22 @@ do {								\
 #define inline
 #endif
 
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	return ret;
+}
+
 /*
    We keep a set of rules for each CPU, so we can avoid write-locking
    them in the softirq when updating the counters and therefore
@@ -83,7 +99,6 @@ ip_packet_match(const struct iphdr *ip,
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
 	unsigned long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,13 +118,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
-
+	ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
 	if (FWINV(ret != 0, IPT_INV_VIA_IN)) ...
From: Andi Kleen
Date: Friday, January 30, 2009 - 8:47 am

That's because nobody passed -funroll-loops. Did you try that for

That will silently break for IFNAMSIZ >= 4*sizeof(unsigned long)
You should add a dummy loop for that or at least a BUILD_BUG_ON

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Eric Dumazet
Date: Friday, January 30, 2009 - 9:54 am

I dont want to unroll all loops, only those two :)
I wish gcc (4.3.2 here) was litle bit smarter :(

Without using -funroll-loops

size ipv4/netfilter/ip_tables.o
   text    data     bss     dec     hex filename
   6424     368      16    6808    1a98 net/ipv4/netfilter/ip_tables.o

With -funroll-loops :

size ipv4/netfilter/ip_tables.o
   text    data     bss     dec     hex filename
   7144     368      16    7528    1d68 net/ipv4/netfilter/ip_tables.o

With my patch and no -funroll-loops
   text    data     bss     dec     hex filename

It will also break the day we port linux to a 128 bits machine :)

Thanks Andi

(By the way, I still use the patch on arch/x86/oprofile/op_model_ppro.c
to have a working oprofile on my dev machine...)


[PATCH] netfilter: unfold two critical loops in ip_packet_match()

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench, for a small
text size increase (62 bytes for both loops)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/ip_tables.c |   34 ++++++++++++++++++-------------
 1 files changed, 20 insertions(+), 14 deletions(-)


diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index ef8b6ca..9298d0a 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -65,6 +65,24 @@ do {								\
 #define inline
 #endif
 
+static unsigned long ifname_compare(const void *_a, const void *_b, const void *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ ...
From: Andi Kleen
Date: Friday, January 30, 2009 - 10:27 am

gcc 4.4 will have a way to do that per function, but earlier
you would need to move it to a separate file and specify
the option only for that.

Doing so would be still a good idea compared to your
patch because the code will be cleaner and might
be more adaptable to future architectures

It cannot do much without profile feedback because

Yes I know, sorry for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Eric Dumazet
Date: Friday, January 30, 2009 - 10:27 am

So... you suggest me to split file, and use a "-funroll-loops",
that might doing strange things on some arches / compilers...

This is far too complicated and risky imho.

Check compare_ether_addr_64bits() definition in 
include/linux/etherdevice.h for a truly unreadable code :)


--

From: Andi Kleen
Date: Friday, January 30, 2009 - 10:50 am

I don't think it will. -funroll-loops is pretty clear defined.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Patrick McHardy
Date: Monday, February 9, 2009 - 6:41 am

The interface name matching has shown up in profiles forever
though and we've actually already tried to optimize it IIRC.

Eric, I'm trying to keep all the *tables files synchronized,
could you send me a patch updating the other ones as well
please?
--

From: Eric Dumazet
Date: Wednesday, February 18, 2009 - 8:10 am

While doing this, I found arp_tables is still using loop using
byte operations.

Also, I could not find how iniface_mask[], outiface_mask[], iniface[]
and outiface[] were forced to long word alignment ... 

(in struct ipt_ip, struct ip6t_ip6, struct arpt_arp)


--

From: Patrick McHardy
Date: Wednesday, February 18, 2009 - 8:21 am

In case of IPv4 and IPv6 they are already suitable aligned, it
simply performing the comparison in unsigned long quantities.
struct arpt_arp unfortunately doesn't properly align the interface
names, so we need to continue to do byte-wise comparisons.
--

From: Eric Dumazet
Date: Wednesday, February 18, 2009 - 9:33 am

I see, but #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS can help here ;)

ifname_compare() should be static in three files (ipv4_ip_tables, ipv6_ip_tables and arp_tables),
since only arp_tables variant has the alignement problem.

[PATCH] netfilter: unfold two critical loops in arp_packet_match()

x86 and powerpc can perform long word accesses in an efficient maner.
We can use this to unroll two loops in arp_packet_match(), to
perform arithmetic on long words instead of bytes. This is a win
on x86_64 for example.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/netfilter/arp_tables.c |   44 +++++++++++++++++++++++-------
 1 files changed, 34 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 7ea88b6..b5db463 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -73,6 +73,36 @@ static inline int arp_devaddr_compare(const struct arpt_devaddr_info *ap,
 	return (ret != 0);
 }
 
+/*
+ * Unfortunatly, _b and _mask are not aligned to an int (or long int)
+ * Some arches dont care, unrolling the loop is a win on them.
+ */
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+#else
+	unsigned long ret = 0;
+	int i;
+
+	for (i = 0; i < IFNAMSIZ; i++)
+		ret |= (_a[i] ^ _b[i]) & _mask[i];
+#endif
+	return ret;
+}
+
 /* Returns whether packet matches rule or not. */
 ...
From: Patrick McHardy
Date: Wednesday, February 18, 2009 - 9:52 am

get_unaligned() would work as well I guess. But we don't seem to

This looks good to me. Applied, thanks.
--

From: Eric Dumazet
Date: Wednesday, February 18, 2009 - 10:36 am

Hi Patrick

Please find following patch to prepare next one (loop unrolling)

I am not sure xt_physdev_info is aligned, either on an int or a long,
so please check my assertion before applying :)



Thank you

[PATCH] netfilter: xt_physdev fixes

1) physdev_mt() incorrectly assumes nulldevname[] is aligned on an int

2) It also uses word comparisons, while it could use long word ones.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 1bcdfc1..4b13ef7 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -24,9 +24,9 @@ static bool
 physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 {
 	int i;
-	static const char nulldevname[IFNAMSIZ];
+	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	const struct xt_physdev_info *info = par->matchinfo;
-	bool ret;
+	unsigned long ret;
 	const char *indev, *outdev;
 	const struct nf_bridge_info *nf_bridge;
 
@@ -68,10 +68,10 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 	if (!(info->bitmask & XT_PHYSDEV_OP_IN))
 		goto match_outdev;
 	indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
-	for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
-		ret |= (((const unsigned int *)indev)[i]
-			^ ((const unsigned int *)info->physindev)[i])
-			& ((const unsigned int *)info->in_mask)[i];
+	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
+		ret |= (((const unsigned long *)indev)[i]
+			^ ((const unsigned long *)info->physindev)[i])
+			& ((const unsigned long *)info->in_mask)[i];
 	}
 
 	if (!ret ^ !(info->invert & XT_PHYSDEV_OP_IN))
@@ -82,13 +82,12 @@ match_outdev:
 		return true;
 	outdev = nf_bridge->physoutdev ?
 		 nf_bridge->physoutdev->name : nulldevname;
-	for (i = 0, ret = false; i < IFNAMSIZ/sizeof(unsigned int); i++) {
-		ret |= (((const unsigned int *)outdev)[i]
-			^ ((const unsigned ...
From: Patrick McHardy
Date: Wednesday, February 18, 2009 - 11:14 am

Yes, the private structures can assume alignment suitable for

Applied, thanks.
--

From: Eric Dumazet
Date: Thursday, February 19, 2009 - 1:00 am

xt_physdev netfilter module can use an ifname_compare() helper
so that two loops are unfolded.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/netfilter/xt_physdev.c |   32 +++++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 4b13ef7..44a234e 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -20,10 +20,27 @@ MODULE_DESCRIPTION("Xtables: Bridge physical device match");
 MODULE_ALIAS("ipt_physdev");
 MODULE_ALIAS("ip6t_physdev");
 
+static unsigned long ifname_compare(const char *_a, const char *_b, const char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 static bool
 physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 {
-	int i;
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	const struct xt_physdev_info *info = par->matchinfo;
 	unsigned long ret;
@@ -68,11 +85,7 @@ physdev_mt(const struct sk_buff *skb, const struct xt_match_param *par)
 	if (!(info->bitmask & XT_PHYSDEV_OP_IN))
 		goto match_outdev;
 	indev = nf_bridge->physindev ? nf_bridge->physindev->name : nulldevname;
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)info->physindev)[i])
-			& ((const unsigned long *)info->in_mask)[i];
-	}
+	ret = ifname_compare(indev, info->physindev, info->in_mask);
 
 	if (!ret ^ ...
From: Eric Dumazet
Date: Thursday, February 19, 2009 - 1:14 am

ip6_tables netfilter module can use an ifname_compare() helper
so that two loops are unfolded.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv6/netfilter/ip6_tables.c |   33 +++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index a33485d..d64594b 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -89,6 +89,25 @@ ip6t_ext_hdr(u8 nexthdr)
 		 (nexthdr == IPPROTO_DSTOPTS) );
 }
 
+static unsigned long ifname_compare(const char *_a, const char *_b,
+				    const unsigned char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 /* Returns whether matches rule or not. */
 /* Performance critical - called for every packet */
 static inline bool
@@ -99,7 +118,6 @@ ip6_packet_match(const struct sk_buff *skb,
 		 unsigned int *protoff,
 		 int *fragoff, bool *hotdrop)
 {
-	size_t i;
 	unsigned long ret;
 	const struct ipv6hdr *ipv6 = ipv6_hdr(skb);
 
@@ -120,12 +138,7 @@ ip6_packet_match(const struct sk_buff *skb,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ip6info->iniface)[i])
-			& ((const unsigned long *)ip6info->iniface_mask)[i];
-	}
+	ret = ifname_compare(indev, ip6info->iniface, ip6info->iniface_mask);
 
 	if (FWINV(ret != 0, ...
From: Patrick McHardy
Date: Thursday, February 19, 2009 - 3:19 am

Also applied, thanks Eric.

Are you going to send a new patch for ip_tables.c or should I
just take the old one?
--

From: Patrick McHardy
Date: Thursday, February 19, 2009 - 3:17 am

Applied, thanks.

--

From: Eric Dumazet
Date: Friday, February 20, 2009 - 3:02 am

While doing oprofile tests I noticed two loops are not properly unrolled by gcc

Using a hand coded unrolled loop provides nice speedup : ipt_do_table
credited of 2.52 % of cpu instead of 3.29 % in tbench.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 08cde5b..e5294ae 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -74,6 +74,25 @@ do {								\
 
    Hence the start of any table is given by get_table() below.  */
 
+static unsigned long ifname_compare(const char *_a, const char *_b,
+				    const unsigned char *_mask)
+{
+	const unsigned long *a = (const unsigned long *)_a;
+	const unsigned long *b = (const unsigned long *)_b;
+	const unsigned long *mask = (const unsigned long *)_mask;
+	unsigned long ret;
+
+	ret = (a[0] ^ b[0]) & mask[0];
+	if (IFNAMSIZ > sizeof(unsigned long))
+		ret |= (a[1] ^ b[1]) & mask[1];
+	if (IFNAMSIZ > 2 * sizeof(unsigned long))
+		ret |= (a[2] ^ b[2]) & mask[2];
+	if (IFNAMSIZ > 3 * sizeof(unsigned long))
+		ret |= (a[3] ^ b[3]) & mask[3];
+	BUILD_BUG_ON(IFNAMSIZ > 4 * sizeof(unsigned long));
+	return ret;
+}
+
 /* Returns whether matches rule or not. */
 /* Performance critical - called for every packet */
 static inline bool
@@ -83,7 +102,6 @@ ip_packet_match(const struct iphdr *ip,
 		const struct ipt_ip *ipinfo,
 		int isfrag)
 {
-	size_t i;
 	unsigned long ret;
 
 #define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg)))
@@ -103,12 +121,7 @@ ip_packet_match(const struct iphdr *ip,
 		return false;
 	}
 
-	/* Look for ifname matches; this should unroll nicely. */
-	for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long); i++) {
-		ret |= (((const unsigned long *)indev)[i]
-			^ ((const unsigned long *)ipinfo->iniface)[i])
-			& ((const unsigned long *)ipinfo->iniface_mask)[i];
-	}
+	ret = ifname_compare(indev, ipinfo->iniface, ipinfo->iniface_mask);
 
 	if (FWINV(ret != 0, ...
From: Patrick McHardy
Date: Friday, February 20, 2009 - 3:04 am

Applied, thanks Eric.
--

From: Patrick McHardy
Date: Monday, February 9, 2009 - 7:57 am

The lock is currently used to avoid timer update races.
Martin has two old patches two remove it, but they require
changes to the timer code to support a mod_timer variant
that doesn't enable an inactive timer:

http://people.netfilter.org/gandalf/performance/patches/mod_timer_noact
http://people.netfilter.org/gandalf/performance/patches/__nf_ct_refresh_acct-locking
--

From: Stephen Hemminger
Date: Tuesday, February 10, 2009 - 11:44 am

On Tue, 27 Jan 2009 00:10:44 +0100

Does anyone have a fix for this bottleneck.
--

Previous thread: [PATCH 1/1] Fix getsockname (ipv4/ipv6) by John Reumann on Friday, January 23, 2009 - 2:45 pm. (4 messages)

Next thread: [patch 2.6.28-rc8-davinci-git] dm355: dm9000 by Narnakaje, Snehaprabha on Monday, January 26, 2009 - 3:36 pm. (3 messages)