The reader/writer lock in ip_tables is acquired in the critical path of processing packets and is one of the reasons just loading iptables can cause a 20% performance loss. The rwlock serves two functions: 1) it prevents changes to table state (xt_replace) while table is in use. This is now handled by doing rcu on the xt_table. When table is replaced, the new table(s) are put in and the old one table(s) are freed after RCU period. 2) it provides synchronization when accesing the counter values. This is now handled by swapping in new table_info entries for each cpu then summing the old values, and putting the result back onto one cpu. On a busy system it may cause sampling to occur at different times on each cpu, but no packet/byte counts are lost in the process. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- include/linux/netfilter/x_tables.h | 6 + net/ipv4/netfilter/arp_tables.c | 114 ++++++++++++++++++++++++++--------- net/ipv4/netfilter/ip_tables.c | 120 ++++++++++++++++++++++++++----------- net/ipv6/netfilter/ip6_tables.c | 119 +++++++++++++++++++++++++----------- net/netfilter/x_tables.c | 26 ++++++-- 5 files changed, 283 insertions(+), 102 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-02-17 11:04:08.112056798 -0800 +++ b/include/linux/netfilter/x_tables.h 2009-02-17 11:04:09.639778762 -0800 @@ -353,7 +353,7 @@ struct xt_table unsigned int valid_hooks; /* Lock for the curtain */ - rwlock_t lock; + struct mutex lock; /* Man behind the curtain... */ struct xt_table_info *private; @@ -385,7 +385,7 @@ struct xt_table_info /* ipt_entry tables: one per CPU */ /* Note : this field MUST be the last one, see XT_TABLE_INFO_SZ */ - char *entries[1]; + void *entries[1]; }; #define XT_TABLE_INFO_SZ (offsetof(struct xt_table_info, entries) \ @@ -432,6 +432,8 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned ...
This part (arptables.c) seems to be missing a preempt_enable(). Other than this, the patch looks good to me, if you want I can already apply this one. --
The reader/writer lock in ip_tables is acquired in the critical path of processing packets and is one of the reasons just loading iptables can cause a 20% performance loss. The rwlock serves two functions: 1) it prevents changes to table state (xt_replace) while table is in use. This is now handled by doing rcu on the xt_table. When table is replaced, the new table(s) are put in and the old one table(s) are freed after RCU period. 2) it provides synchronization when accesing the counter values. This is now handled by swapping in new table_info entries for each cpu then summing the old values, and putting the result back onto one cpu. On a busy system it may cause sampling to occur at different times on each cpu, but no packet/byte counts are lost in the process. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- Added missing preempt_enable. Patch against nf-next-2.6 git tree. include/linux/netfilter/x_tables.h | 6 + net/ipv4/netfilter/arp_tables.c | 115 +++++++++++++++++++++++++++-------- net/ipv4/netfilter/ip_tables.c | 120 ++++++++++++++++++++++++++----------- net/ipv6/netfilter/ip6_tables.c | 119 +++++++++++++++++++++++++----------- net/netfilter/x_tables.c | 26 ++++++-- 5 files changed, 284 insertions(+), 102 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-02-19 11:42:43.060110657 -0800 +++ b/include/linux/netfilter/x_tables.h 2009-02-19 11:42:58.863663575 -0800 @@ -353,7 +353,7 @@ struct xt_table unsigned int valid_hooks; /* Lock for the curtain */ - rwlock_t lock; + struct mutex lock; /* Man behind the curtain... */ struct xt_table_info *private; @@ -385,7 +385,7 @@ struct xt_table_info /* ipt_entry tables: one per CPU */ /* Note : this field MUST be the last one, see XT_TABLE_INFO_SZ */ - char *entries[1]; + void *entries[1]; }; #define XT_TABLE_INFO_SZ (offsetof(struct xt_table_info, entries) \ @@ -432,6 +432,8 @@ extern void xt_proto_fini(struct net ...
Acked-by: Eric Dumazet <dada1@cosmosbay.com> Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here) BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago) --
Do you folks need/want further testing against the 32-core setup? --
On Thu, 19 Feb 2009 15:56:18 -0800 It would be good to combine all 3 (iptables-rcu, timer change, and conntrack lock) to see what the overhead change is. --
Fair enough. Is there a tree somewhere I can pull with all those in it, or do I need to go back through the emails and apply patches? rick jones --
You can use my nf-next.git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git It contains the lock free counters, as well as smaller optimizations from Eric. The last timer patch I've seen missed the actual conversion to use mod_timer_pending(), but it would be great to have some numbers on the conntrack lock changes. Thanks Rick! --
So, by the time this hits inboxes, under: ftp://ftp.netperf.org/nf-next-2.6-results should be a directory called "baseline" which are the results from just a clone of your tree from earlier today. There you will find the config file, the log of the build and then three subdirectories: none - results without doing iptables --list empty - results after doing iptables --list full - results after doing an iptables-restore of a config from the "iptables" file also up there In each will be the netperf results in csv format, and four different caliper (using the perfmon interface) profiles: "cycles" uses a profile which is able to take samples with interrupts disabled "fprof" is a plain flat profile that does not see things happening with interrupt s disabled - comparing an fprof to cycles is sometimes interesting "dcache" tries to take cache miss profiles. iirc that uses the data ear in the Itanium PMU to do its thing - I cannot recall the effect of interrupt disabling there "scgprof" is a sampled call graph profile - likely as not with interrupt I will go back through my email now and try to find the conntrack lock changes and apply them to the tree and turn the crank. happy benchmarking rick jones --
Under the base URL above there is now a "conntrack" subdir with the usual "none," "empty," and "full" subdirs. This is with the patch from message ID <20090219140303.4329f860@extreme> titled "Re: [RFT 4/4] netfilter: Get rid of central rwlock in tcp conntracking" which my mail client says has a date of 02/19/09 14:03. On the plus side, only one of the 64 concurrent netperfs died during the "full" test compared with more than 10 without the patch. Also, there were no soft lockups reported as there were without the patch. The rwlock time is gone, naturally, replaced with boatloads of spinlock contention. Hopefully the scgprof profile will help show the source. Perhaps there is yet another patch I should have applied :) happy benchmarking, rick jones --
Applied, thanks everyone. I've also addes Eric's tbench results to the changelog. --
Damned this broke xt_hashlimit, version=0
Look file "net/netfilter/xt_hashlimit.c" line 706
/* Ugly hack: For SMP, we only want to use one set */
r->u.master = r;
File "include/linux/netfilter/xt_hashlimit.h"
struct xt_hashlimit_info {
char name [IFNAMSIZ]; /* name */
struct hashlimit_cfg cfg;
/* Used internally by the kernel */
struct xt_hashlimit_htable *hinfo;
union {
void *ptr;
struct xt_hashlimit_info *master;
} u;
};
So, it appears some modules are using pointers to themselves, what a hack :(
We probably need an audit of other modules.
(net/netfilter/xt_statistic.c, net/netfilter/xt_quota.c,
net/netfilter/xt_limit.c ...)
Unfortunatly I wont have time to do this in following days, any volunteer ?
Thank you
[PATCH] netfilter: xt_hashlimit fix
Commit 784544739a25c30637397ace5489eeb6e15d7d49
(netfilter: iptables: lock free counters) broke xt_hashlimit netfilter module :
This module was storing a pointer inside its xt_hashlimit_info, and this pointer
is not relocated when we temporarly switch tables (iptables -L).
This hack is not not needed at all (probably a leftover from
ancient time), as each cpu should and can access to its own copy.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 2482055..a5b5369 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -565,8 +565,7 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo,
static bool
hashlimit_mt_v0(const struct sk_buff *skb, const struct xt_match_param *par)
{
- const struct xt_hashlimit_info *r =
- ((const struct xt_hashlimit_info *)par->matchinfo)->u.master;
+ const struct xt_hashlimit_info *r = par->matchinfo;
struct xt_hashlimit_htable *hinfo = r->hinfo;
unsigned long now = jiffies;
struct dsthash_ent *dh;
@@ -702,8 +701,6 @@ static bool ...Please have a look!
---8<---
parent b2bd9ab764d65237232c4aad5ab8d5d8b5714f72 (v2.6.29-rc6-31-gb2bd9ab)
commit 3e7ee1dcc808b8eec82eddfa0436f78e31d2004a
Author: Jan Engelhardt <jengelh@medozas.de>
Date: Sat Feb 28 02:49:28 2009 +0100
netfilter: xtables: avoid pointer to self
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
---
include/linux/netfilter/xt_limit.h | 9 +++--
include/linux/netfilter/xt_statistic.h | 7 ++--
net/netfilter/xt_limit.c | 40 +++++++++++++++++------
net/netfilter/xt_statistic.c | 28 +++++++++++++---
4 files changed, 61 insertions(+), 23 deletions(-)
diff --git a/include/linux/netfilter/xt_limit.h b/include/linux/netfilter/xt_limit.h
index b3ce653..fda222c 100644
--- a/include/linux/netfilter/xt_limit.h
+++ b/include/linux/netfilter/xt_limit.h
@@ -4,6 +4,8 @@
/* timings are in milliseconds. */
#define XT_LIMIT_SCALE 10000
+struct xt_limit_priv;
+
/* 1/10,000 sec period => max of 10,000/sec. Min rate is then 429490
seconds, or one every 59 hours. */
struct xt_rateinfo {
@@ -11,11 +13,10 @@ struct xt_rateinfo {
u_int32_t burst; /* Period multiplier for upper limit. */
/* Used internally by the kernel */
- unsigned long prev;
- u_int32_t credit;
+ unsigned long prev; /* moved to xt_limit_priv */
+ u_int32_t credit; /* moved to xt_limit_priv */
u_int32_t credit_cap, cost;
- /* Ugly, ugly fucker. */
- struct xt_rateinfo *master;
+ struct xt_limit_priv *master;
};
#endif /*_XT_RATE_H*/
diff --git a/include/linux/netfilter/xt_statistic.h b/include/linux/netfilter/xt_statistic.h
index 3d38bc9..8f521ab 100644
--- a/include/linux/netfilter/xt_statistic.h
+++ b/include/linux/netfilter/xt_statistic.h
@@ -13,6 +13,8 @@ enum xt_statistic_flags {
};
#define XT_STATISTIC_MASK 0x1
+struct xt_statistic_priv;
+
struct xt_statistic_info {
u_int16_t mode;
u_int16_t flags;
@@ -23,11 +25,10 @@ struct xt_statistic_info {
struct {
u_int32_t every;
...Seems good to me ! Thanks Jan ! Reviewed-by: Eric Dumazet <dada1@cosmosbay.com> Are you sure xt_quota doesnt need some tweak, or should we not care of changes --
Oh right, xt_quota.. thanks for noticing! Patch will follow. --
Indeed. This is unfortunately necessary in some cases to make sure that modules using global state actually use global state instead This seems fine in case of hashlimit since it the match data is read-only. In case of statistic and quota I think we still Applied, thanks. --
While testing multicast flooding stuff, I found that "iptables -nvL" can have a *very* slow response time on my dual quad core machine... LatencyTOP version 0.5 (C) 2008 Intel Corporation Cause Maximum Percentage synchronize_rcu synchronize_net do_ipt_get_ctl nf_1878.6 msec 3.1 % Scheduler: waiting for cpu 160.3 msec 13.6 % do_get_write_access journal_get_write_access __ext 11.0 msec 0.0 % do_get_write_access journal_get_write_access __ext 7.7 msec 0.0 % poll_schedule_timeout do_select core_sys_select sy 4.9 msec 0.0 % do_wait sys_wait4 sys_waitpid sysenter_do_call 3.4 msec 0.1 % call_usermodehelper_exec request_module netlink_cr 1.6 msec 0.0 % __skb_recv_datagram skb_recv_datagram raw_recvmsg 1.5 msec 0.0 % do_wait sys_wait4 sysenter_do_call 0.7 msec 0.0 % # time iptables -nvL Chain INPUT (policy ACCEPT 416M packets, 64G bytes) pkts bytes target prot opt in out source destination Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes) pkts bytes target prot opt in out source destination real 0m1.810s user 0m0.000s sys 0m0.001s CONFIG_NO_HZ=y CONFIG_HZ_1000=y CONFIG_HZ=1000 One cpu is 100% handling softirqs, could it be the problem ? Cpu0 : 1.0%us, 14.7%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Cpu1 : 3.6%us, 23.2%sy, 0.0%ni, 71.6%id, 0.0%wa, 0.0%hi, 1.7%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,100.0%si, 0.0%st Cpu3 : 2.7%us, 23.9%sy, 0.0%ni, 71.1%id, 0.7%wa, 0.0%hi, 1.7%si, 0.0%st Cpu4 : 1.3%us, 14.3%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st Cpu5 : 1.0%us, 14.2%sy, 0.0%ni, 83.4%id, 0.0%wa, ...
Hi Paul I found following patch helps if one cpu is looping inside ksoftirqd() synchronize_rcu() now completes in 40 ms instead of 1800 ms. Thank you [PATCH] rcu: increment quiescent state counter in ksoftirqd() If a machine is flooded by network frames, a cpu can loop 100% of its time inside ksoftirqd() without calling schedule(). This can delay RCU grace period to insane values. Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/kernel/softirq.c b/kernel/softirq.c index bdbe9de..9041ea7 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -626,6 +626,7 @@ static int ksoftirqd(void * __bind_cpu) preempt_enable_no_resched(); cond_resched(); preempt_disable(); + rcu_qsctr_inc((long)__bind_cpu); } preempt_enable(); set_current_state(TASK_INTERRUPTIBLE); --
Good catch!!! This regression was a result of the recent change from "schedule()" to "cond_resched()", which got rid of that quiescent state in the common case where a reschedule is not needed. --
Is this fixed by your RCU quiescent state fix? --
On Mon, 02 Mar 2009 22:56:39 +0100 I wonder if the RCU quiescent fix should go in 2.6.29 because it fixes other issues like route changing RCU latency under Dos attack. --
Yes probably, and on stable versions too, since this problem is quite old... --
