From: Stephen Hemminger <shemminger@vyatta.com> Date: Fri, 10 Apr 2009 09:52:46 -0700 --
I wonder if we should bring in the RCU people too, for them to tell you that the networking people are beign silly, and should not synchronize with the very heavy-handed synchronize_net() but instead of doing synchronization (which is probably why adding a few hundred rules then takes several seconds - each synchronizes and that takes a timer tick or so), add the rules to be free'd on some rcu-freeing list for later freeing. Or whatever. Paul? synchronize_net() just calls synchronize_rcu(), and with that knowledge and a simple git show 784544739a25c30637397ace5489eeb6e15d7d49 I bet you can already tell people how to fix their performance issue. Linus --
Well, I am certainly happy to demonstrate my ignorance of the networking code by throwing out a few suggestions. So, Dave and Steve, you might want to get out your barf bag before reading further. You have been warned! ;-) 1. Assuming that the synchronize_net() is intended to guarantee that the new rules will be in effect before returning to user space: a. Split this functionality, so that there is a new user-space primitive that installs a new rule, but without waiting. They provide an additional user-space primitive that waits for the rules to take effect. Then, when loading a long list of rules, load them using the non-waiting primitive, and wait at the end. b. As above, but provide a flag that says whether or not to wait. Same general effect. But I am not seeing the direct connection between this patch and netfilter, so... 2. For the xt_replace_table() case, it would be necessary to add an rcu_head to the xt_table_info, and replace each caller's direct calls to xt_free_table_info() with call_rcu(). Now this has an issue in that the caller wants to return the final counter values. My assumption is that these values do not in fact need to be exact. If I am wrong about that, then my suggestion would lose the counts from late readers. I must defer to the networking guys as to whether this is acceptable or not. If not, more head-scratching would be required. (But it looks to me that the rule is being trashed, so who cares about the extra counts?) In addition, a malicious user might be able to force this to happen extremely frequently, running the system out of memory. One way to fix this is to invoke synchronize_net() one out of 20 times or some such. 3. For the alloc_counters() case, the comments indicate that we really truly do want an atomic sampling of the counters. The counters are 64-bit entities, which is a bit inconvenient. Though people using this functionality are no doubt quite happy to never have to ...
iptables works in whole tables. Userspace submits a table, checkentry is called for all rules in the new table, things are swapped, then destroy is called for all rules in the old table. By that logic (which existed since dawn I think), only the swap operation needs to be locked. The fact that `iptables -A` is called a hundred times means you are doing 100 table replacements -- instead of one. And calling synchronize_net at least a 100 times. As I read the new code, it seems that synchronize_net is only used on copying the rules from kernel into userspace; not when updating them from userspace: IPT_SO_GET_ENTRIES -> get_entries -> copy_entries_to_user -> Would a seqlock suffice, as it does for the 64-bit jiffies? --
The 64-bit jiffies counter is not updated often, so write-acquiring a seqlock on each update is OK. From what I understand, these counters are updated quite often (one each packet transmission or reception?), so write-acquiring on each update would be quite painful. Or did you have something else in mind here? Thanx, Paul --
From: Jan Engelhardt <jengelh@medozas.de> I want to derail this line of thinking as fast as possible. This is not an acceptable response to this problem. We made something fundamentally slower by several orders of magnitude. Therefore, saying "Don't insert your firewall rules like that." is not a valid response for this regression. We really have to fix it or revert. --
Let me start by saying that I agree that for most systems this patch provided a bad performance tradeoff that needs to get fixed. On the other hand I have certain systems where I would much rather reduce the per-packet load by a few percent... even if it increases the effort to load a new ruleset by many orders of magnitude!!! Quite simply the boxes only reboot a few times a year but in-between times they forward many terabytes of low-latency network traffic. So... to play devils advocate: Almost all of the standard firewall tools (such as shorewall, etc) are already using iptables-restore command to load firewall rules, primarily because using separate iptables commands was *already* way too slow. There's also the serious race-condition of doing a firewall restart that way where you only have half your rules loaded for a bit. The "iptables" command is fine for fiddling around with the command line and making minor tweaks, but it simply doesn't cut it for large-scale rules. I remember when switching from a shell-based shorewall to a perl-based shorewall. The time to build my rule lists with the perl-based version was about 20% of what it had been, but the time to load the rules into the kernel with iptables-restore was easily 2% or perhaps less. Finally, if you really are loading a couple hundred IPs into a linear ruleset, you're adding a fair amount of packet latency to each and every packet that goes through totally independent of iptables load time. It would be much better to use ipsets (or similar) because they load all of the IP ranges into an appropriate tree datastructure with O(small-constant * log(N) + large-constant * 1) lookup instead of O(large-constant * N). Cheers, Kyle Moffett --
Some time ago there was batch patch that allowed to use standard shell format of calling iptables but did everything at once: http://lists.netfilter.org/pipermail/netfilter-devel/2004- September/016704.html It didn't get merged - no idea why. -- Arkadiusz Miśkiewicz PLD/Linux Team arekm / maven.pl http://ftp.pld-linux.org/ --
>> This is not an acceptable response to this problem.
Not true... The iptables-restore format is pretty well documented and not far off the standard command-line argument format. For instance the "shorewall" tool does a sort of "compile" of its high-level firewall language into an input file for the "iptables-restore" command. The basic format to atomically load a table is: -A SOMECHAIN --rule-arguments-here -A customchain --rule-arguments-here COMMIT At the end of this email you can find some sample data cut-n-pasted from the iptables-restore file from one of my boxes running shorewall. The full file is 645 lines but takes at most a second or so to load once compiled. You could also do an iptables-save -c on one of your configured systems to see what various constructions you use look like in the iptables format. It's all pretty straightforward. Cheers, Kyle Moffett COMMIT -A POSTROUTING -o world -j world_masq -A excl0 -d 10.0.0.0/8 -j RETURN -A excl0 -d 192.168.0.0/16 -j RETURN -A excl0 -d 172.16.0.0/12 -j RETURN -A excl0 -j MASQUERADE -A excl1 -d 10.0.0.0/8 -j RETURN -A excl1 -d 192.168.0.0/16 -j RETURN -A excl1 -d 172.16.0.0/12 -j RETURN -A excl1 -j MASQUERADE -A excl2 -d 10.0.0.0/8 -j RETURN -A excl2 -d 192.168.0.0/16 -j RETURN -A excl2 -d 172.16.0.0/12 -j RETURN -A excl2 -j MASQUERADE -A world_masq -s 10.0.0.0/8 -j excl0 -A world_masq -s 172.16.0.0/12 -j excl1 -A world_masq -s 192.168.0.0/16 -j excl2 COMMIT -A PREROUTING -j tcpre -A FORWARD -j tcfor -A FORWARD -j fortos -A OUTPUT -j tcout -A POSTROUTING -j tcpost -A fortos -p 17 -i dmz -s 72.214.41.2 -j TOS --set-tos 16 -A fortos -p 17 -o dmz -d 72.214.41.2 -j TOS --set-tos 16 COMMIT [...many more lines snipped...] --
Hi all, That's what I implemented as "iptables-restore --noflush" a number of years ago. It doesn't flush the current uleset and swaps in a new one, but=20 reads the current rules from kernel, applies any number of changes and swaps the new ruleset in. The syntax of iptables-restore is almost identical to iptables commands, you just specify the table in a different way. So you would just create your desired changes in that format, and echo that into iptables-restore. If it= 's an entire new ruleset, you use no '--noflush' and it does automatic flushing of all old rules. If your stdin-file contains only incremental changes, you use --noflush. In the netfilter project, we knew for many years that the 'swap the entire table atomically in' is a bad design choice. This is what various develope= rs have been trying to address at different times, and which finally resulted in the nftables implementation of Patrick McHardy. So for the mid- to long term there is a clear design that moves away from that. But so far, we have to live with the API and its semantics. iptables users= pace has been improved a number of times, and things like iptables-restore with = or without --noflush can be used as an intermediate solution - and have been u= sed by many systems out there. --=20 - Harald Welte <laforge@netfilter.org> http://netfilter.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D "Fragmentation is like classful addressing -- an interesting early architectural error that shows how much experimentation was going on while IP was being designed." -- Paul Vixie
Well, there is an extra tool in SUSE's iptables, which collects rules added this way, and then commits them in one go when you are done. Perhaps that is an "adequeate" way? --
On Sat, 11 Apr 2009 07:14:50 +0200 (CEST) Part of the overhead is the API choice to take counter values from user space during the replace. If the rule replacement just always started with zero counters it could be done with less overhead. --
On Sat, Apr 11, 2009 at 11:07 PM, Stephen Hemminger It's always good practice to start from zero with these ... # iptables -F # iptables -t nat -F # iptables -X And most of the time, rules should be put into a file so that it can rerun easily after reboot. So if it can be speed up for just this case, it'll help many out there. Thanks, Jeff. --
Not really. Yes, iptables as a single command works in whole tables. USERS, on the other hand, often work in multiple iptables commands, ie they just add things to the tables. And in fact, I think this is the exact workload that Jeff complains about - doing two hundred "update table" The problem is, the new code makes the "wait after swap" thing happen after every switch. And if you do two hundred "update table" commands, you now take a _long_ time to update. Sure, you could tell people to just do everything as one single table update, but that isn't what they do. Linus --
Note, i have recently implemented full atomic64_t support on 32-bit x86, for the perfcounters code, based on the CMPXCHG8B instruction. Which, while not the lightest of instructions, is still much better than the sequence above. So i think a better approach would be to also add a dumb generic implementation for atomic64_t (using a global lock or so), and then generic code could just assume that atomic64_t always exists. It is far nicer - and faster as well - as the hack above, even on 32-bit x86. Ingo --
On Sat, 11 Apr 2009 09:08:54 +0200 The iptables counters are write mostly, read rarely so they don't fit the seq counter or atomic use case. Also, it is important to get a consistent snapshot of the whole set not just each individual counter. --
If the generic implementation is needed only on !SMP systems, that could work. The architectures I would be worried about include powerpc and ia64, which I believe support 32-bit SMP builds. Thanx, Paul --
ia64 would naturally support the CMPXCHG8B instructions. Not sure about powerpc32. Having a lock for the library implementation is not _that_ much of a problem. We obviously dont want the design of Linux to be dictated by the weakest link of all platforms, right? Ingo --
32-bit powerpc doesn't have 64-bit atomic operations and does support SMP. What about ARM? I thought they had 32-bit SMP these days as well. Paul. --
Some of Steve Hemminger's recent suggestions in this thread seem to me to avoid this whole issue nicely. But we will see! ;-) Thanx, Paul --
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> I hope so. Eventually it seems that all of the older 32-bit SMP platforms will be run under a bus having to execute some many "efficient" primitives using the "hash table of spinlocks" scheme for synchronization. --
Or, in some cases, per-CPU locks. It might also be that 32-bit SMP systems use stop-the-world approaches. Or that we get a variant of RCU with shorter grace periods. I don't recommend holding your breath, but I have not given up on this. For one only slightly crazy example, some of the user-level RCU variants could be adapted to in-kernel use. Some of these have sub-microsecond grace-period latencies, but also have some limitations that would make it difficult for them to replace RCU wholesale in the Linux kernel. And it is not like we are suffering from any lack of distinct RCU implementations in the Linux kernel just now. :-/ However, they might be useful in isolated situations. Thanx, Paul --
This is an alternative version of ip/ip6/arp tables locking using per-cpu locks. This avoids the overhead of synchronize_net() during update but still removes the expensive rwlock in earlier versions. The idea for this came from an earlier version done by Eric Duzamet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. The slow case involves acquiring the locks on all cpu's. The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Tested basic functionality (add/remove/list), but don't have test cases for stress, ip6tables or arptables. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- Patch against 2.6.30-rc1 include/linux/netfilter/x_tables.h | 5 - net/ipv4/netfilter/arp_tables.c | 122 ++++++++++------------------------ net/ipv4/netfilter/ip_tables.c | 129 +++++++++++-------------------------- net/ipv6/netfilter/ip6_tables.c | 127 +++++++++++------------------------- net/netfilter/x_tables.c | 21 ------ 5 files changed, 116 insertions(+), 288 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-13 08:27:45.698412446 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-13 08:34:05.499348295 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); /* * This helper is performance critical and must be inlined --- a/net/ipv4/netfilter/ip_tables.c 2009-04-13 08:27:45.684411824 -0700 +++ b/net/ipv4/netfilter/ip_tables.c 2009-04-13 ...
Patch seems good to me, but apparently xt_replace_table() misses the "acquiring the locks on all cpus" you mentioned in ChangeLog ? I am still off-computers until tomorrow so cannot provide a patch for this, sorry. Some form of local_bh_disable(); for_each_possible_cpu(cpu) spin_lock(&per_cpu(ip_tables_lock, cpu)); oldinfo = private; /* do the substitution */ table->private = newinfo; newinfo->initial_entries = oldinfo->initial_entries; for_each_possible_cpu(cpu) spin_unlock(&per_cpu(ip_tables_lock, cpu)); local_bh_enable(); But I wonder if this could hit a limit of max spinlocks held by this cpu, say on a 4096 cpu machine ? --
On Mon, 13 Apr 2009 19:40:24 +0200 --
Doesn't spin_lock() result in a pipeline flush on x86? iirc there was a benchmark in an RCU paper that tested using per cpu spin_locks and the result was that it didn't scale well at all. /Martin --
It's about 20-40 cycles when cached and uncontended, with some outliers (50+ cycles on P4, 12 cycles on some AMD opterons). Spinlocks scale wonderfully well if you only touch them on one CPU. Of course, if you truly only touch them on one CPU they are pointless, but a "all normal code only touches the local CPU spinlock, the really odd cases take all locks" approach works fine. It makes the uncommon case really quite slow, but if it truly is uncommon, that's fine. Linus --
On Mon, 13 Apr 2009 09:53:09 -0700 This would be racy against cpu hotplug if this code was hotplug-aware. And it should be hotplug aware, really. num_possible_cpus() can exceed num_online_cpus(). The extent by which possible>online is controversial, but one can conceive of situations where it is "lots". Is lib/percpu_counter.c no good for this application? Unfixably no good? That code automagically handles cpu hotplug. --
On Mon, 13 Apr 2009 15:24:37 -0700 No. get_cpu_var implies smp_processor_id which is not safe It is doing right thing already with hotplug. This code still needs to count packets processed by previously online percpu_counter can't deal with the layout/load here. --
On Mon, 13 Apr 2009 16:20:00 -0700 spin_lock_bh() will dtrt, but spelling it out seems a good idea. Those counts could be migrated off that CPU when it is offlined. As Insufficient detail here for anyone to understand why percpucounter cannot be adapted to this requirement. --
No, spin_lock_bh() will _not_ do the right thing. On UP it will actually work for two reasons: it will work because (a) it's UP, so there are no issues with smp_processor_id() to beging with, but also because even if there _were_ issues, it would still work because it would all expand as a macro, and the preempt_disable() will actually happen before the argument is evaluated. But on SMP, spin_lock_bh() expands to just _spin_lock_bh(), and is a real function - and the argument will be evaluated before the call (obviously), and thus before the preempt_disable(). So local_bh_disable(); spin_lock(&__get_cpu_var(ip_tables_lock)); is correct, and spin_lock_bh(&__get_cpu_var(ip_tables_lock)); is _not_ correct. The latter will do "&__get_cpu_var(ip_tables_lock)" with no protection from the process being switched to another CPU. Linus --
One option would be to make it more robust for such use. The downside would be all the other cases where the expression is really constant (but still takes a few instructions to calculate) and could be (and should be) evaluated outside of that critical section. So i'd tend to leave it in its current (optimistic) form: we've got 1283 uses of spin_lock_bh(), and just a quick look at git grep suggests that the current optimistic optimization matters in practice. Ingo --
Thanks Stephen, I'll do some testing with ip6tables. --
Here is the patch I cooked on top of Stephen one to get proper locking. In the "iptables -L" case, we freeze updates on all cpus to get previous RCU behavior (not sure it is mandatory, but anyway...) And xt_replace_table() uses same logic to make sure a cpu wont try to parse rules and update counters while a writer is replacing tables (and thus calling vfree() and unmapping in-use pages) Feel free to merge this patch to Stephen one before upstream submission Thank you Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/netfilter/x_tables.h | 5 ++ net/ipv4/netfilter/arp_tables.c | 20 +++------ net/ipv4/netfilter/ip_tables.c | 24 ++++------- net/ipv6/netfilter/ip6_tables.c | 24 ++++------- net/netfilter/x_tables.c | 55 ++++++++++++++++++++++++++- 5 files changed, 84 insertions(+), 44 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 1ff1a76..a5840a4 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -426,6 +426,11 @@ extern struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af, const char *name); extern void xt_table_unlock(struct xt_table *t); +extern void xt_tlock_lockall(void); +extern void xt_tlock_unlockall(void); +extern void xt_tlock_lock(void); +extern void xt_tlock_unlock(void); + extern int xt_proto_init(struct net *net, u_int8_t af); extern void xt_proto_fini(struct net *net, u_int8_t af); diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index c60cc11..b561e1e 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -231,8 +231,6 @@ static inline struct arpt_entry *get_entry(void *base, unsigned int offset) return (struct arpt_entry *)(base + offset); } -static DEFINE_PER_CPU(spinlock_t, arp_tables_lock); - unsigned int arpt_do_table(struct sk_buff *skb, unsigned int hook, const struct net_device ...
On Tue, 14 Apr 2009 16:23:33 +0200
I see no demonstrated problem with locking in my version.
The reader/writer race is already handled. On replace the race of
CPU 0 CPU 1
lock (iptables(1))
refer to oldinfo
swap in new info
foreach CPU
lock iptables(i)
(spin) unlock(iptables(1))
read oldinfo
unlock
...
The point is my locking works, you just seem to feel more comfortable with
With RCU type semantics this isn't an issue. A CPU always sees consistent
--
Oh right, I missed that xt_replace_table() was *followed* by a get_counters() call, but I am pretty sure something is needed in xt_replace_table(). A memory barrier at least (smp_wmb()) As soon as we do "table->private = newinfo;", other cpus might fetch incorrect values for newinfo->fields. In the past, we had a write_lock_bh()/write_unlock_bh() pair that was doing this for us. Then we had rcu_assign_pointer() that also had this memory barrier implied. Even if vmalloc() calls we do before calling xt_replace_table() probably already force barriers, add one for reference, just in case we change callers Previous to RCU conversion, we had a rwlock. Doing a write_lock_bh() on it while reading counters (iptables -L) *did* stop all cpus from doing their read_lock_bh() and counters updates. After RCU and your last patch, an "iptables -L" locks each table one by one. This is correct, since a cpu wont update its table while we are fetching it, but we lost previous "rwlock freeze all" behavior, and some apps/users could complain about it, this is why I said "not sure it is mandatory"... Here is an updated patch ontop of yours, with the smp_wmb() in xt_replace_table() : Thank you include/linux/netfilter/x_tables.h | 5 ++ net/ipv4/netfilter/arp_tables.c | 20 +++------ net/ipv4/netfilter/ip_tables.c | 24 ++++------- net/ipv6/netfilter/ip6_tables.c | 24 ++++------- net/netfilter/x_tables.c | 57 ++++++++++++++++++++++++++- 5 files changed, 86 insertions(+), 44 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 1ff1a76..a5840a4 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -426,6 +426,11 @@ extern struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af, const char *name); extern void xt_table_unlock(struct xt_table *t); +extern void xt_tlock_lockall(void); +extern void xt_tlock_unlockall(void); +extern void ...
Tested as well. Loaded as fast as 2.6.29. Thanks, Jeff. --
Subject: iptables: This is an alternative version of ip/ip6/arp tables locking using per-cpu locks. This avoids the overhead of synchronize_net() during update but still removes the expensive rwlock in earlier versions. The idea for this came from an earlier version done by Eric Duzamet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. The slow case involves acquiring the locks on all cpu's. The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- include/linux/netfilter/x_tables.h | 5 - net/ipv4/netfilter/arp_tables.c | 112 +++++++++------------------------ net/ipv4/netfilter/ip_tables.c | 123 +++++++++++-------------------------- net/ipv6/netfilter/ip6_tables.c | 119 +++++++++++------------------------ net/netfilter/x_tables.c | 28 -------- 5 files changed, 110 insertions(+), 277 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-14 10:13:59.932292529 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-14 10:14:04.121043331 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); /* * This helper is performance critical and must be inlined --- a/net/ipv4/netfilter/ip_tables.c 2009-04-14 10:13:59.923167357 -0700 +++ b/net/ipv4/netfilter/ip_tables.c 2009-04-14 11:13:51.066830533 -0700 @@ -297,6 +297,8 @@ static void trace_packet(struct sk_buff } #endif +static ...
Oh well, it seems factorization of this stuff is not what you want, so I'll stop arguing. Please check spelling of my name in ChangeLog, and more importantly : initialize arp_tables_lock that is missing in V1/2 for_each_possible_cpu(cpu) spin_lock_init(&per_cpu(arp_tables_lock, cpu)); Then please add my : Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Thanks --
That's no fun.. I didn't mind the refactorizing, but was concerned that it caused the locking to be out of line in fast path. --
This is an alternative version of ip/ip6/arp tables locking using per-cpu locks. This avoids the overhead of synchronize_net() during update but still removes the expensive rwlock in earlier versions. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. The slow case involves acquiring the locks on all cpu's. The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- include/linux/netfilter/x_tables.h | 5 - net/ipv4/netfilter/arp_tables.c | 112 +++++++++------------------------ net/ipv4/netfilter/ip_tables.c | 123 +++++++++++-------------------------- net/ipv6/netfilter/ip6_tables.c | 119 +++++++++++------------------------ net/netfilter/x_tables.c | 28 -------- 5 files changed, 110 insertions(+), 277 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-14 10:13:59.932292529 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-14 10:14:04.121043331 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); /* * This helper is performance critical and must be inlined --- a/net/ipv4/netfilter/ip_tables.c 2009-04-14 10:13:59.923167357 -0700 +++ b/net/ipv4/netfilter/ip_tables.c 2009-04-14 11:13:51.066830533 -0700 @@ -297,6 +297,8 @@ static void trace_packet(struct sk_buff } #endif +static DEFINE_PER_CPU(spinlock_t, ...
Tested successfuly on my dev machine, thanks Stephen.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
"strace -tt -T iptables -L" on this 8 cpu machine show quite fast operations now
(sub 1/HZ)
23:37:26.629489 getsockopt(3, SOL_IP, 0x40 /* IP_??? */, "filter\0\300\\\236@\365\271\231\"\300`\225T\367\1\0\0\0\0vm\300`\225T\367\
16"..., [84]) = 0 <0.000008>
23:37:26.629579 brk(0) = 0x8054000 <0.000006>
23:37:26.629613 brk(0x8075000) = 0x8075000 <0.000007>
23:37:26.629660 getsockopt(3, SOL_IP, 0x41 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0L"..., [10608])
= 0 <0.000031>
23:37:26.629772 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000007>
same fast operation for an "iptables -A INPUT" operation
23:37:02.180313 close(4) = 0 <0.000006>
23:37:02.180382 setsockopt(3, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16"..., 10664)
= 0 <0.000114>
23:37:02.180552 setsockopt(3, SOL_IP, 0x41 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20"..., 292) =
0 <0.000015>
23:37:02.180635 close(3) = 0 <0.000010>
--
Applied, thanks everyone. I'll give it some testing myself and will send it upstream tonight. --
On Wed, 15 Apr 2009 12:59:03 +0200 I am running it with LOCKDEP now to check for any issues. It also needs to be validated with SMP configured kernel running on UP. --
Looks like there is some recursive path into ip_tables that makes the per-cpu spinlock break. I get lockup's with KVM networking. Suggestions? [ 1931.788787] virbr3: port 1(vnet2) entering forwarding state [ 2106.068500] Pid: 12397, comm: kvm Not tainted 2.6.30-rc1 #6 [ 2106.068501] Call Trace: [ 2106.068503] <IRQ> [<ffffffff80393a09>] ? _raw_spin_lock+0xdd/0x105 [ 2106.068513] [<ffffffffa024041d>] ? ipt_do_table+0x102/0x5f1 [ip_tables] [ 2106.068516] [<ffffffff8048d4e0>] ? skb_checksum+0x4c/0x257 [ 2106.068521] [<ffffffffa025f1fc>] ? manip_pkt+0x80/0xf4 [nf_nat] [ 2106.068525] [<ffffffffa025f432>] ? nf_nat_icmp_reply_translation+0x1c2/0x245 [nf_nat] [ 2106.068528] [<ffffffff8048f632>] ? __alloc_skb+0x6b/0x13d [ 2106.068536] [<ffffffffa0246cfe>] ? nf_conntrack_in+0x45e/0x5ac [nf_conntrack] [ 2106.068540] [<ffffffffa01602b1>] ? nf_nat_fn+0xc1/0x14d [iptable_nat] [ 2106.068544] [<ffffffff804b00d0>] ? nf_iterate+0x41/0x7d [ 2106.068547] [<ffffffff804b8ed0>] ? dst_output+0x0/0xb [ 2106.068550] [<ffffffff804b0195>] ? nf_hook_slow+0x89/0x104 [ 2106.068552] [<ffffffff804b8ed0>] ? dst_output+0x0/0xb [ 2106.068555] [<ffffffff80393925>] ? _raw_spin_unlock+0x8b/0x92 [ 2106.068557] [<ffffffff804ba8c7>] ? __ip_local_out+0x98/0x9a [ 2106.068559] [<ffffffff804ba8d2>] ? ip_local_out+0x9/0x1f [ 2106.068562] [<ffffffff804babb4>] ? ip_push_pending_frames+0x2cc/0x33e [ 2106.068566] [<ffffffff804dac79>] ? icmp_send+0x559/0x588 [ 2106.068569] [<ffffffff8022d3a0>] ? task_rq_lock+0x46/0x79 [ 2106.068571] [<ffffffff8023004f>] ? enqueue_task_fair+0x23b/0x293 [ 2106.068575] [<ffffffffa00f5083>] ? reject_tg+0x41/0x30e [ipt_REJECT] [ 2106.068578] [<ffffffffa024084f>] ? ipt_do_table+0x534/0x5f1 [ip_tables] [ 2106.068581] [<ffffffff804b37a0>] ? rt_intern_hash+0x46f/0x48a [ 2106.068584] [<ffffffff804b00d0>] ? nf_iterate+0x41/0x7d [ 2106.068587] [<ffffffff804b7d94>] ? ip_forward_finish+0x0/0x3b [ 2106.068589] [<ffffffff804b0195>] ? nf_hook_slow+0x89/0x104 [ 2106.068591] ...
Well, it seems original patch was not so bad after all http://lists.netfilter.org/pipermail/netfilter-devel/2006-January/023175.html So change per-cpu spinlocks to per-cpu rwlocks and use read_lock() in ipt_do_table() to allow recursion... --
iptables cannot quite recurse into itself due to the comefrom stuff. --
Yes, but it has to return an absolute verdict (which REJECT does), so it's not really a recursion, it's more like a goto without return. --
Its recursion in the sense that we reenter the same code path, while holding a lock. The verdict is issued *after* recursing. A (quite ugly) workaround would be to have ipt_REJECT queue the packets to a temporary queue and have ipt_do_table call dst_output() after dropping the lock. --
Yet another alternative version of ip/ip6/arp tables locking using per-cpu locks. This avoids the overhead of synchronize_net() during update but still removes the expensive rwlock in earlier versions. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. The slow case involves acquiring the locks on all cpu's. This version uses RCU for the table->base reference but per-cpu-rwlock for counters. The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com --- include/linux/netfilter/x_tables.h | 5 - net/ipv4/netfilter/arp_tables.c | 122 +++++++++++-------------------------- net/ipv4/netfilter/ip_tables.c | 122 ++++++++++--------------------------- net/ipv6/netfilter/ip6_tables.c | 118 ++++++++++------------------------- net/netfilter/x_tables.c | 26 ------- 5 files changed, 112 insertions(+), 281 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-15 08:44:01.449318844 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-15 14:55:27.387131493 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); /* * This helper is performance critical and must be inlined --- a/net/ipv4/netfilter/ip_tables.c 2009-04-15 08:44:01.441318723 -0700 +++ b/net/ipv4/netfilter/ip_tables.c 2009-04-15 14:34:42.688131823 -0700 @@ -297,6 +297,8 @@ static void ...
From: Eric Dumazet <dada1@cosmosbay.com> Grumble, one more barrier to getting rid of rwlocks in the whole tree. :-/ I really think we should entertain the idea where we don't RCU quiesce when adding rules. That was dismissed as not workable because the new rule must be "visible" as soon as we return to userspace but let's get real, effectively it will be. If there are any stale object reference issues, we can use RCU object destruction to handle that kind of thing. I almost cringed when the per-spinlock idea was proposed, but per-cpu rwlocks just takes things too far for my tastes. --
On Wed, 15 Apr 2009 16:48:11 -0700 (PDT) Hey, we are reinventing your brwlock ;-< The other option is use spinlock, over a smaller area (only counters), The counters are the bigger problem, otherwise we could just free table info via rcu. Do we really have to support: replace where the counter values coming out to user space are always exactly accurate, or is it allowed to replace a rule and maybe lose some counter ticks (worst case The problem is pulling the counter values out of the object in the replace case. It could be changed to use some form of counting semaphore --
From: Stephen Hemminger <shemminger@vyatta.com> I say this case doesn't matter until someone can prove that it's any different from the IPTABLES replace operation system call executing a few microseconds earlier or later. There really is no difference, and we're making complexity out of nothing just to ensure something which isn't actually guarenteed right now. --
Actually I believe it does work right now. Userspace maps the old counter values to the replacement rules and the kernel adds them up, so in the end we currently should always have accurate counters, independant of the exact time when a replacement took place. --
Why not just read the counters fromt he old one at RCU free time (they are guaranteed to be stable at that point, since we're all done with those entries), and apply them at that point to the current setup? Linus --
This is an alternative version of ip/ip6/arp tables locking using per-cpu locks. This avoids the overhead of synchronize_net() during update but still removes the expensive rwlock in earlier versions. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. The slow case involves acquiring the locks on all cpu's. This version uses RCU for the table->base reference but per-cpu-lock for counters. The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. This version does not do coarse locking or synchronize_net() during the __do_replace function, so there is a small race which allows for some of the old counter values to be incorrect (Ncpu -1). Scenario would be replacing a rule set and the same rules are inflight on other CPU. The other CPU might still be looking at the old rules (and update those counters), after counter values have been captured. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- include/linux/netfilter/x_tables.h | 11 +-- net/ipv4/netfilter/arp_tables.c | 121 +++++++++++-------------------------- net/ipv4/netfilter/ip_tables.c | 121 ++++++++++--------------------------- net/ipv6/netfilter/ip6_tables.c | 118 +++++++++++------------------------- net/netfilter/x_tables.c | 45 +++++++------ 5 files changed, 137 insertions(+), 279 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-15 08:44:01.449318844 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-15 17:08:35.303217128 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -385,6 +382,12 @@ struct xt_table_info unsigned int hook_entry[NF_INET_NUMHOOKS]; unsigned int ...
This version is a regression over 2.6.2[0-9], because of two points 1) Much more atomic ops : added on each counter updates. On many setups, each packet coming in or out of the machine has to update between 2 to 20 rule counters. So to avoid *one* atomic ops of read_unlock(), this v4 version adds 2 to 20 atomic ops... I still not see the problem between the previous version (2.6.2[0-8]) that had a central rwlock, that hurted performance on SMP because of cache line ping pong, and the solution having one rwlock per cpu. We wanted to reduce the cache line ping pong first. This *is* the hurting point, by an order of magnitude. We tried a full RCU solution, it took us three years and we failed. Lets take an easy solution, before whole replacement of x_table by new Patrick infrastructure. Then, if it appears the rwlock itself and its two atomic ops are *really* a problem, we can go further, but I doubt modern cpus really care about atomic ops on an integer already hot in L1 cache. 2) Second problem : Potential OOM About freeing old rules with call_rcu() and/or schedule_work(), this is going to OOM pretty fast on small appliances with basic firewall setups loading rules one by one, as done by original topic reporter. We had reports from guys using linux with 4MB of available ram (French provider free.fr on their applicance box), and we had to use SLAB_DESTROY_BY_RCU thing on conntrack to avoid OOM for their setups. We dont want to use call_rcu() and queue 100 or 200 vfree(). So I prefer your v3 version, even if I didnt tested yet. --
Dave doesn't seem to like the rwlock approach. I don't see a way to do anything asynchronously like call_rcu() to improve this, so to We could use this to replace the counters, presuming it is Agreed. --
Well, we don't really need an rwlock, especially given that we really
don't want two "readers" incrementing the same counter concurrently.
A safer approach would be to maintain a flag in the task structure
tracking which (if any) of the per-CPU locks you hold. Also maintain
a recursion-depth counter. If the flag says you don't already hold
the lock, set it and acquire the lock. Either way, increment the
recursion-depth counter:
if (current->netfilter_lock_held != cur_cpu) {
BUG_ON(current->netfilter_lock_held != CPU_NONE);
spin_lock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = cur_cpu;
}
current->netfilter_lock_nesting++;
And reverse the process to unlock:
if (--current->netfilter_lock_nesting == 0) {
spin_unlock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = CPU_NONE;
One way to accomplish this is to take Mathieu Desnoyers's user-level
RCU implementation and drop it into the kernel, replacing the POSIX
This is not a real problem be handled by doing a synchronize_rcu() every
so often as noted in a prior email elsewhere in this thread:
call_rcu(...);
if (++count > 50) {
synchronize_rcu();
count = 0;
}
This choice of constant would reduce the grace-period pain to 2% of the
full effect, which should be acceptable, at least if I remember the
original problem report of 0.2 seconds growing to 6.0 seconds -- this
would give you:
(6.0-0.2)/50+0.2 = .316
I would argue that 100 milliseconds is an OK penalty for a deprecated
feature. But of course the per-CPU lock approach should avoid even that
penalty, albeit at some per-packet penalty. However, my guess is that
this per-packet penalty is not measureable at the system level.
And if the penalty of a single uncontended lock -is- measureable, I will
be very quick to offer my congratulations, at least once I get my jaw
off my keyboard. ;-)
Thanx, Paul
--
Yes, you are right, we can avoid rwlock, but use a 'recursive' lock
or spin_trylock()
We can use one counter close to the spinlock,
no need to add one or two fields to every "thread_info"
struct rec_lock {
spinlock_t lock;
int count;
};
static DEFINE_PER_CPU(struct rec_lock, ip_tables_lock);
I also considered using regular spinlocks and spin_trylock() to "detect"
the recurse case without a global counter.
lock :
local_bh_disable();
int locked = spin_trylock(&__get_cpu_var(arp_tables_lock);
unlock:
if (likely(locked))
spin_unlock(&__get_cpu_var(arp_tables_lock));
local_bh_enable();
But we would lose some runtime features, I dont feel comfortable about
this trylock version. What others people think ?
Here is the resulting patch, based on Stephen v4
(Not sure we *need* recursive spinlock for the arp_tables, but it seems
better to have an uniform implementation)
[PATCH] netfilter: use per-cpu recursive spinlock (v6)
Yet another alternative version of ip/ip6/arp tables locking using
per-cpu locks. This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.
The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters. The slow case involves acquiring the locks on
all cpu's.
The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.
We have to use recursive spinlocks because netfilter can sometimes
nest several calls to ipt_do_table() for a given cpu.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/linux/netfilter/x_tables.h | 5 -
net/ipv4/netfilter/arp_tables.c | 131 +++++++++------------------
net/ipv4/netfilter/ip_tables.c | 130 +++++++++-----------------
net/ipv6/netfilter/ip6_tables.c | 127 ...Oh well, this wont work of course, forget about this trylock thing :) --
What the _fuck_ are you doing? Stop sending these shit-for-brains crazy patches out. That's not a lock, that's a messy way of saying "I don't know what the hell I'm doing, but I'll mess things up". Don't do recursive locks (or your own locking primitives in general), but goddammit, if you have to, at least know what the hell you're doing. Your thing is a piece of shit. A recursive lock needs an owner, or it's not a lock at all. It's some random data structure that contains a lock that may or may not be taken, and that may actually _work_ depending on the exact patterns of taking the lock, but that's not an excuse. The fact that code "happens to work by mistake" (and I'm not saying that your does - but it might just because of the per-cpu'ness of it, and I'm not even going to look at crap like that it closer to try to prove it one way or the other) does not make that code acceptable. Because even if it works today, it's just a bug waiting to happen. The thing you did is _not_ a generic recursive lock, and it does _not_ work in general. Don't call it a "rec_lock". Don't write code that accesses it without any comments as if it was simple. Just DON'T. Guys, this whole discussion has just been filled with crazy crap. Can somebody even explain why we care so deeply about some counters for something that we just _deleted_ and that have random values anyway? I can see the counters being interesting while a firewall is active, but I sure don't see what's so wonderfully interesting after-the-fact about a counter on something that NO LONGER EXISTS that it has to be somehow "exactly right". And it's certainly not interesting enough to merit this kind of random fragile crazy code. Please. Get a grip, people! Show of hands, here: tell me a single use that really _requires_ those exact counters of a netfilter rule that got deleted and is no longer active? Linus --
They're copied to userspace after replacing the ruleset, associated with the rules that are still active after the change and then added to the current counters in a second operation. The end result is that the People use netfilter for accounting quite a lot. Having dynamic updates is also not uncommon, so this might actually matter. --
Hmmm...
What happens if some other CPU is actually holding the lock? For
I do not believe that it actually works.
I much prefer your earlier idea of associating a counter with the lock.
But an owner field is also required, please see below. Or please let me
know what I am missing.
We also need an owner field:
But if some other CPU holds the lock, this code would fail to wait for
that other CPU to release the lock, right? It also might corrupt the
rl->count field due to two CPUs accessing it concurrently non-atomically.
I suggest the following, preferably in a function or macro or something:
cur_cpu = smp_processor_id();
if (likely(rl->owner != cur_cpu) {
spin_lock(&rl->lock);
rl->owner = smp_processor_id();
rl->count = 1;
} else {
rl->count++;
}
And the inverse for unlock.
--
If another cpu holds the lock, this cpu will spin on its own lock. Remember other cpus dont touch rl->count. This is a private field, only touched by the cpu on its own per_cpu data. There is no possible 'corruption' So the owner of the per_cpu data does : /* * disable preemption, get rl = &__get_cpu_var(arp_tables_lock); * then : */ lock_time : if (++rl->count == 0) spin_lock(&rl->lock); unlock_time: if (likely(--rl->count == 0)) spin_unlock(&rl->lock); while other cpus only do : spin_lock(&rl->lock); /* work on data */ spin_unlock(&rl->lock); Apparently Linus missed it too, and reacted badly to my mail. I dont know why we discuss of this stuff on lkml either... I stop working on this subject and consider drinking dome hard stuf and watching tv :) See you --
This version of x_tables (ip/ip6/arp) locking uses a per-cpu rwlock that can be nested. It is sort of like earlier brwlock (fast reader, slow writer). The locking is isolated so future improvements can concentrate on measuring/optimizing xt_table_info_lock. I tried other versions based on recursive spin locks and sequence counters and for me, the risk of inventing own locking primitives not worth it at this time. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. This reduces the contention of a single reader lock (in 2.6.29) without the delay of synchronize_net() (in 2.6.30-rc2). The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Lockdep reports bogus warnings on this, so using raw_write_lock might be necessary. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com --- include/linux/netfilter/x_tables.h | 34 +++++++++-- net/ipv4/netfilter/arp_tables.c | 110 ++++++++----------------------------- net/ipv4/netfilter/ip_tables.c | 110 +++++++------------------------------ net/ipv6/netfilter/ip6_tables.c | 108 ++++++++---------------------------- net/netfilter/x_tables.c | 63 ++++++++++++++------- 5 files changed, 144 insertions(+), 281 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-16 13:40:57.256734671 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-16 13:40:58.858044088 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,35 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void ...
This is stil scary. Do we guarantee that read-locks nest in the presense of a waiting writer on another CPU? Now, I know we used to (ie readers always nested happily with readers even if there were pending writers), and then we broke it. I don't know that we ever unbroke it. IOW, at least at some point we deadlocked on this (due to trying to be fair, and not lettign in readers while earlier writers were waiting): CPU#1 CPU#2 read_lock write_lock .. spins with write bit set, waiting for readers to go away .. recursive read_lock .. spins due to the write bit being. BOOM: deadlock .. Now, I _think_ we avoid this, but somebody should double-check. Also, I have still yet to hear the answer to why we care about stale counters of dead rules so much that we couldn't just free them later with RCU. Linus --
This is a narrow special-case where the spin-rwlock is safe, and the
rwsem is unsafe.
But it should work for rwlocks - it always worked and the networking
code always relied on that AFAIK.
Here's the x86 assembly code of the write-side slowpath:
ENTRY(__write_lock_failed)
CFI_STARTPROC
LOCK_PREFIX
addl $RW_LOCK_BIAS,(%rdi)
1: rep
nop
cmpl $RW_LOCK_BIAS,(%rdi)
jne 1b
LOCK_PREFIX
subl $RW_LOCK_BIAS,(%rdi)
jnz __write_lock_failed
ret
CFI_ENDPROC
the fastpath decreased the value with RW_LOCK_BIAS, and when we
enter this function we undo that effect by adding RW_LOCK_BIAS. Then
we spin (without signalling our write-intent) passively until the
count reaches RW_LOCK_BIAS. Then we try to lock it again and bring
it to zero (meaning no other readers or writers - we got the lock).
This is pretty much the most unfair strategy possible for writers -
but this is how rwlocks always behaved - and they do so mostly for
recursive use within networking.
This is why the tasklist_lock was always so suspect to insane
starvation symptoms on really large SMP systems, and this is why
write_lock_irq(&tasklist_lock) was always a dangerous operation to
do. (it can spin for a _long_ time with irqs off.)
It's not the most optimal of situations. Some patches are in the
works to fix the irqs-off artifact (on ia64 - no x86 patches yet
AFAICS) - but that's just papering it over.
Ingo
--
Encapsulating them so that they appear in the same place might (or might not) have gotten the fact that you were not doing a recursive lock through my head. Even so, the name "rec_lock" might have overwhelmed That -is- extreme! ;-) --
We need the counters immediately to copy them to userspace, so waiting for an asynchronous RCU free is not going to work. --
From: Patrick McHardy <kaber@trash.net> It just occurred to me that since all netfilter packet handling goes through one place, we could have a sort-of "netfilter RCU" of sorts to solve this problem. --
OK, I am putting one together... It will be needed sooner or later, though I suspect per-CPU locking would work fine in this case. Thanx, Paul --
This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.
"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.
"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.
The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters. This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2).
The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.
Future optimizations possible:
- Lockdep doesn't really handle this well
- hot plug CPU case, if kernel is built with large # of CPU's, skip
the inactive ones; migrate values when CPU is removed.
- reading counters could be incremental by CPU.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com
---
include/linux/netfilter/x_tables.h | 10 +--
net/ipv4/netfilter/arp_tables.c | 110 ++++++++----------------------------
net/ipv4/netfilter/ip_tables.c | 110 +++++++-----------------------------
net/ipv6/netfilter/ip6_tables.c | 108 ++++++++---------------------------
net/netfilter/x_tables.c | 113 ++++++++++++++++++++++++++++++-------
5 files changed, 170 insertions(+), 281 deletions(-)
--- a/include/linux/netfilter/x_tables.h 2009-04-16 15:09:53.082406828 -0700
+++ b/include/linux/netfilter/x_tables.h 2009-04-16 15:10:17.154966874 -0700
@@ -354,9 +354,6 @@ struct ...On Fri, Apr 17, 2009 at 7:52 AM, Stephen Hemminger Tested and working. As fast as before. Thanks, Jeff. --
Quite so, this is the old MAX_LOCK_DEPTH < NR_CPUS issue for large systems. Last time this came up David found another way of solving the problem. Not having fully read this thread, I cannot suggest one myself -- except that RCU domains as suggested by David sound good. --
I like this version 8 of the patch, as it mixes all ideas we had, but have two questions. Previous netfilter code (and 2.6.30-rc2 one too) disable BH, not only preemption. I see xt_table_info_lock_all(void) does block BH, so this one is safe. I let Patrick or other tell us if its safe to run ipt_do_table() with preemption disabled but BH enabled, I really dont know. Also, please dont call this a 'recursive lock', since it is not a general recursive lock, as pointed by Linus and Paul. Second question is about MAX_LOCK_DEPTH Why dont use this kind of construct to get rid of this limit ? --
Very good point, so 256 nested spin_lock() instances will make the kernel unhappy -- since we now (almost?) support up to 4096 cpus, this seems like a no-no. --
No, on jumps the return position is stored in the per-cpu copy of the ruleset and we must prevent BH context corrupting the value of something running in process context. --
And here is a crude first cut. Untested, probably does not even compile.
Straight conversion of Mathieu Desnoyers's user-space RCU implementation
at git://lttng.org/userspace-rcu.git to the kernel (and yes, I did help
a little, but he must bear the bulk of the guilt). Pick on srcu.h
and srcu.c out of sheer laziness. User-space testing gives deep
sub-microsecond grace-period latencies, so should be fast enough, at
least if you don't mind two smp_call_function() invocations per grace
period and spinning on each instance of a per-CPU variable.
Again, I believe per-CPU locking should work fine for the netfilter
counters, but I guess "friends don't let friends use hashed locks".
(I would not know for sure, never having used them myself, except of
course to protect hash tables.)
Most definitely -not- for inclusion at this point. Next step is to hack
up the relevant rcutorture code and watch it explode on contact. ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
include/linux/srcu.h | 30 ++++++++++++++++++++++++
kernel/srcu.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+)
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index aca0eee..4577cd0 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -50,4 +50,34 @@ void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);
+/* Single bit for grace-period index, low-order bits are nesting counter. */
+#define RCU_FGP_COUNT 1UL
+#define RCU_FGP_PARITY (1UL << (sizeof(long) << 2))
+#define RCU_FGP_NEST_MASK (RCU_FGP_PARITY - 1)
+
+extern long rcu_fgp_ctr;
+DECLARE_PER_CPU(long, rcu_fgp_active_readers);
+
+static inline void rcu_read_lock_fgp(void)
+{
+ long tmp;
+ long *uarp;
+
+ preempt_disable();
+ uarp = &__get_cpu_var(rcu_fgp_active_readers);
+ tmp = *uarp;
+ if (likely(!(tmp & ...I'm innocent, I swear ;-) That should give very impressive performance results. OK, so we are translating the original implementation from per-thread to per-cpu, with preemption disabled. Fine with me if we can't afford the per-thread unsigned long nor can't afford to iterate on each thread when ACCESS_ONCE(rcu_fgp_ctr) could not hurt here I think. Given the surrounding code, that does not seem like a necessity, but that would I kind of expect an IPI with a smp_mb() to promote this barrier() to a smp_mb() when the update side needs to wait for a quiescent state. I Saying why we populate the value 1 here (RCU_FGP_COUNT) as an I would be tempted to add a comment here telling hot cpu hotunplug cannot let us wait forever, given all read-side critical sections we can be busy-waiting for are required to have preemption disabled, and are Ah, here it is. Commenting that it matches the two barrier()s I identified /* * Call a function on all other processors */ int smp_call_function(void(*func)(void *info), void *info, int wait); I guess you meant on_each_cpu ? That should include "self", given we Same as above. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I wouldn't expect more than about three or four orders of magnitude I believe that it is safe. Only one bit of rcu_fgp_ctr ever changes, so we should be immune from load tearing. We only load it once and only do one thing with it, and we have a barrier() before (as part of preempt_disable()) and after, so I don't think that the compiler has much latitude here. In theory, we could get store tearing through *uarp, but if gcc did that, much of the kernel would go down in flames. In contrast, in the user-mode version, there was no barrier() on entry, Hmmm... Why do we need "self"? Because synchronize_rcu_fgp() blocks, it had jolly well better not be within a read-side critical section. Yeah, I never was all that good at disguising code anyway. But I did keep a couple of changes. ;-) Updated patch below. Thanx, Paul ------------------------------------------------------------------------ And here is a crude second cut. Untested, probably does not even compile. Straight conversion of Mathieu Desnoyers's user-space RCU implementation at git://lttng.org/userspace-rcu.git to the kernel (and yes, I did help a little, but he must bear the bulk of the guilt). Pick on srcu.h and srcu.c out of sheer laziness. User-space testing gives deep sub-microsecond grace-period latencies, so should be fast enough, at least if you don't mind two smp_call_function() invocations per grace period and spinning on each instance of a per-CPU variable. Again, I believe per-CPU locking should work fine for the netfilter counters, but I guess "friends don't let friends use hashed locks". (I would not know for sure, never having used them myself, except of course to protect hash tables.) Most definitely -not- for inclusion at this point. Next step is to hack up the relevant rcutorture code and watch it explode on contact. ;-) Changes since v1: o Applied Mathieu's feedback. o Added docbook headers and other comments. o Added the ...
I mean that I think we also need some smp_mb()s on the writer side, don't we ? If we want the changes done by the writer (assign pointer) to be shown to the readers before the writer starts flipping the parity, a smp_mb() is needed at the beginning of synchronize_rcu_fgp() (actually at the same location where you call the rcu_fgp_do_mb ipis), same at the end (so we order parity flipping with the next assign pointer). Or maybe it's getting late and I am missing the obvious. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
The smp_call_function() itself must have barriers in order to ensure that the other CPUs see the updates to its parameter block. But see my upcoming response to Dave and Peter. --
On Thu, 16 Apr 2009 18:28:12 -0700 I am glad to see this worked on, but would rather not use RCU in this case of iptables. It would be good for some of the other long grace period sutff. The code to per-cpu entry consolidation by alloc/flip in 2.6.30-rc2 was hard to debug and more convoluted so it probably would be a long term maintaince nightmare. The issue was the variable size skip structure so it made for lots of iterators, etc. If the non-RCU per-cpu spinlock version is just as fast, it is easier to understand. --
Agreed, as noted above. Mostly just getting tired of people complaining about long grace periods. Again, this patch cannot replace standard RCU Your per-CPU-lock patch looked more straightforward to me than did the RCU patch. Thanx, Paul --
I agree that for 2.6.30, we could use a per-cpu spinlock as your last patch did,
this would be very risky to push this new RCU right now.
But this new stuff looks very promising, (no more locked ops on fast path),
and considering new percpu_{add|sub...} infra, very fast :
static inline void rcu_read_unlock_fgp(void)
{
barrier();
percpu_sub(rcu_fgp_active_readers, 1); /* one instruction on x86 */
preempt_enable();
}
I wonder if IPI are really necessary on x86 if we use percpu_sub() since
it already contains a barrier, and rcu_read_lock_fgp(void) also ends with
a barrier() call...
--
Very cool!!! If I had seen this, I had forgotten about it. I will give it a try, but only after getting it working the old way. (What, Hmmmm... But x86 can still execute a later load before an earlier store, so it seems to me that there would be the potential for even an x86 CPU to pull loads from the critical section up before the final store of the percpu_sub(), right? If so, we really do still need the IPIs on x86. Thanx, Paul --
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> I don't understand why we're writing such complicated code. Oh I see why, it's because not every arch uses the generic SMP helpers yet :-) Because if they did universally, we could solve this problem so simply, by merely sending a remote softirq to every online cpu. Once those all complete we have enough of a quiesce period, every cpu must have exited any netfilter packet processing code path they were in. And we could know they complete using an atomic counter or something. --
I was with you until you got to the atomic counter, which would require
dealing with CPU hotplug.
But your point is a very good one. We already do have a flavor of
RCU that waits for softirq code, namely rcu-bh. And it is used
exclusively by the networking code:
14 net/ipv4/
13 net/decnet/
10 net/core/
6 net/ipv6/
4 kernel/ [rcutorture, so these four uses don't count.]
3 net/netfilter/
2 net/mac80211/
2 net/packet/
So both yours and Peter's implicit point is quite correct -- the kernel
really does not need yet another flavor of RCU. So maybe I should instead
be thinking in terms of making the existing rcu-bh be better adapted to
the networking code, like maybe a fast synchronize_rcu_bh().
Or do the networking uses of rcu-bh need it to work exactly the way that
it does now?
Thanx, Paul
kernel/rcutorture.c __acquires 392 rcu_read_lock_bh();
kernel/rcutorture.c __releases 398 rcu_read_unlock_bh();
kernel/rcutorture.c rcu_bh_torture_deferred_free 408 call_rcu_bh(&p->rtort_rcu, rcu_torture_cb);
kernel/rcutorture.c rcu_bh_torture_synchronize 429 call_rcu_bh(&rcu.head, rcu_bh_torture_wakeme_after_cb);
net/core/dev.c dev_queue_xmit 1844 rcu_read_lock_bh();
net/core/dev.c dev_queue_xmit 1909 rcu_read_unlock_bh();
net/core/dev.c dev_queue_xmit 1915 rcu_read_unlock_bh();
net/core/filter.c sk_filter 88 rcu_read_lock_bh();
net/core/filter.c sk_filter 95 rcu_read_unlock_bh();
net/core/filter.c sk_filter_delayed_uncharge 477 call_rcu_bh(&fp->rcu, sk_filter_rcu_release);
net/core/filter.c sk_attach_filter 517 rcu_read_lock_bh();
net/core/filter.c sk_attach_filter 520 rcu_read_unlock_bh();
net/core/filter.c sk_detach_filter 532 rcu_read_lock_bh();
net/core/filter.c sk_detach_filter 539 rcu_read_unlock_bh();
net/decnet/dn_route.c dnrt_free 148 call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
net/decnet/dn_route.c dnrt_drop 154 call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
net/decnet/dn_route.c __dn_route_output_key 1161 ...Since its a full broadcast, we can do that _today_ using on_each_cpu(). But whatever way we turn this, this will be a very expensive operation. Imagine doing that on your 256-way for every iptables rules change. --
Well, you _could_ just have a per-CPU bit of "have I used nf rules since the last update", and skip those CPU's. Use memory ordering to check the bits (set the bit _before_ looking up a NF rule, and check them _after_ doing the update, and have a barrier in between if you really think it matters). Remember: the cost was never about a single filter rule update. The cost of a single one is almost immaterial, as long as it's not in hundreds of milliseconds. It's the cost of people building up things incrementally that caused this thing. So if you have 200 "iptables" commands in a sequence, and especially during bootup, a trivial "has the old rule been ever even looked at on this CPU" would already fix the issue. Because it would always be zero in the only case where it matters. This is, of course, what we do for the TLB flushing issue. We don't want to send IPI's to all CPU's, and in 99.999% of all cases we don't need to, because the other CPU's never even loaded the MM. Linus --
One comment, its again a global thing.. I've been playing with the idea for a while now to make all RCU implementations into proper objects so that you can do things like: struct atomic_rcu_domain my_rcu_domain = create_atomic_rcu(); atomic_rcu_read_lock(&my_rcu_domain()); ... atomic_rcu_read_unlock(&my_rcu_domain()); and call_atomic_rcu(&my_rcu_domain, &my_obj->rcu_head, do_something); etc.. We would have: atomic_rcu -- 'classic' non preemptible RCU (treercu these days) sleep_rcu -- 'preemptible' RCU Then have 3 default domains: sched_rcu -- always atomic_rcu rcu -- depends on PREEMPT_RCU preempt_rcu -- always sleep_rcu This would allow generic code to: 1) use preemptible RCU for those cases where needed 2) create smaller RCU domains where needed, such as in this case 3) mostly do away with SRCU Now I realize that the presented RCU implementation has a different grace period method than the existing ones that use the timer tick to drive the state machine, so 2) might not be too relevant here. But maybe we can do something with different grace periods too. Anyway, just an idea because I always get a little offended at the hard coded global variables in all these RCU implementations :-) --
I guess that this one could allow sleeping on mutexes... Does anyone #3 would be good! But... At an API level, there are two differences between SRCU and the other RCU implementations: a. The return value from srcu_read_lock() is passed to srcu_read_unlock(). b. There is a control block passed in to each SRCU primitive. Difference (a) could potentially be taken care of with a few tricks I am trying in the process of getting preemptrcu merged into treercu. Your approach to (b) certainly makes it uniform, there are >500 occurrences of rcu_read_lock() and rcu_read_unlock() each, but only a very few occurrences of srcu_read_lock() and srcu_read_unlock() (like exactly one each!). So adding an argument to rcu_read_lock() I am thinking in terms of adding a synchronize_rcu_bh() with the desired properties. That way we avoid yet another RCU flavor. (What can I say? I got carried away!) Also, since the rcu-bh flavor is used only by networking, we have a fair amount of freedom to tweak it. It will take longer than introducing a new flavor, but Steve Hemminger has a good solution already, and RCU really isn't the thing to do quick hacks on. Thanx, Paul --
I almost did a few times, but never quite got the code that needed it
Right, incrementing one cpu and decrementing another doesn't change the
static inline void rcu_read_lock(void)
{
atomic_rcu_read_lock(&global_atomic_rcu_context);
}
static inline void rcu_read_unlock(void)
{
atomic_rcu_read_unlock(&global_atomic_rcu_context);
}
static inline void call_rcu(struct rcu_head *rcuhead, void (*func)(struct rcu_head *))
{
call_atomic_rcu(&global_atomic_rcu_context, rcuhead, func);
}
Right. I was thinking along the way of providing a watermark (either
nr_queued based, or time based) and once it exceeds try to drive it from
read_unlock. Or similar. unlock driven RCU implementations have the best
Ok, back on topic :-)
I wouldn't exactly call it a good solution, it does a
for_each_possible_cpu() spin_lock();
1) that should probably be for_each_online_cpu()
2) that doesn't scale at all, I'm sure dave's 256-way hurts like mad
when inserting tons of rules and we do that for every iptable
modification.
3) there is no way lockdep can track all that :(
Do we _really_ _really_ __HAVE__ to serialize this? So far I've heard
Patrick say: there might be, a use case. That doesn't sound like: we
should make it dead slow for everybody else.
--
It probably would not be hard to enable preemptable RCU in a !CONFIG_PREEMPT configuration, which would allow mutexes to be acquired in these read-side critical sections. After I fix any relevant bugs, Well, I am trying to get rid of the summing over all CPUs -- really hard Given that for classic and hierarchical RCU, rcu_read_lock() and rcu_read_unlock() just map to preempt_disable() and preempt_enable(), Jim Houston did an unlock-driven implementation some years back: http://marc.theaimsgroup.com/?l=linux-kernel&m=109387402400673&w=2 The read-side overhead can be a problem. And I have gotten grace-period latencies under 100ns without driving the grace period from the update In principle I agree. In practice, this is an infrequently executed slow path, right? Or are you concerned about real-time latencies while loading new There of course is a point beyond which this method is slower than a full RCU grace period. But I bet that Dave's 256-way machine is not anywhere near big enough to reach that point. Maybe he can This is a good point. I understand the need to acquire the locks, but am not fully clear on why we cannot acquire one CPU's lock, gather its counters, release that lock, acquire the next CPU's lock, and so on. Maybe a code-complexity issue? Please keep in mind that we are trying to hit 2.6.30 with this fix, so simplicity is even more important than it usually be. Yes, I have some We are making it faster than it used to be by quite a bit by getting rid of the global lock, so this does sound like a good approach. Here is my reasoning: 1. The update-side performance is good, as verified by Jeff Chua. 2. The per-packet read-side performance is slowed by roughly the overhead of an uncontended lock, which comes to about 60ns on my laptop. At some point, this 60ns will become critical, but I do not believe that we are there yet. When it does become critical, a new patch can be produced. Such a patch can of course be backported ...
Hi. Shouldn't it be rcu_fgp_active_readers - RCU_FGP_COUNT? Although it is 1 by definition, it is more clear when understanding what's going on here. -- Evgeniy Polyakov --
Excellent point, fixed! Thanx, Paul --
This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.
"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.
"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.
The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters. This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2).
The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com
---
Changes from earlier patches.
- function name changes
- disable bottom half in info_rdlock
These should still be addressed but beyond the scope of the problem
- lockdep mapping; really a tradeoff between LOCKDEP special clutter
and clarity
- Figure out how to stop sparse warning
- hot plug CPU case, if kernel is built with large # of CPU's, skip
the inactive ones; migrate values when CPU is removed.
include/linux/netfilter/x_tables.h | 10 +--
net/ipv4/netfilter/arp_tables.c | 110 +++++++---------------------------
net/ipv4/netfilter/ip_tables.c | 110 +++++++---------------------------
net/ipv6/netfilter/ip6_tables.c | 108 +++++++---------------------------
net/netfilter/x_tables.c | 117 ++++++++++++++++++++++++++++++-------
5 files changed, 174 insertions(+), 281 deletions(-)
--- ...This looks good to me! But I have to ask the stupid question... Would it be possible for the "write" side to acquire the locks one at a time, accumulating and resetting the counts for that CPU, then advancing to the next CPU? If this is possible, then lockdep is much happier and the disruption to readers is much less. --
OK, but we still have a problem on machines with >= 250 cpus,
because calling 250 times spin_lock() is going to overflow preempt_count,
as each spin_lock() increases preempt_count by one.
PREEMPT_MASK: 0x000000ff
add_preempt_count() should warn us about this overflow if CONFIG_DEBUG_PREEMPT is set
#ifdef CONFIG_DEBUG_PREEMPT
/*
* Spinlock count overflowing soon?
*/
DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
PREEMPT_MASK - 10);
#endif
My suggestion (in a previous mail) was to call preempt_disable() after each spin_lock(),
--
On Mon, 20 Apr 2009 20:25:14 +0200 Ok, not that I have one of those. The problem which lockdep has is that it seems to associate all the locks with same name. --
On Mon, 20 Apr 2009 20:25:14 +0200 Wouldn't 256 or higher CPU system be faster without preempt? If there are that many CPU's, it is faster to do the work on other cpu and avoid the overhead of a hotly updated preempt count. --
The preempt count is maintained per-CPU, so has low overhead. The problem is that for CONFIG_PREEMPT builds, the preempt disabing is built into spin_lock(). Thanx, Paul --
Huh? Each cpu has its own separate preempt_count. Paul. --
But a single CPU is acquiring one lock per CPU, so all the increments are to one CPU's preempt_count. :-( Thanx, Paul --
OK, I see, so a task can't take more than 255 spinlocks without overflowing the preempt count, which seems a bit limiting. There are 6 free bits in the preempt_count currently, so the preempt count could be expanded to 14 bits, which would be enough for all current systems. Beyond that I guess we could make preempt_count be a long and allow bigger counts on 64-bit architectures. Paul. --
This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.
"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.
"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.
The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters. This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2).
The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com
---
CHANGES
- optimize for UP
- disable bottom half in info_rdlock
- prevent preempt count overflow
- turn off lockdep in writer to avoid bogus warning
- optimize unlock_bh
TODO
- Figure out how to stop sparse warnings
- hot plug CPU case, if kernel is built with large # of CPU's, skip
the inactive ones; migrate values when CPU is removed.
include/linux/netfilter/x_tables.h | 19 +++--
net/ipv4/netfilter/arp_tables.c | 110 ++++++-----------------------
net/ipv4/netfilter/ip_tables.c | 110 ++++++-----------------------
net/ipv6/netfilter/ip6_tables.c | 108 ++++++----------------------
net/netfilter/x_tables.c | 138 +++++++++++++++++++++++++++++++------
5 files changed, 204 insertions(+), 281 deletions(-)
--- a/include/linux/netfilter/x_tables.h 2009-04-20 13:57:28.281199339 -0700
+++ ...Maybe I missed something. I think softirq may be still enabled here.
Is this OK for you:
void xt_info_rdlock_bh(void)
{
struct xt_info_lock *lock;
local_bh_disable();
lock = &__get_cpu_var(xt_info_locks);
if (likely(++lock->depth == 0))
spin_lock(&lock->lock);
}
Lai.
--
well, first time its called, you are right softirqs are enabled until After this line, both softirqs and preempt are disabled. Future calls to this function temporarly raise preemptcount and decrease it. well, Stephen was trying to not change preempt count for the 2nd, 3rd, 4th?... invocation of this function. Thanks for reviewing Lai --
On Tue, 21 Apr 2009 05:56:55 +0200 In this version, I was trying to use/preserve the optimizations that are done in spin_unlock_bh(). --
xt_info_rdlock_bh() called recursively here will enter the
critical region without &__get_cpu_var(xt_info_locks)->lock.
Because xt_info_rdlock_bh() called recursively here sees
Sorry for it.
Is this OK:
void xt_info_rdlock_bh(void)
{
struct xt_info_lock *lock;
local_bh_disable();
lock = &__get_cpu_var(xt_info_locks);
if (likely(++lock->depth == 0))
spin_lock(&lock->lock);
else
local_bh_enable();
}
I did not think things carefully enough, and I do know
nothing about ip/ip6/arp.
Lai
--
On Tue, 21 Apr 2009 13:22:27 +0800
NO spin_lock_bh always does a preempt_disable
xt_info_rdlock_bh (depth = -1)
+1 preempt_disable
spin_lock_bh
+1 preempt_disable
-1 preempt_enable_no_resched
---
+1
Second call preempt_count=1 (depth = 0)
xt_info_rdlock_bh
+1 preempt_disable
-1 preempt_enable_no_resched
---
Result is preempt_count=1 (depth = 1)
Now lets do unlocks
xt_info_rdunlock_bh preempt_count=1 depth=1
does nothing
xt_info_rdunlock_bh preempt_count=1 depth = 0
-1 spin_unlock_bh
Resulting preempt_count=0 depth = -1
--
I think I had made you mistook my email.
The preempt_count is correct, but I thought about lock-save:
we must hold &__get_cpu_var(xt_info_locks)->lock
when we enter the read-side critical region.
--------------------
xt_info_rdlock_bh() (depth = -1)
preempt_disable()
depth++
==========>interrupt here
==========>
==========>xt_info_rdlock_bh() (depth = 0)
==========> preempt_disable()
==========> depth++
==========> preempt_enable_no_resched()
==========>
==========>enter the read-side critical region *without* lock.
==========> it may get trashy data.
==========>
==========>xt_info_rdunlock_bh()
==========>
==========>interrupt return.
spin_lock_bh()
preempt_enable_no_resched()
enter the read-side critical region *with* lock.
xt_info_rdunlock_bh().
----------------------
----------
Is this OK? (Now I suppose we can enter the read-side critical region
in irq context)
void xt_info_rdlock_bh(void)
{
unsigned long flags;
struct xt_info_lock *lock;
local_irq_save(flags);
lock = &__get_cpu_var(xt_info_locks);
if (likely(++lock->depth == 0))
spin_lock_bh(&lock->lock);
local_irq_restore(flags);
}
Lai
--
Hi. Netfilter as long as other generic network pathes are never accessed from interrupt context, but your analysis looks right for the softirq case. Stephen, should preempt_disable() be replaced with local_bh_disable() to prevent softirq to race on the same cpu for the lock's depth field? Or can it be made atomic? -- Evgeniy Polyakov --
A question: softirq is always not nesting. Why we need recursive lock? Lai. --
From: Lai Jiangshan <laijs@cn.fujitsu.com> Netfilter itself, is nesting. When using bridging netfilter, iptables can be entered twice in the same call chain. --
Maybe just dont care about calling several time local_bh_disable()
(since we were doing this in previous kernels anyway, we used to call read_lock_bh())
This shortens fastpath, is faster than local_irq_save()/local_irq_restore(),
and looks better.
void xt_info_rdlock_bh(void)
{
struct xt_info_lock *lock;
local_bh_disable();
lock = &__get_cpu_var(xt_info_locks);
if (likely(++lock->depth == 0))
spin_lock(&lock->lock);
}
void xt_info_rdunlock_bh(void)
{
struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
BUG_ON(lock->depth < 0);
if (likely(--lock->depth < 0))
spin_unlock(&lock->lock);
local_bh_enable();
}
--
Yeah, given that non-nested locking is more likely condition, it will be -- Evgeniy Polyakov --
David said:
Netfilter itself, is nesting.
When using bridging netfilter, iptables can be entered twice
in the same call chain.
And Stephen said:
In this version, I was trying to use/preserve the optimizations that
are done in spin_unlock_bh().
So:
void xt_info_rdlock_bh(void)
{
struct xt_info_lock *lock;
preempt_disable();
lock = &__get_cpu_var(xt_info_locks);
if (likely(lock->depth < 0))
spin_lock_bh(&lock->lock);
/* softirq is disabled now */
++lock->depth;
preempt_enable_no_resched();
}
xt_info_rdunlock_bh() is the same as v11.
--
Which context can enter the critical region? Can irq and softirq? or softirq only? --
I reviewed this patch believe its in quite good shape, thanks Stephen. Then I tested it on a x86_32 8 cpus machine and got no obvious problem. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Hopefully, next rcu_bh (or whatever name is used) will permit us to switch back to pure RCU in 2.6.31 oprofile snapshot of a tbench session, with light iptables rules. (4 rules in INPUT chain, 3 rules on OUTPUT) xt_info_rdlock_bh() uses 0.6786 % of cpu xt_info_rdunlock_bh() uses 0.1743 % of cpu CPU: Core 2, speed 3000.77 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 1248350 1248350 11.3285 11.3285 copy_from_user 534049 1782399 4.8464 16.1749 copy_to_user 480898 2263297 4.3641 20.5390 __schedule 325581 2588878 2.9546 23.4936 ipt_do_table 312697 2901575 2.8377 26.3312 tcp_ack 309381 3210956 2.8076 29.1388 tcp_sendmsg 248238 3459194 2.2527 31.3915 tcp_v4_rcv 230405 3689599 2.0909 33.4824 tcp_transmit_skb 220638 3910237 2.0022 35.4847 ip_queue_xmit 217099 4127336 1.9701 37.4548 tcp_recvmsg 175885 4303221 1.5961 39.0509 tcp_rcv_established 173112 4476333 1.5710 40.6219 __switch_to 165138 4641471 1.4986 42.1205 sysenter_past_esp 149367 4790838 1.3555 43.4759 dst_release 138619 4929457 1.2579 44.7339 sched_clock_cpu 132724 5062181 1.2044 45.9383 lock_sock_nested 121353 5183534 1.1013 47.0396 nf_iterate 119205 5302739 1.0818 48.1214 netif_receive_skb 118859 5421598 1.0786 49.2000 release_sock 112597 5534195 1.0218 50.2218 __inet_lookup_established 112195 5646390 1.0181 51.2399 sys_socketcall 110018 5756408 0.9984 52.2383 tcp_write_xmit 106466 5862874 ...
On Tue, Apr 21, 2009 at 06:59:29AM +0200, Eric Dumazet wrote: My excursion in to rcu_fgp can be thought of as a training exercise for doing this, though perhaps others are also working on an upgraded rcu_bh as well. Thanx, Paul --
Ok, so others already pointed out how dangerous/buggy this looks, but I'd like to strengthen that a bit more: This function is FUCKED UP. It's total crap for several reasons" - the already-mentioned race with bottom half locking. If bottom halfs aren't already disabled, then if a bottom half comes in after the "++lock->depth" and before the spin_lock_bh(), then you will have no locking AT ALL for the processing of that bottom half - it will just increment the lock depth again, and nobody will have locked anything at all. And if for some reason, you can prove that bottom half processing is already disabled, then ALL THAT OTHER CRAP is just that - CRAP. The whole preemption disabling, the whole "_bh()" thing, everything. So either it's horribly buggy, or it's horribly broken and pointless. Take your pick. - the total lack of comments. Why does that "count" protect anything? It's not a recursive lock, since there is no ownership (two independent accessors could call this and both "get" the lock), so you had damn well better create some big-ass comments about why it's ok in this case, and what the rules are that make it ok. So DON'T GO AROUND CALLING IT A RECURSIVE LOCK! Don't write comments that are TOTAL AND UTTER SH*T! Just DON'T! It's a "reader lock". It's not "recursive". It never was recursive, it never will be, and calling it that is just a sign that whoever wrote the function is a moron and doesn't know what he is doing. STOP DOING THIS! - that _idiotic_ "preempt_enable_no_resched()". F*ck me, but the comment already says that preemption is disabled when exiting, so why does it even bother to enable it? Why play those mindless games with preemption counts, knowing that they are bogus? Do it readably. Disable preemption first, and just re-enable it at UNLOCK time. No odd pseudo-reenables anywhere. Oh, I know very well that the spli-locking will disable ...
On Tue, 21 Apr 2009 09:13:52 -0700 (PDT) Ah a nice day, with Linus giving constructive feedback. Too bad he has to channel it out of the dark side. --
I already flamed that patch at least once before. People didn't react. What do I have to do to make people listen? I'm sorry, but I'm not going to send you flowers with a card saying "Hope you do better next time!". I realize that the flowers might be friendlier, but I have absolutely no incentive to be friendly to just bad code. I have even _less_ incentive when my first "that sucks" is apparently totally ignored. So now I spelled it out why it sucks, but I sure as hell didn't have any reason to be polite about it. Linus --
i think there might be (are?) uses of: spin_lock_bh(&some->lock); ... spin_unlock(&some->lock); ... local_bh_enable(); So we have to have two preemption control levels for that, as there's no knowledge at the spin_lock_bh() place whether it will be followed by a spin_unlock_bh() [in which case it would be safe to do SOFTIRQ_OFFSET only] - or by a spin_unlock() + local_bh_enable() pair.. [ That locking pattrn even makes a certain amount of sense: keep the lock held for a short amount of time - then weaken locking to bh context exclusion only. ] What we could do is an optimization to do a compound increase the preempt count by SOFTIRQ_OFFSET+1 - instead of a local_bh_disable() + preempt_disable()? Symmetrically we could do a compound decrease in the unlock case. It might even be called: local_bh_preempt_disable() or so? Ingo --
Subject: netfilter: use per-cpu recursive lock (v12) This version of x_tables (ip/ip6/arp) locking uses a per-cpu recursive lock that can be nested. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. This reduces the contention of a single reader lock (in 2.6.29) without the delay of synchronize_net() (in 2.6.30-rc2). The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. I don't like the NR_CPUS preempt count kludge, it is working around an issue that should be handled in a more generic way. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com --- CHANGES - go back to read/write lock - reader case is now inline - more comments - always disables bottom half - only play preempt count mindgames if really necessary - add lockdep keys include/linux/netfilter/x_tables.h | 42 ++++++++++++-- net/ipv4/netfilter/arp_tables.c | 110 ++++++++----------------------------- net/ipv4/netfilter/ip_tables.c | 110 +++++++------------------------------ net/ipv6/netfilter/ip6_tables.c | 108 ++++++++---------------------------- net/netfilter/x_tables.c | 77 ++++++++++++++++++------- 5 files changed, 165 insertions(+), 282 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-21 07:57:06.668582345 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-21 11:06:10.381487605 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,43 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct ...
hm, this is rather ugly and it will make a lot of instrumentation
code explode.
Why not use the obvious solution: a _single_ wrlock for global
access and read_can_lock() plus per cpu locks in the fastpath?
That way there's no global cacheline bouncing (just the _reading_ of
a global cacheline - which will be nicely localized - on NUMA too) -
and we will hold at most 1-2 locks at once!
Something like:
__cacheline_aligned DEFINE_RWLOCK(global_wrlock);
DEFINE_PER_CPU(rwlock_t local_lock);
void local_read_lock(void)
{
again:
read_lock(&per_cpu(local_lock, this_cpu));
if (unlikely(!read_can_lock(&global_wrlock))) {
read_unlock(&per_cpu(local_lock, this_cpu));
/*
* Just wait for any global write activity:
*/
read_unlock_wait(&global_wrlock);
goto again;
}
}
void global_write_lock(void)
{
write_lock(&global_wrlock);
for_each_possible_cpu(i)
write_unlock_wait(&per_cpu(local_lock, i));
}
Note how nesting friendly this construct is: we dont actually _hold_
NR_CPUS locks all at once, we simply cycle through all CPUs and make
sure they have our attention.
No preempt overflow. No lockdep explosion. A very fast and scalable
read path.
Okay - we need to implement read_unlock_wait() and
write_unlock_wait() which is similar to spin_unlock_wait(). The
trivial first-approximation is:
read_unlock_wait(x)
{
read_lock(x);
read_unlock(x);
}
write_unlock_wait(x)
{
write_lock(x);
write_unlock(x);
}
Hm?
Ingo
--
Obvious is not the qualifier I would use :)
Hmm... here we can see global_wrlock locked by on writer, while
this cpu already called local_read_lock(), and calls again this
Very interesting and could be changed to use spinlock + depth per cpu.
-> we can detect recursion and avoid the deadlock, and we only use one
atomic operation per lock/unlock pair in fastpath (this was the reason we
tried hard to use a percpu spinlock during this thread)
__cacheline_aligned DEFINE_RWLOCK(global_wrlock);
struct ingo_local_lock {
spinlock_t lock;
int depth;
};
DEFINE_PER_CPU(struct ingo_local_lock local_lock);
void local_read_lock(void)
{
struct ingo_local_lock *lck;
local_bh_and_preempt_disable();
lck = &get_cpu_var(local_lock);
if (++lck->depth > 0) /* already locked */
return;
again:
spin_lock(&lck->lock);
if (unlikely(!read_can_lock(&global_wrlock))) {
spin_unlock(&lck->lock);
/*
* Just wait for any global write activity:
*/
read_unlock_wait(&global_wrlock);
goto again;
}
}
void global_write_lock(void)
{
write_lock(&global_wrlock);
for_each_possible_cpu(i)
spin_unlock_wait(&per_cpu(local_lock, i));
}
Hmm ?
--
Yeah, indeed.
I wasnt really concentrating on the nested case, i was concentrating
on the scalability and lock nesting angle. I think the code
submitted here looks rather incestous in that regard.
Allowing nested locking _on the same CPU_ is asking for trouble. Use
short critical sections and if there's any exclusion needed, use an
Yeah, this looks IRQ-nesting safe. But i really have to ask: why
does this code try so hard to allow same-CPU nesting?
Nesting on the same CPU is _bad_ for several reasons:
1) Performance: it rips apart critical sections cache-wise. Instead
of a nice:
C1 ... C2 ... C3 ... C4
serial sequence of critical sections, we get:
C1A ... ( C2 ) ... C1B ... C3 ... C4
Note that C1 got "ripped apart" into C1A and C1B with C2 injected
- reducing cache locality between C1A and C1B. We have to execute
C1B no matter what, so we didnt actually win anything in terms of
total work to do, by processing C2 out of order.
[ Preemption of work (which this kind of nesting is really about)
is _the anti-thesis of performance_, and we try to delay it as
much as possible and we try to batch up as much as possible.
For example the CPU scheduler will try _real_ hard to not
preempt a typical workload, as long as external latency
boundaries allow that. ]
2) Locking complexity and robustness. Nested locking is rank #1 in
terms of introducing regressions into the kernel.
3) Instrumentation/checking complexity. Checking locking
dependencies is good and catches a boatload of bugs before they
hit upstream, and nested locks are supported but cause an
exponential explosion in terms of dependencies to check.
Also, whenever instrumentation explodes is typically the sign of
some true, physical complexity that has been introduced into the
code. So it often is a canary for a misdesign at a fundamental
level, not a failure in the instrumentation framework.
In ...The netfilter case is real simple Ingo, (note I did not use "obvious" here ;) )
netfilter in 2.6.2[0-9] used :
CPU 1
sofirq handles one packet from a NIC
ipt_do_table() /* to handle INPUT table for example, or FORWARD */
read_lock_bh(&a_central_and_damn_rwlock)
... parse rules
-> calling some netfilter sub-function
re-entering network IP stack to send some packet (say a RST packet)
...
ipt_do_table() /* to handle OUPUT table rules for example */
read_lock_bh() ; /* WE RECURSE here, but once in a while (if ever) */
This is one of the case, but other can happens with virtual networks, tunnels, ...
and so on. (Stephen had some cases with KVM if I remember well)
If this could be done without recursion, I am pretty sure netfilter
and network guys would have done it. I found Linus reaction quite
shocking IMHO, considering hard work done by all people on this.
I was pleased by your locking schem, that was *very* interesting, even
if not yet ready.
1) We can discuss of how bad recursion is.
We know loopback_xmit() could be way faster if we could avoid queeing packet
to softirq handler.
(Remember you and I suggested this patch some months ago ? Please remember
David rejected this because of recursion and possibility to overflow stack)
So yes, people are aware of recursion problems.
2) We can discuss how bad rwlocks are
We did lot of work on last months to delete some of rwlocks we had in kernel.
UDP stack for example dont use them anymore.
We tried to delete them on x_tables, but we must take care of ipt_do_table() nesting,
that is legal on 2.6.30.
Maybe netfilter guys can work to avoid this nesting on 2.6.31,
I dont know how hard it is, definitly not 2.6.30 material.
Solutions were discussed many times and Stephen provided 13 versions of the patch.
If this was that obvious, one or two iterations would have been OK.
3) About last patch (v13)
Stephen did not agreed with you (well... maybe after all ..),
he only ...On 22-04-2009 10:53, Eric Dumazet wrote: Right, looks like 100% (Scandinavian?!) troll. I wonder where is the admin of this mess... Jarek P. --
You might disagree with me on an honest technical basis, but note that this kind of childish, knee-jerk reaction against dissenting people is what gives netdev its bad reputation. Ingo --
How did you find I disagree with you? There was nothing about you, neither anything technical. Actually, I forgot to mention I agree with Eric about your locking proposal. Jarek P. --
Again ... i dont know this code well, but you yourself describe it
as "a_central_and_damn_rwlock" above.
"Central and damn" locks of any type, anywhere tend to cause such
trouble.
Often they are mini-BKL's in the making. Here's some historic
patterns:
1- First there's a convenient lock around a popular and useful
piece of data and code.
2- As popularity (and reach of code) grows, and some non-trivial
interaction ensues, a little bit of self-recursion is added to
the mix (this is often easier to do than to fix the root cause
of the problem) - making critical sections even easier to grow
in size.
3- Then attempts are made to make it scale all (it's a popular
piece of code), it's extended along a per-cpu axis, but
complexity of locking explode by taking a ton of locks all at
once. The solution: yell at the lock validator for not allowing
this (or for exploding due to the sheer mathematically
large complexity of the locking rules). Frequent requests for
an exclusion from those pesky validations are made, and various
hacks are done to work it around. It's all the fault of the lock
validator, of course.
4- At this point it's rarely a clean, tidy data lock anymore - it
tends to grow into a code lock nobody really knows how it works,
except that it better be taken more often than not, badness may
ensue otherwise. Nobody really knows when to take it, only that
it should be taken widely enough, that it should be recursive
enough to call even _more_ code from under it. Efforts to reduce
critical section length are rebuffed with: "this adds unlocking
and relocking overhead". And it should then all scale as well.
5- In the end it's a lock everyone curses but nobody is able to fix
anymore. "If it was easy to fix we'd have fixed it long ago
already" kind of fatalist thinking becomes widespread.
We saw many examples of this in the past: beyond the BKL (which is a ...You don't _understand_ do you?
There is a huge difference between recursive code, and a recursive lock.
The netfilter code may need to occasionally re-enter itself. Nobody ever
contested _that_ part.
What I have disagreed with the whole time is
(a) doing local ad-hoc locking primitives without any comments
what-so-ever.
(b) Doing them _wrong_ in many cases
(c) Calling the _lock_ a "recursive" lock.
The fact that a lock works with recursion doesn't make it "recursive".
That generally has a very special meaning for locking primitives, and
means something else.
In contrast, a read-write lock actually has known properties, and we have
existing locking mechanisms for those. And we call them read-write locks
DESPITE THE FACT that the reading part can be done recursively.
If you call a read-write lock a "recursive" lock, then you're a moron.
It simply is _not_ a recursive lock. And neither is the lock you actually
implemented, even though you (and Stephen) continually call it that.
SO STOP CALLING IT A RECURSIVE LOCK. Look at your very own code: you can
actually only use that lock in a recursive context in a _very_ specific
place. Notice how it's only "recursive" when taken in the per-CPU context,
but _not_ recursive when the filter-updating code ("writer") takes it?
Do you understand now? It really shouldn't be so hard for you.
Naming is important. Locking is important. You did both things wrong. You
named your locks something incorrect and mis-leading that didn't actually
describe them, and you did your own private locking code without then
documenting what the rules for this special lock were.
Maybe in your world that's ok. But no, in mine it's not. I've seen too
many damn _non-functioning_ locks to ever want to see stuff like that
again.
Linus
--
Linus, I actually sent *one* buggy patch, and you already gave your feedback and NACK. Fine I even relayed this to Stephen suggesting him not calling this a recursive lock. (Note how I use 'suggesting' here) So, what do you want from me ? Should I copy 100 times : "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." "I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." ... OK done Can we now proceed and continue ? Thank you --
Actually, the thing is, I don't even think your original patch was even buggy. The bug crept in later. I NAK'ed it not because it was buggy, but because of the ad-hoc'ness and the naming. Really. And I actually even said so in my original rant: 'The fact that code "happens to work by mistake" (and I'm not saying that your does - but it might just because of the per-cpu'ness of it [..]' because your original patch still had the rcu_read_lock_bh(); in place before the whole rl = &__get_cpu_var(arp_tables_lock); if (likely(rl->count++ == 0)) spin_lock(&rl->lock); and that should have protected against both BH callers and preemption. So I actually believe that your original patch probably worked fine (but as I said in my reaction to it, I thought it was almost by mistake and I wasn't going to review it) So the actual _bug_ crept in later, when the RCU lock was removed, and the lock was cleaned up and separated into a function of its own. And in fact, that is kind of my point: "uncommented locking with ad-hoc semantics is very fragile". Even _correct_ code ends up not being correct So I consider this thread ended from a technical standpoint. [ That said, I will not be at all shocked to hear if people decide later that the RCU method was better after all, and that even the per-cpu rwlock or spinlock is just too expensive. ] My problem today (apart from the relatively minor issue of also wanting to get the commit log fixed up) is just that I see emails from you finding my reaction shocking and from Jarek Poplawski that seem to still think that I'm a troll. Just because I pointed out real technical problems? Is that shocking or trolling? Really - please go back to my _original_ email. No, it was not polite. But here's another quote from it: "Because even if it works today, it's just a bug waiting to happen." and dammit, I sent that out _before_ the very next version of the patch that actually _did_ introduce that ...
Linus Torvalds wrote, On 04/22/2009 07:18 PM: Very funny! For newbies only: http://en.wikipedia.org/wiki/Troll_(Internet) Actually, I admit you could be the best "Ozzy" Osbourne imposter instead. Jarek P. --
Btw., i didnt find Linus's reaction shocking at all, nor did i understand it as any criticism of prior (and future) good work of the people involved. I found it to be what it was: a forceful (because repeated) criticism of a bad patch. Ingo --
On Tue, 21 Apr 2009 21:10:07 +0200
Better general solutions:
* use raw_spin_lock
* increase PREEMPT_BITS on 64 bit machine
and limit to 128 CPU or less on 32 bit
* get rid of default preempt_disable in spin_lock
Quit trying to be so damn f*cking cool. We don't build the software
for locking instrumentation. Locking instrumentation needs to serve
the kernel, not the other way around.
Your version fails for the case of nested local rules. Which was
the whole reason I switched to read/writer locks.
CPU 1 CPU 2
local_read_lock
... global_write_lock
local_read_lock
Stuck...
CPU 2 is waiting for CPU 1 to get out of its nested table processing
CPU 1 is waiting for write lock to get done
--
You make it quite hard to give reasonable feedback to your code :-/
First you attack me personally here, then - 30 minutes later - in
the next iteration of your patch, you do:
+ /* wait for each other cpu to see new table */
+ for_each_possible_cpu(i)
+ if (i != smp_processor_id()) {
+ xt_info_wrlock(i);
+ xt_info_wrunlock(i);
+ }
... which i have not seen in your previous patch and which looks
awfully similar to the write_lock_wait() based primitive i
suggested.
( Just open-coded in an ugly fashion and slower than a real, proper
wait-unlock would be, because it dirties all those locks
needlessly. )
So you must have agreed with me to a certain degree - i just dont
see that in any of the discussion. (you seem to totally disagree
with me to the level of ridiculing me.) Which makes it hard to
Yes, as Eric pointed it out, more than an hour before your reply. I
find the nesting uninteresting (in fact i find it harmful - see my
reply to Eric). If you were only interested in nesting then a plain
old-fashioned rwlock would have done the job.
The detail that is interesting here is how to avoid the global
rwlock cacheline bounce - not the recursion. (the same-CPU recursion
is avoidable via proper design or workaround-able via a counter in
so many ways)
Anyway, i'm back into lurker mode.
Ingo
--
In the global/local lock scheme i proposed this would become:
global_write_unlock(void)
{
write_unlock(&global_lock);
}
As we dont hold the local locks during the write-locked critical
section. No loop needed over CPUs, no preempt nesting complications,
no lockdep complications, etc.
Ingo
--
This version of x_tables (ip/ip6/arp) locking uses a per-cpu recursive lock that can be nested. The idea for this came from an earlier version done by Eric Dumazet. Locking is done per-cpu, the fast path locks on the current cpu and updates counters. This reduces the contention of a single reader lock (in 2.6.29) without the delay of synchronize_net() (in 2.6.30-rc2). The mutex that was added for 2.6.30 in xt_table is unnecessary since there already is a mutex for xt[af].mutex that is held. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com --- CHANGES - reader and write now inline - only acquire one cpu write lock at a time - write lock pushed down into get_counters include/linux/netfilter/x_tables.h | 50 +++++++++++++-- net/ipv4/netfilter/arp_tables.c | 121 ++++++++++--------------------------- net/ipv4/netfilter/ip_tables.c | 120 +++++++++--------------------------- net/ipv6/netfilter/ip6_tables.c | 120 ++++++++++-------------------------- net/netfilter/x_tables.c | 55 +++++++++------- 5 files changed, 174 insertions(+), 292 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-21 07:57:06.668582345 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-21 14:24:03.295299154 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,51 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); + + +/* + * Per-CPU read/write lock. This makes reader locking fast since + * there is no shared variable to cause cache ping-pong; but adds an + * additional write-side penalty since write must iterate over ...
Looks good from a concurrency viewpoint! This is very good -- gets rid of problems with preemption nesting depth limitations and lockdep limitations. In addition, it gets rid of the period of time during which all packet processing might otherwise be blocked. Not a big deal on a small machine, but could be a real problem on a large one. Good, you do indeed need to prevent migration before you acquire this CPU's lock. Otherwise, you could have more than one CPU attempting to Good. Of course, the reason we care about preemption here is that otherwise some other task could mess up this CPU's counters. Bad for real-time response, but then again, what the heck are you doing updating netfilter rules while a system is running a real-time Hmmm... Not sure I want to know what module uses this. Good, disabling preemption prevents other readers from messing with this --
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested today on various machines and no problem so far
tbench/oprofile results, 3.7236% cpu spent in ipt_do_table, and 0.84% used
on read_lock/read_unlock
c04a5c30 <ipt_do_table>: /* ipt_do_table total: 217134 3.7236 */
...
349 0.0060 :c04a5ccc: call c04ce380 <_read_lock>
23914 0.4101 :c04a5cd1: mov 0xc(%edi),%eax
...
:c04a5ecb: call c04ce5f0 <_read_unlock_bh>
25279 0.4335 :c04a5ed0: cmpb $0x0,-0xd(%ebp)
"iptables -L" fetches its data very fast too.
150 us on a 8 cpus machine, small firewall rules.
400-700 us on same machine, 1000 fw rules set (160000 bytes per cpu)
depending on network trafic (from light to flood)
--
Ack on the code. But the comment is _still_ crap. Please update. It's not a recursive lock, as clearly shown by the code itself. It's a per-cpu read-write lock, and only the reader is "recursive" (but that's how read-write locks with in Linux, and that has nothing to do with anything). So make the explanations match the code and the intent. Write it something like This version of x_tables (ip/ip6/arp) locking uses a per-cpu reader-writer lock lock where the readers can nest. and don't confuse it with incorrect commit messages. The lock is very much not recursive - on purpose - for half the people taking it. [ That, btw, was always true, even in the original random open-coded version. Because you can't actually do a real recursive lock without having notion of "current ownership" either by making the count be <per-thread,per-lock> - like the BKL - or by saving the ownership information in the lock. A plain counter simply doesn't do it. ] Linus --
In days of old in 2.6.29, netfilter did locketh using a
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.
But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.
So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.
The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed. This patch and purple prose was penned by
in honor of "Talk like Shakespeare" day.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
What hath changed over the last two setting suns:
* more words, mostly correct...
* no need to locketh for writeh on current cpu tis
always so
* the locking of all cpu's on replace is always done as
part of the get_counters cycle, so the sychronize swip
in replace tables is gone with only a comment remaing
include/linux/netfilter/x_tables.h | 55 ++++++++++++++--
net/ipv4/netfilter/arp_tables.c | 125 ++++++++++--------------------------
net/ipv4/netfilter/ip_tables.c | 126 ++++++++++---------------------------
net/ipv6/netfilter/ip6_tables.c | 123 ++++++++++--------------------------
net/netfilter/x_tables.c | 55 ++++++++--------
5 files changed, 188 insertions(+), 296 deletions(-)
--- a/include/linux/netfilter/x_tables.h 2009-04-23 19:59:36.076558199 -0700
+++ b/include/linux/netfilter/x_tables.h 2009-04-23 20:22:06.566001575 -0700
@@ -354,9 +354,6 @@ struct xt_table
/* What hooks you will enter on */
unsigned int valid_hooks;
- /* Lock for the curtain ...Philip Davis of the university’s School of English said :
"Shakespeare surprises the brain and catches it off guard in
a manner that produces a sudden burst of activity - a sense
of drama created out of the simplest of things."
Did you tried :
static DECLARE_PER_CPU(struct lock_class_key, xt_locks_key);
static int __init xt_init(void)
{
unsigned int i;
int rv;
for_each_possible_cpu(i) {
rwlock_t *lock = &per_cpu(xt_info_locks, i);
rwlock_init(lock);
lockdep_set_class(lock, &per_cpu(&xt_locks_key, i));
}
...
Thanks
--
Either way is fine with me, I'll wait for Stephen to state his opinion. --
On Fri, 24 Apr 2009 06:58:39 +0200 The lock keys are really only used by lock dep, and I thought per cpu space was more scarce on some arch. --
Maybe I'm wrong but after this change: "- only acquire one cpu write lock at a time" lockdep_set_class() might be unnecessary. Alas I'm not able to test it. Jarek P. --
Epilogue due to master Jarek. Lockdep carest not about the locking
doth bestowed. Therefore no keys are needed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
net/netfilter/x_tables.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
--- a/net/netfilter/x_tables.c 2009-04-25 13:25:48.115026283 -0700
+++ b/net/netfilter/x_tables.c 2009-04-25 13:26:15.646215635 -0700
@@ -1145,14 +1145,9 @@ static int __init xt_init(void)
{
unsigned int i;
int rv;
- static struct lock_class_key xt_lock_key[NR_CPUS];
- for_each_possible_cpu(i) {
- rwlock_t *lock = &per_cpu(xt_info_locks, i);
-
- rwlock_init(lock);
- lockdep_set_class(lock, xt_lock_key+i);
- }
+ for_each_possible_cpu(i)
+ rwlock_init(&per_cpu(xt_info_locks, i));
xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
if (!xt)
--
Very nice! I guess this Shakespeare guy Will sign off this all too. ;-) Thanks, --
So far, so good, should be ready for inclusion now, nobody complained :)
I include the final patch, merge of your last two patches.
David, could you please review it once again and apply it if it's OK ?
Thanks to all for your help and patience
[PATCH] netfilter: use per-CPU recursive lock {XV}
In days of old in 2.6.29, netfilter did locketh using a
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.
But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.
So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.
The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed. This patch and purple prose was penned by
in honor of "Talk like Shakespeare" day.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
What hath changed over the last two setting suns:
* more words, mostly correct...
* no need to locketh for writeh on current cpu tis
always so
* the locking of all cpu's on replace is always done as
part of the get_counters cycle, so the sychronize swip
in replace tables is gone with only a comment remaing
* Epilogue due to master Jarek. Lockdep carest not about
the locking doth bestowed. Therefore no keys are needed.
include/linux/netfilter/x_tables.h | 55 ++++++++++-
net/ipv4/netfilter/arp_tables.c | 125 +++++++-------------------
net/ipv4/netfilter/ip_tables.c | 126 +++++++--------------------
...Hi Eric, Please... could you rename this patch according to Linus'comments ? Suitable name would probably be : Stating that the write lock must _always_ be taken with bh disabled Did you really need to move add_counter_to_entry and put_counters in this patch ? This also seems more like a cleanup to me, if it is even Whiteline..... -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
On Sun, 26 Apr 2009 14:56:46 -0400 But Linus is trying to delude himself. This usage is recursive even if he doesn't like the terminology. The same CPU has to be able to reacquire the read lock without deadlocking. If reader/writer locks were implemented in a pure writer gets priority method, then this code would break! So yes read locks can be used recursively now in Linux, but if the were implemented differently then this code would break. For example, the -rt kernel turns all read/write locks into mutexs, so the -rt kernel developers will have to address this. --
Hi Stephen, [I see that you have cutted my name proposal from the original email, which might make it difficult for others to follow. I will assume you did it by mistake.] (re-added) Reading Documentation/spinlocks.txt, which states the lock usage guidelines : "Note that you can be clever with read-write locks and interrupts. For example, if you know that the interrupt only ever gets a read-lock, then you can use a non-irq version of read locks everywhere - because they don't block on each other (and thus there is no dead-lock wrt interrupts. But when you do the write-lock, you have to use the irq-safe version." So it's assumed in the kernel-wide read lock usage that nested read locks are OK. I'm adding Thomas and Steven in CC, but I'm quite sure they must have dealt with nested read-lock transformation into mutexes by detecting nesting in some way in -rt. But I'll let them confirm this. So I don't see why you are dreaming about a different semantic than the one of the primitives you are using. I guess I'll leave the semantics to you. I just find it astonishing that you persist saying that everbody is wrong on a topic like semantic, which is in the end a mean to communicate ideas clearly within the overall community you disagree with. Good luck ! Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
A recursive lock has the property:
lock()
{
if (lock->owner == current) {
lock->depth++;
return;
}
/* regular lock stuff */
}
unlock()
{
if (!--lock->depth)
/* regular unlock */
}
non of the linux kernel locking primitives have this -- with the
possible exception of the cpu-hotplug lock.
What rwlock_t has, is reader bias to the point where you can utterly
starve writers, with the side effect that you can obtain multiple read
ownerships without causing a deadlock.
This is not what is called a recursive lock. A recursive lock would have
each owner only once, this rwlock_t thing is simply so unfair that it
can have unlimited owners, including multiple copies of the same one.
rwsem has fifo fairness, and therefore can deadlock in this scenario,
suppose thread A does a read, thread B tries a write and blocks, then
thread A recurses and tries to obtain another read ownership --
deadlock, as the FIFO fairness will demand the second read ownership
will wait on the pending writer, which will wait on the outstanding read
owner.
Now if rwsem were a fifo-fair recursive lock, the above would not
deadlock, since it would detect that the task already had (read)
ownership and simply increment the depth, instead of trying to acquire a
second ownership.
This is all common and well understood terminology, not something Linus
invented just to harras you with.
Generally speaking we do not condone recursive locking strategies -- and
afaik reiserfs (as per the BKL) and the network code (as per abusing
rwlock_t unfairness) are the only offenders.
Like Linus stated, recursive locking is generally poor taste and
indicates you basically gave up on trying to find a proper locking
scheme. We should very much work towards getting rid of these
abberations instead of adding new ones.
Linus is very much right on what he said, and you calling him delusional
only high-lights your ignorance on the issue.
[ PS. -rt implements rwlock_t as a proper recursive ...On Mon, 27 Apr 2009 19:44:57 +0200 Only on Linux, and only because you look at locking from the point of view of the magic variable "current" process In Documentation/ ? online ? Where is the definition? The only reference The people complaining about naming never seem to be the ones providing Name it "dog's breath locking" for all I care. I am not bothering with arguments over names; there is real work to do elsewhere. --
Sure, see: The thing is, while you now have named your locking primitive correctly, you are still abusing it by using it recursively. So it's not 'just about naming'. You should not use read-locks as recursive locks. It's poor code. Ingo --
On Mon, 27 Apr 2009 20:54:23 +0200 All those references support my argument that the lock is being used recursively in this case. It is being acquired multiple times by the same CPU. This is not new, it has always been acquired multiple times, so my change does not break anything. If other implementations of reader locks to not nest the same way, then on those system iptables can deadlock. Nothing was If you don't like the proposal please, think of a better alternative. Not just pseudo code that is broken with handwaving arguments. --
What's so hard between understanding the difference between "used recursively" and "recursive lock"? THEY MEAN TWO TOTALLY DIFFERENT THINGS! The fact that you don't seem to understand that is one of the things I've been complaining about all along. So here's a final set of clue-bat: Clue bat #1: - "can be used in a recursive context" is not the same as "is recursive". Analogy time: Look at "strtok_r()", and the difference between that and "strtok()". "strtok_r()" can be used in a threaded environment. Does that mean that 'strtok_r()' is threaded? No. When you call 'strtok_r()', it's still signle-threaded - it's just that it _can_ be called in a threaded environment and then still has sane semantics. Now, replace "threaded" with "recursive". Do you get it? Clue bat #2: - a lock that can count does not mean that it is recursive. Counting and recursion are TWO TOTALLY DIFFERENT ISSUES. The "countingness" means that there can be multiple users inside of it, and that, in turn, implies that maybe it can be used in a recursive context. But again, counting and recursion are not about the same thing at all. Put another way: a read-write lock is not a "recursive lock" despite the fact that you can recursively take the lock for reading. It just happens to count readers, and allow more than one (from _any_ context, not just from the "same context"). Clue bat #3: - A recursive lock is very much _about_ recursion. It is explicitly about the fact that the _same_ thread re-enters the lock, not about another thread being in the locked region at the same time. See the difference? Big, big difference. A recursive lock will lock out other thread contexts, even if it allows the current one to recurse. Notice how the _only_ thing a recursive lock allows is that recursive behavior, and nothing else. IOW, a "recursive lock" is explicitly designed for recursion. But that ...
Note to self: learn to count beyond three one of these days. Linus --
Hi. Just ot be sure readers will not lose the discssion topic: do you object against naming or realizaion? If its about the former, does 'dog's breath lock' proposed by Stephen fit? -- Evgeniy Polyakov --
I said _long_ ago that I thought the patch was fine. What I object to is people mis-representing the lock, and apparently having a really hard time admitting that having a good grounding in networking doesn't necessarily mean that you know everything about Is that just an attempt to avoid admitting that they were wrong about lock naming? And then trying to trivialize it by calling things by a _different_ wrong name, but silly enough that they hope people won't call them on it? Why not just use the correct name? I think it was Mathieu that just suggested: [PATCH] netfilter: use bh disabling with per-cpu read-write lock or just call it "netfilter: use per-CPU read-write lock". Why are people so against calling things by their correct names? Why do certain network people seem to force a stupid and incorrect description, when they have been told (a) that it's wrong and (b) why it's wrong several times? What's so hard with just doing the TechnicallyRightThing(tm) and describing it as such? And btw - don't get me wrong. There are _other_ ways to do that locking too. You don't have to use a rwlock. You can do it with explicit counting, the way Eric's original patch did. But it would be wrong to call that one "recursive lock" too. Or you _could_ use a recursive lock, but nobody has actually posted such patches. It would work. No question about it. And if it actually _were_ a recursive lock, I wouldn't have objected about the description saying it is (although I would probably have objected to it being unnecessarily complex, when a simpler rwlock or the explicit count thing would be sufficient). But since the current simple patch is using a rwlock, why not just say that? Why call it something incorrect ("recursive lock") or nonsensical ("dog's breath lock"). As I tried to explain with an analogy, networking people would (quite correctly) object to me calling a serial line an "ethernet cable". Why is it so hard for netfilter people to ...
On Mon, 27 Apr 2009 13:58:48 -0700 (PDT) The part that concerns me is that the reader lock used in a nested manner on same cpu may still be broken on -rt. Other than that it is just language lawyering; violent agreement that the lock gets used multiple times by same CPU. I never had the occasion to address this before (and avoided such usage), but this legacy code exists and needs to [PATCH] netfilter: Ceci n'est pas une serrure récurrente Because meaning comes from context, and my meaning comes from different --
I think that's a valid concern, and I don't actually object to not using a rwlock, but using the "explicit counting + spinlock" that we had at one point. It _might_ even be faster, since a spinlock can be faster than a rwlock, and the (rare) case where you recurse you can then avoid the atomic entirely. But EVEN IF YOU DO THAT, it's still wrong to call it a "recursive lock". Because it still wouldn't be one. That's kind of my point, and always has been. It was why I objected to the original patch description by Dumazet. It wasn't a recursive lock back then _either_. For all the reasons I tried to explain to you, and you seem to not care about. Btw, if you do use the "explicit count" case, then please please please make sure it's documented and bug-free. And dammit, correct documentation in this case very much talks about how it is _not_ a "recursive lock", but a spinlock that then has an explicit counter that avoids taking the lock entirely in one very specific path (that happens to be recursive). The thing is, using a rwlock kind of implicitly documents all of that, because you can tell somebody who doesn't even read the code that it's a "per-cpu rwlock", and they then know what the semantics are (given that they know the kernel semantics for locking in the first place). But once you start doing your own locking rules, you really have to explain why it works, and what it does. And you do have to be very If you don't care about the naming, then use the right one. And if you don't care about the naming, why do you then say I'm deluding myself, when I'm _correct_, and I _do_ happen to care about correct naming. Locking really is important. Code that gets locking wrong breaks in really subtle and nasty ways. And it sadly tends to "work" in testing, since the breakage cases require just the right timing. So locking should be robust and as "obviously correct" as possible. And naming really is important. Misnaming things makes people ...
So, just as an example, I would not object to the counter approach, as long as it really talks about the issues, and why it works (and as long that doesn't call the locks "recursive locks"). So if you are just nervous about rwlocks, then something like this might work (needs testing, and other people to sanity-check it). I left the commentary about "readers" and "writers", because in many ways it's correct, and what the code actually does is very much to emulate a reader-writer lock. I put quotes around the uses in the comments to high-light that it largely _acts_ as a reader-writer lock. Now, it would actually be a really _bad_ reader-writer lock in general, exactly because it's not very "atomic" (ie this would be a truly sucky and broken lock if we didn't have the strict per-cpu usage rules). So it just so happens that the per-cpu'ness of the lock, together with the very limited way in which it is used, make that sucky implementation possible - and indeed possibly faster than the standard (and generic) kernel rwlock implementation. So it's basically a special-case rwlock that happens to work due to the specific rules. And exactly because it really wouldn't work in the absense of those rules, those rules really do need to have big comments on them so that people don't then later forget about the limitations. BTW: THIS IS TOTALLY UNTESTED. I just cut-and-pasted the existing rwlock version from one of the later patches, used the counting approach from one of the earlier ones, and basically just added what I think would be minimal required comments for this special case. All of it was written inside the mail reader, none of it has been compiled or tested in any other way. Because it's exactly the "lock semantics awareness" that I think is so important here (and that's why all the naming and commentary is so critical). Btw, the "counter" could probably be just a single bit, since apparently the nesting level is always supposed to be just one. I ...
This one was missing the "local_bh_enable()" at the end. There may be other bugs, but that's the one I noticed immediately when reading what I sent out. Oops. Linus --
I am not sure my day job will permit me to polish a patch mixing all
the bits and comments. But I am glad we eventually got back spinlocks
which are probably better than rwlocks for implementing this stuff.
Instead of submitting a full patch again, we could first submit a new
include file containg all comments and inline functions ?
This include file could be local to netfilter, with a big stick on
it to forbids its use on other areas (No changes in Documentation/ )
Then, as soon as we can go back to pure RCU solution, we can safely
delete this controversial-locking-nesting-per-cpu-thing ?
Instead of local/global name that Paul suggested, that was about
'global' locking all locks at the same time. Not any more the good
name IMHO
Maybe something like local/remote or owner/foreigner ?
xt_info_owner_lock_bh(), xt_info_owner_unlock_bh()
xt_info_foreigner_lock(), xt_info_foreigner_unlock()
One comment about this comment you wrote :
/*
* The "writer" side needs to get exclusive access to the lock,
* regardless of readers. This must be called with bottom half
* processing (and thus also preemption) disabled.
*/
static inline void xt_info_wrlock(unsigned int cpu)
{
spin_lock(&per_cpu(xt_info_locks, cpu).lock);
}
static inline void xt_info_wrunlock(unsigned int cpu)
{
spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
}
Its true that BH should be disabled if caller runs
on the cpu it wants to lock.
For other ones (true foreigners), there is
no requirement about BH (current cpu could be interrupted
by a softirq and packets could fly)
We could use following construct and not require disabling BH
more than a short period of time.
(But preemption disabled for the whole duration)
preempt_disable(); // could be cpu_migration_disable();
int curcpu = smp_processor_id();
/*
* Gather stats for current cpu : must disable BH
* before trying to lock.
*/
local_bh_disable();
xt_info_wrlock(curcpu);
// copy stats of this cpu on my private data (not ...From: Eric Dumazet <dada1@cosmosbay.com> I say we merge Linus's locking idea into the XV patch, fixup the commit message wording, and move on with life. For something that's going to get deleted as soon as the faster grace period RCU stuff is available, it has consumed an inordinate amount of our time :-) I might take a stab at this before hittng bed tonight, no promises :) --
One more reason to factor out this code into general locking code. The latest code looks a bit similar to the old big-reader-locks hack (which got dropped for good many eons ago and with which i deny any involvement with, such as having authored it. [oh, did i say that out loud? crap.]), implemented cleanly and properly. IMHO this locking construct should be considered for linux/local_lock.h and kernel/local_lock.c. Even if the netfilter code drops its use soon afterwards ;-) [ The _only_ thing i am worried about is the apparent fact that there's so much confusion about recursion versus read-access. Recursion might be hard to factor out of the netfilter code, and maybe it's not even possible realistically (we fought years with the BKL and are still fighting it) but if its harms are not even _realized_ that difficult task turns into an impossible task ;-) ] Ingo --
From: Ingo Molnar <mingo@elte.hu> If you can show me have to pass a per-cpu variable (the variable, not a dereference of it) as an argument to an inline function, I'll implement this :-) It has to be dereferenced after local_bh_disable() for the read side acquisition. --
The local_bh_disable() could be outside of the locking construct. This would make it easier to adapt it to various users (irq disable, bh disable, preempt disable) depending on the contexts from which they much be protected. And if it still does not work for some reason, using a #define is discouraged, but could work. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> That's what I was hoping to avoid, things like macros and having the callers of this thing expand the two parts of the operation. What's the point in making this generic if it ends up being ugly as hell? --
.. and what's the point in making it generic if it can be replaced by a proper RCU implementation ? :-) I am not convinced of the added value we get in making it a generic header this soon. I would wait for other users to express similar needs, otherwise this could soon become an orphaned piece of locking code. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
That is my opinion as well.
Anyways, here is a patch that builds, I haven't started working
on the commit message yet.
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 7b1a652..086e976 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -354,9 +354,6 @@ struct xt_table
/* What hooks you will enter on */
unsigned int valid_hooks;
- /* Lock for the curtain */
- struct mutex lock;
-
/* Man behind the curtain... */
struct xt_table_info *private;
@@ -434,8 +431,79 @@ extern void xt_proto_fini(struct net *net, u_int8_t af);
extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
- struct xt_table_info *new);
+
+/*
+ * Per-CPU spinlock associated with per-cpu table entries, and
+ * with a counter for the "reading" side that allows a recursive
+ * reader to avoid taking the lock and deadlocking.
+ *
+ * "reading" is used by ip/arp/ip6 tables rule processing which runs per-cpu.
+ * It needs to ensure that the rules are not being changed while the packet
+ * is being processed. In some cases, the read lock will be acquired
+ * twice on the same CPU; this is okay because of the count.
+ *
+ * The write lock is used in two cases:
+ * 1. reading counter values
+ * all rule processing need to be stopped and the per-CPU values are summed.
+ *
+ * 2. replacing tables
+ * any readers that are using the old tables have to complete
+ * before freeing the old table. This is handled by reading
+ * as a side effect of reading counters
+ */
+struct xt_info_lock {
+ spinlock_t lock;
+ unsigned char readers;
+};
+DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
+
+/*
+ * Note: we need to ensure that preemption is disabled before acquiring
+ * the ...The x_tables are organized with a table structure and a per-cpu copies of the counters and rules. On older kernels there was a reader/writer lock per table which was a performance bottleneck. In 2.6.30-rc, this was converted to use RCU and the counters/rules which solved the performance problems for do_table but made replacing rules much slower because of the necessary RCU grace period. This version uses a per-cpu set of spinlocks and counters to allow to table processing to proceed without the cache thrashing of a global reader lock and keeps the same performance for table updates. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> --- Probably same as dave/linus but with different comments. include/linux/netfilter/x_tables.h | 73 +++++++++++++++++++-- net/ipv4/netfilter/arp_tables.c | 125 ++++++++++-------------------------- net/ipv4/netfilter/ip_tables.c | 126 ++++++++++--------------------------- net/ipv6/netfilter/ip6_tables.c | 123 ++++++++++-------------------------- net/netfilter/x_tables.c | 53 ++++++++------- 5 files changed, 204 insertions(+), 296 deletions(-) --- a/include/linux/netfilter/x_tables.h 2009-04-28 08:01:59.942151297 -0700 +++ b/include/linux/netfilter/x_tables.h 2009-04-28 09:15:09.240990339 -0700 @@ -354,9 +354,6 @@ struct xt_table /* What hooks you will enter on */ unsigned int valid_hooks; - /* Lock for the curtain */ - struct mutex lock; - /* Man behind the curtain... */ struct xt_table_info *private; @@ -434,8 +431,74 @@ extern void xt_proto_fini(struct net *ne extern struct xt_table_info *xt_alloc_table_info(unsigned int size); extern void xt_free_table_info(struct xt_table_info *info); -extern void xt_table_entry_swap_rcu(struct xt_table_info *old, - struct xt_table_info *new); + +/* + * Per-CPU spinlock associated with per-cpu table entries, and + * with a counter for the "reading" side that allows a recursive + * reader to avoid taking the lock and deadlocking. + ...
Ack. It could do with the update from Eric about how non-current CPU writelocks only require preemp-disable around get_counters() (and then the local_bh_disable() only around the current-CPU case). I _think_ get_counters() is the only case that can use that optimization, but it's quite possible that it's worth doing especially for machines with lots of cores, if BH latency is an issue (and it might be). Of course, for the lots-and-lots of cores case, even the preemption disable might be an issue. And then it really does get much more complicated. At that point, you probably want the RCU thing. Linus --
Btw, regardless, that's an incremental improvement, and does not negate the "Ack" part. Linus --
From: Linus Torvalds <torvalds@linux-foundation.org> I've applied this, thanks everyone! --
I see the patch is already in Linus's tree that I just git pulled. Tested for 200 iptables rules ... as fast as before the slow down. real 0m0.211s user 0m0.060s sys 0m0.144s Thank you all for fixing this bug! Rafael, it's fixed. Please close the case. Thanks, Jeff. --
From: Jeff Chua <jeff.chua.linux@gmail.com> Thanks for testing. --
Small followup on this one, since the likely() were forgotten.
(I trimmed down CCed list, which was insane)
It makes a difference on my x86_32 machine, gcc-4.4.0
Thank you
[PATCH] netfilter: use likely() in xt_info_rdlock_bh()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 1b2e435..c9efe03 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -472,7 +472,7 @@ static inline void xt_info_rdlock_bh(void)
local_bh_disable();
lock = &__get_cpu_var(xt_info_locks);
- if (!lock->readers++)
+ if (likely(!lock->readers++))
spin_lock(&lock->lock);
}
@@ -480,7 +480,7 @@ static inline void xt_info_rdunlock_bh(void)
{
struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
- if (!--lock->readers)
+ if (likely(!--lock->readers))
spin_unlock(&lock->lock);
local_bh_enable();
}
--
From: Eric Dumazet <dada1@cosmosbay.com> Applied. --
The way I did this in treercu.c was to create an array of references to the per-CPU data in question. Not necessarily recommended, but one way of doing it. That said, one could argue that we should wait until we have at least three users before creating a generic primitive. And I just know that I am going to regret this deeply, but I cannot resist posting the following URL: http://en.wikipedia.org/wiki/Wikipedia:Avoid_Parkinson's_Bicycle_Shed_Effect Thanx, Paul --
The new percpu allocator allows you to create a per cpu pointer and pass it to functions. per_cpu_ptr(pointer,cpu) is used to select an instance. --
Well, I actually already suggested to David that he should just merge the last patch I saw floating around (with the "recursive" -> "readwrite" fix in the comment ;), so that we can at least get the basic issue fixed, and then we can tweak it a bit with smaller patches flying around. And at least right now, the difference between the rwlock and the "count+spinlock" should be basically almost unnoticeable, and a very small I don't have any strogn preferences, but I'd almost prefer to not abstract things out even that much. It's already pretty well hidden inside <netfilter/x_tables.h>, I'd hate to add a new file just for this. As to just adding more commenting that it must not be used anywhere else, local/remote works for me, and yes, since we only take the remote side one CPU at a time, I guess "global" is misleading. But "owner/foreigner" Ack. Linus --
Btw, I think it was Paul who pointed out that technically it's probably better to call them "local" and "global" lockers instead of "readers" and "writers". That also probably clarifies the rules on when you use one over the other (ie reading off all the statistics is a "global" operation, as is obviously replacing the tables). Of course, "readers" and "writers" is something most Linux lock people are more used to. Or "brlock" for the old-timers, but that involves a heavy dose of bad taste. The new use is much nicer, especially since it never takes the global lock on _all_ cpu's (which was really a killer in so many ways). Linus --
exclusive vs non-exclusive is what the literature would call them in --
I would argue that the non-exclusive category includes both reader-writer locking and local-global locking. That said, we have an unusual variant of local-global in this case, as the global processing acquires only one of the locks at a time. Thanx, Paul --
It could be worse. He could be running Ethernet over serial, e.g. L2TP. Or his serial line is a TP cable with RJ45 plugs - consumers like to call that Ethernet (cable) too. --
Why do you need to disable bottom halves on the read-side ? You could probably just disable preemption, given this lock is nestable on the [...] Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
It may not be obvious, but subject already raised on this list, so I'll try to be as precise as possible (But may be wrong on some points, I'll let Patrick correct me if necessary) ipt_do_table() is not a readonly function returning a verdict. 1) It handles a stack (check how is used next->comefrom) that seems to be stored on rules themselves. (This is how I understand this code) This is safe as each cpu has its own copy of rules/counters, and BH protected. 2) It also updates two 64 bit counters (bytes/packets) on each matched rule. 3) Some netfilter matches/targets probably rely on the fact their handlers are run with BH disabled by their caller (ipt_do_table()/arp/ip6...) These must be BH protected (and preempt disabled too), or else : 1) A softirq could interrupt a process in the middle of ipt_do_table() and corrupt its "stack". 2) A softirq could interrupt a process in ipt_do_table() in the middle of the ADD_COUNTER(). Some counters could be corrupted. 3) Some netfiler extensions would break. Previous linux versions already used a read_lock_bh() here, on a single and shared rwlock, there is nothing new on this BH locking AFAIK. Thank you --
Thanks for the explanation. It might help to document the role of bh disabling for the reader in a supplementary code comment, otherwise one might think it's been put there to match the bottom half disabling used on the write-side, which has the supplementary role of making sure bh will not deadlock (and this precise behavior is not needed usually on the read-side). One more point : * 1. reading counter values * all readers need to be stopped and the per-CPU values are summed. Maybe it's just me, but this sentence does not seem to clearly indicate that we have : for_each_cpu() write lock() read data write unlock() One might interpret it as : for_each_cpu() write lock() read data for_each_cpu() write unlock() Or maybe it's just my understanding of English that's not perfect. Anyhow, rewording this sentence might not hurt. Something along the lines of : "reading counter values all readers are iteratively stopped to have their per-CPU values summed" This is an important difference, as this behaves more like a RCU-based mechanism than a global per-cpu read/write lock where all the write locks would be taken at once. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Sigh!!! Part of the problem is that back in 1971, Courtois, Heymans, and Parnas foolishly named their article "Concurrent Control with 'Readers' and 'Writers'", which lead to the name "reader-writer lock". This started really biting around 1991, when Hsieh and Weihl created a reader-optimized lock similar to brlock, but with each of the per-CPU locks being exclusive rather than each being an rwlock. The problem was that Hsieh's and Weihl's lock was more than just a reader-writer lock. It could also be used (and -was- used) as a local/global lock, where for example you acquire your own lock to make local changes, and acquire all of the locks to obtain a consistent view of global state. In which case, you would read-acquire the lock in order to write, and write-acquire the lock in order to read. Blech. So, would it help if the function names names in this patch said something about "local" and "global" rather than "read" and "write"? --
Maybe. I do agree that the way netfilter would use the lock is somewhat different from a normal 'reader-writer' lock, since this special case of readers is Oh, I would have no problem at all with 'local' and 'global', in fact it would explain _why_ that read-write lock works. The problem with naming I have is with the 'recursive' part. There is no ambiguity what-so-ever about what a "recursive lock" is (at least of the traditional kind), and the lock described here is not it. So don't get me wrong - I could certainly live with a special lock in the networking. BUT: - it had damn well be documented as to what it does, and why it works - and it had better actually _work_, and not be buggy. I suspect that using our regular reader-writer locks works well enough, and yes, it's probably worth making it really clear that the reader variety can only ever be used in the "local" form. That kind of is implied by the whole function, though. And if somebody wants to open-code it as a per-cpu spinlock and a per-cpu 'local counter', I can live with that too, but at that point I want people to just be a lot more careful, and also document it a lot more. Linus --
Or we use the trick Eric suggested and Steve employed in the most recent patch. ;-) An alternative would be for the update code to acquire but one lock at a time, but this would likely require another lock to exclude other updaters and I believe would also require restructuring the count accumulation. So Steve's current patch seems a bit less intrusive, overall. Thanx, Paul --
I never understood that dismissal. The new rule _will_ be visible as we return to user space. It's just that old packets may still be in flight in other queues. But that is true even _without_ the "synchronize_net()". The old packets just had to make it slightly further in the queueing - but as far as user space is concerned, there is absolutely _zero_ difference between the two. I really personally would prefer the RCU approach too. I don't think rwlocks are any more cringe-worthy than spinlocks, although it is true that they tend to be slightly more expensive. The pure RCU "just get rid of the unnecessary 'serialze_net()'" approach seems to be clearly superior to either. Linus --
We had to RCU quiesce to be sure old rules were not any more used before freeing them. Alternative is to defer freeing via call_rcu() but subject to OOM. With 200 basic rules, size of rules table is about 40960 bytes per cpu. (88 pages taken on vmalloc virtual space on my 8 cpus machine) 0xfcaf8000-0xfcb03000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb04000-0xfcb0f000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb10000-0xfcb1b000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb1c000-0xfcb27000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb28000-0xfcb33000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb34000-0xfcb3f000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb40000-0xfcb4b000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc 0xfcb4c000-0xfcb57000 45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc This kind of monolithic huge object is hard to handle with RCU semantic, more suitable for handling set of small objects (struct file for example), In my humble opinion, this is a reasonnable compromise, and Stephen patch version 4 is ok for me. --
To be honest, the per-CPU-locking approach looks pretty good to me for
this particular case. That said, the problem you mention above does
have some straightforward solutions.
One solution to consider would be to do the call_rcu(), but to keep a
counter of the number of calls, perhaps something like the following:
call_rcu(...);
if (++count > 50) {
synchronize_rcu();
count = 0;
}
Of course, you might (or might not) need to atomically increment count,
and you of course would want to replace the "50" with some symbolic
constant, or perhaps even a variable whose value might be determined by
the size of the object and/or the amount of memory available.
Again, the per-CPU-locking approach looks good to me, as well.
But if it turns out that we really do need an RCU implementation with
really short grace periods (tens of microseconds typical latency on
mid-range multiprocessors, those with SGI Altix systems would suffer
a bit more), then it can be done. It would need to be yet another
implementation of RCU for the following reasons:
o High update-side overhead (broadcast IPIs via
smp_call_function(). This is not a problem in this case,
but would be a showstopper for (say) dcache. I don't know
of any way of fixing this.
o Defeats power-conservation measures by waking up every CPU
at every grace period. (Might be fixable, for example, by using
the same dyntick tricks used by preemptable and hierarchical RCU.
But not recommended for first implementation.)
o Poor update-side scalability. (Definitely fixable, but the
fix should be to the underlying smp_call_function() primitives.)
o No ability to share grace periods among concurrent
synchronize_rcu() primitves. (Definitely fixable, but not
recommended until needed. Unlikely to be needed -- after all,
if your grace period completes in 10 microseconds, just how many
concurrent updates do you expect there to be???)
o No call_rcu() style primtive. (Definitely fixable, but not
recommended ...From: Stephen Hemminger <shemminger@vyatta.com> Things seem to be winding down, good. :-) I'll let Patrick McHardy merge this to me with his other pending netfilter fixes. Thanks! --
On Tue, 14 Apr 2009 17:49:57 +0200
You are right, doing something with barrier would be safer there.
How about using xchg?
@@ -682,26 +668,19 @@ xt_replace_table(struct xt_table *table,
struct xt_table_info *newinfo,
int *error)
{
- struct xt_table_info *oldinfo, *private;
+ struct xt_table_info *private;
/* Do the substitution. */
- mutex_lock(&table->lock);
private = table->private;
/* Check inside lock: is the old number correct? */
if (num_counters != private->number) {
duprintf("num_counters != table->private->number (%u/%u)\n",
num_counters, private->number);
- mutex_unlock(&table->lock);
*error = -EAGAIN;
return NULL;
}
- oldinfo = private;
- rcu_assign_pointer(table->private, newinfo);
- newinfo->initial_entries = oldinfo->initial_entries;
- mutex_unlock(&table->lock);
-
- synchronize_net();
- return oldinfo;
+ newinfo->initial_entries = private->initial_entries;
+ return xchg(&table->private, newinfo);
}
EXPORT_SYMBOL_GPL(xt_replace_table);
P.s: we all missed the ordering bug in the RCU version??
--
On Fri, 10 Apr 2009 21:15:33 -0700 In this case it is to make sure that the old counter table is no And we need snapshot of all counters (which are not even an array but --
Ah!!! Is this particular code path one of the ones responsible for the OK. However, the code seems to swap in a new set of counters intended to account for subsequent packets. So I was assuming that "snapshot" meant that a given packet had to be accounted for precisely, but that it was OK to do so either in the old set or the new set, as long as it appeared in exactly one of the two sets. If this assumption is accurate, then something like the following should work. If my assumption is wrong, what exactly does this snapshot need to do? --
Btw, I think that's a bad assumption. The thing is, nobody can really care if the new rules are in effect or not, because the thing you race with is not the "return to user space" part, but the incoming packets. And those incoming packets might have been incoming before the rules were set up too. So I seriously doubt you need to synchronize with any returning to user space. What you want to synchronize with is then later actions that do things like turning on the interface that the rules are attached to etc! So I would suggest: - remove the synchronize_net() entirely. Replace it with just freeing the old rules using RCU. - new packets will always end up seeing the new rules. That includes the case of somebody doing "ifconfig eth0 up" that enables a new source of packets, so there are no real security issues. - if you enabled your network interfaces before you updated your packet filtering rules, you already had a window where packets would come in with the old rules, so doing a "synchronize_net()" in no way protects against any race conditions anyway. Am I missing something? Linus --
The issue at this point seems to be the need to get accurate snapshots of various counters -- there are a number of Linux networking users who need to account for every byte flowing through their systems. However, it is also necessary to update these counters very efficiently, given that they are updated on a per-packet basis. The current approach is as follows: 1. Install a new set of counters. 2. Wait for a grace period to elapse. 3. At this point, we know that all subsequent counting will happen on the new set of counters. 4. Add the value of the old set of counters to the new set of counters. 5. Copy the old set of counters up to user space. So we get a good snapshot in #5, while #4 ensures that we don't lose any counts when taking future snapshots. Unfortunately, #2 hits us with grace-period latencies on the critical path. We are going through the following possibilities: o Stick with the current approach, and ask people to move to new batch-oriented interfaces. However, a 30x decrease in performance is pretty grim, even for an old-style interface. o Use various atomic tricks to get an immediate snapshot of the old counters after step 1. Make step 3 use call_rcu() instead of synchronize_rcu(), and then step 4 happens off the critical path. This approach moves the RCU grace period off of the critical path, but the atomic tricks are extremely ugly on 32-bit SMP machines. 32-bit UP machines and 64-bit machines are not too bad, though the 32-bit UP case does add preemption-disable overhead on the counter-update fastpath. o Provide some sort of expedited synchronize_rcu(). This might be able to decrease the hit from 30x down to maybe 5x. But I might need to do this for the fast-boot folks anyway, though I am first trying to get away with just speeding up synchronized_rcu(). Though I was not thinking in terms of 6x, let alone 30x. Please note that this would not be a drop-in replacement for synchronize_rcu(). One would ...
Hi. If we add or change the rule we can not know if iptables' return to userspace does mean that rule started to act. There may be other queues already filled with the packets which could match the new rule (like receiving socket buffer). So effectively we want it to take effect very soon. What if there will be a timer which will synchronize RCU-added states, so if we update single rule - it will take effect in a second, and if we update bunch of them - during the delay second we pretty much can load them all. -- Evgeniy Polyakov --
On Sat, 11 Apr 2009 17:34:45 -0700
We could also try:
* per-cpu spinlock on counters (instead of synchronize_net).
When doing update, just acquire
lock on that cpu and futz with counters then. Overhead should
still be less than 2.6.29 and earlier global rwlock
* synchonize_rcu/synchronize_net is more guarantee than needed?
* use on_each_cpu() somehow to do grace periood?
* Add a cond_resched() into net_rx_action which might cause rx processing
to get out of rcu sooner? also in transmit packet scheduler.
--
This one makes a lot of sense to me. The overhead of an uncontended lock is pretty small on most systems. This would also mean that you If you really do need to swap the counters themselves, you -might- also need call_rcu() to dispose of them. But it should possible to do that You could certainly use something like smp_call_function() to collect the other CPUs' counter values -- just disable interrupts across the increments for architectures that cannot atomically increment a 64-bit value. (And it only needs to be atomic with respect to an interrupt, This might help some, but would probably only give a few tens of percent improvement. Thanx, Paul --
