Re: [PATCH] netfilter: use per-CPU r**ursive lock {XV}

Previous thread: Dear Email User by Cyberus UpgradeTeam on Friday, April 10, 2009 - 8:56 am. (1 message)

Next thread: [PATCH] Bonding: fix zero address hole bug in arp_ip_target list by Brian Haley on Friday, April 10, 2009 - 7:41 pm. (3 messages)
From: David Miller
Date: Friday, April 10, 2009 - 6:25 pm

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 10 Apr 2009 09:52:46 -0700

--

From: Linus Torvalds
Date: Friday, April 10, 2009 - 6:39 pm

I wonder if we should bring in the RCU people too, for them to tell you 
that the networking people are beign silly, and should not synchronize 
with the very heavy-handed

	synchronize_net()

but instead of doing synchronization (which is probably why adding a few 
hundred rules then takes several seconds - each synchronizes and that 
takes a timer tick or so), add the rules to be free'd on some rcu-freeing 
list for later freeing.

Or whatever. Paul? synchronize_net() just calls synchronize_rcu(), and 
with that knowledge and a simple

	git show 784544739a25c30637397ace5489eeb6e15d7d49

I bet you can already tell people how to fix their performance issue.

		Linus

--

From: Paul E. McKenney
Date: Friday, April 10, 2009 - 9:15 pm

Well, I am certainly happy to demonstrate my ignorance of the networking
code by throwing out a few suggestions.

So, Dave and Steve, you might want to get out your barf bag before
reading further.  You have been warned!  ;-)

1.	Assuming that the synchronize_net() is intended to guarantee
	that the new rules will be in effect before returning to
	user space:

	a.	Split this functionality, so that there is a new
		user-space primitive that installs a new rule, but
		without waiting.  They provide an additional user-space
		primitive that waits for the rules to take effect.
		Then, when loading a long list of rules, load them
		using the non-waiting primitive, and wait at the end.

	b.	As above, but provide a flag that says whether or not
		to wait.  Same general effect.

	But I am not seeing the direct connection between this patch
	and netfilter, so...

2.	For the xt_replace_table() case, it would be necessary to add an
	rcu_head to the xt_table_info, and replace each caller's direct
	calls to xt_free_table_info() with call_rcu().

	Now this has an issue in that the caller wants to return the
	final counter values.  My assumption is that these values do
	not in fact need to be exact.  If I am wrong about that, then
	my suggestion would lose the counts from late readers.
	I must defer to the networking guys as to whether this is
	acceptable or not.  If not, more head-scratching would be
	required.  (But it looks to me that the rule is being trashed,
	so who cares about the extra counts?)

	In addition, a malicious user might be able to force this to
	happen extremely frequently, running the system out of memory.
	One way to fix this is to invoke synchronize_net() one out of
	20 times or some such.

3.	For the alloc_counters() case, the comments indicate that we
	really truly do want an atomic sampling of the counters.
	The counters are 64-bit entities, which is a bit inconvenient.
	Though people using this functionality are no doubt quite happy
	to never have to ...
From: Jan Engelhardt
Date: Friday, April 10, 2009 - 10:14 pm

iptables works in whole tables. Userspace submits a table, checkentry is 
called for all rules in the new table, things are swapped, then destroy 
is called for all rules in the old table. By that logic (which existed
since dawn I think), only the swap operation needs to be locked.


The fact that `iptables -A` is called a hundred times means you are 
doing 100 table replacements -- instead of one. And calling
synchronize_net at least a 100 times.


As I read the new code, it seems that synchronize_net is only
used on copying the rules from kernel into userspace;
not when updating them from userspace:

IPT_SO_GET_ENTRIES -> get_entries -> copy_entries_to_user -> 

Would a seqlock suffice, as it does for the 64-bit jiffies?
--

From: Paul E. McKenney
Date: Friday, April 10, 2009 - 10:42 pm

The 64-bit jiffies counter is not updated often, so write-acquiring a
seqlock on each update is OK.  From what I understand, these counters
are updated quite often (one each packet transmission or reception?),
so write-acquiring on each update would be quite painful.

Or did you have something else in mind here?

							Thanx, Paul
--

From: David Miller
Date: Friday, April 10, 2009 - 11:00 pm

From: Jan Engelhardt <jengelh@medozas.de>

I want to derail this line of thinking as fast as possible.

This is not an acceptable response to this problem.  We made something
fundamentally slower by several orders of magnitude.

Therefore, saying "Don't insert your firewall rules like that." is not
a valid response for this regression.

We really have to fix it or revert.
--

From: Kyle Moffett
Date: Saturday, April 11, 2009 - 11:12 am

Let me start by saying that I agree that for most systems this patch
provided a bad performance tradeoff that needs to get fixed.

On the other hand I have certain systems where I would much rather
reduce the per-packet load by a few percent...  even if it increases
the effort to load a new ruleset by many orders of magnitude!!!  Quite
simply the boxes only reboot a few times a year but in-between times
they forward many terabytes of low-latency network traffic.

So... to play devils advocate:

Almost all of the standard firewall tools (such as shorewall, etc) are
already using iptables-restore command to load firewall rules,
primarily because using separate iptables commands was *already* way
too slow.  There's also the serious race-condition of doing a firewall
restart that way where you only have half your rules loaded for a bit.
 The "iptables" command is fine for fiddling around with the command
line and making minor tweaks, but it simply doesn't cut it for
large-scale rules.

I remember when switching from a shell-based shorewall to a perl-based
shorewall.  The time to build my rule lists with the perl-based
version was about 20% of what it had been, but the time to load the
rules into the kernel with iptables-restore was easily 2% or perhaps
less.

Finally, if you really are loading a couple hundred IPs into a linear
ruleset, you're adding a fair amount of packet latency to each and
every packet that goes through totally independent of iptables load
time.  It would be much better to use ipsets (or similar) because they
load all of the IP ranges into an appropriate tree datastructure with
O(small-constant * log(N) + large-constant * 1) lookup instead of
O(large-constant * N).

Cheers,
Kyle Moffett
--

From: Arkadiusz Miskiewicz
Date: Saturday, April 11, 2009 - 11:32 am

Some time ago there was batch patch that allowed to use standard shell format 
of calling iptables but did everything at once:

http://lists.netfilter.org/pipermail/netfilter-devel/2004-
September/016704.html

It didn't get merged - no idea why.

-- 
Arkadiusz Miśkiewicz        PLD/Linux Team
arekm / maven.pl            http://ftp.pld-linux.org/

--

From: david
Date: Saturday, April 11, 2009 - 5:54 pm

>> This is not an acceptable response to this problem. 
From: Kyle Moffett
Date: Saturday, April 11, 2009 - 10:05 pm

Not true...  The iptables-restore format is pretty well documented and
not far off the standard command-line argument format.  For instance
the "shorewall" tool does a sort of "compile" of its high-level
firewall language into an input file for the "iptables-restore"
command.

The basic format to atomically load a table is:
-A SOMECHAIN --rule-arguments-here
-A customchain --rule-arguments-here
COMMIT

At the end of this email you can find some sample data cut-n-pasted
from the iptables-restore file from one of my boxes running shorewall.
 The full file is 645 lines but takes at most a second or so to load
once compiled.

You could also do an iptables-save -c on one of your configured
systems to see what various constructions you use look like in the
iptables format.  It's all pretty straightforward.

Cheers,
Kyle Moffett


COMMIT
-A POSTROUTING -o world -j world_masq
-A excl0 -d 10.0.0.0/8 -j RETURN
-A excl0 -d 192.168.0.0/16 -j RETURN
-A excl0 -d 172.16.0.0/12 -j RETURN
-A excl0 -j MASQUERADE
-A excl1 -d 10.0.0.0/8 -j RETURN
-A excl1 -d 192.168.0.0/16 -j RETURN
-A excl1 -d 172.16.0.0/12 -j RETURN
-A excl1 -j MASQUERADE
-A excl2 -d 10.0.0.0/8 -j RETURN
-A excl2 -d 192.168.0.0/16 -j RETURN
-A excl2 -d 172.16.0.0/12 -j RETURN
-A excl2 -j MASQUERADE
-A world_masq -s 10.0.0.0/8 -j excl0
-A world_masq -s 172.16.0.0/12 -j excl1
-A world_masq -s 192.168.0.0/16 -j excl2
COMMIT
-A PREROUTING  -j tcpre
-A FORWARD -j tcfor
-A FORWARD -j fortos
-A OUTPUT  -j tcout
-A POSTROUTING -j tcpost
-A fortos -p 17 -i dmz -s 72.214.41.2 -j TOS --set-tos 16
-A fortos -p 17 -o dmz -d 72.214.41.2 -j TOS --set-tos 16
COMMIT
[...many more lines snipped...]
--

From: Harald Welte
Date: Sunday, April 12, 2009 - 5:30 am

Hi all,


That's what I implemented as "iptables-restore --noflush" a number of years
ago. It doesn't flush the current uleset and swaps in a new one, but=20
reads the current rules from kernel, applies any number of changes and swaps
the new ruleset in.

The syntax of iptables-restore is almost identical to iptables commands, you
just specify the table in a different way.  So you would just create your
desired changes in that format, and echo that into iptables-restore.  If it=
's
an entire new ruleset, you use no '--noflush' and it does automatic flushing
of all old rules.  If your stdin-file contains only incremental changes, you
use --noflush.

In the netfilter project, we knew for many years that the 'swap the entire
table atomically in' is a bad design choice.  This is what various develope=
rs
have been trying to address at different times, and which finally resulted
in the nftables implementation of Patrick McHardy.  So for the mid- to long
term there is a clear design that moves away from that.

But so far, we have to live with the API and its semantics.  iptables users=
pace
has been improved a number of times, and things like iptables-restore with =
or
without --noflush can be used as an intermediate solution - and have been u=
sed
by many systems out there.

--=20
- Harald Welte <laforge@netfilter.org>                 http://netfilter.org/
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie
From: Jan Engelhardt
Date: Sunday, April 12, 2009 - 9:38 am

Well, there is an extra tool in SUSE's iptables, which collects
rules added this way, and then commits them in one go when you
are done. Perhaps that is an "adequeate" way?
--

From: Stephen Hemminger
Date: Saturday, April 11, 2009 - 8:07 am

On Sat, 11 Apr 2009 07:14:50 +0200 (CEST)

Part of the overhead is the API choice to take counter values from user
space during the replace.  If the rule replacement just always started with
zero counters it could be done with less overhead.
--

From: Jeff Chua
Date: Saturday, April 11, 2009 - 9:05 am

On Sat, Apr 11, 2009 at 11:07 PM, Stephen Hemminger

It's always good practice to start from zero with these ...

# iptables -F
# iptables -t nat -F
# iptables -X

And most of the time, rules should be put into a file so that it can
rerun easily after reboot. So if it can be speed up for just this
case, it'll help many out there.

Thanks,
Jeff.
--

From: Linus Torvalds
Date: Saturday, April 11, 2009 - 10:51 am

Not really.

Yes, iptables as a single command works in whole tables.

USERS, on the other hand, often work in multiple iptables commands, ie 
they just add things to the tables. And in fact, I think this is the exact 
workload that Jeff complains about - doing two hundred "update table" 

The problem is, the new code makes the "wait after swap" thing happen 
after every switch. And if you do two hundred "update table" commands, you 
now take a _long_ time to update.

Sure, you could tell people to just do everything as one single table 
update, but that isn't what they do.

		Linus
--

From: Ingo Molnar
Date: Saturday, April 11, 2009 - 12:08 am

Note, i have recently implemented full atomic64_t support on 32-bit 
x86, for the perfcounters code, based on the CMPXCHG8B instruction.

Which, while not the lightest of instructions, is still much better 
than the sequence above.

So i think a better approach would be to also add a dumb generic 
implementation for atomic64_t (using a global lock or so), and then 
generic code could just assume that atomic64_t always exists.

It is far nicer - and faster as well - as the hack above, even on 
32-bit x86.

	Ingo
--

From: Stephen Hemminger
Date: Saturday, April 11, 2009 - 8:05 am

On Sat, 11 Apr 2009 09:08:54 +0200

The iptables counters are write mostly, read rarely so they don't
fit the seq counter or atomic use case. Also, it is important
to get a consistent snapshot of the whole set not just each
individual counter.
--

From: Paul E. McKenney
Date: Saturday, April 11, 2009 - 10:48 am

If the generic implementation is needed only on !SMP systems, that
could work.  The architectures I would be worried about include
powerpc and ia64, which I believe support 32-bit SMP builds.

						Thanx, Paul
--

From: Ingo Molnar
Date: Sunday, April 12, 2009 - 3:54 am

ia64 would naturally support the CMPXCHG8B instructions.

Not sure about powerpc32. Having a lock for the library 
implementation is not _that_ much of a problem. We obviously dont 
want the design of Linux to be dictated by the weakest link of all 
platforms, right?

	Ingo
--

From: Paul Mackerras
Date: Sunday, April 12, 2009 - 4:34 am

32-bit powerpc doesn't have 64-bit atomic operations and does support
SMP.

What about ARM?  I thought they had 32-bit SMP these days as well.

Paul.
--

From: Paul E. McKenney
Date: Sunday, April 12, 2009 - 10:31 am

Some of Steve Hemminger's recent suggestions in this thread seem to me
to avoid this whole issue nicely.  But we will see!  ;-)

							Thanx, Paul
--

From: David Miller
Date: Sunday, April 12, 2009 - 6:13 pm

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

I hope so.

Eventually it seems that all of the older 32-bit SMP platforms
will be run under a bus having to execute some many "efficient"
primitives using the "hash table of spinlocks" scheme for
synchronization.
--

From: Paul E. McKenney
Date: Sunday, April 12, 2009 - 9:04 pm

Or, in some cases, per-CPU locks.  It might also be that 32-bit SMP
systems use stop-the-world approaches.

Or that we get a variant of RCU with shorter grace periods.  I don't
recommend holding your breath, but I have not given up on this.  For one
only slightly crazy example, some of the user-level RCU variants could be
adapted to in-kernel use.  Some of these have sub-microsecond grace-period
latencies, but also have some limitations that would make it difficult
for them to replace RCU wholesale in the Linux kernel.  And it is not
like we are suffering from any lack of distinct RCU implementations in
the Linux kernel just now.  :-/

However, they might be useful in isolated situations.

							Thanx, Paul
--

From: Stephen Hemminger
Date: Monday, April 13, 2009 - 9:53 am

This is an alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Duzamet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Tested basic functionality (add/remove/list), but don't have test cases
for stress, ip6tables or arptables.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
Patch against 2.6.30-rc1

 include/linux/netfilter/x_tables.h |    5 -
 net/ipv4/netfilter/arp_tables.c    |  122 ++++++++++------------------------
 net/ipv4/netfilter/ip_tables.c     |  129 +++++++++++--------------------------
 net/ipv6/netfilter/ip6_tables.c    |  127 +++++++++++-------------------------
 net/netfilter/x_tables.c           |   21 ------
 5 files changed, 116 insertions(+), 288 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-13 08:27:45.698412446 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-13 08:34:05.499348295 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
 
 /*
  * This helper is performance critical and must be inlined
--- a/net/ipv4/netfilter/ip_tables.c	2009-04-13 08:27:45.684411824 -0700
+++ b/net/ipv4/netfilter/ip_tables.c	2009-04-13 ...
From: Eric Dumazet
Date: Monday, April 13, 2009 - 10:40 am

Patch seems good to me, but apparently xt_replace_table()
misses the "acquiring the locks on all cpus" you mentioned in ChangeLog ?

I am still off-computers until tomorrow so cannot provide a patch for this, sorry.

Some form of

local_bh_disable();
for_each_possible_cpu(cpu)
	spin_lock(&per_cpu(ip_tables_lock, cpu));

oldinfo = private;
/* do the substitution */
table->private = newinfo;
newinfo->initial_entries = oldinfo->initial_entries;

for_each_possible_cpu(cpu)
	spin_unlock(&per_cpu(ip_tables_lock, cpu));
local_bh_enable();


But I wonder if this could hit a limit of max spinlocks held by this cpu, say on a 4096 cpu machine ?



--

From: Stephen Hemminger
Date: Monday, April 13, 2009 - 11:11 am

On Mon, 13 Apr 2009 19:40:24 +0200

--

From: Martin Josefsson
Date: Monday, April 13, 2009 - 12:06 pm

Doesn't spin_lock() result in a pipeline flush on x86?

iirc there was a benchmark in an RCU paper that tested using per cpu 
spin_locks and the result was that it didn't scale well at all.

/Martin
--

From: Linus Torvalds
Date: Monday, April 13, 2009 - 12:17 pm

It's about 20-40 cycles when cached and uncontended, with some outliers 
(50+ cycles on P4, 12 cycles on some AMD opterons).


Spinlocks scale wonderfully well if you only touch them on one CPU.

Of course, if you truly only touch them on one CPU they are pointless, but 
a "all normal code only touches the local CPU spinlock, the really odd 
cases take all locks" approach works fine. It makes the uncommon case 
really quite slow, but if it truly is uncommon, that's fine.

			Linus
--

From: Andrew Morton
Date: Monday, April 13, 2009 - 3:24 pm

On Mon, 13 Apr 2009 09:53:09 -0700


This would be racy against cpu hotplug if this code was hotplug-aware.

And it should be hotplug aware, really.  num_possible_cpus() can exceed
num_online_cpus().  The extent by which possible>online is
controversial, but one can conceive of situations where it is "lots".

Is lib/percpu_counter.c no good for this application?  Unfixably no
good?  That code automagically handles cpu hotplug.


--

From: Stephen Hemminger
Date: Monday, April 13, 2009 - 4:20 pm

On Mon, 13 Apr 2009 15:24:37 -0700

No. get_cpu_var implies smp_processor_id which is not safe

It is doing right thing already with hotplug.
This code still needs to count packets processed by previously online

percpu_counter can't deal with the layout/load here.
--

From: Andrew Morton
Date: Monday, April 13, 2009 - 4:26 pm

On Mon, 13 Apr 2009 16:20:00 -0700

spin_lock_bh() will dtrt, but spelling it out seems a good idea.



Those counts could be migrated off that CPU when it is offlined.  As

Insufficient detail here for anyone to understand why percpucounter
cannot be adapted to this requirement.

--

From: Linus Torvalds
Date: Monday, April 13, 2009 - 4:37 pm

No, spin_lock_bh() will _not_ do the right thing. 

On UP it will actually work for two reasons: it will work because (a) it's 
UP, so there are no issues with smp_processor_id() to beging with, but 
also because even if there _were_ issues, it would still work because it 
would all expand as a macro, and the preempt_disable() will actually 
happen before the argument is evaluated.

But on SMP, spin_lock_bh() expands to just _spin_lock_bh(), and is a real 
function - and the argument will be evaluated before the call (obviously), 
and thus before the preempt_disable().

So

	local_bh_disable();
	spin_lock(&__get_cpu_var(ip_tables_lock));

is correct, and 

	spin_lock_bh(&__get_cpu_var(ip_tables_lock));

is _not_ correct. The latter will do "&__get_cpu_var(ip_tables_lock)"
with no protection from the process being switched to another CPU.

			Linus
--

From: Ingo Molnar
Date: Monday, April 13, 2009 - 4:52 pm

One option would be to make it more robust for such use. The 
downside would be all the other cases where the expression is really 
constant (but still takes a few instructions to calculate) and could 
be (and should be) evaluated outside of that critical section.

So i'd tend to leave it in its current (optimistic) form: we've got 
1283 uses of spin_lock_bh(), and just a quick look at git grep 
suggests that the current optimistic optimization matters in 
practice.

	Ingo
--

From: Patrick McHardy
Date: Tuesday, April 14, 2009 - 5:27 am

Thanks Stephen, I'll do some testing with ip6tables.

--

From: Eric Dumazet
Date: Tuesday, April 14, 2009 - 7:23 am

Here is the patch I cooked on top of Stephen one to get proper locking.

In the "iptables -L" case, we freeze updates on all cpus to get previous
RCU behavior (not sure it is mandatory, but anyway...)

And xt_replace_table() uses same logic to make sure a cpu wont try to parse rules
and update counters while a writer is replacing tables (and thus calling
vfree() and unmapping in-use pages)

Feel free to merge this patch to Stephen one before upstream submission

Thank you

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/netfilter/x_tables.h |    5 ++
 net/ipv4/netfilter/arp_tables.c    |   20 +++------
 net/ipv4/netfilter/ip_tables.c     |   24 ++++-------
 net/ipv6/netfilter/ip6_tables.c    |   24 ++++-------
 net/netfilter/x_tables.c           |   55 ++++++++++++++++++++++++++-
 5 files changed, 84 insertions(+), 44 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 1ff1a76..a5840a4 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -426,6 +426,11 @@ extern struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
 					   const char *name);
 extern void xt_table_unlock(struct xt_table *t);
 
+extern void xt_tlock_lockall(void);
+extern void xt_tlock_unlockall(void);
+extern void xt_tlock_lock(void);
+extern void xt_tlock_unlock(void);
+
 extern int xt_proto_init(struct net *net, u_int8_t af);
 extern void xt_proto_fini(struct net *net, u_int8_t af);
 
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index c60cc11..b561e1e 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -231,8 +231,6 @@ static inline struct arpt_entry *get_entry(void *base, unsigned int offset)
 	return (struct arpt_entry *)(base + offset);
 }
 
-static DEFINE_PER_CPU(spinlock_t, arp_tables_lock);
-
 unsigned int arpt_do_table(struct sk_buff *skb,
 			   unsigned int hook,
 			   const struct net_device ...
From: Stephen Hemminger
Date: Tuesday, April 14, 2009 - 7:45 am

On Tue, 14 Apr 2009 16:23:33 +0200

I see no demonstrated problem with locking in my version.
The reader/writer race is already handled. On replace the race of

CPU 0                          CPU 1
                           lock (iptables(1))
                           refer to oldinfo
swap in new info
foreach CPU
   lock iptables(i)
   (spin)                  unlock(iptables(1))
   read oldinfo
   unlock
...

The point is my locking works, you just seem to feel more comfortable with


With RCU type semantics this isn't an issue. A CPU always sees consistent
--

From: Eric Dumazet
Date: Tuesday, April 14, 2009 - 8:49 am

Oh right, I missed that xt_replace_table() was *followed* by a get_counters()
call, but I am pretty sure something is needed in xt_replace_table().

A memory barrier at least (smp_wmb())

As soon as we do "table->private = newinfo;", other cpus might fetch incorrect
values for newinfo->fields.

In the past, we had a write_lock_bh()/write_unlock_bh() pair that was
doing this for us.
Then we had rcu_assign_pointer() that also had this memory barrier implied.

Even if vmalloc() calls we do before calling xt_replace_table() probably
already force barriers, add one for reference, just in case we change callers

Previous to RCU conversion, we had a rwlock.

Doing a write_lock_bh() on it while reading counters (iptables -L)
*did* stop all cpus from doing their read_lock_bh() and counters updates.

After RCU and your last patch, an "iptables -L" locks each table one by one.

This is correct, since a cpu wont update its table while we are fetching it,
but we lost previous "rwlock freeze all" behavior, and some apps/users could
complain about it, this is why I said "not sure it is mandatory"...

Here is an updated patch ontop of yours, with the smp_wmb() in xt_replace_table() :

Thank you

 include/linux/netfilter/x_tables.h |    5 ++
 net/ipv4/netfilter/arp_tables.c    |   20 +++------
 net/ipv4/netfilter/ip_tables.c     |   24 ++++-------
 net/ipv6/netfilter/ip6_tables.c    |   24 ++++-------
 net/netfilter/x_tables.c           |   57 ++++++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 44 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 1ff1a76..a5840a4 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -426,6 +426,11 @@ extern struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
 					   const char *name);
 extern void xt_table_unlock(struct xt_table *t);
 
+extern void xt_tlock_lockall(void);
+extern void xt_tlock_unlockall(void);
+extern void ...
From: Jeff Chua
Date: Tuesday, April 14, 2009 - 9:51 am

Tested as well. Loaded as fast as 2.6.29.

Thanks,
Jeff.
--

From: Stephen Hemminger
Date: Tuesday, April 14, 2009 - 11:17 am

Subject: iptables: 

This is an alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Duzamet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 include/linux/netfilter/x_tables.h |    5 -
 net/ipv4/netfilter/arp_tables.c    |  112 +++++++++------------------------
 net/ipv4/netfilter/ip_tables.c     |  123 +++++++++++--------------------------
 net/ipv6/netfilter/ip6_tables.c    |  119 +++++++++++------------------------
 net/netfilter/x_tables.c           |   28 --------
 5 files changed, 110 insertions(+), 277 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-14 10:13:59.932292529 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-14 10:14:04.121043331 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
 
 /*
  * This helper is performance critical and must be inlined
--- a/net/ipv4/netfilter/ip_tables.c	2009-04-14 10:13:59.923167357 -0700
+++ b/net/ipv4/netfilter/ip_tables.c	2009-04-14 11:13:51.066830533 -0700
@@ -297,6 +297,8 @@ static void trace_packet(struct sk_buff 
 }
 #endif
 
+static ...
From: Eric Dumazet
Date: Tuesday, April 14, 2009 - 12:28 pm

Oh well, it seems factorization of this stuff is not what you want, so
I'll stop arguing.

Please check spelling of my name in ChangeLog, and more importantly :
initialize arp_tables_lock that is missing in V1/2

	for_each_possible_cpu(cpu)
		spin_lock_init(&per_cpu(arp_tables_lock, cpu));

Then please add my :

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Thanks


--

From: Stephen Hemminger
Date: Tuesday, April 14, 2009 - 2:11 pm

That's no fun..  I didn't mind the refactorizing, but was concerned
that it caused the locking to be out of line in fast path. 
--

From: Stephen Hemminger
Date: Tuesday, April 14, 2009 - 2:13 pm

This is an alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 include/linux/netfilter/x_tables.h |    5 -
 net/ipv4/netfilter/arp_tables.c    |  112 +++++++++------------------------
 net/ipv4/netfilter/ip_tables.c     |  123 +++++++++++--------------------------
 net/ipv6/netfilter/ip6_tables.c    |  119 +++++++++++------------------------
 net/netfilter/x_tables.c           |   28 --------
 5 files changed, 110 insertions(+), 277 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-14 10:13:59.932292529 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-14 10:14:04.121043331 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
 
 /*
  * This helper is performance critical and must be inlined
--- a/net/ipv4/netfilter/ip_tables.c	2009-04-14 10:13:59.923167357 -0700
+++ b/net/ipv4/netfilter/ip_tables.c	2009-04-14 11:13:51.066830533 -0700
@@ -297,6 +297,8 @@ static void trace_packet(struct sk_buff 
 }
 #endif
 
+static DEFINE_PER_CPU(spinlock_t, ...
From: Eric Dumazet
Date: Tuesday, April 14, 2009 - 2:40 pm

Tested successfuly on my dev machine, thanks Stephen.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


"strace -tt -T iptables -L" on this 8 cpu machine show quite fast operations now
(sub 1/HZ) 

23:37:26.629489 getsockopt(3, SOL_IP, 0x40 /* IP_??? */, "filter\0\300\\\236@\365\271\231\"\300`\225T\367\1\0\0\0\0vm\300`\225T\367\
16"..., [84]) = 0 <0.000008>
23:37:26.629579 brk(0)                  = 0x8054000 <0.000006>
23:37:26.629613 brk(0x8075000)          = 0x8075000 <0.000007>
23:37:26.629660 getsockopt(3, SOL_IP, 0x41 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0L"..., [10608])
= 0 <0.000031>
23:37:26.629772 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000007>

same fast operation for an "iptables -A INPUT" operation


23:37:02.180313 close(4)                = 0 <0.000006>
23:37:02.180382 setsockopt(3, SOL_IP, 0x40 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\16"..., 10664)
= 0 <0.000114>
23:37:02.180552 setsockopt(3, SOL_IP, 0x41 /* IP_??? */, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20"..., 292) =
0 <0.000015>
23:37:02.180635 close(3)                = 0 <0.000010>

--

From: Patrick McHardy
Date: Wednesday, April 15, 2009 - 3:59 am

Applied, thanks everyone. I'll give it some testing myself and
will send it upstream tonight.
--

From: Stephen Hemminger
Date: Wednesday, April 15, 2009 - 9:31 am

On Wed, 15 Apr 2009 12:59:03 +0200

I am running it with LOCKDEP now to check for any issues.
It also needs to be validated with SMP configured kernel running on UP.
--

From: Stephen Hemminger
Date: Wednesday, April 15, 2009 - 1:55 pm

Looks like there is some recursive path into ip_tables that makes the
per-cpu spinlock break.  I get lockup's with KVM networking.

Suggestions?

[ 1931.788787] virbr3: port 1(vnet2) entering forwarding state
[ 2106.068500] Pid: 12397, comm: kvm Not tainted 2.6.30-rc1 #6
[ 2106.068501] Call Trace:
[ 2106.068503]  <IRQ>  [<ffffffff80393a09>] ? _raw_spin_lock+0xdd/0x105
[ 2106.068513]  [<ffffffffa024041d>] ? ipt_do_table+0x102/0x5f1 [ip_tables]
[ 2106.068516]  [<ffffffff8048d4e0>] ? skb_checksum+0x4c/0x257
[ 2106.068521]  [<ffffffffa025f1fc>] ? manip_pkt+0x80/0xf4 [nf_nat]
[ 2106.068525]  [<ffffffffa025f432>] ? nf_nat_icmp_reply_translation+0x1c2/0x245 [nf_nat]
[ 2106.068528]  [<ffffffff8048f632>] ? __alloc_skb+0x6b/0x13d
[ 2106.068536]  [<ffffffffa0246cfe>] ? nf_conntrack_in+0x45e/0x5ac [nf_conntrack]
[ 2106.068540]  [<ffffffffa01602b1>] ? nf_nat_fn+0xc1/0x14d [iptable_nat]
[ 2106.068544]  [<ffffffff804b00d0>] ? nf_iterate+0x41/0x7d
[ 2106.068547]  [<ffffffff804b8ed0>] ? dst_output+0x0/0xb
[ 2106.068550]  [<ffffffff804b0195>] ? nf_hook_slow+0x89/0x104
[ 2106.068552]  [<ffffffff804b8ed0>] ? dst_output+0x0/0xb
[ 2106.068555]  [<ffffffff80393925>] ? _raw_spin_unlock+0x8b/0x92
[ 2106.068557]  [<ffffffff804ba8c7>] ? __ip_local_out+0x98/0x9a
[ 2106.068559]  [<ffffffff804ba8d2>] ? ip_local_out+0x9/0x1f
[ 2106.068562]  [<ffffffff804babb4>] ? ip_push_pending_frames+0x2cc/0x33e
[ 2106.068566]  [<ffffffff804dac79>] ? icmp_send+0x559/0x588
[ 2106.068569]  [<ffffffff8022d3a0>] ? task_rq_lock+0x46/0x79
[ 2106.068571]  [<ffffffff8023004f>] ? enqueue_task_fair+0x23b/0x293
[ 2106.068575]  [<ffffffffa00f5083>] ? reject_tg+0x41/0x30e [ipt_REJECT]
[ 2106.068578]  [<ffffffffa024084f>] ? ipt_do_table+0x534/0x5f1 [ip_tables]
[ 2106.068581]  [<ffffffff804b37a0>] ? rt_intern_hash+0x46f/0x48a
[ 2106.068584]  [<ffffffff804b00d0>] ? nf_iterate+0x41/0x7d
[ 2106.068587]  [<ffffffff804b7d94>] ? ip_forward_finish+0x0/0x3b
[ 2106.068589]  [<ffffffff804b0195>] ? nf_hook_slow+0x89/0x104
[ 2106.068591]  ...
From: Eric Dumazet
Date: Wednesday, April 15, 2009 - 2:07 pm

Well, it seems original patch was not so bad after all

http://lists.netfilter.org/pipermail/netfilter-devel/2006-January/023175.html

So change per-cpu spinlocks to per-cpu rwlocks 

and use read_lock() in ipt_do_table() to allow recursion...


--

From: Jan Engelhardt
Date: Wednesday, April 15, 2009 - 2:55 pm

iptables cannot quite recurse into itself due to the comefrom stuff.

--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 5:12 am

From: Jan Engelhardt
Date: Thursday, April 16, 2009 - 5:24 am

Yes, but it has to return an absolute verdict (which REJECT does),
so it's not really a recursion, it's more like a goto without return.
--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 5:31 am

Its recursion in the sense that we reenter the same code path,
while holding a lock. The verdict is issued *after* recursing.

A (quite ugly) workaround would be to have ipt_REJECT queue
the packets to a temporary queue and have ipt_do_table call
dst_output() after dropping the lock.



--

From: Stephen Hemminger
Date: Wednesday, April 15, 2009 - 2:57 pm

Yet another alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's. This version uses RCU for the table->base reference
but per-cpu-rwlock for counters.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
 include/linux/netfilter/x_tables.h |    5 -
 net/ipv4/netfilter/arp_tables.c    |  122 +++++++++++--------------------------
 net/ipv4/netfilter/ip_tables.c     |  122 ++++++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  118 ++++++++++-------------------------
 net/netfilter/x_tables.c           |   26 -------
 5 files changed, 112 insertions(+), 281 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-15 08:44:01.449318844 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-15 14:55:27.387131493 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,6 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
 
 /*
  * This helper is performance critical and must be inlined
--- a/net/ipv4/netfilter/ip_tables.c	2009-04-15 08:44:01.441318723 -0700
+++ b/net/ipv4/netfilter/ip_tables.c	2009-04-15 14:34:42.688131823 -0700
@@ -297,6 +297,8 @@ static void ...
From: David Miller
Date: Wednesday, April 15, 2009 - 4:48 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Grumble, one more barrier to getting rid of rwlocks in the whole
tree. :-/

I really think we should entertain the idea where we don't RCU quiesce
when adding rules.  That was dismissed as not workable because the new
rule must be "visible" as soon as we return to userspace but let's get
real, effectively it will be.

If there are any stale object reference issues, we can use RCU object
destruction to handle that kind of thing.

I almost cringed when the per-spinlock idea was proposed, but per-cpu
rwlocks just takes things too far for my tastes.
--

From: Stephen Hemminger
Date: Wednesday, April 15, 2009 - 5:01 pm

On Wed, 15 Apr 2009 16:48:11 -0700 (PDT)

Hey, we are reinventing your brwlock ;-<

The other option is use spinlock, over a smaller area (only counters),

The counters are the bigger problem, otherwise we could just free table
info via rcu.  Do we really have to support: replace where the counter
values coming out to user space are always exactly accurate, or is it
allowed to replace a rule and maybe lose some counter ticks (worst case

The problem is pulling the counter values out of the object in the
replace case. It could be changed to use some form of counting semaphore

--

From: David Miller
Date: Wednesday, April 15, 2009 - 5:05 pm

From: Stephen Hemminger <shemminger@vyatta.com>

I say this case doesn't matter until someone can prove that it's
any different from the IPTABLES replace operation system call
executing a few microseconds earlier or later.

There really is no difference, and we're making complexity out of
nothing just to ensure something which isn't actually guarenteed right
now.
--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 5:28 am

Actually I believe it does work right now. Userspace maps the old
counter values to the replacement rules and the kernel adds them
up, so in the end we currently should always have accurate counters,
independant of the exact time when a replacement took place.

--

From: Linus Torvalds
Date: Wednesday, April 15, 2009 - 5:10 pm

Why not just read the counters fromt he old one at RCU free time (they are 
guaranteed to be stable at that point, since we're all done with those 
entries), and apply them at that point to the current setup?

			Linus
--

From: Stephen Hemminger
Date: Wednesday, April 15, 2009 - 5:45 pm

This is an alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's. This version uses RCU for the table->base reference
but per-cpu-lock for counters.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

This version does not do coarse locking or synchronize_net() during
the __do_replace function, so there is a small race which allows for
some of the old counter values to be incorrect (Ncpu -1). Scenario
would be replacing a rule set and the same rules are inflight on other
CPU.  The other CPU might still be looking at the old rules (and
update those counters), after counter values have been captured.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 include/linux/netfilter/x_tables.h |   11 +--
 net/ipv4/netfilter/arp_tables.c    |  121 +++++++++++--------------------------
 net/ipv4/netfilter/ip_tables.c     |  121 ++++++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  118 +++++++++++-------------------------
 net/netfilter/x_tables.c           |   45 +++++++------
 5 files changed, 137 insertions(+), 279 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-15 08:44:01.449318844 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-15 17:08:35.303217128 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -385,6 +382,12 @@ struct xt_table_info
 	unsigned int hook_entry[NF_INET_NUMHOOKS];
 	unsigned int ...
From: Eric Dumazet
Date: Wednesday, April 15, 2009 - 10:01 pm

This version is a regression over 2.6.2[0-9], because of two points

1) Much more atomic ops :


added on each counter updates.

On many setups, each packet coming in or out of the machine has
to update between 2 to 20 rule counters. So to avoid *one* atomic ops
of read_unlock(), this v4 version adds 2 to 20 atomic ops...


I still not see the problem between the previous version (2.6.2[0-8]) that had a central
 rwlock, that hurted performance on SMP because of cache line ping pong, and the solution
having one rwlock per cpu.

We wanted to reduce the cache line ping pong first. This *is* the hurting point,
by an order of magnitude.

We tried a full RCU solution, it took us three years and we failed.
Lets take an easy solution, before whole replacement of x_table by new Patrick
infrastructure.

Then, if it appears the rwlock itself and its two atomic ops are *really* a problem,
we can go further, but I doubt modern cpus really care about atomic ops on an integer
already hot in L1 cache.

2) Second problem : Potential OOM

About freeing old rules with call_rcu() and/or schedule_work(), this is going
to OOM pretty fast on small appliances with basic firewall setups loading
rules one by one, as done by original topic reporter.

We had reports from guys using linux with 4MB of available ram (French provider free.fr on
their applicance box), and we had to use SLAB_DESTROY_BY_RCU thing on conntrack
 to avoid OOM for their setups. We dont want to use call_rcu() and queue 100 or 200 vfree().


So I prefer your v3 version, even if I didnt tested yet.



--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 6:53 am

Dave doesn't seem to like the rwlock approach. I don't see a way to
do anything asynchronously like call_rcu() to improve this, so to

We could use this to replace the counters, presuming it is

Agreed.
--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 7:47 am

Well, we don't really need an rwlock, especially given that we really
don't want two "readers" incrementing the same counter concurrently.

A safer approach would be to maintain a flag in the task structure
tracking which (if any) of the per-CPU locks you hold.  Also maintain
a recursion-depth counter.  If the flag says you don't already hold
the lock, set it and acquire the lock.  Either way, increment the
recursion-depth counter:

	if (current->netfilter_lock_held != cur_cpu) {
		BUG_ON(current->netfilter_lock_held != CPU_NONE);
		spin_lock(per_cpu(..., cur_cpu));
		current->netfilter_lock_held = cur_cpu;
	}
	current->netfilter_lock_nesting++;

And reverse the process to unlock:

	if (--current->netfilter_lock_nesting == 0) {
		spin_unlock(per_cpu(..., cur_cpu));
		current->netfilter_lock_held = CPU_NONE;

One way to accomplish this is to take Mathieu Desnoyers's user-level
RCU implementation and drop it into the kernel, replacing the POSIX

This is not a real problem be handled by doing a synchronize_rcu() every
so often as noted in a prior email elsewhere in this thread:

	call_rcu(...);
	if (++count > 50) {
		synchronize_rcu();
		count = 0;
	}

This choice of constant would reduce the grace-period pain to 2% of the
full effect, which should be acceptable, at least if I remember the
original problem report of 0.2 seconds growing to 6.0 seconds -- this
would give you:

	(6.0-0.2)/50+0.2 = .316

I would argue that 100 milliseconds is an OK penalty for a deprecated
feature.  But of course the per-CPU lock approach should avoid even that
penalty, albeit at some per-packet penalty.  However, my guess is that
this per-packet penalty is not measureable at the system level.

And if the penalty of a single uncontended lock -is- measureable, I will
be very quick to offer my congratulations, at least once I get my jaw
off my keyboard.  ;-)

							Thanx, Paul
--

From: Eric Dumazet
Date: Thursday, April 16, 2009 - 9:10 am

Yes, you are right, we can avoid rwlock, but use a 'recursive' lock
or spin_trylock()

We can use one counter close to the spinlock, 
no need to add one or two fields to every "thread_info"

struct rec_lock {
	spinlock_t lock;
	int        count;
};
static DEFINE_PER_CPU(struct rec_lock, ip_tables_lock);


I also considered using regular spinlocks and spin_trylock() to "detect"
the recurse case without a global counter.

lock :
local_bh_disable();
int locked = spin_trylock(&__get_cpu_var(arp_tables_lock);

unlock:

if (likely(locked))
	spin_unlock(&__get_cpu_var(arp_tables_lock));
local_bh_enable();

But we would lose some runtime features, I dont feel comfortable about
this trylock version. What others people think ?


Here is the resulting patch, based on Stephen v4

(Not sure we *need* recursive spinlock for the arp_tables, but it seems
better to have an uniform implementation)


[PATCH] netfilter: use per-cpu recursive spinlock (v6)

Yet another alternative version of ip/ip6/arp tables locking using
per-cpu locks.  This avoids the overhead of synchronize_net() during
update but still removes the expensive rwlock in earlier versions.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  The slow case involves acquiring the locks on
all cpu's.

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

We have to use recursive spinlocks because netfilter can sometimes
nest several calls to ipt_do_table() for a given cpu.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/netfilter/x_tables.h |    5 -
 net/ipv4/netfilter/arp_tables.c    |  131 +++++++++------------------
 net/ipv4/netfilter/ip_tables.c     |  130 +++++++++-----------------
 net/ipv6/netfilter/ip6_tables.c    |  127 ...
From: Eric Dumazet
Date: Thursday, April 16, 2009 - 9:20 am

Oh well, this wont work of course, forget about this trylock thing :)

--

From: Linus Torvalds
Date: Thursday, April 16, 2009 - 9:37 am

What the _fuck_ are you doing?

Stop sending these shit-for-brains crazy patches out. That's not a lock, 
that's a messy way of saying "I don't know what the hell I'm doing, but 
I'll mess things up".

Don't do recursive locks (or your own locking primitives in general), but 
goddammit, if you have to, at least know what the hell you're doing. Your 
thing is a piece of shit.

A recursive lock needs an owner, or it's not a lock at all. It's some 
random data structure that contains a lock that may or may not be taken, 
and that may actually _work_ depending on the exact patterns of taking the 
lock, but that's not an excuse.

The fact that code "happens to work by mistake" (and I'm not saying that 
your does - but it might just because of the per-cpu'ness of it, and I'm 
not even going to look at crap like that it closer to try to prove it one 
way or the other) does not make that code acceptable.

Because even if it works today, it's just a bug waiting to happen. The 
thing you did is _not_ a generic recursive lock, and it does _not_ work in 
general. Don't call it a "rec_lock". Don't write code that accesses it 
without any comments as if it was simple. Just DON'T.

Guys, this whole discussion has just been filled with crazy crap. Can 
somebody even explain why we care so deeply about some counters for 
something that we just _deleted_ and that have random values anyway?

I can see the counters being interesting while a firewall is active, but I 
sure don't see what's so wonderfully interesting after-the-fact about a 
counter on something that NO LONGER EXISTS that it has to be somehow 
"exactly right".

And it's certainly not interesting enough to merit this kind of random 
fragile crazy code.

Please. Get a grip, people!

Show of hands, here: tell me a single use that really _requires_ those 
exact counters of a netfilter rule that got deleted and is no longer 
active?

			Linus
--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 9:59 am

They're copied to userspace after replacing the ruleset, associated with
the rules that are still active after the change and then added to the
current counters in a second operation. The end result is that the

People use netfilter for accounting quite a lot. Having dynamic updates
is also not uncommon, so this might actually matter.


--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 10:58 am

Hmmm...

What happens if some other CPU is actually holding the lock?  For

I do not believe that it actually works.

I much prefer your earlier idea of associating a counter with the lock.

But an owner field is also required, please see below.  Or please let me
know what I am missing.


We also need an owner field:


But if some other CPU holds the lock, this code would fail to wait for
that other CPU to release the lock, right?  It also might corrupt the
rl->count field due to two CPUs accessing it concurrently non-atomically.

I suggest the following, preferably in a function or macro or something:

	cur_cpu = smp_processor_id();
	if (likely(rl->owner != cur_cpu) {
		spin_lock(&rl->lock);
		rl->owner = smp_processor_id();
		rl->count = 1;
	} else {
		rl->count++;
	}

And the inverse for unlock.

--

From: Eric Dumazet
Date: Thursday, April 16, 2009 - 11:41 am

If another cpu holds the lock, this cpu will spin on its own lock.

Remember other cpus dont touch rl->count. This is a private field, only touched
by the cpu on its own per_cpu data. There is no possible 'corruption'


So the owner of the per_cpu data does :

/*
 * disable preemption, get rl = &__get_cpu_var(arp_tables_lock);
 * then :
 */
lock_time :
if (++rl->count == 0)
	spin_lock(&rl->lock);

unlock_time:
if (likely(--rl->count == 0))
	spin_unlock(&rl->lock);


while other cpus only do :

spin_lock(&rl->lock);
/* work on data */
spin_unlock(&rl->lock);


Apparently Linus missed it too, and reacted badly to my mail.
I dont know why we discuss of this stuff on lkml either...

I stop working on this subject and consider drinking dome hard stuf and
watching tv :)

See you

--

From: Stephen Hemminger
Date: Thursday, April 16, 2009 - 1:49 pm

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
rwlock that can be nested. It is sort of like earlier brwlock 
(fast reader, slow writer). The locking is isolated so future improvements
can concentrate on measuring/optimizing xt_table_info_lock. I tried
other versions based on recursive spin locks and sequence counters and 
for me, the risk of inventing own locking primitives not worth it at this time.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2).

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Lockdep reports bogus warnings on this, so using raw_write_lock
might be necessary.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
 include/linux/netfilter/x_tables.h |   34 +++++++++--
 net/ipv4/netfilter/arp_tables.c    |  110 ++++++++-----------------------------
 net/ipv4/netfilter/ip_tables.c     |  110 +++++++------------------------------
 net/ipv6/netfilter/ip6_tables.c    |  108 ++++++++----------------------------
 net/netfilter/x_tables.c           |   63 ++++++++++++++-------
 5 files changed, 144 insertions(+), 281 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-16 13:40:57.256734671 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-16 13:40:58.858044088 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,35 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void ...
From: Linus Torvalds
Date: Thursday, April 16, 2009 - 2:02 pm

This is stil scary.

Do we guarantee that read-locks nest in the presense of a waiting writer 
on another CPU? Now, I know we used to (ie readers always nested happily 
with readers even if there were pending writers), and then we broke it. I 
don't know that we ever unbroke it.

IOW, at least at some point we deadlocked on this (due to trying to be 
fair, and not lettign in readers while earlier writers were waiting):

	CPU#1			CPU#2

	read_lock

				write_lock
				.. spins with write bit set, waiting for
				   readers to go away ..

	recursive read_lock
	.. spins due to the write bit
	   being. BOOM: deadlock  ..

Now, I _think_ we avoid this, but somebody should double-check.

Also, I have still yet to hear the answer to why we care about stale 
counters of dead rules so much that we couldn't just free them later with 
RCU.

			Linus
--

From: Ingo Molnar
Date: Thursday, April 16, 2009 - 4:04 pm

This is a narrow special-case where the spin-rwlock is safe, and the 
rwsem is unsafe.

But it should work for rwlocks - it always worked and the networking 
code always relied on that AFAIK.

Here's the x86 assembly code of the write-side slowpath:

ENTRY(__write_lock_failed)
        CFI_STARTPROC
        LOCK_PREFIX
        addl $RW_LOCK_BIAS,(%rdi)
1:      rep
        nop
        cmpl $RW_LOCK_BIAS,(%rdi)
        jne 1b
        LOCK_PREFIX
        subl $RW_LOCK_BIAS,(%rdi)
        jnz  __write_lock_failed
        ret
        CFI_ENDPROC

the fastpath decreased the value with RW_LOCK_BIAS, and when we 
enter this function we undo that effect by adding RW_LOCK_BIAS. Then 
we spin (without signalling our write-intent) passively until the 
count reaches RW_LOCK_BIAS. Then we try to lock it again and bring 
it to zero (meaning no other readers or writers - we got the lock).

This is pretty much the most unfair strategy possible for writers - 
but this is how rwlocks always behaved - and they do so mostly for 
recursive use within networking.

This is why the tasklist_lock was always so suspect to insane 
starvation symptoms on really large SMP systems, and this is why 
write_lock_irq(&tasklist_lock) was always a dangerous operation to 
do. (it can spin for a _long_ time with irqs off.)

It's not the most optimal of situations. Some patches are in the 
works to fix the irqs-off artifact (on ia64 - no x86 patches yet 
AFAICS) - but that's just papering it over.

	Ingo
--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 5:13 pm

Encapsulating them so that they appear in the same place might (or might
not) have gotten the fact that you were not doing a recursive lock
through my head.  Even so, the name "rec_lock" might have overwhelmed

That -is- extreme!  ;-)

--

From: Patrick McHardy
Date: Thursday, April 16, 2009 - 6:11 am

We need the counters immediately to copy them to userspace, so waiting
for an asynchronous RCU free is not going to work.
--

From: David Miller
Date: Thursday, April 16, 2009 - 3:33 pm

From: Patrick McHardy <kaber@trash.net>

It just occurred to me that since all netfilter packet handling
goes through one place, we could have a sort-of "netfilter RCU"
of sorts to solve this problem.
--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 4:49 pm

OK, I am putting one together...

It will be needed sooner or later, though I suspect per-CPU locking
would work fine in this case.

							Thanx, Paul
--

From: Stephen Hemminger
Date: Thursday, April 16, 2009 - 4:52 pm

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.

"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.

"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2). 


The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Future optimizations possible:
  - Lockdep doesn't really handle this well
  - hot plug CPU case, if kernel is built with large # of CPU's, skip
    the inactive ones; migrate values when CPU is removed.
  - reading counters could be incremental by CPU.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
 include/linux/netfilter/x_tables.h |   10 +--
 net/ipv4/netfilter/arp_tables.c    |  110 ++++++++----------------------------
 net/ipv4/netfilter/ip_tables.c     |  110 +++++++-----------------------------
 net/ipv6/netfilter/ip6_tables.c    |  108 ++++++++---------------------------
 net/netfilter/x_tables.c           |  113 ++++++++++++++++++++++++++++++-------
 5 files changed, 170 insertions(+), 281 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-16 15:09:53.082406828 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-16 15:10:17.154966874 -0700
@@ -354,9 +354,6 @@ struct ...
From: Jeff Chua
Date: Thursday, April 16, 2009 - 5:15 pm

On Fri, Apr 17, 2009 at 7:52 AM, Stephen Hemminger

Tested and working. As fast as before.

Thanks,
Jeff.
--

From: Peter Zijlstra
Date: Thursday, April 16, 2009 - 10:55 pm

Quite so, this is the old MAX_LOCK_DEPTH < NR_CPUS issue for large
systems.

Last time this came up David found another way of solving the problem.
Not having fully read this thread, I cannot suggest one myself -- except
that RCU domains as suggested by David sound good.



--

From: Eric Dumazet
Date: Thursday, April 16, 2009 - 11:03 pm

I like this version 8 of the patch, as it mixes all ideas we had,
but have two questions.

Previous netfilter code (and 2.6.30-rc2 one too) disable BH, not only preemption.

I see xt_table_info_lock_all(void) does block BH, so this one is safe.

I let Patrick or other tell us if its safe to run ipt_do_table()
with preemption disabled but BH enabled, I really dont know.

Also, please dont call this a 'recursive lock', since it is not a general
recursive lock, as pointed by Linus and Paul.

Second question is about MAX_LOCK_DEPTH

Why dont use this kind of construct to get rid of this limit ?


--

From: Eric Dumazet
Date: Thursday, April 16, 2009 - 11:14 pm

From: Peter Zijlstra
Date: Friday, April 17, 2009 - 10:08 am

Very good point, so 256 nested spin_lock() instances will make the
kernel unhappy -- since we now (almost?) support up to 4096 cpus, this
seems like a no-no.

--

From: Patrick McHardy
Date: Friday, April 17, 2009 - 4:17 am

No, on jumps the return position is stored in the per-cpu copy
of the ruleset and we must prevent BH context corrupting the
value of something running in process context.
--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 6:28 pm

And here is a crude first cut.  Untested, probably does not even compile.

Straight conversion of Mathieu Desnoyers's user-space RCU implementation
at git://lttng.org/userspace-rcu.git to the kernel (and yes, I did help
a little, but he must bear the bulk of the guilt).  Pick on srcu.h
and srcu.c out of sheer laziness.  User-space testing gives deep
sub-microsecond grace-period latencies, so should be fast enough, at
least if you don't mind two smp_call_function() invocations per grace
period and spinning on each instance of a per-CPU variable.

Again, I believe per-CPU locking should work fine for the netfilter
counters, but I guess "friends don't let friends use hashed locks".
(I would not know for sure, never having used them myself, except of
course to protect hash tables.)

Most definitely -not- for inclusion at this point.  Next step is to hack
up the relevant rcutorture code and watch it explode on contact.  ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/srcu.h |   30 ++++++++++++++++++++++++
 kernel/srcu.c        |   63 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index aca0eee..4577cd0 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -50,4 +50,34 @@ void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
 void synchronize_srcu(struct srcu_struct *sp);
 long srcu_batches_completed(struct srcu_struct *sp);
 
+/* Single bit for grace-period index, low-order bits are nesting counter. */
+#define RCU_FGP_COUNT		1UL
+#define RCU_FGP_PARITY		(1UL << (sizeof(long) << 2))
+#define RCU_FGP_NEST_MASK	(RCU_FGP_PARITY - 1)
+
+extern long rcu_fgp_ctr;
+DECLARE_PER_CPU(long, rcu_fgp_active_readers);
+
+static inline void rcu_read_lock_fgp(void)
+{
+	long tmp;
+	long *uarp;
+
+	preempt_disable();
+	uarp = &__get_cpu_var(rcu_fgp_active_readers);
+	tmp = *uarp;
+	if (likely(!(tmp & ...
From: Mathieu Desnoyers
Date: Thursday, April 16, 2009 - 7:19 pm

I'm innocent, I swear ;-)

That should give very impressive performance results.


OK, so we are translating the original implementation from per-thread to
per-cpu, with preemption disabled. Fine with me if we can't afford the
per-thread unsigned long nor can't afford to iterate on each thread when

ACCESS_ONCE(rcu_fgp_ctr) could not hurt here I think. Given the
surrounding code, that does not seem like a necessity, but that would

I kind of expect an IPI with a smp_mb() to promote this barrier() to a
smp_mb() when the update side needs to wait for a quiescent state. I


Saying why we populate the value 1 here (RCU_FGP_COUNT) as an

I would be tempted to add a comment here telling hot cpu hotunplug
cannot let us wait forever, given all read-side critical sections we can
be busy-waiting for are required to have preemption disabled, and are

Ah, here it is. Commenting that it matches the two barrier()s I identified

/*
 * Call a function on all other processors
 */
int smp_call_function(void(*func)(void *info), void *info, int wait);

I guess you meant on_each_cpu ? That should include "self", given we

Same as above.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 10:05 pm

I wouldn't expect more than about three or four orders of magnitude


I believe that it is safe.  Only one bit of rcu_fgp_ctr ever changes,
so we should be immune from load tearing.  We only load it once and
only do one thing with it, and we have a barrier() before (as part
of preempt_disable()) and after, so I don't think that the compiler
has much latitude here.  In theory, we could get store tearing through
*uarp, but if gcc did that, much of the kernel would go down in flames.

In contrast, in the user-mode version, there was no barrier() on entry,






Hmmm...  Why do we need "self"?  Because synchronize_rcu_fgp() blocks,
it had jolly well better not be within a read-side critical section.



Yeah, I never was all that good at disguising code anyway.  But I did
keep a couple of changes.  ;-)

Updated patch below.

							Thanx, Paul

------------------------------------------------------------------------

And here is a crude second cut.  Untested, probably does not even compile.

Straight conversion of Mathieu Desnoyers's user-space RCU implementation
at git://lttng.org/userspace-rcu.git to the kernel (and yes, I did help
a little, but he must bear the bulk of the guilt).  Pick on srcu.h
and srcu.c out of sheer laziness.  User-space testing gives deep
sub-microsecond grace-period latencies, so should be fast enough, at
least if you don't mind two smp_call_function() invocations per grace
period and spinning on each instance of a per-CPU variable.

Again, I believe per-CPU locking should work fine for the netfilter
counters, but I guess "friends don't let friends use hashed locks".
(I would not know for sure, never having used them myself, except of
course to protect hash tables.)

Most definitely -not- for inclusion at this point.  Next step is to hack
up the relevant rcutorture code and watch it explode on contact.  ;-)

Changes since v1:

o	Applied Mathieu's feedback.

o	Added docbook headers and other comments.

o	Added the ...
From: Mathieu Desnoyers
Date: Thursday, April 16, 2009 - 10:44 pm

I mean that I think we also need some smp_mb()s on the writer side,
don't we ? If we want the changes done by the writer (assign pointer) to
be shown to the readers before the writer starts flipping the parity, a
smp_mb() is needed at the beginning of synchronize_rcu_fgp() (actually
at the same location where you call the rcu_fgp_do_mb ipis), same at the
end (so we order parity flipping with the next assign pointer).

Or maybe it's getting late and I am missing the obvious.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Paul E. McKenney
Date: Friday, April 17, 2009 - 7:51 am

The smp_call_function() itself must have barriers in order to ensure
that the other CPUs see the updates to its parameter block.

But see my upcoming response to Dave and Peter.

--

From: Stephen Hemminger
Date: Thursday, April 16, 2009 - 9:50 pm

On Thu, 16 Apr 2009 18:28:12 -0700

I am glad to see this worked on, but would rather not use RCU in this case
of iptables. It would be good for some of the other long grace period sutff.

The code to per-cpu entry consolidation by alloc/flip in 2.6.30-rc2 was
hard to debug and more convoluted so it probably would be a long term maintaince
nightmare.  The issue was the variable size skip structure so it made
for lots of iterators, etc. If the non-RCU per-cpu spinlock version is just
as fast, it is easier to understand.

--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 10:08 pm

Agreed, as noted above.  Mostly just getting tired of people complaining
about long grace periods.  Again, this patch cannot replace standard RCU

Your per-CPU-lock patch looked more straightforward to me than did the
RCU patch.

							Thanx, Paul
--

From: Eric Dumazet
Date: Thursday, April 16, 2009 - 10:16 pm

I agree that for 2.6.30, we could use a per-cpu spinlock as your last patch did,
this would be very risky to push this new RCU right now. 

But this new stuff looks very promising, (no more locked ops on fast path),
and considering new percpu_{add|sub...} infra, very fast :

static inline void rcu_read_unlock_fgp(void)
{
	barrier();
	percpu_sub(rcu_fgp_active_readers, 1); /* one instruction on x86 */
	preempt_enable();
}

I wonder if IPI are really necessary on x86 if we use percpu_sub() since
it already contains a barrier, and rcu_read_lock_fgp(void) also ends with
a barrier() call...

--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 10:40 pm

Very cool!!!  If I had seen this, I had forgotten about it.  I will
give it a try, but only after getting it working the old way.  (What,

Hmmmm...  But x86 can still execute a later load before an earlier
store, so it seems to me that there would be the potential for even
an x86 CPU to pull loads from the critical section up before the final
store of the percpu_sub(), right?  If so, we really do still need the
IPIs on x86.

							Thanx, Paul
--

From: David Miller
Date: Friday, April 17, 2009 - 1:07 am

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

I don't understand why we're writing such complicated code.

Oh I see why, it's because not every arch uses the generic SMP helpers
yet :-)

Because if they did universally, we could solve this problem so
simply, by merely sending a remote softirq to every online cpu.  Once
those all complete we have enough of a quiesce period, every cpu must
have exited any netfilter packet processing code path they were in.

And we could know they complete using an atomic counter or something.
--

From: Paul E. McKenney
Date: Friday, April 17, 2009 - 8:00 am

I was with you until you got to the atomic counter, which would require
dealing with CPU hotplug.

But your point is a very good one.  We already do have a flavor of
RCU that waits for softirq code, namely rcu-bh.  And it is used
exclusively by the networking code:

     14 net/ipv4/
     13 net/decnet/
     10 net/core/
      6 net/ipv6/
      4 kernel/		[rcutorture, so these four uses don't count.]
      3 net/netfilter/
      2 net/mac80211/
      2 net/packet/

So both yours and Peter's implicit point is quite correct -- the kernel
really does not need yet another flavor of RCU.  So maybe I should instead
be thinking in terms of making the existing rcu-bh be better adapted to
the networking code, like maybe a fast synchronize_rcu_bh().

Or do the networking uses of rcu-bh need it to work exactly the way that
it does now?

							Thanx, Paul

kernel/rcutorture.c __acquires 392 rcu_read_lock_bh();
kernel/rcutorture.c __releases 398 rcu_read_unlock_bh();
kernel/rcutorture.c rcu_bh_torture_deferred_free 408 call_rcu_bh(&p->rtort_rcu, rcu_torture_cb);
kernel/rcutorture.c rcu_bh_torture_synchronize 429 call_rcu_bh(&rcu.head, rcu_bh_torture_wakeme_after_cb);
net/core/dev.c dev_queue_xmit 1844 rcu_read_lock_bh();
net/core/dev.c dev_queue_xmit 1909 rcu_read_unlock_bh();
net/core/dev.c dev_queue_xmit 1915 rcu_read_unlock_bh();
net/core/filter.c sk_filter 88 rcu_read_lock_bh();
net/core/filter.c sk_filter 95 rcu_read_unlock_bh();
net/core/filter.c sk_filter_delayed_uncharge 477 call_rcu_bh(&fp->rcu, sk_filter_rcu_release);
net/core/filter.c sk_attach_filter 517 rcu_read_lock_bh();
net/core/filter.c sk_attach_filter 520 rcu_read_unlock_bh();
net/core/filter.c sk_detach_filter 532 rcu_read_lock_bh();
net/core/filter.c sk_detach_filter 539 rcu_read_unlock_bh();
net/decnet/dn_route.c dnrt_free 148 call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
net/decnet/dn_route.c dnrt_drop 154 call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
net/decnet/dn_route.c __dn_route_output_key 1161 ...
From: Peter Zijlstra
Date: Friday, April 17, 2009 - 10:22 am

Since its a full broadcast, we can do that _today_ using on_each_cpu().
But whatever way we turn this, this will be a very expensive operation.
Imagine doing that on your 256-way for every iptables rules change.



--

From: Linus Torvalds
Date: Friday, April 17, 2009 - 10:32 am

Well, you _could_ just have a per-CPU bit of "have I used nf rules since 
the last update", and skip those CPU's. Use memory ordering to check the 
bits (set the bit _before_ looking up a NF rule, and check them _after_ 
doing the update, and have a barrier in between if you really think it 
matters).

Remember: the cost was never about a single filter rule update. The cost 
of a single one is almost immaterial, as long as it's not in hundreds of 
milliseconds. It's the cost of people building up things incrementally 
that caused this thing.

So if you have 200 "iptables" commands in a sequence, and especially 
during bootup, a trivial "has the old rule been ever even looked at on 
this CPU" would already fix the issue. Because it would always be zero in 
the only case where it matters.

This is, of course, what we do for the TLB flushing issue. We don't want 
to send IPI's to all CPU's, and in 99.999% of all cases we don't need to, 
because the other CPU's never even loaded the MM.

			Linus
--

From: Peter Zijlstra
Date: Thursday, April 16, 2009 - 11:12 pm

One comment, its again a global thing..

I've been playing with the idea for a while now to make all RCU
implementations into proper objects so that you can do things like:

  struct atomic_rcu_domain my_rcu_domain = create_atomic_rcu();

  atomic_rcu_read_lock(&my_rcu_domain());
  ...

  atomic_rcu_read_unlock(&my_rcu_domain());

and

  call_atomic_rcu(&my_rcu_domain, &my_obj->rcu_head, do_something);

etc..

We would have:

  atomic_rcu  --  'classic' non preemptible RCU (treercu these days)
  sleep_rcu   --  'preemptible' RCU

Then have 3 default domains:

sched_rcu     -- always atomic_rcu
rcu           -- depends on PREEMPT_RCU
preempt_rcu   -- always sleep_rcu

This would allow generic code to:
  1) use preemptible RCU for those cases where needed
  2) create smaller RCU domains where needed, such as in this case
  3) mostly do away with SRCU

Now I realize that the presented RCU implementation has a different
grace period method than the existing ones that use the timer tick to
drive the state machine, so 2) might not be too relevant here. But maybe
we can do something with different grace periods too.

Anyway, just an idea because I always get a little offended at the hard
coded global variables in all these RCU implementations :-)



--

From: Paul E. McKenney
Date: Friday, April 17, 2009 - 9:33 am

I guess that this one could allow sleeping on mutexes...  Does anyone

#3 would be good!  But...

At an API level, there are two differences between SRCU and the other
RCU implementations:

a.	The return value from srcu_read_lock() is passed to
	srcu_read_unlock().

b.	There is a control block passed in to each SRCU primitive.

Difference (a) could potentially be taken care of with a few tricks I
am trying in the process of getting preemptrcu merged into treercu.

Your approach to (b) certainly makes it uniform, there are >500
occurrences of rcu_read_lock() and rcu_read_unlock() each, but only
a very few occurrences of srcu_read_lock() and srcu_read_unlock()
(like exactly one each!).  So adding an argument to rcu_read_lock()

I am thinking in terms of adding a synchronize_rcu_bh() with the desired
properties.  That way we avoid yet another RCU flavor.  (What can I say?
I got carried away!)  Also, since the rcu-bh flavor is used only by
networking, we have a fair amount of freedom to tweak it.  It will take
longer than introducing a new flavor, but Steve Hemminger has a good
solution already, and RCU really isn't the thing to do quick hacks on.

							Thanx, Paul
--

From: Peter Zijlstra
Date: Friday, April 17, 2009 - 9:51 am

I almost did a few times, but never quite got the code that needed it

Right, incrementing one cpu and decrementing another doesn't change the

static inline void rcu_read_lock(void)
{
	atomic_rcu_read_lock(&global_atomic_rcu_context);
}

static inline void rcu_read_unlock(void)
{
	atomic_rcu_read_unlock(&global_atomic_rcu_context);
}

static inline void call_rcu(struct rcu_head *rcuhead, void (*func)(struct rcu_head *))
{
	call_atomic_rcu(&global_atomic_rcu_context, rcuhead, func);
}


Right. I was thinking along the way of providing a watermark (either
nr_queued based, or time based) and once it exceeds try to drive it from
read_unlock. Or similar. unlock driven RCU implementations have the best

Ok, back on topic :-)

I wouldn't exactly call it a good solution, it does a
for_each_possible_cpu() spin_lock();

 1) that should probably be for_each_online_cpu()

 2) that doesn't scale at all, I'm sure dave's 256-way hurts like mad
    when inserting tons of rules and we do that for every iptable
    modification.

 3) there is no way lockdep can track all that :(

Do we _really_ _really_ __HAVE__ to serialize this? So far I've heard
Patrick say: there might be, a use case. That doesn't sound like: we
should make it dead slow for everybody else.

--

From: Paul E. McKenney
Date: Friday, April 17, 2009 - 2:29 pm

It probably would not be hard to enable preemptable RCU in a
!CONFIG_PREEMPT configuration, which would allow mutexes to be acquired
in these read-side critical sections.  After I fix any relevant bugs,

Well, I am trying to get rid of the summing over all CPUs -- really hard

Given that for classic and hierarchical RCU, rcu_read_lock() and
rcu_read_unlock() just map to preempt_disable() and preempt_enable(),

Jim Houston did an unlock-driven implementation some years back:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109387402400673&w=2

The read-side overhead can be a problem.  And I have gotten grace-period
latencies under 100ns without driving the grace period from the update


In principle I agree.  In practice, this is an infrequently executed
slow path, right?

Or are you concerned about real-time latencies while loading new

There of course is a point beyond which this method is slower than
a full RCU grace period.  But I bet that Dave's 256-way machine
is not anywhere near big enough to reach that point.  Maybe he can

This is a good point.  I understand the need to acquire the locks, but
am not fully clear on why we cannot acquire one CPU's lock, gather its
counters, release that lock, acquire the next CPU's lock, and so on.
Maybe a code-complexity issue?

Please keep in mind that we are trying to hit 2.6.30 with this fix, so
simplicity is even more important than it usually be.  Yes, I have some

We are making it faster than it used to be by quite a bit by getting rid
of the global lock, so this does sound like a good approach.  Here is my
reasoning:

1.	The update-side performance is good, as verified by Jeff Chua.

2.	The per-packet read-side performance is slowed by roughly the
	overhead of an uncontended lock, which comes to about 60ns
	on my laptop.  At some point, this 60ns will become critical,
	but I do not believe that we are there yet.

	When it does become critical, a new patch can be produced.
	Such a patch can of course be backported ...
From: Evgeniy Polyakov
Date: Saturday, April 18, 2009 - 2:40 am

Hi.


Shouldn't it be rcu_fgp_active_readers - RCU_FGP_COUNT?
Although it is 1 by definition, it is more clear when understanding
what's going on here.

-- 
	Evgeniy Polyakov
--

From: Paul E. McKenney
Date: Saturday, April 18, 2009 - 7:14 am

Excellent point, fixed!

						Thanx, Paul
--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 10:34 am

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.

"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.

"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2). 

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
Changes from earlier patches.
  - function name changes
  - disable bottom half in info_rdlock

These should still be addressed but beyond the scope of the problem
  - lockdep mapping; really a tradeoff between LOCKDEP special clutter
    and clarity
  - Figure out how to stop sparse warning
  - hot plug CPU case, if kernel is built with large # of CPU's, skip
    the inactive ones; migrate values when CPU is removed.

 include/linux/netfilter/x_tables.h |   10 +--
 net/ipv4/netfilter/arp_tables.c    |  110 +++++++---------------------------
 net/ipv4/netfilter/ip_tables.c     |  110 +++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  108 +++++++---------------------------
 net/netfilter/x_tables.c           |  117 ++++++++++++++++++++++++++++++-------
 5 files changed, 174 insertions(+), 281 deletions(-)

--- ...
From: Paul E. McKenney
Date: Monday, April 20, 2009 - 11:21 am

This looks good to me!

But I have to ask the stupid question...  Would it be possible for
the "write" side to acquire the locks one at a time, accumulating and
resetting the counts for that CPU, then advancing to the next CPU?

If this is possible, then lockdep is much happier and the disruption
to readers is much less.

--

From: Eric Dumazet
Date: Monday, April 20, 2009 - 11:25 am

OK, but we still have a problem on machines with >= 250 cpus,
because calling 250 times spin_lock() is going to overflow preempt_count,
as each spin_lock() increases preempt_count by one.

PREEMPT_MASK: 0x000000ff

add_preempt_count() should warn us about this overflow if CONFIG_DEBUG_PREEMPT is set

#ifdef CONFIG_DEBUG_PREEMPT
        /*
         * Spinlock count overflowing soon?
         */
        DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
                                PREEMPT_MASK - 10);
#endif


My suggestion (in a previous mail) was to call preempt_disable() after each spin_lock(),
 
	


--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 1:32 pm

On Mon, 20 Apr 2009 20:25:14 +0200

Ok, not that I have one of those.

The problem which lockdep has is that it seems to associate all the
locks with same name.
--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 1:42 pm

On Mon, 20 Apr 2009 20:25:14 +0200

Wouldn't 256 or higher CPU system be faster without preempt?  If there are that many
CPU's, it is faster to do the work on other cpu and avoid the overhead of a hotly
updated preempt count.
--

From: Paul E. McKenney
Date: Monday, April 20, 2009 - 2:05 pm

The preempt count is maintained per-CPU, so has low overhead.  The
problem is that for CONFIG_PREEMPT builds, the preempt disabing is
built into spin_lock().

							Thanx, Paul
--

From: Paul Mackerras
Date: Monday, April 20, 2009 - 2:23 pm

Huh?  Each cpu has its own separate preempt_count.

Paul.
--

From: Paul E. McKenney
Date: Monday, April 20, 2009 - 2:58 pm

But a single CPU is acquiring one lock per CPU, so all the increments
are to one CPU's preempt_count.  :-(

							Thanx, Paul
--

From: Paul Mackerras
Date: Monday, April 20, 2009 - 3:41 pm

OK, I see, so a task can't take more than 255 spinlocks without
overflowing the preempt count, which seems a bit limiting.

There are 6 free bits in the preempt_count currently, so the preempt
count could be expanded to 14 bits, which would be enough for all
current systems.  Beyond that I guess we could make preempt_count be a
long and allow bigger counts on 64-bit architectures.

Paul.
--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 4:01 pm

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested. It is sort of like existing kernel_lock,
rwlock_t and even old 2.4 brlock.

"Reader" is ip/arp/ip6 tables rule processing which runs per-cpu.
It needs to ensure that the rules are not being changed while packet
is being processed.

"Writer" is used in two cases: first is replacing rules in which case
all packets in flight have to be processed before rules are swapped,
then counters are read from the old (stale) info. Second case is where
counters need to be read on the fly, in this case all CPU's are blocked
from further rule processing until values are aggregated.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2). 

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
CHANGES 
  - optimize for UP
  - disable bottom half in info_rdlock
  - prevent preempt count overflow
  - turn off lockdep in writer to avoid bogus warning
  - optimize unlock_bh

TODO
  - Figure out how to stop sparse warnings
  - hot plug CPU case, if kernel is built with large # of CPU's, skip
    the inactive ones; migrate values when CPU is removed.


 include/linux/netfilter/x_tables.h |   19 +++--
 net/ipv4/netfilter/arp_tables.c    |  110 ++++++-----------------------
 net/ipv4/netfilter/ip_tables.c     |  110 ++++++-----------------------
 net/ipv6/netfilter/ip6_tables.c    |  108 ++++++----------------------
 net/netfilter/x_tables.c           |  138 +++++++++++++++++++++++++++++++------
 5 files changed, 204 insertions(+), 281 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-20 13:57:28.281199339 -0700
+++ ...
From: Lai Jiangshan
Date: Monday, April 20, 2009 - 8:41 pm

Maybe I missed something. I think softirq may be still enabled here.

Is this OK for you:

void xt_info_rdlock_bh(void)
{
	struct xt_info_lock *lock;

	local_bh_disable();
	lock = &__get_cpu_var(xt_info_locks);
	if (likely(++lock->depth == 0))
		spin_lock(&lock->lock);
}

Lai.

--

From: Eric Dumazet
Date: Monday, April 20, 2009 - 8:56 pm

well, first time its called, you are right softirqs are enabled until

After this line, both softirqs and preempt are disabled.

Future calls to this function temporarly raise preemptcount and decrease it.

well, Stephen was trying to not change preempt count for the 2nd, 3rd, 4th?... invocation of this function.


Thanks for reviewing Lai

--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 9:15 pm

On Tue, 21 Apr 2009 05:56:55 +0200

In this version, I was trying to use/preserve the optimizations that
are done in spin_unlock_bh().
--

From: Lai Jiangshan
Date: Monday, April 20, 2009 - 10:22 pm

xt_info_rdlock_bh() called recursively here will enter the
critical region without &__get_cpu_var(xt_info_locks)->lock.

Because xt_info_rdlock_bh() called recursively here sees

Sorry for it.
Is this OK:

void xt_info_rdlock_bh(void)
{
	struct xt_info_lock *lock;

	local_bh_disable();
	lock = &__get_cpu_var(xt_info_locks);
	if (likely(++lock->depth == 0))
		spin_lock(&lock->lock);
	else
		local_bh_enable();
}

I did not think things carefully enough, and I do know
nothing about ip/ip6/arp.

Lai



--

From: Stephen Hemminger
Date: Monday, April 20, 2009 - 10:45 pm

On Tue, 21 Apr 2009 13:22:27 +0800

NO spin_lock_bh always does a preempt_disable

  xt_info_rdlock_bh            (depth = -1)
+1         preempt_disable
           spin_lock_bh
+1             preempt_disable
-1         preempt_enable_no_resched
---
+1

Second call preempt_count=1   (depth = 0)
       xt_info_rdlock_bh
+1         preempt_disable
-1         preempt_enable_no_resched
---

Result is preempt_count=1 (depth = 1)


Now lets do unlocks
       xt_info_rdunlock_bh  preempt_count=1 depth=1
          does nothing
   xt_info_rdunlock_bh      preempt_count=1 depth = 0
-1      spin_unlock_bh

Resulting preempt_count=0  depth = -1



--

From: Lai Jiangshan
Date: Monday, April 20, 2009 - 11:52 pm

I think I had made you mistook my email.

The preempt_count is correct, but I thought about lock-save:
we must hold &__get_cpu_var(xt_info_locks)->lock
when we enter the read-side critical region.

--------------------
xt_info_rdlock_bh() (depth = -1)
  preempt_disable()
  depth++
  ==========>interrupt here
  ==========>
  ==========>xt_info_rdlock_bh() (depth = 0)
  ==========>  preempt_disable()
  ==========>  depth++
  ==========>  preempt_enable_no_resched()
  ==========>
  ==========>enter the read-side critical region *without* lock.
  ==========>  it may get trashy data.
  ==========>
  ==========>xt_info_rdunlock_bh()
  ==========>
  ==========>interrupt return.
  spin_lock_bh()
  preempt_enable_no_resched()

enter the read-side critical region *with* lock.

xt_info_rdunlock_bh().
----------------------



----------
Is this OK? (Now I suppose we can enter the read-side critical region
in irq context)

void xt_info_rdlock_bh(void)
{
	unsigned long flags;
	struct xt_info_lock *lock;

	local_irq_save(flags);
	lock = &__get_cpu_var(xt_info_locks);
	if (likely(++lock->depth == 0))
		spin_lock_bh(&lock->lock);
	local_irq_restore(flags);
}

Lai

--

From: Evgeniy Polyakov
Date: Tuesday, April 21, 2009 - 1:16 am

Hi.


Netfilter as long as other generic network pathes are never accessed
from interrupt context, but your analysis looks right for the softirq
case.

Stephen, should preempt_disable() be replaced with local_bh_disable() to
prevent softirq to race on the same cpu for the lock's depth field? Or
can it be made atomic?

-- 
	Evgeniy Polyakov
--

From: Lai Jiangshan
Date: Tuesday, April 21, 2009 - 1:42 am

A question:

softirq is always not nesting. Why we need recursive lock?

Lai.

--

From: David Miller
Date: Tuesday, April 21, 2009 - 1:49 am

From: Lai Jiangshan <laijs@cn.fujitsu.com>

Netfilter itself, is nesting.

When using bridging netfilter, iptables can be entered twice
in the same call chain.
--

From: Eric Dumazet
Date: Tuesday, April 21, 2009 - 1:55 am

Maybe just dont care about calling several time local_bh_disable()
(since we were doing this in previous kernels anyway, we used to call read_lock_bh())

This shortens fastpath, is faster than local_irq_save()/local_irq_restore(),
and looks better.

void xt_info_rdlock_bh(void)
{
	struct xt_info_lock *lock;

	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
 	if (likely(++lock->depth == 0))
 		spin_lock(&lock->lock);
}

void xt_info_rdunlock_bh(void)
{
	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);

	BUG_ON(lock->depth < 0);
	if (likely(--lock->depth < 0))
		 spin_unlock(&lock->lock);
	local_bh_enable();
}



--

From: Evgeniy Polyakov
Date: Tuesday, April 21, 2009 - 2:22 am

Yeah, given that non-nested locking is more likely condition, it will be

-- 
	Evgeniy Polyakov
--

From: Lai Jiangshan
Date: Tuesday, April 21, 2009 - 2:34 am

David said:
Netfilter itself, is nesting.

When using bridging netfilter, iptables can be entered twice
in the same call chain.

And Stephen said:
In this version, I was trying to use/preserve the optimizations that
are done in spin_unlock_bh().

So:

void xt_info_rdlock_bh(void)
{
	struct xt_info_lock *lock;

	preempt_disable();
 	lock = &__get_cpu_var(xt_info_locks);
 	if (likely(lock->depth < 0))
 		spin_lock_bh(&lock->lock);
	/* softirq is disabled now */
	++lock->depth;
	preempt_enable_no_resched();
}

xt_info_rdunlock_bh() is the same as v11.

--

From: Lai Jiangshan
Date: Monday, April 20, 2009 - 10:34 pm

Which context can enter the critical region?
Can irq and softirq? or softirq only?

--

From: Eric Dumazet
Date: Monday, April 20, 2009 - 9:59 pm

I reviewed this patch believe its in quite good shape, thanks Stephen.

Then I tested it on a x86_32 8 cpus machine and got no obvious problem.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Hopefully, next rcu_bh (or whatever name is used) will permit us
to switch back to pure RCU in 2.6.31
 
oprofile snapshot of a tbench session, with light iptables rules.
(4 rules in INPUT chain, 3 rules on OUTPUT)

xt_info_rdlock_bh() uses 0.6786 % of cpu
xt_info_rdunlock_bh() uses 0.1743 % of cpu


CPU: Core 2, speed 3000.77 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
1248350  1248350       11.3285  11.3285    copy_from_user
534049   1782399        4.8464  16.1749    copy_to_user
480898   2263297        4.3641  20.5390    __schedule
325581   2588878        2.9546  23.4936    ipt_do_table
312697   2901575        2.8377  26.3312    tcp_ack
309381   3210956        2.8076  29.1388    tcp_sendmsg
248238   3459194        2.2527  31.3915    tcp_v4_rcv
230405   3689599        2.0909  33.4824    tcp_transmit_skb
220638   3910237        2.0022  35.4847    ip_queue_xmit
217099   4127336        1.9701  37.4548    tcp_recvmsg
175885   4303221        1.5961  39.0509    tcp_rcv_established
173112   4476333        1.5710  40.6219    __switch_to
165138   4641471        1.4986  42.1205    sysenter_past_esp
149367   4790838        1.3555  43.4759    dst_release
138619   4929457        1.2579  44.7339    sched_clock_cpu
132724   5062181        1.2044  45.9383    lock_sock_nested
121353   5183534        1.1013  47.0396    nf_iterate
119205   5302739        1.0818  48.1214    netif_receive_skb
118859   5421598        1.0786  49.2000    release_sock
112597   5534195        1.0218  50.2218    __inet_lookup_established
112195   5646390        1.0181  51.2399    sys_socketcall
110018   5756408        0.9984  52.2383    tcp_write_xmit
106466   5862874  ...
From: Paul E. McKenney
Date: Tuesday, April 21, 2009 - 9:37 am

On Tue, Apr 21, 2009 at 06:59:29AM +0200, Eric Dumazet wrote:


My excursion in to rcu_fgp can be thought of as a training exercise for
doing this, though perhaps others are also working on an upgraded rcu_bh
as well.

							Thanx, Paul
--

From: Lai Jiangshan
Date: Monday, April 20, 2009 - 10:46 pm

And remove lockdep_xxx()s.


--

From: Linus Torvalds
Date: Tuesday, April 21, 2009 - 9:13 am

Ok, so others already pointed out how dangerous/buggy this looks, but I'd 
like to strengthen that a bit more:


This function is FUCKED UP.

It's total crap for several reasons"

 - the already-mentioned race with bottom half locking.

   If bottom halfs aren't already disabled, then if a bottom half comes in 
   after the "++lock->depth" and before the spin_lock_bh(), then you will 
   have no locking AT ALL for the processing of that bottom half - it will 
   just increment the lock depth again, and nobody will have locked 
   anything at all.

   And if for some reason, you can prove that bottom half processing is 
   already disabled, then ALL THAT OTHER CRAP is just that - CRAP. The 
   whole preemption disabling, the whole "_bh()" thing, everything.

   So either it's horribly buggy, or it's horribly broken and pointless. 
   Take your pick.

 - the total lack of comments. Why does that "count" protect anything? 
   It's not a recursive lock, since there is no ownership (two 
   independent accessors could call this and both "get" the lock), so you 
   had damn well better create some big-ass comments about why it's ok in 
   this case, and what the rules are that make it ok.

   So DON'T GO AROUND CALLING IT A RECURSIVE LOCK! Don't write comments 
   that are TOTAL AND UTTER SH*T! Just DON'T!

   It's a "reader lock". It's not "recursive".  It never was recursive, it 
   never will be, and calling it that is just a sign that whoever wrote 
   the function is a moron and doesn't know what he is doing. STOP DOING THIS!

 - that _idiotic_ "preempt_enable_no_resched()". F*ck me, but the comment 
   already says that preemption is disabled when exiting, so why does it 
   even bother to enable it? Why play those mindless games with preemption 
   counts, knowing that they are bogus?

   Do it readably. Disable preemption first, and just re-enable it at 
   UNLOCK time. No odd pseudo-reenables anywhere.

Oh, I know very well that the spli-locking will disable ...
From: Stephen Hemminger
Date: Tuesday, April 21, 2009 - 9:43 am

On Tue, 21 Apr 2009 09:13:52 -0700 (PDT)

Ah a nice day, with Linus giving constructive feedback. Too bad he has
to channel it out of the dark side.

--

From: Linus Torvalds
Date: Tuesday, April 21, 2009 - 9:50 am

I already flamed that patch at least once before. People didn't react. 

What do I have to do to make people listen? I'm sorry, but I'm not going 
to send you flowers with a card saying "Hope you do better next time!".

I realize that the flowers might be friendlier, but I have absolutely no 
incentive to be friendly to just bad code. I have even _less_ incentive 
when my first "that sucks" is apparently totally ignored. So now I spelled 
it out why it sucks, but I sure as hell didn't have any reason to be 
polite about it.

			Linus
--

From: Ingo Molnar
Date: Tuesday, April 21, 2009 - 11:02 am

i think there might be (are?) uses of:

	spin_lock_bh(&some->lock);
	...
	spin_unlock(&some->lock);
	...
	local_bh_enable();

So we have to have two preemption control levels for that, as 
there's no knowledge at the spin_lock_bh() place whether it will be 
followed by a spin_unlock_bh() [in which case it would be safe to do 
SOFTIRQ_OFFSET only] - or by a spin_unlock() + local_bh_enable() 
pair..

[ That locking pattrn even makes a certain amount of sense: keep the 
  lock held for a short amount of time - then weaken locking to bh 
  context exclusion only. ]

What we could do is an optimization to do a compound increase the 
preempt count by SOFTIRQ_OFFSET+1 - instead of a local_bh_disable() 
+ preempt_disable()? Symmetrically we could do a compound decrease 
in the unlock case.

It might even be called: local_bh_preempt_disable() or so?

	Ingo
--

From: Stephen Hemminger
Date: Tuesday, April 21, 2009 - 11:15 am

Subject: netfilter: use per-cpu recursive lock (v12)

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2). 

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

I don't like the NR_CPUS preempt count kludge, it is working around an
issue that should be handled in a more generic way.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
CHANGES 
  - go back to read/write lock
  - reader case is now inline
  - more comments
  - always disables bottom half
  - only play preempt count mindgames if really necessary
  - add lockdep keys

 include/linux/netfilter/x_tables.h |   42 ++++++++++++--
 net/ipv4/netfilter/arp_tables.c    |  110 ++++++++-----------------------------
 net/ipv4/netfilter/ip_tables.c     |  110 +++++++------------------------------
 net/ipv6/netfilter/ip6_tables.c    |  108 ++++++++----------------------------
 net/netfilter/x_tables.c           |   77 ++++++++++++++++++-------
 5 files changed, 165 insertions(+), 282 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-21 07:57:06.668582345 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-21 11:06:10.381487605 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,43 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct ...
From: Ingo Molnar
Date: Tuesday, April 21, 2009 - 12:10 pm

hm, this is rather ugly and it will make a lot of instrumentation 
code explode.

Why not use the obvious solution: a _single_ wrlock for global 
access and read_can_lock() plus per cpu locks in the fastpath?

That way there's no global cacheline bouncing (just the _reading_ of 
a global cacheline - which will be nicely localized - on NUMA too) - 
and we will hold at most 1-2 locks at once!

Something like:

	__cacheline_aligned DEFINE_RWLOCK(global_wrlock);

	DEFINE_PER_CPU(rwlock_t local_lock);


	void local_read_lock(void)
	{
	again:
		read_lock(&per_cpu(local_lock, this_cpu));

		if (unlikely(!read_can_lock(&global_wrlock))) {
			read_unlock(&per_cpu(local_lock, this_cpu));
			/*
			 * Just wait for any global write activity:
			 */
			read_unlock_wait(&global_wrlock);
			goto again;
		}
	}

	void global_write_lock(void)
	{
		write_lock(&global_wrlock);

		for_each_possible_cpu(i)
			write_unlock_wait(&per_cpu(local_lock, i));
	}

Note how nesting friendly this construct is: we dont actually _hold_ 
NR_CPUS locks all at once, we simply cycle through all CPUs and make 
sure they have our attention.

No preempt overflow. No lockdep explosion. A very fast and scalable 
read path.

Okay - we need to implement read_unlock_wait() and 
write_unlock_wait() which is similar to spin_unlock_wait(). The 
trivial first-approximation is:

	read_unlock_wait(x)
	{
		read_lock(x);
		read_unlock(x);
	}

	write_unlock_wait(x)
	{
		write_lock(x);
		write_unlock(x);
	}

Hm?

	Ingo
--

From: Eric Dumazet
Date: Tuesday, April 21, 2009 - 12:46 pm

Obvious is not the qualifier I would use :)


Hmm... here we can see global_wrlock locked by on writer, while
this cpu already called local_read_lock(), and calls again this

Very interesting and could be changed to use spinlock + depth per cpu.

-> we can detect recursion and avoid the deadlock, and we only use one
atomic operation per lock/unlock pair in fastpath (this was the reason we
tried hard to use a percpu spinlock during this thread)


__cacheline_aligned DEFINE_RWLOCK(global_wrlock);

struct ingo_local_lock {
	spinlock_t lock;
	int depth;
};
DEFINE_PER_CPU(struct ingo_local_lock local_lock);


void local_read_lock(void)
{
	struct ingo_local_lock *lck;

	local_bh_and_preempt_disable();
	lck = &get_cpu_var(local_lock);
	if (++lck->depth > 0) /* already locked */
		return;
again:
	spin_lock(&lck->lock);

	if (unlikely(!read_can_lock(&global_wrlock))) {
		spin_unlock(&lck->lock);
		/*
		 * Just wait for any global write activity:
		 */
		read_unlock_wait(&global_wrlock);
		goto again;
	}
}

void global_write_lock(void)
{
	write_lock(&global_wrlock);

	for_each_possible_cpu(i)
		spin_unlock_wait(&per_cpu(local_lock, i));
}



Hmm ?

--

From: Ingo Molnar
Date: Wednesday, April 22, 2009 - 12:35 am

Yeah, indeed.

I wasnt really concentrating on the nested case, i was concentrating 
on the scalability and lock nesting angle. I think the code 
submitted here looks rather incestous in that regard.

Allowing nested locking _on the same CPU_ is asking for trouble. Use 
short critical sections and if there's any exclusion needed, use an 

Yeah, this looks IRQ-nesting safe. But i really have to ask: why 
does this code try so hard to allow same-CPU nesting?

Nesting on the same CPU is _bad_ for several reasons:

1) Performance: it rips apart critical sections cache-wise. Instead 
   of a nice:

       C1 ... C2 ... C3 ... C4

   serial sequence of critical sections, we get:
 
       C1A ... ( C2 ) ... C1B ... C3 ... C4

   Note that C1 got "ripped apart" into C1A and C1B with C2 injected 
   - reducing cache locality between C1A and C1B. We have to execute
   C1B no matter what, so we didnt actually win anything in terms of 
   total work to do, by processing C2 out of order.

   [ Preemption of work (which this kind of nesting is really about)
     is _the anti-thesis of performance_, and we try to delay it as
     much as possible and we try to batch up as much as possible. 
     For example the CPU scheduler will try _real_ hard to not 
     preempt a typical workload, as long as external latency 
     boundaries allow that. ]

2) Locking complexity and robustness. Nested locking is rank #1 in
   terms of introducing regressions into the kernel.

3) Instrumentation/checking complexity. Checking locking 
   dependencies is good and catches a boatload of bugs before they 
   hit upstream, and nested locks are supported but cause an 
   exponential explosion in terms of dependencies to check.

   Also, whenever instrumentation explodes is typically the sign of 
   some true, physical complexity that has been introduced into the 
   code. So it often is a canary for a misdesign at a fundamental 
   level, not a failure in the instrumentation framework.

In ...
From: Eric Dumazet
Date: Wednesday, April 22, 2009 - 1:53 am

The netfilter case is real simple Ingo, (note I did not use "obvious" here ;) )

netfilter in 2.6.2[0-9] used :

CPU 1

sofirq handles one packet from a NIC

ipt_do_table() /* to handle INPUT table for example, or FORWARD */
read_lock_bh(&a_central_and_damn_rwlock)
... parse rules
    -> calling some netfilter sub-function
       re-entering network IP stack to send some packet (say a RST packet)
       ...
       ipt_do_table() /* to handle OUPUT table rules for example */
       read_lock_bh() ; /* WE RECURSE here, but once in a while (if ever) */


This is one of the case, but other can happens with virtual networks, tunnels, ...
and so on. (Stephen had some cases with KVM if I remember well)

If this could be done without recursion, I am pretty sure netfilter
and network guys would have done it. I found Linus reaction quite
shocking IMHO, considering hard work done by all people on this.

I was pleased by your locking schem, that was *very* interesting, even
if not yet ready.

1) We can discuss of how bad recursion is.

We know loopback_xmit() could be way faster if we could avoid queeing packet
to softirq handler.
(Remember you and I suggested this patch some months ago ? Please remember
David rejected this because of recursion and possibility to overflow stack)

So yes, people are aware of recursion problems.

2) We can discuss how bad rwlocks are

We did lot of work on last months to delete some of rwlocks we had in kernel.
UDP stack for example dont use them anymore.

We tried to delete them on x_tables, but we must take care of ipt_do_table() nesting,
that is legal on 2.6.30.
Maybe netfilter guys can work to avoid this nesting on 2.6.31, 
I dont know how hard it is, definitly not 2.6.30 material.

Solutions were discussed many times and Stephen provided 13 versions of the patch.
If this was that obvious, one or two iterations would have been OK.

3) About last patch (v13)
Stephen did not agreed with you (well... maybe after all ..),
he only ...
From: Jarek Poplawski
Date: Wednesday, April 22, 2009 - 3:13 am

On 22-04-2009 10:53, Eric Dumazet wrote:

Right, looks like 100% (Scandinavian?!) troll. I wonder where is the
admin of this mess...

Jarek P.
--

From: Ingo Molnar
Date: Wednesday, April 22, 2009 - 4:26 am

You might disagree with me on an honest technical basis, but note 
that this kind of childish, knee-jerk reaction against dissenting 
people is what gives netdev its bad reputation.

	Ingo
--

From: Jarek Poplawski
Date: Wednesday, April 22, 2009 - 4:39 am

How did you find I disagree with you? There was nothing about you,
neither anything technical. Actually, I forgot to mention I agree with
Eric about your locking proposal.

Jarek P.
--

From: Ingo Molnar
Date: Wednesday, April 22, 2009 - 4:18 am

Again ... i dont know this code well, but you yourself describe it 
as "a_central_and_damn_rwlock" above.

"Central and damn" locks of any type, anywhere tend to cause such 
trouble.

Often they are mini-BKL's in the making. Here's some historic 
patterns:

 1- First there's a convenient lock around a popular and useful 
    piece of data and code.

 2- As popularity (and reach of code) grows, and some non-trivial 
    interaction ensues, a little bit of self-recursion is added to 
    the mix (this is often easier to do than to fix the root cause 
    of the problem) - making critical sections even easier to grow 
    in size.

 3- Then attempts are made to make it scale all (it's a popular 
    piece of code), it's extended along a per-cpu axis, but 
    complexity of locking explode by taking a ton of locks all at 
    once. The solution: yell at the lock validator for not allowing
    this (or for exploding due to the sheer mathematically 
    large complexity of the locking rules). Frequent requests for
    an exclusion from those pesky validations are made, and various
    hacks are done to work it around. It's all the fault of the lock
    validator, of course.

 4- At this point it's rarely a clean, tidy data lock anymore - it 
    tends to grow into a code lock nobody really knows how it works, 
    except that it better be taken more often than not, badness may
    ensue otherwise. Nobody really knows when to take it, only that 
    it should be taken widely enough, that it should be recursive 
    enough to call even _more_ code from under it. Efforts to reduce 
    critical section length are rebuffed with: "this adds unlocking
    and relocking overhead". And it should then all scale as well.

 5- In the end it's a lock everyone curses but nobody is able to fix 
    anymore. "If it was easy to fix we'd have fixed it long ago 
    already" kind of fatalist thinking becomes widespread.

We saw many examples of this in the past: beyond the BKL (which is a ...
From: Linus Torvalds
Date: Wednesday, April 22, 2009 - 8:19 am

You don't _understand_ do you?

There is a huge difference between recursive code, and a recursive lock.

The netfilter code may need to occasionally re-enter itself. Nobody ever 
contested _that_ part.

What I have disagreed with the whole time is 

 (a) doing local ad-hoc locking primitives without any comments 
     what-so-ever.

 (b) Doing them _wrong_ in many cases

 (c) Calling the _lock_ a "recursive" lock.

The fact that a lock works with recursion doesn't make it "recursive". 
That generally has a very special meaning for locking primitives, and 
means something else.

In contrast, a read-write lock actually has known properties, and we have 
existing locking mechanisms for those. And we call them read-write locks 
DESPITE THE FACT that the reading part can be done recursively. 

If you call a read-write lock a "recursive" lock, then you're a moron. 
It simply is _not_ a recursive lock. And neither is the lock you actually 
implemented, even though you (and Stephen) continually call it that. 

SO STOP CALLING IT A RECURSIVE LOCK. Look at your very own code: you can 
actually only use that lock in a recursive context in a _very_ specific 
place. Notice how it's only "recursive" when taken in the per-CPU context, 
but _not_ recursive when the filter-updating code ("writer") takes it?

Do you understand now? It really shouldn't be so hard for you. 

Naming is important. Locking is important. You did both things wrong. You 
named your locks something incorrect and mis-leading that didn't actually 
describe them, and you did your own private locking code without then 
documenting what the rules for this special lock were.

Maybe in your world that's ok. But no, in mine it's not. I've seen too 
many damn _non-functioning_ locks to ever want to see stuff like that 
again.

			Linus
--

From: Eric Dumazet
Date: Wednesday, April 22, 2009 - 9:57 am

Linus,

I actually sent *one* buggy patch, and you already gave your feedback and NACK.

Fine

I even relayed this to Stephen suggesting him not calling this a recursive lock.
(Note how I use 'suggesting' here)

So, what do you want from me ? Should I copy 100 times :

"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
"I should not call it a recursive lock. I shall not invent new locking infra. I am a moron." 
...

OK done

Can we now proceed and continue ?

Thank you

--

From: Linus Torvalds
Date: Wednesday, April 22, 2009 - 10:18 am

Actually, the thing is, I don't even think your original patch was even 
buggy. The bug crept in later. I NAK'ed it not because it was buggy, but 
because of the ad-hoc'ness and the naming. Really.

And I actually even said so in my original rant:

 'The fact that code "happens to work by mistake" (and I'm not saying that
  your does - but it might just because of the per-cpu'ness of it [..]'

because your original patch still had the

	rcu_read_lock_bh();

in place before the whole

	rl = &__get_cpu_var(arp_tables_lock);
	if (likely(rl->count++ == 0))
		spin_lock(&rl->lock);

and that should have protected against both BH callers and preemption.

So I actually believe that your original patch probably worked fine (but 
as I said in my reaction to it, I thought it was almost by mistake and I 
wasn't going to review it)

So the actual _bug_ crept in later, when the RCU lock was removed, and the 
lock was cleaned up and separated into a function of its own.

And in fact, that is kind of my point: "uncommented locking with ad-hoc 
semantics is very fragile". Even _correct_ code ends up not being correct 

So I consider this thread ended from a technical standpoint.

[ That said, I will not be at all shocked to hear if people decide later 
  that the RCU method was better after all, and that even the per-cpu
  rwlock or spinlock is just too expensive. ]

My problem today (apart from the relatively minor issue of also wanting to 
get the commit log fixed up) is just that I see emails from you finding my 
reaction shocking and from Jarek Poplawski that seem to still think that 
I'm a troll.

Just because I pointed out real technical problems? Is that shocking or 
trolling?

Really - please go back to my _original_ email. No, it was not polite. But 
here's another quote from it:

  "Because even if it works today, it's just a bug waiting to happen."

and dammit, I sent that out _before_ the very next version of the patch 
that actually _did_ introduce that ...
From: Jarek Poplawski
Date: Wednesday, April 22, 2009 - 1:46 pm

Linus Torvalds wrote, On 04/22/2009 07:18 PM:


Very funny! For newbies only: http://en.wikipedia.org/wiki/Troll_(Internet)


Actually, I admit you could be the best "Ozzy" Osbourne imposter instead.

Jarek P.
--

From: Ingo Molnar
Date: Wednesday, April 22, 2009 - 10:48 am

Btw., i didnt find Linus's reaction shocking at all, nor did i 
understand it as any criticism of prior (and future) good work of 
the people involved. I found it to be what it was: a forceful 
(because repeated) criticism of a bad patch.

	Ingo
--

From: Stephen Hemminger
Date: Tuesday, April 21, 2009 - 2:04 pm

On Tue, 21 Apr 2009 21:10:07 +0200

Better general solutions:
    * use raw_spin_lock
    * increase PREEMPT_BITS on 64 bit machine 
      and limit to 128 CPU or less on 32 bit
    * get rid of default preempt_disable in spin_lock


Quit trying to be so damn f*cking cool. We don't build the software
for locking instrumentation. Locking instrumentation needs to serve
the kernel, not the other way around. 

Your version fails for the case of nested local rules. Which was
the whole reason I switched to read/writer locks.

        CPU 1                            CPU 2
        local_read_lock
        ...                              global_write_lock
            local_read_lock
Stuck...

CPU 2 is waiting for CPU 1 to get out of its nested table processing
CPU 1 is waiting for write lock to get done

--

From: Ingo Molnar
Date: Wednesday, April 22, 2009 - 1:00 am

You make it quite hard to give reasonable feedback to your code :-/
First you attack me personally here, then - 30 minutes later - in 
the next iteration of your patch, you do:

+       /* wait for each other cpu to see new table */
+       for_each_possible_cpu(i)
+               if (i != smp_processor_id()) {
+                       xt_info_wrlock(i);
+                       xt_info_wrunlock(i);
+               }

... which i have not seen in your previous patch and which looks 
awfully similar to the write_lock_wait() based primitive i 
suggested.

( Just open-coded in an ugly fashion and slower than a real, proper
  wait-unlock would be, because it dirties all those locks 
  needlessly. )

So you must have agreed with me to a certain degree - i just dont 
see that in any of the discussion. (you seem to totally disagree 
with me to the level of ridiculing me.) Which makes it hard to 

Yes, as Eric pointed it out, more than an hour before your reply. I 
find the nesting uninteresting (in fact i find it harmful - see my 
reply to Eric). If you were only interested in nesting then a plain 
old-fashioned rwlock would have done the job.

The detail that is interesting here is how to avoid the global 
rwlock cacheline bounce - not the recursion. (the same-CPU recursion 
is avoidable via proper design or workaround-able via a counter in 
so many ways)

Anyway, i'm back into lurker mode.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, April 21, 2009 - 12:39 pm

In the global/local lock scheme i proposed this would become:

	global_write_unlock(void)
	{
		write_unlock(&global_lock);
	}

As we dont hold the local locks during the write-locked critical 
section. No loop needed over CPUs, no preempt nesting complications, 
no lockdep complications, etc.

	Ingo
--

From: Stephen Hemminger
Date: Tuesday, April 21, 2009 - 2:39 pm

This version of x_tables (ip/ip6/arp) locking uses a per-cpu
recursive lock that can be nested.

The idea for this came from an earlier version done by Eric Dumazet.
Locking is done per-cpu, the fast path locks on the current cpu
and updates counters.  This reduces the contention of a
single reader lock (in 2.6.29) without the delay of synchronize_net()
(in 2.6.30-rc2). 

The mutex that was added for 2.6.30 in xt_table is unnecessary since
there already is a mutex for xt[af].mutex that is held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com

---
CHANGES 
  - reader and write now inline
  - only acquire one cpu write lock at a time
  - write lock pushed down into get_counters

 include/linux/netfilter/x_tables.h |   50 +++++++++++++--
 net/ipv4/netfilter/arp_tables.c    |  121 ++++++++++---------------------------
 net/ipv4/netfilter/ip_tables.c     |  120 +++++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  120 ++++++++++--------------------------
 net/netfilter/x_tables.c           |   55 +++++++++-------
 5 files changed, 174 insertions(+), 292 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-21 07:57:06.668582345 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-21 14:24:03.295299154 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,51 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
+
+
+/*
+ * Per-CPU read/write lock. This makes reader locking fast since
+ * there is no shared variable to cause cache ping-pong; but adds an
+ * additional write-side penalty since write must iterate over ...
From: Paul E. McKenney
Date: Tuesday, April 21, 2009 - 9:17 pm

Looks good from a concurrency viewpoint!


This is very good -- gets rid of problems with preemption nesting
depth limitations and lockdep limitations.  In addition, it gets rid of
the period of time during which all packet processing might otherwise
be blocked.  Not a big deal on a small machine, but could be a real
problem on a large one.


Good, you do indeed need to prevent migration before you acquire this
CPU's lock.  Otherwise, you could have more than one CPU attempting to

Good.  Of course, the reason we care about preemption here is that
otherwise some other task could mess up this CPU's counters.

Bad for real-time response, but then again, what the heck are you
doing updating netfilter rules while a system is running a real-time

Hmmm...  Not sure I want to know what module uses this.


Good, disabling preemption prevents other readers from messing with this



--

From: Eric Dumazet
Date: Wednesday, April 22, 2009 - 7:57 am

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Tested today on various machines and no problem so far
tbench/oprofile results, 3.7236% cpu spent in ipt_do_table, and 0.84% used
on read_lock/read_unlock

c04a5c30 <ipt_do_table>: /* ipt_do_table total: 217134  3.7236 */
...
   349  0.0060 :c04a5ccc:   call   c04ce380 <_read_lock>
 23914  0.4101 :c04a5cd1:   mov    0xc(%edi),%eax
...
               :c04a5ecb:   call   c04ce5f0 <_read_unlock_bh>
 25279  0.4335 :c04a5ed0:   cmpb   $0x0,-0xd(%ebp)

"iptables -L" fetches its data very fast too.
150 us on a 8 cpus machine, small firewall rules.
400-700 us on same machine, 1000 fw rules set (160000 bytes per cpu)
 depending on network trafic (from light to flood)



--

From: Linus Torvalds
Date: Wednesday, April 22, 2009 - 8:32 am

Ack on the code.

But the comment is _still_ crap. Please update. It's not a recursive lock, 
as clearly shown by the code itself. It's a per-cpu read-write lock, and 
only the reader is "recursive" (but that's how read-write locks with in 
Linux, and that has nothing to do with anything).

So make the explanations match the code and the intent. Write it something 
like

	This version of x_tables (ip/ip6/arp) locking uses a per-cpu
	reader-writer lock lock where the readers can nest.

and don't confuse it with incorrect commit messages. The lock is very much 
not recursive - on purpose - for half the people taking it.

[ That, btw, was always true, even in the original random open-coded 
  version. Because you can't actually do a real recursive lock without 
  having notion of "current ownership" either by making the count be 
  <per-thread,per-lock> - like the BKL - or by saving the ownership 
  information in the lock. A plain counter simply doesn't do it. ]

		Linus
--

From: Stephen Hemminger
Date: Thursday, April 23, 2009 - 9:09 pm

In days of old in 2.6.29, netfilter did locketh using a 
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.

But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.

So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.

The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed. This patch and purple prose was penned by
in honor of "Talk like Shakespeare" day.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
What hath changed over the last two setting suns:
  * more words, mostly correct...

  * no need to locketh for writeh on current cpu tis 
    always so

  * the locking of all cpu's on replace is always done as
    part of the get_counters cycle, so the sychronize swip
    in replace tables is gone with only a comment remaing

 include/linux/netfilter/x_tables.h |   55 ++++++++++++++--
 net/ipv4/netfilter/arp_tables.c    |  125 ++++++++++--------------------------
 net/ipv4/netfilter/ip_tables.c     |  126 ++++++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  123 ++++++++++--------------------------
 net/netfilter/x_tables.c           |   55 ++++++++--------
 5 files changed, 188 insertions(+), 296 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-23 19:59:36.076558199 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-23 20:22:06.566001575 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain ...
From: Eric Dumazet
Date: Thursday, April 23, 2009 - 9:58 pm

Philip Davis of the university’s School of English said :

  "Shakespeare surprises the brain and catches it off guard in
  a manner that produces a sudden burst of activity - a sense 
  of drama created out of the simplest of things."




Did you tried :

static DECLARE_PER_CPU(struct lock_class_key, xt_locks_key);

static int __init xt_init(void)
 {
	unsigned int i;
	int rv;

	for_each_possible_cpu(i) {
		rwlock_t *lock = &per_cpu(xt_info_locks, i);

		rwlock_init(lock);
		lockdep_set_class(lock, &per_cpu(&xt_locks_key, i));
	}
...

Thanks

--

From: Patrick McHardy
Date: Friday, April 24, 2009 - 8:33 am

Either way is fine with me, I'll wait for Stephen to state his opinion.
--

From: Stephen Hemminger
Date: Friday, April 24, 2009 - 9:18 am

On Fri, 24 Apr 2009 06:58:39 +0200

The lock keys are really only used by lock dep, and I thought per cpu
space was more scarce on some arch.

--

From: Jarek Poplawski
Date: Friday, April 24, 2009 - 1:43 pm

Maybe I'm wrong but after this change: "- only acquire one cpu write
lock at a time" lockdep_set_class() might be unnecessary. Alas I'm
not able to test it.

Jarek P.
--

From: Stephen Hemminger
Date: Saturday, April 25, 2009 - 1:30 pm

Epilogue due to master Jarek. Lockdep carest not about the locking
doth bestowed. Therefore no keys are needed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 net/netfilter/x_tables.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- a/net/netfilter/x_tables.c	2009-04-25 13:25:48.115026283 -0700
+++ b/net/netfilter/x_tables.c	2009-04-25 13:26:15.646215635 -0700
@@ -1145,14 +1145,9 @@ static int __init xt_init(void)
 {
 	unsigned int i;
 	int rv;
-	static struct lock_class_key xt_lock_key[NR_CPUS];
 
-	for_each_possible_cpu(i) {
-		rwlock_t *lock = &per_cpu(xt_info_locks, i);
-
-		rwlock_init(lock);
-		lockdep_set_class(lock, xt_lock_key+i);
-	}
+	for_each_possible_cpu(i)
+		rwlock_init(&per_cpu(xt_info_locks, i));
 
 	xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
 	if (!xt)
--

From: Jarek Poplawski
Date: Sunday, April 26, 2009 - 1:18 am

Very nice! I guess this Shakespeare guy Will sign off this all too. ;-)

Thanks,
--

From: Eric Dumazet
Date: Sunday, April 26, 2009 - 11:24 am

So far, so good, should be ready for inclusion now, nobody complained :)

I include the final patch, merge of your last two patches.

David, could you please review it once again and apply it if it's OK ?

Thanks to all for your help and patience

[PATCH] netfilter: use per-CPU recursive lock {XV}

In days of old in 2.6.29, netfilter did locketh using a 
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.

But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.

So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.

The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed. This patch and purple prose was penned by
in honor of "Talk like Shakespeare" day.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
What hath changed over the last two setting suns:

  * more words, mostly correct...

  * no need to locketh for writeh on current cpu tis 
    always so

  * the locking of all cpu's on replace is always done as
    part of the get_counters cycle, so the sychronize swip
    in replace tables is gone with only a comment remaing

  * Epilogue due to master Jarek. Lockdep carest not about
    the locking doth bestowed. Therefore no keys are needed.

 include/linux/netfilter/x_tables.h |   55 ++++++++++-
 net/ipv4/netfilter/arp_tables.c    |  125 +++++++-------------------
 net/ipv4/netfilter/ip_tables.c     |  126 +++++++--------------------
 ...
From: Mathieu Desnoyers
Date: Sunday, April 26, 2009 - 11:56 am

Hi Eric,

Please... could you rename this patch according to Linus'comments ?

Suitable name would probably be :


Stating that the write lock must _always_ be taken with bh disabled



Did you really need to move add_counter_to_entry and put_counters in
this patch ? This also seems more like a cleanup to me, if it is even



Whiteline.....


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Stephen Hemminger
Date: Sunday, April 26, 2009 - 2:57 pm

On Sun, 26 Apr 2009 14:56:46 -0400

But Linus is trying to delude himself.

This usage is recursive even if he doesn't like the terminology.
The same CPU has to be able to reacquire the read lock without deadlocking.
If reader/writer locks were implemented in a pure writer gets priority
method, then this code would break!  So yes read locks can be used recursively
now in Linux, but if the were implemented differently then this code
would break.  For example, the -rt kernel turns all read/write locks into
mutexs, so the -rt kernel developers will have to address this.
--

From: Mathieu Desnoyers
Date: Sunday, April 26, 2009 - 3:32 pm

Hi Stephen,

[I see that you have cutted my name proposal from the original email,
which might make it difficult for others to follow. I will assume you
did it by mistake.]

(re-added)

Reading Documentation/spinlocks.txt, which states the lock usage
guidelines :

"Note that you can be clever with read-write locks and interrupts. For
example, if you know that the interrupt only ever gets a read-lock, then
you can use a non-irq version of read locks everywhere - because they
don't block on each other (and thus there is no dead-lock wrt interrupts.
But when you do the write-lock, you have to use the irq-safe version."

So it's assumed in the kernel-wide read lock usage that nested read
locks are OK. I'm adding Thomas and Steven in CC, but I'm quite sure
they must have dealt with nested read-lock transformation into mutexes
by detecting nesting in some way in -rt. But I'll let them confirm this.

So I don't see why you are dreaming about a different semantic than the
one of the primitives you are using. I guess I'll leave the semantics to
you. I just find it astonishing that you persist saying that everbody is
wrong on a topic like semantic, which is in the end a mean to
communicate ideas clearly within the overall community you disagree
with.

Good luck !

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Peter Zijlstra
Date: Monday, April 27, 2009 - 10:44 am

A recursive lock has the property:

lock()
{
  if (lock->owner == current) {
    lock->depth++;
    return;
  }

  /* regular lock stuff */
}

unlock()
{
  if (!--lock->depth)
    /* regular unlock */
}

non of the linux kernel locking primitives have this -- with the
possible exception of the cpu-hotplug lock.

What rwlock_t has, is reader bias to the point where you can utterly
starve writers, with the side effect that you can obtain multiple read
ownerships without causing a deadlock.

This is not what is called a recursive lock. A recursive lock would have
each owner only once, this rwlock_t thing is simply so unfair that it
can have unlimited owners, including multiple copies of the same one.

rwsem has fifo fairness, and therefore can deadlock in this scenario,
suppose thread A does a read, thread B tries a write and blocks, then
thread A recurses and tries to obtain another read ownership --
deadlock, as the FIFO fairness will demand the second read ownership
will wait on the pending writer, which will wait on the outstanding read
owner.

Now if rwsem were a fifo-fair recursive lock, the above would not
deadlock, since it would detect that the task already had (read)
ownership and simply increment the depth, instead of trying to acquire a
second ownership.

This is all common and well understood terminology, not something Linus
invented just to harras you with.

Generally speaking we do not condone recursive locking strategies -- and
afaik reiserfs (as per the BKL) and the network code (as per abusing
rwlock_t unfairness) are the only offenders.

Like Linus stated, recursive locking is generally poor taste and
indicates you basically gave up on trying to find a proper locking
scheme. We should very much work towards getting rid of these
abberations instead of adding new ones.

Linus is very much right on what he said, and you calling him delusional
only high-lights your ignorance on the issue.

[ PS. -rt implements rwlock_t as a proper recursive ...
From: Stephen Hemminger
Date: Monday, April 27, 2009 - 11:30 am

On Mon, 27 Apr 2009 19:44:57 +0200

Only on Linux, and only because you look at locking from
the point of view of the magic variable "current" process


In Documentation/ ?  online ?  Where is the definition? The only reference

The people complaining about naming never seem to be the ones providing

Name it "dog's breath locking" for all I care. I am not bothering
with arguments over names; there is real work to do elsewhere.
--

From: Ingo Molnar
Date: Monday, April 27, 2009 - 11:54 am

Sure, see:


The thing is, while you now have named your locking primitive 
correctly, you are still abusing it by using it recursively.

So it's not 'just about naming'. You should not use read-locks as 
recursive locks. It's poor code.

	Ingo
--

From: Stephen Hemminger
Date: Monday, April 27, 2009 - 12:06 pm

On Mon, 27 Apr 2009 20:54:23 +0200

All those references support my argument that the lock is being
used recursively in this case. It is being acquired multiple
times by the same CPU. This is not new, it has always been
acquired multiple times, so my change does not break anything.

If other implementations of reader locks to not nest the same
way, then on those system iptables can deadlock. Nothing was

If you don't like the proposal please, think of a better alternative.
Not just pseudo code that is broken with handwaving arguments.
--

From: Linus Torvalds
Date: Monday, April 27, 2009 - 12:46 pm

What's so hard between understanding the difference between "used 
recursively" and "recursive lock"?

THEY MEAN TWO TOTALLY DIFFERENT THINGS!

The fact that you don't seem to understand that is one of the things I've 
been complaining about all along.

So here's a final set of clue-bat:

Clue bat #1:

 - "can be used in a recursive context" is not the same as "is recursive".

   Analogy time: Look at "strtok_r()", and the difference between that and 
   "strtok()". "strtok_r()" can be used in a threaded environment. Does 
   that mean that 'strtok_r()' is threaded? No. When you call 
   'strtok_r()', it's still signle-threaded - it's just that it _can_ be 
   called in a threaded environment and then still has sane semantics.

   Now, replace "threaded" with "recursive". Do you get it?

Clue bat #2:

 - a lock that can count does not mean that it is recursive. 

   Counting and recursion are TWO TOTALLY DIFFERENT ISSUES. The 
   "countingness" means that there can be multiple users inside of it, and 
   that, in turn, implies that maybe it can be used in a recursive 
   context. But again, counting and recursion are not about the same thing 
   at all.

   Put another way: a read-write lock is not a "recursive lock" despite 
   the fact that you can recursively take the lock for reading. It just 
   happens to count readers, and allow more than one (from _any_ context, 
   not just from the "same context").

Clue bat #3:

 - A recursive lock is very much _about_ recursion. It is explicitly about 
   the fact that the _same_ thread re-enters the lock, not about another 
   thread being in the locked region at the same time.

   See the difference? Big, big difference. A recursive lock will lock out 
   other thread contexts, even if it allows the current one to recurse. 
   Notice how the _only_ thing a recursive lock allows is that recursive 
   behavior, and nothing else.

   IOW, a "recursive lock" is explicitly designed for recursion. But that 
   ...
From: Linus Torvalds
Date: Monday, April 27, 2009 - 12:48 pm

Note to self: learn to count beyond three one of these days.

		Linus
--

From: Evgeniy Polyakov
Date: Monday, April 27, 2009 - 1:36 pm

Hi.


Just ot be sure readers will not lose the discssion topic: do you object
against naming or realizaion? If its about the former, does 'dog's
breath lock' proposed by Stephen fit?

-- 
	Evgeniy Polyakov
--

From: Linus Torvalds
Date: Monday, April 27, 2009 - 1:58 pm

I said _long_ ago that I thought the patch was fine. 

What I object to is people mis-representing the lock, and apparently 
having a really hard time admitting that having a good grounding in 
networking doesn't necessarily mean that you know everything about 

Is that just an attempt to avoid admitting that they were wrong about lock 
naming? And then trying to trivialize it by calling things by a 
_different_ wrong name, but silly enough that they hope people won't call 
them on it?

Why not just use the correct name? I think it was Mathieu that just 
suggested:

	[PATCH] netfilter: use bh disabling with per-cpu read-write lock

or just call it "netfilter: use per-CPU read-write lock".

Why are people so against calling things by their correct names? Why do 
certain network people seem to force a stupid and incorrect description, 
when they have been told (a) that it's wrong and (b) why it's wrong 
several times?

What's so hard with just doing the TechnicallyRightThing(tm) and 
describing it as such?

And btw - don't get me wrong. There are _other_ ways to do that locking 
too. You don't have to use a rwlock. You can do it with explicit counting, 
the way Eric's original patch did. But it would be wrong to call that one 
"recursive lock" too.

Or you _could_ use a recursive lock, but nobody has actually posted such 
patches. It would work. No question about it. And if it actually _were_ a 
recursive lock, I wouldn't have objected about the description saying it 
is (although I would probably have objected to it being unnecessarily 
complex, when a simpler rwlock or the explicit count thing would be 
sufficient).

But since the current simple patch is using a rwlock, why not just say 
that? Why call it something incorrect ("recursive lock") or nonsensical 
("dog's breath lock").

As I tried to explain with an analogy, networking people would (quite 
correctly) object to me calling a serial line an "ethernet cable". Why is 
it so hard for netfilter people to ...
From: Stephen Hemminger
Date: Monday, April 27, 2009 - 2:40 pm

On Mon, 27 Apr 2009 13:58:48 -0700 (PDT)

The part that concerns me is that the reader lock used in a nested manner on
same cpu may still be broken on -rt. Other than that it is just language
lawyering; violent agreement that the lock gets used multiple times by
same CPU. I never had the occasion to address this before
(and avoided such usage), but this legacy code exists and needs to

[PATCH] netfilter: Ceci n'est pas une serrure récurrente


Because meaning comes from context, and my meaning comes from different
--

From: Linus Torvalds
Date: Monday, April 27, 2009 - 3:24 pm

I think that's a valid concern, and I don't actually object to not using a 
rwlock, but using the "explicit counting + spinlock" that we had at one 
point. It _might_ even be faster, since a spinlock can be faster than a 
rwlock, and the (rare) case where you recurse you can then avoid the 
atomic entirely.

But EVEN IF YOU DO THAT, it's still wrong to call it a "recursive lock". 
Because it still wouldn't be one.

That's kind of my point, and always has been. It was why I objected to the 
original patch description by Dumazet. It wasn't a recursive lock back 
then _either_. For all the reasons I tried to explain to you, and you seem 
to not care about.

Btw, if you do use the "explicit count" case, then please please please 
make sure it's documented and bug-free. And dammit, correct documentation 
in this case very much talks about how it is _not_ a "recursive lock", but 
a spinlock that then has an explicit counter that avoids taking the lock 
entirely in one very specific path (that happens to be recursive).

The thing is, using a rwlock kind of implicitly documents all of that, 
because you can tell somebody who doesn't even read the code that it's a 
"per-cpu rwlock", and they then know what the semantics are (given that 
they know the kernel semantics for locking in the first place).

But once you start doing your own locking rules, you really have to 
explain why it works, and what it does. And you do have to be very 

If you don't care about the naming, then use the right one. 

And if you don't care about the naming, why do you then say I'm deluding 
myself, when I'm _correct_, and I _do_ happen to care about correct 
naming.

Locking really is important.  Code that gets locking wrong breaks in
really subtle and nasty ways.  And it sadly tends to "work" in testing,
since the breakage cases require just the right timing.  So locking
should be robust and as "obviously correct" as possible.

And naming really is important.  Misnaming things makes people ...
From: Linus Torvalds
Date: Monday, April 27, 2009 - 4:01 pm

So, just as an example, I would not object to the counter approach, as 
long as it really talks about the issues, and why it works (and as long 
that doesn't call the locks "recursive locks").

So if you are just nervous about rwlocks, then something like this might
work (needs testing, and other people to sanity-check it).

I left the commentary about "readers" and "writers", because in many
ways it's correct, and what the code actually does is very much to
emulate a reader-writer lock.  I put quotes around the uses in the
comments to high-light that it largely _acts_ as a reader-writer lock. 

Now, it would actually be a really _bad_ reader-writer lock in general, 
exactly because it's not very "atomic" (ie this would be a truly sucky and 
broken lock if we didn't have the strict per-cpu usage rules).

So it just so happens that the per-cpu'ness of the lock, together with the 
very limited way in which it is used, make that sucky implementation 
possible - and indeed possibly faster than the standard (and generic) 
kernel rwlock implementation.

So it's basically a special-case rwlock that happens to work due to the
specific rules. 

And exactly because it really wouldn't work in the absense of those
rules, those rules really do need to have big comments on them so that 
people don't then later forget about the limitations.

BTW: THIS IS TOTALLY UNTESTED.  I just cut-and-pasted the existing
rwlock version from one of the later patches, used the counting approach
from one of the earlier ones, and basically just added what I think
would be minimal required comments for this special case.  All of it was
written inside the mail reader, none of it has been compiled or tested
in any other way. 

Because it's exactly the "lock semantics awareness" that I think is so
important here (and that's why all the naming and commentary is so
critical). 

Btw, the "counter" could probably be just a single bit, since apparently
the nesting level is always supposed to be just one.  I ...
From: Linus Torvalds
Date: Monday, April 27, 2009 - 4:03 pm

This one was missing the "local_bh_enable()" at the end.

There may be other bugs, but that's the one I noticed immediately when 
reading what I sent out. Oops.

			Linus
--

From: Eric Dumazet
Date: Monday, April 27, 2009 - 11:58 pm

I am not sure my day job will permit me to polish a patch mixing all
the bits and comments. But I am glad we eventually got back spinlocks
which are probably better than rwlocks for implementing this stuff.

Instead of submitting a full patch again, we could first submit a new
 include file containg all comments and inline functions ?

This include file could be local to netfilter, with a big stick on
it to forbids its use on other areas (No changes in Documentation/ )

Then, as soon as we can go back to pure RCU solution, we can safely
delete this controversial-locking-nesting-per-cpu-thing ?


Instead of local/global name that Paul suggested, that was about
'global' locking all locks at the same time. Not any more the good
name IMHO

Maybe something like local/remote or owner/foreigner ?

xt_info_owner_lock_bh(), xt_info_owner_unlock_bh()
xt_info_foreigner_lock(), xt_info_foreigner_unlock()

One comment about this comment you wrote :

/*
 * The "writer" side needs to get exclusive access to the lock,
 * regardless of readers.  This must be called with bottom half
 * processing (and thus also preemption) disabled. 
 */
static inline void xt_info_wrlock(unsigned int cpu)
{
	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
}

static inline void xt_info_wrunlock(unsigned int cpu)
{
	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
}

Its true that BH should be disabled if caller runs
on the cpu it wants to lock. 
For other ones (true foreigners), there is
no requirement about BH (current cpu could be interrupted
by a softirq and packets could fly)

We could use following construct and not require disabling BH
more than a short period of time.
(But preemption disabled for the whole duration)

preempt_disable(); // could be cpu_migration_disable();

int curcpu = smp_processor_id();
/*
 * Gather stats for current cpu : must disable BH
 * before trying to lock.
 */
local_bh_disable();
xt_info_wrlock(curcpu);
// copy stats of this cpu on my private data (not ...
From: David Miller
Date: Tuesday, April 28, 2009 - 4:53 am

From: Eric Dumazet <dada1@cosmosbay.com>

I say we merge Linus's locking idea into the XV patch, fixup the
commit message wording, and move on with life.

For something that's going to get deleted as soon as the faster grace
period RCU stuff is available, it has consumed an inordinate amount of
our time :-)

I might take a stab at this before hittng bed tonight, no promises :)
--

From: Ingo Molnar
Date: Tuesday, April 28, 2009 - 5:40 am

One more reason to factor out this code into general locking code.

The latest code looks a bit similar to the old big-reader-locks hack 
(which got dropped for good many eons ago and with which i deny any 
involvement with, such as having authored it. [oh, did i say that 
out loud? crap.]), implemented cleanly and properly.

IMHO this locking construct should be considered for 
linux/local_lock.h and kernel/local_lock.c. Even if the netfilter 
code drops its use soon afterwards ;-)

[ The _only_ thing i am worried about is the apparent fact that
  there's so much confusion about recursion versus read-access.
  Recursion might be hard to factor out of the netfilter code, and
  maybe it's not even possible realistically (we fought years with
  the BKL and are still fighting it) but if its harms are not even
  _realized_ that difficult task turns into an impossible task ;-) ]

	Ingo
--

From: David Miller
Date: Tuesday, April 28, 2009 - 6:43 am

From: Ingo Molnar <mingo@elte.hu>

If you can show me have to pass a per-cpu variable (the variable,
not a dereference of it) as an argument to an inline function,
I'll implement this :-)

It has to be dereferenced after local_bh_disable() for the
read side acquisition.
--

From: Mathieu Desnoyers
Date: Tuesday, April 28, 2009 - 6:52 am

The local_bh_disable() could be outside of the locking construct. This
would make it easier to adapt it to various users (irq disable, bh
disable, preempt disable) depending on the contexts from which they much
be protected.

And if it still does not work for some reason, using a #define is
discouraged, but could work.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: David Miller
Date: Tuesday, April 28, 2009 - 7:37 am

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

That's what I was hoping to avoid, things like macros and having
the callers of this thing expand the two parts of the operation.

What's the point in making this generic if it ends up being ugly
as hell?
--

From: Mathieu Desnoyers
Date: Tuesday, April 28, 2009 - 7:49 am

.. and what's the point in making it generic if it can be replaced
by a proper RCU implementation ? :-) I am not convinced of the added
value we get in making it a generic header this soon. I would wait for
other users to express similar needs, otherwise this could soon become
an orphaned piece of locking code.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: David Miller
Date: Tuesday, April 28, 2009 - 8:00 am

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

That is my opinion as well.

Anyways, here is a patch that builds, I haven't started working
on the commit message yet.

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 7b1a652..086e976 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,79 @@ extern void xt_proto_fini(struct net *net, u_int8_t af);
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
+
+/*
+ * Per-CPU spinlock associated with per-cpu table entries, and
+ * with a counter for the "reading" side that allows a recursive
+ * reader to avoid taking the lock and deadlocking.
+ *
+ * "reading" is used by ip/arp/ip6 tables rule processing which runs per-cpu.
+ * It needs to ensure that the rules are not being changed while the packet
+ * is being processed. In some cases, the read lock will be acquired
+ * twice on the same CPU; this is okay because of the count.
+ *
+ * The write lock is used in two cases:
+ *    1. reading counter values
+ *       all rule processing need to be stopped and the per-CPU values are summed.
+ *
+ *    2. replacing tables
+ *       any readers that are using the old tables have to complete
+ *       before freeing the old table. This is handled by reading
+ *       as a side effect of reading counters
+ */
+struct xt_info_lock {
+	spinlock_t lock;
+	unsigned char readers;
+};
+DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
+
+/*
+ * Note: we need to ensure that preemption is disabled before acquiring
+ * the ...
From: Stephen Hemminger
Date: Tuesday, April 28, 2009 - 9:24 am

The x_tables are organized with a table structure and a per-cpu copies
of the counters and rules. On older kernels there was a reader/writer 
lock per table which was a performance bottleneck. In 2.6.30-rc, this
was converted to use RCU and the counters/rules which solved the performance
problems for do_table but made replacing rules much slower because of
the necessary RCU grace period.

This version uses a per-cpu set of spinlocks and counters to allow to
table processing to proceed without the cache thrashing of a global
reader lock and keeps the same performance for table updates.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
Probably same as dave/linus but with different comments.

 include/linux/netfilter/x_tables.h |   73 +++++++++++++++++++--
 net/ipv4/netfilter/arp_tables.c    |  125 ++++++++++--------------------------
 net/ipv4/netfilter/ip_tables.c     |  126 ++++++++++---------------------------
 net/ipv6/netfilter/ip6_tables.c    |  123 ++++++++++--------------------------
 net/netfilter/x_tables.c           |   53 ++++++++-------
 5 files changed, 204 insertions(+), 296 deletions(-)

--- a/include/linux/netfilter/x_tables.h	2009-04-28 08:01:59.942151297 -0700
+++ b/include/linux/netfilter/x_tables.h	2009-04-28 09:15:09.240990339 -0700
@@ -354,9 +354,6 @@ struct xt_table
 	/* What hooks you will enter on */
 	unsigned int valid_hooks;
 
-	/* Lock for the curtain */
-	struct mutex lock;
-
 	/* Man behind the curtain... */
 	struct xt_table_info *private;
 
@@ -434,8 +431,74 @@ extern void xt_proto_fini(struct net *ne
 
 extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
 extern void xt_free_table_info(struct xt_table_info *info);
-extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
-				    struct xt_table_info *new);
+
+/*
+ * Per-CPU spinlock associated with per-cpu table entries, and
+ * with a counter for the "reading" side that allows a recursive
+ * reader to avoid taking the lock and deadlocking.
+ ...
From: Linus Torvalds
Date: Tuesday, April 28, 2009 - 9:50 am

Ack. 

It could do with the update from Eric about how non-current CPU writelocks 
only require preemp-disable around get_counters() (and then the 
local_bh_disable() only around the current-CPU case).

I _think_ get_counters() is the only case that can use that optimization, 
but it's quite possible that it's worth doing especially for machines with 
lots of cores, if BH latency is an issue (and it might be).

Of course, for the lots-and-lots of cores case, even the preemption 
disable might be an issue. And then it really does get much more 
complicated. At that point, you probably want the RCU thing.

			Linus			
--

From: Linus Torvalds
Date: Tuesday, April 28, 2009 - 9:55 am

Btw, regardless, that's an incremental improvement, and does not negate 
the "Ack" part.

			Linus
--

From: David Miller
Date: Tuesday, April 28, 2009 - 10:37 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

I've applied this, thanks everyone!
--

From: Jeff Chua
Date: Wednesday, April 29, 2009 - 8:26 pm

I see the patch is already in Linus's tree that I just git pulled.
Tested for 200 iptables rules ... as fast as before the slow down.

real    0m0.211s
user    0m0.060s
sys     0m0.144s

Thank you all for fixing this bug!



Rafael, it's fixed. Please close the case.


Thanks,
Jeff.
--

From: David Miller
Date: Wednesday, April 29, 2009 - 8:31 pm

From: Jeff Chua <jeff.chua.linux@gmail.com>

Thanks for testing.
--

From: Eric Dumazet
Date: Friday, May 1, 2009 - 1:38 am

Small followup on this one, since the likely() were forgotten.
(I trimmed down CCed list, which was insane)

It makes a difference on my x86_32 machine, gcc-4.4.0

Thank you

[PATCH] netfilter: use likely() in xt_info_rdlock_bh()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 1b2e435..c9efe03 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -472,7 +472,7 @@ static inline void xt_info_rdlock_bh(void)
 
 	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
-	if (!lock->readers++)
+	if (likely(!lock->readers++))
 		spin_lock(&lock->lock);
 }
 
@@ -480,7 +480,7 @@ static inline void xt_info_rdunlock_bh(void)
 {
 	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
 
-	if (!--lock->readers)
+	if (likely(!--lock->readers))
 		spin_unlock(&lock->lock);
 	local_bh_enable();
 }

--

From: David Miller
Date: Friday, May 1, 2009 - 9:10 am

From: Eric Dumazet <dada1@cosmosbay.com>

Applied.
--

From: Paul E. McKenney
Date: Tuesday, April 28, 2009 - 8:42 am

The way I did this in treercu.c was to create an array of references
to the per-CPU data in question.  Not necessarily recommended, but
one way of doing it.  That said, one could argue that we should wait
until we have at least three users before creating a generic primitive.

And I just know that I am going to regret this deeply, but I cannot
resist posting the following URL:

http://en.wikipedia.org/wiki/Wikipedia:Avoid_Parkinson's_Bicycle_Shed_Effect

							Thanx, Paul
--

From: Christoph Lameter
Date: Tuesday, April 28, 2009 - 10:35 am

The new percpu allocator allows you to create a per cpu pointer and pass
it to functions.

per_cpu_ptr(pointer,cpu) is used to select an instance.
--

From: Linus Torvalds
Date: Tuesday, April 28, 2009 - 8:09 am

Well, I actually already suggested to David that he should just merge the 
last patch I saw floating around (with the "recursive" -> "readwrite" fix 
in the comment ;), so that we can at least get the basic issue fixed, and 
then we can tweak it a bit with smaller patches flying around.

And at least right now, the difference between the rwlock and the 
"count+spinlock" should be basically almost unnoticeable, and a very small 

I don't have any strogn preferences, but I'd almost prefer to not abstract 
things out even that much. It's already pretty well hidden inside 
<netfilter/x_tables.h>, I'd hate to add a new file just for this. 

As to just adding more commenting that it must not be used anywhere else, 

local/remote works for me, and yes, since we only take the remote side one 
CPU at a time, I guess "global" is misleading. But "owner/foreigner" 



Ack.

			Linus
--

From: Linus Torvalds
Date: Monday, April 27, 2009 - 4:32 pm

Btw, I think it was Paul who pointed out that technically it's probably 
better to call them "local" and "global" lockers instead of "readers" and 
"writers".

That also probably clarifies the rules on when you use one over the other 
(ie reading off all the statistics is a "global" operation, as is 
obviously replacing the tables).

Of course, "readers" and "writers" is something most Linux lock people are 
more used to. Or "brlock" for the old-timers, but that involves a heavy 
dose of bad taste. The new use is much nicer, especially since it never 
takes the global lock on _all_ cpu's (which was really a killer in so 
many ways).

			Linus
--

From: Peter Zijlstra
Date: Tuesday, April 28, 2009 - 12:41 am

exclusive vs non-exclusive is what the literature would call them in
--

From: Paul E. McKenney
Date: Tuesday, April 28, 2009 - 7:22 am

I would argue that the non-exclusive category includes both reader-writer
locking and local-global locking.  That said, we have an unusual variant
of local-global in this case, as the global processing acquires only one
of the locks at a time.

							Thanx, Paul
--

From: Jan Engelhardt
Date: Tuesday, April 28, 2009 - 12:42 am

It could be worse. He could be running Ethernet over serial, e.g. L2TP.
Or his serial line is a TP cable with RJ45 plugs - consumers like
to call that Ethernet (cable) too.
--

From: Mathieu Desnoyers
Date: Sunday, April 26, 2009 - 12:31 pm

Why do you need to disable bottom halves on the read-side ? You could
probably just disable preemption, given this lock is nestable on the

[...]

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Eric Dumazet
Date: Sunday, April 26, 2009 - 1:55 pm

It may not be obvious, but subject already raised on this list, so I'll
try to be as precise as possible (But may be wrong on some points, I'll
let Patrick correct me if necessary)

ipt_do_table() is not a readonly function returning a verdict.

1) It handles a stack (check how is used next->comefrom) that seems to
be stored on rules themselves. (This is how I understand this code)
This is safe as each cpu has its own copy of rules/counters, and BH protected.

2) It also updates two 64 bit counters (bytes/packets) on each matched rule.

3) Some netfilter matches/targets probably rely on the fact their handlers
are run with BH disabled by their caller (ipt_do_table()/arp/ip6...)

These must be BH protected (and preempt disabled too), or else :

1) A softirq could interrupt a process in the middle of ipt_do_table()
and corrupt its "stack".

2) A softirq could interrupt a process in ipt_do_table() in the middle
 of the ADD_COUNTER(). Some counters could be corrupted.

3) Some netfiler extensions would break.

Previous linux versions already used a read_lock_bh() here, on a single
and shared rwlock, there is nothing new on this BH locking AFAIK.

Thank you

--

From: Mathieu Desnoyers
Date: Sunday, April 26, 2009 - 2:39 pm

Thanks for the explanation. It might help to document the role of bh
disabling for the reader in a supplementary code comment, otherwise one
might think it's been put there to match the bottom half disabling used
on the write-side, which has the supplementary role of making sure bh
will not deadlock (and this precise behavior is not needed usually on
the read-side).

One more point :

 *    1. reading counter values
 *       all readers need to be stopped and the per-CPU values are summed.

Maybe it's just me, but this sentence does not seem to clearly indicate
that we have :

for_each_cpu()
  write lock()
  read data
  write unlock()

One might interpret it as :

for_each_cpu()
  write lock()

read data

for_each_cpu()
  write unlock()

Or maybe it's just my understanding of English that's not perfect.
Anyhow, rewording this sentence might not hurt. Something along the
lines of :

"reading counter values
 all readers are iteratively stopped to have their per-CPU values
 summed"

This is an important difference, as this behaves more like a RCU-based
mechanism than a global per-cpu read/write lock where all the write
locks would be taken at once.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Paul E. McKenney
Date: Tuesday, April 21, 2009 - 11:34 am

Sigh!!!

Part of the problem is that back in 1971, Courtois, Heymans, and Parnas
foolishly named their article "Concurrent Control with 'Readers' and
'Writers'", which lead to the name "reader-writer lock".  This started
really biting around 1991, when Hsieh and Weihl created a reader-optimized
lock similar to brlock, but with each of the per-CPU locks being
exclusive rather than each being an rwlock.

The problem was that Hsieh's and Weihl's lock was more than just
a reader-writer lock.  It could also be used (and -was- used) as a
local/global lock, where for example you acquire your own lock to make
local changes, and acquire all of the locks to obtain a consistent view
of global state.  In which case, you would read-acquire the lock in
order to write, and write-acquire the lock in order to read.  Blech.

So, would it help if the function names names in this patch said something
about "local" and "global" rather than "read" and "write"?

--

From: Linus Torvalds
Date: Tuesday, April 21, 2009 - 1:14 pm

Maybe.

I do agree that the way netfilter would use the lock is somewhat different 
from a normal 'reader-writer' lock, since this special case of readers is 

Oh, I would have no problem at all with 'local' and 'global', in fact it 
would explain _why_ that read-write lock works.

The problem with naming I have is with the 'recursive' part. There is no 
ambiguity what-so-ever about what a "recursive lock" is (at least of the 
traditional kind), and the lock described here is not it.

So don't get me wrong - I could certainly live with a special lock in the 
networking. BUT:

 - it had damn well be documented as to what it does, and why it works

 - and it had better actually _work_, and not be buggy.

I suspect that using our regular reader-writer locks works well enough, 
and yes, it's probably worth making it really clear that the reader 
variety can only ever be used in the "local" form. That kind of is implied 
by the whole function, though.

And if somebody wants to open-code it as a per-cpu spinlock and a per-cpu 
'local counter', I can live with that too, but at that point I want people 
to just be a lot more careful, and also document it a lot more. 

		Linus
--

From: Paul E. McKenney
Date: Monday, April 20, 2009 - 4:44 pm

Or we use the trick Eric suggested and Steve employed in the most recent
patch.  ;-)

An alternative would be for the update code to acquire but one lock at a
time, but this would likely require another lock to exclude other
updaters and I believe would also require restructuring the count
accumulation.

So Steve's current patch seems a bit less intrusive, overall.

							Thanx, Paul
--

From: Linus Torvalds
Date: Wednesday, April 15, 2009 - 5:02 pm

I never understood that dismissal.

The new rule _will_ be visible as we return to user space. It's just that 
old packets may still be in flight in other queues.

But that is true even _without_ the "synchronize_net()". The old packets 
just had to make it slightly further in the queueing - but as far as user 
space is concerned, there is absolutely _zero_ difference between the two. 

I really personally would prefer the RCU approach too. I don't think 
rwlocks are any more cringe-worthy than spinlocks, although it is true 
that they tend to be slightly more expensive.

The pure RCU "just get rid of the unnecessary 'serialze_net()'" approach 
seems to be clearly superior to either.

		Linus
--

From: Eric Dumazet
Date: Wednesday, April 15, 2009 - 11:26 pm

We had to RCU quiesce to be sure old rules were not any more used before
freeing them. Alternative is to defer freeing via call_rcu() but
subject to OOM.

With 200 basic rules, size of rules table is about 40960 bytes per cpu.
(88 pages taken on vmalloc virtual space on my 8 cpus machine)
0xfcaf8000-0xfcb03000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb04000-0xfcb0f000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb10000-0xfcb1b000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb1c000-0xfcb27000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb28000-0xfcb33000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb34000-0xfcb3f000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb40000-0xfcb4b000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc
0xfcb4c000-0xfcb57000   45056 xt_alloc_table_info+0xa8/0xd0 pages=10 vmalloc

This kind of monolithic huge object is hard to handle with RCU semantic,
more suitable for handling set of small objects (struct file for example),


In my humble opinion, this is a reasonnable compromise, and Stephen patch
version 4 is ok for me.


--

From: Paul E. McKenney
Date: Thursday, April 16, 2009 - 7:33 am

To be honest, the per-CPU-locking approach looks pretty good to me for
this particular case.  That said, the problem you mention above does
have some straightforward solutions.

One solution to consider would be to do the call_rcu(), but to keep a
counter of the number of calls, perhaps something like the following:

	call_rcu(...);
	if (++count > 50) {
		synchronize_rcu();
		count = 0;
	}

Of course, you might (or might not) need to atomically increment count,
and you of course would want to replace the "50" with some symbolic
constant, or perhaps even a variable whose value might be determined by
the size of the object and/or the amount of memory available.


Again, the per-CPU-locking approach looks good to me, as well.

But if it turns out that we really do need an RCU implementation with
really short grace periods (tens of microseconds typical latency on
mid-range multiprocessors, those with SGI Altix systems would suffer
a bit more), then it can be done.  It would need to be yet another
implementation of RCU for the following reasons:

o	High update-side overhead (broadcast IPIs via
	smp_call_function().  This is not a problem in this case,
	but would be a showstopper for (say) dcache.  I don't know
	of any way of fixing this.

o	Defeats power-conservation measures by waking up every CPU
	at every grace period.	(Might be fixable, for example, by using
	the same dyntick tricks used by preemptable and hierarchical RCU.
	But not recommended for first implementation.)

o	Poor update-side scalability.  (Definitely fixable, but the
	fix should be to the underlying smp_call_function() primitives.)

o	No ability to share grace periods among concurrent
	synchronize_rcu() primitves.  (Definitely fixable, but not
	recommended until needed.  Unlikely to be needed -- after all,
	if your grace period completes in 10 microseconds, just how many
	concurrent updates do you expect there to be???)

o	No call_rcu() style primtive.  (Definitely fixable, but not
	recommended ...
From: David Miller
Date: Tuesday, April 14, 2009 - 8:23 pm

From: Stephen Hemminger <shemminger@vyatta.com>

Things seem to be winding down, good. :-)

I'll let Patrick McHardy merge this to me with his other pending
netfilter fixes.

Thanks!
--

From: Stephen Hemminger
Date: Tuesday, April 14, 2009 - 10:19 am

On Tue, 14 Apr 2009 17:49:57 +0200

You are right, doing something with barrier would be safer there.
How about using xchg?

@@ -682,26 +668,19 @@ xt_replace_table(struct xt_table *table,
 	      struct xt_table_info *newinfo,
 	      int *error)
 {
-	struct xt_table_info *oldinfo, *private;
+	struct xt_table_info *private;
 
 	/* Do the substitution. */
-	mutex_lock(&table->lock);
 	private = table->private;
 	/* Check inside lock: is the old number correct? */
 	if (num_counters != private->number) {
 		duprintf("num_counters != table->private->number (%u/%u)\n",
 			 num_counters, private->number);
-		mutex_unlock(&table->lock);
 		*error = -EAGAIN;
 		return NULL;
 	}
-	oldinfo = private;
-	rcu_assign_pointer(table->private, newinfo);
-	newinfo->initial_entries = oldinfo->initial_entries;
-	mutex_unlock(&table->lock);
-
-	synchronize_net();
-	return oldinfo;
+	newinfo->initial_entries = private->initial_entries;
+	return xchg(&table->private, newinfo);
 }
 EXPORT_SYMBOL_GPL(xt_replace_table);
 
P.s: we all missed the ordering bug in the RCU version??

--

From: Stephen Hemminger
Date: Saturday, April 11, 2009 - 8:50 am

On Fri, 10 Apr 2009 21:15:33 -0700

In this case it is to make sure that the old counter table is no



And we need snapshot of all counters (which are not even an array but
--

From: Paul E. McKenney
Date: Saturday, April 11, 2009 - 10:43 am

Ah!!!

Is this particular code path one of the ones responsible for the


OK.  However, the code seems to swap in a new set of counters intended
to account for subsequent packets.  So I was assuming that "snapshot"
meant that a given packet had to be accounted for precisely, but that
it was OK to do so either in the old set or the new set, as long as it
appeared in exactly one of the two sets.

If this assumption is accurate, then something like the following should
work.

If my assumption is wrong, what exactly does this snapshot need to do?

--

From: Linus Torvalds
Date: Saturday, April 11, 2009 - 11:57 am

Btw, I think that's a bad assumption.

The thing is, nobody can really care if the new rules are in effect or 
not, because the thing you race with is not the "return to user space" 
part, but the incoming packets.

And those incoming packets might have been incoming before the rules were 
set up too.

So I seriously doubt you need to synchronize with any returning to user 
space. What you want to synchronize with is then later actions that do 
things like turning on the interface that the rules are attached to etc!

So I would suggest:

 - remove the synchronize_net() entirely. Replace it with just freeing the 
   old rules using RCU.

 - new packets will always end up seeing the new rules. That includes the 
   case of somebody doing "ifconfig eth0 up" that enables a new source of 
   packets, so there are no real security issues.

 - if you enabled your network interfaces before you updated your packet 
   filtering rules, you already had a window where packets would come in 
   with the old rules, so doing a "synchronize_net()" in no way protects 
   against any race conditions anyway.

Am I missing something?

			Linus
--

From: Paul E. McKenney
Date: Saturday, April 11, 2009 - 5:34 pm

The issue at this point seems to be the need to get accurate snapshots
of various counters -- there are a number of Linux networking users who
need to account for every byte flowing through their systems.  However,
it is also necessary to update these counters very efficiently, given
that they are updated on a per-packet basis.  The current approach is
as follows:

1.	Install a new set of counters.

2.	Wait for a grace period to elapse.

3.	At this point, we know that all subsequent counting will happen
	on the new set of counters.

4.	Add the value of the old set of counters to the new set of
	counters.

5.	Copy the old set of counters up to user space.

So we get a good snapshot in #5, while #4 ensures that we don't lose
any counts when taking future snapshots.  Unfortunately, #2 hits us
with grace-period latencies on the critical path.

We are going through the following possibilities:

o	Stick with the current approach, and ask people to move to
	new batch-oriented interfaces.  However, a 30x decrease in
	performance is pretty grim, even for an old-style interface.

o	Use various atomic tricks to get an immediate snapshot of the
	old counters after step 1.  Make step 3 use call_rcu() instead
	of synchronize_rcu(), and then step 4 happens off the
	critical path.

	This approach moves the RCU grace period off of the critical
	path, but the atomic tricks are extremely ugly on 32-bit SMP
	machines.  32-bit UP machines and 64-bit machines are not
	too bad, though the 32-bit UP case does add preemption-disable
	overhead on the counter-update fastpath.

o	Provide some sort of expedited synchronize_rcu().  This might
	be able to decrease the hit from 30x down to maybe 5x.
	But I might need to do this for the fast-boot folks anyway,
	though I am first trying to get away with just speeding
	up synchronized_rcu().  Though I was not thinking in terms
	of 6x, let alone 30x.

	Please note that this would not be a drop-in replacement for
	synchronize_rcu().  One would ...
From: Evgeniy Polyakov
Date: Sunday, April 12, 2009 - 12:23 am

Hi.


If we add or change the rule we can not know if iptables' return to
userspace does mean that rule started to act. There may be other queues
already filled with the packets which could match the new rule (like
receiving socket buffer). So effectively we want it to take effect very
soon. What if there will be a timer which will synchronize RCU-added states,
so if we update single rule - it will take effect in a second, and if we
update bunch of them - during the delay second we pretty much can load
them all.

-- 
	Evgeniy Polyakov
--

From: Stephen Hemminger
Date: Sunday, April 12, 2009 - 9:06 am

On Sat, 11 Apr 2009 17:34:45 -0700

We could also try:
  * per-cpu spinlock on counters (instead of synchronize_net). 
    When doing update, just acquire
    lock on that cpu and futz with counters then. Overhead should
    still be less than 2.6.29 and earlier global rwlock

  * synchonize_rcu/synchronize_net is more guarantee than needed?

  * use on_each_cpu() somehow to do grace periood?

  * Add a cond_resched() into net_rx_action which might cause rx processing
    to get out of rcu sooner? also in transmit packet scheduler.
--

From: Paul E. McKenney
Date: Sunday, April 12, 2009 - 10:30 am

This one makes a lot of sense to me.  The overhead of an uncontended
lock is pretty small on most systems.  This would also mean that you

If you really do need to swap the counters themselves, you -might- also
need call_rcu() to dispose of them.  But it should possible to do that

You could certainly use something like smp_call_function() to collect
the other CPUs' counter values -- just disable interrupts across the
increments for architectures that cannot atomically increment a 64-bit
value.  (And it only needs to be atomic with respect to an interrupt,

This might help some, but would probably only give a few tens of percent
improvement.

							Thanx, Paul
--

Previous thread: Dear Email User by Cyberus UpgradeTeam on Friday, April 10, 2009 - 8:56 am. (1 message)

Next thread: [PATCH] Bonding: fix zero address hole bug in arp_ip_target list by Brian Haley on Friday, April 10, 2009 - 7:41 pm. (3 messages)