Change the UDP hash lock from an rwlock to RCU.
Signed-off-by: Corey Minyard <cminyard@mvista.com>
---
include/net/udp.h | 9 +++++----
net/ipv4/udp.c | 47 +++++++++++++++++++++++++++--------------------
net/ipv6/udp.c | 17 +++++++++--------
3 files changed, 41 insertions(+), 32 deletions(-)
diff --git a/include/net/udp.h b/include/net/udp.h
index addcdc6..35aa104 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -51,7 +51,7 @@ struct udp_skb_cb {
#define UDP_SKB_CB(__skb) ((struct udp_skb_cb *)((__skb)->cb))
extern struct hlist_head udp_hash[UDP_HTABLE_SIZE];
-extern rwlock_t udp_hash_lock;
+extern spinlock_t udp_hash_wlock;
/* Note: this must match 'valbool' in sock_setsockopt */
@@ -112,12 +112,13 @@ static inline void udp_lib_hash(struct sock *sk)
static inline void udp_lib_unhash(struct sock *sk)
{
- write_lock_bh(&udp_hash_lock);
- if (sk_del_node_init(sk)) {
+ spin_lock_bh(&udp_hash_wlock);
+ if (sk_del_node_init_rcu(sk)) {
inet_sk(sk)->num = 0;
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
}
- write_unlock_bh(&udp_hash_lock);
+ spin_unlock_bh(&udp_hash_wlock);
+ synchronize_rcu();
}
static inline void udp_lib_close(struct sock *sk, long timeout)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 57e26fa..1b65cb6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -112,7 +112,8 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_stats_in6) __read_mostly;
EXPORT_SYMBOL(udp_stats_in6);
struct hlist_head udp_hash[UDP_HTABLE_SIZE];
-DEFINE_RWLOCK(udp_hash_lock);
+DEFINE_SPINLOCK(udp_hash_wlock);
+EXPORT_SYMBOL(udp_hash_wlock);
int sysctl_udp_mem[3] __read_mostly;
int sysctl_udp_rmem_min __read_mostly;
@@ -155,7 +156,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
int error = 1;
struct net *net = sock_net(sk);
- write_lock_bh(&udp_hash_lock);
+ spin_lock_bh(&udp_hash_wlock);
if (!snum) {
int i, low, high, remaining;
@@ -225,12 +226,12 @@ gotit:
sk->sk_hash = ...UDP central rwlock can hurt performance, because of cache line ping pong, so your patch really makes sense. Me wondering what impact this synchronize_rcu() can have on mono-threaded VOIP applications using lot of UDP sockets. What is the maximum delay of this function ? For "struct file" freeing, we chose call_rcu() instead of synchronize_rcu() Maybe we could add a generic rcu head to struct sock, and use call_rcu() in sk_prot_free() for sockets needing RCU (udp after your patch is applied, maybe tcp on future patches, while I believe previous work on the subject concluded RCU was not giving good results for short lived http sessions) ? Or just add SLAB_DESTROY_BY_RCU to slab creation in proto_register() for "struct proto udp_prot/udpv6_prot" so that kmem_cache_free() done in sk_prot_free() can defer freeing to RCU... --
From: Eric Dumazet <dada1@cosmosbay.com> The cost is enormous, we really can't use it here. I have a patch that did top-level socket destruction using RCU, and that didn't use synchronize_rcu(), and that killed connection rates by up to %20. I can only imagine what the cost would be if I had to add such a call in there. Really, I can't consider these changes seriously, as-is. --
Would using SLAB_DESTROY_BY_RCU be ok, or would that be too expensive? -corey --
I tested skb destruction via RCU path, and got 2.5 times worse numbers with small-packets-bulk-transfer workload. For more details http://tservice.net.ru/~s0mbre/blog/devel/networking/2006_12_05_1.html -- Evgeniy Polyakov --
Was this with regular RCU freeing? This will cool down the cacheline before frees. You need SLAB_DESTROY_BY_RCU to keep the objects cache hot. --
Indeed! But care is required -- SLAB_DESTROY_BY_RCU permits objects to be freed and reallocated while a reader holds a reference. The only guarantee is that the -type- of the data structure will not change while a reader holds a reference. With something like UDP, this might well be sufficient. Just be careful! ;-) Thanx, Paul --
Right so after the hash lookup operation you are not assured that the object has not been freed or even reallocated for a different purpose. So after finding the pointer to the object two things need to happen (under rcu_lock): 1. Verify that the object is still in use 2. Verify that the object is matching the hash If not then the operation needs to be redone because we have a stale hash pointer. --
OK... so restart full lookup at the begining of hash chain if we detect points 1 or 2 invalid. Not that expensive since everything should be cache hot :) One has to take care to group all components (keys to compute the hash, and the *next* pointer) in one cache line to minimize cache misses, since we now need to access them all to compute/check hash value. Now if a freed object is re-inserted with same hash value, same hash chain, we'll also restart the lookup, but it is harmless. --
There is also the possibility that the element will be reused, but placed in the same list that it resided in last time. A reader referencing that item during this process might then be relocated in the list, which could either cause the reader to skip elements in the list (if the new element is relocated farther down the list) or endlessly loop through the list (if new elements were relocated closer to the list head and this free-reallocate process repeated and the reader was insanely unlucky). Thanx, Paul --
I believe there were no SLAB_DESTROY_BY_RCU 2 yars ago :) It was pure call_rcu(&skb->rcu, real_skb_freeing), where real_skb_freeing() just did usual kfree(). -- Evgeniy Polyakov --
Right. That results in cacheline cooldown. You'd want to recycle the object as they are cache hot on a per cpu basis. That is screwed up by the delayed regular rcu processing. We have seen multiple regressions due to cacheline cooldown. The only choice in cacheline hot sensitive areas is to deal with the complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU. --
Yes, I suppose you are right about TCP sessions, that should stay as they are.
Then if we use call_rcu() RCU freeing only for UDP sockets, we should get rid of
taking this rwlock each time we handle an incoming datagram, and introduce
no extra cost for other sockets.
Most UDP sockets are setup for long periods (RTP trafic), or if an application really
wants to {open/send or receive one UDP frame/close} many sockets, it already hits
RCU handling of its file structures and should not be slowed down that much.
By 'long period' I mean thousand of packets sent/received by each RTP session, being
voice (50 packets/second) or even worse video...
--
Does DNS with port randomization need short lived sockets? /Benny --
Yes very true, but current allocation of a random port can be very expensive, since we scan all the UDP hash table to select the smaller hash chain. We stop the scan if we find an empty slot, but on machines with say more than 200 bound UDP sockets, they are probably no empty slots. (UDP_HTABLE_SIZE is 128) bind(NULL port) algo is then O(N), N being number of bound UDP sockets. So heavy DNS servers/proxies probably use a pool/range of pre-allocated sockets to avoid costs of allocating/freeing them ? If they dont care about that cost, the extra call_rcu() will be unnoticed. For pathological (yet very common :) ) cases like single DNS query/answer, RCU would mean : Pros : - one few rwlock hit when receiving the answer (if any) Cons : - one call_rcu() to delay socket freeing/reuse after RCU period. So it might be a litle bit more expensive than without RCU I agree I am more interested in optimizing UDP stack for heavy users like RTP servers/proxies handling xxx.000 packets/second than DNS users/servers. Shame on me :) (2 weeks ago, Corey mentioned a 10x increase on UDP throughput on a 16-way machine, that sounds promising) --
On Tue, 07 Oct 2008 14:59:20 +0200 The idea of keeping chains short is the problem. That code should just be pulled because it doesn't help that much, and also creates bias on the port randomization. --
From: Stephen Hemminger <shemminger@vyatta.com> I have that patch from Vitaly Mayatskikh which does exactly this. I keep looking at it, but I can't bring myself to apply it since I'm not completely convinced. --
On Tue, 07 Oct 2008 13:55:48 -0700 (PDT) Some one on a busy server should run it and measure the delta in hash chains. I would, but don't run anything that has more than a few UDP's open. --
Vitaly patch might be appropriate if only few UDP ports are opened. We could zap the code to search short chains and extend Vitaly's idea with following patch : [PATCH] udp: Improve port randomization Current UDP port allocation is suboptimal. We select the shortest chain to chose a port (out of 512) that will hash in this shortest chain. First, it can lead to give not so ramdom ports and ease give attackers more opportunities to break the system. Second, it can consume a lot of CPU to scan all table in order to find the shortest chain. Third, in some pathological cases we can fail to find a free port even if they are plenty of them. This patch zap the search for a short chain and only use one random seed. Problem of getting long chains should be addressed in another way, since we can obtain long chains with non random ports. Based on a report and patch from Vitaly Mayatskikh Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: Eric Dumazet <dada1@cosmosbay.com> I really like this, and I've applied it to net-next-2.6 I think the "increment until back in range" do/while loop can be improved a bit. It can spin for more than 60,000 iterations in some edge case scenerios as-is :-) Ugh, there's also that expensive divide in there for the modulus. --
RCUification of UDP hash tables
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path. Using an array of rwlocks (one per
slot for example is not an option in this regard)
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the ...On ipv6 side, I forgot to add a check before compute_score(), like I did on ipv4
+ rcu_read_lock();
+begin:
+ result = NULL;
+ badness = -1;
+ sk_for_each_rcu(sk, node, &hslot->head) {
< BEGIN HERE missing part --->
/*
* lockless reader, and SLAB_DESTROY_BY_RCU items:
* We must check this item was not moved to another chain
*/
if (udp_hashfn(net, sk->sk_hash) != hash)
goto begin;
< END missing part --->
score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
if (score > badness) {
result = sk;
badness = score;
}
}
- if (result)
- sock_hold(result);
- read_unlock(&hslot->lock);
+ if (result) {
+ if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
+ result = NULL;
+ else if (unlikely(compute_score(result, net, hnum, saddr, sport,
+ daddr, dport, dif) < badness)) {
+ sock_put(result);
+ goto begin;
+ }
+ }
I will submit a new patch serie tomorrow, with :
Patch 1 : spinlocks instead of rwlocks, and bug spotted by Christian Bell
Patch 2 : splited on two parts (2 & 3) , one for IPV4, one for IPV6,
Thanks
--
From: Eric Dumazet <dada1@cosmosbay.com> I very much look forward to this :-) I like these changes and can't wait to add them to net-next-2.6 --
Thanks David, please find first updated patch 1 Thanks to Christian Bell and Stephen for their usefull review. [PATCH] udp: introduce struct udp_table and multiple spinlocks UDP sockets are hashed in a 128 slots hash table. This hash table is protected by *one* rwlock. This rwlock is readlocked each time an incoming UDP message is handled. This rwlock is writelocked each time a socket must be inserted in hash table (bind time), or deleted from this table (close time) This is not scalable on SMP machines : 1) Even in read mode, lock() and unlock() are atomic operations and must dirty a contended cache line, shared by all cpus. 2) A writer might be starved if many readers are 'in flight'. This can happen on a machine with some NIC receiving many UDP messages. User process can be delayed a long time at socket creation/dismantle time. This patch prepares RCU migration, by introducing 'struct udp_table and struct udp_hslot', and using one spinlock per chain, to reduce contention on central rwlock. Introducing one spinlock per chain reduces latencies, for port randomization on heavily loaded UDP servers. This also speedup bindings to specific ports. udp_lib_unhash() was uninlined, becoming to big. Some cleanups were done to ease review of following patch (RCUification of UDP Unicast lookups) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/sock.h | 2 include/net/udp.h | 25 ++-- include/net/udplite.h | 2 net/ipv4/udp.c | 209 +++++++++++++++++++++++----------------- net/ipv4/udp_impl.h | 4 net/ipv4/udplite.c | 13 +- net/ipv6/udp.c | 112 +++++++++++---------- net/ipv6/udp_impl.h | 4 net/ipv6/udplite.c | 8 - 9 files changed, 215 insertions(+), 164 deletions(-)
From: Eric Dumazet <dada1@cosmosbay.com> Applied, please (re-)send the current version of patch 2 as well. Thanks. --
I found a fatal bug in /proc/net/udp handling, sorry. [PATCH] udp: udp_get_next() should use spin_unlock_bh() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. --
Please find updated patch 2
Missing check in __udp6_lib_lookup() was added,
and based on spinlock version ([PATCH] udp: introduce struct udp_table an=
d multiple spinlocks)
Thank you
[PATCH] udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain=
really ...From: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. --
I believe there is a race in this patch:
+ sk_for_each_rcu(sk, node, &hslot->head) {
+ /*
+ * lockless reader, and SLAB_DESTROY_BY_RCU items:
+ * We must check this item was not moved to another chain
+ */
+ if (udp_hashfn(net, sk->sk_hash) != hash)
+ goto begin;
score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
if (score > badness) {
result = sk;
badness = score;
}
}
If the socket is moved from one list to another list in-between the time
the hash is calculated and the next field is accessed, and the socket
has moved to the end of the new list, the traversal will not complete
properly on the list it should have, since the socket will be on the end
of the new list and there's not a way to tell it's on a new list and
restart the list traversal. I think that this can be solved by
pre-fetching the "next" field (with proper barriers) before checking the
hash.
I also might be nice to have a way to avoid recomputing the score the
second time, perhaps using a sequence number of some type.
-corey
--
You are absolutely right. As we *need* next pointer anyway for the prefet= ch(), Well, computing score is really cheap, everything is in cache, granted the chain was not really huge and high score item at the beginning. Adding yet another field in sock structure should be avoided if possible.= Thanks [PATCH] udp: introduce sk_for_each_rcu_safenext() Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149= 933110672d (udp: RCU handling for Unicast packets.) "If the socket is moved from one list to another list in-between the tim= e=20 the hash is calculated and the next field is accessed, and the socket=20 has moved to the end of the new list, the traversal will not complete=20 properly on the list it should have, since the socket will be on the en= d=20 of the new list and there's not a way to tell it's on a new list and=20 restart the list traversal. I think that this can be solved by=20 pre-fetching the "next" field (with proper barriers) before checking th= e=20 hash." This patch corrects this problem, introducing a new sk_for_each_rcu_safen= ext() macro. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/rculist.h | 17 +++++++++++++++++ include/net/sock.h | 4 ++-- net/ipv4/udp.c | 4 ++-- net/ipv6/udp.c | 4 ++-- 4 files changed, 23 insertions(+), 6 deletions(-)
You also need the appropriate smp_wmb() in udp_lib_get_port() after sk_hash is set, I think, so the next field is guaranteed to be changed after the hash value is changed. -corey --
Not sure about this one Corey.
If a reader catches previous value of item->sk_hash, two cases are to be taken into :
1) its udp_hashfn(net, sk->sk_hash) is != hash
-> goto begin : Reader will redo its scan
2) its udp_hashfn(net, sk->sk_hash) is == hash
-> next pointer is good enough : it points to next item in same hash chain.
No need to rescan the chain at this point.
Yes we could miss the fact that a new port was bound and this UDP message could be lost.
If we force a smp_wmb(), reader would fetch pointer to begin of list.
1) its udp_hashfn(net, sk->sk_hash) is != hash
-> goto begin : Reader will redo its scan : next pointer value had no meaning
2) its udp_hashfn(net, sk->sk_hash) is == hash
->next pointer "force reader" to rescan the chain, but wont find new items.
In any case, we cannot lost an UDP message sent to a stable port (previously bound)
Thanks
--
3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was removed, freed, reallocated, and then readded with the same hash value, possibly carrying the reader to a new position in the same list. You might well cover this (will examine your code in detail on my plane flight starting about 20 hours from now), but thought I should point it out. ;-) --
If I understand this, without the smp_wmb(), it is possible that the
next field can be written to main memory before the hash value is
written. If that happens, the following can occur:
CPU1 CPU2
next is set to NULL (end of new list)
fetch next
calculate hash and compare to sk_hash
sk_hash is set to new value
So I think in the above cases, your case #2 is not necessarily valid
without the barrier.
And another possible issue. If sk_hash is written before next, and CPU1
is interrupted before CPU2, CPU2 will continually spin on the list until
CPU1 comes back and moves it to the new list. Note sure if that is an
issue.
-corey
--
Well, if this item is injected to the same chain, next wont be set to NULL. That would mean previous writers deleted all items from the chain. In this case, readers can see NULL, it is not a problem at all. List is/was empty. An application cannot complain a packet is not handled if its bind() syscall is not yet completed :) Probably not. Previously, readers were spining on read_lock(), when a writer was inside its critical section (write_lock()/write_unlock()). So instead of spining inside read_unlock(), issuing stupid memory transactions, the readers can now spin reading hash chain and populate cpu cache :) --
I put my comment in the wrong place, I wasn't talking about being If the item is injected onto the end of another chain, the next field will be NULL and you won't detect a hash mismatch. It's basically the same issue as the previous race, though a lot more subtle and unlikely. If you get (from the previous socket) an old value of "sk_hash" and (from the new socket) a new value of "next" that is NULL, you will terminate the loop when you should have restarted it. I'm pretty sure Yes, I thought about that and thought I would point it out, but I agree, what you have is certainly better than spinning on a lock :). -corey --
One way of dealing with this is to keep a tail pointer. Then, if the element containing the NULL pointer doesn't match the tail pointer seen at the start of the search, or if the tail pointer has changed, restart the search. Memory barriers will be needed. ;-) --
Hum... Another way of handling all those cases and avoid memory barriers would be to have different "NULL" pointers. Each hash chain should have a unique "NULL" pointer (in the case of UDP, it can be the 128 values : [ (void*)0 .. (void *)127 ] Then, when performing a lookup, a reader should check the "NULL" pointer it get at the end of its lookup has is the "hash" value of its chain. If not -> restart the loop, aka "goto begin;" :) We could avoid memory barriers then. In the two cases Corey mentioned, this trick could let us avoid memory barriers. (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough) What do you think ? --
Kinky!!! ;-) Then the rcu_dereference() would be supplying the needed memory barriers. Hmmm... I guess that the only confusion would be if the element got removed and then added to the same list. But then if its pointer was pseudo-NULL, then that would mean that all subsequent elements had been removed, and all preceding ones added after the scan started. Which might well be harmless, but I must defer to you on this one at the moment. If you need a larger hash table, another approach would be to set the pointer's low-order bit, allowing the upper bits to be a full-sized index -- or even a pointer to the list header. Just make very sure to clear the pointer when freeing, or an element on the freelist could end up looking like a legitimate end of list... Which again might well be safe, but why inflict this on oneself? Thanx, Paul --
Kind of my thought, too. That's a lot of work to avoid a single smb_wmb() on the socket creation path. Plus this could be extra confusing. -corey --
Just to be clear, I was fulminating against any potential failure to clear the pseudo-NULL pointer, not against the pseudo-pointer itself. This sort of trick is already used in some of the RCU-protected trees (for example, FIB tree, IIRC), so I would look a bit funny fulminating too hard against it. ;-) The only other high-level approach I have come up with thus far is to maintain separate hash tables for the long-lived UDP sockets (protected by RCU) and for the short-lived UDP sockets (protected by locking). Given the usual bimodal traffic pattern, most of the sockets are short lived, but most of the data is transmitted over long-lived sockets. If a socket receives more than N packets (10? 50? 100?), it is moved from the short-lived table to the long-lived table. Sockets on the short-lived table may be freed directly, while sockets on the long-lived table must be RCU freed -- but this added overhead should be in the noise for a long-lived connection. Lookups hit the RCU-protected table, then the lock protected table, then the RCU-protected table again, but still holding the lock. (Clearly, only search until you find the desired socket.) However, I am not certain that this short-term/long-term approach is better than the approach that Eric is proposing. It might in fact be worse. But I throw it out anyway on the off-chance that it is helpful as a comparison or as a solution to some future problem. Thanx, Paul --
Sure this smp_wmb() seems harmless (but, remember this infrastructure will next be deployed for TCP sockets as well ;) ) But saving smp_rmb() at lookup time, for each item is a clear win, no ? --
Well, for UDP case, hash table will be <=3D 65536 anyway, we can assume no dynamic kernel memory is in the range [0 .. 65535] Here is a patch (untested yet, its really time for a sleep for me ;) ) [PATCH] udp: Introduce special NULL pointers for hlist termination In order to safely detect changes in chains, we would like to have differ= ent 'NULL' pointers. Each chain in hash table is terminated by an unique 'NUL= L' value, so that the lockless readers can detect their lookups evaded from their starting chain. We define 'NULL' values as ((unsigned long)values < UDP_HTABLE_SIZE) This saves memory barriers (a read barrier to fetch 'next' pointers *before* checking key values) we added in commit=20 96631ed16c514cf8b28fab991a076985ce378c26 (udp: introduce=20 sk_for_each_rcu_safenext())=20 This also saves a missing memory barrier spotted by Corey Minyard=20 (a write one in udp_lib_get_port(), between sk_hash update and ->next upd= ate) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/list.h | 32 ++++++++++++++++++++++++++++++++ include/linux/rculist.h | 15 ++++++++------- include/net/sock.h | 9 +++++++-- net/ipv4/udp.c | 19 +++++++++++++------ net/ipv6/udp.c | 16 ++++++++++++---- 5 files changed, 72 insertions(+), 19 deletions(-)
I think you are right, this will certainly perform a lot better without the read barrier in the list traversal. I haven't seen any problems with this approach, though it's unusual enough to perhaps warrant some extra comments in the code. You do need to modify udp_lib_unhash(), as sk_del_node_init_rcu() will do a NULL check on the ->next value, so you will need a special version of that as well. -corey --
Yes, we need many new macros, like sk_next_nulls(), sk_head_nulls(), ... I have a working patch now, but not yet presentable for lkml :) This patch need to touch files outside of netdev scope, so will need really good shape and documentation. (Probably a new file : include/linux/list_nulls.h ?) Maybe in the meantime, we can commit a temporary patch doing the smp_wmb(= ) you suggested ? Thanks [PATCH] udp: add a missing smp_wmb() in udp_lib_get_port() Corey Minyard spotted a missing memory barrier in udp_lib_get_port() We need to make sure a reader cannot read the new 'sk->sk_next' value and previous value of 'sk->sk_hash'. Or else, an item could be deleted from a chain, and inserted into another chain. If new chain was empty before the move, 'next' pointer is NULL, and lockless reader can not detect it missed following items in original chain. This patch is temporary, since we expect an upcoming patch to introduce another way of handling the problem. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: Eric Dumazet <dada1@cosmosbay.com> I've applied this to net-next-2.6 --
From: Eric Dumazet <dada1@cosmosbay.com> Please hide this behind some list.h interface macro, even something as simple as INIT_HLIST_HEAD_NULLS(X, index) would suffice. And as Corey said, the code needs more comments for something as clever as this! :-) Thanks. --
Yes I agree 100%, please give me one day to prepare a real patch, or else akpm will kill us :) --
If we design something that could be reused, say for TCP sockets, we need to be able to handle very large number of 'NULL' pointers, say, up to 64*1024*1024 So lets use the low order bit, set to one for "NULL" pointers, and 0 for regular pointers. This gives us 31 bits (or 63 bits) to store any valuable info :) and all ...._nulls() macros would not need to know the max value (UDP_HTABLE_SIZE in UDP case), since all they have to do is a test ((unsigned long)pos & 1) At iterator exit, pos would contain the 'index' value, (pos >> 1), to hide this implementation detail. --
From: Eric Dumazet <dada1@cosmosbay.com> This sound fine to me. Quite an improvement in fact :) --
Ok, here is an updated and tested patch. Thanks everybody [PATCH] udp: Introduce special NULL pointers for hlist termination In order to safely detect changes in chains, we would like to have differ= ent 'NULL' pointers. Each chain in hash table is terminated by an unique 'NUL= L' value, so that the lockless readers can detect their lookups evaded from their starting chain. We introduce a new type of hlist implementation, named hlist_nulls, were we use the least significant bit of the 'ptr' to tell if its a "NULL" val= ue or a pointer to an object. We expect to use this new hlist variant for TC= P as well. For UDP/UDP-Lite hash table, we use 128 different "NULL" values, (UDP_HTABLE_SIZE=3D128) Using hlist_nulls saves memory barriers (a read barrier to fetch 'next' pointers *before* checking key values) we added in commit=20 96631ed16c514cf8b28fab991a076985ce378c26 (udp: introduce sk_for_each_rcu_safenext()) This also saves a write memory barrier in udp_lib_get_port(), between sk->sk_hash update and sk->next update) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/list_nulls.h | 97 ++++++++++++++++++++++++++++++++ include/linux/rculist.h | 17 ----- include/linux/rculist_nulls.h | 55 ++++++++++++++++++ include/net/sock.h | 50 ++++++++++++---- include/net/udp.h | 2 net/ipv4/udp.c | 40 ++++++------- net/ipv6/udp.c | 22 ++++--- 7 files changed, 228 insertions(+), 55 deletions(-)
On Thu, 30 Oct 2008 16:40:01 +0100 IMHO this goes over the edge into tricky hack. Is it really worth it? Is there a better simpler way? --
The only think I've thought of is to do a single smp_rmb() after the loop scanning the list and check the sk_hash value again. That's better than the read barrier for every list element, but still not as good as this list from a performance point of view. IMHO, this is a tricky hack, but it if is well abstracted and documented I think it's ok. I'd guess something like this will become more often used as we get larger numbers of processors on systems. It is annoying that it doesn't help the performance for multicast. However, I think the current patch will solve the DOS issue for multicast, since it switches to a normal spinlock and has a per-list lock. -corey --
Corey Minyard a =C3=A9crit : ck. About multicast, it should be possible to do something about it, if it ha= ppens to be an issue. That is, do a lockless lookup and accumulate matching sockets ptr in a ta= ble (incrementing their refcount if not zero, checking key, adding in a local= stack). If lookup must be restarted, forget all accumulated sockets (sock_put(ptr= s)) goto begin; Then, send the (cloned) packet to all accumulated sockets, and sock_put() them to release the refcount. Well, looking at current implementation, I found that udp_v4_mcast_next()= doesnt take into account the 'struct net *net', so we have a bug here... udp_v6_mcast_next() is buggy too (or at least its caller is) David, please find a patch against net-2.6 Thanks [PATCH] udp: multicast packets need to check namespace Current UDP multicast delivery is not namespace aware. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- net/ipv4/udp.c | 14 ++++++++------ net/ipv6/udp.c | 8 ++++---- 2 files changed, 12 insertions(+), 10 deletions(-)
Acked-by: Pavel Emelyanov <xemul@openvz.org> --
From: Pavel Emelyanov <xemul@openvz.org> Applied, thanks everyone. --
rwlocks , spinlocks, seqlocks :) More seriously Stephen, if the infrastructure is clean, and well tested on a relative simple case (UDP), it can then be deployed on a much more interesting protocol : TCP The moment we switch to RCU, we have to accept the pain of really understand what we did. Details are scary yes. --
From: Eric Dumazet <dada1@cosmosbay.com> Agreed. --
If we're going to do this, It'd be good to have the list_nulls stuff in their own patch, as clearly they are not UDP specific. Also, I think it would be very good to have some extensive comments in the list_nulls files describing their use in clear and concise language, because the above changelog doesn't even begin to explain things for those not following this thread. --
Hi all
Here is a serie of three patches (based on net-next-2.6), to continue work
with RCU on UDP/TCP/DCCP stacks
Many thanks for all usefull reviews and comments, especially from Paul and Corey.
1) Introduce hlist_nulls variant of hlist
hlist uses NULL value to finish a chain.
hlist_nulls variant use the low order bit set to 1 to signal an end marker.
This allows to store many different end markers, so that some RCU lockless
algos (used in TCP/UDP stack for example) can save some memory barriers in
fast paths.
2) Use hlist_nulls in UDP RCU code
This is a straightforward patch, using hlist_nulls infrastructure.
RCU-ification already done on UDP two weeks ago, so hlist_nulls
permits us to avoid some memory barriers, both at lookup time
and delete time. Patch is large because it adds new macros to
include/net/sock.h. These macros will be used by TCP & DCCP too.
3) Convert TCP & DCCP hash tables to use RCU & hlist_nulls
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.
This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.
Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.
__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)
Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
--
Do you have any numbers that demonstrate the read memory barriers being a performance problem? At least on x86 they should be very cheap because they're normally nops. -Andi -- ak@linux.intel.com --
From: Eric Dumazet <dada1@cosmosbay.com> All 4 patches applied to net-next-2.6, thanks a lot Eric! These patches are incredibly cool! --
How much does this give us on the aim9 udp test? --
hlist uses NULL value to finish a chain.
hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
This allows to store many different end markers, so that some RCU lockless
algos (used in TCP/UDP stack for example) can save some memory barriers in
fast paths.
Two new files are added :
include/linux/list_nulls.h
- mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
include/linux/rculist_nulls.h
- mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
Only four helpers are declared for the moment :
hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
prefetches() were removed, since an end of list is not anymore NULL value.
prefetches() could trigger useless (and possibly dangerous) memory transactions.
Example of use (extracted from __udp4_lib_lookup())
struct sock *sk, *result;
struct hlist_nulls_node *node;
unsigned short hnum = ntohs(dport);
unsigned int hash = udp_hashfn(net, hnum);
struct udp_hslot *hslot = &udptable->hash[hash];
int score, badness;
rcu_read_lock();
begin:
result = NULL;
badness = -1;
sk_nulls_for_each_rcu(sk, node, &hslot->head) {
score = compute_score(sk, net, saddr, hnum, sport,
daddr, dport, dif);
if (score > badness) {
result = sk;
badness = score;
}
}
/*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(node) != hash)
goto begin;
if (result) {
if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
result = ...So by not using some memory barriers (would be nice to have it illustrated which ones), we can race and end up on the wrong chain, in case that happens we detect this by using this per-chain terminator and try again. It would be really good to have it explained in the rculist_nulls.h comments what memory barriers are missing, what races they open, and how the this special terminator trick closes that race. I'm sure most of us understand it now, but will we still in a few months? - how about new people? Other than that, very cool stuff! :-) --
For example, if a process holds a lock, it doesnt need rcu version. OK, maybe I should add a Documentation/RCU/rculist_nulls.txt file with appropriate examples and documentation. (Say the lookup/insert algorithms, with standard hlist and memory barriers, and with hlist_nulls without those two memory barriers. (These two memory barriers can be found in commits : c37ccc0d4e2a4ee52f1a40cff1be0049f2104bba : udp: add a missing smp_wmb() in udp_lib_get_port() Corey Minyard spotted a missing memory barrier in udp_lib_get_port() We need to make sure a reader cannot read the new 'sk->sk_next' value and previous value of 'sk->sk_hash'. Or else, an item could be deleted from a chain, and inserted into another chain. If new chain was empty before the move, 'next' pointer is NULL, and lockless reader can not detect it missed following items in original chain. This patch is temporary, since we expect an upcoming patch to introduce another way of handling the problem. And commit 96631ed16c514cf8b28fab991a076985ce378c26 : udp: introduce sk_for_each_rcu_safenext() Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d (udp: RCU handling for Unicast packets.) "If the socket is moved from one list to another list in-between the time the hash is calculated and the next field is accessed, and the socket has moved to the end of the new list, the traversal will not complete properly on the list it should have, since the socket will be on the end of the new list and there's not a way to tell it's on a new list and restart the list traversal. I think that this can be solved by pre-fetching the "next" field (with proper barriers) before checking the hash." This patch corrects this problem, introducing a new Thanks Peter ;) --
[PATCH 4/3] rcu: documents rculist_nulls Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls' end-of-list can help in some RCU algos. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- Documentation/RCU/rculist_nulls.txt | 167 ++++++++++++++++++++++++++ 1 files changed, 167 insertions(+)
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --
From: Peter Zijlstra <a.p.zijlstra@chello.nl> Since there seems to be consensus for these changes I'm going to merge this stuff into net-next-2.6 so that I can add in the users that Eric has written. Thanks everyone for the review and feedback. --
Very good -- only one small suggestion below. In some cases, a "generation number" will be needed. For example, there are algorithms that must detect when an object has been removed and then re-inserted with the same key. One increments the generation number --
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --
Looks good, but a few questions and suggestions interspersed below. And @pos is the starting point, correct? Suggest something like: The point here is to allow an RCU reader to grab the update-side lock while holding a reference to an hlist_nulls_node, and then be able to blindly call hlist_nulls_del_init_rcu() without having to do any complex check to see if the element has already been deleted? But this only works if each free operation waits for a grace period. If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to Any chance of using a trick like Oleg used to get rid of the "pos" argument? http://lkml.org/lkml/2008/3/12/47 The hlist_nulls_node must always be at an even address, correct? Couldn't this fact be used to allow testing for is_a_nulls() on tpos rather than on pos? Or is there a better way to approach this? --
Yes, comment was copied from hlist_for_each_entry_from() comment, this one
<start a brain refresh cycle>
<read again your questions>
Tilt...
hlist_nulls_del_init_rcu() is only used by a writer, exactly
like hlist_del_init_rcu().
I see nothing special about SLAB_DESTROY_BY_RCU here.
static inline void hlist_del_init_rcu(struct hlist_node *n)
{
if (!hlist_unhashed(n)) {
__hlist_del(n);
n->pprev = NULL;
}
#define sk_nulls_for_each_rcu(__sk, node, list) \
hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)
1) __sk is the pointer to found item if any is found in the loop
2) node will contain the end value of chain, in case we find nothing in loop
because we need to check it after the loop.
if (get_nulls_value(node) != hash)
goto begin;
I dont know, it seems quite complex to try to use only three args ?
This algo is not very easy to read as is already ...
--
Not a problem, as you don't use it the way I was thinking.
For whatever it is worth, here is a more complete use case, on the
off-chance that it becomes useful some time:
retry:
rcu_read_lock();
hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
if (!(curgen = still_valid(tpos)))
goto retry;
if (needs_deletion(tpos)) {
spin_lock(&update_side_lock);
if (still_valid(tpos) == curgen)
hlist_nulls_del_init_rcu(pos);
spin_unlock(&update_side_lock);
}
}
rcu_read_unlock();
This approach requires that the key and a generation number be encoded
into a single word, and that the generation number be changed on each
One way around #2 would be for get_nulls_value() to undo the
hlist_nulls_entry(). Not sure whether it is worth it, but a
thought.
Thanx, Paul
--
Indeed! Either that or have an rcu_read_unlock() before the goto retry. Good catch! --
Hum, we should add this template in Documentation/RCU I guess Thanks --
With Arnaldo's change -- probably should prototype and test to find the other inevitable bugs. :-/ Thanx, Paul --
This is a straightforward patch, using hlist_nulls infrastructure. RCUification already done on UDP two weeks ago. Using hlist_nulls permits us to avoid some memory barriers, both at lookup time and delete time. Patch is large because it adds new macros to include/net/sock.h. These macros will be used by TCP & DCCP in next patch. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/rculist.h | 17 ----------- include/net/sock.h | 57 ++++++++++++++++++++++++++++++-------- include/net/udp.h | 2 - net/ipv4/udp.c | 47 ++++++++++++++----------------- net/ipv6/udp.c | 26 +++++++++-------- 5 files changed, 83 insertions(+), 66 deletions(-)
Looks good, one question below about the lockless searches. If the answer is that the search must complete undisturbed by deletions and additions, then: Shouldn't this check go -after- the check for "result"? Or is this a case where the readers absolutely must have traversed a chain without --
Very good question Yes, we really have to look at all the sockets, to find the one with higher score, not one with a low score, and the one with higher score being ignored because we didnt examined it. So we really must check we finished our lookup on the right chain end, not an alien one. (Previous UDP code had a shortcut if we found the socket with the maximal possible score, I deleted this test as it had basically 0.0001 % of chance being hit) Thanks a lot for your patient review Paul. --
RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/inet_hashtables.h | 4 - include/net/inet_timewait_sock.h | 10 +-- net/core/sock.c | 4 + net/dccp/ipv4.c | 1 net/dccp/ipv6.c | 1 net/dccp/proto.c | 4 - net/ipv4/inet_diag.c | 6 +- net/ipv4/inet_hashtables.c | 78 ++++++++++++++++++++--------- net/ipv4/inet_timewait_sock.c | 26 +++++---- net/ipv4/tcp.c | 4 - net/ipv4/tcp_ipv4.c | 25 ++++----- net/ipv6/inet6_hashtables.c | 70 +++++++++++++++++--------- net/ipv6/tcp_ipv6.c | 1 13 files changed, 150 insertions(+), 84 deletions(-)
This is the validation step that verifies the race opened by using SLAB_DESTROY_BY_RCU, right? Does it make sense to add a little comment to these validation steps to --
The atomic_inc_not_zero() is not related to SLAB_DESTROY_BY_RCU but classic RCU lookup. A writer can delete the item right before we try to use it. Next step is necessary in case the deleted item was re-allocated and inserted in a hash chain (this one or another one, it doesnt matter). In this case, previous atomic_inc_not_zero test will succeed. So we must check again the item we selected (and refcounted) is the one we were searching. So yes, this bit should be documented, since SLAB_DESTROY_BY_RCU is Yes, you are right. Thanks --
It is used for the anonymous vmas. That is the purpose that Hugh introduced it for since he saw regression if he would use straight rcu freeing. See mm/rmap.c. SLAB_DESTROY_BY_RCU a pretty strange way of using RCU and slab so it should always be documented in detail. --
We have one user, anon_vma, and one thing that is very nearly identical the lockless pagecache. See page_cache_get_speculative() and find_get_page(). The pagecache gets away with this due to the simple fact that the page frames are never freed. Hmm, I once wrote a comment to go with SLAB_DESTROY_BY_RCU, which seems to have gotten lost... /me goes dig. Found it: http://lkml.org/lkml/2008/4/2/143 I guess I'd better re-submit that.. --
Looks good. CC Pekka and he will merge it. --
Hi David Please find patch to convert TCP/DCCP listening hash tables to RCU. A followup patch will cleanup all sk_node fields and macros that are not used anymore. Thanks [PATCH] net: Convert TCP/DCCP listening hash tables to use RCU This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/inet_hashtables.h | 9 + net/ipv4/inet_diag.c | 4 net/ipv4/inet_hashtables.c | 148 ++++++++++++++++---------------- net/ipv4/tcp_ipv4.c | 8 - net/ipv6/inet6_hashtables.c | 94 ++++++++++++-------- 5 files changed, 147 insertions(+), 116 deletions(-)
I do like this use of the full set up upper bits! However, wouldn't it be a good idea to use a larger base value for 64-bit systems, perhaps using CONFIG_64BIT to choose? 500M entries might not seem like that many in a few years time... --
Well, this value is correct up to 2^29 slots, and a hash table of 2^32 bytes (8 bytes per pointer) A TCP socket uses about 1472 bytes on 64bit arches, so 2^29 sessions would need 800 GB of ram, not counting dentries, inodes, ... I really doubt a machine, even with 4096 cpus should/can handle so many tcp sessions :) --
200MB per CPU, right? But yes, now that you mention it, 800GB of memory dedicated to TCP connections sounds almost as ridiculous as did 640K of memory in the late 1970s. ;-) Nevertheless, I don't have an overwhelming objection to the current code. Easy enough to change should it become a problem, right? Thanx, Paul --
Sure. By that time, cpus might be 128 bits or 256 bits anyway :) --
Or even 640K bits. ;-) Thanx, Paul --
From: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. --
Why not use the bucket pointer as terminating condition? Because all you really need is a pointer that is specific per bucket, and not a valid element, right? --
Yes, but it forces compiler to keep around the bucket pointer. Big chance this value will be stored on stack. Next patch will use the least significant bit to distinguish a valid pointer from a "NULL pointer" --
As you might guess, I do like that idea. ;-) On my plane trip, so am reviewing what I believe to be your current combined patchset consisting of six patches transmitted in this thread: Message-ID: <49081D67.3050502@cosmosbay.com> Message-ID: <49082718.2030201@cosmosbay.com> Message-ID: <490874F2.2060306@cosmosbay.com> Message-ID: <4908DEDE.5030706@cosmosbay.com> Message-ID: <49094B0F.2090208@cosmosbay.com> Message-ID: <490838C6.4060304@cosmosbay.com> This probably won't be your latest and greatest by the time you receive this, but it appears to be the latest and greatest that I have. ;-) A few comments, search for blank lines. It might be good to take a predicate macro/function instead of nullval, in order to allow this primitive to be used for a number of different pseudo-NULL-pointer schemes, and ditto for the similar macros you define below. Might be too early to know exactly how such a primitive should look, though. So just a random thought at this point. In any case, this macro cannot be used in read-side critical sections. Used by sk_for_each_nulls(), which is called by udp_lib_lport_inuse() and udp_get_first(). udp_lib_lport_inuse() is called by udp_lib_get_port(), which holds the hslot->lock, so should be OK. This also must be called with the update-side lock held. It is wrappered by sk_for_each_from_nulls(), which in turn is called from udp_v4_mcast_next() and udp_v6_mcast_next(). According to messages earlier in this thread, that is the case. udp_v4_mcast_next() is called from __udp4_lib_mcast_deliver(), which does hold hslot->lock, so OK. Ditto for the calls from udp_v6_mcast_next(). Interestingly enough, these two functions use sk_head(), which calls __sk_head(), which do not do rcu_dereference(). Which is another reason Looks good. It should be possible to get rid of the "pos" argument in the upcoming version, as all of the offsets should be even, so the bottom bit would come through unchanged. In fact, it should be ...
Well, we have 65536(=2^16) possible port values, and while 'rand' is random, it has the interesting property/bias of being odd. We know (thanks modular arithmetic / congruence relation) we will hit all 65356 values exactly once, after exactly 65536 iterations. --
Ah, got it! Thank you for the explanation! I was fixating on the low..high interval. ;-) Thanx, Paul --
yes, but 'new position' is 'before any not yet examined objects', since Yes, I'll double check too, this seems tricky :) About SLAB_DESTROY_BY_RCU effect, we now have two different kmem_cache for "UDP-Lite" and "UDP". This is expected, but we could avoid that and alias these caches, since these objects have the same *type* . (The fields used for the RCU lookups, deletes and inserts are the same) Maybe a hack in net/ipv4/udplite.c before calling proto_register(), to copy the kmem_cache from UDP. --
OK. However, this reasoning assumes that a socket with a given udp_hashfn() value will appear on one and only one list. There are no As long as this preserves the aforementioned assumption that a socket with a given hash can appear on one and only one list. ;-) Thanx, Paul --
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Nope, with UDP things are very simple, just one hash table. --
Cool! That does make things easier. ;-) Thanx, Paul --
Ouch, thanks Paul, that is indeed the point, well sort of. If a UDP socket is freed, and re-allocated as an UDP-Lite socket, inserted on the udplite_table, then we would have a problem with current implementation. A reader could be directed to the chain of the other hash table, without noticing it should restart its lookup... Not worth adding a check to detect such a scenario, we can live with two different kmem_cache after all, they are not that expensive. --
From: Eric Dumazet <dada1@cosmosbay.com> Also applied, thanks Eric. --
why _safenext and not _safe like hlist_for_each_entry_safe() which also keeps a next pointer? Also note the difference in argument order between these two. --
Yes, this one is going to vanish soon. --
UDP sockets are hashed in a 128 slots hash table. This hash table is protected by *one* rwlock. This rwlock is readlocked each time an incoming UDP message is handled. This rwlock is writelocked each time a socket must be inserted in hash table (bind time), or deleted from this table (close time) This is not scalable on SMP machines : 1) Even in read mode, lock() and unlock() are atomic operations and must dirty a contended cache line, shared by all cpus. 2) A writer might be starved if many readers are 'in flight'. This can happen on a machine with some NIC receiving many UDP messages. User process can be delayed a long time at socket creation/dismantle time. This patch prepares RCU migration, by introducing 'struct udp_table and struct udp_hslot', and using one rwlock per chain, to reduce contention on central rwlock. Introducing one rwlock per chain reduces latencies, for port randomization on heavily loaded UDP servers. Some cleanups were done to ease review of following patch. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/net/sock.h | 2 include/net/udp.h | 25 ++-- include/net/udplite.h | 2 net/ipv4/udp.c | 208 +++++++++++++++++++++++----------------- net/ipv4/udp_impl.h | 4 net/ipv4/udplite.c | 13 +- net/ipv6/udp.c | 112 +++++++++++---------- net/ipv6/udp_impl.h | 4 net/ipv6/udplite.c | 8 - 9 files changed, 214 insertions(+), 164 deletions(-)
This structure should be aligned up to cacheline to reduce false The fail: label below should still unlock_bh when the above condition cheers, . . christian --
This should be new label actually, since it is also used for non-locked fail. -- Evgeniy Polyakov --
Yes, I though about that. But : a full cache line is a waste of memory, and choosing a power of two alignement is not easy because of 32bit/64bit arches, Good spoting, the write_unlock_bh(&hslot->lock); must be moved after the "fail:" label. Thanks a lot --
Hi. Ugh-ogh, that's a very cool change! So far reviewed only this and things look very promising. -- Evgeniy Polyakov --
UDP sockets are hashed in a 128 slots hash table. This hash table is protected by *one* rwlock. This rwlock is readlocked each time an incoming UDP message is handled. This rwlock is writelocked each time a socket must be inserted in hash table (bind time), or deleted from this table (unbind time) This is not scalable on SMP machines : 1) Even in read mode, lock() and unlock() are atomic operations and must dirty a contended cache line, shared by all cpus. 2) A writer might be starved if many readers are 'in flight'. This can happen on a machine with some NIC receiving many UDP messages. User process can be delayed a long time at socket creation/dismantle time. What Corey and I propose is to use RCU to protect this hash table. Goals are : 1) Optimizing handling of incoming Unicast UDP frames, so that no memory writes should happen in the fast path. Using an array of rwlocks (one per slot for example is not an option in this regard) Note: Multicasts and broadcasts still will need to take a lock, because doing a full lockless lookup in this case is difficult. 2) No expensive operations in the socket bind/unhash phases : - No expensive synchronize_rcu() calls. - No added rcu_head in socket structure, increasing memory needs, but more important, forcing us to use call_rcu() calls, that have the bad property of making sockets structure cold. (rcu grace period between socket freeing and its potential reuse make this socket being cold in CPU cache). David did a previous patch using call_rcu() and noticed a 20% impact on TCP connection rates. Quoting Cristopher Lameter : "Right. That results in cacheline cooldown. You'd want to recycle the object as they are cache hot on a per cpu basis. That is screwed up by the delayed regular rcu processing. We have seen multiple regressions due to cacheline cooldown. The only choice in cacheline hot sensitive areas is to deal with the complexity that comes with SLAB_DESTROY_BY_RCU or ...
On Tue, 28 Oct 2008 21:37:15 +0100 We should just make it a spin_lock later and speed up udp socket creation. --
Just to be clear, that was 10x with preempt RT, which converts rwlocks into PI mutexes. So 16 processors going for the same lock is pretty ugly. Under heavy loads there is also a writer starvation problem, I believe in non-RT. You can't actually create or destroy a UDP socket when the load is high because there's always a reader holding the lock. RCU also solves that problem. -corey --
From: Eric Dumazet <dada1@cosmosbay.com> As stated, I added RCU destruction generically for socket objects, and it showed up clearly. So "not be slowed down that much" has been disproven, at least to me, already :-) --
RCU in hash table is ok for managing read mostly data, since only during the read access you avoid to dirty a rwlock... If we have a workload that insert/delete sockets as hell but receive few frames (that hit the hash table in a read only way), then you defeat the purpose of RCU, and pay the price of throwing away (in rcu queue) hot data that will become cold before reuse... BTW is there any chance your results were obtained before October 2005 ? At that time, RCU was able to queue an unlimited number of events. a single loop doing close(open("/dev/null",0)) could exhaust RAM... Refs: commit 5ee832dbc6770135ec8d63296af0a4374557bb79 and many others... Anyway we can probably code something without call_rcu() cache blower for UDP, if time permits :) --
From: Eric Dumazet <dada1@cosmosbay.com> No, the timestamp on the saved patch file I have is June 16, 2008 :-) --
Yeah, sync_rcu() is rediculously expensive, at best 3 jiffies IIRC. --
I could make it -much- faster, but at the expense of -serious- CPU overhead. Still, might be useful during boot time (when the system can't do anything useful anyway) to accelerate getting data structures initialized. Thanx, Paul --
From: Peter Zijlstra <a.p.zijlstra@chello.nl> Probably the RCU delay on a 128 cpu machine :-) Also I bet batching the socket destruction eliminates all of the cached local state we have in the cpu at the actual socket destruction time. --
It delays until all currently executing RCU read-side sections have executed (new ones don't count, just currently executing ones). I'm not sure what this delay is, but I would expect it to be fairly small. This function is only called when a socket is closed, too, so it's not a I'd prefer that, too, but that would mean adding another member to the RCU probably wouldn't be a good choice for short-lived http sessions, since you will only get a couple of messages that would matter. I'm not against adding an item to struct sock, but this is not a common thing That's an interesting thought; I didn't know that capability was there. I can look at that. With this, the short-lived TCP sessions might not matter, though that's a different issue. -corey --
Be careful!, SLAB_DESTROY_BY_RCU just means the slab page gets RCU-freed, this means that slab object pointers stay pointing to valid memory, but it does _NOT_ mean those slab objects themselves remain valid. The slab allocator is free to re-use those objects at any time - irrespective of the rcu-grace period. Therefore you will have to be able to validate that the object you point to is indeed the object you expect, otherwise strange and wonderful things will happen. --
Thanks for this clarification. I guess we really need a rcu head then :) --
No you just need to make sure that the object you located is still active (f.e. refcount > 0) and that it is really a match (hash pointers may be updated asynchronously and therefore point to the object that has been reused for something else). Generally it is advisable to use SLAB_DESTROY_BY_RCU because it preserves the cache hot advantages of the objects. Regular RCU freeing will let the object expire for a tick or so which will result in the cacheline cooling down. --
In some cases, you might be able to not care, but yes, most of the time, And SLAB_DESTROY_BY_RCU guarantees that the type of the object will remain the same during any given RCU read-side critical section. Thanx, Paul --
Seems really good to master this SLAB_DESTROY_BY_RCU thing (I see almost no use of it in current kernel) 1) Hum, do you know why "struct file" objects dont use SLAB_DESTROY_BY_RCU then, since we noticed a performance regression for several workloads at RCUification of file structures ? 2) What prevents an object to be *freed* (and deleted from a hash chain), then re-allocated and inserted to another chain (different keys) ? (final refcount=1) If the lookup detects a key mismatch, how will it continue to the next item, since 'next' pointer will have been reused for the new chain insertion... Me confused... --
Nothing prevents this from happening. You either have to have some sort of validation step based on object identity (e.g., a generation number that is incremented on each allocation), or have an algorithm that doesn't care if searches sometimes spuriously fail to find something that really is in the list. Which is one of the reasons that its use is rare. But perhaps more One way to approach this is to have a generation number that is incremented each time the object is recycled along with a pointer to the list header. The list header contains the most recent generation number of any element in the list. Then if either the generation number of a give element is greater than that of the header when you started the search, or the element's pointer no longer references the list header you started your search from, restart the search. Read-side memory barriers may also be required in some cases. It may be possible to simplify this in some special cases. There are probably better ways to approach this, but that is one way. Thanx, Paul --
Right, you can't have lists with items like that. You can only do
matching lookups. What you do is:
find_get_obj(key)
{
rcu_read_lock()
again:
obj = lookup(key);
if (!obj)
goto out;
/*
* if we can't get a ref, the item got freed concurrently
* try again
*/
if (!get_ref_unless_zero(obj))
goto again;
/*
* if we did get a ref, but its not the object we expected
* try again
*/
if (obj->key != key) {
put_ref(obj);
goto again;
}
out:
rcu_read_unlock();
return obj;
}
Which is basically what we do with the lockless pagecache, where we
don't need the RCU because the page-frames are never freed.
--
Because my patches were not accepted that fix the issue. If there is a mismatch then you have to do another hash lookup. Do an rcu unlock and start over. --
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
| Linux Kernel Mailing List | md: move allocation of ->queue from mddev_find to md_probe |
| Linux Kernel Mailing List | md: raid0: Represent zone->zone_offset in sectors. |
| Linux Kernel Mailing List | [ARM] S3C24XX: Add gpio_to_irq() facility |
| Linux Kernel Mailing List | md: raid0_make_request(): Replace |
