Re: [PATCH 2/2] udp: RCU handling for Unicast packets.

Previous thread: [PATCH 2/3] Add RCU versions of socket hlist handling by Corey Minyard on Monday, October 6, 2008 - 11:50 am. (1 message)

Next thread: IGMP sent to Foreign VLAN by Jarod Neuner on Monday, October 6, 2008 - 12:51 pm. (8 messages)
From: Corey Minyard
Date: Monday, October 6, 2008 - 11:50 am

Change the UDP hash lock from an rwlock to RCU.

Signed-off-by: Corey Minyard <cminyard@mvista.com>
---
 include/net/udp.h |    9 +++++----
 net/ipv4/udp.c    |   47 +++++++++++++++++++++++++++--------------------
 net/ipv6/udp.c    |   17 +++++++++--------
 3 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index addcdc6..35aa104 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -51,7 +51,7 @@ struct udp_skb_cb {
 #define UDP_SKB_CB(__skb)	((struct udp_skb_cb *)((__skb)->cb))
 
 extern struct hlist_head udp_hash[UDP_HTABLE_SIZE];
-extern rwlock_t udp_hash_lock;
+extern spinlock_t udp_hash_wlock;
 
 
 /* Note: this must match 'valbool' in sock_setsockopt */
@@ -112,12 +112,13 @@ static inline void udp_lib_hash(struct sock *sk)
 
 static inline void udp_lib_unhash(struct sock *sk)
 {
-	write_lock_bh(&udp_hash_lock);
-	if (sk_del_node_init(sk)) {
+	spin_lock_bh(&udp_hash_wlock);
+	if (sk_del_node_init_rcu(sk)) {
 		inet_sk(sk)->num = 0;
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 	}
-	write_unlock_bh(&udp_hash_lock);
+	spin_unlock_bh(&udp_hash_wlock);
+	synchronize_rcu();
 }
 
 static inline void udp_lib_close(struct sock *sk, long timeout)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 57e26fa..1b65cb6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -112,7 +112,8 @@ DEFINE_SNMP_STAT(struct udp_mib, udp_stats_in6) __read_mostly;
 EXPORT_SYMBOL(udp_stats_in6);
 
 struct hlist_head udp_hash[UDP_HTABLE_SIZE];
-DEFINE_RWLOCK(udp_hash_lock);
+DEFINE_SPINLOCK(udp_hash_wlock);
+EXPORT_SYMBOL(udp_hash_wlock);
 
 int sysctl_udp_mem[3] __read_mostly;
 int sysctl_udp_rmem_min __read_mostly;
@@ -155,7 +156,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 	int    error = 1;
 	struct net *net = sock_net(sk);
 
-	write_lock_bh(&udp_hash_lock);
+	spin_lock_bh(&udp_hash_wlock);
 
 	if (!snum) {
 		int i, low, high, remaining;
@@ -225,12 +226,12 @@ gotit:
 	sk->sk_hash = ...
From: Eric Dumazet
Date: Monday, October 6, 2008 - 2:22 pm

UDP central rwlock can hurt performance, because of cache line ping pong,
so your patch really makes sense.

Me wondering what impact this synchronize_rcu() can have on mono-threaded
VOIP applications using lot of UDP sockets. What is the maximum delay of
this function ?

For "struct file" freeing, we chose call_rcu() instead of synchronize_rcu()

Maybe we could add a generic rcu head to struct sock, and use call_rcu() in
sk_prot_free() for sockets needing RCU (udp after your patch is applied, maybe
tcp on future patches, while I believe previous work on the subject concluded
RCU was not giving good results for short lived http sessions) ?

Or just add SLAB_DESTROY_BY_RCU to slab creation in proto_register()
for "struct proto udp_prot/udpv6_prot" so that kmem_cache_free() 
done in sk_prot_free() can defer freeing to RCU...




--

From: David Miller
Date: Monday, October 6, 2008 - 2:40 pm

From: Eric Dumazet <dada1@cosmosbay.com>

The cost is enormous, we really can't use it here.

I have a patch that did top-level socket destruction using RCU,
and that didn't use synchronize_rcu(), and that killed connection
rates by up to %20.

I can only imagine what the cost would be if I had to add such a call
in there.

Really, I can't consider these changes seriously, as-is.
--

From: Corey Minyard
Date: Monday, October 6, 2008 - 4:08 pm

Would using SLAB_DESTROY_BY_RCU be ok, or would that be too expensive?

-corey
--

From: Evgeniy Polyakov
Date: Tuesday, October 7, 2008 - 1:37 am

I tested skb destruction via RCU path, and got 2.5 times worse numbers
with small-packets-bulk-transfer workload.

For more details
http://tservice.net.ru/~s0mbre/blog/devel/networking/2006_12_05_1.html

-- 
	Evgeniy Polyakov
--

From: Christoph Lameter
Date: Tuesday, October 7, 2008 - 7:16 am

Was this with regular RCU freeing? This will cool down the cacheline before
frees. You need SLAB_DESTROY_BY_RCU to keep the objects cache hot.


--

From: Paul E. McKenney
Date: Tuesday, October 7, 2008 - 7:33 am

Indeed!

But care is required -- SLAB_DESTROY_BY_RCU permits objects to be freed
and reallocated while a reader holds a reference.  The only guarantee is
that the -type- of the data structure will not change while a reader holds
a reference.  With something like UDP, this might well be sufficient.

Just be careful!  ;-)

							Thanx, Paul
--

From: Christoph Lameter
Date: Tuesday, October 7, 2008 - 7:45 am

Right so after the hash lookup operation you are not assured that the object
has not been freed or even reallocated for a different purpose. So after
finding the pointer to the object two things need to happen (under rcu_lock):

1. Verify that the object is still in use
2. Verify that the object is matching the hash

If not then the operation needs to be redone because we have a stale hash pointer.
--

From: Eric Dumazet
Date: Tuesday, October 7, 2008 - 8:07 am

OK... so restart full lookup at the begining of hash chain if we detect
points 1 or 2 invalid.

Not that expensive since everything should be cache hot :)

One has to take care to group all components (keys to compute the hash, 
and the *next* pointer) in one cache line to minimize cache misses,
since we now need to access them all to compute/check hash value.

Now if a freed object is re-inserted with same hash value, 
same hash chain, we'll also restart the lookup, but it is harmless.




--

From: Paul E. McKenney
Date: Tuesday, October 7, 2008 - 8:07 am

There is also the possibility that the element will be reused, but placed
in the same list that it resided in last time.  A reader referencing
that item during this process might then be relocated in the list, which
could either cause the reader to skip elements in the list (if the new
element is relocated farther down the list) or endlessly loop through
the list (if new elements were relocated closer to the list head and this
free-reallocate process repeated and the reader was insanely unlucky).

							Thanx, Paul
--

From: Evgeniy Polyakov
Date: Tuesday, October 7, 2008 - 7:29 am

I believe there were no SLAB_DESTROY_BY_RCU 2 yars ago :)

It was pure call_rcu(&skb->rcu, real_skb_freeing), where
real_skb_freeing() just did usual kfree().

-- 
	Evgeniy Polyakov
--

From: Christoph Lameter
Date: Tuesday, October 7, 2008 - 7:38 am

Right. That results in cacheline cooldown. You'd want to recycle the object as
they are cache hot on a per cpu basis. That is screwed up by the delayed
regular rcu processing. We have seen multiple regressions due to cacheline
cooldown.

The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU.






--

From: Eric Dumazet
Date: Monday, October 6, 2008 - 10:24 pm

Yes, I suppose you are right about TCP sessions, that should stay as they are.

Then if we use call_rcu() RCU freeing only for UDP sockets, we should get rid of
taking this rwlock each time we handle an incoming datagram, and introduce
no extra cost for other sockets.

Most UDP sockets are setup for long periods (RTP trafic), or if an application really
wants to {open/send or receive one UDP frame/close} many sockets, it already hits
RCU handling of its file structures and should not be slowed down that much.

By 'long period' I mean thousand of packets sent/received by each RTP session, being
voice (50 packets/second) or even worse video...




--

From: Benny Amorsen
Date: Tuesday, October 7, 2008 - 1:54 am

Does DNS with port randomization need short lived sockets?


/Benny

--

From: Eric Dumazet
Date: Tuesday, October 7, 2008 - 5:59 am

Yes very true, but current allocation of a random port can be very expensive, 
since we scan all the UDP hash table to select the smaller hash chain.

We stop the scan if we find an empty slot, but on machines with say more than 200
bound UDP sockets, they are probably no empty slots. (UDP_HTABLE_SIZE is 128)

bind(NULL port) algo is then O(N), N being number of bound UDP sockets.

So heavy DNS servers/proxies probably use a pool/range of pre-allocated sockets
to avoid costs of allocating/freeing them ? If they dont care about that cost,
the extra call_rcu() will be unnoticed.

For pathological (yet very common :) ) cases like single DNS query/answer, RCU
would mean :

Pros :
- one few rwlock hit when receiving the answer (if any)
Cons :
- one call_rcu() to delay socket freeing/reuse after RCU period.

So it might be a litle bit more expensive than without RCU

I agree I am more interested in optimizing UDP stack for heavy users like RTP 
servers/proxies handling xxx.000 packets/second than DNS users/servers.
Shame on me :)

(2 weeks ago, Corey mentioned a 10x increase on UDP throughput on a 16-way machine,
that sounds promising)





--

From: Stephen Hemminger
Date: Tuesday, October 7, 2008 - 7:07 am

On Tue, 07 Oct 2008 14:59:20 +0200

The idea of keeping chains short is the problem. That code should just be pulled because
it doesn't help that much, and also creates bias on the port randomization.
--

From: David Miller
Date: Tuesday, October 7, 2008 - 1:55 pm

From: Stephen Hemminger <shemminger@vyatta.com>

I have that patch from Vitaly Mayatskikh which does exactly this.

I keep looking at it, but I can't bring myself to apply it since
I'm not completely convinced.
--

From: Stephen Hemminger
Date: Tuesday, October 7, 2008 - 2:20 pm

On Tue, 07 Oct 2008 13:55:48 -0700 (PDT)

Some one on a busy server should run it and measure the delta in hash chains.
I would, but don't run anything that has more than a few UDP's open.
--

From: Eric Dumazet
Date: Wednesday, October 8, 2008 - 6:55 am

Vitaly patch might be appropriate if only few UDP ports are opened.

We could zap the code to search short chains and extend Vitaly's
idea with following patch :

[PATCH] udp: Improve port randomization

Current UDP port allocation is suboptimal.
We select the shortest chain to chose a port (out of 512)
that will hash in this shortest chain.

First, it can lead to give not so ramdom ports and ease
give attackers more opportunities to break the system.

Second, it can consume a lot of CPU to scan all table
in order to find the shortest chain.

Third, in some pathological cases we can fail to find
a free port even if they are plenty of them.

This patch zap the search for a short chain and only
use one random seed. Problem of getting long chains
should be addressed in another way, since we can
obtain long chains with non random ports.

Based on a report and patch from Vitaly Mayatskikh

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


From: David Miller
Date: Wednesday, October 8, 2008 - 11:45 am

From: Eric Dumazet <dada1@cosmosbay.com>

I really like this, and I've applied it to net-next-2.6

I think the "increment until back in range" do/while loop can
be improved a bit.  It can spin for more than 60,000 iterations
in some edge case scenerios as-is :-)

Ugh, there's also that expensive divide in there for the modulus.
--

From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 1:42 pm

RCUification of UDP hash tables

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
  writes should happen in the fast path. Using an array of rwlocks (one per
  slot for example is not an option in this regard)

  Note: Multicasts and broadcasts still will need to take a lock,
  because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
   - No expensive synchronize_rcu() calls.

   - No added rcu_head in socket structure, increasing memory needs,
   but more important, forcing us to use call_rcu() calls,
   that have the bad property of making sockets structure cold.
   (rcu grace period between socket freeing and its potential reuse
    make this socket being cold in CPU cache).
   David did a previous patch using call_rcu() and noticed a 20%
   impact on TCP connection rates.
   Quoting Cristopher Lameter :
    "Right. That results in cacheline cooldown. You'd want to recycle
     the object as they are cache hot on a per cpu basis. That is screwed
     up by the delayed regular rcu processing. We have seen multiple
     regressions due to cacheline cooldown.
     The only choice in cacheline hot sensitive areas is to deal with the
     complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

   - Because udp sockets are allocated from dedicated kmem_cache,
   use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the ...
From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 3:45 pm

On ipv6 side, I forgot to add a check before compute_score(), like I did on ipv4


+	rcu_read_lock();
+begin:
+	result = NULL;
+	badness = -1;
+	sk_for_each_rcu(sk, node, &hslot->head) {

< BEGIN HERE missing part --->
		/*
		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
		 * We must check this item was not moved to another chain
		 */
		if (udp_hashfn(net, sk->sk_hash) != hash)
			goto begin;

< END missing part --->
 		score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
 		}
 	}
-	if (result)
-		sock_hold(result);
-	read_unlock(&hslot->lock);
+	if (result) {
+		if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
+			result = NULL;
+		else if (unlikely(compute_score(result, net, hnum, saddr, sport,
+					daddr, dport, dif) < badness)) {
+			sock_put(result);
+			goto begin;
+ 		}
+	}


I will submit a new patch serie tomorrow, with :

Patch 1 : spinlocks instead of rwlocks, and bug spotted by Christian Bell

Patch 2 : splited on two parts (2 & 3) , one for IPV4, one for IPV6, 

Thanks

--

From: David Miller
Date: Tuesday, October 28, 2008 - 10:05 pm

From: Eric Dumazet <dada1@cosmosbay.com>

I very much look forward to this :-)

I like these changes and can't wait to add them to net-next-2.6
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 1:23 am

Thanks David, please find first updated patch 1

Thanks to Christian Bell and Stephen for their usefull review.


[PATCH] udp: introduce struct udp_table and multiple spinlocks

UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
 must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
 happen on a machine with some NIC receiving many UDP messages. User
 process can be delayed a long time at socket creation/dismantle time.

This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.

Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.

udp_lib_unhash() was uninlined, becoming to big.

Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h    |    2
 include/net/udp.h     |   25 ++--
 include/net/udplite.h |    2
 net/ipv4/udp.c        |  209 +++++++++++++++++++++++-----------------
 net/ipv4/udp_impl.h   |    4
 net/ipv4/udplite.c    |   13 +-
 net/ipv6/udp.c        |  112 +++++++++++----------
 net/ipv6/udp_impl.h   |    4
 net/ipv6/udplite.c    |    8 -
 9 files changed, 215 insertions(+), 164 deletions(-)
From: David Miller
Date: Wednesday, October 29, 2008 - 1:56 am

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, please (re-)send the current version of patch 2 as well.

Thanks.
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 3:19 am

I found a fatal bug in /proc/net/udp handling, sorry.

[PATCH] udp: udp_get_next() should use spin_unlock_bh()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: David Miller
Date: Wednesday, October 29, 2008 - 11:19 am

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 2:04 am

Please find updated patch 2

Missing check in __udp6_lib_lookup() was added,
and based on spinlock version ([PATCH] udp: introduce struct udp_table an=
d multiple spinlocks)

Thank you

[PATCH] udp: RCU handling for Unicast packets.

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
 writes should happen in the fast path.

 Note: Multicasts and broadcasts still will need to take a lock,
 because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
  - No expensive synchronize_rcu() calls.

  - No added rcu_head in socket structure, increasing memory needs,
  but more important, forcing us to use call_rcu() calls,
  that have the bad property of making sockets structure cold.
  (rcu grace period between socket freeing and its potential reuse
   make this socket being cold in CPU cache).
  David did a previous patch using call_rcu() and noticed a 20%
  impact on TCP connection rates.
  Quoting Cristopher Lameter :
   "Right. That results in cacheline cooldown. You'd want to recycle
    the object as they are cache hot on a per cpu basis. That is screwed
    up by the delayed regular rcu processing. We have seen multiple
    regressions due to cacheline cooldown.
    The only choice in cacheline hot sensitive areas is to deal with the
    complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

  - Because udp sockets are allocated from dedicated kmem_cache,
  use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain=

really ...
From: David Miller
Date: Wednesday, October 29, 2008 - 2:17 am

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.
--

From: Corey Minyard
Date: Wednesday, October 29, 2008 - 6:17 am

I believe there is a race in this patch:

+	sk_for_each_rcu(sk, node, &hslot->head) {
+		/*
+		 * lockless reader, and SLAB_DESTROY_BY_RCU items:
+		 * We must check this item was not moved to another chain
+		 */
+		if (udp_hashfn(net, sk->sk_hash) != hash)
+			goto begin;
 		score = compute_score(sk, net, hnum, saddr, sport, daddr, dport, dif);
 		if (score > badness) {
 			result = sk;
 			badness = score;
 		}
 	}

If the socket is moved from one list to another list in-between the time 
the hash is calculated and the next field is accessed, and the socket 
has moved to the end of the new list, the traversal will not complete 
properly on the list it should have, since the socket will be on the end 
of the new list and there's not a way to tell it's on a new list and 
restart the list traversal.  I think that this can be solved by 
pre-fetching the "next" field (with proper barriers) before checking the 
hash.

I also might be nice to have a way to avoid recomputing the score the 
second time, perhaps using a sequence number of some type.

-corey
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 7:36 am

You are absolutely right. As we *need* next pointer anyway for the prefet=
ch(),

Well, computing score is really cheap, everything is in cache, granted
the chain was not really huge and high score item at the beginning.

Adding yet another field in sock structure should be avoided if possible.=


Thanks

[PATCH] udp: introduce sk_for_each_rcu_safenext()

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149=
933110672d
(udp: RCU handling for Unicast packets.)

 "If the socket is moved from one list to another list in-between the tim=
e=20
  the hash is calculated and the next field is accessed, and the socket=20
  has moved to the end of the new list, the traversal will not complete=20
  properly on the list it should have, since the socket will be on the en=
d=20
  of the new list and there's not a way to tell it's on a new list and=20
  restart the list traversal.  I think that this can be solved by=20
  pre-fetching the "next" field (with proper barriers) before checking th=
e=20
  hash."

This patch corrects this problem, introducing a new sk_for_each_rcu_safen=
ext()
macro.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/rculist.h |   17 +++++++++++++++++
 include/net/sock.h      |    4 ++--
 net/ipv4/udp.c          |    4 ++--
 net/ipv6/udp.c          |    4 ++--
 4 files changed, 23 insertions(+), 6 deletions(-)

From: Corey Minyard
Date: Wednesday, October 29, 2008 - 8:34 am

You also need the appropriate smp_wmb() in udp_lib_get_port() after 
sk_hash is set, I think, so the next field is guaranteed to be changed 
after the hash value is changed.


-corey
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 9:09 am

Not sure about this one Corey.

If a reader catches previous value of item->sk_hash, two cases are to be taken into :

1) its udp_hashfn(net, sk->sk_hash) is != hash  
  -> goto begin : Reader will redo its scan

2) its udp_hashfn(net, sk->sk_hash) is == hash
  -> next pointer is good enough : it points to next item in same hash chain.
     No need to rescan the chain at this point.
     Yes we could miss the fact that a new port was bound and this UDP message could be lost.


If we force a smp_wmb(), reader would fetch pointer to begin of list.


1) its udp_hashfn(net, sk->sk_hash) is != hash  
  -> goto begin : Reader will redo its scan : next pointer value had no meaning

2) its udp_hashfn(net, sk->sk_hash) is == hash
  ->next pointer "force reader" to rescan the chain, but wont find new items.

In any case, we cannot lost an UDP message sent to a stable port (previously bound)


Thanks

--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 9:37 am

3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
removed, freed, reallocated, and then readded with the same hash value,
possibly carrying the reader to a new position in the same list.

You might well cover this (will examine your code in detail on my plane
flight starting about 20 hours from now), but thought I should point it
out.  ;-)

--

From: Corey Minyard
Date: Wednesday, October 29, 2008 - 10:22 am

If I understand this, without the smp_wmb(), it is possible that the 
next field can be written to main memory before the hash value is 
written.  If that happens, the following can occur:

  CPU1                    CPU2
  next is set to NULL (end of new list)
                          fetch next
                          calculate hash and compare to sk_hash
  sk_hash is set to new value

So I think in the above cases, your case #2 is not necessarily valid 
without the barrier.

And another possible issue.  If sk_hash is written before next, and CPU1 
is interrupted before CPU2, CPU2 will continually spin on the list until 
CPU1 comes back and moves it to the new list.  Note sure if that is an 
issue.

-corey
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 10:45 am

Well, if this item is injected to the same chain, next wont be set to NULL.

That would mean previous writers deleted all items from the chain.

In this case, readers can see NULL, it is not a problem at all.
List is/was empty.
An application cannot complain a packet is not
handled if its bind() syscall is not yet completed :)


Probably not. Previously, readers were spining on read_lock(), when 
a writer was inside its critical section (write_lock()/write_unlock()).
So instead of spining inside read_unlock(), issuing stupid memory 
transactions, the readers can now spin reading hash chain and populate
cpu cache :)



--

From: Corey Minyard
Date: Wednesday, October 29, 2008 - 11:28 am

I put my comment in the wrong place, I wasn't talking about being 
If the item is injected onto the end of another chain, the next field 
will be NULL and you won't detect a hash mismatch.  It's basically the 
same issue as the previous race, though a lot more subtle and unlikely.  
If you get (from the previous socket) an old value of "sk_hash" and 
(from the new socket) a new value of "next" that is NULL, you will 
terminate the loop when you should have restarted it.  I'm pretty sure 
Yes, I thought about that and thought I would point it out, but I agree, 
what you have is certainly better than spinning on a lock :).


-corey
--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 11:52 am

One way of dealing with this is to keep a tail pointer.  Then, if the
element containing the NULL pointer doesn't match the tail pointer seen
at the start of the search, or if the tail pointer has changed,
restart the search.  Memory barriers will be needed.  ;-)

--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 1:00 pm

Hum... Another way of handling all those cases and avoid memory barriers
would be to have different "NULL" pointers.

Each hash chain should have a unique "NULL" pointer (in the case of UDP, it
can be the 128 values : [ (void*)0 .. (void *)127 ]

Then, when performing a lookup, a reader should check the "NULL" pointer
it get at the end of its lookup has is the "hash" value of its chain.

If not -> restart the loop, aka "goto begin;" :)

We could avoid memory barriers then.

In the two cases Corey mentioned, this trick could let us avoid memory barriers.
(existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)

What do you think ?


--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 1:17 pm

Kinky!!!  ;-)

Then the rcu_dereference() would be supplying the needed memory barriers.

Hmmm...  I guess that the only confusion would be if the element got
removed and then added to the same list.  But then if its pointer was
pseudo-NULL, then that would mean that all subsequent elements had been
removed, and all preceding ones added after the scan started.

Which might well be harmless, but I must defer to you on this one at
the moment.

If you need a larger hash table, another approach would be to set the
pointer's low-order bit, allowing the upper bits to be a full-sized
index -- or even a pointer to the list header.  Just make very sure
to clear the pointer when freeing, or an element on the freelist
could end up looking like a legitimate end of list...  Which again
might well be safe, but why inflict this on oneself?

							Thanx, Paul
--

From: Corey Minyard
Date: Wednesday, October 29, 2008 - 2:29 pm

Kind of my thought, too.  That's a lot of work to avoid a single 
smb_wmb() on the socket creation path.  Plus this could be extra confusing.

-corey
--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 2:58 pm

Just to be clear, I was fulminating against any potential failure to
clear the pseudo-NULL pointer, not against the pseudo-pointer itself.
This sort of trick is already used in some of the RCU-protected trees
(for example, FIB tree, IIRC), so I would look a bit funny fulminating
too hard against it.  ;-)

The only other high-level approach I have come up with thus far is to
maintain separate hash tables for the long-lived UDP sockets (protected
by RCU) and for the short-lived UDP sockets (protected by locking).
Given the usual bimodal traffic pattern, most of the sockets are short
lived, but most of the data is transmitted over long-lived sockets.  If a
socket receives more than N packets (10? 50? 100?), it is moved from the
short-lived table to the long-lived table.  Sockets on the short-lived
table may be freed directly, while sockets on the long-lived table must
be RCU freed -- but this added overhead should be in the noise for a
long-lived connection.  Lookups hit the RCU-protected table, then the lock
protected table, then the RCU-protected table again, but still holding
the lock.  (Clearly, only search until you find the desired socket.)

However, I am not certain that this short-term/long-term approach is
better than the approach that Eric is proposing.  It might in fact be
worse.  But I throw it out anyway on the off-chance that it is helpful
as a comparison or as a solution to some future problem.

							Thanx, Paul
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 2:57 pm

Sure this smp_wmb() seems harmless (but, remember this infrastructure will
next be deployed for TCP sockets as well ;) )

But saving smp_rmb() at lookup time, for each item is a clear win, no ?



--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 3:08 pm

Well, for UDP case, hash table will be <=3D 65536 anyway, we can assume
no dynamic kernel memory is in the range [0 .. 65535]

Here is a patch (untested yet, its really time for a sleep for me ;) )

[PATCH] udp: Introduce special NULL pointers for hlist termination

In order to safely detect changes in chains, we would like to have differ=
ent
'NULL' pointers. Each chain in hash table is terminated by an unique 'NUL=
L'
value, so that the lockless readers can detect their lookups evaded from
their starting chain.

We define 'NULL' values as ((unsigned long)values < UDP_HTABLE_SIZE)

This saves memory barriers (a read barrier to fetch 'next' pointers
*before* checking key values) we added in commit=20
96631ed16c514cf8b28fab991a076985ce378c26 (udp: introduce=20
sk_for_each_rcu_safenext())=20

This also saves a missing memory barrier spotted by Corey Minyard=20
(a write one in udp_lib_get_port(), between sk_hash update and ->next upd=
ate)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/linux/list.h    |   32 ++++++++++++++++++++++++++++++++
include/linux/rculist.h |   15 ++++++++-------
include/net/sock.h      |    9 +++++++--
net/ipv4/udp.c          |   19 +++++++++++++------
net/ipv6/udp.c          |   16 ++++++++++++----
5 files changed, 72 insertions(+), 19 deletions(-)


From: Corey Minyard
Date: Wednesday, October 29, 2008 - 8:22 pm

I think you are right, this will certainly perform a lot better without 
the read barrier in the list traversal.  I haven't seen any problems 
with this approach, though it's unusual enough to perhaps warrant some 
extra comments in the code.

You do need to modify udp_lib_unhash(), as sk_del_node_init_rcu() will 
do a NULL check on the ->next value, so you will need a special version 
of that as well.

-corey
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 10:50 pm

Yes, we need many new macros, like sk_next_nulls(), sk_head_nulls(), ...

I have a working patch now, but not yet presentable for lkml :)

This patch need to touch files outside of netdev scope, so will need
really good shape and documentation.

(Probably a new file : include/linux/list_nulls.h ?)

Maybe in the meantime, we can commit a temporary patch doing the smp_wmb(=
)
you suggested ?

Thanks

[PATCH] udp: add a missing smp_wmb() in udp_lib_get_port()

Corey Minyard spotted a missing memory barrier in udp_lib_get_port()

We need to make sure a reader cannot read the new 'sk->sk_next' value
and previous value of 'sk->sk_hash'. Or else, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.

This patch is temporary, since we expect an upcoming patch
to introduce another way of handling the problem.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: David Miller
Date: Saturday, November 1, 2008 - 9:19 pm

From: Eric Dumazet <dada1@cosmosbay.com>

I've applied this to net-next-2.6
--

From: David Miller
Date: Wednesday, October 29, 2008 - 10:40 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Please hide this behind some list.h interface macro, even something
as simple as INIT_HLIST_HEAD_NULLS(X, index) would suffice.

And as Corey said, the code needs more comments for something as
clever as this! :-)

Thanks.
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 10:51 pm

Yes I agree 100%, please give me one day to prepare a real patch,
or else akpm will kill us :)



--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 12:04 am

If we design something that could be reused, say for TCP sockets, we need
to be able to handle very large number of 'NULL' pointers, say, up to 64*1024*1024

So lets use the low order bit, set to one for "NULL" pointers, and 0 for regular pointers.

This gives us 31 bits (or 63 bits) to store any valuable info :)

and all ...._nulls() macros would not need to know the max value (UDP_HTABLE_SIZE in UDP case),
since all they have to do is a test ((unsigned long)pos & 1)

At iterator exit, pos would contain the 'index' value, (pos >> 1), to hide this
implementation detail.


--

From: David Miller
Date: Thursday, October 30, 2008 - 12:05 am

From: Eric Dumazet <dada1@cosmosbay.com>

This sound fine to me.  Quite an improvement in fact :)

--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 8:40 am

Ok, here is an updated and tested patch.

Thanks everybody

[PATCH] udp: Introduce special NULL pointers for hlist termination

In order to safely detect changes in chains, we would like to have differ=
ent
'NULL' pointers. Each chain in hash table is terminated by an unique 'NUL=
L'
value, so that the lockless readers can detect their lookups evaded from
their starting chain.

We introduce a new type of hlist implementation, named hlist_nulls, were
we use the least significant bit of the 'ptr' to tell if its a "NULL" val=
ue
or a pointer to an object. We expect to use this new hlist variant for TC=
P
as well.

For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
(UDP_HTABLE_SIZE=3D128)

Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
pointers *before* checking key values) we added in commit=20
96631ed16c514cf8b28fab991a076985ce378c26
(udp: introduce sk_for_each_rcu_safenext())

This also saves a write memory barrier in udp_lib_get_port(), between
sk->sk_hash update and sk->next update)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/list_nulls.h    |   97 ++++++++++++++++++++++++++++++++
 include/linux/rculist.h       |   17 -----
 include/linux/rculist_nulls.h |   55 ++++++++++++++++++
 include/net/sock.h            |   50 ++++++++++++----
 include/net/udp.h             |    2
 net/ipv4/udp.c                |   40 ++++++-------
 net/ipv6/udp.c                |   22 ++++---
 7 files changed, 228 insertions(+), 55 deletions(-)

From: Stephen Hemminger
Date: Thursday, October 30, 2008 - 8:51 am

On Thu, 30 Oct 2008 16:40:01 +0100

IMHO this goes over the edge into tricky hack. Is it really worth it?
Is there a better simpler way?
--

From: Corey Minyard
Date: Thursday, October 30, 2008 - 9:28 am

The only think I've thought of is to do a single smp_rmb() after the 
loop scanning the list and check the sk_hash value again.  That's better 
than the read barrier for every list element, but still not as good as 
this list from a performance point of view.

IMHO, this is a tricky hack, but it if is well abstracted and documented 
I think it's ok.  I'd guess something like this will become more often 
used as we get larger numbers of processors on systems.

It is annoying that it doesn't help the performance for multicast.  
However, I think the current patch will solve the DOS issue for 
multicast, since it switches to a normal spinlock and has a per-list lock.

-corey
--

From: Eric Dumazet
Date: Friday, October 31, 2008 - 7:37 am

Corey Minyard a =C3=A9crit :
ck.

About multicast, it should be possible to do something about it, if it ha=
ppens
to be an issue.

That is, do a lockless lookup and accumulate matching sockets ptr in a ta=
ble

(incrementing their refcount if not zero, checking key, adding in a local=
 stack).

If lookup must be restarted, forget all accumulated sockets (sock_put(ptr=
s))
   goto begin;

Then, send the (cloned) packet to all accumulated sockets, and
 sock_put() them to release the refcount.



Well, looking at current implementation, I found that udp_v4_mcast_next()=

doesnt take into account the 'struct net *net', so we have a bug here...

udp_v6_mcast_next() is buggy too (or at least its caller is)

David, please find a patch against net-2.6

Thanks

[PATCH] udp: multicast packets need to check namespace

Current UDP multicast delivery is not namespace aware.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/udp.c |   14 ++++++++------
 net/ipv6/udp.c |    8 ++++----
 2 files changed, 12 insertions(+), 10 deletions(-)
From: Pavel Emelyanov
Date: Friday, October 31, 2008 - 7:55 am

Acked-by: Pavel Emelyanov <xemul@openvz.org>


--

From: David Miller
Date: Saturday, November 1, 2008 - 9:22 pm

From: Pavel Emelyanov <xemul@openvz.org>

Applied, thanks everyone.
--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 10:12 am

rwlocks , spinlocks, seqlocks :)

More seriously Stephen, if the infrastructure is clean, and well tested on a relative
simple case (UDP), it can then be deployed on a much more interesting protocol : TCP

The moment we switch to RCU, we have to accept the pain of really understand what
we did. Details are scary yes.


--

From: David Miller
Date: Friday, October 31, 2008 - 12:51 am

From: Eric Dumazet <dada1@cosmosbay.com>


Agreed.
--

From: Peter Zijlstra
Date: Thursday, October 30, 2008 - 9:01 am

If we're going to do this, It'd be good to have the list_nulls stuff in
their own patch, as clearly they are not UDP specific.

Also, I think it would be very good to have some extensive comments in
the list_nulls files describing their use in clear and concise language,
because the above changelog doesn't even begin to explain things for
those not following this thread.



--

From: Keith Owens
Date: Thursday, October 30, 2008 - 5:14 pm

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:13 am

Hi all

Here is a serie of three patches (based on net-next-2.6), to continue work
with RCU on UDP/TCP/DCCP stacks

Many thanks for all usefull reviews and comments, especially from Paul and Corey.

1) Introduce hlist_nulls variant of hlist

   hlist uses NULL value to finish a chain.
   hlist_nulls variant use the low order bit set to 1 to signal an end marker.
   This allows to store many different end markers, so that some RCU lockless
   algos (used in TCP/UDP stack for example) can save some memory barriers in
   fast paths.

2) Use hlist_nulls in UDP RCU code

   This is a straightforward patch, using hlist_nulls infrastructure.
   RCU-ification already done on UDP two weeks ago, so hlist_nulls
   permits us to avoid some memory barriers, both at lookup time
   and delete time. Patch is large because it adds new macros to
   include/net/sock.h. These macros will be used by TCP & DCCP too.

3) Convert TCP & DCCP hash tables to use RCU & hlist_nulls

   RCU was added to UDP lookups, using a fast infrastructure :
   - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
     price of call_rcu() at freeing time.
   - hlist_nulls permits to use few memory barriers.

   This patch uses same infrastructure for TCP/DCCP established
   and timewait sockets.

   Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
   using short lived TCP connections. A followup patch, converting
   rwlocks to spinlocks will even speedup this case.

   __inet_lookup_established() is pretty fast now we dont have to
   dirty a contended cache line (read_lock/read_unlock)

   Only established and timewait hashtable are converted to RCU
  (bind table and listen table are still using traditional locking)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 10:20 am

Do you have any numbers that demonstrate the read memory barriers being
a performance problem? At least on x86 they should be very cheap
because they're normally nops.

-Andi

-- 
ak@linux.intel.com
--

From: David Miller
Date: Sunday, November 16, 2008 - 8:41 pm

From: Eric Dumazet <dada1@cosmosbay.com>

All 4 patches applied to net-next-2.6, thanks a lot Eric!

These patches are incredibly cool!

--

From: Christoph Lameter
Date: Wednesday, November 19, 2008 - 12:52 pm

How much does this give us on the aim9 udp test?
--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:14 am

hlist uses NULL value to finish a chain.

hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.

This allows to store many different end markers, so that some RCU lockless
algos (used in TCP/UDP stack for example) can save some memory barriers in
fast paths.

Two new files are added :

include/linux/list_nulls.h
  - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant

include/linux/rculist_nulls.h
  - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant

   Only four helpers are declared for the moment :

     hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
     hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()

prefetches() were removed, since an end of list is not anymore NULL value.
prefetches() could trigger useless (and possibly dangerous) memory transactions.

Example of use (extracted from __udp4_lib_lookup())

	struct sock *sk, *result;
        struct hlist_nulls_node *node;
        unsigned short hnum = ntohs(dport);
        unsigned int hash = udp_hashfn(net, hnum);
        struct udp_hslot *hslot = &udptable->hash[hash];
        int score, badness;

        rcu_read_lock();
begin:
        result = NULL;
        badness = -1;
        sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                score = compute_score(sk, net, saddr, hnum, sport,
                                      daddr, dport, dif);
                if (score > badness) {
                        result = sk;
                        badness = score;
                }
        }
        /*
         * if the nulls value we got at the end of this lookup is
         * not the expected one, we must restart lookup.
         * We probably met an item that was moved to another chain.
         */
        if (get_nulls_value(node) != hash)
                goto begin;

        if (result) {
                if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                        result = ...
From: Peter Zijlstra
Date: Thursday, November 13, 2008 - 6:29 am

So by not using some memory barriers (would be nice to have it
illustrated which ones), we can race and end up on the wrong chain, in
case that happens we detect this by using this per-chain terminator and
try again.

It would be really good to have it explained in the rculist_nulls.h
comments what memory barriers are missing, what races they open, and how
the this special terminator trick closes that race.

I'm sure most of us understand it now, but will we still in a few
months? - how about new people?

Other than that, very cool stuff! :-)

--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:44 am

For example, if a process holds a lock, it doesnt need rcu version.


OK, maybe I should add a Documentation/RCU/rculist_nulls.txt file with
appropriate examples and documentation.

(Say the lookup/insert algorithms, with standard hlist and memory barriers,
and with hlist_nulls without those two memory barriers.

(These two memory barriers can be found in commits :

c37ccc0d4e2a4ee52f1a40cff1be0049f2104bba :

udp: add a missing smp_wmb() in udp_lib_get_port()

Corey Minyard spotted a missing memory barrier in udp_lib_get_port()

We need to make sure a reader cannot read the new 'sk->sk_next' value
and previous value of 'sk->sk_hash'. Or else, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.

This patch is temporary, since we expect an upcoming patch
to introduce another way of handling the problem.


And commit 96631ed16c514cf8b28fab991a076985ce378c26 :

udp: introduce sk_for_each_rcu_safenext()

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
(udp: RCU handling for Unicast packets.)

 "If the socket is moved from one list to another list in-between the
 time the hash is calculated and the next field is accessed, and the
 socket has moved to the end of the new list, the traversal will not
 complete properly on the list it should have, since the socket will
 be on the end of the new list and there's not a way to tell it's on a
 new list and restart the list traversal.  I think that this can be
 solved by pre-fetching the "next" field (with proper barriers) before
 checking the hash."

This patch corrects this problem, introducing a new

Thanks Peter ;)


--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 9:02 am

[PATCH 4/3] rcu: documents rculist_nulls

Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls'
end-of-list can help in some RCU algos.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 Documentation/RCU/rculist_nulls.txt |  167 ++++++++++++++++++++++++++
 1 files changed, 167 insertions(+)
From: Peter Zijlstra
Date: Friday, November 14, 2008 - 8:16 am

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--

From: David Miller
Date: Sunday, November 16, 2008 - 8:36 pm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Since there seems to be consensus for these changes I'm going to merge this
stuff into net-next-2.6 so that I can add in the users that Eric has written.

Thanks everyone for the review and feedback.
--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 10:07 am

Very good -- only one small suggestion below.


In some cases, a "generation number" will be needed.  For example, there
are algorithms that must detect when an object has been removed and then
re-inserted with the same key.  One increments the generation number

--

From: Peter Zijlstra
Date: Friday, November 14, 2008 - 8:16 am

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 10:01 am

Looks good, but a few questions and suggestions interspersed below.


And @pos is the starting point, correct?  Suggest something like:


The point here is to allow an RCU reader to grab the update-side lock
while holding a reference to an hlist_nulls_node, and then be able to
blindly call hlist_nulls_del_init_rcu() without having to do any complex
check to see if the element has already been deleted?

But this only works if each free operation waits for a grace period.
If using SLAB_DESTROY_BY_RCU, the would-be deleter still needs to

Any chance of using a trick like Oleg used to get rid of the "pos"
argument?  http://lkml.org/lkml/2008/3/12/47

The hlist_nulls_node must always be at an even address, correct?
Couldn't this fact be used to allow testing for is_a_nulls() on tpos
rather than on pos?  Or is there a better way to approach this?
--

From: Eric Dumazet
Date: Wednesday, November 19, 2008 - 10:53 am

Yes, comment was copied from hlist_for_each_entry_from() comment, this one

<start a brain refresh cycle>
  <read again your questions>
    Tilt... 


hlist_nulls_del_init_rcu() is only used by a writer, exactly
like hlist_del_init_rcu().
I see nothing special about SLAB_DESTROY_BY_RCU here.

static inline void hlist_del_init_rcu(struct hlist_node *n)
{
        if (!hlist_unhashed(n)) {
                __hlist_del(n);
                n->pprev = NULL;
        }

#define sk_nulls_for_each_rcu(__sk, node, list) \
	hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)

1) __sk is the pointer to found item if any is found in the loop

2) node will contain the end value of chain, in case we find nothing in loop
   because we need to check it after the loop.

if (get_nulls_value(node) != hash)
	 goto begin;

I dont know, it seems quite complex to try to use only three args ?

This algo is not very easy to read as is already ...


--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 11:46 am

Not a problem, as you don't use it the way I was thinking.

For whatever it is worth, here is a more complete use case, on the
off-chance that it becomes useful some time:

	retry:
	rcu_read_lock();
	hlist_nulls_for_each_entry_rcu(tpos, pos, head, hn_node) {
		if (!(curgen = still_valid(tpos)))
			goto retry;
		if (needs_deletion(tpos)) {
			spin_lock(&update_side_lock);
			if (still_valid(tpos) == curgen)
				hlist_nulls_del_init_rcu(pos);
			spin_unlock(&update_side_lock);
		}
	}
	rcu_read_unlock();

This approach requires that the key and a generation number be encoded
into a single word, and that the generation number be changed on each

One way around #2 would be for get_nulls_value() to undo the
hlist_nulls_entry().  Not sure whether it is worth it, but a
thought.

							Thanx, Paul
--

From: Arnaldo Carvalho de Melo
Date: Wednesday, November 19, 2008 - 11:53 am

- Arnaldo
--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 2:17 pm

Indeed!  Either that or have an rcu_read_unlock() before the
goto retry.

Good catch!

--

From: Eric Dumazet
Date: Wednesday, November 19, 2008 - 1:39 pm

Hum, we should add this template in Documentation/RCU  I guess

Thanks


--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 2:21 pm

With Arnaldo's change -- probably should prototype and test to find
the other inevitable bugs.  :-/

							Thanx, Paul
--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:15 am

This is a straightforward patch, using hlist_nulls infrastructure.

RCUification already done on UDP two weeks ago.

Using hlist_nulls permits us to avoid some memory barriers, both
at lookup time and delete time.

Patch is large because it adds new macros to include/net/sock.h.
These macros will be used by TCP & DCCP in next patch.


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/rculist.h |   17 -----------
 include/net/sock.h      |   57 ++++++++++++++++++++++++++++++--------
 include/net/udp.h       |    2 -
 net/ipv4/udp.c          |   47 ++++++++++++++-----------------
 net/ipv6/udp.c          |   26 +++++++++--------
 5 files changed, 83 insertions(+), 66 deletions(-)

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 10:29 am

Looks good, one question below about the lockless searches.  If the
answer is that the search must complete undisturbed by deletions and
additions, then:



Shouldn't this check go -after- the check for "result"?  Or is this a
case where the readers absolutely must have traversed a chain without


--

From: Eric Dumazet
Date: Wednesday, November 19, 2008 - 10:53 am

Very good question

Yes, we really have to look at all the sockets, to find the one with higher score, not
one with a low score, and the one with higher score being ignored because we didnt examined it.

So we really must check we finished our lookup on the right chain end, not an alien one.

(Previous UDP code had a shortcut if we found the socket with the maximal possible score,
I deleted this test as it had basically 0.0001 % of chance being hit)

Thanks a lot for your patient review Paul.

--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:15 am

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/inet_hashtables.h    |    4 -
 include/net/inet_timewait_sock.h |   10 +--
 net/core/sock.c                  |    4 +
 net/dccp/ipv4.c                  |    1
 net/dccp/ipv6.c                  |    1
 net/dccp/proto.c                 |    4 -
 net/ipv4/inet_diag.c             |    6 +-
 net/ipv4/inet_hashtables.c       |   78 ++++++++++++++++++++---------
 net/ipv4/inet_timewait_sock.c    |   26 +++++----
 net/ipv4/tcp.c                   |    4 -
 net/ipv4/tcp_ipv4.c              |   25 ++++-----
 net/ipv6/inet6_hashtables.c      |   70 +++++++++++++++++---------
 net/ipv6/tcp_ipv6.c              |    1
 13 files changed, 150 insertions(+), 84 deletions(-)
From: Peter Zijlstra
Date: Thursday, November 13, 2008 - 6:34 am

This is the validation step that verifies the race opened by using
SLAB_DESTROY_BY_RCU, right?

Does it make sense to add a little comment to these validation steps to

--

From: Eric Dumazet
Date: Thursday, November 13, 2008 - 6:51 am

The atomic_inc_not_zero() is not related to SLAB_DESTROY_BY_RCU but
classic RCU lookup. A writer can delete the item right before we try to use it.

Next step is necessary in case the deleted item was re-allocated and inserted
in a hash chain (this one or another one, it doesnt matter). In this case,
previous atomic_inc_not_zero test will succeed. So we must check again the item
we selected (and refcounted) is the one we were searching.

So yes, this bit should be documented, since SLAB_DESTROY_BY_RCU is

Yes, you are right.

Thanks


--

From: Christoph Lameter
Date: Thursday, November 13, 2008 - 7:08 am

It is used for the anonymous vmas. That is the purpose that Hugh
introduced it for since he saw regression if he would use straight rcu
freeing. See mm/rmap.c. SLAB_DESTROY_BY_RCU a pretty strange way of
using RCU and slab so it should always be documented in detail.

--

From: Peter Zijlstra
Date: Thursday, November 13, 2008 - 7:22 am

We have one user, anon_vma, and one thing that is very nearly identical
the lockless pagecache.

See page_cache_get_speculative() and find_get_page(). The pagecache gets
away with this due to the simple fact that the page frames are never
freed.

Hmm, I once wrote a comment to go with SLAB_DESTROY_BY_RCU, which seems
to have gotten lost... /me goes dig.

Found it:  http://lkml.org/lkml/2008/4/2/143

I guess I'd better re-submit that..

--

From: Christoph Lameter
Date: Thursday, November 13, 2008 - 7:27 am

Looks good. CC Pekka and he will merge it.


--

From: Paul E. McKenney
Date: Wednesday, November 19, 2008 - 10:53 am

Looks good to me!


--

From: Eric Dumazet
Date: Sunday, November 23, 2008 - 2:33 am

Hi David

Please find patch to convert TCP/DCCP listening hash tables
to RCU.

A followup patch will cleanup all sk_node fields and macros
that are not used anymore.

Thanks

[PATCH] net: Convert TCP/DCCP listening hash tables to use RCU

This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.

The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.

We define a large value :

#define LISTENING_NULLS_BASE (1U << 29)

So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/inet_hashtables.h |    9 +
 net/ipv4/inet_diag.c          |    4
 net/ipv4/inet_hashtables.c    |  148 ++++++++++++++++----------------
 net/ipv4/tcp_ipv4.c           |    8 -
 net/ipv6/inet6_hashtables.c   |   94 ++++++++++++--------
 5 files changed, 147 insertions(+), 116 deletions(-)
From: Paul E. McKenney
Date: Sunday, November 23, 2008 - 8:59 am

I do like this use of the full set up upper bits!  However, wouldn't it
be a good idea to use a larger base value for 64-bit systems, perhaps
using CONFIG_64BIT to choose?  500M entries might not seem like that
many in a few years time...


--

From: Eric Dumazet
Date: Sunday, November 23, 2008 - 11:42 am

Well, this value is correct up to 2^29 slots, and a hash table of 2^32 bytes
(8 bytes per pointer)

A TCP socket uses about 1472 bytes on 64bit arches, so 2^29 sessions
would need 800 GB of ram, not counting dentries, inodes, ...

I really doubt a machine, even with 4096 cpus should/can handle so many
tcp sessions :)


--

From: Paul E. McKenney
Date: Sunday, November 23, 2008 - 12:17 pm

200MB per CPU, right?

But yes, now that you mention it, 800GB of memory dedicated to TCP
connections sounds almost as ridiculous as did 640K of memory in the
late 1970s.  ;-)

Nevertheless, I don't have an overwhelming objection to the current
code.  Easy enough to change should it become a problem, right?

						Thanx, Paul
--

From: Eric Dumazet
Date: Sunday, November 23, 2008 - 1:18 pm

Sure. By that time, cpus might be 128 bits or 256 bits anyway :)


--

From: Paul E. McKenney
Date: Sunday, November 23, 2008 - 3:33 pm

Or even 640K bits.  ;-)

							Thanx, Paul
--

From: David Miller
Date: Sunday, November 23, 2008 - 6:23 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.
--

From: Peter Zijlstra
Date: Thursday, October 30, 2008 - 4:04 am

Why not use the bucket pointer as terminating condition?

Because all you really need is a pointer that is specific per bucket,
and not a valid element, right?


--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 4:30 am

Yes, but it forces compiler to keep around the bucket pointer.

Big chance this value will be stored on stack.

Next patch will use the least significant bit to distinguish a valid
pointer from a "NULL pointer"

--

From: Paul E. McKenney
Date: Thursday, October 30, 2008 - 11:25 am

As you might guess, I do like that idea.  ;-)

On my plane trip, so am reviewing what I believe to be your current
combined patchset consisting of six patches transmitted in this thread:

Message-ID: <49081D67.3050502@cosmosbay.com>
Message-ID: <49082718.2030201@cosmosbay.com>
Message-ID: <490874F2.2060306@cosmosbay.com>
Message-ID: <4908DEDE.5030706@cosmosbay.com>
Message-ID: <49094B0F.2090208@cosmosbay.com>
Message-ID: <490838C6.4060304@cosmosbay.com>

This probably won't be your latest and greatest by the time you receive
this, but it appears to be the latest and greatest that I have.  ;-)

A few comments, search for blank lines.


It might be good to take a predicate macro/function instead of nullval,
in order to allow this primitive to be used for a number of different
pseudo-NULL-pointer schemes, and ditto for the similar macros you define
below.  Might be too early to know exactly how such a primitive should
look, though.

So just a random thought at this point.

In any case, this macro cannot be used in read-side critical sections.

Used by sk_for_each_nulls(), which is called by udp_lib_lport_inuse()
and udp_get_first().

udp_lib_lport_inuse() is called by udp_lib_get_port(), which holds the
hslot->lock, so should be OK.


This also must be called with the update-side lock held.  It is
wrappered by sk_for_each_from_nulls(), which in turn is called from
udp_v4_mcast_next() and udp_v6_mcast_next().  According to messages
earlier in this thread, that is the case.

udp_v4_mcast_next() is called from __udp4_lib_mcast_deliver(), which does
hold hslot->lock, so OK.  Ditto for the calls from udp_v6_mcast_next().

Interestingly enough, these two functions use sk_head(), which calls
__sk_head(), which do not do rcu_dereference().  Which is another reason

Looks good.  It should be possible to get rid of the "pos" argument in
the upcoming version, as all of the offsets should be even, so the
bottom bit would come through unchanged.  In fact, it should be ...
From: Eric Dumazet
Date: Friday, October 31, 2008 - 9:40 am

Well, we have 65536(=2^16) possible port values, and while 'rand' is random,
it has the interesting property/bias of being odd.

We know (thanks modular arithmetic / congruence relation) we will hit
all 65356 values exactly once, after exactly 65536 iterations.

--

From: Paul E. McKenney
Date: Friday, October 31, 2008 - 8:10 pm

Ah, got it!  Thank you for the explanation!

I was fixating on the low..high interval.  ;-)

							Thanx, Paul
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 10:32 am

yes, but 'new position' is 'before any not yet examined objects', since

Yes, I'll double check too, this seems tricky :)

About SLAB_DESTROY_BY_RCU effect, we now have two different kmem_cache for "UDP-Lite"
and "UDP".

This is expected, but we could avoid that and alias these caches, since
these objects have the same *type* . (The fields used for the RCU lookups,
deletes and inserts are the same)

Maybe a hack in net/ipv4/udplite.c before calling proto_register(), to
copy the kmem_cache from UDP.

--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 11:11 am

OK.  However, this reasoning assumes that a socket with a given
udp_hashfn() value will appear on one and only one list.  There are no


As long as this preserves the aforementioned assumption that a socket
with a given hash can appear on one and only one list.  ;-)

							Thanx, Paul
--

From: David Miller
Date: Wednesday, October 29, 2008 - 11:29 am

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Nope, with UDP things are very simple, just one hash table.
--

From: Paul E. McKenney
Date: Wednesday, October 29, 2008 - 11:38 am

Cool!  That does make things easier.  ;-)

							Thanx, Paul
--

From: Eric Dumazet
Date: Wednesday, October 29, 2008 - 11:36 am

Ouch, thanks Paul, that is indeed the point, well sort of.

If a UDP socket is freed, and re-allocated as an UDP-Lite socket, inserted on
the udplite_table, then we would have a problem with current implementation.

A reader could be directed to the chain of the other hash table, without
noticing it should restart its lookup...

Not worth adding a check to detect such a scenario, we can live with two different
kmem_cache after all, they are not that expensive.


--

From: David Miller
Date: Wednesday, October 29, 2008 - 11:20 am

From: Eric Dumazet <dada1@cosmosbay.com>

Also applied, thanks Eric.
--

From: Peter Zijlstra
Date: Thursday, October 30, 2008 - 4:12 am

why _safenext and not _safe like hlist_for_each_entry_safe() which also
keeps a next pointer?

Also note the difference in argument order between these two.

--

From: Eric Dumazet
Date: Thursday, October 30, 2008 - 4:29 am

Yes, this one is going to vanish soon.


--

From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 1:37 pm

UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
  must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
  happen on a machine with some NIC receiving many UDP messages. User
  process can be delayed a long time at socket creation/dismantle time.

This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one rwlock per chain, to reduce
contention on central rwlock.

Introducing one rwlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers.

Some cleanups were done to ease review of following patch.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h    |    2
 include/net/udp.h     |   25 ++--
 include/net/udplite.h |    2
 net/ipv4/udp.c        |  208 +++++++++++++++++++++++-----------------
 net/ipv4/udp_impl.h   |    4
 net/ipv4/udplite.c    |   13 +-
 net/ipv6/udp.c        |  112 +++++++++++----------
 net/ipv6/udp_impl.h   |    4
 net/ipv6/udplite.c    |    8 -
 9 files changed, 214 insertions(+), 164 deletions(-)




From: Christian Bell
Date: Tuesday, October 28, 2008 - 2:23 pm

This structure should be aligned up to cacheline to reduce false  

The fail: label below should still unlock_bh when the above condition  

cheers,

	. . christian
--

From: Evgeniy Polyakov
Date: Tuesday, October 28, 2008 - 2:31 pm

This should be new label actually, since it is also used for non-locked
fail.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 2:48 pm

Yes, I though about that. But : a full cache line is a waste of memory, and
choosing a power of two alignement is not easy because of 32bit/64bit arches,

Good spoting, the write_unlock_bh(&hslot->lock); must be moved after the "fail:" label.

Thanks a lot


--

From: Evgeniy Polyakov
Date: Tuesday, October 28, 2008 - 2:28 pm

Hi.


Ugh-ogh, that's a very cool change!
So far reviewed only this and things look very promising.

-- 
	Evgeniy Polyakov
--

From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 1:37 pm

UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (unbind time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
happen on a machine with some NIC receiving many UDP messages. User
process can be delayed a long time at socket creation/dismantle time.


What Corey and I propose is to use RCU to protect this hash table.

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path. Using an array of rwlocks (one per
slot for example is not an option in this regard)

Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
  - No expensive synchronize_rcu() calls.

  - No added rcu_head in socket structure, increasing memory needs,
  but more important, forcing us to use call_rcu() calls,
  that have the bad property of making sockets structure cold.
  (rcu grace period between socket freeing and its potential reuse
   make this socket being cold in CPU cache).
  David did a previous patch using call_rcu() and noticed a 20%
  impact on TCP connection rates.

  Quoting Cristopher Lameter :
  "Right. That results in cacheline cooldown. You'd want to recycle
   the object as they are cache hot on a per cpu basis. That is screwed
   up by the delayed regular rcu processing. We have seen multiple
   regressions due to cacheline cooldown.
   The only choice in cacheline hot sensitive areas is to deal with the
   complexity that comes with SLAB_DESTROY_BY_RCU or ...
From: Stephen Hemminger
Date: Tuesday, October 28, 2008 - 2:28 pm

On Tue, 28 Oct 2008 21:37:15 +0100

We should just make it a spin_lock later and speed up udp socket creation.
--

From: Eric Dumazet
Date: Tuesday, October 28, 2008 - 2:50 pm

Indeed

--

From: Corey Minyard
Date: Tuesday, October 7, 2008 - 9:43 am

Just to be clear, that was 10x with preempt RT, which converts rwlocks 
into PI mutexes.  So 16 processors going for the same lock is pretty ugly.

Under heavy loads there is also a writer starvation problem, I believe 
in non-RT.  You can't actually create or destroy a UDP socket when the 
load is high because there's always a reader holding the lock.  RCU also 
solves that problem.

-corey

--

From: David Miller
Date: Tuesday, October 7, 2008 - 11:26 am

From: Eric Dumazet <dada1@cosmosbay.com>

As stated, I added RCU destruction generically for socket objects, and it
showed up clearly.

So "not be slowed down that much" has been disproven, at least to me,
already :-)
--

From: Eric Dumazet
Date: Wednesday, October 8, 2008 - 1:35 am

RCU in hash table is ok for managing read mostly data, since
only during the read access you avoid to dirty a rwlock...

If we have a workload that insert/delete sockets as hell but
receive few frames (that hit the hash table in a read only way),
then you defeat the purpose of RCU, and pay the price of
throwing away (in rcu queue) hot data that will become cold
before reuse...

BTW is there any chance your results were obtained before October 2005 ?

At that time, RCU was able to queue an unlimited number of events.
a single loop doing close(open("/dev/null",0)) could exhaust RAM...


Refs: commit  5ee832dbc6770135ec8d63296af0a4374557bb79
and many others...

Anyway we can probably code something without call_rcu() cache blower
for UDP, if time permits :)



--

From: David Miller
Date: Wednesday, October 8, 2008 - 9:38 am

From: Eric Dumazet <dada1@cosmosbay.com>

No, the timestamp on the saved patch file I have is June 16, 2008 :-)
--

From: Peter Zijlstra
Date: Tuesday, October 7, 2008 - 1:31 am

Yeah, sync_rcu() is rediculously expensive, at best 3 jiffies IIRC.

--

From: Paul E. McKenney
Date: Tuesday, October 7, 2008 - 7:36 am

I could make it -much- faster, but at the expense of -serious- CPU
overhead.  Still, might be useful during boot time (when the system
can't do anything useful anyway) to accelerate getting data structures
initialized.

							Thanx, Paul
--

From: David Miller
Date: Tuesday, October 7, 2008 - 11:29 am

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Probably the RCU delay on a 128 cpu machine :-)

Also I bet batching the socket destruction eliminates all of
the cached local state we have in the cpu at the actual socket
destruction time.
--

From: Corey Minyard
Date: Monday, October 6, 2008 - 3:07 pm

It delays until all currently executing RCU read-side sections have 
executed (new ones don't count, just currently executing ones).  I'm not 
sure what this delay is, but I would expect it to be fairly small.  This 
function is only called when a socket is closed, too, so it's not a 
I'd prefer that, too, but that would mean adding another member to the 
RCU probably wouldn't be a good choice for short-lived http sessions, 
since you will only get a couple of messages that would matter.  I'm not 
against adding an item to struct sock, but this is not a common thing 
That's an interesting thought; I didn't know that capability was there.  
I can look at that.  With this, the short-lived TCP sessions might not 
matter, though that's a different issue.

-corey
--

From: Peter Zijlstra
Date: Tuesday, October 7, 2008 - 1:17 am

Be careful!, SLAB_DESTROY_BY_RCU just means the slab page gets
RCU-freed, this means that slab object pointers stay pointing to valid
memory, but it does _NOT_ mean those slab objects themselves remain
valid.

The slab allocator is free to re-use those objects at any time -
irrespective of the rcu-grace period. Therefore you will have to be able
to validate that the object you point to is indeed the object you
expect, otherwise strange and wonderful things will happen.

--

From: Eric Dumazet
Date: Tuesday, October 7, 2008 - 2:24 am

Thanks for this clarification. I guess we really need a rcu head then :)



--

From: Christoph Lameter
Date: Tuesday, October 7, 2008 - 7:15 am

No you just need to make sure that the object you located is still active
(f.e. refcount > 0) and that it is really a match (hash pointers may be
updated asynchronously and therefore point to the object that has been reused
for something else).

Generally it is advisable to use SLAB_DESTROY_BY_RCU because it preserves the
cache hot advantages of the objects. Regular RCU freeing will let the object
expire for a tick or so which will result in the cacheline cooling down.


--

From: Paul E. McKenney
Date: Tuesday, October 7, 2008 - 7:38 am

In some cases, you might be able to not care, but yes, most of the time,

And SLAB_DESTROY_BY_RCU guarantees that the type of the object will
remain the same during any given RCU read-side critical section.

							Thanx, Paul
--

From: Eric Dumazet
Date: Tuesday, October 7, 2008 - 7:50 am

Seems really good to master this SLAB_DESTROY_BY_RCU thing (I see almost no use
of it in current kernel)

1) Hum, do you know why "struct file" objects dont use SLAB_DESTROY_BY_RCU then,
since we noticed a performance regression for several workloads at RCUification
of file structures ?

2) What prevents an object to be *freed* (and deleted from a hash chain), then
re-allocated and inserted to another chain (different keys) ? (final refcount=1)

If the lookup detects a key mismatch, how will it continue to the next item,
since 'next' pointer will have been reused for the new chain insertion...

Me confused...



--

From: Paul E. McKenney
Date: Tuesday, October 7, 2008 - 8:05 am

Nothing prevents this from happening.  You either have to have some sort
of validation step based on object identity (e.g., a generation number
that is incremented on each allocation), or have an algorithm that
doesn't care if searches sometimes spuriously fail to find something
that really is in the list.

Which is one of the reasons that its use is rare.  But perhaps more

One way to approach this is to have a generation number that is
incremented each time the object is recycled along with a pointer to the
list header.  The list header contains the most recent generation number
of any element in the list.  Then if either the generation number of
a give element is greater than that of the header when you started the
search, or the element's pointer no longer references the list header
you started your search from, restart the search.  Read-side memory
barriers may also be required in some cases.

It may be possible to simplify this in some special cases.

There are probably better ways to approach this, but that is one way.

						Thanx, Paul
--

From: Peter Zijlstra
Date: Tuesday, October 7, 2008 - 8:09 am

Right, you can't have lists with items like that. You can only do
matching lookups. What you do is:

find_get_obj(key)
{
 rcu_read_lock()
again:
 obj = lookup(key);
 if (!obj)
  goto out;

 /* 
  * if we can't get a ref, the item got freed concurrently
  * try again
  */
 if (!get_ref_unless_zero(obj))
  goto again;

 /*
  * if we did get a ref, but its not the object we expected
  * try again
  */
 if (obj->key != key) {
   put_ref(obj);
   goto again;
 }
out:
 rcu_read_unlock();
 return obj;
}

Which is basically what we do with the lockless pagecache, where we
don't need the RCU because the page-frames are never freed.



--

From: Christoph Lameter
Date: Tuesday, October 7, 2008 - 8:23 am

Because my patches were not accepted that fix the issue.


If there is a mismatch then you have to do another hash lookup. Do an rcu
unlock and start over.



--

Previous thread: [PATCH 2/3] Add RCU versions of socket hlist handling by Corey Minyard on Monday, October 6, 2008 - 11:50 am. (1 message)

Next thread: IGMP sent to Foreign VLAN by Jarod Neuner on Monday, October 6, 2008 - 12:51 pm. (8 messages)