Re: [RFC,PATCH] ipvs: Fix race condition in lblb and lblcr schedulers

Previous thread: [NET-NEXT PATCH 2/3] e1000e: add support for 82567LM-3 and 82567LF-3 (ICH10D) parts by Jeff Kirsher on Monday, August 11, 2008 - 3:39 pm. (4 messages)

Next thread: [Qdisc] Prioritizing traffic per vlan using qdisc. by Pravin Bathija on Monday, August 11, 2008 - 5:48 pm. (4 messages)
From: Sven Wegener
Date: Monday, August 11, 2008 - 3:57 pm

Both schedulers have a race condition that happens in the following 
situation:

We have an entry in our table that already has expired according to it's 
last use time. Then we need to schedule a new connection that uses this 
entry.

CPU 1                           CPU 2

ip_vs_lblc_schedule()
  ip_vs_lblc_get()
    lock table for read
    find entry
    unlock table
                                ip_vs_lblc_check_expire()
                                  lock table for write
                                  kfree() expired entry
                                  unlock table
    return invalid entry

Problem is that we assign the last use time outside of our critical 
region. We can make hitting this race more difficult, if not impossible, 
if we assign the last use time while still holding the lock for reading. 
That gives us six minutes during which it's save to use the entry, which 
should be enough for our use case, as we're going to use it immediately 
and don't keep a long reference to it.

We're holding the lock for reading and not for writing. The last use time 
is an unsigned long, so the assignment should be atomic by itself. And we 
don't care, if some other user sets it to a slightly different value. The 
read_unlock() implies a barrier so that other CPUs see the new last use 
time during cleanup, even if we're just using a read lock.

Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() with 
write_lock()ing the lock, 2) add reference counting for the entries, 3) 
protect each entry with it's own lock. And all are bad for performance.

Comments? Ideas?

---
 net/ipv4/ipvs/ip_vs_lblc.c  |    4 +++-
 net/ipv4/ipvs/ip_vs_lblcr.c |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
index 0efa3db..65f8414 100644
--- a/net/ipv4/ipvs/ip_vs_lblc.c
+++ b/net/ipv4/ipvs/ip_vs_lblc.c
@@ -144,6 +144,8 @@ ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
 ...
From: Simon Horman
Date: Monday, August 11, 2008 - 7:10 pm

Is there a pathological case here if sysctl_ip_vs_lblc_expiration is
set to be very short and we happen to hit ip_vs_lblc_full_check()?

To be honest I think that I like the reference count approach best,
as it seems safe and simple. Is it really going to be horrible
for performance?

If so, I wonder if a workable solution would be to provide a more fine-grained
lock on tbl. Something like the way that ct_read_lock/unlock() works.
--

From: Sven Wegener
Date: Monday, August 11, 2008 - 9:27 pm

Also possible. But I guess I was thinking too complicated last night. What 
I was after with the "protect the whole ip_vs_lblc_schedule() with 
write_lock()ing the lock" was also to simply prevent someone adding 
duplicate entries. If we just extend the read_lock() region to cover the 
whole usage of the entry and do an additional duplicate check during 
inserting the entry under write_lock(), we fix the issue and also fix the 
race that someone may add duplicate entries. We have a bit overhead, 
because we unlock/lock and also there is a chance of doing one or more 
useless __ip_vs_wlc_schedule(), but in the end this should work. More 
fine-grained locking could help to lower the impact.

Sven
--

From: Sven Wegener
Date: Tuesday, August 12, 2008 - 6:07 am

I wondered if this whole thing can ever be totally race condition free, 
without changing how destinations are purged from the trash. Initial 
version of the patch below. Basically it pushes the locking up into 
ip_vs_lblc_schedule() and rearranges the code to be safe that we have a 
valid destination.

diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
index 7a6a319..67f7b04 100644
--- a/net/ipv4/ipvs/ip_vs_lblc.c
+++ b/net/ipv4/ipvs/ip_vs_lblc.c
@@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = {
 
 static struct ctl_table_header * sysctl_header;
 
-/*
- *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
- *      IP address to a server.
- */
-static inline struct ip_vs_lblc_entry *
-ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
-{
-	struct ip_vs_lblc_entry *en;
-
-	en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC);
-	if (en == NULL) {
-		IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
-		return NULL;
-	}
-
-	INIT_LIST_HEAD(&en->list);
-	en->addr = daddr;
-
-	atomic_inc(&dest->refcnt);
-	en->dest = dest;
-
-	return en;
-}
-
-
 static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en)
 {
 	list_del(&en->list);
@@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct ip_vs_lblc_entry *en)
 	 */
 	hash = ip_vs_lblc_hashkey(en->addr);
 
-	write_lock(&tbl->lock);
 	list_add(&en->list, &tbl->bucket[hash]);
 	atomic_inc(&tbl->entries);
-	write_unlock(&tbl->lock);
 
 	return 1;
 }
@@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 addr)
 
 	hash = ip_vs_lblc_hashkey(addr);
 
-	read_lock(&tbl->lock);
-
 	list_for_each_entry(en, &tbl->bucket[hash], list) {
 		if (en->addr == addr) {
 			/* HIT */
-			read_unlock(&tbl->lock);
 			return en;
 		}
 	}
 
-	read_unlock(&tbl->lock);
-
 	return NULL;
 }
 
 
 /*
+ *      new/free a ip_vs_lblc_entry, which is a mapping of a destionation
+ *      IP address to a server.
+ */
+static inline struct ...
From: Simon Horman
Date: Tuesday, August 12, 2008 - 10:42 pm

Hi Sven,

I think that I am happy with this, except for updating lastuse
under a read lock, is it really safe? Do you still have any concerns?

I've annotated the code with my thoughts on how your locking is working

Setting en->lastuse with a read lock held makes me feel uncomfortable.

At this point we know that dest exists and is not in the trash.
We know that dest can't disapear given that its not already
in the trash and our caller holds a read lock on __ip_vs_svc_lock.
So dest will be safe until after ip_vs_lblc_schedule() returns.

Dest seems ok :-)

Ok, that seems complex but non-racy to me :-)


After the lock is released we may now lose en, but we don't care
as its not used again inside ip_vs_lblc_schedule().


This new dest must also not be in the trash by virtue of
being returned by __ip_vs_wlc_schedule() and delitions from
svc->destinations being protected by __ip_vs_svc_lock, which
is held by the caller.

Also good :-)

BTW, the name of __ip_vs_wlc_schedule, really ought to be changed

This write lock looks obviously correct as the destructors
--

From: Sven Wegener
Date: Tuesday, August 12, 2008 - 11:59 pm

It looks suspicous, but word-size writes should be atomic by itself. So 
even doing it under read lock should not lead to corrupted values. We 
don't care if the values that are set concurrently are overwritten, 
because they should only differ in a couple of jiffies. Might be good to 

Yeah, that's the point where we prevent the race with the trash purging. 
We could change the purging of destinations from the trash to be under 
__ip_vs_svc_lock write locked, then we know that all destinations, even 
the ones in the trash are valid. Might make more sense than duplicating 
--

From: Simon Horman
Date: Wednesday, August 13, 2008 - 12:12 am

Ok. I was only worried about the atomicity. I agree that having the

--

From: Sven Wegener
Date: Wednesday, August 13, 2008 - 3:36 am

Looking at the code it doesn't seem that simple. We lock __ip_vs_svc_lock 
for writing and then wait for all current users of the service to go away. 
But it's still possible that we clean up destinations from other services. 
And when we restrict it to only clean up the destinations that match our 
service, we have a chance of leaving destinations in the trash for good.

Sven
--

From: Sven Wegener
Date: Thursday, August 14, 2008 - 11:03 pm

Still thinking about it, I'm against doing it globally. It's too much 
hassle and in the end we still need to check for IP_VS_DEST_F_AVAILABLE in 
the schedulers, because destinations in the trash shouldn't be used for 
further connections.

Sven
--

Previous thread: [NET-NEXT PATCH 2/3] e1000e: add support for 82567LM-3 and 82567LF-3 (ICH10D) parts by Jeff Kirsher on Monday, August 11, 2008 - 3:39 pm. (4 messages)

Next thread: [Qdisc] Prioritizing traffic per vlan using qdisc. by Pravin Bathija on Monday, August 11, 2008 - 5:48 pm. (4 messages)