Both schedulers have a race condition that happens in the following
situation:
We have an entry in our table that already has expired according to it's
last use time. Then we need to schedule a new connection that uses this
entry.
CPU 1 CPU 2
ip_vs_lblc_schedule()
ip_vs_lblc_get()
lock table for read
find entry
unlock table
ip_vs_lblc_check_expire()
lock table for write
kfree() expired entry
unlock table
return invalid entry
Problem is that we assign the last use time outside of our critical
region. We can make hitting this race more difficult, if not impossible,
if we assign the last use time while still holding the lock for reading.
That gives us six minutes during which it's save to use the entry, which
should be enough for our use case, as we're going to use it immediately
and don't keep a long reference to it.
We're holding the lock for reading and not for writing. The last use time
is an unsigned long, so the assignment should be atomic by itself. And we
don't care, if some other user sets it to a slightly different value. The
read_unlock() implies a barrier so that other CPUs see the new last use
time during cleanup, even if we're just using a read lock.
Other solutions would be: 1) protect the whole ip_vs_lblc_schedule() with
write_lock()ing the lock, 2) add reference counting for the entries, 3)
protect each entry with it's own lock. And all are bad for performance.
Comments? Ideas?
---
net/ipv4/ipvs/ip_vs_lblc.c | 4 +++-
net/ipv4/ipvs/ip_vs_lblcr.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
index 0efa3db..65f8414 100644
--- a/net/ipv4/ipvs/ip_vs_lblc.c
+++ b/net/ipv4/ipvs/ip_vs_lblc.c
@@ -144,6 +144,8 @@ ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
...Is there a pathological case here if sysctl_ip_vs_lblc_expiration is set to be very short and we happen to hit ip_vs_lblc_full_check()? To be honest I think that I like the reference count approach best, as it seems safe and simple. Is it really going to be horrible for performance? If so, I wonder if a workable solution would be to provide a more fine-grained lock on tbl. Something like the way that ct_read_lock/unlock() works. --
Also possible. But I guess I was thinking too complicated last night. What I was after with the "protect the whole ip_vs_lblc_schedule() with write_lock()ing the lock" was also to simply prevent someone adding duplicate entries. If we just extend the read_lock() region to cover the whole usage of the entry and do an additional duplicate check during inserting the entry under write_lock(), we fix the issue and also fix the race that someone may add duplicate entries. We have a bit overhead, because we unlock/lock and also there is a chance of doing one or more useless __ip_vs_wlc_schedule(), but in the end this should work. More fine-grained locking could help to lower the impact. Sven --
I wondered if this whole thing can ever be totally race condition free,
without changing how destinations are purged from the trash. Initial
version of the patch below. Basically it pushes the locking up into
ip_vs_lblc_schedule() and rearranges the code to be safe that we have a
valid destination.
diff --git a/net/ipv4/ipvs/ip_vs_lblc.c b/net/ipv4/ipvs/ip_vs_lblc.c
index 7a6a319..67f7b04 100644
--- a/net/ipv4/ipvs/ip_vs_lblc.c
+++ b/net/ipv4/ipvs/ip_vs_lblc.c
@@ -123,31 +123,6 @@ static ctl_table vs_vars_table[] = {
static struct ctl_table_header * sysctl_header;
-/*
- * new/free a ip_vs_lblc_entry, which is a mapping of a destionation
- * IP address to a server.
- */
-static inline struct ip_vs_lblc_entry *
-ip_vs_lblc_new(__be32 daddr, struct ip_vs_dest *dest)
-{
- struct ip_vs_lblc_entry *en;
-
- en = kmalloc(sizeof(struct ip_vs_lblc_entry), GFP_ATOMIC);
- if (en == NULL) {
- IP_VS_ERR("ip_vs_lblc_new(): no memory\n");
- return NULL;
- }
-
- INIT_LIST_HEAD(&en->list);
- en->addr = daddr;
-
- atomic_inc(&dest->refcnt);
- en->dest = dest;
-
- return en;
-}
-
-
static inline void ip_vs_lblc_free(struct ip_vs_lblc_entry *en)
{
list_del(&en->list);
@@ -189,10 +164,8 @@ ip_vs_lblc_hash(struct ip_vs_lblc_table *tbl, struct ip_vs_lblc_entry *en)
*/
hash = ip_vs_lblc_hashkey(en->addr);
- write_lock(&tbl->lock);
list_add(&en->list, &tbl->bucket[hash]);
atomic_inc(&tbl->entries);
- write_unlock(&tbl->lock);
return 1;
}
@@ -209,23 +182,54 @@ ip_vs_lblc_get(struct ip_vs_lblc_table *tbl, __be32 addr)
hash = ip_vs_lblc_hashkey(addr);
- read_lock(&tbl->lock);
-
list_for_each_entry(en, &tbl->bucket[hash], list) {
if (en->addr == addr) {
/* HIT */
- read_unlock(&tbl->lock);
return en;
}
}
- read_unlock(&tbl->lock);
-
return NULL;
}
/*
+ * new/free a ip_vs_lblc_entry, which is a mapping of a destionation
+ * IP address to a server.
+ */
+static inline struct ...Hi Sven, I think that I am happy with this, except for updating lastuse under a read lock, is it really safe? Do you still have any concerns? I've annotated the code with my thoughts on how your locking is working Setting en->lastuse with a read lock held makes me feel uncomfortable. At this point we know that dest exists and is not in the trash. We know that dest can't disapear given that its not already in the trash and our caller holds a read lock on __ip_vs_svc_lock. So dest will be safe until after ip_vs_lblc_schedule() returns. Dest seems ok :-) Ok, that seems complex but non-racy to me :-) After the lock is released we may now lose en, but we don't care as its not used again inside ip_vs_lblc_schedule(). This new dest must also not be in the trash by virtue of being returned by __ip_vs_wlc_schedule() and delitions from svc->destinations being protected by __ip_vs_svc_lock, which is held by the caller. Also good :-) BTW, the name of __ip_vs_wlc_schedule, really ought to be changed This write lock looks obviously correct as the destructors --
It looks suspicous, but word-size writes should be atomic by itself. So even doing it under read lock should not lead to corrupted values. We don't care if the values that are set concurrently are overwritten, because they should only differ in a couple of jiffies. Might be good to Yeah, that's the point where we prevent the race with the trash purging. We could change the purging of destinations from the trash to be under __ip_vs_svc_lock write locked, then we know that all destinations, even the ones in the trash are valid. Might make more sense than duplicating --
Ok. I was only worried about the atomicity. I agree that having the --
Looking at the code it doesn't seem that simple. We lock __ip_vs_svc_lock for writing and then wait for all current users of the service to go away. But it's still possible that we clean up destinations from other services. And when we restrict it to only clean up the destinations that match our service, we have a chance of leaving destinations in the trash for good. Sven --
Still thinking about it, I'm against doing it globally. It's too much hassle and in the end we still need to check for IP_VS_DEST_F_AVAILABLE in the schedulers, because destinations in the trash shouldn't be used for further connections. Sven --
