Scalable reader/writer lock.
Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.
Its not FIFO fair, but starvation proof by alternating readers and writers.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/rwmutex.h | 103 +++++++++++++++++++++++++++++++++++++
kernel/Makefile | 3 -
kernel/rwmutex.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 237 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h 2007-05-11 14:59:09.000000000 +0200
@@ -0,0 +1,103 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+ /* Read mostly global */
+ struct percpu_counter readers;
+ unsigned int status;
+
+ /* The following variables are only for the slowpath */
+ struct mutex read_mutex; /* r -> w waiting */
+ struct mutex write_mutex; /* w -> w waiting */
+ wait_queue_head_t wait_queue; /* w -> r waiting */
+ atomic_t read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+extern void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+ struct lock_class_key *key);
+extern void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex) \
+ do { \
+ static struct lock_class_key __key; \
+ __rw_mutex_init((rw_mutex), #rw_mutex, &__key); \
+ } while (0)
+
+extern void ...While this implementation looks pretty nice I really hate growing more and more locking primitives. Do we have any rwsem user that absolutley needs FIFO semantics or could we convert all user over (in which case the objection above is of course completely moot) -
A nice comment describing the overall design and the runtime dynamics and the lock's characteristics would be useful. It should include a prominent and other times you don't use extern. I think it's pretty pointless down_foo(mmap_sem) was previously accessible to non-gpl modules, so the GPL yipes. percpu_counter_sum() is expensive. -
Capable of triggering NMI watchdog on 4096+ processors? -
On Fri, 11 May 2007 10:07:17 -0700 (PDT) Well. That would be a millisecond per cpu which sounds improbable. And we'd need to be calling it under local_irq_save() which we presently don't. And nobody has reported any problems against the existing callsites. But it's no speed demon, that's for sure. -
There is one possible optimization for this I did some time ago. You don't really need to sum all over the possible map, but only all CPUs that were ever online. But this only helps on systems where the possible map is bigger than online map in the common case. But that shouldn't be the case anymore on x86 -- it just used to be. If it's true on some other architectures it might be still worth it. Old patches with an network example for reference appended. -Andi Index: linux-2.6.21-git2-net/include/linux/cpumask.h =================================================================== --- linux-2.6.21-git2-net.orig/include/linux/cpumask.h +++ linux-2.6.21-git2-net/include/linux/cpumask.h @@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_ extern cpumask_t cpu_possible_map; extern cpumask_t cpu_online_map; extern cpumask_t cpu_present_map; +extern cpumask_t cpu_everonline_map; #if NR_CPUS > 1 #define num_online_cpus() cpus_weight(cpu_online_map) @@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu) cpu_isset((cpu), cpu_online_map) #define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map) #define cpu_present(cpu) cpu_isset((cpu), cpu_present_map) +#define cpu_ever_online(cpu) cpu_isset((cpu), cpu_everonline_map) #else #define num_online_cpus() 1 #define num_possible_cpus() 1 @@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map; #define cpu_online(cpu) ((cpu) == 0) #define cpu_possible(cpu) ((cpu) == 0) #define cpu_present(cpu) ((cpu) == 0) +#define cpu_ever_online(cpu) ((cpu) == 0) #endif #ifdef CONFIG_SMP @@ -409,5 +412,6 @@ int __any_online_cpu(const cpumask_t *ma #define for_each_possible_cpu(cpu) for_each_cpu_mask((cpu), cpu_possible_map) #define for_each_online_cpu(cpu) for_each_cpu_mask((cpu), cpu_online_map) #define for_each_present_cpu(cpu) for_each_cpu_mask((cpu), cpu_present_map) +#define for_each_everonline_cpu(cpu) for_each_cpu_mask((cpu), cpu_everonline_map) #endif /* __LINUX_CPUMASK_H ...
hm, yeah. We could put a cpumask in percpu_counter, initialise it to cpu_possible_map. Then, those callsites which have hotplug notifiers can call into new percpu_counter functions which clear and set bits in that cpumask and which drain percpu_counter.counts[cpu] into percpu_counter.count. And percpu_counter_sum() gets taught to do for_each_cpu_mask(fbc->cpumask). -
Perhaps we could have a single cpu hotplug notifier in the percpu_counter library. Add register/deregister functions to the percpu_counter API so that all percpu_counters in the machine can be linked together. One _could_ just register and deregister the counters in percpu_counter_init() and percpu_counter_destroy(), but perhaps that wouldn't suit all callers, dunno. -
On Sat, 12 May 2007 11:06:24 -0700
Like this:
From: Andrew Morton <akpm@linux-foundation.org>
per-cpu counters presently must iterate over all possible CPUs in the
exhaustive percpu_counter_sum().
But it can be much better to only iterate over the presently-online CPUs. To
do this, we must arrange for an offlined CPU's count to be spilled into the
counter's central count.
We can do this for all percpu_counters in the machine by linking them into a
single global list and walking that list at CPU_DEAD time.
(I hope. Might have race windows in which the percpu_counter_sum() count is
inaccurate?)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/percpu_counter.h | 18 ++------
lib/percpu_counter.c | 66 +++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+), 12 deletions(-)
diff -puN lib/percpu_counter.c~percpu_counters-use-cpu-notifiers lib/percpu_counter.c
--- a/lib/percpu_counter.c~percpu_counters-use-cpu-notifiers
+++ a/lib/percpu_counter.c
@@ -3,8 +3,17 @@
*/
#include <linux/percpu_counter.h>
+#include <linux/notifier.h>
+#include <linux/mutex.h>
+#include <linux/init.h>
+#include <linux/cpu.h>
#include <linux/module.h>
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(percpu_counters);
+static DEFINE_MUTEX(percpu_counters_lock);
+#endif
+
void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
{
long count;
@@ -44,3 +53,60 @@ s64 percpu_counter_sum(struct percpu_cou
return ret < 0 ? 0 : ret;
}
EXPORT_SYMBOL(percpu_counter_sum);
+
+void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+{
+ spin_lock_init(&fbc->lock);
+ fbc->count = amount;
+ fbc->counters = alloc_percpu(s32);
+#ifdef CONFIG_HOTPLUG_CPU
+ mutex_lock(&percpu_counters_lock);
+ list_add(&fbc->list, &percpu_counters);
+ mutex_unlock(&percpu_counters_lock);
+#endif
+}
+EXPORT_SYMBOL(percpu_counter_init);
+
+void percpu_counter_destroy(struct percpu_counter ...The question is how do these race windows affect the locking scheme? -
The race to which I refer here is if another CPU is running percpu_counter_sum() in the window between the clearing of the bit in cpu_online_map and the CPU_DEAD callout. Maybe that's too small to care about in the short-term, dunno. Officially we should fix that by taking lock_cpu_hotplug() in percpu_counter_sum(), but I hate that thing. I was thinking of putting a cpumask into the counter. If we do that then there's no race at all: everything happens under fbc->lock. This would be a preferable fix, if we need to fix it. But I'd prefer that freezer-based cpu-hotplug comes along and saves us again. umm, actually, we can fix the race by using CPU_DOWN_PREPARE instead of CPU_DEAD. Because it's OK if percpu_counter_sum() looks at a gone-away CPU's slot. -
This also allows us to de-uglify workqueue.c a little bit, it uses a home-grown cpu_populated_map. Oleg. -
It might be obsolete iff more and more architecture don't use NR_CPUS filled possible_map anymore (haven't checked them all to know if it's true or not) If not there are a couple of more optimizations that can be done, e.g. in networking by converting more code to hotplug notifier. -Andi -
As for workqueue.c, it is not an optimization. It is a documentation. For example, if CPU-hotplug use freezer, we can just do s/cpu_populated_map/cpu_online_map/ workqueue.c has a hotplug notifier, but we can't migrate work_structs currently in a race-free manner. So I vote for your patch in any case. Oleg. -
Yes, storage-wise it is a tad heavy. /me adds documentation pointing to the smp_wmb() in Good question; and while writing up the answer I had for myself I found it wrong. So this might very well be a race where a read lock succeeds concurrently with a writer - bad! Yeah, this could use some more documentation; it wakes _write_unlock() which can wait holding off new writers until at least a single reader Right, and this instance is not strictly needed for correctness. It might be possible to remove the other from the wait_event() loop, if that makes any difference. If we fold the counter when switching to the slow path, and use the shared counter there so it doesn't diverge again. -
This look a bit suspicious, can't mutex_write_lock() set RW_MUTEX_READER_SLOW The same. __rw_mutex_status_set()->wmb() in rw_mutex_write_lock below is not enough. percpu_counter_mod() doesn't take fbc->lock if < FBC_BATCH, so we don't have a proper serialization. write_lock() sets RW_MUTEX_READER_SLOW, finds percpu_counter_sum() != 0, and sleeps. rw_mutex_read_unlock() decrements cpu-local var, does not Looks like we can have only one task on rw_mutex->wait_queue, and it holds ->write_mutex. Can't we use just a "task_struct *write_waiter" instead of ->wait_queue ? This makes rw_mutex smaller. Oleg. -
Yeah, I found that one when Andrew asked me about that preempt_disable()
thing.
How about:
int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex)
{
percpu_counter_inc(&rw_mutex->readers);
if (unlikely(rw_mutex_reader_slow(rw_mutex))) {
percpu_counter_dec(&rw_mutex->readers);
/*
* possibly wake up a writer waiting for this reference to
* disappear
*/
wake_up(&rw_mutex->wait_queue);
return 0;
}
return 1;
Indeed; however with the above having the reverse sequence this has, it
write lock read lock read unlock
a) state = slow 1) readers++ I) readers--
b) wait(readers == 0) 2) if (state == slow) II) if (state == slow)
That looks pretty safe to me; however are you suggesting the
percpu_counter_inc() needs some sort of barrier in order to be reliably
picked up by the percpu_counter_sum()?
something like this:
percpu_counter_{inc,dec}
smp_wmb()
vs
smp_rmb()
Good point; I'll try and figure out how to sleep and wake a single task
without the waitqueue.
-
It has grown a few undocumented barriers again; but I'd like some
feedback on them. /me still hopes some can go.. but these things still
mess my head up.
---
Scalable reader/writer lock.
Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.
Its not FIFO fair, but starvation proof by alternating readers and writers.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/rwmutex.h | 83 +++++++++++++++
kernel/Makefile | 3
kernel/rwmutex.c | 254 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 339 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h 2007-05-12 13:39:51.000000000 +0200
@@ -0,0 +1,83 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+ /* Read mostly global */
+ struct percpu_counter readers;
+ unsigned int status;
+
+ /* The following variables are only for the slowpath */
+ struct mutex read_mutex; /* r -> w waiting */
+ struct mutex write_mutex; /* w -> w waiting */
+ struct task_struct *waiter; /* w -> r waiting */
+ atomic_t read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+ struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex) \
+ do { \
+ static struct ...I think this is still not right, but when it comes to barriers we need a true expert (Paul cc-ed). this code roughly does (the only reader does unlock) READER WRITER readers = 0; state = 1; wmb(); wmb(); CHECK(state != 0) CHECK(readers == 0) We need to ensure that we can't miss both CHECKs. Either reader should see RW_MUTEX_READER_SLOW, o writer sees "readers == 0" and does not sleep. In that case both barriers should be converted to smp_mb(). There was a _long_ discussion about STORE-MB-LOAD behaviour, and experts seem to believe everething is ok. Another question. Isn't it possible to kill rw_mutex->status ? I have a vague feeling you can change the code so that rw_mutex_reader_slow() <=> "->waiter != NULL" , but I am not sure. Oleg. -
Ah, but note that both those CHECK()s have a rmb(), so that ends up being: READER WRITER readers = 0; state = 1; wmb(); wmb(); rmb(); rmb(); if (state != 0) if (readers == 0) If not now, it might be possible to make it so. Thanks for the suggestion. Peter -
I used to think the same, but this is wrong: wmb+rmb != mb. wmb+rmb doesn't provide LOAD,STORE or STORE,LOAD ordering. for example, LOAD; rmb(); wmb(); STORE; it is still possible that STORE comes before LOAD. At least this is my understanding. Oleg. -
Changes include:
- wmb+rmb != mb
- ->state folded into ->waiter
---
Subject: scalable rw_mutex
Scalable reader/writer lock.
Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.
Its not FIFO fair, but starvation proof by alternating readers and writers.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/rwmutex.h | 82 ++++++++++++++++
kernel/Makefile | 3
kernel/rwmutex.c | 232 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 316 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h 2007-05-14 10:34:32.000000000 +0200
@@ -0,0 +1,82 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+ /* Read mostly global */
+ struct percpu_counter readers;
+
+ /* The following variables are only for the slowpath */
+ struct task_struct *waiter; /* w -> r waiting */
+ struct mutex read_mutex; /* r -> w waiting */
+ struct mutex write_mutex; /* w -> w waiting */
+ atomic_t read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+ struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex) \
+ do { \
+ static struct lock_class_key __key; \
+ __rw_mutex_init((rw_mutex), #rw_mutex, &__key); \
+ } while ...The store to percpu readers AFAIKS may not become visible until after the
wakeup and therefore after the waiter checks for readers. So I think this
needs a full smp_mb, doesn't it? (you seem to have the barrier in unlock,
Pretty sure you need to be more careful here: the waiter might have left
the locking code and have exitted by this time, no? (ditto for the rest of
Hmm, if you have set rw_mutex->waiter to NULL _after_ waiting for
read_waiters to be decremented below value X, don't you have a starvation
problem?
What I believe you need to do is this:
set_task_state(task_uninterruptible);
rw_mutex->waiter = NULL;
smp_mb();
if (read_waiters >= waiters)
schedule();
-
Hmmm... brlock reincarnates, but as sleeplock. ;-) I believe that there are a few severe problems in this code, search for "!!!" to quickly find the specific areas that concern me. List of races that must be resolved: 1. Read acquire vs. write acquire. rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if this fails, invokes rw_mutex_read_lock_slow(). __rw_mutex_read_trylock() increments the per-CPU counter, does smp_mb(), picks up ->waiter: if non-NULL decrements the per-CPU counter, does a barrier(), does wake_up_process() on the task fetched from ->waiter. Return failure. Otherwise, return success. rw_mutex_read_lock_slow() increments ->read_waiters, acquires ->read_mutex, increments the ->readers counter, and decrements the ->read_waiters counter. It then fetches ->waiter, and, if non-NULL, wakes up the tasks. Either way, releases ->read_mutex. rw_mutex_write_lock_nested(): acquires ->write_mutex, which prevents any writer-writer races. Acquires ->read_mutex, which does -not- prevent readers from continuing to acquire. Sets ->waiter to current, which -does- (eventually) stop readers. smp_mb(), then invokes rw_mutex_writer_wait() for the sum of the per-CPU counters to go to zero. !In principle, this could be indefinitely postponed, but in practice would require an infinite number of reading tasks, so probably OK to ignore. ;-) This can occur because the readers unconditionally increment their per-CPU counters, and decrement it only later. The smp_mb()s currently in the reader and the writer code forms a Dekker-algorithm-like barrier, preventing both the reader and writer from entering their critical section, as required. 2. Read acquire vs. write release (need to avoid reader sleeping forever, even in the case where no one ever uses the lock again). rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if this fails, invokes ...
You could short circuit the no readers case for a full brlock like affair. But yeah, the write side will be horribly heavy when you have to It will actually block the task. schedule() will make it sleep when Blame Oleg for this :-) He suggested I not use waitqueues since I only ever have a single It does really sleep. And Nick thinks the suggested wmb + rmb will Yes, this is a real problem, I'll borrow a spinlock from one of the Ah, yeah. Ouch! I missed this detail when I got rid of ->state. I used to clear the Yes, this seems like a very good suggestion, I'll give it a shot. -
Color me confused... :-( Sorry for the noise!!! But there were a couple of places that I thought were correct only I will defer to Oleg on this, although I could have sworn that the My concern would be the possibility that someone else also woke the sleeper up just as we were preparing to do so. This might happen in the case where multiple readers executed this code simultaneously. Seems to me that this might be vulnerable to the following sequence of events: 1. Reader-wannabe #1 wakes up the writer. 2. The writer wakes up and sees that the sum of the per-cpu counters is still non-zero, due to the presence of reader-wannabe #2. 3. Reader-wannabe #2 decrements its counter. Note that reader-wannabe #2 could do all the memory barriers it wanted -- this vulnerability would still exist. 4. Reader-wannabe #2 wakes up the writer, which has no effect, since the writer is already awake (but is perhaps preempted, interrupted, or something). 5. The writer goes to sleep, never to awaken again (unless some other reader comes along, which might or might not happen). Not sure exactly what you are intending to do, so will await your -
Ugh, nasty. Will have to ponder this a bit. It looks like a spinlock is
needed somewhere.
In the mean time, here is the latest code I have:
----
Subject: scalable rw_mutex
Scalable reader/writer lock.
Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.
Its not FIFO fair, but starvation proof by alternating readers and writers.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/rwmutex.h | 82 ++++++++++++++++
kernel/Makefile | 3
kernel/rwmutex.c | 240 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 324 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h 2007-05-15 09:58:03.000000000 +0200
@@ -0,0 +1,82 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+ /* Read mostly global */
+ struct percpu_counter readers;
+
+ /* The following variables are only for the slowpath */
+ struct task_struct *waiter; /* w -> r waiting */
+ struct mutex read_mutex; /* r -> w waiting */
+ struct mutex write_mutex; /* w -> w waiting */
+ atomic_t reader_seq;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+ struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex) \
+ do { \
+ static struct lock_class_key ...Ah -- the usual approach is for the writer to hold a spinlock across the initial check, the prepare_to_wait(), and last-minute re-checks to see if blocking is still warranted. Then drop the lock, call schedule(), and call finish_wait(). The reader would hold this same spinlock across both the change that would prevent the writer from sleeping and the wake-up call. Then if the reader shows up just after the writer drops the lock, the reader's wakeup is guaranteed to do something useful. If two readers show up, they cannot run concurrently, so the above situation cannot occur. There are a number of variations on this theme. Easy to get wrong, though -- I have proven that to myself more times than I care to admit... -
