Re: [PATCH 1/2] scalable rw_mutex

Previous thread: [PATCH 0/2] convert mmap_sem to a scalable rw_mutex by Peter Zijlstra on Friday, May 11, 2007 - 6:15 am. (17 messages)

Next thread: [PATCH 2/2] mm: change mmap_sem over to the scalable rw_mutex by Peter Zijlstra on Friday, May 11, 2007 - 6:15 am. (5 messages)
From: Peter Zijlstra
Date: Friday, May 11, 2007 - 6:15 am

Scalable reader/writer lock.

Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.

Its not FIFO fair, but starvation proof by alternating readers and writers.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rwmutex.h |  103 +++++++++++++++++++++++++++++++++++++
 kernel/Makefile         |    3 -
 kernel/rwmutex.c        |  132 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 237 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h	2007-05-11 14:59:09.000000000 +0200
@@ -0,0 +1,103 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+	/* Read mostly global */
+	struct percpu_counter	readers;
+	unsigned int		status;
+
+	/* The following variables are only for the slowpath */
+	struct mutex		read_mutex;	/* r -> w waiting */
+	struct mutex		write_mutex;	/* w -> w waiting */
+	wait_queue_head_t	wait_queue;	/* w -> r waiting */
+	atomic_t		read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
+};
+
+extern void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+		struct lock_class_key *key);
+extern void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex)					\
+	do {							\
+		static struct lock_class_key __key;		\
+		__rw_mutex_init((rw_mutex), #rw_mutex, &__key);	\
+	} while (0)
+
+extern void ...
From: Christoph Hellwig
Date: Friday, May 11, 2007 - 7:03 am

While this implementation looks pretty nice I really hate growing more
and more locking primitives.  Do we have any rwsem user that absolutley
needs FIFO semantics or could we convert all user over (in which case
the objection above is of course completely moot)

-

From: Andrew Morton
Date: Friday, May 11, 2007 - 9:31 am

A nice comment describing the overall design and the runtime dynamics and
the lock's characteristics would be useful.  It should include a prominent




and other times you don't use extern.  I think it's pretty pointless

down_foo(mmap_sem) was previously accessible to non-gpl modules, so the GPL



yipes.  percpu_counter_sum() is expensive.


-

From: Christoph Lameter
Date: Friday, May 11, 2007 - 10:07 am

Capable of triggering NMI watchdog on 4096+ processors?

-

From: Andrew Morton
Date: Friday, May 11, 2007 - 11:05 am

On Fri, 11 May 2007 10:07:17 -0700 (PDT)

Well.  That would be a millisecond per cpu which sounds improbable.  And
we'd need to be calling it under local_irq_save() which we presently don't.
And nobody has reported any problems against the existing callsites.

But it's no speed demon, that's for sure.
-

From: Andi Kleen
Date: Saturday, May 12, 2007 - 11:55 am

There is one possible optimization for this I did some time ago. You don't really
need to sum all over the possible map, but only all CPUs that were ever 
online. But this only helps on systems where the possible map is bigger
than online map in the common case. But that shouldn't be the case anymore on x86
-- it just used to be. If it's true on some other architectures it might
be still worth it.

Old patches with an network example for reference appended.

-Andi

Index: linux-2.6.21-git2-net/include/linux/cpumask.h
===================================================================
--- linux-2.6.21-git2-net.orig/include/linux/cpumask.h
+++ linux-2.6.21-git2-net/include/linux/cpumask.h
@@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_
 extern cpumask_t cpu_possible_map;
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_everonline_map;
 
 #if NR_CPUS > 1
 #define num_online_cpus()	cpus_weight(cpu_online_map)
@@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)		cpu_isset((cpu), cpu_online_map)
 #define cpu_possible(cpu)	cpu_isset((cpu), cpu_possible_map)
 #define cpu_present(cpu)	cpu_isset((cpu), cpu_present_map)
+#define cpu_ever_online(cpu)	cpu_isset((cpu), cpu_everonline_map)
 #else
 #define num_online_cpus()	1
 #define num_possible_cpus()	1
@@ -395,6 +397,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)		((cpu) == 0)
 #define cpu_possible(cpu)	((cpu) == 0)
 #define cpu_present(cpu)	((cpu) == 0)
+#define cpu_ever_online(cpu)	((cpu) == 0)
 #endif
 
 #ifdef CONFIG_SMP
@@ -409,5 +412,6 @@ int __any_online_cpu(const cpumask_t *ma
 #define for_each_possible_cpu(cpu)  for_each_cpu_mask((cpu), cpu_possible_map)
 #define for_each_online_cpu(cpu)  for_each_cpu_mask((cpu), cpu_online_map)
 #define for_each_present_cpu(cpu) for_each_cpu_mask((cpu), cpu_present_map)
+#define for_each_everonline_cpu(cpu) for_each_cpu_mask((cpu), cpu_everonline_map)
 
 #endif /* __LINUX_CPUMASK_H ...
From: Andrew Morton
Date: Saturday, May 12, 2007 - 11:06 am

hm, yeah.

We could put a cpumask in percpu_counter, initialise it to
cpu_possible_map.  Then, those callsites which have hotplug notifiers can
call into new percpu_counter functions which clear and set bits in that
cpumask and which drain percpu_counter.counts[cpu] into
percpu_counter.count.

And percpu_counter_sum() gets taught to do for_each_cpu_mask(fbc->cpumask).
-

From: Andrew Morton
Date: Saturday, May 12, 2007 - 11:11 am

Perhaps we could have a single cpu hotplug notifier in the percpu_counter
library.  Add register/deregister functions to the percpu_counter API so
that all percpu_counters in the machine can be linked together.

One _could_ just register and deregister the counters in
percpu_counter_init() and percpu_counter_destroy(), but perhaps that
wouldn't suit all callers, dunno.

-

From: Andrew Morton
Date: Wednesday, May 16, 2007 - 4:28 pm

On Sat, 12 May 2007 11:06:24 -0700

Like this:


From: Andrew Morton <akpm@linux-foundation.org>

per-cpu counters presently must iterate over all possible CPUs in the
exhaustive percpu_counter_sum().

But it can be much better to only iterate over the presently-online CPUs.  To
do this, we must arrange for an offlined CPU's count to be spilled into the
counter's central count.

We can do this for all percpu_counters in the machine by linking them into a
single global list and walking that list at CPU_DEAD time.

(I hope.  Might have race windows in which the percpu_counter_sum() count is
inaccurate?)


Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/percpu_counter.h |   18 ++------
 lib/percpu_counter.c           |   66 +++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 12 deletions(-)

diff -puN lib/percpu_counter.c~percpu_counters-use-cpu-notifiers lib/percpu_counter.c
--- a/lib/percpu_counter.c~percpu_counters-use-cpu-notifiers
+++ a/lib/percpu_counter.c
@@ -3,8 +3,17 @@
  */
 
 #include <linux/percpu_counter.h>
+#include <linux/notifier.h>
+#include <linux/mutex.h>
+#include <linux/init.h>
+#include <linux/cpu.h>
 #include <linux/module.h>
 
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(percpu_counters);
+static DEFINE_MUTEX(percpu_counters_lock);
+#endif
+
 void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
 {
 	long count;
@@ -44,3 +53,60 @@ s64 percpu_counter_sum(struct percpu_cou
 	return ret < 0 ? 0 : ret;
 }
 EXPORT_SYMBOL(percpu_counter_sum);
+
+void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+{
+	spin_lock_init(&fbc->lock);
+	fbc->count = amount;
+	fbc->counters = alloc_percpu(s32);
+#ifdef CONFIG_HOTPLUG_CPU
+	mutex_lock(&percpu_counters_lock);
+	list_add(&fbc->list, &percpu_counters);
+	mutex_unlock(&percpu_counters_lock);
+#endif
+}
+EXPORT_SYMBOL(percpu_counter_init);
+
+void percpu_counter_destroy(struct percpu_counter ...
From: Christoph Lameter
Date: Wednesday, May 16, 2007 - 4:40 pm

The question is how do these race windows affect the locking scheme?
-

From: Andrew Morton
Date: Wednesday, May 16, 2007 - 5:24 pm

The race to which I refer here is if another CPU is running
percpu_counter_sum() in the window between the clearing of the bit in
cpu_online_map and the CPU_DEAD callout.  Maybe that's too small to care
about in the short-term, dunno.

Officially we should fix that by taking lock_cpu_hotplug() in
percpu_counter_sum(), but I hate that thing.

I was thinking of putting a cpumask into the counter.  If we do that then
there's no race at all: everything happens under fbc->lock.  This would be
a preferable fix, if we need to fix it.

But I'd prefer that freezer-based cpu-hotplug comes along and saves us
again.



umm, actually, we can fix the race by using CPU_DOWN_PREPARE instead of
CPU_DEAD.  Because it's OK if percpu_counter_sum() looks at a gone-away
CPU's slot.
-

From: Oleg Nesterov
Date: Saturday, May 12, 2007 - 11:12 am

This also allows us to de-uglify workqueue.c a little bit, it uses
a home-grown cpu_populated_map.

Oleg.

-

From: Andi Kleen
Date: Saturday, May 12, 2007 - 12:21 pm

It might be obsolete iff more and more architecture don't use NR_CPUS filled
possible_map anymore (haven't checked them all to know if it's true or not)

If not there are a couple of more optimizations that can be done, e.g.
in networking by converting more code to hotplug notifier.

-Andi
-

From: Oleg Nesterov
Date: Saturday, May 12, 2007 - 2:42 pm

As for workqueue.c, it is not an optimization. It is a documentation.
For example, if CPU-hotplug use freezer, we can just do

	s/cpu_populated_map/cpu_online_map/

workqueue.c has a hotplug notifier, but we can't migrate work_structs
currently in a race-free manner.

So I vote for your patch in any case.

Oleg.

-

From: Peter Zijlstra
Date: Friday, May 11, 2007 - 10:57 am

Yes, storage-wise it is a tad heavy.



/me adds documentation pointing to the smp_wmb() in

Good question; and while writing up the answer I had for myself I found
it wrong. So this might very well be a race where a read lock succeeds
concurrently with a writer - bad!



Yeah, this could use some more documentation; it wakes _write_unlock()
which can wait holding off new writers until at least a single reader


Right, and this instance is not strictly needed for correctness.

It might be possible to remove the other from the wait_event() loop, if
that makes any difference. If we fold the counter when switching to the
slow path, and use the shared counter there so it doesn't diverge again.

-

From: Oleg Nesterov
Date: Friday, May 11, 2007 - 4:00 pm

This look a bit suspicious, can't mutex_write_lock() set RW_MUTEX_READER_SLOW

The same. __rw_mutex_status_set()->wmb() in rw_mutex_write_lock below
is not enough. percpu_counter_mod() doesn't take fbc->lock if < FBC_BATCH,
so we don't have a proper serialization.

write_lock() sets RW_MUTEX_READER_SLOW, finds percpu_counter_sum() != 0,
and sleeps. rw_mutex_read_unlock() decrements cpu-local var, does not

Looks like we can have only one task on rw_mutex->wait_queue, and it holds
->write_mutex. Can't we use just a "task_struct *write_waiter" instead of
->wait_queue ? This makes rw_mutex smaller.

Oleg.

-

From: Peter Zijlstra
Date: Saturday, May 12, 2007 - 12:39 am

Yeah, I found that one when Andrew asked me about that preempt_disable()
thing.

How about:

int __rw_mutex_read_trylock(struct rw_mutex *rw_mutex)
{
	percpu_counter_inc(&rw_mutex->readers);
	if (unlikely(rw_mutex_reader_slow(rw_mutex))) {
		percpu_counter_dec(&rw_mutex->readers);
		/*
		 * possibly wake up a writer waiting for this reference to
		 * disappear
		 */
		wake_up(&rw_mutex->wait_queue);
		return 0;
	}
	return 1;

Indeed; however with the above having the reverse sequence this has, it


write lock              read lock               read unlock

a) state = slow         1) readers++            I)  readers--
b) wait(readers == 0)   2) if (state == slow)   II) if (state == slow)

That looks pretty safe to me; however are you suggesting the
percpu_counter_inc() needs some sort of barrier in order to be reliably
picked up by the percpu_counter_sum()?

something like this:

percpu_counter_{inc,dec}
smp_wmb()

vs

smp_rmb()

Good point; I'll try and figure out how to sleep and wake a single task
without the waitqueue.

-

From: Peter Zijlstra
Date: Saturday, May 12, 2007 - 6:41 am

It has grown a few undocumented barriers again; but I'd like some
feedback on them. /me still hopes some can go.. but these things still
mess my head up.

---

Scalable reader/writer lock.

Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.

Its not FIFO fair, but starvation proof by alternating readers and writers.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rwmutex.h |   83 +++++++++++++++
 kernel/Makefile         |    3 
 kernel/rwmutex.c        |  254 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 339 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h	2007-05-12 13:39:51.000000000 +0200
@@ -0,0 +1,83 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+	/* Read mostly global */
+	struct percpu_counter	readers;
+	unsigned int		status;
+
+	/* The following variables are only for the slowpath */
+	struct mutex		read_mutex;	/* r -> w waiting */
+	struct mutex		write_mutex;	/* w -> w waiting */
+	struct task_struct	*waiter;	/* w -> r waiting */
+	atomic_t		read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+		struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex)					\
+	do {							\
+		static struct ...
From: Oleg Nesterov
Date: Saturday, May 12, 2007 - 9:04 am

I think this is still not right, but when it comes to barriers we
need a true expert (Paul cc-ed).

this code roughly does (the only reader does unlock)

	READER			WRITER

	readers = 0;		state = 1;
	wmb();			wmb();
	CHECK(state != 0)	CHECK(readers == 0)

We need to ensure that we can't miss both CHECKs. Either reader
should see RW_MUTEX_READER_SLOW, o writer sees "readers == 0"
and does not sleep.

In that case both barriers should be converted to smp_mb(). There
was a _long_ discussion about STORE-MB-LOAD behaviour, and experts
seem to believe everething is ok.

Another question. Isn't it possible to kill rw_mutex->status ?

I have a vague feeling you can change the code so that

	rw_mutex_reader_slow() <=> "->waiter != NULL"

, but I am not sure.

Oleg.

-

From: Peter Zijlstra
Date: Saturday, May 12, 2007 - 9:57 am

Ah, but note that both those CHECK()s have a rmb(), so that ends up
being:

	READER				WRITER

	readers = 0;			state = 1;
	wmb();				wmb();

	rmb();				rmb();		
	if (state != 0)			if (readers == 0)


If not now, it might be possible to make it so. Thanks for the
suggestion.

Peter

-

From: Oleg Nesterov
Date: Saturday, May 12, 2007 - 11:03 am

I used to think the same, but this is wrong: wmb+rmb != mb. wmb+rmb
doesn't provide LOAD,STORE or STORE,LOAD ordering.

for example,

	LOAD;
	rmb(); wmb();
	STORE;

it is still possible that STORE comes before LOAD. At least this
is my understanding.

Oleg.

-

From: Peter Zijlstra
Date: Monday, May 14, 2007 - 3:59 am

Changes include:

 - wmb+rmb != mb
 - ->state folded into ->waiter

---
Subject: scalable rw_mutex

Scalable reader/writer lock.

Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.

Its not FIFO fair, but starvation proof by alternating readers and writers.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rwmutex.h |   82 ++++++++++++++++
 kernel/Makefile         |    3 
 kernel/rwmutex.c        |  232 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 316 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h	2007-05-14 10:34:32.000000000 +0200
@@ -0,0 +1,82 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+	/* Read mostly global */
+	struct percpu_counter	readers;
+
+	/* The following variables are only for the slowpath */
+	struct task_struct	*waiter;	/* w -> r waiting */
+	struct mutex		read_mutex;	/* r -> w waiting */
+	struct mutex		write_mutex;	/* w -> w waiting */
+	atomic_t		read_waiters;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+		struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex)					\
+	do {							\
+		static struct lock_class_key __key;		\
+		__rw_mutex_init((rw_mutex), #rw_mutex, &__key);	\
+	} while ...
From: Nick Piggin
Date: Monday, May 14, 2007 - 4:36 am

The store to percpu readers AFAIKS may not become visible until after the
wakeup and therefore after the waiter checks for readers. So I think this
needs a full smp_mb, doesn't it? (you seem to have the barrier in unlock,

Pretty sure you need to be more careful here: the waiter might have left
the locking code and have exitted by this time, no? (ditto for the rest of

Hmm, if you have set rw_mutex->waiter to NULL _after_ waiting for
read_waiters to be decremented below value X, don't you have a starvation
problem?

What I believe you need to do is this:

  set_task_state(task_uninterruptible);
  rw_mutex->waiter = NULL;
  smp_mb();
  if (read_waiters >= waiters)
    schedule();

-

From: Paul E. McKenney
Date: Monday, May 14, 2007 - 5:36 pm

Hmmm...  brlock reincarnates, but as sleeplock.  ;-)

I believe that there are a few severe problems in this code, search
for "!!!" to quickly find the specific areas that concern me.


List of races that must be resolved:

1.	Read acquire vs. write acquire.

	rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if
		this fails, invokes rw_mutex_read_lock_slow().

		__rw_mutex_read_trylock() increments the per-CPU counter,
			does smp_mb(), picks up ->waiter:
				if non-NULL decrements the per-CPU
				counter, does a barrier(), does
				wake_up_process() on the task fetched
				from ->waiter.  Return failure.

				Otherwise, return success.

		rw_mutex_read_lock_slow() increments ->read_waiters,
			acquires ->read_mutex, increments the ->readers
			counter, and decrements the ->read_waiters
			counter.  It then fetches ->waiter, and, if
			non-NULL, wakes up the tasks.
			Either way, releases ->read_mutex.

	rw_mutex_write_lock_nested(): acquires ->write_mutex, which
		prevents any writer-writer races.  Acquires ->read_mutex,
		which does -not- prevent readers from continuing to
		acquire.  Sets ->waiter to current, which -does-
		(eventually) stop readers.  smp_mb(), then invokes
		rw_mutex_writer_wait() for the sum of the per-CPU
		counters to go to zero.

		!In principle, this could be indefinitely postponed,
		but in practice would require an infinite number of
		reading tasks, so probably OK to ignore.  ;-)
		This can occur because the readers unconditionally
		increment their per-CPU counters, and decrement it
		only later.

	The smp_mb()s currently in the reader and the writer code
	forms a Dekker-algorithm-like barrier, preventing both the
	reader and writer from entering their critical section, as
	required.

2.	Read acquire vs. write release (need to avoid reader sleeping
	forever, even in the case where no one ever uses the lock again).

	rw_mutex_read_lock() invokes __rw_mutex_read_trylock(), if
		this fails, invokes ...
From: Peter Zijlstra
Date: Tuesday, May 15, 2007 - 12:43 am

You could short circuit the no readers case for a full brlock like
affair. But yeah, the write side will be horribly heavy when you have to

It will actually block the task. schedule() will make it sleep when

Blame Oleg for this :-)
He suggested I not use waitqueues since I only ever have a single

It does really sleep. And Nick thinks the suggested wmb + rmb will

Yes, this is a real problem, I'll borrow a spinlock from one of the


Ah, yeah. Ouch!

I missed this detail when I got rid of ->state. I used to clear the

Yes, this seems like a very good suggestion, I'll give it a shot.

-

From: Paul E. McKenney
Date: Tuesday, May 15, 2007 - 8:29 am

Color me confused...  :-(

Sorry for the noise!!!

But there were a couple of places that I thought were correct only

I will defer to Oleg on this, although I could have sworn that the

My concern would be the possibility that someone else also woke the
sleeper up just as we were preparing to do so.  This might happen in
the case where multiple readers executed this code simultaneously.
Seems to me that this might be vulnerable to the following sequence
of events:

1.	Reader-wannabe #1 wakes up the writer.

2.	The writer wakes up and sees that the sum of the per-cpu
	counters is still non-zero, due to the presence of
	reader-wannabe #2.

3.	Reader-wannabe #2 decrements its counter.  Note that
	reader-wannabe #2 could do all the memory barriers it
	wanted -- this vulnerability would still exist.

4.	Reader-wannabe #2 wakes up the writer, which has no effect,
	since the writer is already awake (but is perhaps preempted,
	interrupted, or something).

5.	The writer goes to sleep, never to awaken again (unless some
	other reader comes along, which might or might not happen).


Not sure exactly what you are intending to do, so will await your


-

From: Peter Zijlstra
Date: Tuesday, May 15, 2007 - 9:17 am

Ugh, nasty. Will have to ponder this a bit. It looks like a spinlock is
needed somewhere.

In the mean time, here is the latest code I have:

----
Subject: scalable rw_mutex

Scalable reader/writer lock.

Its scalable in that the read count is a percpu counter and the reader fast
path does not write to a shared cache-line.

Its not FIFO fair, but starvation proof by alternating readers and writers.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/rwmutex.h |   82 ++++++++++++++++
 kernel/Makefile         |    3 
 kernel/rwmutex.c        |  240 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 324 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/rwmutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/rwmutex.h	2007-05-15 09:58:03.000000000 +0200
@@ -0,0 +1,82 @@
+/*
+ * Scalable reader/writer lock.
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+#ifndef _LINUX_RWMUTEX_H
+#define _LINUX_RWMUTEX_H
+
+#include <linux/preempt.h>
+#include <linux/wait.h>
+#include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
+#include <linux/mutex.h>
+#include <asm/atomic.h>
+
+struct rw_mutex {
+	/* Read mostly global */
+	struct percpu_counter	readers;
+
+	/* The following variables are only for the slowpath */
+	struct task_struct	*waiter;	/* w -> r waiting */
+	struct mutex		read_mutex;	/* r -> w waiting */
+	struct mutex		write_mutex;	/* w -> w waiting */
+	atomic_t		reader_seq;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
+};
+
+void __rw_mutex_init(struct rw_mutex *rw_mutex, const char * name,
+		struct lock_class_key *key);
+void rw_mutex_destroy(struct rw_mutex *rw_mutex);
+
+#define rw_mutex_init(rw_mutex)					\
+	do {							\
+		static struct lock_class_key ...
From: Paul E. McKenney
Date: Tuesday, May 15, 2007 - 11:52 am

Ah -- the usual approach is for the writer to hold a spinlock across
the initial check, the prepare_to_wait(), and last-minute re-checks to
see if blocking is still warranted. Then drop the lock, call schedule(),
and call finish_wait().

The reader would hold this same spinlock across both the change that
would prevent the writer from sleeping and the wake-up call.

Then if the reader shows up just after the writer drops the lock, the
reader's wakeup is guaranteed to do something useful.  If two readers show
up, they cannot run concurrently, so the above situation cannot occur.

There are a number of variations on this theme.  Easy to get wrong,
though -- I have proven that to myself more times than I care to admit...

-

Previous thread: [PATCH 0/2] convert mmap_sem to a scalable rw_mutex by Peter Zijlstra on Friday, May 11, 2007 - 6:15 am. (17 messages)

Next thread: [PATCH 2/2] mm: change mmap_sem over to the scalable rw_mutex by Peter Zijlstra on Friday, May 11, 2007 - 6:15 am. (5 messages)