Re: [RFC PATCH] Writer-biased low-latency rwlock v8

Previous thread: [PATCH] Move sysctl check into debugging section and don't make it default y by Andi Kleen on Saturday, August 16, 2008 - 1:53 am. (7 messages)

Next thread: [PATCH] kexec: fix up KEXEC_CONTROL_CODE_SIZE missed during conversion by Paul Collins on Saturday, August 16, 2008 - 4:55 am. (1 message)
To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 3:39 am

x86_64 add/sub atomic ops does not seems to accept integer values bigger
than 32 bits as immediates. Intel's add/sub documentation specifies they
have to be passed as registers.

http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Machine-Constraints.html#Mac...

states :

e
32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions).
Z
32-bit unsigned integer constant, or a symbolic reference known to fit that range (for immediate operands in zero-extending x86-64 instructions).

Since add/sub does sign extension, using the "e" constraint seems appropriate.

It applies to 2.6.27-rc, 2.6.26, 2.6.25...

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Joe Perches <joe@perches.com>
---
include/asm-x86/atomic_64.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/atomic_64.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/atomic_64.h 2008-08-16 03:19:35.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/atomic_64.h 2008-08-16 03:21:18.000000000 -0400
@@ -228,7 +228,7 @@ static inline void atomic64_add(long i,
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
- : "ir" (i), "m" (v->counter));
+ : "er" (i), "m" (v->counter));
}

/**
@@ -242,7 +242,7 @@ static inline void atomic64_sub(long i,
{
asm volatile(LOCK_PREFIX "subq %1,%0"
: "=m" (v->counter)
- : "ir" (i), "m" (v->counter));
+ : "er" (i), "m" (v->counter));
}

/**
@@ -260,7 +260,7 @@ static inline int atomic64_sub_and_test(

asm volatile(LOCK_PREFIX "s...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 11:04 am

This is correct; this is in fact true for all instructions except "mov".

Whether it's sign- or zero-extending is sometimes subtle, but not in
these cases.

Do you happen to know if this is a manifest bug in the current kernel
(i.e. if there is anywhere we're using more than ±2 GB as a constant to
these functions?)

Either way, I'll queue this up to tip:x86/urgent if Ingo hasn't already
since this is a pure bug fix.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 11:43 am

No, I did not hit this on current kernel code and the effect is quite
esasy to detect : the assembler spits an error.

I have hit this problem when tying to implement a better rwlock design
than is currently in the mainline kernel (I know the RT kernel has a
hard time with rwlocks), and had to play with add/sub of large values.
The idea is to bring down the interrupt latency caused by rwlocks shared
between fast read-side interrupt handlers and slow thread context
read-sides (tasklist_lock is the perfect example). In that case, the
worse case interrupt latency is caused by the irq-disabled writer lock
when contended by the slow readers. I will probably post a RFC about
this in a near future.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 1:30 pm

Have you looked at my sleping rwlock trial thing?

It's very different from a spinning one, but I think the fast path should
be identical, and that's the one I tried to make fairly optimal.

See

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary

for a git tree. The sleeping version has two extra words for the sleep
events, but those would be irrelevant for the spinning version.

The fastpath is

movl $4,%eax
lock ; xaddl %eax,(%rdi)
testl $3,%eax
jne __my_rwlock_rdlock

for the read-lock (the two low bits are contention bits, so you can make
contention have any behaviour you want - including fairish, prefer-reads,
or prefer-writes).

The write fastpath is

xorl %eax,%eax
movl $1,%edx
lock ; cmpxchgl %edx,(%rdi)
jne __my_rwlock_wrlock

and the "unlock" case is actually unnecessarily complex in my
implementation, because it needs to

- wake things up in case of a conflict (not true of a spinning version,
of course)
- it's pthreads-compatible, so the same function needs to handle both a
read-unlock and a write-unlock.

but a spinning version should be much simpler.

Anyway, I haven't tried turning it into a spinning version, but it was
very much designed to

- work with both 32-bit and 64-bit x86 by making the fastpath only do
32-bit locked accesses
- have any number of pending readers/writers (which is not a big deal for
a spinning one, but at least there are no CPU count overflows).
- and because it is designed for sleeping, I'm pretty sure that you can
easily drop interrupts in the contention path, to make
write_lock_irq[save]() be reasonable.

In particular, the third bullet is the important one: because it's
designed to have a "contention" path that has _extra_ information for the
contended case, you could literally make the extra information have things
like a list of pending writers, so that you can drop interrupts on one
CPU, while you adding information to let the reader side ...

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 5:19 pm

No, I just had a look at it, thanks for the pointer!

Tweakable contention behavior seems interesting, but I don't think it
deals with the fact that on a mainline kernel, when an interrupt handler
comes in and asks for a read lock, it has to get it on the spot. (RT
kernels can get away with that using threaded threaded interrupts, but
that's a completely different scheme). Therefore, the impact is that
interrupts must be disabled around write lock usage, and we end up in
the situation where this interrupt disable section can last for a long
time, given it waits for every readers (including ones which does not
disable interrupts nor softirqs) to complete.

Actually, I just used LTTng traces and eventually made a small patch to
lockdep to detect whenever a spinlock or a rwlock is used both with
interrupts enabled and disabled. Those sites are likely to produce very
high latencies and should IMHO be considered as bogus. The basic bogus
scenario is to have a spinlock held on CPU A with interrupts enabled
being interrupted and then a softirq runs. On CPU B, the same lock is
acquired with interrupts off. We therefore disable interrupts on CPU B
for the duration of the softirq currently running on the CPU A, which is
really not something that helps keeping short latencies. My preliminary
results shows that there are a lot of inconsistent spinlock/rwlock irq
on/off uses in the kernel.

This kind of scenario is pretty easy to fix for spinlocks (either move
the interrupt disable within the spinlock section if the spinlock is
never used by an interrupt handler or make sure that every users has
interrupts disabled).

The problem comes with rwlocks : it is correct to have readers both with
and without irq disable, even when interrupt handlers use the read lock.
However, the write lock has to disable interrupt in that case, and we
suffer from the high latency I pointed out. The tasklist_lock is the
perfect example of this. In the following patch, I try to address this
issue.

The core idea...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Saturday, August 16, 2008 - 5:33 pm

Right. Which is exactly why I'd suggest using the extra space for saying
"this CPU is busy-looping waiting for a write lock", and then the
read-lock contention case can say

- ok, there's a pending write lock holder on _my_ CPU, so I need to just
succeed right away despite the fact that there is contention.

In other words, there are a few cases:

- you actually *got* the write lock

interrupts obviously have to be disabled here, because any reader can
simply not get the lock.

- you are waiting to get the write lock.

Mark this CPU as "pending" in the rwlock (by making the extended thing
be a queue, or by simply only ever allowing a single pendign CPU), and
re-enable interrupts while you wait for it. An interrupt that comes in
and wants a read-lock sees that it's pending, so it should then ignore
the contention bit, and only wait for the write bit to go away.

See? This way you only need to actually disable interrupts while holding
literally the lock, not while waiting for it. While still giving priority
to writers (except for the _one_ CPU, where new readers will have to get
through).

So this way you can be fair, and not allow readers to starve a writer. The
only reader that is allowed past a waiting writer is the reader on that
same CPU.

And notice how the fast-path needs no spinlock or anything else - it's
still just a single locked instruction. In comparison, if I read your
example code right, it is absolutely horrid and has an extra spinlock
access for the fair_write_lock case.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Sunday, August 17, 2008 - 3:53 am

Hi Linus,

Using a writer subscription to the rwlock to make sure the readers stop taking
the lock when a writer is waiting for it is indeed a good way to insure
fairness. I just fixed the issues you pointed in my patch (spinlock removed,
added a write fastpath with a single atomic op) and also used the subscription
idea to get fairness for writers. Contention delay tests shows that fairness is
achieved pretty well. For periodical writers, busy looping readers and
periodical interrupt readers :

6 thread readers (busy-looping)
3 thread writers (1ms period)
2 periodical interrupt readers on 7/8 cpus (IPIs).

- 21us max. contention for writer
- 154us max. contention for thread readers
- 16us max. contention for interrupt readers

(benchmark details below)

I still perceive two potential problems with your approach. It's not related to
fairness between readers and writers, but more on the effect on interrupt
latency on the system. This is actually what my patch try to address.

First there is a long interrupt handler scenario, as bad as disabling interrupts
for a long time period, which is not fixed by your solution :

CPU A
thread context, takes the read lock, gets interrupted, softirq runs.

CPU B
thread context, subscribes for the writer lock, busy loops waiting for CPU A.
(in your solution, interrupts are enabled here so interrupt handlers on CPU B
can come in)

CPU C
interrupt context, takes the read lock, contended because CPU B busy loops
waiting for the writer lock.

-> CPU C will have an interrupt handler running for the duration of the softirq
on CPU A, which will impact interrupt latency on CPU C.

Second, I tend to think this it might be difficult to atomically set the writer
state and disable interrupts, unless we disable interrupts, cmpxchg the writer
test, reenable interrupts if it fails in a loop, which does not seem very neat.

Fair low-latency rwlock v3

Changelog since v2 :
Add writer fairness in addition to fairness wrt interrupts and soft...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Sunday, August 17, 2008 - 12:17 pm

So first off, you should call this "trylock", since it doesn't necessarily
get the lock at all. It has nothing to do with fast.

This is actually potentially very slow.

Why? If the lock is uncontended, but is not in the current CPU's caches,
the read -> rmw operation generates multiple cache coherency protocol
events. First it gets the line in shared mode (for the read), and then
later it turns it into exclusive mode.

So if it's likely that the value is zero (or even if it's just the only
case we really care about), then you really should do the

atomic_long_cmpxchg(&rwlock->value, 0, newvalue);

thing as the _first_ access to the lock.

Yeah, yeah, that means that you need to do the local_bh_disable etc first
too, and undo it if it fails, but the failure path should be the unusual
one.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, Paul E. McKenney <paulmck@...>, <linux-kernel@...>
Date: Sunday, August 17, 2008 - 3:10 pm

Ah, you are right. This new version implements a "single cmpxchg"
uncontended code path, changes the _fast semantic for "trylock
uncontended" and also adds the trylock primitives.

Thanks for the input,

Mathieu

Fair low-latency rwlock v5

Fair low-latency rwlock provides fairness for writers against reader threads,
but lets softirq and irq readers in even when there are subscribed writers
waiting for the lock. Writers could starve reader threads.

Changelog since v4 :
rename _fast -> trylock uncontended
Remove the read from the uncontended read and write cases : just do a cmpxchg
expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock".
Use the value returned by the first cmpxchg as following value instead of doing
an unnecessary read.

Here is why read + rmw is wrong (quoting Linus) :

"This is actually potentially very slow.

Why? If the lock is uncontended, but is not in the current CPU's caches,
the read -> rmw operation generates multiple cache coherency protocol
events. First it gets the line in shared mode (for the read), and then
later it turns it into exclusive mode."

Changelog since v3 :
Added writer subscription and fair trylock. The writer subscription keeps the
reader threads out of the lock.

Changelog since v2 :
Add writer fairness in addition to fairness wrt interrupts and softirqs.
Added contention delays performance tests for thread and interrupt contexts to
changelog.

Changelog since v1 :
- No more spinlock to protect against concurrent writes, it is done within
the existing atomic variable.
- Write fastpath with a single atomic op.

No, I just had a look at it, thanks for the pointer!

Tweakable contention behavior seems interesting, but I don't think it
deals with the fact that on a mainline kernel, when an interrupt handler
comes in and asks for a read lock, it has to get it on the spot. (RT
kernels can get away with that using threaded threaded interrupts, but
that's a completely different scheme). Therefore, the ...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, Paul E. McKenney <paulmck@...>, <linux-kernel@...>
Date: Monday, August 25, 2008 - 3:20 pm

New lock types should come with lockdep support.

Also, if you're serious about -rt, you'll need to consider full PI.

--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Monday, August 18, 2008 - 7:25 pm

Interesting approach! Questions and comments interspersed below.

OK... 64 bits less three for the writer bits is 61 bits, divided by
four gives us 15 bits, which limits us to 32,768 CPUs (16,384 if we need
a guard bit separating the fields). This should be enough for the next
few years.

However, if critical sections are to be preemptable (as they need to be
in -rt, then this limit applies to the number of -tasks- rather than the
number of CPUs. 32,768 is IMHO an uncomfortably low ceiling for the
number of tasks in a system, given that my laptop supports several
thousand without breaking a sweat.

So, how about combining the softirq and hardirq bits, giving the range 0
.. 2*log_2(NR_CPUS)-1 over to tasks? This would give 30 bits, allowing
more than 10^9 tasks, which should be enough for the next few years.
Better yet, why not begin the bit allotment at the high end of the word,
allowing whatever is left over at the bottom for tasks?

Alternatively, "hardirq" and "softirq" runs in process context in -rt,
so the three read-side bitfields should be able to be combined into one

Suggest inlining this one, as it is just a wrapper. (Or is gcc smart

I don't understand the need for this smp_rmb() -- what other location
are we maintaining order with? The CPU must maintain order of accesses
to a single variable (cache coherence requires this).

I suggest replacing the smp_rmb() with a barrier(). Not that optimizing

Why not use fair_read_trylock_uncontended()? Or equivalently, why not

Why don't we use fair_write_subscribe() here? Ah, because we have
already disabled preemption...

OK, I'll bite... Why don't we also have to wait for the softirq and
hardirq readers? I would expect THREAD_RMASK|SOFTIRQ_RMASK|HARDIRQ_RMASK
rather than just THREAD_RMASK above.

Never mind, I finally see the following pair of statements.

So, the idea is to allow recursive readers in irq when we also have
readers in the interrupted process context, right? But isn't the
common...

To: Paul E. McKenney <paulmck@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>
Date: Tuesday, August 19, 2008 - 2:04 am

Agreed. I am careful about overdesigning for RT though, given that once
we get it right for the current thread vs softirq vs hardirq mess, the

Technical detail : x86_64 cannot pass immediate values > 2^32 to most
instructions; it must load an immediate value in a register, which would
add a few instructions. Therefore, I prefer to use the lowest bits

Actually, given that softirqs and hardirqs are really just tasks in -RT,
I guess threads could be given the whole 45 bits available for threads,
softirqs and irqs (0 .. 3*log_2(NR_CPUS)-1).

However, in that case, we would also like to have as much preempable
writer critical sections as we have reader critical sections. Therefore,
it would be best to split the 60 bits in two :

30 bits for reader threads
30 bits for writer threads

However, I wonder if the whole algorithm included in this rwlock
primitive to make sure that interrupt latency < softirq latency < thread
latency won't just vanish in -rt kernels. We might want to group those
threads by priority (e.g. grouping irq threads together and maybe
leaving softirq and normal threads in the same bunch), but we would then
impact softirq latency a little bit more.

The problem of this approach wrt RT kernels is that we cannot provide
enough "priority groups" (current irq, softirq and threads in mainline
kernel) for all the subtile priority levels of RT kernels. The more
groups we add, the less threads we allow on the system.

So basically, the relationship between the max number of threads (T) and
the number of reader priorities goes as follow on a 64 bits machine :

T writers subscribed count bits
1 bit for writer mutex

for first priority group :
T reader count bits
(no need of reader exclusion bit because the writer subscribed count
bits and the writer mutex act as exclusion)

for each other priority group :
T reader count bits
1 reader exclusion bit (set by the writer)

We have the inequality :

64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1)

6...

To: Paul E. McKenney <paulmck@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Tuesday, August 19, 2008 - 3:33 am

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:

It strikes me that Intel has a nice (probably slow?) cmpxchg16b
instruction on x86_64. Therefore, we could atomically update 128 bits,
which gives the following table :

((127 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES

Thread bits | Max number of threads | Number of priorities
63 | 2^63 | 1
42 | 2^42 | 2
31 | 2^31 | 3
24 | 2^24 | 4
20 | 2^20 | 5
17 | 131072 | 6
15 | 32768 | 7
13 | 8192 | 8
11 | 2048 | 9
10 | 1024 | 10
9 | 512 | 11
8 | 256 | 13
7 | 128 | 15
6 | 64 | 17
5 | 32 | 20
4 | 16 | 24

.. where we have more priorities than threads.

So I wonder if having in the surrounding of 10 priorities, which could
dynamically adapt the number of threads to the number of priorities
available, could be interesting for the RT kernel ?

That would however depend on the very architecture-specific cmpxchg16b.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Tuesday, August 19, 2008 - 12:48 pm

Stop this crapola.

Take a look at the rwsem thing I pointed you to. After you understand
that, come back.

The WHOLE POINT of that thing was to use only 32-bit atomics on the hot
path. Don't even start think9ing about cmpxchg16b. If you cannot do your
atomics in 32-bit, they're broken.

Please. I realize that the rwsem implementation I did is subtle. But
really. Spend the time to understand it.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 4:50 pm

You are right, before posting my new version, let's review your rwsem
implementation. Let's pseudo-code the fastpath in C (I don't like to
review the instruction set when thinking conceptually). Comments
inlined. My new patch follows.

* Read lock fast path

my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdlock(lock);
return

Ok, so the reader just say "hey, I'm there" by adding itself to the
reader could. Unlikely to overflow, but could. Then checks the writer
mask, if set, call the read slowpath. In my solution, I do basically the
same as the writer fast path you did : I expect all bits to be 0.
Therefore, if there are many readers at once, I use the slow path even
though there is not "real" contention on the lock. Given that I need to
limit the number of readers to a smaller amount of bits, I need to
detect overflow before-the-fact (this is my new v8 implementation, will
be posted shortly). BTW, you could accelerate the slow path by passing
"v" :

my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdlock(lock, v);
return;

* Write lock fast path

my_rwlock_wrlock :
if (cmpcxhg(&lock, 0, 0x1) != 0)
__my_rwlock_wrlock(lock);
return;

Writer expects all bits to be 0 (lock uncontended). If true, take the
lock by setting 0x1, else take the slow path. Yes, that's more or less
what I have too. Here too, you could accelerate the write slow path by
passing v.

my_rwlock_wrlock :
int v = cmpcxhg(&lock, 0, 0x1);
if (v)
__my_rwlock_wrlock(lock, v);
return;

* Read and write unlock fastpath

my_rwlock_unlock :
int v = atomic_read(&lock);
if (!(v & 0x1)) {
/* unlock reader */
v = atomic_sub_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdunlock(lock);
return;
} else {
/* unlock writer */
if (!(v == 0x1))
__my_rwlock_wrunlock(lock);
if (cmpxchg(&lock, v, 0) != v)
__my_rwlock_wrunlock(lock);
return;
}

Hrm, wh...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 5:00 pm

I stopped reading right there.

Why?

Because that is already crap.

Go look at my code once more. Go look at how it has 128 bits of data,
exactly so that it DOES NOT HAVE TO LIMIT THE NUMBER OF READERS.

And then go look at it again.

Look at it five times, and until you can understand that it still uses
just a 32-bit word for the fast-path and no unnecessarily crap in it, but
it actually has 128 bits of data for all the slow paths, don't bother
emailing me any new versions.

Please. You -still- apparently haven't looked at it, at least not enough
to understand the _point_ of it. You still go on about trying to fit in
three or four different numbers in that one word. Even though the whole
point of my rwlock is that you need exactly _one_ count (active writers),
and _one_ bit (active reader) and _one_ extra bit ("contention, go to slow
path, look at the other bits ONLY IN THE SLOW PATH!")

That leaves 30 bits for readers. If you still think you need to "limit the
number of readers", then you aren't getting it.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Paul E. McKenney <paulmck@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 5:26 pm

First of all, let me say I don't pretend to understand formally how you
deal with overflow-after-the-fact, as unlikely as it is.

However, it seems to me to be an easy way to avoid it. Simply by
changing the read-test mask to $0x80000003, you will kick the code down
the slow path once the read counter reaches $0x80000004 (2^29+1
readers), where you can do any necessary fixup -- or BUG() -- at leisure.

This fastpath ends up being identical in size and performance to the one
you posted, although yours could be reduced by changing the test to a
testb instruction -- at the almost certainly unacceptable expense of
taking a partial-register stall on the CPUs that have those.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Paul E. McKenney <paulmck@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 5:41 pm

Just make sure it can't overflow. With spinlocks, you are guaranteed that
you won't have more than NR_CPU's thing, so 20 bits is pretty safe. 30

Sure, you could do things like that, but that sounds like a separate

Well, you could just change the "testl $3,%eax" into an "andl $3,%eax",
and it will be two bytes shorter with no partial register stall.

I forgot that "testl" doesn't have the byte immediate version.

Linus
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 5:15 pm

Side note: the actual main rwlock thing is designed for a 64-bit word and
the waiters separately as two 32-bit words, so it doesn't really do what I
describe, but that's actually because the whole sleeping thing is _harder_
than a spinning thing, and has races with wakeups etc.

A spinning thing, in contrast, is pretty trivial.

So here's what I think your code should be like:

rdlock:
movl $4,%eax
lock ; xaddl %eax,(%rdi)
testl $3,%eax
jne __rdlock_slowpath
ret

rwlock:
xorl %eax,%eax
movl $1,%edx
lock ; cmpxchgl %edx,(%rdi)
jne __rwlock_slowpath
ret

rdunlock:
lock ; subl $4,(%rdi)
ret

rwunlock:
lock ; andl $~1,(%rdi)
ret

and I'm pretty damn sure that that should be totally sufficient for a
spinning rwlock. The non-spinning one is more complex just because the
unlock paths need to guarantee that something gets woken up, that just
isn't an issue when you do spinlocks.

Now, in the slow-path:
- on the rwlock slowpath side, set bit#1 to make sure that readers get
caught in the slowpath
- then do a *separate* count of how many pending readers and writers
(ie the ones that got caught into the slowpath) you have (one word
each is probably fine), and then the slowpaths can just do the right
thing depending on whether there are pending readers/writers.

See? The main lock needs not worry about number of writers AT ALL, because
it's totally irrelevant. So don't worry about running out of bits. You
won't. Just put those counts somewhere else! The only thing that matters
for the main lock word is whether there are active readers (30 bits), and
whether there is an active writer (there can only ever be one: 1 bit), and
whether new readers should be trapped (1 bit).

If you worry about overflows, you're doing something wrong.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Saturday, August 23, 2008 - 1:09 am

Hi Linus,

After having had a look 15 times more at your rwlock code, and again at
your answer, in loops, I come to the conclusion that I made a terrible
job at explaining the core idea beneath my rwlock design. So let's try
to make it more obvious. I am going to keep this explanation short.

First, the goal : to kill latency induced by rwlocks.

Now, let me explain why I need at least not one, but _four different_
contention bits. Simply because there are four types of contention, one
for each execution context which may take the read lock. (irq, softirq,
non-preemptable, preemptable)

So let's suppose I share the rwlock between non-preemptable writers,
non-preemptable readers and interrupt context readers. The idea is this:
If I want to take the write lock, I'll have to protect against both
non-preemptable and interrupt readers (and also against other writers).

A way to do that without inducing high interrupt latency with current
rwlocks would be for the writer to take _two_ locks, not just one. The
first excludes preemptable readers and the second excludes irqs. So we
have :

For the writer :

preempt_disable();
write_lock(&preempt_rwlock);

local_irq_disable();
write_lock(&irq_rwlock);

/* Write access */

write_unlock(&irq_rwlock);
local_irq_enable();

write_unlock(&preempt_rwlock);
preempt_enable();

And the readers :

Interrupt :
read_lock(&irq_rwlock);
...
read_unlock(&irq_rwlock);

non-preemptable :
read_lock(&preempt_rwlock);
...
read_unlock(&preempt_rwlock);

If you still wonder why I need to take two locks, thus elevating the
context priority by exluding one context at a time, we can go to this
example showing why current rwlock use in the kernel produces such high
latencies. The current use is :

For the writer, we exclude all the readers in one go :

local_irq_disable();
write_lock(&big_lock_against_all_readers);

/* Write access */

write_unlock(&big_lock_against_all_readers);
local_irq_ena...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Saturday, August 23, 2008 - 2:02 pm

No. You need _one_ contention bit in the fast-path.

Then, as you get into the slow-path, you can decide on four different
behaviours.

Quite frankly, I don't think this discussion is going anywhere. I don't
think I'd take anything from you, since you seem to have a really hard
time separating out the issue of fast-path and slow-path. So I'm simply
not going to bother, and I'm not going to expect to merge your work.

Sorry, but it simply isn't worth my time or effort.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Saturday, August 23, 2008 - 4:30 pm

Hi,

OK, I now see how I can apply this contention bit idea to my algo.
Actually, the point I just fixed in my head is that this bit will be a
"MAY_CONTEND" bit, which could let higher priority readers access the
lock in the slow path. Only one fast path reader count will be required,
just as you said.

Sorry about taking that much time to get my head around this. Thanks for
your helpful explanations and your time.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Saturday, August 23, 2008 - 5:40 pm

EXACTLY.

It's not even necessarily a "contention" bit per se - it's literally a
"readers have to take the slow-path" bit (writers will obviously _always_
take the slowpath if there is any non-zero value at all, so they don't
need it).

Then, the slow-path might actually decide that "hey, there is no _actual_
writer there yet - just some _waiting_ writer, but since this read lock is
in an interrupt context, we have to let it go through _despite_ the fact
that the lock is contended in order to avoid deadlock".

So it allows a fast-path for the trivial cases that is literally just a
couple of instructions long, and that is nice not just because of
performance issues, but because it then means that you can entirely ignore
all those things in the slow path. It also means that everybody can look
at the fast-path and decide that "ok, the fast-path really is optimal".

That fast-path is what a lot of people care more about than just about
anything else.

The slow-path, in comparison, can be in C, and can do all those checks
like "are we in an (sw-)interrupt handler?" and basically prioritize
certain classes of people.

Linus
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Paul E. McKenney <paulmck@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Thursday, August 21, 2008 - 6:22 pm

Ok, so change that "testl" to "andl" to shave two bytes, and then make the
whole lock structure look something like this:

struct rwlock {
u32 fastpath;
u32 pending_readers;
u32 pending_writers;
};

and then the slowpath would look something like this:

void __rdlock_slowpath(struct rwlock *lock)
{
/* Move the read count to the pending path and undo our xadd */
atomic_add(1, &lock->pending_readers);
atomic_sub(4, &lock->fastpath);

for (;;) {
unsigned fastpath;
while (lock->pending_writers)
cpu_relax();
while ((fastpath = lock->fastpath) & 1)
cpu_relax();
/* This will also undo the contention bit */
if (atomic_cmpxchg(&lock->fastpath, fastpath, (fastpath & ~2)+4)) == fastpath);
break;
}
atomic_sub(1, &lock->pending_readers);
}

and yes, there are more atomic accesses in there than would be necessary,
in that if you can cram the whole thing into a single 64-bit word you can
do both the initial pending fixup and the final cmpxchg/pending_readers
thing as a single locked 64-bit op, but hey, the above is fairly simple.

You could also just use a spinlock to protect all the other data, of
course.

There are fairness issues here too - do you really want to wait for
pending writes every time, or just the first time through the loop? It
depends on just how much you want to prefer writers.

The slow-paths for writers is similar, and has similar issues for the
fairness issue. For example, instead of a simple "pending_writers" count,
maybe you want to use a ticket lock there, to make writers be fair among
themselves? Of course, the hope would be that there would never be that
kind of contention by pure writers, so that sounds like overdesign, but
the thing I'm trying to point out here is that this is all a separate
decision from the actual fast-path, and having fair writers doesn't
necessarily mean that the fastpath has to even know/care.

The placement of the c...

To: Paul E. McKenney <paulmck@...>
Cc: Linus Torvalds <torvalds@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, <linux-kernel@...>, Steven Rostedt <rostedt@...>
Date: Tuesday, August 19, 2008 - 5:06 am

Still thinking about RT :

The good news is : we don't really need all those bits to be updated
atomically. The bit groups which must be atomically updated are :

- Writer and first priority group
T writers subscribed count bits
1 bit for writer mutex
T reader count bits

- For each other priority group :
T reader count bits
1 reader exclusion bit (set by the writer)

I also noticed that the writer and the first reader priority group
happen to be at the same priority level. When the writer want to exclude
readers from higher priority groups, it must take their priority,
exclude them using the reader exclusion bit, wait for all readers of
that group to be out of their critical section and then proceed to the
next priority group until all the groups are locked.

The reader of the first priority group must check that both the writer
mutex and the writers subscribed count bits are not set.

The readers in the other groups only have to check the reader exclusion
bit associated to their own group.

So if we simplify the problem a bit for RT, let's fix a maximum of 32768
threads on the system (15 bits). Let's also assume we have 19
priorities.

- Writer and first priority group (fits in 31 bits)
15 bits for writers subscribed count
1 bit for writer mutex
15 bits reader count

Then we have an array of 18 : (each fit in 16 bits)
15 reader count bits
1 reader exclusion bit

Therefore, the whole data would fit in 40 bytes.

The only thing we would not have is the ability to do a single cmpxchg
on all the bits in the writer fast path, making the writer common case
much much slower.

However, the reader side would still be lightning-fast as it would only
have to do a cmpxchg on its own 16 or 32 bits bitfield.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, Paul E. McKenney <paulmck@...>, <linux-kernel@...>
Date: Monday, August 18, 2008 - 2:59 pm

Can you also finally

- separate out the fastpaths more clearly. The slowpaths should not be
even visible in the I$ footprint. The fastpaths are probably better off
as a separate "fastpath.S" file to make sure that the compiler doesn't
inline the slowpath or do other insane things.

If the fastpath isn't just a few instructions, there's something wrong.
It should be written in assembly, so that it's very clear what the
fastpath is. Because most of the time, the fastpath is all that really
matters (at least as long as there is some reasonable slow-path and
contention behaviour - the slowpath matters int hat it shouldn't ever
be a _problem_).

- don't call these "fair". They may be fairER than the regular rwlocks,
but you'd really need to explain that, and true fairness you (a) can't
really get without explicit queueing and (b) probably don't even want,
because the only starvation people tend to worry about is reader vs
writer, and writer-writer fairness tends to be unimportant. So it's not
that people want the (expensive) "fairness" it's really that they want
something _reasonably_ fair considering the normal worries.

(ie if you have so much write activity that you get into write-write
fairness worries, you shouldn't be using a rwlock to begin with, so
that level of fairness is simply not very interesting).

People react emotionally and too strongly to to words like "fairness"
or "freedom". Different people have different ideas on what they mean,
and take them to pointless extremes. So just don't use the words, they
just detract from the real issue.

Also, the performance numbers on their own are pretty irrelevant, since
there's nothing to compare them to. It would be much more relevant if you
had a "this is what the standard rwlock" does as a comparison for each
number.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Joe Perches <joe@...>, Paul E. McKenney <paulmck@...>, <linux-kernel@...>
Date: Sunday, August 17, 2008 - 5:30 pm

Here are some updated benchmarks of the rwlock v5, showing [min,avg,max]
delays in non-contended and contended cases :

The test module is available at :

http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c

** Performance tests

Dual quad-core Xeon 2.0GHz E5405

* Lock contention delays, per context, 60s test

** get_cycles calibration **
get_cycles takes [min,avg,max] 78,79,84 cycles, results calibrated on avg

** Single writer test, no contention **
writer_thread/0 iterations : 964054, lock delay [min,avg,max] 149,157,5933 cycles

** Single reader test, no contention **
reader_thread/0 iterations : 37591763, lock delay [min,avg,max] 59,65,21383 cycles

** Multiple readers test, no contention **
reader_thread/0 iterations : 9062919, lock delay [min,avg,max] 59,921,42593 cycles
reader_thread/1 iterations : 9014846, lock delay [min,avg,max] 59,928,42065 cycles
reader_thread/2 iterations : 9394114, lock delay [min,avg,max] 59,897,46877 cycles
reader_thread/3 iterations : 9459081, lock delay [min,avg,max] 59,896,38207 cycles

** High contention test **
4 thread readers (no delay loop)
2 thread trylock readers (no delay loop)
2 thread writers (10us period)
2 thread trylock writers (10us period)
1 periodical interrupt readers on 7/8 cpus (IPIs).
1 periodical interrupt trylock readers on 7/8 cpus (IPIs).

writer_thread/0 iterations : 2864146, lock delay [min,avg,max] 179,9493,102563 cycles
writer_thread/1 iterations : 2986492, lock delay [min,avg,max] 179,9332,66923 cycles
trylock_writer_thread/0 iterations : 12789892, successful iterations : 3257203
trylock_writer_thread/1 iterations : 12747023, successful iterations : 3268997
reader_thread/0 iterations : 13091562, lock delay [min,avg,max] 59,5806,141053 cycles
reader_thread/1 iterations : 12574027, lock delay [min,avg,max] 59,5706,141839 cycles
reader_thread/2 iterations : 12805706, lock delay [min,avg,max] 59,5738,138725 cycles
reader_thread/3 iterations : 13352731, lock delay [min,avg,max] 59,5606,137585 cycle...

Previous thread: [PATCH] Move sysctl check into debugging section and don't make it default y by Andi Kleen on Saturday, August 16, 2008 - 1:53 am. (7 messages)

Next thread: [PATCH] kexec: fix up KEXEC_CONTROL_CODE_SIZE missed during conversion by Paul Collins on Saturday, August 16, 2008 - 4:55 am. (1 message)