x86_64 add/sub atomic ops does not seems to accept integer values bigger than 32 bits as immediates. Intel's add/sub documentation specifies they have to be passed as registers. http://gcc.gnu.org/onlinedocs/gcc-4.3.0/gcc/Machine-Constraints.html#Machine-Constraints states : e 32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions). Z 32-bit unsigned integer constant, or a symbolic reference known to fit that range (for immediate operands in zero-extending x86-64 instructions). Since add/sub does sign extension, using the "e" constraint seems appropriate. It applies to 2.6.27-rc, 2.6.26, 2.6.25... Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Jeremy Fitzhardinge <jeremy@goop.org> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Ingo Molnar <mingo@elte.hu> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Joe Perches <joe@perches.com> --- include/asm-x86/atomic_64.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6-lttng/include/asm-x86/atomic_64.h =================================================================== --- linux-2.6-lttng.orig/include/asm-x86/atomic_64.h 2008-08-16 03:19:35.000000000 -0400 +++ linux-2.6-lttng/include/asm-x86/atomic_64.h 2008-08-16 03:21:18.000000000 -0400 @@ -228,7 +228,7 @@ static inline void atomic64_add(long i, { asm volatile(LOCK_PREFIX "addq %1,%0" : "=m" (v->counter) - : "ir" (i), "m" (v->counter)); + : "er" (i), "m" (v->counter)); } /** @@ -242,7 +242,7 @@ static inline void atomic64_sub(long i, { asm volatile(LOCK_PREFIX "subq %1,%0" : "=m" (v->counter) - : "ir" (i), "m" (v->counter)); + : "er" (i), "m" (v->counter)); } /** @@ -260,7 +260,7 @@ static inline int atomic64_sub_and_test( asm volatile(LOCK_PREFIX "subq %2,%0; sete %1" : "=m" (v->counter), "=qm" ...
This is correct; this is in fact true for all instructions except "mov". Whether it's sign- or zero-extending is sometimes subtle, but not in these cases. Do you happen to know if this is a manifest bug in the current kernel (i.e. if there is anywhere we're using more than ±2 GB as a constant to these functions?) Either way, I'll queue this up to tip:x86/urgent if Ingo hasn't already since this is a pure bug fix. -hpa --
No, I did not hit this on current kernel code and the effect is quite esasy to detect : the assembler spits an error. I have hit this problem when tying to implement a better rwlock design than is currently in the mainline kernel (I know the RT kernel has a hard time with rwlocks), and had to play with add/sub of large values. The idea is to bring down the interrupt latency caused by rwlocks shared between fast read-side interrupt handlers and slow thread context read-sides (tasklist_lock is the perfect example). In that case, the worse case interrupt latency is caused by the irq-disabled writer lock when contended by the slow readers. I will probably post a RFC about this in a near future. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Have you looked at my sleping rwlock trial thing? It's very different from a spinning one, but I think the fast path should be identical, and that's the one I tried to make fairly optimal. See http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary for a git tree. The sleeping version has two extra words for the sleep events, but those would be irrelevant for the spinning version. The fastpath is movl $4,%eax lock ; xaddl %eax,(%rdi) testl $3,%eax jne __my_rwlock_rdlock for the read-lock (the two low bits are contention bits, so you can make contention have any behaviour you want - including fairish, prefer-reads, or prefer-writes). The write fastpath is xorl %eax,%eax movl $1,%edx lock ; cmpxchgl %edx,(%rdi) jne __my_rwlock_wrlock and the "unlock" case is actually unnecessarily complex in my implementation, because it needs to - wake things up in case of a conflict (not true of a spinning version, of course) - it's pthreads-compatible, so the same function needs to handle both a read-unlock and a write-unlock. but a spinning version should be much simpler. Anyway, I haven't tried turning it into a spinning version, but it was very much designed to - work with both 32-bit and 64-bit x86 by making the fastpath only do 32-bit locked accesses - have any number of pending readers/writers (which is not a big deal for a spinning one, but at least there are no CPU count overflows). - and because it is designed for sleeping, I'm pretty sure that you can easily drop interrupts in the contention path, to make write_lock_irq[save]() be reasonable. In particular, the third bullet is the important one: because it's designed to have a "contention" path that has _extra_ information for the contended case, you could literally make the extra information have things like a list of pending writers, so that you can drop interrupts on one CPU, while you adding information to let the reader ...
No, I just had a look at it, thanks for the pointer! Tweakable contention behavior seems interesting, but I don't think it deals with the fact that on a mainline kernel, when an interrupt handler comes in and asks for a read lock, it has to get it on the spot. (RT kernels can get away with that using threaded threaded interrupts, but that's a completely different scheme). Therefore, the impact is that interrupts must be disabled around write lock usage, and we end up in the situation where this interrupt disable section can last for a long time, given it waits for every readers (including ones which does not disable interrupts nor softirqs) to complete. Actually, I just used LTTng traces and eventually made a small patch to lockdep to detect whenever a spinlock or a rwlock is used both with interrupts enabled and disabled. Those sites are likely to produce very high latencies and should IMHO be considered as bogus. The basic bogus scenario is to have a spinlock held on CPU A with interrupts enabled being interrupted and then a softirq runs. On CPU B, the same lock is acquired with interrupts off. We therefore disable interrupts on CPU B for the duration of the softirq currently running on the CPU A, which is really not something that helps keeping short latencies. My preliminary results shows that there are a lot of inconsistent spinlock/rwlock irq on/off uses in the kernel. This kind of scenario is pretty easy to fix for spinlocks (either move the interrupt disable within the spinlock section if the spinlock is never used by an interrupt handler or make sure that every users has interrupts disabled). The problem comes with rwlocks : it is correct to have readers both with and without irq disable, even when interrupt handlers use the read lock. However, the write lock has to disable interrupt in that case, and we suffer from the high latency I pointed out. The tasklist_lock is the perfect example of this. In the following patch, I try to address this issue. The core ...
Right. Which is exactly why I'd suggest using the extra space for saying "this CPU is busy-looping waiting for a write lock", and then the read-lock contention case can say - ok, there's a pending write lock holder on _my_ CPU, so I need to just succeed right away despite the fact that there is contention. In other words, there are a few cases: - you actually *got* the write lock interrupts obviously have to be disabled here, because any reader can simply not get the lock. - you are waiting to get the write lock. Mark this CPU as "pending" in the rwlock (by making the extended thing be a queue, or by simply only ever allowing a single pendign CPU), and re-enable interrupts while you wait for it. An interrupt that comes in and wants a read-lock sees that it's pending, so it should then ignore the contention bit, and only wait for the write bit to go away. See? This way you only need to actually disable interrupts while holding literally the lock, not while waiting for it. While still giving priority to writers (except for the _one_ CPU, where new readers will have to get through). So this way you can be fair, and not allow readers to starve a writer. The only reader that is allowed past a waiting writer is the reader on that same CPU. And notice how the fast-path needs no spinlock or anything else - it's still just a single locked instruction. In comparison, if I read your example code right, it is absolutely horrid and has an extra spinlock access for the fair_write_lock case. Linus --
Hi Linus, Using a writer subscription to the rwlock to make sure the readers stop taking the lock when a writer is waiting for it is indeed a good way to insure fairness. I just fixed the issues you pointed in my patch (spinlock removed, added a write fastpath with a single atomic op) and also used the subscription idea to get fairness for writers. Contention delay tests shows that fairness is achieved pretty well. For periodical writers, busy looping readers and periodical interrupt readers : 6 thread readers (busy-looping) 3 thread writers (1ms period) 2 periodical interrupt readers on 7/8 cpus (IPIs). - 21us max. contention for writer - 154us max. contention for thread readers - 16us max. contention for interrupt readers (benchmark details below) I still perceive two potential problems with your approach. It's not related to fairness between readers and writers, but more on the effect on interrupt latency on the system. This is actually what my patch try to address. First there is a long interrupt handler scenario, as bad as disabling interrupts for a long time period, which is not fixed by your solution : CPU A thread context, takes the read lock, gets interrupted, softirq runs. CPU B thread context, subscribes for the writer lock, busy loops waiting for CPU A. (in your solution, interrupts are enabled here so interrupt handlers on CPU B can come in) CPU C interrupt context, takes the read lock, contended because CPU B busy loops waiting for the writer lock. -> CPU C will have an interrupt handler running for the duration of the softirq on CPU A, which will impact interrupt latency on CPU C. Second, I tend to think this it might be difficult to atomically set the writer state and disable interrupts, unless we disable interrupts, cmpxchg the writer test, reenable interrupts if it fails in a loop, which does not seem very neat. Fair low-latency rwlock v3 Changelog since v2 : Add writer fairness in addition to fairness wrt interrupts and ...
So first off, you should call this "trylock", since it doesn't necessarily get the lock at all. It has nothing to do with fast. This is actually potentially very slow. Why? If the lock is uncontended, but is not in the current CPU's caches, the read -> rmw operation generates multiple cache coherency protocol events. First it gets the line in shared mode (for the read), and then later it turns it into exclusive mode. So if it's likely that the value is zero (or even if it's just the only case we really care about), then you really should do the atomic_long_cmpxchg(&rwlock->value, 0, newvalue); thing as the _first_ access to the lock. Yeah, yeah, that means that you need to do the local_bh_disable etc first too, and undo it if it fails, but the failure path should be the unusual one. Linus --
Ah, you are right. This new version implements a "single cmpxchg" uncontended code path, changes the _fast semantic for "trylock uncontended" and also adds the trylock primitives. Thanks for the input, Mathieu Fair low-latency rwlock v5 Fair low-latency rwlock provides fairness for writers against reader threads, but lets softirq and irq readers in even when there are subscribed writers waiting for the lock. Writers could starve reader threads. Changelog since v4 : rename _fast -> trylock uncontended Remove the read from the uncontended read and write cases : just do a cmpxchg expecting "sub_expect", 0 for "lock", 1 subscriber for a "trylock". Use the value returned by the first cmpxchg as following value instead of doing an unnecessary read. Here is why read + rmw is wrong (quoting Linus) : "This is actually potentially very slow. Why? If the lock is uncontended, but is not in the current CPU's caches, the read -> rmw operation generates multiple cache coherency protocol events. First it gets the line in shared mode (for the read), and then later it turns it into exclusive mode." Changelog since v3 : Added writer subscription and fair trylock. The writer subscription keeps the reader threads out of the lock. Changelog since v2 : Add writer fairness in addition to fairness wrt interrupts and softirqs. Added contention delays performance tests for thread and interrupt contexts to changelog. Changelog since v1 : - No more spinlock to protect against concurrent writes, it is done within the existing atomic variable. - Write fastpath with a single atomic op. No, I just had a look at it, thanks for the pointer! Tweakable contention behavior seems interesting, but I don't think it deals with the fact that on a mainline kernel, when an interrupt handler comes in and asks for a read lock, it has to get it on the spot. (RT kernels can get away with that using threaded threaded interrupts, but that's a completely different scheme). Therefore, the ...
Here are some updated benchmarks of the rwlock v5, showing [min,avg,max] delays in non-contended and contended cases : The test module is available at : http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-fair-rwlock.c ** Performance tests Dual quad-core Xeon 2.0GHz E5405 * Lock contention delays, per context, 60s test ** get_cycles calibration ** get_cycles takes [min,avg,max] 78,79,84 cycles, results calibrated on avg ** Single writer test, no contention ** writer_thread/0 iterations : 964054, lock delay [min,avg,max] 149,157,5933 cycles ** Single reader test, no contention ** reader_thread/0 iterations : 37591763, lock delay [min,avg,max] 59,65,21383 cycles ** Multiple readers test, no contention ** reader_thread/0 iterations : 9062919, lock delay [min,avg,max] 59,921,42593 cycles reader_thread/1 iterations : 9014846, lock delay [min,avg,max] 59,928,42065 cycles reader_thread/2 iterations : 9394114, lock delay [min,avg,max] 59,897,46877 cycles reader_thread/3 iterations : 9459081, lock delay [min,avg,max] 59,896,38207 cycles ** High contention test ** 4 thread readers (no delay loop) 2 thread trylock readers (no delay loop) 2 thread writers (10us period) 2 thread trylock writers (10us period) 1 periodical interrupt readers on 7/8 cpus (IPIs). 1 periodical interrupt trylock readers on 7/8 cpus (IPIs). writer_thread/0 iterations : 2864146, lock delay [min,avg,max] 179,9493,102563 cycles writer_thread/1 iterations : 2986492, lock delay [min,avg,max] 179,9332,66923 cycles trylock_writer_thread/0 iterations : 12789892, successful iterations : 3257203 trylock_writer_thread/1 iterations : 12747023, successful iterations : 3268997 reader_thread/0 iterations : 13091562, lock delay [min,avg,max] 59,5806,141053 cycles reader_thread/1 iterations : 12574027, lock delay [min,avg,max] 59,5706,141839 cycles reader_thread/2 iterations : 12805706, lock delay [min,avg,max] 59,5738,138725 cycles reader_thread/3 iterations : 13352731, lock delay [min,avg,max] 59,5606,137585 ...
Can you also finally - separate out the fastpaths more clearly. The slowpaths should not be even visible in the I$ footprint. The fastpaths are probably better off as a separate "fastpath.S" file to make sure that the compiler doesn't inline the slowpath or do other insane things. If the fastpath isn't just a few instructions, there's something wrong. It should be written in assembly, so that it's very clear what the fastpath is. Because most of the time, the fastpath is all that really matters (at least as long as there is some reasonable slow-path and contention behaviour - the slowpath matters int hat it shouldn't ever be a _problem_). - don't call these "fair". They may be fairER than the regular rwlocks, but you'd really need to explain that, and true fairness you (a) can't really get without explicit queueing and (b) probably don't even want, because the only starvation people tend to worry about is reader vs writer, and writer-writer fairness tends to be unimportant. So it's not that people want the (expensive) "fairness" it's really that they want something _reasonably_ fair considering the normal worries. (ie if you have so much write activity that you get into write-write fairness worries, you shouldn't be using a rwlock to begin with, so that level of fairness is simply not very interesting). People react emotionally and too strongly to to words like "fairness" or "freedom". Different people have different ideas on what they mean, and take them to pointless extremes. So just don't use the words, they just detract from the real issue. Also, the performance numbers on their own are pretty irrelevant, since there's nothing to compare them to. It would be much more relevant if you had a "this is what the standard rwlock" does as a comparison for each number. Linus --
Interesting approach! Questions and comments interspersed below. OK... 64 bits less three for the writer bits is 61 bits, divided by four gives us 15 bits, which limits us to 32,768 CPUs (16,384 if we need a guard bit separating the fields). This should be enough for the next few years. However, if critical sections are to be preemptable (as they need to be in -rt, then this limit applies to the number of -tasks- rather than the number of CPUs. 32,768 is IMHO an uncomfortably low ceiling for the number of tasks in a system, given that my laptop supports several thousand without breaking a sweat. So, how about combining the softirq and hardirq bits, giving the range 0 .. 2*log_2(NR_CPUS)-1 over to tasks? This would give 30 bits, allowing more than 10^9 tasks, which should be enough for the next few years. Better yet, why not begin the bit allotment at the high end of the word, allowing whatever is left over at the bottom for tasks? Alternatively, "hardirq" and "softirq" runs in process context in -rt, so the three read-side bitfields should be able to be combined into one Suggest inlining this one, as it is just a wrapper. (Or is gcc smart I don't understand the need for this smp_rmb() -- what other location are we maintaining order with? The CPU must maintain order of accesses to a single variable (cache coherence requires this). I suggest replacing the smp_rmb() with a barrier(). Not that optimizing Why not use fair_read_trylock_uncontended()? Or equivalently, why not Why don't we use fair_write_subscribe() here? Ah, because we have already disabled preemption... OK, I'll bite... Why don't we also have to wait for the softirq and hardirq readers? I would expect THREAD_RMASK|SOFTIRQ_RMASK|HARDIRQ_RMASK rather than just THREAD_RMASK above. Never mind, I finally see the following pair of statements. So, the idea is to allow recursive readers in irq when we also have readers in the interrupted process context, right? But isn't ...
Agreed. I am careful about overdesigning for RT though, given that once we get it right for the current thread vs softirq vs hardirq mess, the Technical detail : x86_64 cannot pass immediate values > 2^32 to most instructions; it must load an immediate value in a register, which would add a few instructions. Therefore, I prefer to use the lowest bits Actually, given that softirqs and hardirqs are really just tasks in -RT, I guess threads could be given the whole 45 bits available for threads, softirqs and irqs (0 .. 3*log_2(NR_CPUS)-1). However, in that case, we would also like to have as much preempable writer critical sections as we have reader critical sections. Therefore, it would be best to split the 60 bits in two : 30 bits for reader threads 30 bits for writer threads However, I wonder if the whole algorithm included in this rwlock primitive to make sure that interrupt latency < softirq latency < thread latency won't just vanish in -rt kernels. We might want to group those threads by priority (e.g. grouping irq threads together and maybe leaving softirq and normal threads in the same bunch), but we would then impact softirq latency a little bit more. The problem of this approach wrt RT kernels is that we cannot provide enough "priority groups" (current irq, softirq and threads in mainline kernel) for all the subtile priority levels of RT kernels. The more groups we add, the less threads we allow on the system. So basically, the relationship between the max number of threads (T) and the number of reader priorities goes as follow on a 64 bits machine : T writers subscribed count bits 1 bit for writer mutex for first priority group : T reader count bits (no need of reader exclusion bit because the writer subscribed count bits and the writer mutex act as exclusion) for each other priority group : T reader count bits 1 reader exclusion bit (set by the writer) We have the inequality : 64 >= (T + 1) + T + (NR_PRIORITIES - 1) * (T + 1) 64 >= (2T ...
* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: It strikes me that Intel has a nice (probably slow?) cmpxchg16b instruction on x86_64. Therefore, we could atomically update 128 bits, which gives the following table : ((127 - 2T) / (T + 1)) + 1 >= NR_PRIORITIES Thread bits | Max number of threads | Number of priorities 63 | 2^63 | 1 42 | 2^42 | 2 31 | 2^31 | 3 24 | 2^24 | 4 20 | 2^20 | 5 17 | 131072 | 6 15 | 32768 | 7 13 | 8192 | 8 11 | 2048 | 9 10 | 1024 | 10 9 | 512 | 11 8 | 256 | 13 7 | 128 | 15 6 | 64 | 17 5 | 32 | 20 4 | 16 | 24 .. where we have more priorities than threads. So I wonder if having in the surrounding of 10 priorities, which could dynamically adapt the number of threads to the number of priorities available, could be interesting for the RT kernel ? That would however depend on the very architecture-specific cmpxchg16b. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Still thinking about RT : The good news is : we don't really need all those bits to be updated atomically. The bit groups which must be atomically updated are : - Writer and first priority group T writers subscribed count bits 1 bit for writer mutex T reader count bits - For each other priority group : T reader count bits 1 reader exclusion bit (set by the writer) I also noticed that the writer and the first reader priority group happen to be at the same priority level. When the writer want to exclude readers from higher priority groups, it must take their priority, exclude them using the reader exclusion bit, wait for all readers of that group to be out of their critical section and then proceed to the next priority group until all the groups are locked. The reader of the first priority group must check that both the writer mutex and the writers subscribed count bits are not set. The readers in the other groups only have to check the reader exclusion bit associated to their own group. So if we simplify the problem a bit for RT, let's fix a maximum of 32768 threads on the system (15 bits). Let's also assume we have 19 priorities. - Writer and first priority group (fits in 31 bits) 15 bits for writers subscribed count 1 bit for writer mutex 15 bits reader count Then we have an array of 18 : (each fit in 16 bits) 15 reader count bits 1 reader exclusion bit Therefore, the whole data would fit in 40 bytes. The only thing we would not have is the ability to do a single cmpxchg on all the bits in the writer fast path, making the writer common case much much slower. However, the reader side would still be lightning-fast as it would only have to do a cmpxchg on its own 16 or 32 bits bitfield. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Stop this crapola. Take a look at the rwsem thing I pointed you to. After you understand that, come back. The WHOLE POINT of that thing was to use only 32-bit atomics on the hot path. Don't even start think9ing about cmpxchg16b. If you cannot do your atomics in 32-bit, they're broken. Please. I realize that the rwsem implementation I did is subtle. But really. Spend the time to understand it. Linus --
You are right, before posting my new version, let's review your rwsem
implementation. Let's pseudo-code the fastpath in C (I don't like to
review the instruction set when thinking conceptually). Comments
inlined. My new patch follows.
* Read lock fast path
my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdlock(lock);
return
Ok, so the reader just say "hey, I'm there" by adding itself to the
reader could. Unlikely to overflow, but could. Then checks the writer
mask, if set, call the read slowpath. In my solution, I do basically the
same as the writer fast path you did : I expect all bits to be 0.
Therefore, if there are many readers at once, I use the slow path even
though there is not "real" contention on the lock. Given that I need to
limit the number of readers to a smaller amount of bits, I need to
detect overflow before-the-fact (this is my new v8 implementation, will
be posted shortly). BTW, you could accelerate the slow path by passing
"v" :
my_rwlock_rdlock :
int v = atomic_add_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdlock(lock, v);
return;
* Write lock fast path
my_rwlock_wrlock :
if (cmpcxhg(&lock, 0, 0x1) != 0)
__my_rwlock_wrlock(lock);
return;
Writer expects all bits to be 0 (lock uncontended). If true, take the
lock by setting 0x1, else take the slow path. Yes, that's more or less
what I have too. Here too, you could accelerate the write slow path by
passing v.
my_rwlock_wrlock :
int v = cmpcxhg(&lock, 0, 0x1);
if (v)
__my_rwlock_wrlock(lock, v);
return;
* Read and write unlock fastpath
my_rwlock_unlock :
int v = atomic_read(&lock);
if (!(v & 0x1)) {
/* unlock reader */
v = atomic_sub_return(0x4, &lock);
if (v & 0x3)
__my_rwlock_rdunlock(lock);
return;
} else {
/* unlock writer */
if (!(v == 0x1))
__my_rwlock_wrunlock(lock);
if (cmpxchg(&lock, v, 0) != v)
__my_rwlock_wrunlock(lock);
return;
}
Hrm, why don't you split the unlocks ? You ...I stopped reading right there. Why? Because that is already crap. Go look at my code once more. Go look at how it has 128 bits of data, exactly so that it DOES NOT HAVE TO LIMIT THE NUMBER OF READERS. And then go look at it again. Look at it five times, and until you can understand that it still uses just a 32-bit word for the fast-path and no unnecessarily crap in it, but it actually has 128 bits of data for all the slow paths, don't bother emailing me any new versions. Please. You -still- apparently haven't looked at it, at least not enough to understand the _point_ of it. You still go on about trying to fit in three or four different numbers in that one word. Even though the whole point of my rwlock is that you need exactly _one_ count (active writers), and _one_ bit (active reader) and _one_ extra bit ("contention, go to slow path, look at the other bits ONLY IN THE SLOW PATH!") That leaves 30 bits for readers. If you still think you need to "limit the number of readers", then you aren't getting it. Linus --
Side note: the actual main rwlock thing is designed for a 64-bit word and the waiters separately as two 32-bit words, so it doesn't really do what I describe, but that's actually because the whole sleeping thing is _harder_ than a spinning thing, and has races with wakeups etc. A spinning thing, in contrast, is pretty trivial. So here's what I think your code should be like: rdlock: movl $4,%eax lock ; xaddl %eax,(%rdi) testl $3,%eax jne __rdlock_slowpath ret rwlock: xorl %eax,%eax movl $1,%edx lock ; cmpxchgl %edx,(%rdi) jne __rwlock_slowpath ret rdunlock: lock ; subl $4,(%rdi) ret rwunlock: lock ; andl $~1,(%rdi) ret and I'm pretty damn sure that that should be totally sufficient for a spinning rwlock. The non-spinning one is more complex just because the unlock paths need to guarantee that something gets woken up, that just isn't an issue when you do spinlocks. Now, in the slow-path: - on the rwlock slowpath side, set bit#1 to make sure that readers get caught in the slowpath - then do a *separate* count of how many pending readers and writers (ie the ones that got caught into the slowpath) you have (one word each is probably fine), and then the slowpaths can just do the right thing depending on whether there are pending readers/writers. See? The main lock needs not worry about number of writers AT ALL, because it's totally irrelevant. So don't worry about running out of bits. You won't. Just put those counts somewhere else! The only thing that matters for the main lock word is whether there are active readers (30 bits), and whether there is an active writer (there can only ever be one: 1 bit), and whether new readers should be trapped (1 bit). If you worry about overflows, you're doing something wrong. Linus --
Ok, so change that "testl" to "andl" to shave two bytes, and then make the
whole lock structure look something like this:
struct rwlock {
u32 fastpath;
u32 pending_readers;
u32 pending_writers;
};
and then the slowpath would look something like this:
void __rdlock_slowpath(struct rwlock *lock)
{
/* Move the read count to the pending path and undo our xadd */
atomic_add(1, &lock->pending_readers);
atomic_sub(4, &lock->fastpath);
for (;;) {
unsigned fastpath;
while (lock->pending_writers)
cpu_relax();
while ((fastpath = lock->fastpath) & 1)
cpu_relax();
/* This will also undo the contention bit */
if (atomic_cmpxchg(&lock->fastpath, fastpath, (fastpath & ~2)+4)) == fastpath);
break;
}
atomic_sub(1, &lock->pending_readers);
}
and yes, there are more atomic accesses in there than would be necessary,
in that if you can cram the whole thing into a single 64-bit word you can
do both the initial pending fixup and the final cmpxchg/pending_readers
thing as a single locked 64-bit op, but hey, the above is fairly simple.
You could also just use a spinlock to protect all the other data, of
course.
There are fairness issues here too - do you really want to wait for
pending writes every time, or just the first time through the loop? It
depends on just how much you want to prefer writers.
The slow-paths for writers is similar, and has similar issues for the
fairness issue. For example, instead of a simple "pending_writers" count,
maybe you want to use a ticket lock there, to make writers be fair among
themselves? Of course, the hope would be that there would never be that
kind of contention by pure writers, so that sounds like overdesign, but
the thing I'm trying to point out here is that this is all a separate
decision from the actual fast-path, and having fair writers doesn't
necessarily mean that the fastpath has to even know/care.
The placement of the counts is also something that can be ...Hi Linus, After having had a look 15 times more at your rwlock code, and again at your answer, in loops, I come to the conclusion that I made a terrible job at explaining the core idea beneath my rwlock design. So let's try to make it more obvious. I am going to keep this explanation short. First, the goal : to kill latency induced by rwlocks. Now, let me explain why I need at least not one, but _four different_ contention bits. Simply because there are four types of contention, one for each execution context which may take the read lock. (irq, softirq, non-preemptable, preemptable) So let's suppose I share the rwlock between non-preemptable writers, non-preemptable readers and interrupt context readers. The idea is this: If I want to take the write lock, I'll have to protect against both non-preemptable and interrupt readers (and also against other writers). A way to do that without inducing high interrupt latency with current rwlocks would be for the writer to take _two_ locks, not just one. The first excludes preemptable readers and the second excludes irqs. So we have : For the writer : preempt_disable(); write_lock(&preempt_rwlock); local_irq_disable(); write_lock(&irq_rwlock); /* Write access */ write_unlock(&irq_rwlock); local_irq_enable(); write_unlock(&preempt_rwlock); preempt_enable(); And the readers : Interrupt : read_lock(&irq_rwlock); ... read_unlock(&irq_rwlock); non-preemptable : read_lock(&preempt_rwlock); ... read_unlock(&preempt_rwlock); If you still wonder why I need to take two locks, thus elevating the context priority by exluding one context at a time, we can go to this example showing why current rwlock use in the kernel produces such high latencies. The current use is : For the writer, we exclude all the readers in one go : local_irq_disable(); write_lock(&big_lock_against_all_readers); /* Write access ...
No. You need _one_ contention bit in the fast-path. Then, as you get into the slow-path, you can decide on four different behaviours. Quite frankly, I don't think this discussion is going anywhere. I don't think I'd take anything from you, since you seem to have a really hard time separating out the issue of fast-path and slow-path. So I'm simply not going to bother, and I'm not going to expect to merge your work. Sorry, but it simply isn't worth my time or effort. Linus --
Hi, OK, I now see how I can apply this contention bit idea to my algo. Actually, the point I just fixed in my head is that this bit will be a "MAY_CONTEND" bit, which could let higher priority readers access the lock in the slow path. Only one fast path reader count will be required, just as you said. Sorry about taking that much time to get my head around this. Thanks for your helpful explanations and your time. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
EXACTLY. It's not even necessarily a "contention" bit per se - it's literally a "readers have to take the slow-path" bit (writers will obviously _always_ take the slowpath if there is any non-zero value at all, so they don't need it). Then, the slow-path might actually decide that "hey, there is no _actual_ writer there yet - just some _waiting_ writer, but since this read lock is in an interrupt context, we have to let it go through _despite_ the fact that the lock is contended in order to avoid deadlock". So it allows a fast-path for the trivial cases that is literally just a couple of instructions long, and that is nice not just because of performance issues, but because it then means that you can entirely ignore all those things in the slow path. It also means that everybody can look at the fast-path and decide that "ok, the fast-path really is optimal". That fast-path is what a lot of people care more about than just about anything else. The slow-path, in comparison, can be in C, and can do all those checks like "are we in an (sw-)interrupt handler?" and basically prioritize certain classes of people. Linus --
First of all, let me say I don't pretend to understand formally how you deal with overflow-after-the-fact, as unlikely as it is. However, it seems to me to be an easy way to avoid it. Simply by changing the read-test mask to $0x80000003, you will kick the code down the slow path once the read counter reaches $0x80000004 (2^29+1 readers), where you can do any necessary fixup -- or BUG() -- at leisure. This fastpath ends up being identical in size and performance to the one you posted, although yours could be reduced by changing the test to a testb instruction -- at the almost certainly unacceptable expense of taking a partial-register stall on the CPUs that have those. -hpa --
Just make sure it can't overflow. With spinlocks, you are guaranteed that you won't have more than NR_CPU's thing, so 20 bits is pretty safe. 30 Sure, you could do things like that, but that sounds like a separate Well, you could just change the "testl $3,%eax" into an "andl $3,%eax", and it will be two bytes shorter with no partial register stall. I forgot that "testl" doesn't have the byte immediate version. Linus --
New lock types should come with lockdep support. Also, if you're serious about -rt, you'll need to consider full PI. --
