Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.
Not completely implemented with assembly yet, to make the algorithm a bit
clearer.
The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
"atomic_inc_return" on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.
To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).
Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.
Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues.
Index: linux-2.6/include/asm-i386/spinlock.h
===================================================================
--- linux-2.6.orig/include/asm-i386/spinlock.h
+++ linux-2.6/include/asm-i386/spinlock.h
@@ -29,69 +29,23 @@
static inline int __raw_spin_is_locked(raw_spinlock_t *x)
{
- return *(volatile signed char *)(&(x)->slock) <= 0;
+ return unlikely(x->qhead != x->qtail);
}
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
- asm volatile("\n1:\t"
- LOCK_PREFIX " ; decb %0\n\t"
- "jns 3f\n"
- "2:\t"
- "rep;nop\n\t"
- "cmpb $0,%0\n\t"
- "jle 2b\n\t"
- "jmp 1b\n"
- "3:\n\t"
- : "+m" (lock->slock) : : "memory");
-}
+ unsigned short pos = 1;
-/*
- * It is easier for the lock validator if interrupts are not re-enabled
- * in the middle ...On Fri, 23 Mar 2007 09:59:11 +0100
It's a very nice idea Nick.
You also have for free the number or cpus that are before you.
On big SMP/NUMA, we could use this information to call a special lock_cpu_relax() function to avoid cacheline transferts.
asm volatile(LOCK_PREFIX "xaddw %0, %1\n\t"
: "+r" (pos), "+m" (lock->qhead) : : "memory");
for (;;) {
unsigned short nwait = pos - lock->qtail;
if (likely(nwait == 0))
break;
lock_cpu_relax(lock, nwait);
}
lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait)
{
unsigned int cycles = nwait * lock->min_cycles_per_round;
busy_loop(cycles);
}
-
Yeah that's true, it would give you a lot more information for use in a lock backoff system, so that might counter the lesser efficiency of queued locks under contention. You could perhaps even wait using monitor/mwait on the lock cacheline if there are more than X CPUs in front of you. I'm not sure if the cost of mwait is entirely within the core that executes it, or whether it slows down cache coherency as well (which would make that idea a showstopper). -
Good Idea. Hopefully, this should reduce the number of cacheline transfers in the contended case. -
isnt this patented by MS? (which might not worry you SuSE/Novell guys, but it might be a worry for the rest of the world ;-) Ingo -
I never thought a FIFO queue would be patentable. Do you have a reference to the patent number? -
FWIW, MS patented even the preprocessor in computer languages ;) Holding a patent does not make it automatically enforceable. I remember thare was a discussion time ago about queued spinlocks on lkml. - Davide -
Hmm, it looks like they have implemented a system where the spinning cpu sleeps on a per-CPU variable rather than the lock itself, and the releasing cpu writes to that variable to wake it. They do this so that spinners don't continually perform exclusive->shared transitions on the lock cacheline. They call these things queued spinlocks. They don't seem to be very patent worthy either, but maybe it is what you're thinking of? I'm not as concerned about the contended performance of spinlocks for Linux as MS seems to be for windows (they seem to be very proud of this lock). Because if it is a big problem then IMO it is a bug. This was just something I had in mind when the hardware lock starvation issue came up, so I thought I should quickly code it up and RFC... actually it makes contended performance worse, but I'm not too worried about that because I'm happy I was able to implement it without increasing data size or number of locked operations. -
On Fri, 23 Mar 2007 11:32:44 +0100 Sure, but please note that you should rename your patch to : -
Those exclusive-to-shared transitions are among the cacheline transfers typically accounted to the algorithms in their complexity analyses. -- wli -
Nick Piggin writes: > On Fri, Mar 23, 2007 at 11:04:18AM +0100, Ingo Molnar wrote: > > > > * Nick Piggin <npiggin@suse.de> wrote: > > > > > Implement queued spinlocks for i386. [...] > > > > isnt this patented by MS? (which might not worry you SuSE/Novell guys, > > but it might be a worry for the rest of the world ;-) > > Hmm, it looks like they have implemented a system where the spinning > cpu sleeps on a per-CPU variable rather than the lock itself, and > the releasing cpu writes to that variable to wake it. They do this > so that spinners don't continually perform exclusive->shared > transitions on the lock cacheline. They call these things queued > spinlocks. They don't seem to be very patent worthy either, but Indeed, this technique is very well known. E.g., http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole section (3. Local-spin Algorithms) on them, citing papers from the 1990 onward. Nikita. -
that is a cool reference! So i'd suggest to do (redo?) the patch based on those concepts and that terminology and not use 'queued spinlocks' that are commonly associated with MS's stuff. And as a result the contended case would be optimized some more via local-spin algorithms. (which is not a key thing for us, but which would be nice to have nevertheless) Ingo -
Ingo Molnar writes: > > * Nikita Danilov <nikita@clusterfs.com> wrote: > > > Indeed, this technique is very well known. E.g., > > http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole > > section (3. Local-spin Algorithms) on them, citing papers from the > > 1990 onward. > > that is a cool reference! So i'd suggest to do (redo?) the patch based > on those concepts and that terminology and not use 'queued spinlocks' There is some old version: http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p06-locallock.patch http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p07-locallock-bkl.patch http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p08-locallock-zone.patch http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0b-atomic_dec_and_lo... http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0c-locallock-dcache.... This version retains original spin-lock interface (i.e., no additional "queue link" pointer is passed to the locking function). As a result, lock data structure contains an array of NR_CPU counters, so it's only suitable for global statically allocated locks. > that are commonly associated with MS's stuff. And as a result the > contended case would be optimized some more via local-spin algorithms. > (which is not a key thing for us, but which would be nice to have > nevertheless) > > Ingo Nikita. -
Firstly, the terminology in that paper _is_ "queue lock", which isn't really surprising. I don't really know or care about what MS calls their locks, but I'd suggest that their queued spinlock is probably named in reference to its queueing property rather than its local spin property. Mine is a queueing lock, so I'll continue to call it queued spinlocks (not that the terminology will make it into our API anyway, because I intend for them simply to be an implementation of spinlocks). Secondly, as you say, local spin isn't really a must have. If SGI hasn't made a big stink about them by now, then I think queued locks (which are addressing a real hardware starvation issue on opterons) is more important. That isn't to say that local spin might not help performance, or that my queued spinlocks would make it impossible to implement... it's just that it isn't my aim. -
The method you propose is otherwise called "Ticket Lock": http://en.wikipedia.org/wiki/Ticket_lock http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#ticket - Davide -
That this work prio-art dates to 1991: http://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf So I would not worry to much about patents here. At least W2K MS ones ;) What I would worry though, is to add another class of locks. There's no reason why Ticket Locks would perform worse than our spinlock, in both contended and not-contended case, AFAICS. And they have a nice FIFO behaviour. - Davide -
No, as you see from my patch I just change the spinlock implementation In most cases, no. For the uncontended case they should be about the same. They have the same spinning behaviour. However there is a little window where they might be a bit slower I think... actually perhaps I'm wrong! Currently if you have 4 CPUs spinning and the lock is released, all 4 CPU cachelines will be invalidated, then they will be loaded again, and found to be 0, so they all try to atomic_dec_return the counter, each one invalidating others' cachelines. 1 gets through. With my queued locks, all 4 cachelines are invalidated and loaded, but only one will be allowed to proceed, and there are 0 atomic operations or stores of any kind. So I take that back: our current spinlocks have a worse thundering herd behaviour under contention than my queued ones. So I'll definitely push the patch through. -
OK, it isn't a big difference, but a user-space test is showing slightly
(~2%) improvement in the contended case on a 16 core Opteron.
There is a case where the present spinlocks are almost twice as fast on
this machine (in terms of aggregate throughput), and that is when a lock
is taken right after it is released. This is because the same CPU will
often be able to retake the lock without transitioning the cache. This is
going to be a rare case for us, and would suggest suboptimal code anyway
(ie. the lock should just be kept rather than dropped and retaken).
Actually, one situation where it comes up is when we drop and retake a
lock that needs_lockbreak. Of course, the queued lock behaviour is
desired in that case anyway.
However single-thread performance is presently a bit down. OTOH, the
assembly generated by gcc looks like it could be improved upon (even by
me :P).
This is what I've got so far. Should work for i386 and x86_64. Any
enhancements or results from other CPUs would be interesting.
--
#define _GNU_SOURCE
#include <stdlib.h>
#include <assert.h>
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/time.h>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
static inline void cpu_relax(void)
{
asm volatile("rep ; nop" : : : "memory");
}
static unsigned long delta(struct timeval start, struct timeval end)
{
return (end.tv_sec - start.tv_sec) * 1000000
+ end.tv_usec - start.tv_usec;
}
static unsigned long lps;
static void calibrate_delay(void)
{
struct timeval start, now;
unsigned long i;
gettimeofday(&start, NULL);
i = 0;
gettimeofday(&start, NULL);
do {
gettimeofday(&now, NULL);
i++;
} while (delta(start, now) < 1000000 * 2);
lps = i / 2;
printf("%lu lps\n", lps);
}
static void delay(unsigned long usec)
{
struct timeval now;
unsigned long loops = (lps*usec + 999999) / 1000000;
unsigned long i;
for (i = 0; i < loops; i++)
gettimeofday(&now, ...I slightly modified it to use cycles: http://www.xmailserver.org/qspins.c Here (Dual Opteron 252) queued locks (ticklocks) are about 10% slower in both cases. This is really a microbench, and assembly matter a lot. I did not have time to look at the generated one yet, but optimizing branches can help in those cases. - Davide -
Slightly more than slightly ;) You want to have a delay _outside_ the critical section as well, for multi-thread tests, otherwise the releasing CPU often just retakes the lock (in the unqueued lock case). As I said, most kernel code -
Yeah. ATM it mostly does double-takes. - Davide -
Yes, a ticket based FIFO queue isn't new... I think we have a lot of xamples already in the kernel. Using them to implement queue locks obviously isn't new either. I don't think we'd have to worry about patents. -
The contended case matters. Back in 2.5.something I screwed up the debug version of one of the locks (rwlock, iirc) - it was simply missing a It looks like a good way to address the lru_lock starvation/capture problem. But I think I'd be more comfortable if we were to introduce it as a new lock type, rather than as a reimplementation of the existing spin_lock(). Initially, at least. -
Do you have a reference? rwlocks are a bit funny, because if they are found to be useful (that is, they get used somewhere), then it indicates there can be situations with a lot of contention and spinning. Wheras we usually prefer not to use spinlocks in situations like that. Not that I'm claiming the contended case doesn't matter, but I think problems there indicate a bug (and I think rwlocks are almost always questionable). I'd hate to have a proliferation of lock types though. I think my queued spinlock addresses a real hardware limitation of some systems. In situations where contention isn't a problem, then queued locks won't cause a slowdown. In situations where it is, starvation could also be a problem. -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismat |
