Re: [rfc][patch] queued spinlocks (i386)

Previous thread: RSDL cpu scheduler v 0.33 by Con Kolivas on Friday, March 23, 2007 - 2:05 am. (3 messages)

Next thread: Re: [PATCH] clockevents: Fix suspend/resume to disk hangs by Marcus Better on Friday, March 23, 2007 - 2:14 am. (4 messages)
From: Nick Piggin
Date: Friday, March 23, 2007 - 1:59 am

Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.

Not completely implemented with assembly yet, to make the algorithm a bit
clearer.

The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
"atomic_inc_return" on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.

To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).

Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.

Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues. 


Index: linux-2.6/include/asm-i386/spinlock.h
===================================================================
--- linux-2.6.orig/include/asm-i386/spinlock.h
+++ linux-2.6/include/asm-i386/spinlock.h
@@ -29,69 +29,23 @@
 
 static inline int __raw_spin_is_locked(raw_spinlock_t *x)
 {
-	return *(volatile signed char *)(&(x)->slock) <= 0;
+	return unlikely(x->qhead != x->qtail);
 }
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-	asm volatile("\n1:\t"
-		     LOCK_PREFIX " ; decb %0\n\t"
-		     "jns 3f\n"
-		     "2:\t"
-		     "rep;nop\n\t"
-		     "cmpb $0,%0\n\t"
-		     "jle 2b\n\t"
-		     "jmp 1b\n"
-		     "3:\n\t"
-		     : "+m" (lock->slock) : : "memory");
-}
+	unsigned short pos = 1;
 
-/*
- * It is easier for the lock validator if interrupts are not re-enabled
- * in the middle ...
From: Eric Dumazet
Date: Friday, March 23, 2007 - 2:40 am

On Fri, 23 Mar 2007 09:59:11 +0100

It's a very nice idea Nick.

You also have for free the number or cpus that are before you.

On big SMP/NUMA, we could use this information to call a special lock_cpu_relax() function to avoid cacheline transferts.

	asm volatile(LOCK_PREFIX "xaddw %0, %1\n\t"
		     : "+r" (pos), "+m" (lock->qhead) : : "memory");
	for (;;) {
		unsigned short nwait = pos - lock->qtail;
		if (likely(nwait == 0))
			break;
		lock_cpu_relax(lock, nwait);
	}

lock_cpu_relax(raw_spinlock_t *lock, unsigned int nwait)
{
unsigned int cycles = nwait * lock->min_cycles_per_round;
busy_loop(cycles);
}
-

From: Nick Piggin
Date: Friday, March 23, 2007 - 2:59 am

Yeah that's true, it would give you a lot more information for use in a
lock backoff system, so that might counter the lesser efficiency of queued
locks under contention.

You could perhaps even wait using monitor/mwait on the lock cacheline
if there are more than X CPUs in front of you. I'm not sure if the cost of
mwait is entirely within the core that executes it, or whether it slows
down cache coherency as well (which would make that idea a showstopper).

-

From: Ravikiran G Thirumalai
Date: Friday, March 23, 2007 - 12:27 pm

Good Idea.  Hopefully, this should reduce the number of cacheline transfers
in the contended case.
-

From: Ingo Molnar
Date: Friday, March 23, 2007 - 3:04 am

isnt this patented by MS? (which might not worry you SuSE/Novell guys, 
but it might be a worry for the rest of the world ;-)

	Ingo
-

From: Nick Piggin
Date: Friday, March 23, 2007 - 3:10 am

I never thought a FIFO queue would be patentable.

Do you have a reference to the patent number?
-

From: Davide Libenzi
Date: Friday, March 23, 2007 - 11:15 am

FWIW, MS patented even the preprocessor in computer languages ;)
Holding a patent does not make it automatically enforceable.
I remember thare was a discussion time ago about queued spinlocks on lkml.



- Davide


-

From: Nick Piggin
Date: Friday, March 23, 2007 - 3:32 am

Hmm, it looks like they have implemented a system where the spinning
cpu sleeps on a per-CPU variable rather than the lock itself, and
the releasing cpu writes to that variable to wake it.  They do this
so that spinners don't continually perform exclusive->shared
transitions on the lock cacheline. They call these things queued
spinlocks.  They don't seem to be very patent worthy either, but
maybe it is what you're thinking of?

I'm not as concerned about the contended performance of spinlocks
for Linux as MS seems to be for windows (they seem to be very proud
of this lock). Because if it is a big problem then IMO it is a bug.

This was just something I had in mind when the hardware lock
starvation issue came up, so I thought I should quickly code it up
and RFC...  actually it makes contended performance worse, but I'm
not too worried about that because I'm happy I was able to implement
it without increasing data size or number of locked operations.
-

From: Eric Dumazet
Date: Friday, March 23, 2007 - 3:40 am

On Fri, 23 Mar 2007 11:32:44 +0100

Sure, but please note that you should rename your patch to :

-

From: William Lee Irwin III
Date: Friday, March 23, 2007 - 4:02 am

Those exclusive-to-shared transitions are among the cacheline transfers
typically accounted to the algorithms in their complexity analyses.


-- wli
-

From: Nikita Danilov
Date: Saturday, March 24, 2007 - 8:55 am

Nick Piggin writes:
 > On Fri, Mar 23, 2007 at 11:04:18AM +0100, Ingo Molnar wrote:
 > > 
 > > * Nick Piggin <npiggin@suse.de> wrote:
 > > 
 > > > Implement queued spinlocks for i386. [...]
 > > 
 > > isnt this patented by MS? (which might not worry you SuSE/Novell guys, 
 > > but it might be a worry for the rest of the world ;-)
 > 
 > Hmm, it looks like they have implemented a system where the spinning
 > cpu sleeps on a per-CPU variable rather than the lock itself, and
 > the releasing cpu writes to that variable to wake it.  They do this
 > so that spinners don't continually perform exclusive->shared
 > transitions on the lock cacheline. They call these things queued
 > spinlocks.  They don't seem to be very patent worthy either, but

Indeed, this technique is very well known. E.g.,
http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole
section (3. Local-spin Algorithms) on them, citing papers from the 1990
onward.

Nikita.

-

From: Ingo Molnar
Date: Saturday, March 24, 2007 - 10:29 am

that is a cool reference! So i'd suggest to do (redo?) the patch based 
on those concepts and that terminology and not use 'queued spinlocks' 
that are commonly associated with MS's stuff. And as a result the 
contended case would be optimized some more via local-spin algorithms. 
(which is not a key thing for us, but which would be nice to have 
nevertheless)

	Ingo
-

From: Nikita Danilov
Date: Saturday, March 24, 2007 - 11:49 am

Ingo Molnar writes:
 > 
 > * Nikita Danilov <nikita@clusterfs.com> wrote:
 > 
 > > Indeed, this technique is very well known. E.g., 
 > > http://citeseer.ist.psu.edu/anderson01sharedmemory.html has a whole 
 > > section (3. Local-spin Algorithms) on them, citing papers from the 
 > > 1990 onward.
 > 
 > that is a cool reference! So i'd suggest to do (redo?) the patch based 
 > on those concepts and that terminology and not use 'queued spinlocks' 

There is some old version:

http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p06-locallock.patch
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p07-locallock-bkl.patch
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p08-locallock-zone.patch

http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0b-atomic_dec_and_lo...
http://namesys.com/pub/misc-patches/unsupported/extra/2004.02.04/p0c-locallock-dcache....

This version retains original spin-lock interface (i.e., no additional
"queue link" pointer is passed to the locking function). As a result,
lock data structure contains an array of NR_CPU counters, so it's only
suitable for global statically allocated locks.

 > that are commonly associated with MS's stuff. And as a result the 
 > contended case would be optimized some more via local-spin algorithms. 
 > (which is not a key thing for us, but which would be nice to have 
 > nevertheless)
 > 
 > 	Ingo

Nikita.

-

From: Nick Piggin
Date: Tuesday, March 27, 2007 - 11:43 pm

Firstly, the terminology in that paper _is_ "queue lock", which isn't
really surprising. I don't really know or care about what MS calls their
locks, but I'd suggest that their queued spinlock is probably named in
reference to its queueing property rather than its local spin property.

Mine is a queueing lock, so I'll continue to call it queued spinlocks
(not that the terminology will make it into our API anyway, because I
intend for them simply to be an implementation of spinlocks).

Secondly, as you say, local spin isn't really a must have. If SGI hasn't
made a big stink about them by now, then I think queued locks (which are
addressing a real hardware starvation issue on opterons) is more important.

That isn't to say that local spin might not help performance, or that
my queued spinlocks would make it impossible to implement... it's just
that it isn't my aim.
-

From: Davide Libenzi
Date: Wednesday, March 28, 2007 - 12:26 pm

From: Davide Libenzi
Date: Wednesday, March 28, 2007 - 3:00 pm

That this work prio-art dates to 1991:

http://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf

So I would not worry to much about patents here. At least W2K MS ones ;)
What I would worry though, is to add another class of locks. There's no 
reason why Ticket Locks would perform worse than our spinlock, in both 
contended and not-contended case, AFAICS. And they have a nice FIFO 
behaviour.



- Davide


-

From: Nick Piggin
Date: Wednesday, March 28, 2007 - 6:36 pm

No, as you see from my patch I just change the spinlock implementation

In most cases, no. For the uncontended case they should be about the
same. They have the same spinning behaviour. However there is a little
window where they might be a bit slower I think... actually perhaps I'm
wrong!

Currently if you have 4 CPUs spinning and the lock is released, all 4
CPU cachelines will be invalidated, then they will be loaded again, and
found to be 0, so they all try to atomic_dec_return the counter, each
one invalidating others' cachelines. 1 gets through.

With my queued locks, all 4 cachelines are invalidated and loaded, but
only one will be allowed to proceed, and there are 0 atomic operations
or stores of any kind.

So I take that back: our current spinlocks have a worse thundering herd
behaviour under contention than my queued ones. So I'll definitely
push the patch through.
-

From: Nick Piggin
Date: Thursday, March 29, 2007 - 12:16 am

OK, it isn't a big difference, but a user-space test is showing slightly
(~2%) improvement in the contended case on a 16 core Opteron.

There is a case where the present spinlocks are almost twice as fast on
this machine (in terms of aggregate throughput), and that is when a lock
is taken right after it is released. This is because the same CPU will
often be able to retake the lock without transitioning the cache. This is
going to be a rare case for us, and would suggest suboptimal code anyway 
(ie. the lock should just be kept rather than dropped and retaken).

Actually, one situation where it comes up is when we drop and retake a
lock that needs_lockbreak. Of course, the queued lock behaviour is
desired in that case anyway.

However single-thread performance is presently a bit down. OTOH, the
assembly generated by gcc looks like it could be improved upon (even by
me :P).

This is what I've got so far. Should work for i386 and x86_64. Any
enhancements or results from other CPUs would be interesting.

--
#define _GNU_SOURCE
#include <stdlib.h>
#include <assert.h>
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/time.h>

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)
static inline void cpu_relax(void)
{
	asm volatile("rep ; nop" : : : "memory");
}

static unsigned long delta(struct timeval start, struct timeval end)
{
	return (end.tv_sec - start.tv_sec) * 1000000
		+ end.tv_usec - start.tv_usec;
}

static unsigned long lps;
static void calibrate_delay(void)
{
	struct timeval start, now;
	unsigned long i;
	gettimeofday(&start, NULL);
	i = 0;
	gettimeofday(&start, NULL);
	do {
		gettimeofday(&now, NULL);
		i++;
	} while (delta(start, now) < 1000000 * 2);
	lps = i / 2;
	printf("%lu lps\n", lps);
}

static void delay(unsigned long usec)
{
	struct timeval now;
	unsigned long loops = (lps*usec + 999999) / 1000000;
	unsigned long i;

	for (i = 0; i < loops; i++)
		gettimeofday(&now, ...
From: Davide Libenzi
Date: Thursday, March 29, 2007 - 5:27 pm

I slightly modified it to use cycles:

http://www.xmailserver.org/qspins.c

Here (Dual Opteron 252) queued locks (ticklocks) are about 10% slower in 
both cases. This is really a microbench, and assembly matter a lot. I did 
not have time to look at the generated one yet, but optimizing branches 
can help in those cases.




- Davide


-

From: Nick Piggin
Date: Thursday, March 29, 2007 - 6:59 pm

Slightly more than slightly ;)

You want to have a delay _outside_ the critical section as well, for
multi-thread tests, otherwise the releasing CPU often just retakes
the lock (in the unqueued lock case). As I said, most kernel code

-

From: Davide Libenzi
Date: Thursday, March 29, 2007 - 7:43 pm

Yeah. ATM it mostly does double-takes.



- Davide


-

From: Nick Piggin
Date: Wednesday, March 28, 2007 - 6:24 pm

Yes, a ticket based FIFO queue isn't new... I think we have a lot of
xamples already in the kernel. Using them to implement queue locks
obviously isn't new either.

I don't think we'd have to worry about patents.
-

From: Andrew Morton
Date: Saturday, March 24, 2007 - 2:41 pm

The contended case matters.  Back in 2.5.something I screwed up the debug
version of one of the locks (rwlock, iirc) - it was simply missing a

It looks like a good way to address the lru_lock starvation/capture
problem.  But I think I'd be more comfortable if we were to introduce it as
a new lock type, rather than as a reimplementation of the existing
spin_lock().   Initially, at least.
-

From: Nick Piggin
Date: Tuesday, March 27, 2007 - 11:56 pm

Do you have a reference?

rwlocks are a bit funny, because if they are found to be useful (that
is, they get used somewhere), then it indicates there can be situations
with a lot of contention and spinning.

Wheras we usually prefer not to use spinlocks in situations like that.

Not that I'm claiming the contended case doesn't matter, but I think
problems there indicate a bug (and I think rwlocks are almost always
questionable).


I'd hate to have a proliferation of lock types though. I think my
queued spinlock addresses a real hardware limitation of some systems.

In situations where contention isn't a problem, then queued locks won't
cause a slowdown. In situations where it is, starvation could also be
a problem.
-

Previous thread: RSDL cpu scheduler v 0.33 by Con Kolivas on Friday, March 23, 2007 - 2:05 am. (3 messages)

Next thread: Re: [PATCH] clockevents: Fix suspend/resume to disk hangs by Marcus Better on Friday, March 23, 2007 - 2:14 am. (4 messages)