Re: [PATCH 0/3] 64-bit futexes: Intro

Previous thread: [PATCH 3/3] 64-bit futexes: x86 support by Ulrich Drepper on Friday, May 30, 2008 - 6:27 pm. (10 messages)

Next thread: [PATCH 1/3] 64-bit futexes: generic code by Ulrich Drepper on Friday, May 30, 2008 - 6:27 pm. (1 message)
From: Ulrich Drepper
Date: Friday, May 30, 2008 - 6:27 pm

This patch series adds support for 64-bit futexes.  The current futexes
only protect 32-bits.  I don't know why, ask the original authors.  It is
unnecessarily limiting, though, especially for 64-bit machines.

To understand the problem let me just say that futexes work by storing
a program-defined state in a variable.  Threads then can wait for the
state of the variable to change.  It all works dandy if the protocol for
using futexes is followed correctly; see

	http://people.redhat.com/drepper/futexes.pdf

For mutexes, semaphores etc this is quite easy.  For mutexes, for instance,
we have as state to protect a flag whether the mutex is locked and whether
there is any thread waiting for the release.  This can be done easily with
the available 32 bits in a futex.

The situation is different for more complex synchronization objects.  The
best example are reader/writer locks.  Here the state consists at least
of

- number of readers which locked the rwlock
- flag whether rwlock is locked for reader, writing, or not at all
- number of readers waiting
- number of writers waiting

I.e., rwlocks are significantly more complicated.  Much of the complication
stems from the requirement to have to types: those which prefer readers and
those which prefer writers.

Without imposing unreasonable limitations on the number of concurrent users
a rwlock can have the state cannot be represented in a single 32-bit variable.
This is why the current rwlock implementation uses an internal lock and
spreads the state out in several variables.

This works, but there is a significant performance penalty to be paid:

- at least two atomic operations to lock and unlock the internal lock
- multiple readers are not really able to run concurrently since they
  might block on the internal lock
- following from the last point, the behaviour of the code is less
  predictable due to scheduling artifacts


The logical solution is to extend the size of the state which can be
protected by allowing ...
From: Linus Torvalds
Date: Friday, May 30, 2008 - 7:13 pm

I call bull on that.

We've had a 32-bit rwsemaphore in the kernel for a *long* time.

I think it's totally stupid for glibc to even *try* to do something like 
this, since almost nobody will have a kernel with 64-bit futex support for 
a long time in the wild anyway. So you need to support a 32-bit semaphore 
in practice, and it's been done before.

The x86 kernel rwsem implementation may limit things to 64k readers (I'm 
not even sure that's true, I'm not going to look at the code again), but 
if I recall correctly that's just because we wanted to use a single "xadd" 
in the hotpath, instead of doign a load a cmpxchg.

I really object to adding another 32/64-bit difference just for something 
like an rwsemaphore. It needs a whole lot of stronger reasons than that.

I suspect the old rwlocks are simply just stupid. You can do those things 
with a single 32-bit locked op. Have you tried
 - 29 bits of reader counts
 - 1 bit of uncontended writer
 - 1 bit of "contention" (ie a mark for requiring fairness)
 - 1 bit for a spinlock so that you can do all the fairness without doing 
   any extra locked ops
or similar?

			Linus
--

From: Ulrich Drepper
Date: Friday, May 30, 2008 - 8:14 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


You misread what I wrote.

Semaphores are fine.  The problem are reader/writer locks where we need
three counters and some flags.  It is absolutely not acceptable to limit

Again, misread.  This is not functionality which is not available.
Semaphores are *of course* fine with 32 bits.  And there is a
reader/writer lock implementation.  It is just very slow compared to
what is possible.  The transition between systems miss the 64-bit
support and those which have it is completely transparent.  The same
glibc binary will work for both.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhAwpsACgkQ2ijCOnn/RHS7PACdGvdgI9pv+IcsSPkXPVwFJ5ZF
YNQAn01fm1wEc4EI/fwOVuM57XtCy9Dq
=LcpJ
-----END PGP SIGNATURE-----
--

From: Linus Torvalds
Date: Friday, May 30, 2008 - 8:44 pm

No, you do.

The kernel calls reader-writer locks rwsemaphores.

		Linus
--

From: Ulrich Drepper
Date: Friday, May 30, 2008 - 9:04 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Well, then you don't understand the complexity of the required interface
together with the performance implications.  Take your proposed allocation:

 - 29 bits of reader counts
 - 1 bit of uncontended writer
 - 1 bit of "contention" (ie a mark for requiring fairness)
 - 1 bit for a spinlock so that you can do all the fairness without
   doing any extra locked ops


Ask yourself this:

- - How do you know when there is no more writer waiting?  You cannot
  reset a "writer waiting" bit after you wake up one writer without
  waking every single thread waiting for the rwlock and letting them
  fight for it

- - How do you handle the difference between reader-preferred rwlocks
  and writer-preferred rwlocks?  In the latter, if a rwlock is locked
  for reading, future readers must be delayed until all writers are
  gone

- - How do you do the accounting for the *timedlock variants?  In the
  case of those a functions, if the threads wake due to a timeout,
  you have the decrement the waiter count.  But you have only one bit...


I don't say you cannot implement rwlocks this way.  Sure, it definitely
is possible.  But this implementation would in the contended case like
10x as slow as the current code because you constantly have to wake up
every single thread.

If you want, I'll give you a libpthread.so with the new code.  Then you
can test your code.  I bet you whatever you want that you cannot achieve
the performance with your puny 32-bit futex without limiting the number
of threads to a ridiculously low number.  The implementation I have
allows for 2^23 (possibly recursively locked) readers, 2^18 readers or
writers waiting.

When I was writing "you need more than 32 bits" I didn't even imagine
that somebody would suggest using such a primitive scheme which cannot
possibly work efficiently in a userlevel implementation.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP ...
From: Linus Torvalds
Date: Friday, May 30, 2008 - 9:16 pm

How about you post your code. 


Sure you can. By looking at the *other*data* that isn't atomic.

So when doing a "write_unlock()" (or whatever you call it - for the kernel 
calls it "up_write()") you can look at the non-atomic write counts to 
decide whether there are others.

Also note how you can use 64-bit atomic ops to do that all in user space, 
without actually requiring 64-bit futex support - as long as the bits that 
matter for waking (like "was there more than one pending writer") fit in 

Again, that's not something that needs to be in the *atomic* part.

It's unquestionable that a rwlock is more than 32 bits, but what I 

Uli, you're not even trying any more.

NO, you don't have just one bit. You have as many bits as you want. It's 
just that only 32 of the bits will be relevant for FUTEX_WAIT_OP etc.

			Linus
--

From: Linus Torvalds
Date: Friday, May 30, 2008 - 9:23 pm

So here's an example of this:

 - make all the readers/writers actually update a 64-bit word (by using 
   cmpxchg8b in user space to actually get the locks)

 - but organize things so that a reader only needs to look at the high 32 
   bits to actually make it's wakeup-decision, and a sleeping writer only 
   needs to look at the low 32 bits. How? Make the low bits of the words 
   contain the "contention status".

Is it possible? I dunno. I personally suspect it is. I also suspect you 
didn't even try.

			Linus
--

From: Ulrich Drepper
Date: Friday, May 30, 2008 - 9:38 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Once again not without limitations which are too low.  The main problem
is that there is far more information needed on the reader side.  There
are two counters and a number of bits for state information.  Which
means the counters can be at most 14 bits in size.

A counter this low also means I cannot unconditionally increment them
(since there can be more than 2^14 threads sharing one page).  This
means you get in the cache line race I described in the last mail.  You
first have to load a word and then perform a the write operation while

I have to disappoint you.  I thought of that and had to reject it.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhA1j8ACgkQ2ijCOnn/RHTUXQCgnkUy6PNSuUI4O6BD4abxqqKr
PzMAn2qcvrXGM4qrhOytk1Yie0BsMeVa
=zvxZ
-----END PGP SIGNATURE-----
--

From: Linus Torvalds
Date: Friday, May 30, 2008 - 9:58 pm

Uli, you're NOT EVEN READING!

Go back and read it, dammit! How many times have I said that you can even 
have more than 32 bits of data? 

There's nothing wrong with using 64-bit atomics to update the words. All 
I've *ever* questioned is whether you need more than 32 bits for the 
FUTEX part.

For example, let's say that you need 4 bits of "state" information (ie 
"next woken-up one is a reader"). And yes, you need counts for readers and 
writers. Maybe we can make those counts be 30 bits each. 64 bits total 
space. But does the FUTEX_OP things actually need to access all 64 bits?

So imagine that you bave a 64-bit atomic op, and bits 0-1 are the reader 
state bits, with bits 2-31 being the reader count. Bits 32-33 are the 
writer state bits, and bits 34-63 are the writer count.

Now you can use 64-bit atomic ops to update them, but perhaps just use a 
32-bit FUTEX op to actually wait as a reader on the low 32 bits that are 
the reader state bits. See?

THIS is what I've been trying to tell you. It's quite possible that the 
algorithm you already have could work with just moving the bits around so 
that each FUTEX op can be done on just one of the words.

And notice my use of the word "possible". 

The kernel uses a 32-bit count word, but has various other state too that 
it maintains under a separate locking setup. Same basic idea: the 32-bit 
word is "special" (because it's the one that has to have the right format 
for the 'xadd' op that we use for the fast path), but it's not the _only_ 
data in the thing. The other data is also accessed, it's just accessed 
according to different principles.

			Linus
--

From: Linus Torvalds
Date: Saturday, May 31, 2008 - 3:25 pm

Ok, so I obviously think you're simply wrong, and I've told you so both in 
private after seeing the 64-bit code you propose (and which you want to 
patent), and publicly.

So let's have code talk. I'm going to give real code for the only case 
that really matters, namely the uncontended case.

Do you agree that once you get contention, and you actually have to do a 
FUTEX system call to sort it out, the number of atomic instructions do not 
really matter? Because the system call is going to dominate. Agreed?

So I'm not going to write out the slow contention case, because you have 
already agreed that you can do a 32-bit rwlock _slowly_ with the existing 
32-bit FUTEX support.

IOW, I'm faking it, but I'm making a point. Namely that you can 
efficiently do read-write lock using *only* 32-bit ops, and without any 
real kind of limitations on the number of readers and writers.

So here goes the explanation and the pseudo-code.

 - have two levels of locking: the contended case, and the uncontended case

 - these are two *separate* locks.  I'm going to describe the only case
   that mattes for performance, namely the non-contention one.  We just
   make sure that the non-contention lock "gates" things to the slow case.

 - in other words, here's a pretty optimal fast-case that we write in
   assembly language, and all it does is to handle the non-contention 
   case, with any waiting happening on *another* lock entirely.

 - on that first-level lock, the setup is as follows:
	- the low bit (bit#0) is "writer holds"
	- the next bit (bit#1) is "contended", and all it does is that it 
	  guarantees that when set, everybody will go to the slow-path.
	- bits 2-31 are "reader count"

 - the second-level lock is another 32-bit lock that is *adjacent* in
   memory, and they are 64-bit aligned, so that you can do an atomic 
   initialization and more importantly a 64-bit "it is now no longer
   contended" assignment in one atomic op.  That's going to be relevant
   only for ...
From: Linus Torvalds
Date: Saturday, May 31, 2008 - 3:32 pm

Btw, that comment is bogus, ignore it. It comes from an earlier algorithm 
I tried that also explained the contention case, but got unnecessarily 
complex. Separating out the queuing behaviour from the actual 
non-contended behaviour made things much simpler to explain, so I stopped 
even bothering with a "both" case.

Too much cut-and-paste, IOW.

		Linus
--

From: Ingo Molnar
Date: Monday, June 2, 2008 - 11:54 am

i suspect _any_ abstract locking functionality around a data structure 
can be implemented via atomic control over just a single user-space bit.

That bit can be used as a lock and if all access to the state of that 
atomic variable uses it, arbitrary higher-order atomic state transitions 
can be derived from it. The cost would be a bit more instructions in the 
fastpath, but there would still only be a single atomic op (the acquire 
op), as the unlock would be a natural barrier (on x86 at least).

Concurrency (and scheduling) of that lock would still be exactly the 
same as with genuine 64-bit (or even larger) atomic ops, and the 
fastpath would be very close as well.

	Ingo
--

From: Linus Torvalds
Date: Monday, June 2, 2008 - 1:22 pm

[ I added Nick and DavidH to the Cc, since they have at least historically 
  shown interest in locking algorithms ]


Well, theory is theory, and practice is different.

That's *especially* true when it comes to locking, which is just so 
tightly coupled to the exact details of which atomics are fast on a 
particular architecture.

Also, different people will want to see different performance profiles. 

For example, it makes a *huge* difference whether you are strictly fair or 
not. I can almost guarantee that a 100% fair implementation may be really 
"nice" from a theoretical standpoint, but suck really badly in practice, 
because if you want best performance, then you want the lock to have a 
very strong CPU affinity - and the faster you can do your lock ops, the 
more of a unfairness and CPU affinity they get!

And rwlocks in particular are actually much more interesting in this 
respect, because they not only have that CPU affinity fairness, but also 
the reader-vs-writer fairness. You optimize for one load, and it may give 
you almost perfect performance, but at the cost of sucking at another 
load.

For example, some loads are almost entirely read-read locks, with only 
very occasional write locks for some uncommon config change thing. Do you 
want to optimize for that? Maybe. And yes, you can make those uncontended 
read-read locks go really quickly, but then (especially if you continue to 
let reads through even when writers want to contend), that can slow down 
writers a *lot*, to the point of starvation.

Different CPU's will also show different patterns. 

Anyway, I was busy most of the weekend, but I've now coded up a partial 
actual example implementation. Its' probably buggy. Uli is very correct in 
saying that it's easy to screw up futex'es, but even in the absense of 
futexes, it's just easy to screw up any threaded logic.

But if anybody wants to look at a rough draft that at least limps along 
_partially_, ...
From: Ingo Molnar
Date: Monday, June 2, 2008 - 4:03 pm

yeah, indeed. Compared to all the other costs that have to be dealt with 
here, having a second atomic op isnt all that much of an issue either, 
especially on latest hw. An atomic op will probably never be as cheap as 
a non-atomic op, but ~20 cycles is still plenty fast for most practical 
purposes.

	Ingo
--

From: Nick Piggin
Date: Monday, June 2, 2008 - 8:24 pm

I think optimised our unlock_page in a way that it can do a
non-atomic unlock in the fastpath (no waiters) using 2 bits. In
practice it was still atomic but only because other page flags
operations could operate on ->flags at the same time.

I still have to get around to submitting the damn thing upstream,
but maybe if I bring it up here, I get the idea more reviewed :)

Anyway, the algorithm goes like this:

lock_page()
{
  if (test_and_set_bit_lock(PG_locked))
    lock_page_slow();
}

lock_page_slow()
{
  /* slowpath */
again:
  prepare_to_wait();
  set_bit(PG_waiters);
  if (test_and_set_bit_lock(PG_locked)) {
    schedule();
    goto again;
  }
}

wake_page_waiters()
{
  /* slowpath */
  clear_bit(PG_waiters);
  smp_mb__after_clear_bit(); /* order clear_bit store with wake_up_page load */
  wake_up_page(PG_locked);
}

unlock_page()
{
  clear_bit_unlock(PG_locked);
  if (unlikely(test_bit(PG_waiters)))
    wake_page_waiters();
}

We don't require any load/store barrier in the unlock_page fastpath
because the bits are in the same word, so cache coherency gives us a
sequential ordering anyway.

Now you notice the lock page slowpath can set bits without holding the
lock, at first glance you'd think the clear_bit to unlock has to be
atomic too then. But actually if we're careful, we can put them in seperate
parts of the word and use the sub-word operations on x86 to avoid the atomic
requirement. I'm not aware of any architecture in which operations to the
same word could be out of order.

Not only does this avoid all barriers (except acquire/release) in the
fastpaths, but it also avoids the unconditional load of the random
hash page waitqueue cacheline on unlock.

Anything applicable to your problem at hand?

--

From: Linus Torvalds
Date: Wednesday, June 4, 2008 - 12:57 pm

Yes and no.

Yes, the bits are int he same word, so cache coherency guarantees a lot.

HOWEVER. If you do the sub-word write using a regular store, you are now 
invoking the _one_ non-coherent part of the x86 memory pipeline: the store 
buffer. Normal stores can (and will) be forwarded to subsequent loads from 
the store buffer, and they are not strongly ordered wrt cache coherency 
while they are buffered.

IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other 
stores, but loads are *not* ordered wrt other stores in the absense of a 
serializing instruction, and it's exactly because of the write buffer.


See above. The above is unsafe, because if you do a regular store to a 
partial word, with no serializing instructions between that and a 
subsequent load of the whole word, the value of the store can be bypassed 
from the store buffer, and the load from the other part of the word can be 
carried out _before_ the store has actually gotten that cacheline 
exclusively!

So when you do

	movb reg,(byteptr)
	movl (byteptr),reg

you may actually get old data in the upper 24 bits, along with new data in 
the lower 8.

I think.

Anyway, be careful. The cacheline itself will always be coherent, but the 
store buffer is not going to be part of the coherency rules, and without 
serialization (or locked ops), you _are_ going to invoke the store buffer!

		Linus
--

From: Linus Torvalds
Date: Wednesday, June 4, 2008 - 1:38 pm

Put another way: the CPU may internally effectively rewrite the two 
instructions as

	movb reg,tmpreg		(tmp = writebuffer)
	movl (byteptr),reg	(do the 32-bit read of the old cached contents)
	movb tmpreg,reg		(writebuffer snoop by reads)
	movb tmpreg,(byteptr)	(writebuffer actually drains into cacheline)

and *if* your algorithm is robust wrt these kinds of rewrites, you're ok. 
But notice how there are two accesses to the cacheline, and how they are 
actually in the "wrong" order, and how the cacheline could have been 
updated by another CPU in between.

Does this actually happen? Yeah, I do believe it does. Is it a deathknell 
for anybody trying to be clever with overlapping reads/writes? No, you can 
still have things like causality rules that guarantee that your algorithm 
is perfectly stable in the face of these kinds of reordering. But it *is* 
one of the few re-orderings that I think is theoretically archtiecturally 
visible.

For example, let's start out with a 32-bit word that contains zero, and 
three CPU's. One CPU does

	cmpxchgl 0->0x01010101,mem

another one does

	cmpxchlg 0x01010101->0x02020202,mem

and the third one does that

	movb $0x03,mem
	movl mem,reg

and after it all completed, you may have 0x02020203 in memory, but "reg" 
on the third CPU contains 0x01010103.

Note how NO OTHER CPU could _possibly_ have seen that value! That value 
never ever existed in any caches. If the final value was 0x02020203, then 
both the cmpxchgl's must have worked, so the cache coherency contents 
*must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with 
the movb actually getting the exclusive cache access last).

So the third CPU saw a value for that load that actually *never* existed 
in any cache-line: 0x01010103.  Exactly because the x86 memory ordering 
allows the store buffer contents to be forwarded within a CPU core.

And this is why atomic locked instructions are special. They bypass the 
store buffer (or at least they _act_ as ...
From: Nick Piggin
Date: Wednesday, June 4, 2008 - 6:56 pm

I see what you're saying, but I hadn't really considered it before.
Not that I've attempted to do much of these sub-word operations
without consulting you first (which brings us here) ;).

I'd have thought that for a case like this, you'd simply hit the store
alias logic and store forwarding would stall because it doesn't have
the full data.

I'd like to know for sure.

The other thing that could be possible, and I'd imagine maybe more likely
to be implemented in a real CPU because it should give more imrpovement
(and which does break my algorithm) is just that the load to the cacheline
may get to execute first, then if the cacheline gets evicted and
modified by another CPU before our store completes, we effectively see
store/load reordering again.


--

From: Nick Piggin
Date: Wednesday, June 4, 2008 - 6:58 pm

Well, I should qualify that: it doesn't actually break my algorithm
because you can still implement the unlock without atomic RMWs.
This may require you to have an mfence there, but we can still get
away without atomics (whether that's much cheaper or not, I haven't
measured recently!)
--

From: Linus Torvalds
Date: Wednesday, June 4, 2008 - 8:08 pm

That's _one_ possible implementation. 

Quite frankly, I think it's the less likely one. It's much more likely 
that the cache read access and the store buffer probe happen in parallel 
(this is a really important hotpath for any CPU, but even more so x86 
where there are more of loads and stores that are spills). And then the 
store buffer logic would return the data and a bytemask mask (where the 
mask would be all zeroes for a miss), and the returned value is just the 

You'd have to ask somebody very knowledgeable inside Intel and AMD, and it 
is quite likely that different microarchitectures have different 

Oh, absolutely, the perfect algorithm would actually get the right answer 
and notice that the cacheline got evicted, and retried the whole sequence 
such that it is coherent. 

But we do know that Intel expressly documents loads and stores to pass 
each other and documents the fact that the store buffer is there. So I bet 
that this is visible in some micro-architecture, even if it's not 
necessarily visible in _all_ of them.

The recent Intel memory ordering whitepaper makes it very clear that loads 
can pass earlier stores and in particular that the store buffer allows 
intra-processor forwarding to subsequent loads (2.4 in their whitepaper). 
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure 
it isn't. The store->load forwarding is such a critical performance issue 
that I can pretty much guarantee that it doesn't always hit the cacheline.

Of course, the partial store forwarding case is not nearly as important, 
and stalling is quite a reasonable implementation approach. I just 
personally suspect that doing the unconditional byte-masking is actually 
_simpler_ to implement than the stall, so..

			Linus
--

From: Nick Piggin
Date: Wednesday, June 4, 2008 - 9:29 pm

Well, it would just be nice to hear a "no we'll never do that", "we

Well I have a simple test case to show loads pass earlier non conflicting
stores in the case that loads do not come from the store buffer (ie.
*inter* processor forwarding).

And store forwarding, by definition means that the load can complete before
the store can possibly be visible to another CPU I'd say. So yes, I'm
sure this does happen too.

--

From: Nick Piggin
Date: Wednesday, June 4, 2008 - 6:45 pm

I'd be very surprised if that was the case. But the unlock code needn't
do that anyway. It could do

movb reg,(byteptr)   # clear PG_locked
movb (byteptr+1),reg # load PG_waiters

--

From: Nick Piggin
Date: Thursday, June 5, 2008 - 6:27 pm

Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is*  impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.

What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.

Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...

I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?
--

From: Linus Torvalds
Date: Thursday, June 5, 2008 - 8:37 pm

Yes, nice idea. It avoids the possible unnecessary S->M transition, but 
the downside is that it effectively slows down the write unlock by making 
it do two atomic ops even for the fastpath. So if I were to _only_ care 
about the reader path, I think it would be a great idea, but as it is, the 
current non-contended write case is actually pretty close to optimal, and 

Quite frankly - and this was my argument the whole time - I do not believe 
that a "full" 64-bit implementation can do _anything_ more than this 
32-bit one can do.  That's the reason I wrote the code. I'm pretty sure 
that you can do perfectly well with just 32 bit atomic futex ops (my 
rwlocks obviously do use 64-bit cmpxchg's in user space, but not in the 
fast-path, and it should work fine on x86-32 too using cmpxchg8b).

In fact, in the second version of my locks, I didn't do futex operations 
on the actual lock itself at *all*, and just do futex ops on separate 
"event" fields. That actually should have avoided a bug I think I had (but 
couldn't really trigger in practice) in the first version, and made 
everything look pretty straightforward.

I haven't really worked on them since I write the thing, but I did 
consider things like timeouts etc. Timeouts are "hard" to handle because 
they mean that you cannot use any kind of trivially incrementing "ticket 
locks" with sequence numbers (because we may have to just avoid a sequence 
if it times out), so the sequence number approach that we now use for 
kernel spinlocks was not an option. I didn't actually *write* the timeout 
versions, of course, but given the structure of the locks they really 
should be very straightforward.

[ Half-way subtle thing: a writer that times out needs to be very careful 
  that it doesn't lose a wakeup event, but futexes actually make that part 
  pretty easy - since FUTEX_WAIT returns whether you got woken up or not, 
  you can just decide to wake up the next write-waiter if you cannot get 
  the lock immediately and ...
From: Nick Piggin
Date: Friday, June 6, 2008 - 4:53 am

Yeah, it is a case of a large slowdown for write for a small speedup
for read (pity the API doesn't have explicit read and write unlocks
-- were they too lazy to type the last bit, or did they expect people
to lose track of whether they had a read or write lock? :P)

Anyway, it's obviously a tradeoff you'd just have to carefully

Well... a single lock is only going to be so scalable. I don't see how
it could be done really significantly better? Maybe a small factor of
improvement if you were to concentrate on the contended case (but you
wouldn't want to do that anyway)



--

From: Linus Torvalds
Date: Friday, June 6, 2008 - 8:01 am

My worry is that I did something wrong in the slowpath, and that there is 
some thundering-herd-wakeup kind of thing that makes that much slower than 
it should be.

The slow path doesn't much matter from the angle of counting individual 
cycles, but it still matters very much from a bigger picture. Does it have 
bad behavior where we wake up a thousand readers, but then a writer gets 
to come in first and all the readers have to go to sleep again?

That's one thing I tried to avoid in the second version (the "Another 
approch" commit) where a read-wakeup actually moves the readers from the 
pending count to the final count - both to get more fairness (which can be 
_bad_ for performance), but also because I think it can avoid pathological 
cases (reader starvation and unnecessary futex wakeup).

		Linus
--

Previous thread: [PATCH 3/3] 64-bit futexes: x86 support by Ulrich Drepper on Friday, May 30, 2008 - 6:27 pm. (10 messages)

Next thread: [PATCH 1/3] 64-bit futexes: generic code by Ulrich Drepper on Friday, May 30, 2008 - 6:27 pm. (1 message)