Re: [PATCH 0/3] 64-bit futexes: Intro

Previous thread: [PATCH 3/3] 64-bit futexes: x86 support by Ulrich Drepper on Friday, May 30, 2008 - 9:27 pm. (10 messages)

Next thread: [PATCH 1/3] 64-bit futexes: generic code by Ulrich Drepper on Friday, May 30, 2008 - 9:27 pm. (1 message)
To: <linux-kernel@...>
Cc: <akpm@...>, <mtk.manpages@...>, <torvalds@...>
Date: Friday, May 30, 2008 - 9:27 pm

This patch series adds support for 64-bit futexes. The current futexes
only protect 32-bits. I don't know why, ask the original authors. It is
unnecessarily limiting, though, especially for 64-bit machines.

To understand the problem let me just say that futexes work by storing
a program-defined state in a variable. Threads then can wait for the
state of the variable to change. It all works dandy if the protocol for
using futexes is followed correctly; see

http://people.redhat.com/drepper/futexes.pdf

For mutexes, semaphores etc this is quite easy. For mutexes, for instance,
we have as state to protect a flag whether the mutex is locked and whether
there is any thread waiting for the release. This can be done easily with
the available 32 bits in a futex.

The situation is different for more complex synchronization objects. The
best example are reader/writer locks. Here the state consists at least
of

- number of readers which locked the rwlock
- flag whether rwlock is locked for reader, writing, or not at all
- number of readers waiting
- number of writers waiting

I.e., rwlocks are significantly more complicated. Much of the complication
stems from the requirement to have to types: those which prefer readers and
those which prefer writers.

Without imposing unreasonable limitations on the number of concurrent users
a rwlock can have the state cannot be represented in a single 32-bit variable.
This is why the current rwlock implementation uses an internal lock and
spreads the state out in several variables.

This works, but there is a significant performance penalty to be paid:

- at least two atomic operations to lock and unlock the internal lock
- multiple readers are not really able to run concurrently since they
might block on the internal lock
- following from the last point, the behaviour of the code is less
predictable due to scheduling artifacts

The logical solution is to extend the size of the state which can be
protected by allowing 64-bit fu...

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Friday, May 30, 2008 - 10:13 pm

I call bull on that.

We've had a 32-bit rwsemaphore in the kernel for a *long* time.

I think it's totally stupid for glibc to even *try* to do something like
this, since almost nobody will have a kernel with 64-bit futex support for
a long time in the wild anyway. So you need to support a 32-bit semaphore
in practice, and it's been done before.

The x86 kernel rwsem implementation may limit things to 64k readers (I'm
not even sure that's true, I'm not going to look at the code again), but
if I recall correctly that's just because we wanted to use a single "xadd"
in the hotpath, instead of doign a load a cmpxchg.

I really object to adding another 32/64-bit difference just for something
like an rwsemaphore. It needs a whole lot of stronger reasons than that.

I suspect the old rwlocks are simply just stupid. You can do those things
with a single 32-bit locked op. Have you tried
- 29 bits of reader counts
- 1 bit of uncontended writer
- 1 bit of "contention" (ie a mark for requiring fairness)
- 1 bit for a spinlock so that you can do all the fairness without doing
any extra locked ops
or similar?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Friday, May 30, 2008 - 11:14 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You misread what I wrote.

Semaphores are fine. The problem are reader/writer locks where we need
three counters and some flags. It is absolutely not acceptable to limit

Again, misread. This is not functionality which is not available.
Semaphores are *of course* fine with 32 bits. And there is a
reader/writer lock implementation. It is just very slow compared to
what is possible. The transition between systems miss the 64-bit
support and those which have it is completely transparent. The same
glibc binary will work for both.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhAwpsACgkQ2ijCOnn/RHS7PACdGvdgI9pv+IcsSPkXPVwFJ5ZF
YNQAn01fm1wEc4EI/fwOVuM57XtCy9Dq
=LcpJ
-----END PGP SIGNATURE-----
--

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Friday, May 30, 2008 - 11:44 pm

No, you do.

The kernel calls reader-writer locks rwsemaphores.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 12:04 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Well, then you don't understand the complexity of the required interface
together with the performance implications. Take your proposed allocation:

- 29 bits of reader counts
- 1 bit of uncontended writer
- 1 bit of "contention" (ie a mark for requiring fairness)
- 1 bit for a spinlock so that you can do all the fairness without
doing any extra locked ops

Ask yourself this:

- - How do you know when there is no more writer waiting? You cannot
reset a "writer waiting" bit after you wake up one writer without
waking every single thread waiting for the rwlock and letting them
fight for it

- - How do you handle the difference between reader-preferred rwlocks
and writer-preferred rwlocks? In the latter, if a rwlock is locked
for reading, future readers must be delayed until all writers are
gone

- - How do you do the accounting for the *timedlock variants? In the
case of those a functions, if the threads wake due to a timeout,
you have the decrement the waiter count. But you have only one bit...

I don't say you cannot implement rwlocks this way. Sure, it definitely
is possible. But this implementation would in the contended case like
10x as slow as the current code because you constantly have to wake up
every single thread.

If you want, I'll give you a libpthread.so with the new code. Then you
can test your code. I bet you whatever you want that you cannot achieve
the performance with your puny 32-bit futex without limiting the number
of threads to a ridiculously low number. The implementation I have
allows for 2^23 (possibly recursively locked) readers, 2^18 readers or
writers waiting.

When I was writing "you need more than 32 bits" I didn't even imagine
that somebody would suggest using such a primitive scheme which cannot
possibly work efficiently in a userlevel implementation.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP S...

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 12:16 am

How about you post your code.

Sure you can. By looking at the *other*data* that isn't atomic.

So when doing a "write_unlock()" (or whatever you call it - for the kernel
calls it "up_write()") you can look at the non-atomic write counts to
decide whether there are others.

Also note how you can use 64-bit atomic ops to do that all in user space,
without actually requiring 64-bit futex support - as long as the bits that
matter for waking (like "was there more than one pending writer") fit in

Again, that's not something that needs to be in the *atomic* part.

It's unquestionable that a rwlock is more than 32 bits, but what I

Uli, you're not even trying any more.

NO, you don't have just one bit. You have as many bits as you want. It's
just that only 32 of the bits will be relevant for FUTEX_WAIT_OP etc.

Linus
--

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 12:23 am

So here's an example of this:

- make all the readers/writers actually update a 64-bit word (by using
cmpxchg8b in user space to actually get the locks)

- but organize things so that a reader only needs to look at the high 32
bits to actually make it's wakeup-decision, and a sleeping writer only
needs to look at the low 32 bits. How? Make the low bits of the words
contain the "contention status".

Is it possible? I dunno. I personally suspect it is. I also suspect you
didn't even try.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 12:38 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Once again not without limitations which are too low. The main problem
is that there is far more information needed on the reader side. There
are two counters and a number of bits for state information. Which
means the counters can be at most 14 bits in size.

A counter this low also means I cannot unconditionally increment them
(since there can be more than 2^14 threads sharing one page). This
means you get in the cache line race I described in the last mail. You
first have to load a word and then perform a the write operation while

I have to disappoint you. I thought of that and had to reject it.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkhA1j8ACgkQ2ijCOnn/RHTUXQCgnkUy6PNSuUI4O6BD4abxqqKr
PzMAn2qcvrXGM4qrhOytk1Yie0BsMeVa
=zvxZ
-----END PGP SIGNATURE-----
--

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 6:25 pm

Ok, so I obviously think you're simply wrong, and I've told you so both in
private after seeing the 64-bit code you propose (and which you want to
patent), and publicly.

So let's have code talk. I'm going to give real code for the only case
that really matters, namely the uncontended case.

Do you agree that once you get contention, and you actually have to do a
FUTEX system call to sort it out, the number of atomic instructions do not
really matter? Because the system call is going to dominate. Agreed?

So I'm not going to write out the slow contention case, because you have
already agreed that you can do a 32-bit rwlock _slowly_ with the existing
32-bit FUTEX support.

IOW, I'm faking it, but I'm making a point. Namely that you can
efficiently do read-write lock using *only* 32-bit ops, and without any
real kind of limitations on the number of readers and writers.

So here goes the explanation and the pseudo-code.

- have two levels of locking: the contended case, and the uncontended case

- these are two *separate* locks. I'm going to describe the only case
that mattes for performance, namely the non-contention one. We just
make sure that the non-contention lock "gates" things to the slow case.

- in other words, here's a pretty optimal fast-case that we write in
assembly language, and all it does is to handle the non-contention
case, with any waiting happening on *another* lock entirely.

- on that first-level lock, the setup is as follows:
- the low bit (bit#0) is "writer holds"
- the next bit (bit#1) is "contended", and all it does is that it
guarantees that when set, everybody will go to the slow-path.
- bits 2-31 are "reader count"

- the second-level lock is another 32-bit lock that is *adjacent* in
memory, and they are 64-bit aligned, so that you can do an atomic
initialization and more importantly a 64-bit "it is now no longer
contended" assignment in one atomic op. That's going to be relevant
only for the...

To: Linus Torvalds <torvalds@...>
Cc: Ulrich Drepper <drepper@...>, <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Monday, June 2, 2008 - 2:54 pm

i suspect _any_ abstract locking functionality around a data structure
can be implemented via atomic control over just a single user-space bit.

That bit can be used as a lock and if all access to the state of that
atomic variable uses it, arbitrary higher-order atomic state transitions
can be derived from it. The cost would be a bit more instructions in the
fastpath, but there would still only be a single atomic op (the acquire
op), as the unlock would be a natural barrier (on x86 at least).

Concurrency (and scheduling) of that lock would still be exactly the
same as with genuine 64-bit (or even larger) atomic ops, and the
fastpath would be very close as well.

Ingo
--

To: Ingo Molnar <mingo@...>, Nick Piggin <npiggin@...>, David Howells <dhowells@...>
Cc: Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, June 2, 2008 - 4:22 pm

[ I added Nick and DavidH to the Cc, since they have at least historically
shown interest in locking algorithms ]

Well, theory is theory, and practice is different.

That's *especially* true when it comes to locking, which is just so
tightly coupled to the exact details of which atomics are fast on a
particular architecture.

Also, different people will want to see different performance profiles.

For example, it makes a *huge* difference whether you are strictly fair or
not. I can almost guarantee that a 100% fair implementation may be really
"nice" from a theoretical standpoint, but suck really badly in practice,
because if you want best performance, then you want the lock to have a
very strong CPU affinity - and the faster you can do your lock ops, the
more of a unfairness and CPU affinity they get!

And rwlocks in particular are actually much more interesting in this
respect, because they not only have that CPU affinity fairness, but also
the reader-vs-writer fairness. You optimize for one load, and it may give
you almost perfect performance, but at the cost of sucking at another
load.

For example, some loads are almost entirely read-read locks, with only
very occasional write locks for some uncommon config change thing. Do you
want to optimize for that? Maybe. And yes, you can make those uncontended
read-read locks go really quickly, but then (especially if you continue to
let reads through even when writers want to contend), that can slow down
writers a *lot*, to the point of starvation.

Different CPU's will also show different patterns.

Anyway, I was busy most of the weekend, but I've now coded up a partial
actual example implementation. Its' probably buggy. Uli is very correct in
saying that it's easy to screw up futex'es, but even in the absense of
futexes, it's just easy to screw up any threaded logic.

But if anybody wants to look at a rough draft that at least limps along
_partially_, there's

[ message continues ]

" title="http://git.kernel.org/?p=linux/ker...">http://git.kernel.org/?p=linux/ker...

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, June 5, 2008 - 9:27 pm

Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is* impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.

What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.

Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...

I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?
--

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, June 5, 2008 - 11:37 pm

Yes, nice idea. It avoids the possible unnecessary S->M transition, but
the downside is that it effectively slows down the write unlock by making
it do two atomic ops even for the fastpath. So if I were to _only_ care
about the reader path, I think it would be a great idea, but as it is, the
current non-contended write case is actually pretty close to optimal, and

Quite frankly - and this was my argument the whole time - I do not believe
that a "full" 64-bit implementation can do _anything_ more than this
32-bit one can do. That's the reason I wrote the code. I'm pretty sure
that you can do perfectly well with just 32 bit atomic futex ops (my
rwlocks obviously do use 64-bit cmpxchg's in user space, but not in the
fast-path, and it should work fine on x86-32 too using cmpxchg8b).

In fact, in the second version of my locks, I didn't do futex operations
on the actual lock itself at *all*, and just do futex ops on separate
"event" fields. That actually should have avoided a bug I think I had (but
couldn't really trigger in practice) in the first version, and made
everything look pretty straightforward.

I haven't really worked on them since I write the thing, but I did
consider things like timeouts etc. Timeouts are "hard" to handle because
they mean that you cannot use any kind of trivially incrementing "ticket
locks" with sequence numbers (because we may have to just avoid a sequence
if it times out), so the sequence number approach that we now use for
kernel spinlocks was not an option. I didn't actually *write* the timeout
versions, of course, but given the structure of the locks they really
should be very straightforward.

[ Half-way subtle thing: a writer that times out needs to be very careful
that it doesn't lose a wakeup event, but futexes actually make that part
pretty easy - since FUTEX_WAIT returns whether you got woken up or not,
you can just decide to wake up the next write-waiter if you cannot get
the lock immediately and have...

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Friday, June 6, 2008 - 7:53 am

Yeah, it is a case of a large slowdown for write for a small speedup
for read (pity the API doesn't have explicit read and write unlocks
-- were they too lazy to type the last bit, or did they expect people
to lose track of whether they had a read or write lock? :P)

Anyway, it's obviously a tradeoff you'd just have to carefully

Well... a single lock is only going to be so scalable. I don't see how
it could be done really significantly better? Maybe a small factor of
improvement if you were to concentrate on the contended case (but you
wouldn't want to do that anyway)

--

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Friday, June 6, 2008 - 11:01 am

My worry is that I did something wrong in the slowpath, and that there is
some thundering-herd-wakeup kind of thing that makes that much slower than
it should be.

The slow path doesn't much matter from the angle of counting individual
cycles, but it still matters very much from a bigger picture. Does it have
bad behavior where we wake up a thousand readers, but then a writer gets
to come in first and all the readers have to go to sleep again?

That's one thing I tried to avoid in the second version (the "Another
approch" commit) where a read-wakeup actually moves the readers from the
pending count to the final count - both to get more fairness (which can be
_bad_ for performance), but also because I think it can avoid pathological
cases (reader starvation and unnecessary futex wakeup).

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <npiggin@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, June 2, 2008 - 7:03 pm

yeah, indeed. Compared to all the other costs that have to be dealt with
here, having a second atomic op isnt all that much of an issue either,
especially on latest hw. An atomic op will probably never be as cheap as
a non-atomic op, but ~20 cycles is still plenty fast for most practical
purposes.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, June 2, 2008 - 11:24 pm

I think optimised our unlock_page in a way that it can do a
non-atomic unlock in the fastpath (no waiters) using 2 bits. In
practice it was still atomic but only because other page flags
operations could operate on ->flags at the same time.

I still have to get around to submitting the damn thing upstream,
but maybe if I bring it up here, I get the idea more reviewed :)

Anyway, the algorithm goes like this:

lock_page()
{
if (test_and_set_bit_lock(PG_locked))
lock_page_slow();
}

lock_page_slow()
{
/* slowpath */
again:
prepare_to_wait();
set_bit(PG_waiters);
if (test_and_set_bit_lock(PG_locked)) {
schedule();
goto again;
}
}

wake_page_waiters()
{
/* slowpath */
clear_bit(PG_waiters);
smp_mb__after_clear_bit(); /* order clear_bit store with wake_up_page load */
wake_up_page(PG_locked);
}

unlock_page()
{
clear_bit_unlock(PG_locked);
if (unlikely(test_bit(PG_waiters)))
wake_page_waiters();
}

We don't require any load/store barrier in the unlock_page fastpath
because the bits are in the same word, so cache coherency gives us a
sequential ordering anyway.

Now you notice the lock page slowpath can set bits without holding the
lock, at first glance you'd think the clear_bit to unlock has to be
atomic too then. But actually if we're careful, we can put them in seperate
parts of the word and use the sub-word operations on x86 to avoid the atomic
requirement. I'm not aware of any architecture in which operations to the
same word could be out of order.

Not only does this avoid all barriers (except acquire/release) in the
fastpaths, but it also avoids the unconditional load of the random
hash page waitqueue cacheline on unlock.

Anything applicable to your problem at hand?

--

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 3:57 pm

Yes and no.

Yes, the bits are int he same word, so cache coherency guarantees a lot.

HOWEVER. If you do the sub-word write using a regular store, you are now
invoking the _one_ non-coherent part of the x86 memory pipeline: the store
buffer. Normal stores can (and will) be forwarded to subsequent loads from
the store buffer, and they are not strongly ordered wrt cache coherency
while they are buffered.

IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
stores, but loads are *not* ordered wrt other stores in the absense of a
serializing instruction, and it's exactly because of the write buffer.

See above. The above is unsafe, because if you do a regular store to a
partial word, with no serializing instructions between that and a
subsequent load of the whole word, the value of the store can be bypassed
from the store buffer, and the load from the other part of the word can be
carried out _before_ the store has actually gotten that cacheline
exclusively!

So when you do

movb reg,(byteptr)
movl (byteptr),reg

you may actually get old data in the upper 24 bits, along with new data in
the lower 8.

I think.

Anyway, be careful. The cacheline itself will always be coherent, but the
store buffer is not going to be part of the coherency rules, and without
serialization (or locked ops), you _are_ going to invoke the store buffer!

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 9:45 pm

I'd be very surprised if that was the case. But the unlock code needn't
do that anyway. It could do

movb reg,(byteptr) # clear PG_locked
movb (byteptr+1),reg # load PG_waiters

--

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 4:38 pm

Put another way: the CPU may internally effectively rewrite the two
instructions as

movb reg,tmpreg (tmp = writebuffer)
movl (byteptr),reg (do the 32-bit read of the old cached contents)
movb tmpreg,reg (writebuffer snoop by reads)
movb tmpreg,(byteptr) (writebuffer actually drains into cacheline)

and *if* your algorithm is robust wrt these kinds of rewrites, you're ok.
But notice how there are two accesses to the cacheline, and how they are
actually in the "wrong" order, and how the cacheline could have been
updated by another CPU in between.

Does this actually happen? Yeah, I do believe it does. Is it a deathknell
for anybody trying to be clever with overlapping reads/writes? No, you can
still have things like causality rules that guarantee that your algorithm
is perfectly stable in the face of these kinds of reordering. But it *is*
one of the few re-orderings that I think is theoretically archtiecturally
visible.

For example, let's start out with a 32-bit word that contains zero, and
three CPU's. One CPU does

cmpxchgl 0->0x01010101,mem

another one does

cmpxchlg 0x01010101->0x02020202,mem

and the third one does that

movb $0x03,mem
movl mem,reg

and after it all completed, you may have 0x02020203 in memory, but "reg"
on the third CPU contains 0x01010103.

Note how NO OTHER CPU could _possibly_ have seen that value! That value
never ever existed in any caches. If the final value was 0x02020203, then
both the cmpxchgl's must have worked, so the cache coherency contents
*must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with
the movb actually getting the exclusive cache access last).

So the third CPU saw a value for that load that actually *never* existed
in any cache-line: 0x01010103. Exactly because the x86 memory ordering
allows the store buffer contents to be forwarded within a CPU core.

And this is why atomic locked instructions are special. They bypass the
store buffer (or at least they...

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 9:56 pm

I see what you're saying, but I hadn't really considered it before.
Not that I've attempted to do much of these sub-word operations
without consulting you first (which brings us here) ;).

I'd have thought that for a case like this, you'd simply hit the store
alias logic and store forwarding would stall because it doesn't have
the full data.

I'd like to know for sure.

The other thing that could be possible, and I'd imagine maybe more likely
to be implemented in a real CPU because it should give more imrpovement
(and which does break my algorithm) is just that the load to the cacheline
may get to execute first, then if the cacheline gets evicted and
modified by another CPU before our store completes, we effectively see
store/load reordering again.

--

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 11:08 pm

That's _one_ possible implementation.

Quite frankly, I think it's the less likely one. It's much more likely
that the cache read access and the store buffer probe happen in parallel
(this is a really important hotpath for any CPU, but even more so x86
where there are more of loads and stores that are spills). And then the
store buffer logic would return the data and a bytemask mask (where the
mask would be all zeroes for a miss), and the returned value is just the

You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
is quite likely that different microarchitectures have different

Oh, absolutely, the perfect algorithm would actually get the right answer
and notice that the cacheline got evicted, and retried the whole sequence
such that it is coherent.

But we do know that Intel expressly documents loads and stores to pass
each other and documents the fact that the store buffer is there. So I bet
that this is visible in some micro-architecture, even if it's not
necessarily visible in _all_ of them.

The recent Intel memory ordering whitepaper makes it very clear that loads
can pass earlier stores and in particular that the store buffer allows
intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
it isn't. The store->load forwarding is such a critical performance issue
that I can pretty much guarantee that it doesn't always hit the cacheline.

Of course, the partial store forwarding case is not nearly as important,
and stalling is quite a reasonable implementation approach. I just
personally suspect that doing the unconditional byte-masking is actually
_simpler_ to implement than the stall, so..

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, June 5, 2008 - 12:29 am

Well, it would just be nice to hear a "no we'll never do that", "we

Well I have a simple test case to show loads pass earlier non conflicting
stores in the case that loads do not come from the store buffer (ie.
*inter* processor forwarding).

And store forwarding, by definition means that the load can complete before
the store can possibly be visible to another CPU I'd say. So yes, I'm
sure this does happen too.

--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, David Howells <dhowells@...>, Ulrich Drepper <drepper@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 4, 2008 - 9:58 pm

Well, I should qualify that: it doesn't actually break my algorithm
because you can still implement the unlock without atomic RMWs.
This may require you to have an mfence there, but we can still get
away without atomics (whether that's much cheaper or not, I haven't
measured recently!)
--

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 6:32 pm

Btw, that comment is bogus, ignore it. It comes from an earlier algorithm
I tried that also explained the contention case, but got unnecessarily
complex. Separating out the queuing behaviour from the actual
non-contended behaviour made things much simpler to explain, so I stopped
even bothering with a "both" case.

Too much cut-and-paste, IOW.

Linus
--

To: Ulrich Drepper <drepper@...>
Cc: <linux-kernel@...>, <akpm@...>, <mtk.manpages@...>
Date: Saturday, May 31, 2008 - 12:58 am

Uli, you're NOT EVEN READING!

Go back and read it, dammit! How many times have I said that you can even
have more than 32 bits of data?

There's nothing wrong with using 64-bit atomics to update the words. All
I've *ever* questioned is whether you need more than 32 bits for the
FUTEX part.

For example, let's say that you need 4 bits of "state" information (ie
"next woken-up one is a reader"). And yes, you need counts for readers and
writers. Maybe we can make those counts be 30 bits each. 64 bits total
space. But does the FUTEX_OP things actually need to access all 64 bits?

So imagine that you bave a 64-bit atomic op, and bits 0-1 are the reader
state bits, with bits 2-31 being the reader count. Bits 32-33 are the
writer state bits, and bits 34-63 are the writer count.

Now you can use 64-bit atomic ops to update them, but perhaps just use a
32-bit FUTEX op to actually wait as a reader on the low 32 bits that are
the reader state bits. See?

THIS is what I've been trying to tell you. It's quite possible that the
algorithm you already have could work with just moving the bits around so
that each FUTEX op can be done on just one of the words.

And notice my use of the word "possible".

The kernel uses a 32-bit count word, but has various other state too that
it maintains under a separate locking setup. Same basic idea: the 32-bit
word is "special" (because it's the one that has to have the right format
for the 'xadd' op that we use for the fast path), but it's not the _only_
data in the thing. The other data is also accessed, it's just accessed
according to different principles.

Linus
--

Previous thread: [PATCH 3/3] 64-bit futexes: x86 support by Ulrich Drepper on Friday, May 30, 2008 - 9:27 pm. (10 messages)

Next thread: [PATCH 1/3] 64-bit futexes: generic code by Ulrich Drepper on Friday, May 30, 2008 - 9:27 pm. (1 message)