This patch series adds support for 64-bit futexes. The current futexes only protect 32-bits. I don't know why, ask the original authors. It is unnecessarily limiting, though, especially for 64-bit machines. To understand the problem let me just say that futexes work by storing a program-defined state in a variable. Threads then can wait for the state of the variable to change. It all works dandy if the protocol for using futexes is followed correctly; see http://people.redhat.com/drepper/futexes.pdf For mutexes, semaphores etc this is quite easy. For mutexes, for instance, we have as state to protect a flag whether the mutex is locked and whether there is any thread waiting for the release. This can be done easily with the available 32 bits in a futex. The situation is different for more complex synchronization objects. The best example are reader/writer locks. Here the state consists at least of - number of readers which locked the rwlock - flag whether rwlock is locked for reader, writing, or not at all - number of readers waiting - number of writers waiting I.e., rwlocks are significantly more complicated. Much of the complication stems from the requirement to have to types: those which prefer readers and those which prefer writers. Without imposing unreasonable limitations on the number of concurrent users a rwlock can have the state cannot be represented in a single 32-bit variable. This is why the current rwlock implementation uses an internal lock and spreads the state out in several variables. This works, but there is a significant performance penalty to be paid: - at least two atomic operations to lock and unlock the internal lock - multiple readers are not really able to run concurrently since they might block on the internal lock - following from the last point, the behaviour of the code is less predictable due to scheduling artifacts The logical solution is to extend the size of the state which can be protected by allowing ...
I call bull on that. We've had a 32-bit rwsemaphore in the kernel for a *long* time. I think it's totally stupid for glibc to even *try* to do something like this, since almost nobody will have a kernel with 64-bit futex support for a long time in the wild anyway. So you need to support a 32-bit semaphore in practice, and it's been done before. The x86 kernel rwsem implementation may limit things to 64k readers (I'm not even sure that's true, I'm not going to look at the code again), but if I recall correctly that's just because we wanted to use a single "xadd" in the hotpath, instead of doign a load a cmpxchg. I really object to adding another 32/64-bit difference just for something like an rwsemaphore. It needs a whole lot of stronger reasons than that. I suspect the old rwlocks are simply just stupid. You can do those things with a single 32-bit locked op. Have you tried - 29 bits of reader counts - 1 bit of uncontended writer - 1 bit of "contention" (ie a mark for requiring fairness) - 1 bit for a spinlock so that you can do all the fairness without doing any extra locked ops or similar? Linus --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 You misread what I wrote. Semaphores are fine. The problem are reader/writer locks where we need three counters and some flags. It is absolutely not acceptable to limit Again, misread. This is not functionality which is not available. Semaphores are *of course* fine with 32 bits. And there is a reader/writer lock implementation. It is just very slow compared to what is possible. The transition between systems miss the 64-bit support and those which have it is completely transparent. The same glibc binary will work for both. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkhAwpsACgkQ2ijCOnn/RHS7PACdGvdgI9pv+IcsSPkXPVwFJ5ZF YNQAn01fm1wEc4EI/fwOVuM57XtCy9Dq =LcpJ -----END PGP SIGNATURE----- --
No, you do. The kernel calls reader-writer locks rwsemaphores. Linus --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Well, then you don't understand the complexity of the required interface together with the performance implications. Take your proposed allocation: - 29 bits of reader counts - 1 bit of uncontended writer - 1 bit of "contention" (ie a mark for requiring fairness) - 1 bit for a spinlock so that you can do all the fairness without doing any extra locked ops Ask yourself this: - - How do you know when there is no more writer waiting? You cannot reset a "writer waiting" bit after you wake up one writer without waking every single thread waiting for the rwlock and letting them fight for it - - How do you handle the difference between reader-preferred rwlocks and writer-preferred rwlocks? In the latter, if a rwlock is locked for reading, future readers must be delayed until all writers are gone - - How do you do the accounting for the *timedlock variants? In the case of those a functions, if the threads wake due to a timeout, you have the decrement the waiter count. But you have only one bit... I don't say you cannot implement rwlocks this way. Sure, it definitely is possible. But this implementation would in the contended case like 10x as slow as the current code because you constantly have to wake up every single thread. If you want, I'll give you a libpthread.so with the new code. Then you can test your code. I bet you whatever you want that you cannot achieve the performance with your puny 32-bit futex without limiting the number of threads to a ridiculously low number. The implementation I have allows for 2^23 (possibly recursively locked) readers, 2^18 readers or writers waiting. When I was writing "you need more than 32 bits" I didn't even imagine that somebody would suggest using such a primitive scheme which cannot possibly work efficiently in a userlevel implementation. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP ...
How about you post your code. Sure you can. By looking at the *other*data* that isn't atomic. So when doing a "write_unlock()" (or whatever you call it - for the kernel calls it "up_write()") you can look at the non-atomic write counts to decide whether there are others. Also note how you can use 64-bit atomic ops to do that all in user space, without actually requiring 64-bit futex support - as long as the bits that matter for waking (like "was there more than one pending writer") fit in Again, that's not something that needs to be in the *atomic* part. It's unquestionable that a rwlock is more than 32 bits, but what I Uli, you're not even trying any more. NO, you don't have just one bit. You have as many bits as you want. It's just that only 32 of the bits will be relevant for FUTEX_WAIT_OP etc. Linus --
So here's an example of this: - make all the readers/writers actually update a 64-bit word (by using cmpxchg8b in user space to actually get the locks) - but organize things so that a reader only needs to look at the high 32 bits to actually make it's wakeup-decision, and a sleeping writer only needs to look at the low 32 bits. How? Make the low bits of the words contain the "contention status". Is it possible? I dunno. I personally suspect it is. I also suspect you didn't even try. Linus --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Once again not without limitations which are too low. The main problem is that there is far more information needed on the reader side. There are two counters and a number of bits for state information. Which means the counters can be at most 14 bits in size. A counter this low also means I cannot unconditionally increment them (since there can be more than 2^14 threads sharing one page). This means you get in the cache line race I described in the last mail. You first have to load a word and then perform a the write operation while I have to disappoint you. I thought of that and had to reject it. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkhA1j8ACgkQ2ijCOnn/RHTUXQCgnkUy6PNSuUI4O6BD4abxqqKr PzMAn2qcvrXGM4qrhOytk1Yie0BsMeVa =zvxZ -----END PGP SIGNATURE----- --
Uli, you're NOT EVEN READING! Go back and read it, dammit! How many times have I said that you can even have more than 32 bits of data? There's nothing wrong with using 64-bit atomics to update the words. All I've *ever* questioned is whether you need more than 32 bits for the FUTEX part. For example, let's say that you need 4 bits of "state" information (ie "next woken-up one is a reader"). And yes, you need counts for readers and writers. Maybe we can make those counts be 30 bits each. 64 bits total space. But does the FUTEX_OP things actually need to access all 64 bits? So imagine that you bave a 64-bit atomic op, and bits 0-1 are the reader state bits, with bits 2-31 being the reader count. Bits 32-33 are the writer state bits, and bits 34-63 are the writer count. Now you can use 64-bit atomic ops to update them, but perhaps just use a 32-bit FUTEX op to actually wait as a reader on the low 32 bits that are the reader state bits. See? THIS is what I've been trying to tell you. It's quite possible that the algorithm you already have could work with just moving the bits around so that each FUTEX op can be done on just one of the words. And notice my use of the word "possible". The kernel uses a 32-bit count word, but has various other state too that it maintains under a separate locking setup. Same basic idea: the 32-bit word is "special" (because it's the one that has to have the right format for the 'xadd' op that we use for the fast path), but it's not the _only_ data in the thing. The other data is also accessed, it's just accessed according to different principles. Linus --
Ok, so I obviously think you're simply wrong, and I've told you so both in private after seeing the 64-bit code you propose (and which you want to patent), and publicly. So let's have code talk. I'm going to give real code for the only case that really matters, namely the uncontended case. Do you agree that once you get contention, and you actually have to do a FUTEX system call to sort it out, the number of atomic instructions do not really matter? Because the system call is going to dominate. Agreed? So I'm not going to write out the slow contention case, because you have already agreed that you can do a 32-bit rwlock _slowly_ with the existing 32-bit FUTEX support. IOW, I'm faking it, but I'm making a point. Namely that you can efficiently do read-write lock using *only* 32-bit ops, and without any real kind of limitations on the number of readers and writers. So here goes the explanation and the pseudo-code. - have two levels of locking: the contended case, and the uncontended case - these are two *separate* locks. I'm going to describe the only case that mattes for performance, namely the non-contention one. We just make sure that the non-contention lock "gates" things to the slow case. - in other words, here's a pretty optimal fast-case that we write in assembly language, and all it does is to handle the non-contention case, with any waiting happening on *another* lock entirely. - on that first-level lock, the setup is as follows: - the low bit (bit#0) is "writer holds" - the next bit (bit#1) is "contended", and all it does is that it guarantees that when set, everybody will go to the slow-path. - bits 2-31 are "reader count" - the second-level lock is another 32-bit lock that is *adjacent* in memory, and they are 64-bit aligned, so that you can do an atomic initialization and more importantly a 64-bit "it is now no longer contended" assignment in one atomic op. That's going to be relevant only for ...
Btw, that comment is bogus, ignore it. It comes from an earlier algorithm I tried that also explained the contention case, but got unnecessarily complex. Separating out the queuing behaviour from the actual non-contended behaviour made things much simpler to explain, so I stopped even bothering with a "both" case. Too much cut-and-paste, IOW. Linus --
i suspect _any_ abstract locking functionality around a data structure can be implemented via atomic control over just a single user-space bit. That bit can be used as a lock and if all access to the state of that atomic variable uses it, arbitrary higher-order atomic state transitions can be derived from it. The cost would be a bit more instructions in the fastpath, but there would still only be a single atomic op (the acquire op), as the unlock would be a natural barrier (on x86 at least). Concurrency (and scheduling) of that lock would still be exactly the same as with genuine 64-bit (or even larger) atomic ops, and the fastpath would be very close as well. Ingo --
[ I added Nick and DavidH to the Cc, since they have at least historically shown interest in locking algorithms ] Well, theory is theory, and practice is different. That's *especially* true when it comes to locking, which is just so tightly coupled to the exact details of which atomics are fast on a particular architecture. Also, different people will want to see different performance profiles. For example, it makes a *huge* difference whether you are strictly fair or not. I can almost guarantee that a 100% fair implementation may be really "nice" from a theoretical standpoint, but suck really badly in practice, because if you want best performance, then you want the lock to have a very strong CPU affinity - and the faster you can do your lock ops, the more of a unfairness and CPU affinity they get! And rwlocks in particular are actually much more interesting in this respect, because they not only have that CPU affinity fairness, but also the reader-vs-writer fairness. You optimize for one load, and it may give you almost perfect performance, but at the cost of sucking at another load. For example, some loads are almost entirely read-read locks, with only very occasional write locks for some uncommon config change thing. Do you want to optimize for that? Maybe. And yes, you can make those uncontended read-read locks go really quickly, but then (especially if you continue to let reads through even when writers want to contend), that can slow down writers a *lot*, to the point of starvation. Different CPU's will also show different patterns. Anyway, I was busy most of the weekend, but I've now coded up a partial actual example implementation. Its' probably buggy. Uli is very correct in saying that it's easy to screw up futex'es, but even in the absense of futexes, it's just easy to screw up any threaded logic. But if anybody wants to look at a rough draft that at least limps along _partially_, ...
yeah, indeed. Compared to all the other costs that have to be dealt with here, having a second atomic op isnt all that much of an issue either, especially on latest hw. An atomic op will probably never be as cheap as a non-atomic op, but ~20 cycles is still plenty fast for most practical purposes. Ingo --
I think optimised our unlock_page in a way that it can do a
non-atomic unlock in the fastpath (no waiters) using 2 bits. In
practice it was still atomic but only because other page flags
operations could operate on ->flags at the same time.
I still have to get around to submitting the damn thing upstream,
but maybe if I bring it up here, I get the idea more reviewed :)
Anyway, the algorithm goes like this:
lock_page()
{
if (test_and_set_bit_lock(PG_locked))
lock_page_slow();
}
lock_page_slow()
{
/* slowpath */
again:
prepare_to_wait();
set_bit(PG_waiters);
if (test_and_set_bit_lock(PG_locked)) {
schedule();
goto again;
}
}
wake_page_waiters()
{
/* slowpath */
clear_bit(PG_waiters);
smp_mb__after_clear_bit(); /* order clear_bit store with wake_up_page load */
wake_up_page(PG_locked);
}
unlock_page()
{
clear_bit_unlock(PG_locked);
if (unlikely(test_bit(PG_waiters)))
wake_page_waiters();
}
We don't require any load/store barrier in the unlock_page fastpath
because the bits are in the same word, so cache coherency gives us a
sequential ordering anyway.
Now you notice the lock page slowpath can set bits without holding the
lock, at first glance you'd think the clear_bit to unlock has to be
atomic too then. But actually if we're careful, we can put them in seperate
parts of the word and use the sub-word operations on x86 to avoid the atomic
requirement. I'm not aware of any architecture in which operations to the
same word could be out of order.
Not only does this avoid all barriers (except acquire/release) in the
fastpaths, but it also avoids the unconditional load of the random
hash page waitqueue cacheline on unlock.
Anything applicable to your problem at hand?
--
Yes and no. Yes, the bits are int he same word, so cache coherency guarantees a lot. HOWEVER. If you do the sub-word write using a regular store, you are now invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered. IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other stores, but loads are *not* ordered wrt other stores in the absense of a serializing instruction, and it's exactly because of the write buffer. See above. The above is unsafe, because if you do a regular store to a partial word, with no serializing instructions between that and a subsequent load of the whole word, the value of the store can be bypassed from the store buffer, and the load from the other part of the word can be carried out _before_ the store has actually gotten that cacheline exclusively! So when you do movb reg,(byteptr) movl (byteptr),reg you may actually get old data in the upper 24 bits, along with new data in the lower 8. I think. Anyway, be careful. The cacheline itself will always be coherent, but the store buffer is not going to be part of the coherency rules, and without serialization (or locked ops), you _are_ going to invoke the store buffer! Linus --
Put another way: the CPU may internally effectively rewrite the two instructions as movb reg,tmpreg (tmp = writebuffer) movl (byteptr),reg (do the 32-bit read of the old cached contents) movb tmpreg,reg (writebuffer snoop by reads) movb tmpreg,(byteptr) (writebuffer actually drains into cacheline) and *if* your algorithm is robust wrt these kinds of rewrites, you're ok. But notice how there are two accesses to the cacheline, and how they are actually in the "wrong" order, and how the cacheline could have been updated by another CPU in between. Does this actually happen? Yeah, I do believe it does. Is it a deathknell for anybody trying to be clever with overlapping reads/writes? No, you can still have things like causality rules that guarantee that your algorithm is perfectly stable in the face of these kinds of reordering. But it *is* one of the few re-orderings that I think is theoretically archtiecturally visible. For example, let's start out with a 32-bit word that contains zero, and three CPU's. One CPU does cmpxchgl 0->0x01010101,mem another one does cmpxchlg 0x01010101->0x02020202,mem and the third one does that movb $0x03,mem movl mem,reg and after it all completed, you may have 0x02020203 in memory, but "reg" on the third CPU contains 0x01010103. Note how NO OTHER CPU could _possibly_ have seen that value! That value never ever existed in any caches. If the final value was 0x02020203, then both the cmpxchgl's must have worked, so the cache coherency contents *must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with the movb actually getting the exclusive cache access last). So the third CPU saw a value for that load that actually *never* existed in any cache-line: 0x01010103. Exactly because the x86 memory ordering allows the store buffer contents to be forwarded within a CPU core. And this is why atomic locked instructions are special. They bypass the store buffer (or at least they _act_ as ...
I see what you're saying, but I hadn't really considered it before. Not that I've attempted to do much of these sub-word operations without consulting you first (which brings us here) ;). I'd have thought that for a case like this, you'd simply hit the store alias logic and store forwarding would stall because it doesn't have the full data. I'd like to know for sure. The other thing that could be possible, and I'd imagine maybe more likely to be implemented in a real CPU because it should give more imrpovement (and which does break my algorithm) is just that the load to the cacheline may get to execute first, then if the cacheline gets evicted and modified by another CPU before our store completes, we effectively see store/load reordering again. --
Well, I should qualify that: it doesn't actually break my algorithm because you can still implement the unlock without atomic RMWs. This may require you to have an mfence there, but we can still get away without atomics (whether that's much cheaper or not, I haven't measured recently!) --
That's _one_ possible implementation. Quite frankly, I think it's the less likely one. It's much more likely that the cache read access and the store buffer probe happen in parallel (this is a really important hotpath for any CPU, but even more so x86 where there are more of loads and stores that are spills). And then the store buffer logic would return the data and a bytemask mask (where the mask would be all zeroes for a miss), and the returned value is just the You'd have to ask somebody very knowledgeable inside Intel and AMD, and it is quite likely that different microarchitectures have different Oh, absolutely, the perfect algorithm would actually get the right answer and notice that the cacheline got evicted, and retried the whole sequence such that it is coherent. But we do know that Intel expressly documents loads and stores to pass each other and documents the fact that the store buffer is there. So I bet that this is visible in some micro-architecture, even if it's not necessarily visible in _all_ of them. The recent Intel memory ordering whitepaper makes it very clear that loads can pass earlier stores and in particular that the store buffer allows intra-processor forwarding to subsequent loads (2.4 in their whitepaper). It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure it isn't. The store->load forwarding is such a critical performance issue that I can pretty much guarantee that it doesn't always hit the cacheline. Of course, the partial store forwarding case is not nearly as important, and stalling is quite a reasonable implementation approach. I just personally suspect that doing the unconditional byte-masking is actually _simpler_ to implement than the stall, so.. Linus --
Well, it would just be nice to hear a "no we'll never do that", "we Well I have a simple test case to show loads pass earlier non conflicting stores in the case that loads do not come from the store buffer (ie. *inter* processor forwarding). And store forwarding, by definition means that the load can complete before the store can possibly be visible to another CPU I'd say. So yes, I'm sure this does happen too. --
I'd be very surprised if that was the case. But the unlock code needn't do that anyway. It could do movb reg,(byteptr) # clear PG_locked movb (byteptr+1),reg # load PG_waiters --
Had a bit of a look though this, seems pretty OK to me. Obviously with rwlocks it *is* impossible to do non-atomic unlocks, so I can't see a way to significantly improve your code there. What you *could* maybe do, to slightly speed up the reader fastpath, at the expense of the writer fastpath, is to also have the active writer add 4 to the count too, so your unlock can start with a lock xadd -4, count in order to get the write-intent on the cacheline straight up. Though that's a pretty tiny "optmisation", and not going to be an amazing win, even when it does save a state transition on the cacheline... I'd be more interested to know why this code can't be evolved into a full rwlock implementation? This is a rather standard (though neat) looking rwlock -- so my question is what can the patented 64-bit futex locks do that this can't, or what can they do faster? --
Yes, nice idea. It avoids the possible unnecessary S->M transition, but the downside is that it effectively slows down the write unlock by making it do two atomic ops even for the fastpath. So if I were to _only_ care about the reader path, I think it would be a great idea, but as it is, the current non-contended write case is actually pretty close to optimal, and Quite frankly - and this was my argument the whole time - I do not believe that a "full" 64-bit implementation can do _anything_ more than this 32-bit one can do. That's the reason I wrote the code. I'm pretty sure that you can do perfectly well with just 32 bit atomic futex ops (my rwlocks obviously do use 64-bit cmpxchg's in user space, but not in the fast-path, and it should work fine on x86-32 too using cmpxchg8b). In fact, in the second version of my locks, I didn't do futex operations on the actual lock itself at *all*, and just do futex ops on separate "event" fields. That actually should have avoided a bug I think I had (but couldn't really trigger in practice) in the first version, and made everything look pretty straightforward. I haven't really worked on them since I write the thing, but I did consider things like timeouts etc. Timeouts are "hard" to handle because they mean that you cannot use any kind of trivially incrementing "ticket locks" with sequence numbers (because we may have to just avoid a sequence if it times out), so the sequence number approach that we now use for kernel spinlocks was not an option. I didn't actually *write* the timeout versions, of course, but given the structure of the locks they really should be very straightforward. [ Half-way subtle thing: a writer that times out needs to be very careful that it doesn't lose a wakeup event, but futexes actually make that part pretty easy - since FUTEX_WAIT returns whether you got woken up or not, you can just decide to wake up the next write-waiter if you cannot get the lock immediately and ...
Yeah, it is a case of a large slowdown for write for a small speedup for read (pity the API doesn't have explicit read and write unlocks -- were they too lazy to type the last bit, or did they expect people to lose track of whether they had a read or write lock? :P) Anyway, it's obviously a tradeoff you'd just have to carefully Well... a single lock is only going to be so scalable. I don't see how it could be done really significantly better? Maybe a small factor of improvement if you were to concentrate on the contended case (but you wouldn't want to do that anyway) --
My worry is that I did something wrong in the slowpath, and that there is some thundering-herd-wakeup kind of thing that makes that much slower than it should be. The slow path doesn't much matter from the angle of counting individual cycles, but it still matters very much from a bigger picture. Does it have bad behavior where we wake up a thousand readers, but then a writer gets to come in first and all the readers have to go to sleep again? That's one thing I tried to avoid in the second version (the "Another approch" commit) where a read-wakeup actually moves the readers from the pending count to the final count - both to get more fairness (which can be _bad_ for performance), but also because I think it can avoid pathological cases (reader starvation and unnecessary futex wakeup). Linus --
