Hi,
When implementing the call_rcu() "worker thread" in userspace, I ran
into the problem that it had to be woken up periodically to check if
there are any callbacks to execute. However, I easily imagine that this
does not fit well with the "green computing" definition.
Therefore, I've looked at ways to have the call_rcu() callers waking up
this worker thread when callbacks are enqueued. However, I don't want to
take any lock and the fast path (when no wake up is required) should not
cause any cache-line exchange.
Here are the primitives I've created. I'd like to have feedback on my
futex use, just to make sure I did not do any incorrect assumptions.
This could also be eventually used in the QSBR Userspace RCU quiescent
state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
ensure synchronize_rcu() does not busy-wait for too long.
/*
* Wake-up any waiting defer thread. Called from many concurrent threads.
*/
static void wake_up_defer(void)
{
if (unlikely(atomic_read(&defer_thread_futex) == -1))
atomic_set(&defer_thread_futex, 0);
futex(&defer_thread_futex, FUTEX_WAKE,
0, NULL, NULL, 0);
}
/*
* Defer thread waiting. Single thread.
*/
static void wait_defer(void)
{
atomic_dec(&defer_thread_futex);
if (atomic_read(&defer_thread_futex) == -1)
futex(&defer_thread_futex, FUTEX_WAIT, -1,
NULL, NULL, 0);
}
Thanks,
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Is it a problem if multiple threads all hit the defer_thread_futex==-1 case simultaneously? If so, maybe this should use an atomic Is it a problem if the value of defer_thread_futex changes to zero after the dec but before the test? Chris --
It should not be, since what we'll have is, e.g.:
thread 1 calls wakeup
thread 2 calls wakeup
thread 3 calls wait
(thread 3 is waiting on the futex, defer_thread_futex = -1)
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0
Basically, what happens in this scenario is that the first futex()
call will wake up any waiter, and the second will be a no-op.
Let's complicate this, if we have thread 3 running wait_defer()
concurrently:
- thread 3 decrements defer_thread_futex
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0
- thread 3 calls futex() to wait, expect -1
Returns immediately because defer_thread_futex == 0
Other scenario, where thread decrements defer_thread_futex a bit later:
- thread 1 sees defer_thread_futex==0
- thread 2 sees defer_thread_futex==0
- thread 3 decrements defer_thread_futex
- thread 3 tests defer_thread_futex==-1
- thread 3 calls futex() to wait, expect -1
In this scenario, we have to notice that if threads 1/2 enqueued tasks
to do before checking defer_thread_futex, these tasks would not be seen
by the waiter thread.
So correct memory ordering of:
- wake_up_defer:
* queue callbacks to perform (1)
* wake up (2)
- wait_defer:
* for (;;)
* wait for futex (3)
* sleep 100ms (wait for more callbacks to be enqueued)
* dequeue callbacks, execute them (4)
actually matters. I'll have to be really careful about that (unless we
just accept that tasks to perform could be queued for a while, however,
I'd like to give an upper bound to the delay between batch ...I knew I needed to think about it a bit more. Here is the proposed
algorithm hopefully fixing the race identified in the 3rd scenario
above. The idea is to perform the "check for empty queue" between the
&defer_thread_futex decrement and the test in wait_defer. It skips the
futex call and proceed if the list is non-empty.
As I am drilling into the problem, it looks very much like an attempt to
implement efficient wait queues in userspace based on sys_futex().
/*
* Wake-up any waiting defer thread. Called from many concurrent
* threads.
*/
static void wake_up_defer(void)
{
if (unlikely(atomic_read(&defer_thread_futex) == -1)) {
atomic_set(&defer_thread_futex, 0);
futex(&defer_thread_futex, FUTEX_WAKE, 0,
NULL, NULL, 0);
}
}
/*
* Defer thread waiting. Single thread.
*/
static void wait_defer(void)
{
atomic_dec(&defer_thread_futex);
smp_mb(); /* Write futex before read queue */
if (rcu_defer_num_callbacks()) {
smp_mb(); /* Read queue before write futex */
/* Callbacks are queued, don't wait. */
atomic_set(&defer_thread_futex, 0);
} else {
smp_rmb(); /* Read queue before read futex */
if (atomic_read(&defer_thread_futex) == -1)
futex(&defer_thread_futex, FUTEX_WAIT, -1,
NULL, NULL, 0);
}
}
- call_rcu():
* queue callbacks to perform
* smp_mb()
* wake_up_defer()
- defer thread:
* for (;;)
* wait_defer()
* sleep 100ms (wait for more callbacks to be enqueued)
* dequeue callbacks, execute them
The goal here is that if call_rcu() enqueues a callback (even if it
races with defer thread going to sleep), there should not be a
potentially infinite delay before it gets executed. Therefore, being
blocked in sys_futex while there is a callback to execute, without any
hope to be ...It doesn't seem like the test for the number of callbacks should be necessary. I don't see anything like that in the glibc code, nor do I remember anything like that in the futex sample code. I'm still not totally convinced that you can avoid race conditions without using atomic test-and-set or compare-and-exchange. I haven't sat down and worked it out completely though. Chris --
The mutex code (and usual futex users) use futex to implement mutual exclusion. My goal is to send a wakeup signal to a thread waiting for work to perform when adding such work. But without any mutual exclusion. So it is understandable that glibc code or futex sample code does not Yes.. this is heavily dependent on the states and values which can be reached. I should probably take time to create a promela model and run that though the spin model checker to be sure. Thanks for the comments, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Just created a Promela model for this. It assumes sequential memory ordering (so it's a fairly simplified model). Given I added memory barriers between each operation, it should well represent reality though. My algorithm seems to behave as expected: when a callback is added to the queue, it's not possible to have the waiter thread blocked until the end of days. Available at: http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=blob;f=formal-model/fute... Thanks, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Why is that different ? if you use a normal FUTEX like pthread_mutex_...() and have the thread block right at the start (by blocking the *UTEX before creating the thread). Now you can wake the thread by unblocking the *UTEX. -Michael --
It might be that (different from a sema), a FUTEX can only be unlocked by the process that locked it. But then why should FUTEX offer advantages over sema, if in that application the FUTEX is supposed to block with a system call, anyway ? -Michael --
The standard approach would be to use pthread_cond_wait() and pthread_cond_broadcast(). Unfortunately, this would require holding a pthread_mutex_lock across both operations, which would not necessarily be so good for wake-up-side scalability. That said, without this sort of heavy-locking approach, wakeup races are quite difficult to avoid. Thanx, Paul --
The pthread_cond_broadcast() mutex is really a bugger when it comes to execute it at each rcu_read_unlock(). We could as well use a mutex to I did a formal model of my futex-based wait/wakeup. The main idea is that the waiter: - Set itself to "waiting" - Checks the "real condition" for which it will wait (e.g. queues empty when used for rcu callbacks, no more ongoing old reader thread C.S. when used in synchronize_rcu()) - Calls sys_futex if the variable have not changed. And the waker: - sets the "real condition" waking up the waiter (enqueuing, or rcu_read_unlock()) - check if the waiter must be woken up, if so, wake it up by setting the state to "running" and calling sys_futex. But as you say, wakeup races are difficult (but not impossible!) to avoid. This is why I resorted to a formal model of the wait/wakeup scheme to ensure that we cannot end up in a situation where a waker races with the waiter and does not wake it up when it should. This is nothing fancy (does not model memory and instruction reordering automatically), but I figure that memory barriers are required between almost every steps of this algorithm, so by adding smp_mb() I end up ensure sequential behavior. I added test cases in the model to ensure that incorrect memory reordering _would_ cause errors by doing the reordering by hand in error-injection runs. The model is available at: http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4d... (this is in the formal-model branch of the urcu tree, futex-wakeup subdir) This is modeling this snippet of code : static int defer_thread_futex; /* * Wake-up any waiting defer thread. Called from many concurrent threads. */ static void wake_up_defer(void) { if (unlikely(uatomic_read(&defer_thread_futex) == -1)) { uatomic_set(&defer_thread_futex, 0); futex(&defer_thread_futex, FUTEX_WAKE, 1, ...
My question is whether pthread_cond_wait() and pthread_cond_broadcast() can substitute for the raw call to futex. Unless I am missing something (which I quite possibly am), the kernel will serialize on the futex anyway, so serialization in user-mode code does not add much additional I will take a look after further recovery from jetlag. Not yet competent to review this kind of stuff. Give me a few days. ;-) Thanx, Paul --
The kernel sys_futex implementation only takes per-bucket spinlocks. So this is far from the cost of a global mutex in pthread_cond. Moreover, my scheme does not require to take any mutex in the fast path (when there is no waiter to wake up), which makes performances appropriate for use in rcu read-side. It's a simple memory barrier, variable read, test No problem, thanks for looking at this, -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
(added some CC's, given the length of the answer. Thanks for asking) ;) (Sorry for duplicate, messed up LKML email in original post) Hrm, your assumption about the common case does not seem to fit my scenarios. Typically, the wakeup will find the waiter not blocked, and thus skip the call to sys_futex. Here is why. I use this scheme in two different implementations: 1) in call_rcu(), to wake up the worker thread after adding work to the queue. This worker thread, when woken up, sleeps for a few milliseconds before starting to dequeue work. Therefore, if the system is relatively busy, call_rcu() will usually see the worker thread while it's sleeping (and therefore _not_ waiting on the futex). Also, if work is enqueued while the worker thread is executing past RCU callbacks, the worker thread will detect it and won't wait on the futex. Therefore, this is, by design, a very unlikely event to have call_rcu() calling sys_futex. 2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past reader's grace periods. Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if this does not work, after a few attempts (like the pthread mutexes), it adapts and uses sys_futex. The waker only need to call sys_futex if there is a synchronize_rcu() currently running which had to call sys_futex after a few active attempts failed. As you see, in both cases, the common case, "fast path", is to find the futex unlocked and not having to take any lock. Now, about the slow path. I think it's worth discussing too. Indeed, sys_futex takes a per-address spinlock, which happens to serialize all sys_futex operations on the wakeup side. Therefore, for wakeup designs relying on calling sys_futex for wakeup very frequently, this is really bad. There might be ways to mitigate this problem though: changing the sys_futex implementation to use lock-free lists might help. One advantage of calling sys_futex without holding a userspace mutex is that the contention ...
Hrm, thinking about it, the pthread_cond_wait/pthread_cond_broadcast
scheme I proposed above is racy.
If a wait_gp executes concurrently with wake_up_gp, we can end up doing:
gp_wait = 0
* wait_gp: uatomic_dec(&gp_wait);
* wait_gp: smp_mb()
* wait_gp: if (!num_old_reader_gp()) { (false)
* wait_gp: pthread_mutex_lock(&rcumutex);
* wait_gp: if (uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp: unlikely(uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp: uatomic_set(&gp_wait, 0);
* wait_up_gp: pthread_cond_broadcast(&rcucond);
* wait_gp: pthread_cond_wait(&rcucond, &rcumutex);
might wait forever (or until the next wakeup).
We would need an extra mutex in the wakeup, e.g.:
static inline void wake_up_gp(void)
{
/* just released old reader gp */
smp_mb();
if (unlikely(uatomic_read(&gp_wait) == -1)) {
pthread_mutex_lock(&rcumutex);
uatomic_set(&gp_wait, 0);
pthread_cond_broadcast(&rcucond);
pthread_mutex_unlock(&rcumutex);
}
}
static void wait_gp(void)
{
uatomic_dec(&gp_wait);
smp_mb();
if (!num_old_reader_gp()) {
smp_mb();
atomic_set(&gp_wait, 0);
goto unlock;
}
/* Read reader_gp before read futex */
smp_rmb();
pthread_mutex_lock(&rcumutex);
if (uatomic_read(&gp_wait) == -1) {
pthread_cond_wait(&rcucond, &rcumutex);
pthread_mutex_unlock(&rcumutex);
}
}
This should work better, but I'll need to do a formal model to convince
myself.
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Yep, Obviously I was misled. So I now understand that you want to schedule the thread as a result to some event and the Thread might already be running at that time, so that it does not need to enter an OS-based wait state. But to me this looks as if a _counting_ semaphore is needed here (at least in a more general case), instead of a binary semaphore (such as FUTEX does manage). Only with a counting semaphore no events are missed (unless the user software design cares for this with other means, which might be difficult or impossible). Of course the fast path of a user space counting semaphore is easily doable with atomic_inc() and atomic_dec(). But I only know about two working variants of the user space code for the (binary) FUTEX. There seem to be some more implementations that do not work decently. I have no idea if it's possible to create a counting semaphore in user space that uses the "futex" syscall (or whatever) for the case the threads needs to wait. -Michael --
