Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy

Previous thread: Re: module: Speed up symbol resolution during module loading by Alan Jenkins on Wednesday, September 23, 2009 - 9:52 am. (2 messages)

Next thread: mmc_spi: command errors on 2.6.29.6 by Michael Borsuk on Wednesday, September 23, 2009 - 11:01 am. (1 message)
From: Mathieu Desnoyers
Date: Wednesday, September 23, 2009 - 10:48 am

Hi,

When implementing the call_rcu() "worker thread" in userspace, I ran
into the problem that it had to be woken up periodically to check if
there are any callbacks to execute. However, I easily imagine that this
does not fit well with the "green computing" definition.

Therefore, I've looked at ways to have the call_rcu() callers waking up
this worker thread when callbacks are enqueued. However, I don't want to
take any lock and the fast path (when no wake up is required) should not
cause any cache-line exchange.

Here are the primitives I've created. I'd like to have feedback on my
futex use, just to make sure I did not do any incorrect assumptions.

This could also be eventually used in the QSBR Userspace RCU quiescent
state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
ensure synchronize_rcu() does not busy-wait for too long.

/*
 * Wake-up any waiting defer thread. Called from many concurrent threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(atomic_read(&defer_thread_futex) == -1))
                atomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE,
                      0, NULL, NULL, 0);
}

/*
 * Defer thread waiting. Single thread.
 */
static void wait_defer(void)
{
        atomic_dec(&defer_thread_futex);
        if (atomic_read(&defer_thread_futex) == -1)
                futex(&defer_thread_futex, FUTEX_WAIT, -1,
                      NULL, NULL, 0);
}

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Chris Friesen
Date: Wednesday, September 23, 2009 - 11:04 am

Is it a problem if multiple threads all hit the defer_thread_futex==-1
case simultaneously?  If so, maybe this should use an atomic

Is it a problem if the value of defer_thread_futex changes to zero after
the dec but before the test?

Chris
--

From: Mathieu Desnoyers
Date: Wednesday, September 23, 2009 - 12:03 pm

It should not be, since what we'll have is, e.g.:

thread 1 calls wakeup
thread 2 calls wakeup
thread 3 calls wait

(thread 3 is waiting on the futex, defer_thread_futex = -1)
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0

Basically, what happens in this scenario is that the first futex()
call will wake up any waiter, and the second will be a no-op.

Let's complicate this, if we have thread 3 running wait_defer()
concurrently:

- thread 3 decrements defer_thread_futex
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0
- thread 3 calls futex() to wait, expect -1
  Returns immediately because defer_thread_futex == 0

Other scenario, where thread decrements defer_thread_futex a bit later:

- thread 1 sees defer_thread_futex==0
- thread 2 sees defer_thread_futex==0
- thread 3 decrements defer_thread_futex
- thread 3 tests defer_thread_futex==-1
- thread 3 calls futex() to wait, expect -1

In this scenario, we have to notice that if threads 1/2 enqueued tasks
to do before checking defer_thread_futex, these tasks would not be seen
by the waiter thread.

So correct memory ordering of:

- wake_up_defer:
  * queue callbacks to perform (1)
  * wake up (2)

- wait_defer:
  * for (;;)
    * wait for futex (3)
    * sleep 100ms (wait for more callbacks to be enqueued)
    * dequeue callbacks, execute them (4)


actually matters. I'll have to be really careful about that (unless we
just accept that tasks to perform could be queued for a while, however,
I'd like to give an upper bound to the delay between batch ...
From: Mathieu Desnoyers
Date: Wednesday, September 23, 2009 - 3:32 pm

I knew I needed to think about it a bit more. Here is the proposed
algorithm hopefully fixing the race identified in the 3rd scenario
above. The idea is to perform the "check for empty queue" between the
&defer_thread_futex decrement and the test in wait_defer. It skips the
futex call and proceed if the list is non-empty.

As I am drilling into the problem, it looks very much like an attempt to
implement efficient wait queues in userspace based on sys_futex().

/*
 * Wake-up any waiting defer thread. Called from many concurrent
 * threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(atomic_read(&defer_thread_futex) == -1)) {
                atomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE, 0,
                      NULL, NULL, 0);
        }
}

/*
 * Defer thread waiting. Single thread.
 */
static void wait_defer(void)
{
        atomic_dec(&defer_thread_futex);
        smp_mb();       /* Write futex before read queue */
        if (rcu_defer_num_callbacks()) {
                smp_mb();       /* Read queue before write futex */
                /* Callbacks are queued, don't wait. */
                atomic_set(&defer_thread_futex, 0);
        } else {
                smp_rmb();      /* Read queue before read futex */
                if (atomic_read(&defer_thread_futex) == -1)
                        futex(&defer_thread_futex, FUTEX_WAIT, -1,
                              NULL, NULL, 0);
        }
}

- call_rcu():
  * queue callbacks to perform
  * smp_mb()
  * wake_up_defer()

- defer thread:
  * for (;;)
    * wait_defer()
    * sleep 100ms (wait for more callbacks to be enqueued)
    * dequeue callbacks, execute them

The goal here is that if call_rcu() enqueues a callback (even if it
races with defer thread going to sleep), there should not be a
potentially infinite delay before it gets executed. Therefore, being
blocked in sys_futex while there is a callback to execute, without any
hope to be ...
From: Chris Friesen
Date: Wednesday, September 23, 2009 - 4:12 pm

It doesn't seem like the test for the number of callbacks should be
necessary.  I don't see anything like that in the glibc code, nor do I
remember anything like that in the futex sample code.

I'm still not totally convinced that you can avoid race conditions
without using atomic test-and-set or compare-and-exchange.  I haven't
sat down and worked it out completely though.

Chris
--

From: Mathieu Desnoyers
Date: Wednesday, September 23, 2009 - 4:28 pm

The mutex code (and usual futex users) use futex to implement mutual
exclusion.  My goal is to send a wakeup signal to a thread waiting for
work to perform when adding such work. But without any mutual exclusion.

So it is understandable that glibc code or futex sample code does not

Yes.. this is heavily dependent on the states and values which can be
reached. I should probably take time to create a promela model and run
that though the spin model checker to be sure.

Thanks for the comments,


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Saturday, September 26, 2009 - 12:05 am

Just created a Promela model for this. It assumes sequential memory
ordering (so it's a fairly simplified model). Given I added memory
barriers between each operation, it should well represent reality
though.

My algorithm seems to behave as expected: when a callback is added to
the queue, it's not possible to have the waiter thread blocked until the
end of days.

Available at:
http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=blob;f=formal-model/fute...

Thanks,


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Michael Schnell
Date: Monday, September 28, 2009 - 12:11 am

Why is that different ?

if you use a normal FUTEX like pthread_mutex_...() and have the thread
block right at the start (by blocking the *UTEX before creating the
thread). Now you can wake the thread by unblocking the *UTEX.

-Michael

--

From: Michael Schnell
Date: Monday, September 28, 2009 - 3:58 am

From: Michael Schnell
Date: Monday, September 28, 2009 - 4:01 am

It might be that (different from a sema), a FUTEX can only be unlocked
by the process that locked it.

But then why should FUTEX offer advantages over sema, if in that
application the FUTEX is supposed to block with a system call, anyway ?

-Michael
--

From: Paul E. McKenney
Date: Thursday, October 1, 2009 - 7:40 am

The standard approach would be to use pthread_cond_wait() and
pthread_cond_broadcast().  Unfortunately, this would require holding a
pthread_mutex_lock across both operations, which would not necessarily
be so good for wake-up-side scalability.

That said, without this sort of heavy-locking approach, wakeup races
are quite difficult to avoid.

							Thanx, Paul
--

From: Mathieu Desnoyers
Date: Sunday, October 4, 2009 - 7:37 am

The pthread_cond_broadcast() mutex is really a bugger when it comes to
execute it at each rcu_read_unlock(). We could as well use a mutex to

I did a formal model of my futex-based wait/wakeup. The main idea is
that the waiter:

- Set itself to "waiting"
- Checks the "real condition" for which it will wait (e.g. queues empty
  when used for rcu callbacks, no more ongoing old reader thread C.S.
  when used in synchronize_rcu())
- Calls sys_futex if the variable have not changed.

And the waker:
- sets the "real condition" waking up the waiter (enqueuing, or
  rcu_read_unlock())
- check if the waiter must be woken up, if so, wake it up by setting the
  state to "running" and calling sys_futex.

But as you say, wakeup races are difficult (but not impossible!) to
avoid. This is why I resorted to a formal model of the wait/wakeup
scheme to ensure that we cannot end up in a situation where a waker
races with the waiter and does not wake it up when it should. This is
nothing fancy (does not model memory and instruction reordering
automatically), but I figure that memory barriers are required between
almost every steps of this algorithm, so by adding smp_mb() I end up
ensure sequential behavior. I added test cases in the model to ensure
that incorrect memory reordering _would_ cause errors by doing the
reordering by hand in error-injection runs.

The model is available at:
http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4d...

(this is in the formal-model branch of the urcu tree, futex-wakeup
subdir)

This is modeling this snippet of code :

static int defer_thread_futex;

/*
 * Wake-up any waiting defer thread. Called from many concurrent threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(uatomic_read(&defer_thread_futex) == -1)) {
                uatomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE, 1,
               ...
From: Paul E. McKenney
Date: Sunday, October 4, 2009 - 1:36 pm

My question is whether pthread_cond_wait() and pthread_cond_broadcast()
can substitute for the raw call to futex.  Unless I am missing something
(which I quite possibly am), the kernel will serialize on the futex
anyway, so serialization in user-mode code does not add much additional

I will take a look after further recovery from jetlag.  Not yet competent
to review this kind of stuff.  Give me a few days.  ;-)

							Thanx, Paul
--

From: Mathieu Desnoyers
Date: Sunday, October 4, 2009 - 2:12 pm

The kernel sys_futex implementation only takes per-bucket spinlocks. So
this is far from the cost of a global mutex in pthread_cond. Moreover,
my scheme does not require to take any mutex in the fast path (when
there is no waiter to wake up), which makes performances appropriate for
use in rcu read-side. It's a simple memory barrier, variable read, test

No problem, thanks for looking at this,


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Monday, October 5, 2009 - 6:22 am

(added some CC's, given the length of the answer. Thanks for asking)  ;)
(Sorry for duplicate, messed up LKML email in original post)


Hrm, your assumption about the common case does not seem to fit my
scenarios. Typically, the wakeup will find the waiter not blocked, and
thus skip the call to sys_futex. Here is why.

I use this scheme in two different implementations:

1) in call_rcu(), to wake up the worker thread after adding work to the
queue.

This worker thread, when woken up, sleeps for a few milliseconds before
starting to dequeue work. Therefore, if the system is relatively busy,
call_rcu() will usually see the worker thread while it's sleeping (and
therefore _not_ waiting on the futex). Also, if work is enqueued while
the worker thread is executing past RCU callbacks, the worker thread
will detect it and won't wait on the futex.

Therefore, this is, by design, a very unlikely event to have call_rcu()
calling sys_futex.

2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past
   reader's grace periods.

Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if
this does not work, after a few attempts (like the pthread mutexes), it
adapts and uses sys_futex. The waker only need to call sys_futex if
there is a synchronize_rcu() currently running which had to call
sys_futex after a few active attempts failed.

As you see, in both cases, the common case, "fast path", is to find the
futex unlocked and not having to take any lock.


Now, about the slow path. I think it's worth discussing too. Indeed,
sys_futex takes a per-address spinlock, which happens to serialize all
sys_futex operations on the wakeup side. Therefore, for wakeup designs
relying on calling sys_futex for wakeup very frequently, this is really
bad.

There might be ways to mitigate this problem though: changing the
sys_futex implementation to use lock-free lists might help.

One advantage of calling sys_futex without holding a userspace mutex is
that the contention ...
From: Mathieu Desnoyers
Date: Monday, October 5, 2009 - 3:21 pm

Hrm, thinking about it, the pthread_cond_wait/pthread_cond_broadcast
scheme I proposed above is racy.

If a wait_gp executes concurrently with wake_up_gp, we can end up doing:

gp_wait = 0

*    wait_gp:       uatomic_dec(&gp_wait);
*    wait_gp:       smp_mb()
*    wait_gp:       if (!num_old_reader_gp()) { (false)
*    wait_gp:       pthread_mutex_lock(&rcumutex);
*    wait_gp:       if (uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp:  unlikely(uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp:  uatomic_set(&gp_wait, 0);
* wait_up_gp:  pthread_cond_broadcast(&rcucond);
*    wait_gp:       pthread_cond_wait(&rcucond, &rcumutex);

might wait forever (or until the next wakeup).

We would need an extra mutex in the wakeup, e.g.:

static inline void wake_up_gp(void)
{
	/* just released old reader gp */
	smp_mb();
        if (unlikely(uatomic_read(&gp_wait) == -1)) {
		pthread_mutex_lock(&rcumutex);
                uatomic_set(&gp_wait, 0);
                pthread_cond_broadcast(&rcucond);
		pthread_mutex_unlock(&rcumutex);
        }
}

static void wait_gp(void)
{
	uatomic_dec(&gp_wait);
        smp_mb();
	if (!num_old_reader_gp()) {
        	smp_mb();
		atomic_set(&gp_wait, 0);
		goto unlock;
	}
        /* Read reader_gp before read futex */
        smp_rmb();
	pthread_mutex_lock(&rcumutex);
        if (uatomic_read(&gp_wait) == -1) {
                pthread_cond_wait(&rcucond, &rcumutex);
		pthread_mutex_unlock(&rcumutex);
	}
}

This should work better, but I'll need to do a formal model to convince
myself.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Michael Schnell
Date: Wednesday, October 7, 2009 - 12:22 am

Yep,

Obviously I was misled.

So I now understand that you want to schedule the thread as a result to
some event and the Thread might already be running at that time, so that
it does not need to enter an OS-based wait state.

But to me this looks as if a _counting_ semaphore is needed here (at
least in a more general case), instead of a binary semaphore (such as
FUTEX does manage). Only with a counting semaphore no events are missed
(unless the user software design cares for this with other means, which
might be difficult or impossible).

Of course the fast path of a  user space counting semaphore is easily
doable with atomic_inc() and atomic_dec().

But I only know about two working variants of the user space code for
the (binary) FUTEX. There seem to be some more implementations that do
not work decently.

I have no idea if it's possible to create a counting semaphore in user
space that uses the "futex" syscall (or whatever) for the case the
threads needs to wait.

-Michael

--

Previous thread: Re: module: Speed up symbol resolution during module loading by Alan Jenkins on Wednesday, September 23, 2009 - 9:52 am. (2 messages)

Next thread: mmc_spi: command errors on 2.6.29.6 by Michael Borsuk on Wednesday, September 23, 2009 - 11:01 am. (1 message)