In-Reply-To: NOT FOR INCLUSION The following patch series implements a new experimental kernel side futex mutex via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin follows the kernel mutex model of allowing one spinner until the lock is released or the owner is descheduled. The patch currently allows the user to specify if they want no spinning, a single adaptive spinner, or multiple spinners (aggressive adaptive spinning, or aas... which I have mistyped as "ass" enough times to realize a better term is indeed required :-). I'm using the futex_lock branch of my futextest test suite to gather results. The test is performance/futex_lock.c and can be checked out here: git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git Per Avi's suggestion I added asm instruction based period and duty-cycle options to do some testing. A series of plots with 256 threads, one per period length, is shown at the URL below. Each plot compares raw futex lock (normal, no spinning) with a single spinner (adaptive) and with multiple spinners (aas) for each of several duty-cycles to determine performance at various levels of contention. The test measure the number of lock/unlock pairs performed per second (in thousands). http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/ Avi's suggested starting point was 1000 instructions with a 10% duty cycle. That plot "Futex Lock/Unlock (Period=1000, Threads=256)" does appear to be the most interesting of the lot. It's not so short that other overhead becomes the bottleneck, and not so long so as to make adaptive spinning benefits show up as noise in the numbers. The adaptive spin (with a single waiter) consistently beats the normal run, and outperforms aas for most duty-cycles. I rechecked a few points on this plot to confirm and the relative scores remained consistent. These plots were generated using 10,000,000 iterations per datapoint. Lastly, I should mention that these results all ...
The futex_q struct has grown considerably over the last year or so. I
believe it now merits a static initializer to avoid uninitialized data
errors (having just spent more time than I care to admit debugging
an uninitialized q.bitset in an experimental new op code).
I originally planned on following the __THING_INITIALIZER/DECLARE_THING
method, but since we already had FUTEX_KEY_INIT, and I personally prefer
that method, I went that route.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
kernel/futex.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ed08cfd..2ae18cd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -130,6 +130,12 @@ struct futex_q {
u32 bitset;
};
+#define FUTEX_Q_INIT \
+ { /* list gets initialized in queue_me()*/ \
+ .task = NULL, NULL, FUTEX_KEY_INIT \
+ , NULL, NULL, NULL, FUTEX_BITSET_MATCH_ANY }
+
+
/*
* Hash buckets are shared by all the futex_keys that hash to the same
* location. Each key may have multiple futex_q structures, one for each task
@@ -1799,16 +1805,13 @@ static int futex_wait(u32 __user *uaddr, int flags, u32 val, ktime_t *abs_time,
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
- struct futex_q q;
+ struct futex_q q = FUTEX_Q_INIT;
int ret;
if (!bitset)
return -EINVAL;
- q.pi_state = NULL;
q.bitset = bitset;
- q.rt_waiter = NULL;
- q.requeue_pi_key = NULL;
if (abs_time) {
to = &timeout;
@@ -1899,7 +1902,7 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
{
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
- struct futex_q q;
+ struct futex_q q = FUTEX_Q_INIT;
int res, ret;
if (refill_pi_state_cache())
@@ -1913,9 +1916,6 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
hrtimer_set_expires(&to->timer, *time);
}
- q.pi_state = NULL;
- q.rt_waiter ...In the early days we passed the mmap sem around. That became the
"int fshared" with the fast gup improvements. Then we added
"int clockrt" in places. This patch unifies these options as "flags" and
cleans up various calls which had fshared in their signature but no
longer used it.
V4: Fix an operator precedence bug
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
kernel/futex.c | 215 +++++++++++++++++++++++++++-----------------------------
1 files changed, 104 insertions(+), 111 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..ed08cfd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,14 @@ int __read_mostly futex_cmpxchg_enabled;
#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
/*
+ * Futex flags used to encode options to functions and preserve them across
+ * restarts.
+ */
+#define FLAGS_SHARED 0x01
+#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
+
+/*
* Priority Inheritance state:
*/
struct futex_pi_state {
@@ -283,7 +291,7 @@ again:
}
static inline
-void put_futex_key(int fshared, union futex_key *key)
+void put_futex_key(union futex_key *key)
{
drop_futex_key_refs(key);
}
@@ -878,7 +886,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
/*
* Wake up waiters matching bitset queued on this futex (uaddr).
*/
-static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
+static int futex_wake(u32 __user *uaddr, int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -889,7 +897,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;
- ret = get_futex_key(uaddr, fshared, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
if (unlikely(ret != 0))
goto out;
@@ -915,7 +923,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
}
...Add a non-pi TID value based futex locking mechanism. This enables the
use of adaptive spinning which was problematic with the basic FUTEX_WAIT
operation.
V4: Remove update_futex_waiters() and perform all waiter clearing in
userspace. Turns out to be generally better to do an O(1) operation in
userspace that forces one extra syscall than to do an O(N) operation during
every syscall. Go figure :).
Cleanup timeout when exiting adaptive loop.
Perform adaptive spinning after retry:
General cleanups.
V3: Fix some locking bugs
Use cmpxchg directly in the adaptive loop
Move the adaptive loop into its own function
Use FUTEX_WAIT instead of FUTEX_UNLOCK
Add update_futex_waiters()
V2: Fix preempt-count breakage
This patch is a forward port of the following patch from Peter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
include/linux/futex.h | 6 +
include/linux/sched.h | 1 +
kernel/futex.c | 282 +++++++++++++++++++++++++++++++++++++++++++------
kernel/sched.c | 59 ++++++++++
4 files changed, 314 insertions(+), 34 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1e5a26d..3ca93d1 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -20,6 +20,8 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_LOCK 13
+#define FUTEX_LOCK_ADAPTIVE 14
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -39,6 +41,9 @@
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_ADAPTIVE_PRIVATE (FUTEX_LOCK_ADAPTIVE | \
+ FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
@@ -177,6 +182,7 @@ union futex_key {
extern void ...Hmm. The order is weird. Why don't you do that simpler ?
Get the uval, the tid and the thread_info pointer outside of the
loop. Also task_pid_vnr(current) just needs a one time lookup.
change the loop to do:
for (;;) {
curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
if (!curval)
return 1;
if ((curval & FUTEX_TID_MASK) != ownertid) {
ownertid = curval & FUTEX_TID_MASK;
owner = update_owner(ownertid);
}
if (owner && futex_spin_on_owner(....))
.....
Please make that like the WAIT_BITSET one, which can select the
Uurg. It's enough to check whether the TID value changed. No need to
Do we really need all this code ? A simple owner->on_cpu (owner needs
to be the task_struct then) would be sufficient to figure that out,
wouldn't it?
Thanks,
tglx
--
Eeek. Having the owner in the loop is a good way to negate the benefits of adaptive spinning by spinning forever (unlikely, but it could certainly spin across multiple owners). Nice catch. As for the uval.... I'm not sure what you mean. You get curval below inside the loop, and there is no "uval" in the my version of the code. As for the order, I had put the initial spin prior to the cmpxchg to avoid doing too many cmpxchg's in a row as they are rather expensive. However, since this is (now) the first opportunity to do try and acquire the lock atomically after entering the futex syscall, I think you're Single return point makes instrumentation so much easier. Unless folks Hrm... at this point the owner has changed... so we should break and go to sleep, not update the owner and start spinning again. The futex_spin_on_owner() will detect this and abort, so I'm not seeing the purpose of the above if() block. I placed the retry: label above the adaptive spin loop. This way if we wake a task and the lock is "stolen" it doesn't just go right back to sleep. This should aid in fairness and also performance in less contended cases. I didn't think it was worth a "if (first_time_through We don't. I'm using it here simply to measure the impact of adaptive spinning. Eventually all that will matter is how it stacks up against a raw futex_wait/wake mutex implementation like pthread_mutex_lock/unlock(). But in the early stages it's nice to be able eliminate all other differences other than adaptive spinning. Cruft left over from the futex_lock_pi() roots. Removed. OK, we can skip the euid, uid credentials checking by doing a get_user() As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling through the mutex_spin_on_owner() discussions to see if I can determine why that's the case. Thank you for the review. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Well, you need a first time lookup of owner and ownertid for which you need the user space value (uval), but thinking more about it it's not even necessary. Just initialize ownertid to 0 so it will drop into the Why ? If the owner has changed and the new owner is running on another AFAICT p->oncpu is the correct thing to use when CONFIG_SMP=y. All it needs is a simple accessor function and you can keep all the futex cruft in futex.c where it belongs. Thanks, tglx --
No need for ownertid at all really. The cmpxchg always tries to go from 0 to curtid. I've pushed the futex_owner() call outside the loop for a That's an interesting question, and I'm not sure what the right answer is. The current approach of the adaptive spinning in the kernel is to spin until the owner changes or deschedules, then stop and block. The idea is that if you didn't get the lock before the owner changed, you aren't going to get it in a very short period of time (you have at least an entire critical section to wait through plus whatever time you've already spent spinning). However, blocking just so another task can spin doesn't really make sense either, and makes the lock less fair than it could otherwise be. My goal in starting this is to provide a more intelligent mechanism than sched_yield() for userspace to use to determine when to spin and when to sleep. The current implementation allows for spinning up until the owner changes, deschedules, or the timeslice expires. I believe these are much better than spinning for some fixed number of cycles and then yield for some unpredictable amount of time until CFS decides to schedule you back in. Still, the criteria for breaking the spin are something that needs more Noted. Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Not only less fair, but potentially could cause starvation, no? Perhaps you could see this if you changed your model to allow all contended tasks to spin instead of just one. If a spinning task blocks because of an owner change, and a new task enters and starts spinning directly after the owner change, at what point does the original task get woken up? Its likely that the new spinner will get the resource next, no? Rinse/repeat with another task and the original spinner is starved. (Or am I missing something? My understanding was that unfairness was built-in to this algo... If so, then the above is a possibility, right?) Best, --
At the time of unlock the owner will have to call FUTEX_WAKE. This task will wake and attempt to acquire the lock - it will lose races with aclready running contenders. Lock stealing, adaptive spinning, etc are Yes it is. These locks are typically used in situations where it is more important that some work gets completed than _which_ work gets completed. Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
nod. My only point was that if only one can spin, and sleeps when the owner is not on CPU, then you virtually guarantee worst case performance under certain conditions (ie: multiple threads coming in at various times) If all spin, and they sleep only on owner block, then the 'unfairness' is potentially more evenly distributed via hardware (and stealing) and when the owner is blocked. Best, --
Prepare for FUTEX_LOCK by refactoring futex_lock_pi_atomic() into
lock_futex_atomic() and lock_pi_futex_atomic(). The name change is meant to
reflect the naming convention in futex.c that futex_*() functions map directly
to futex op codes and the others are internal helper functions.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
===================================================================
---
kernel/futex.c | 79 +++++++++++++++++++++++++++++++++++++------------------
1 files changed, 53 insertions(+), 26 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 2ae18cd..8c1bb16 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -625,30 +625,23 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
}
/**
- * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex
- * @uaddr: the pi futex user address
- * @hb: the pi futex hash bucket
- * @key: the futex key associated with uaddr and hb
- * @ps: the pi_state pointer where we store the result of the
- * lookup
- * @task: the task to perform the atomic lock work for. This will
- * be "current" except in the case of requeue pi.
- * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ * lock_futex_atomic() - Try to acquire the futex lock atomically
+ * @uaddr: user address of the futex
+ * @task: the task to perform the atomic lock work for.
+ * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ *
+ * The hb->lock shall be held by the caller.
*
* Returns:
* 0 - ready to wait
* 1 - acquired the lock
* <0 - error
- *
- * The hb->lock and futex_key refs shall be held by the caller.
*/
-static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
- union futex_key *key,
- struct futex_pi_state **ps,
- struct task_struct *task, int set_waiters)
+static int lock_futex_atomic(u32 __user *uaddr, struct task_struct *task,
+ int set_waiters)
{
- int ...Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
kernel/futex.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index c33ac2a..af61dcd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2385,6 +2385,7 @@ out:
/**
* trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
* @uaddr: the futex user address
+ * @timeout: absolute timeout or NULL if none
*
* Try to acquire a futex lock in a loop until the owner changes or the owner
* is descheduled. To lock the futex, set the value to the current TID.
@@ -2394,10 +2395,11 @@ out:
* 1 - Futex lock acquired
* <0 - On error
*/
-static int trylock_futex_adaptive(u32 __user *uaddr)
+static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
{
int ret = 0;
u32 curval;
+ ktime_t now;
for (;;) {
struct thread_info *owner;
@@ -2433,6 +2435,22 @@ static int trylock_futex_adaptive(u32 __user *uaddr)
if (need_resched())
break;
+ if (timeout) {
+ now = ktime_get();
+/* FIXME: consider creating ktime_less_than(lhs, rhs) */
+#if (BITS_PER_LONG == 64) || defined(CONFIG_KTIME_SCALAR)
+ if (timeout->tv64 < now.tv64)
+ break;
+#else
+ if (timeout->sec < now.sec ||
+ (timeout->sec == now.sec &&
+ timeout->nsec < now.nsec)) {
+ ret = -ETIMEDOUT;
+ break;
+ }
+#endif
+ }
+
cpu_relax();
}
return ret;
@@ -2480,7 +2498,7 @@ retry:
#ifdef CONFIG_SMP
if (flags & FLAGS_ADAPTIVE) {
preempt_disable();
- ret = trylock_futex_adaptive(uaddr);
+ ret = trylock_futex_adaptive(uaddr, time);
preempt_enable();
/* We got the lock. */
@@ -2489,7 +2507,7 @@ retry:
goto out;
}
- /* We encountered an error, -EFAULT most likely. */
+ /* We encountered an error, -EFAULT or -ETIMEDOUT */
if (ret)
goto out;
}
--
1.6.3.3
--
Hmm. Calling that in every iteration might hurt especially on non No need. The .tv64 comparison works in both cases. :) Thanks, tglx --
I haven't come across a better alternative since arming the timer before Ah, for some reason I was thinking that was only the case if CONFIG_KTIME_SCALAR was set. Very nice, thanks. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
>>> On 4/7/2010 at 01:31 PM, in message <4BBCC174.7020409@us.ibm.com>, Darren Hart Hey Darren, I remember we tried something similar in early versions of the adaptive locks and this was definitely bad. :( It ended up putting so much contention on the xtime_lock (IIRC) that it resulted in adaptive locks hurting overall performance verses not using adaptive at all. Alternative mechanisms employed a hybrid where the inner loops used a pseudo calibrated counter loop, and the outer loop checks periodically against a real clock. It all plays into "you are burning CPU cycles anyway, so might as well put them to use" theory. Hacky, yes, but it did relieve the pressure on the time subsystem locks and freed up a _lot_ of performance. Without this, the concept of timeouts+adaptive was unworkable. I think Steven ultimately rejected the timeout related patches outright when he merged adaptive to -rt, but I think Sven pulled them into SLERT if you would like a potential code reference to a working solution. -Greg --
Gregory Haskins wrote:
>>>> On 4/7/2010 at 01:31 PM, in message <4BBCC174.7020409@us.ibm.com>,
Darren Hart
> <dvhltc@us.ibm.com> wrote:
>> Thomas Gleixner wrote:
>>> On Mon, 5 Apr 2010, Darren Hart wrote:
>>>
>>>> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
>
>>>> + if (timeout) {
>>>> + now = ktime_get();
>>> Hmm. Calling that in every iteration might hurt especially on non
>>> TSC systems, but well...
>> I haven't come across a better alternative since arming the timer
before
>> setting TASK_INTERRUPTIBLE isn't appropriate.
>
> Hey Darren,
>
> I remember we tried something similar in early versions of the
> adaptive locks and this was definitely bad. :(
>
> It ended up putting so much contention on the xtime_lock (IIRC) that
> it resulted in adaptive locks hurting overall performance verses not
> using adaptive at all. Alternative mechanisms employed a hybrid where
> the inner loops used a pseudo calibrated counter loop, and the outer
> loop checks periodically against a real clock. It all plays into "you
> are burning CPU cycles anyway, so might as well put them to use"
> theory. Hacky, yes, but it did relieve the pressure on the time
> subsystem locks and freed up a _lot_ of performance. Without this,
> the concept of timeouts+adaptive was unworkable. I think Steven
> ultimately rejected the timeout related patches outright when he
> merged adaptive to -rt, but I think Sven pulled them into SLERT if you
> would like a potential code reference to a working solution.
>
Hi Greg,
Thanks for the info! I haven't tested with timeouts yet as I'm still
struggling to get decent performance out of just plain old adaptive
right now. I'll keep that in mind when I get to that, and yeah, if the
plan is to burn CPU cycles, might as well do something constructive :-)
I do feel the timeouts are a necessary feature. Interruptibility may be
as well, but I'm going to ignore it for the time being...
--
Darren
--
Aggresive adaptive spinning (aas) allows for more than spinner in the
adaptive loop at a given time. In some scenarios this yields better
results than the default single spinner mode.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
kernel/futex.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index af61dcd..ddbd158 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2462,6 +2462,9 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
* @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, FLAGS_ADAPTIVE, etc.)
* @detect: detect deadlock (1) or not (0)
* @time: absolute timeout
+ * @aas: aggressive adaptive spinning (1) allow multiple spinners (0) allow
+ * only one spinner. Only valid in conjunction with the FLAGS_ADAPTIVE
+ * flag.
*
* futex_(un)lock() define a futex value policy and implement a full mutex. The
* futex value stores the owner's TID or'd with FUTEX_WAITERS and/or
@@ -2476,12 +2479,14 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
* 0 - On success
* <0 - On error
*/
-static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
+static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time,
+ int aas)
{
struct hrtimer_sleeper timeout, *to = NULL;
- struct futex_hash_bucket *hb;
struct futex_q q = FUTEX_Q_INIT;
+ struct futex_hash_bucket *hb;
int ret = 0;
+ u32 uval;
if (refill_pi_state_cache())
return -ENOMEM;
@@ -2497,6 +2502,14 @@ static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
retry:
#ifdef CONFIG_SMP
if (flags & FLAGS_ADAPTIVE) {
+ if (!aas) {
+ ret = get_user(uval, uaddr);
+ if (ret)
+ goto out;
+ if (uval & FUTEX_WAITERS)
+ goto skip_adaptive;
+ }
+
preempt_disable();
ret = trylock_futex_adaptive(uaddr, time);
preempt_enable();
@@ -2512,6 ...To eliminate syscall overhead from the equation, I modified the testcase to allow for forcing the syscall on lock(). Doing so cut the non-adaptive scores by more than half. The adaptive scores dropped accordingly. The relative difference between normal and adaptive remained in tact (with my adaptive implementation lagging by 10x). So while the syscall overhead does impact the scores, it is not the source of the performance issue with the adaptive futex implementation I posted. The following bits were being used to test for spinners and attempt to only allow one spinner. Obviously it failed miserably at that. I found Trouble is at this point is there are no more bits in the word to be able to have a FUTEX_SPINNER bit. The futex word is the only per-futex storage we have, the futex_q is per task. If we overload the FUTEX_WAITERS bit it will force more futex_wake() calls on the unlock() path. It also will effectively disable spinning under contention as there are bound to be FUTEX_WAITERS in that case. Another option I dislike is to forget about robust futexes in conjunction with adaptive futexes and overload the FUTEX_OWNER_DIED bit. Ulrich mentioned in another mail that "If we have 31 bit TID values there isn't enough room for another bit." Since we have two flag bits now, I figured TID values were 30 bits. Is there an option to run with 31 bits or something? Assuming we all agree that these options are "bad", that leaves us with looking for somewhere else to store the information we need, which in turn brings us back around to what Avi, Alan, and Ulrich were discussing regarding non swappable TLS data and a pointer in the futex value. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
And by V2 I meant V4.... -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
An interesting (but perhaps difficult to achieve) optimization would be How many cores (or hardware threads) does this machine have? At 10% duty cycle you have 25 waiters behind the lock on average. I don't think this is realistic, and it means that spinning is invoked only rarely. I'd be interested in seeing runs where the average number of waiters is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention. 25 average waiters on compute bound code means the application needs to be rewritten, no amount of mutex tweaking will help it. Does the wakeup code select the spinning waiter, or just a random waiter? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I couldn't think of a lightweight way to determine when the owner has been scheduled out in userspace. Kernel assistance is required. You could do this on the schedule() side of things, but I figured I'd get some strong pushback if I tried to add a hook into descheduling that flipped a bit in the futex value stating the owner was about to Sorry, I meant to include that. I tested on an 8 CPU (no hardware threads) 2.6 GHz Opteron 2218 (2 QuadCore CPUs) system. Perhaps some instrumentation is in order, it seems to get invoked enough to achieve some 20% increase in lock/unlock iterations. Perhaps another Perhaps something NR_CPUS threads would be of more interest? At 10% that's about .8 and at 25% the 2 of your upper limit. I could add a few more duty-cycle points and make 25% the max. I'll kick that off and post the results... probably tomorrow, 10M iterations takes a while, but The wakeup code selects the highest priority task in fifo order to wake-up - however, under contention it is most likely going to go back to sleep as another waiter will steal the lock out from under it. This locking strategy is unashamedly about as "unfair" as it gets. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
In the futex value it's hopeless (since a thread can hold many locks), but I don't think it's unreasonable to set a bit in the thread local storage area. The futex format would then need to be extended to Best to avoid the wakeup if we notice the lock was stolen. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
How so? Several real world applications use one thread per CPU to Absolutely, I don't disagree that all the variables should vary in order to get a complete picture. I'm starting with 8 - it takes several hours You really can't do this precisely. You can read the futex value at various points along the wakeup path, but at some point you have to commit to waking a task, and you still have a race between the time you wake_up_task() and when it is scheduled and attempts the cmpxchg itself. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Yes, but that's the best case for spinning. You could simply use a The race is not a problem as it's just an optimization. If you lose it you get a spurious wakeup, if you win you save some cpu cycles. -- error compiling committee.c: too many arguments to function --
Userspace spinlocks are evil.. they should _never_ be used. --
But in this case they're fastest. If we don't provide a non-evil alternative, people will use them. -- error compiling committee.c: too many arguments to function --
That works for the uncontended case. For the contended case, the waiter and the owner have to go into the kernel and back out to transfer ownership. In the non-adaptive case you have to switch to the idle task and back as well, and send an IPI. That's a lot of latency if the unlock happened just after the waiter started the descent into the kernel. -- error compiling committee.c: too many arguments to function --
If you insist on doing it that way yes, but knowing the lock owner is likely to be away for a while also lets you do things like punt work either by picking another work package in the meantime, or by queueing the work you can't do on a list the pre-empted lock owner will review before dropping the lock. --
On Tue, 06 Apr 2010 15:35:31 +0200 Thats a gross and inaccurate simplification. For the case Avi is talking about spinning in userspace makes sense in a lot of environments. Once you've got one thread pinned per cpu (or gang scheduling >-) ) there are various environments where it makes complete and utter sense. --
Hi Alan, Do you feel some of these situations would also benefit from some kernel assistance to stop spinning when the owner schedules out? Or are you saying that there are situations where pure userspace spinlocks will always be the best option? If the latter, I'd think that they would also be situations where sched_yield() is not used as part of the spin loop. If so, then these are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to provide a better informed mechanism for making spin or sleep decisions. If sleeping isn't part of the locking construct implementation, then FUTEX_LOCK_ADAPTIVE doesn't have much to offer. Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
IMO the best solution is to spin in userspace while the lock holder is running, fall into the kernel when it is scheduled out. -- error compiling committee.c: too many arguments to function --
That's just not realistic as user space has no idea whether the lock holder is running or not and when it's scheduled out without a syscall :) Thanks, tglx --
The kernel could easily expose this information by writing into the thread's TLS area. So: - the kernel maintains a current_cpu field in a thread's tls - lock() atomically writes a pointer to the current thread's current_cpu when acquiring - the kernel writes an invalid value to current_cpu when switching out - a contended lock() retrieves the current_cpu pointer, and spins as long as it is a valid cpu -- error compiling committee.c: too many arguments to function --
There are certainly details to sort through in the packaging of the mechanism but conceptually that should do the job. So here the application has chosen a blocking lock as being the optimal synchronization operation and we're detecting a scenario where we can factor out the aggregate overhead of two context switch operations. There is also the case where the application requires a polled lock with the rational being the assumed lock hold/wait time is substantially less than the above context switch overhead. But here we're otherwise completely open to indiscriminate scheduling preemption even though we may be holding a userland lock. The adaptive mutex above is an optimization beyond what is normally expected for the associated model. The preemption of a polled lock OTOH can easily inflict latency several orders of magnitude beyond what is expected in that model. Two use cases exist here which IMO aren't related except for the latter unintentionally degenerating into the former. -john -- john.cooper@third-harmonic.com --
I didn't intend to change the behavior of an existing blocking call with adaptive spinning if that is what you are getting at here. Initially there would be a new futex op, something like FUTEX_LOCK_ADAPTIVE or maybe just FUTEX_WAIT_ADAPTIVE. Applications can use this directly to implement adaptive spinlocks. Ideally glibc would make use of this via either the existing adaptive spinning NP API or via a new one. Before we Again, my intention is not to replace any existing functionality, so applications would have to explicitly request this behavior. If I'm missing your point, please elaborate. Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Agreed. However from your earlier mail it seemed addressing this scenario was within the scope of consideration. There are two ways to deal with this condition, either reactive in the sense we do so after the lock holder has been preempted and subsequently find we're spinning in sibling thread context attempting to acquire the lock. Or proactively where we provide a time bounded deferral of lock holder preemption with the assumption the lock hold path overhead has negligible effect upon deferring a potentially coincident scheduling operation. It is fairly straightforward to demonstrate the impact to performance with a focused micro benchmark, less so for a more "typical" application with the effect being diluted among other app activity. The two approaches are complimentary with differing system wide tradeoffs. Both require some insight into the scheduling disposition of the lock holder, the preemption deferral mechanism more so. If a scheme to expose scheduler state transitions to (or cooperate with) userland locking primitives is being considered, it seems opportune to consider support as well for this scenario. -john -- john.cooper@redhat.com --
Which is the real problem that wants addressing and can be addressed very cheaply. That would bring us up to par with 1970s RTOS environments ;) Alan --
Well, 1970's RTOSes had other features as well like preemption disable mechanisms and other interesting stuff. I hope you're not going to propose that next to bring us up to par with those :) Thanks, tglx --
There are cases its the best option - you are assuming for example that
the owner can get scheduled out. Eg nailing one thread per CPU in some
I am unsure about the approach. As Avi says knowing that the lock owner is
scheduled out allows for far better behaviour. It doesn't need complex
per lock stuff or per lock notifier entries on pre-empt either.
A given task is either pre-empted or not and in the normal case of things
you need this within a process so you've got shared pages anyway. So you
only need one instance of the 'is thread X pre-empted' bit somewhere in a
non swappable page.
That gives you something along the lines of
runaddr = find_run_flag(lock);
do {
while(*runaddr == RUNNING) {
if (trylock(lock))
return WHOOPEE;
cpu relax
}
yield (_on(thread));
} while(*runaddr != DEAD);
which unlike blindly spinning can avoid the worst of any hit on the CPU
power and would be a bit more guided ?
--
There still has to be an upper limit in the number of rounds of the wait loop )some locks are held for a long time) since otherwise CPUs are unnecessarily long tied up. And the DEAD case is only for robust mutex handling. But in theory I agree. We already have the set_tid_address syscall. This could be generalized with a new syscall which can provide the kernel with more than one pointer to store "stuff" in: TIDs, scheduling info, etc. The non-swappable part will be tricky. One doesn't know how many threads will be created in a process. This mechanism shouldn't put an arbitrary limit in place. So where to allocate the memory? Perhaps it's better to implicitly mark the memory page pointed to by the new syscall as non-swappable? This could mean one page per thread... --
You only need one page per 4096 threads or so if you make it create the page on the first request, tied to say the signals and the like in threaded tasks, and you then allocate 'slots' in the page for future calls until you've got 4096 live. ie you'd see something like addr = set_tid_notifier_addr(); [1st call map page at x to x + 4095, probably R/O] [returns x] next thread addr = set_tid_notifier_addr() [returns x + 1] Alan --
Very expensive. Each cache line would be fought over by 64 threads. Constant RFOs make context switches significantly slower. At most 4096/64 = 64 threads should share one page. One page would still cover almost all processes. --
I fear we might end up with a pinned page per thread to get this working properly and it restricts the mechanism to process private That would require a new yield_to_target() syscall, which either blocks the caller when the target thread is not runnable or returns an I doubt that the syscall overhead per se is large enough to justify all of the above horror. We need to figure out a more efficient way to do the spinning in the kernel where we have all the necessary information already. Darren's implementation is suboptimal AFAICT. Thanks, tglx --
Really? The owner information isn't in general available in the kernel. Futex operation doesn't require the value used to be the PID (or negative of the PID). That is a dramatic limitation of the usefulness of futexes. At userlevel there is access to other fields of the data structure which can contain the owner information. I would like to see the method using a per-thread pinned page and an update of a memory location on scheduling. For benchmarking at least. I agree that a sys_yield_to() syscall would be at the very least useful as well. But it's useful for other things already. --
I know that you can do any weird stuff with the futex value, but I The per thread pinned page would be unconditional, right ? I agree that benchmarking would be interesting, but OTOH I fear that we open up a huge can of worms with exposing scheduler details and the related necessary syscalls like sys_yield_to: User space thread management/scheduling comes to my mind and I hope we agree that we do Useful for what ? What are the exact semantics of such a syscall ? How does that fit into the various scheduling constraints ? Thanks, tglx --
I believe this comes back to the discussions of a directed yield. The idea being that a thread yields its remaining timeslice to a thread of it's choosing - usually because the target thread holds a resource the yielding thread needs access to. This makes the yield more explicit so the yielding thread is more likely to get some benefit out of yielding. I believe the arguments would be either a TID or a thread group - however that is specified. I believe the KVM guys would like to see something like this as well - which might be the "other things" referred to above. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Note, directed yield is something that kvm wants for its own ends. As an internal kernel api, not a syscall, but it's apparently useful. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
While this might be of less interest after today's discussion, I promised to share the results of a run with 8 threads with a wider selection of lower duty-cycles. The results are very poor for adaptive and worse for aas (multiple spinners) compared to normal FUTEX_LOCK. As Thomas and Peter have pointed out, the implementation is sub-optimal. Before abandoning this approach I will see if I can find the bottlenecks and simplify the kernel side of things. My impression is that I am doing a lot more work in the kernel, especially in the adaptive loop, than is really necessary. Both the 8 and 256 Thread plots can be viewed here: http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/ -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
It can, but there is a futex value per lock. If the task_struct had a list of held futex locks (as it does for pi futex locks) the This appears to be 1 bit per task instead of 1 bit per lock. Also, the value is thread-specific... so how would a potential waiter be able to determine if the owner of a particular lock was running or not with this method? ... maybe I'm missing some core bit about TLS... are you talking about pthread_key_create() and pthread_getspecific() ? Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
You also have a notification scheme (preempt_notifiers will fire on schedule out). However, you'd have to register the notifiers from a non-current context (i.e. on slow path acquire reaching out to lock owner and registering notifier on lock owner's behalf, which would kind of defeat the point AFAICT). thanks, -chris --
You don't want the context switch path to walk a list whose length is
The lock would need to contain a pointer to the owning task. Could be
set with cmpxchg{8,16}b on x86.
--
error compiling committee.c: too many arguments to function
--
System call overhead isn't that large, but if you want to amortize that you can do a very small spin in userspace and then take the syscall: try spin try syscall like. Complicating anything beyond that is just not worth it. --
This is available for a long time in the mutex implementation (PTHREAD_MUTEX_ADAPTIVE_NP mutex type). It hasn't show much improvement if any. There were some people demanding this support for as far as I know they are not using it now. This is adaptive spinning, learning from previous calls how long to wait. But it's still unguided. There is no way to get information like "the owner has been descheduled". --
That's where the FUTEX_LOCK thing comes in, it does all those, the above was a single spin loop to amortize the syscall overhead. I wouldn't make it any more complex than a single pause ins, syscalls are terribly cheap these days. --
And yet they still seem to have a real impact on the futex_lock benchmark. Perhaps I am just still looking at pathological cases, but there is a strong correlation between high syscall counts and really low iterations per second. Granted this also correlates with lock contention. However, when using the same period and duty-cycle I find that a locking mechanism that makes significantly fewer syscalls also significantly outperforms one that makes more. Kind of handwavy stilly, I'll have more numbers this afternoon. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team --
Sure, but I'm still not sure why FUTEX_LOCK ends up making more syscalls than FUTEX_WAIT based locking. Both should only do the syscall when the lock is contended, both should only ever do 1 syscall per acquire, right? --
This really should be able to out-perform a regular pthread_mutex_lock() we saw a significant performance gain when we added adaptive spins to the kernel mutex implementation, so I'd expect a gain on the futex one as well. --
