Re: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

Previous thread: [BUG] x86info fails on 2.6.34 ? by Eric Dumazet on Monday, April 5, 2010 - 1:24 pm. (8 messages)

Next thread: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3 by Eric Van Hensbergen on Monday, April 5, 2010 - 1:40 pm. (1 message)
From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

In-Reply-To: 

NOT FOR INCLUSION

The following patch series implements a new experimental kernel side futex mutex
via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin
follows the kernel mutex model of allowing one spinner until the lock is
released or the owner is descheduled. The patch currently allows the user to
specify if they want no spinning, a single adaptive spinner, or multiple
spinners (aggressive adaptive spinning, or aas... which I have mistyped as "ass"
enough times to realize a better term is indeed required :-).

I'm using the futex_lock branch of my futextest test suite to gather results.
The test is performance/futex_lock.c and can be checked out here:

git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git

Per Avi's suggestion I added asm instruction based period and duty-cycle options
to do some testing. A series of plots with 256 threads, one per period length,
is shown at the URL below. Each plot compares raw futex lock (normal, no
spinning) with a single spinner (adaptive) and with multiple spinners (aas) for
each of several duty-cycles to determine performance at various levels of
contention. The test measure the number of lock/unlock pairs performed per
second (in thousands).

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/

Avi's suggested starting point was 1000 instructions with a 10% duty cycle.
That plot "Futex Lock/Unlock (Period=1000, Threads=256)" does appear to be the
most interesting of the lot. It's not so short that other overhead becomes the
bottleneck, and not so long so as to make adaptive spinning benefits show up
as noise in the numbers. The adaptive spin (with a single waiter) consistently
beats the normal run, and outperforms aas for most duty-cycles. I rechecked a
few points on this plot to confirm and the relative scores remained consistent.
These plots were generated using 10,000,000 iterations per datapoint.

Lastly, I should mention that these results all ...
From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

The futex_q struct has grown considerably over the last year or so. I
believe it now merits a static initializer to avoid uninitialized data
errors (having just spent more time than I care to admit debugging
an uninitialized q.bitset in an experimental new op code).

I originally planned on following the __THING_INITIALIZER/DECLARE_THING
method, but since we already had FUTEX_KEY_INIT, and I personally prefer
that method, I went that route.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 kernel/futex.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ed08cfd..2ae18cd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -130,6 +130,12 @@ struct futex_q {
 	u32 bitset;
 };
 
+#define FUTEX_Q_INIT \
+	{ /* list gets initialized in queue_me()*/ \
+	  .task = NULL, NULL, FUTEX_KEY_INIT \
+	, NULL, NULL, NULL, FUTEX_BITSET_MATCH_ANY }
+
+
 /*
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
@@ -1799,16 +1805,13 @@ static int futex_wait(u32 __user *uaddr, int flags, u32 val, ktime_t *abs_time,
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct restart_block *restart;
 	struct futex_hash_bucket *hb;
-	struct futex_q q;
+	struct futex_q q = FUTEX_Q_INIT;
 	int ret;
 
 	if (!bitset)
 		return -EINVAL;
 
-	q.pi_state = NULL;
 	q.bitset = bitset;
-	q.rt_waiter = NULL;
-	q.requeue_pi_key = NULL;
 
 	if (abs_time) {
 		to = &timeout;
@@ -1899,7 +1902,7 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct futex_hash_bucket *hb;
-	struct futex_q q;
+	struct futex_q q = FUTEX_Q_INIT;
 	int res, ret;
 
 	if (refill_pi_state_cache())
@@ -1913,9 +1916,6 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
 		hrtimer_set_expires(&to->timer, *time);
 	}
 
-	q.pi_state = NULL;
-	q.rt_waiter ...
From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

In the early days we passed the mmap sem around. That became the
"int fshared" with the fast gup improvements. Then we added
"int clockrt" in places. This patch unifies these options as "flags" and
cleans up various calls which had fshared in their signature but no
longer used it.

V4: Fix an operator precedence bug

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 kernel/futex.c |  215 +++++++++++++++++++++++++++-----------------------------
 1 files changed, 104 insertions(+), 111 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..ed08cfd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,14 @@ int __read_mostly futex_cmpxchg_enabled;
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
+ * Futex flags used to encode options to functions and preserve them across
+ * restarts.
+ */
+#define FLAGS_SHARED		0x01
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
+/*
  * Priority Inheritance state:
  */
 struct futex_pi_state {
@@ -283,7 +291,7 @@ again:
 }
 
 static inline
-void put_futex_key(int fshared, union futex_key *key)
+void put_futex_key(union futex_key *key)
 {
 	drop_futex_key_refs(key);
 }
@@ -878,7 +886,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 /*
  * Wake up waiters matching bitset queued on this futex (uaddr).
  */
-static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
+static int futex_wake(u32 __user *uaddr, int flags, int nr_wake, u32 bitset)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -889,7 +897,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 	if (!bitset)
 		return -EINVAL;
 
-	ret = get_futex_key(uaddr, fshared, &key);
+	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -915,7 +923,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 	}
 
 ...
From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

Add a non-pi TID value based futex locking mechanism. This enables the
use of adaptive spinning which was problematic with the basic FUTEX_WAIT
operation.

V4: Remove update_futex_waiters() and perform all waiter clearing in
    userspace. Turns out to be generally better to do an O(1) operation in
    userspace that forces one extra syscall than to do an O(N) operation during
    every syscall. Go figure :).
    Cleanup timeout when exiting adaptive loop.
    Perform adaptive spinning after retry:
    General cleanups.
V3: Fix some locking bugs
    Use cmpxchg directly in the adaptive loop
    Move the adaptive loop into its own function
    Use FUTEX_WAIT instead of FUTEX_UNLOCK
    Add update_futex_waiters()
V2: Fix preempt-count breakage

This patch is a forward port of the following patch from Peter Zijlstra

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 include/linux/futex.h |    6 +
 include/linux/sched.h |    1 +
 kernel/futex.c        |  282 +++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched.c        |   59 ++++++++++
 4 files changed, 314 insertions(+), 34 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1e5a26d..3ca93d1 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -20,6 +20,8 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_LOCK		13
+#define FUTEX_LOCK_ADAPTIVE	14
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -39,6 +41,9 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE		(FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_ADAPTIVE_PRIVATE	(FUTEX_LOCK_ADAPTIVE | \
+					 FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
@@ -177,6 +182,7 @@ union futex_key {
 extern void ...
From: Thomas Gleixner
Subject:
Date: Tuesday, April 6, 2010 - 9:55 am

Hmm. The order is weird. Why don't you do that simpler ?

Get the uval, the tid and the thread_info pointer outside of the
loop. Also task_pid_vnr(current) just needs a one time lookup.

change the loop to do:

       for (;;) {
       	   curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
  
	   if (!curval)
	      return 1;
	   if ((curval & FUTEX_TID_MASK) != ownertid) {
	      ownertid = curval & FUTEX_TID_MASK;
	      owner = update_owner(ownertid);
	   }

	   if (owner && futex_spin_on_owner(....))
	   .....


  Please make that like the WAIT_BITSET one, which can select the




  Uurg. It's enough to check whether the TID value changed. No need to

Do we really need all this code ? A simple owner->on_cpu (owner needs
to be the task_struct then) would be sufficient to figure that out,
wouldn't it?

Thanks,

	tglx
--

From: Darren Hart
Date: Wednesday, April 7, 2010 - 10:26 am

Eeek. Having the owner in the loop is a good way to negate the benefits
of adaptive spinning by spinning forever (unlikely, but it could
certainly spin across multiple owners). Nice catch.

As for the uval.... I'm not sure what you mean. You get curval below
inside the loop, and there is no "uval" in the my version of the code.

As for the order, I had put the initial spin prior to the cmpxchg to
avoid doing too many cmpxchg's in a row as they are rather expensive.
However, since this is (now) the first opportunity to do try and acquire
the lock atomically after entering the futex syscall, I think you're

Single return point makes instrumentation so much easier. Unless folks


Hrm... at this point the owner has changed... so we should break and go
to sleep, not update the owner and start spinning again. The
futex_spin_on_owner() will detect this and abort, so I'm not seeing the
purpose of the above if() block.





I placed the retry: label above the adaptive spin loop. This way if we 
wake a task and the lock is "stolen" it doesn't just go right back to 
sleep. This should aid in fairness and also performance in less 
contended cases. I didn't think it was worth a "if (first_time_through 


We don't. I'm using it here simply to measure the impact of adaptive 
spinning. Eventually all that will matter is how it stacks up against a 
raw futex_wait/wake mutex implementation like 
pthread_mutex_lock/unlock(). But in the early stages it's nice to be 
able eliminate all other differences other than adaptive spinning.



Cruft left over from the futex_lock_pi() roots. Removed.



OK, we can skip the euid, uid credentials checking by doing a get_user() 

As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling 
through the mutex_spin_on_owner() discussions to see if I can determine 
why that's the case.

Thank you for the review.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

--

From: Thomas Gleixner
Date: Wednesday, April 7, 2010 - 12:59 pm

Well, you need a first time lookup of owner and ownertid for which you
need the user space value (uval), but thinking more about it it's not
even necessary. Just initialize ownertid to 0 so it will drop into the


Why ? If the owner has changed and the new owner is running on another


AFAICT p->oncpu is the correct thing to use when CONFIG_SMP=y. All it
needs is a simple accessor function and you can keep all the futex
cruft in futex.c where it belongs.

Thanks,

	tglx
--

From: Darren Hart
Date: Wednesday, April 7, 2010 - 8:25 pm

No need for ownertid at all really. The cmpxchg always tries to go from 
0 to curtid. I've pushed the futex_owner() call outside the loop for a 

That's an interesting question, and I'm not sure what the right answer 
is. The current approach of the adaptive spinning in the kernel is to 
spin until the owner changes or deschedules, then stop and block. The 
idea is that if you didn't get the lock before the owner changed, you 
aren't going to get it in a very short period of time (you have at least 
an entire critical section to wait through plus whatever time you've 
already spent spinning). However, blocking just so another task can spin 
doesn't really make sense either, and makes the lock less fair than it 
could otherwise be.

My goal in starting this is to provide a more intelligent mechanism than 
sched_yield() for userspace to use to determine when to spin and when to 
sleep. The current implementation allows for spinning up until the owner 
changes, deschedules, or the timeslice expires. I believe these are much 
better than spinning for some fixed number of cycles and then yield for 
some unpredictable amount of time until CFS decides to schedule you back in.

Still, the criteria for breaking the spin are something that needs more 

Noted.

Thanks,
-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Peter W. Morreale
Date: Thursday, April 8, 2010 - 4:10 pm

Not only less fair, but potentially could cause starvation, no?  Perhaps
you could see this if you changed your model to allow all contended
tasks to spin instead of just one.  

If a spinning task blocks because of an owner change, and a new task
enters and starts spinning directly after the owner change, at what
point does the original task get woken up?  Its likely that the new
spinner will get the resource next, no?  Rinse/repeat with another task
and the original spinner is starved.  

(Or am I missing something?  My understanding was that unfairness was
built-in to this algo... If so, then the above is a possibility, right?)

Best,


--

From: Darren Hart
Date: Thursday, April 8, 2010 - 10:41 pm

At the time of unlock the owner will have to call FUTEX_WAKE. This task 
will wake and attempt to acquire the lock - it will lose races with 
aclready running contenders. Lock stealing, adaptive spinning, etc are 

Yes it is. These locks are typically used in situations where it is more 
important that some work gets completed than _which_ work gets completed.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Peter W. Morreale
Date: Friday, April 9, 2010 - 6:13 am

nod.  My only point was that if only one can spin, and sleeps when the
owner is not on CPU, then you virtually guarantee worst case performance
under certain conditions (ie: multiple threads coming in at various
times)

If all spin, and they sleep only on owner block, then the 'unfairness'
is potentially more evenly distributed via hardware (and stealing) and
when the owner is blocked. 


Best,


--

From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

Prepare for FUTEX_LOCK by refactoring futex_lock_pi_atomic() into
lock_futex_atomic() and lock_pi_futex_atomic(). The name change is meant to
reflect the naming convention in futex.c that futex_*() functions map directly
to futex op codes and the others are internal helper functions.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>

===================================================================
---
 kernel/futex.c |   79 +++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2ae18cd..8c1bb16 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -625,30 +625,23 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 }
 
 /**
- * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex
- * @uaddr:		the pi futex user address
- * @hb:			the pi futex hash bucket
- * @key:		the futex key associated with uaddr and hb
- * @ps:			the pi_state pointer where we store the result of the
- *			lookup
- * @task:		the task to perform the atomic lock work for.  This will
- *			be "current" except in the case of requeue pi.
- * @set_waiters:	force setting the FUTEX_WAITERS bit (1) or not (0)
+ * lock_futex_atomic() - Try to acquire the futex lock atomically
+ * @uaddr:       user address of the futex
+ * @task:        the task to perform the atomic lock work for.
+ * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ *
+ * The hb->lock shall be held by the caller.
  *
  * Returns:
  *  0 - ready to wait
  *  1 - acquired the lock
  * <0 - error
- *
- * The hb->lock and futex_key refs shall be held by the caller.
  */
-static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
-				union futex_key *key,
-				struct futex_pi_state **ps,
-				struct task_struct *task, int set_waiters)
+static int lock_futex_atomic(u32 __user *uaddr, struct task_struct *task,
+                             int set_waiters)
 {
-	int ...
From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 kernel/futex.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index c33ac2a..af61dcd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2385,6 +2385,7 @@ out:
 /**
  * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
  * @uaddr: the futex user address
+ * @timeout: absolute timeout or NULL if none
  *
  * Try to acquire a futex lock in a loop until the owner changes or the owner
  * is descheduled. To lock the futex, set the value to the current TID.
@@ -2394,10 +2395,11 @@ out:
  *  1 - Futex lock acquired
  * <0 - On error
  */
-static int trylock_futex_adaptive(u32 __user *uaddr)
+static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
 {
 	int ret = 0;
 	u32 curval;
+	ktime_t now;
 
 	for (;;) {
 		struct thread_info *owner;
@@ -2433,6 +2435,22 @@ static int trylock_futex_adaptive(u32 __user *uaddr)
 		if (need_resched())
 			break;
 
+		if (timeout) {
+			now = ktime_get();
+/* FIXME: consider creating ktime_less_than(lhs, rhs) */
+#if (BITS_PER_LONG == 64) || defined(CONFIG_KTIME_SCALAR)
+			if (timeout->tv64 < now.tv64)
+				break;
+#else
+			if (timeout->sec < now.sec ||
+			    (timeout->sec == now.sec &&
+			     timeout->nsec < now.nsec)) {
+				ret = -ETIMEDOUT;
+				break;
+			}
+#endif
+		}
+
 		cpu_relax();
 	}
 	return ret;
@@ -2480,7 +2498,7 @@ retry:
 #ifdef CONFIG_SMP
 	if (flags & FLAGS_ADAPTIVE) {
 		preempt_disable();
-		ret = trylock_futex_adaptive(uaddr);
+		ret = trylock_futex_adaptive(uaddr, time);
 		preempt_enable();
 
 		/* We got the lock. */
@@ -2489,7 +2507,7 @@ retry:
 			goto out;
 		}
 
-		/* We encountered an error, -EFAULT most likely. */
+		/* We encountered an error, -EFAULT or -ETIMEDOUT */
 		if (ret)
 			goto out;
 	}
-- 
1.6.3.3

--

From: Thomas Gleixner
Date: Tuesday, April 6, 2010 - 1:27 am

Hmm. Calling that in every iteration might hurt especially on non

  No need. The .tv64 comparison works in both cases. :)

Thanks,

	tglx
--

From: Darren Hart
Date: Wednesday, April 7, 2010 - 10:31 am

I haven't come across a better alternative since arming the timer before 

Ah, for some reason I was thinking that was only the case if 
CONFIG_KTIME_SCALAR was set. Very nice, thanks.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Gregory Haskins
Date: Wednesday, April 7, 2010 - 11:44 am

>>> On 4/7/2010 at 01:31 PM, in message <4BBCC174.7020409@us.ibm.com>, Darren Hart

Hey Darren,

I remember we tried something similar in early versions of the adaptive locks and this was definitely bad. :(

It ended up putting so much contention on the xtime_lock (IIRC) that it resulted in adaptive locks hurting overall performance verses not using adaptive at all.  Alternative mechanisms employed a hybrid where the inner loops used a pseudo calibrated counter loop, and the outer loop checks periodically against a real clock.  It all plays into "you are burning CPU cycles anyway, so might as well put them to use" theory.  Hacky, yes, but it did relieve the pressure on the time subsystem locks and freed up a _lot_ of performance.  Without this, the concept of timeouts+adaptive was unworkable.  I think Steven ultimately rejected the timeout related patches outright when he merged adaptive to -rt, but I think Sven pulled them into SLERT if you would like a potential code reference to a working solution.

-Greg



--

From: Darren Hart
Date: Wednesday, April 7, 2010 - 4:15 pm

Gregory Haskins wrote:
 >>>> On 4/7/2010 at 01:31 PM, in message <4BBCC174.7020409@us.ibm.com>, 
Darren Hart
 > <dvhltc@us.ibm.com> wrote:
 >> Thomas Gleixner wrote:
 >>> On Mon, 5 Apr 2010, Darren Hart wrote:
 >>>
 >>>> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
 >
 >>>> +		if (timeout) {
 >>>> +			now = ktime_get();
 >>>   Hmm. Calling that in every iteration might hurt especially on non
 >>>   TSC systems, but well...
 >> I haven't come across a better alternative since arming the timer 
before
 >> setting TASK_INTERRUPTIBLE isn't appropriate.
 >
 > Hey Darren,
 >
 > I remember we tried something similar in early versions of the
 > adaptive locks and this was definitely bad. :(
 >
 > It ended up putting so much contention on the xtime_lock (IIRC) that
 > it resulted in adaptive locks hurting overall performance verses not
 > using adaptive at all.  Alternative mechanisms employed a hybrid where
 > the inner loops used a pseudo calibrated counter loop, and the outer
 > loop checks periodically against a real clock.  It all plays into "you
 > are burning CPU cycles anyway, so might as well put them to use"
 > theory.  Hacky, yes, but it did relieve the pressure on the time
 > subsystem locks and freed up a _lot_ of performance.  Without this,
 > the concept of timeouts+adaptive was unworkable.  I think Steven
 > ultimately rejected the timeout related patches outright when he
 > merged adaptive to -rt, but I think Sven pulled them into SLERT if you
 > would like a potential code reference to a working solution.
 >

Hi Greg,

Thanks for the info! I haven't tested with timeouts yet as I'm still 
struggling to get decent performance out of just plain old adaptive 
right now. I'll keep that in mind when I get to that, and yeah, if the 
plan is to burn CPU cycles, might as well do something constructive :-)

I do feel the timeouts are a necessary feature. Interruptibility may be 
as well, but I'm going to ignore it for the time being...

--
Darren
--

From: Darren Hart
Date: Monday, April 5, 2010 - 1:23 pm

Aggresive adaptive spinning (aas) allows for more than spinner in the
adaptive loop at a given time. In some scenarios this yields better
results than the default single spinner mode.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---
 kernel/futex.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index af61dcd..ddbd158 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2462,6 +2462,9 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
  * @flags:  futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, FLAGS_ADAPTIVE, etc.)
  * @detect: detect deadlock (1) or not (0)
  * @time:   absolute timeout
+ * @aas:    aggressive adaptive spinning (1) allow multiple spinners (0) allow
+ *          only one spinner. Only valid in conjunction with the FLAGS_ADAPTIVE
+ *          flag.
  *
  * futex_(un)lock() define a futex value policy and implement a full mutex. The
  * futex value stores the owner's TID or'd with FUTEX_WAITERS and/or
@@ -2476,12 +2479,14 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
  *  0 - On success
  * <0 - On error
  */
-static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
+static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time,
+                      int aas)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
-	struct futex_hash_bucket *hb;
 	struct futex_q q = FUTEX_Q_INIT;
+	struct futex_hash_bucket *hb;
 	int ret = 0;
+	u32 uval;
 
 	if (refill_pi_state_cache())
 		return -ENOMEM;
@@ -2497,6 +2502,14 @@ static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
 retry:
 #ifdef CONFIG_SMP
 	if (flags & FLAGS_ADAPTIVE) {
+		if (!aas) {
+			ret = get_user(uval, uaddr);
+			if (ret)
+				goto out;
+			if (uval & FUTEX_WAITERS)
+				goto skip_adaptive;
+		}
+
 		preempt_disable();
 		ret = trylock_futex_adaptive(uaddr, time);
 		preempt_enable();
@@ -2512,6 ...
From: Darren Hart
Date: Wednesday, April 7, 2010 - 10:58 pm

To eliminate syscall overhead from the equation, I modified the testcase 
to allow for forcing the syscall on lock(). Doing so cut the 
non-adaptive scores by more than half. The adaptive scores dropped 
accordingly. The relative difference between normal and adaptive 
remained in tact (with my adaptive implementation lagging by 10x). So 
while the syscall overhead does impact the scores, it is not the source 
of the performance issue with the adaptive futex implementation I posted.

The following bits were being used to test for spinners and attempt to 
only allow one spinner. Obviously it failed miserably at that. I found 

Trouble is at this point is there are no more bits in the word to be 
able to have a FUTEX_SPINNER bit. The futex word is the only per-futex 
storage we have, the futex_q is per task.

If we overload the FUTEX_WAITERS bit it will force more futex_wake() 
calls on the unlock() path. It also will effectively disable spinning 
under contention as there are bound to be FUTEX_WAITERS in that case.

Another option I dislike is to forget about robust futexes in 
conjunction with adaptive futexes and overload the FUTEX_OWNER_DIED bit. 
Ulrich mentioned in another mail that "If we have 31 bit TID values 
there isn't enough room for another bit." Since we have two flag bits 
now, I figured TID values were 30 bits. Is there an option to run with 
31 bits or something?

Assuming we all agree that these options are "bad", that leaves us with 
looking for somewhere else to store the information we need, which in 
turn brings us back around to what Avi, Alan, and Ulrich were discussing 
regarding non swappable TLS data and a pointer in the futex value.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Darren Hart
Date: Monday, April 5, 2010 - 1:48 pm

And by V2 I meant V4....

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Avi Kivity
Date: Monday, April 5, 2010 - 2:15 pm

An interesting (but perhaps difficult to achieve) optimization would be 

How many cores (or hardware threads) does this machine have?  At 10% 
duty cycle you have 25 waiters behind the lock on average.  I don't 
think this is realistic, and it means that spinning is invoked only rarely.

I'd be interested in seeing runs where the average number of waiters is 
0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.  25 
average waiters on compute bound code means the application needs to be 
rewritten, no amount of mutex tweaking will help it.

Does the wakeup code select the spinning waiter, or just a random waiter?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Darren Hart
Date: Monday, April 5, 2010 - 2:54 pm

I couldn't think of a lightweight way to determine when the owner has 
been scheduled out in userspace. Kernel assistance is required. You 
could do this on the schedule() side of things, but I figured I'd get 
some strong pushback if I tried to add a hook into descheduling that 
flipped a bit in the futex value stating the owner was about to 

Sorry, I meant to include that. I tested on an 8 CPU (no hardware 
threads) 2.6 GHz Opteron 2218 (2 QuadCore CPUs) system.


Perhaps some instrumentation is in order, it seems to get invoked enough 
to achieve some 20% increase in lock/unlock iterations. Perhaps another 

Perhaps something NR_CPUS threads would be of more interest? At 10% 
that's about .8 and at 25% the 2 of your upper limit. I could add a few 
more duty-cycle points and make 25% the max. I'll kick that off and post 
the results... probably tomorrow, 10M iterations takes a while, but 

The wakeup code selects the highest priority task in fifo order to 
wake-up - however, under contention it is most likely going to go back 
to sleep as another waiter will steal the lock out from under it. This 
locking strategy is unashamedly about as "unfair" as it gets.


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Avi Kivity
Date: Monday, April 5, 2010 - 3:21 pm

In the futex value it's hopeless (since a thread can hold many locks), 
but I don't think it's unreasonable to set a bit in the thread local 
storage area.  The futex format would then need to be extended to 




Best to avoid the wakeup if we notice the lock was stolen.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Darren Hart
Date: Monday, April 5, 2010 - 3:59 pm

How so? Several real world applications use one thread per CPU to 

Absolutely, I don't disagree that all the variables should vary in order 
to get a complete picture. I'm starting with 8 - it takes several hours 

You really can't do this precisely. You can read the futex value at 
various points along the wakeup path, but at some point you have to 
commit to waking a task, and you still have a race between the time you 
wake_up_task() and when it is scheduled and attempts the cmpxchg itself.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 6:28 am

Yes, but that's the best case for spinning.  You could simply use a 

The race is not a problem as it's just an optimization.  If you lose it 
you get a spurious wakeup, if you win you save some cpu cycles.


-- 
error compiling committee.c: too many arguments to function

--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 6:35 am

Userspace spinlocks are evil.. they should _never_ be used.

--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 6:41 am

But in this case they're fastest.  If we don't provide a non-evil 
alternative, people will use them.

-- 
error compiling committee.c: too many arguments to function

--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 7:09 am

That's what FUTEX_LOCK is about.

--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:10 am

That works for the uncontended case.  For the contended case, the waiter 
and the owner have to go into the kernel and back out to transfer 
ownership.  In the non-adaptive case you have to switch to the idle task 
and back as well, and send an IPI.  That's a lot of latency if the 
unlock happened just after the waiter started the descent into the kernel.

-- 
error compiling committee.c: too many arguments to function

--

From: Alan Cox
Subject:
Date: Tuesday, April 6, 2010 - 9:53 am

If you insist on doing it that way yes, but knowing the lock owner is
likely to be away for a while also lets you do things like punt work
either by picking another work package in the meantime, or by queueing
the work you can't do on a list the pre-empted lock owner will review
before dropping the lock.

--

From: Alan Cox
Date: Tuesday, April 6, 2010 - 6:51 am

On Tue, 06 Apr 2010 15:35:31 +0200

Thats a gross and inaccurate simplification. For the case Avi is talking
about spinning in userspace makes sense in a lot of environments. Once
you've got one thread pinned per cpu (or gang scheduling >-) ) there are
various environments where it makes complete and utter sense.
--

From: Darren Hart
Subject:
Date: Tuesday, April 6, 2010 - 8:28 am

Hi Alan,

Do you feel some of these situations would also benefit from some kernel 
assistance to stop spinning when the owner schedules out? Or are you 
saying that there are situations where pure userspace spinlocks will 
always be the best option?

If the latter, I'd think that they would also be situations where 
sched_yield() is not used as part of the spin loop. If so, then these 
are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to 
provide a better informed mechanism for making spin or sleep decisions. 
If sleeping isn't part of the locking construct implementation, then 
FUTEX_LOCK_ADAPTIVE doesn't have much to offer.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:06 am

IMO the best solution is to spin in userspace while the lock holder is 
running, fall into the kernel when it is scheduled out.

-- 
error compiling committee.c: too many arguments to function

--

From: Thomas Gleixner
Subject:
Date: Tuesday, April 6, 2010 - 9:14 am

That's just not realistic as user space has no idea whether the lock
holder is running or not and when it's scheduled out without a syscall :)

Thanks,

	tglx
--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:20 am

The kernel could easily expose this information by writing into the 
thread's TLS area.

So:

- the kernel maintains a current_cpu field in a thread's tls
- lock() atomically writes a pointer to the current thread's current_cpu 
when acquiring
- the kernel writes an invalid value to current_cpu when switching out
- a contended lock() retrieves the current_cpu pointer, and spins as 
long as it is a valid cpu

-- 
error compiling committee.c: too many arguments to function

--

From: john cooper
Date: Tuesday, April 6, 2010 - 11:18 pm

There are certainly details to sort through in the packaging
of the mechanism but conceptually that should do the job.
So here the application has chosen a blocking lock as being
the optimal synchronization operation and we're detecting a
scenario where we can factor out the aggregate overhead of two
context switch operations.

There is also the case where the application requires a
polled lock with the rational being the assumed lock
hold/wait time is substantially less than the above context
switch overhead.  But here we're otherwise completely
open to indiscriminate scheduling preemption even though
we may be holding a userland lock.

The adaptive mutex above is an optimization beyond what
is normally expected for the associated model.  The preemption
of a polled lock OTOH can easily inflict latency several orders
of magnitude beyond what is expected in that model.  Two use
cases exist here which IMO aren't related except for the latter
unintentionally degenerating into the former.

-john

-- 
john.cooper@third-harmonic.com
--

From: Darren Hart
Date: Wednesday, April 7, 2010 - 8:33 pm

I didn't intend to change the behavior of an existing blocking call with 
adaptive spinning if that is what you are getting at here. Initially 
there would be a new futex op, something like FUTEX_LOCK_ADAPTIVE or 
maybe just FUTEX_WAIT_ADAPTIVE. Applications can use this directly to 
implement adaptive spinlocks. Ideally glibc would make use of this via 
either the existing adaptive spinning NP API or via a new one. Before we 



Again, my intention is not to replace any existing functionality, so 
applications would have to explicitly request this behavior.

If I'm missing your point, please elaborate.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: john cooper
Date: Thursday, April 8, 2010 - 10:52 pm

Agreed.  However from your earlier mail it seemed
addressing this scenario was within the scope of
consideration.

There are two ways to deal with this condition, either
reactive in the sense we do so after the lock holder
has been preempted and subsequently find we're spinning
in sibling thread context attempting to acquire the
lock.  Or proactively where we provide a time bounded
deferral of lock holder preemption with the assumption
the lock hold path overhead has negligible effect upon
deferring a potentially coincident scheduling operation.

It is fairly straightforward to demonstrate the impact
to performance with a focused micro benchmark, less so
for a more "typical" application with the effect being
diluted among other app activity.
 
The two approaches are complimentary with differing
system wide tradeoffs.  Both require some insight into
the scheduling disposition of the lock holder, the
preemption deferral mechanism more so.  If a scheme to
expose scheduler state transitions to (or cooperate
with) userland locking primitives is being considered,
it seems opportune to consider support as well for
this scenario.

-john

-- 
john.cooper@redhat.com
--

From: Alan Cox
Subject:
Date: Tuesday, April 6, 2010 - 9:54 am

Which is the real problem that wants addressing and can be addressed very
cheaply. That would bring us up to par with 1970s RTOS environments ;)

Alan
--

From: Thomas Gleixner
Date: Tuesday, April 6, 2010 - 11:15 am

Well, 1970's RTOSes had other features as well like preemption disable
mechanisms and other interesting stuff. I hope you're not going to
propose that next to bring us up to par with those :)

Thanks,

	tglx
--

From: Alan Cox
Subject:
Date: Tuesday, April 6, 2010 - 9:44 am

There are cases its the best option - you are assuming for example that
the owner can get scheduled out. Eg nailing one thread per CPU in some

I am unsure about the approach. As Avi says knowing that the lock owner is
scheduled out allows for far better behaviour. It doesn't need complex
per lock stuff or per lock notifier entries on pre-empt either.

A given task is either pre-empted or not and in the normal case of things
you need this within a process so you've got shared pages anyway. So you
only need one instance of the 'is thread X pre-empted' bit somewhere in a
non swappable page.

That gives you something along the lines of

	runaddr = find_run_flag(lock);
	do {
		while(*runaddr == RUNNING) {
			if (trylock(lock))
				return WHOOPEE;
			cpu relax
		}
		yield (_on(thread));
	} while(*runaddr != DEAD);


which unlike blindly spinning can avoid the worst of any hit on the CPU
power and would be a bit more guided ?

--

From: Ulrich Drepper
Subject:
Date: Tuesday, April 6, 2010 - 10:34 am

There still has to be an upper limit in the number of rounds of the
wait loop )some locks are held for a long time) since otherwise CPUs
are unnecessarily long tied up.  And the DEAD case is only for robust
mutex handling.  But in theory I agree.

We already have the set_tid_address syscall.  This could be
generalized with a new syscall which can provide the kernel with more
than one pointer to store "stuff" in: TIDs, scheduling info, etc.

The non-swappable part will be tricky.  One doesn't know how many
threads will be created in a process.  This mechanism shouldn't put an
arbitrary limit in place.  So where to allocate the memory?  Perhaps
it's better to implicitly mark the memory page pointed to by the new
syscall as non-swappable?  This could mean one page per thread...
--

From: Alan Cox
Date: Saturday, April 10, 2010 - 4:35 pm

You only need one page per 4096 threads or so if you make it create the
page on the first request, tied to say the signals and the like in
threaded tasks, and you then allocate 'slots' in the page for future
calls until you've got 4096 live.

ie you'd see something like

	addr = set_tid_notifier_addr();

	[1st call
	 map page at x to x + 4095, probably R/O]
	[returns x]

next thread
	addr = set_tid_notifier_addr()

	[returns x + 1]

Alan
--

From: Ulrich Drepper
Date: Saturday, April 10, 2010 - 4:53 pm

Very expensive.  Each cache line would be fought over by 64 threads.
Constant RFOs make context switches significantly slower.

At most 4096/64 = 64 threads should share one page.  One page would
still cover almost all processes.
--

From: Thomas Gleixner
Date: Tuesday, April 6, 2010 - 12:31 pm

I fear we might end up with a pinned page per thread to get this
working properly and it restricts the mechanism to process private

That would require a new yield_to_target() syscall, which either
blocks the caller when the target thread is not runnable or returns an

I doubt that the syscall overhead per se is large enough to justify
all of the above horror. We need to figure out a more efficient way to
do the spinning in the kernel where we have all the necessary
information already. Darren's implementation is suboptimal AFAICT.

Thanks,

	tglx
--

From: Ulrich Drepper
Date: Tuesday, April 6, 2010 - 1:02 pm

Really?  The owner information isn't in general available in the
kernel.  Futex operation doesn't require the value used to be the PID
(or negative of the PID).  That is a dramatic limitation of the
usefulness of futexes.

At userlevel there is access to other fields of the data structure
which can contain the owner information.

I would like to see the method using a per-thread pinned page and an
update of a memory location on scheduling.  For benchmarking at least.
 I agree that a sys_yield_to() syscall would be at the very least
useful as well.  But it's useful for other things already.
--

From: Thomas Gleixner
Date: Tuesday, April 6, 2010 - 4:16 pm

I know that you can do any weird stuff with the futex value, but I

The per thread pinned page would be unconditional, right ?

I agree that benchmarking would be interesting, but OTOH I fear that
we open up a huge can of worms with exposing scheduler details and the
related necessary syscalls like sys_yield_to: User space thread
management/scheduling comes to my mind and I hope we agree that we do

Useful for what ?

What are the exact semantics of such a syscall ? 

How does that fit into the various scheduling constraints ?

Thanks,

	tglx
--

From: Darren Hart
Date: Tuesday, April 6, 2010 - 4:36 pm

I believe this comes back to the discussions of a directed yield. The 
idea being that a thread yields its remaining timeslice to a thread of 
it's choosing - usually because the target thread holds a resource the 
yielding thread needs access to. This makes the yield more explicit so 
the yielding thread is more likely to get some benefit out of yielding.

I believe the arguments would be either a TID or a thread group - 
however that is specified. I believe the KVM guys would like to see 
something like this as well - which might be the "other things" referred 
to above.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 10:33 pm

Note, directed yield is something that kvm wants for its own ends.  As 
an internal kernel api, not a syscall, but it's apparently useful.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Darren Hart
Date: Tuesday, April 6, 2010 - 2:22 pm

While this might be of less interest after today's discussion, I 
promised to share the results of a run with 8 threads with a wider 
selection of lower duty-cycles. The results are very poor for adaptive 
and worse for aas (multiple spinners) compared to normal FUTEX_LOCK. As 
Thomas and Peter have pointed out, the implementation is sub-optimal. 
Before abandoning this approach I will see if I can find the bottlenecks 
and simplify the kernel side of things. My impression is that I am doing 
a lot more work in the kernel, especially in the adaptive loop, than is 
really necessary.

Both the 8 and 256 Thread plots can be viewed here:

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Darren Hart
Date: Monday, April 5, 2010 - 4:15 pm

It can, but there is a futex value per lock. If the task_struct had a 
list of held futex locks (as it does for pi futex locks) the 

This appears to be 1 bit per task instead of 1 bit per lock. Also, the 
value is thread-specific... so how would a potential waiter be able to 
determine if the owner of a particular lock was running or not with this 
method?  ... maybe I'm missing some core bit about TLS... are you 
talking about pthread_key_create() and pthread_getspecific() ?

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Chris Wright
Date: Monday, April 5, 2010 - 4:29 pm

You also have a notification scheme (preempt_notifiers will fire on
schedule out).  However, you'd have to register the notifiers from a
non-current context (i.e. on slow path acquire reaching out to lock owner
and registering notifier on lock owner's behalf, which would kind of
defeat the point AFAICT).

thanks,
-chris
--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 6:30 am

You don't want the context switch path to walk a list whose length is 


The lock would need to contain a pointer to the owning task.  Could be 
set with cmpxchg{8,16}b on x86.

-- 
error compiling committee.c: too many arguments to function

--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 1:48 am

System call overhead isn't that large, but if you want to amortize that
you can do a very small spin in userspace and then take the syscall:

  try
  spin
  try
  syscall

like.

Complicating anything beyond that is just not worth it.

--

From: Ulrich Drepper
Subject:
Date: Tuesday, April 6, 2010 - 7:47 am

This is available for a long time in the mutex implementation
(PTHREAD_MUTEX_ADAPTIVE_NP mutex type).  It hasn't show much
improvement if any.  There were some people demanding this support for
as far as I know they are not using it now.  This is adaptive
spinning, learning from previous calls how long to wait.  But it's
still unguided.  There is no way to get information like "the owner
has been descheduled".
--

From: Peter Zijlstra
Subject:
Date: Tuesday, April 6, 2010 - 7:51 am

That's where the FUTEX_LOCK thing comes in, it does all those, the above
was a single spin loop to amortize the syscall overhead.

I wouldn't make it any more complex than a single pause ins, syscalls
are terribly cheap these days.

--

From: Darren Hart
Subject:
Date: Tuesday, April 6, 2010 - 8:33 am

And yet they still seem to have a real impact on the futex_lock 
benchmark. Perhaps I am just still looking at pathological cases, but 
there is a strong correlation between high syscall counts and really low 
iterations per second. Granted this also correlates with lock 
contention. However, when using the same period and duty-cycle I find 
that a locking mechanism that makes significantly fewer syscalls also 
significantly outperforms one that makes more. Kind of handwavy stilly, 
I'll have more numbers this afternoon.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--

From: Peter Zijlstra
Subject:
Date: Tuesday, April 6, 2010 - 8:37 am

Sure, but I'm still not sure why FUTEX_LOCK ends up making more syscalls
than FUTEX_WAIT based locking. Both should only do the syscall when the
lock is contended, both should only ever do 1 syscall per acquire,
right?



--

From: Peter Zijlstra
Subject:
Date: Tuesday, April 6, 2010 - 8:29 am

This really should be able to out-perform a regular pthread_mutex_lock()
we saw a significant performance gain when we added adaptive spins to
the kernel mutex implementation, so I'd expect a gain on the futex one
as well.


--

Previous thread: [BUG] x86info fails on 2.6.34 ? by Eric Dumazet on Monday, April 5, 2010 - 1:24 pm. (8 messages)

Next thread: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3 by Eric Van Hensbergen on Monday, April 5, 2010 - 1:40 pm. (1 message)