>>> On 4/1/2010 at 01:15 AM, in message <4BB42C04.7090308@us.ibm.com>, Darren Hart
<dvhltc@us.ibm.com> wrote:
Hi Darren,
Part of the magic of adaptive locks is the avoidance of the sleep path, and part of it is using the SMP resources in what is effectively a distributed search (e.g. having N cpus actively looking for when they can acquire verses going to sleep and having the scheduler wake them up later).
I haven't seen your algorithm, but if its not simply trading the sleep path for searching for acquisition+lockowner changes, that may perhaps be a reason for an observed regression. For instance, if you have to do substantial work to get to the adaptive part of the algorithm, there could be an unbalanced amount of overhead in selecting this path.
The whole premise is predicated on the following sequence chart:
non-adaptive:
time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| | add_wait_list(X, B)
| unlock(X) |
| grant(X, B) |
| wake_up(B) |
| | schedule()
| | |
| | |
| | V
t1 | | schedule() returns
| | (returns from lock())
adaptive:
time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| unlock(X) |
| grant(X, B) |
| | while(!is_granted(X, B) && is_running(lock_owner(X))
t1 | | (returns from lock()
The idea is that the time interval t0-t1 is shorter in the adaptive case (at least most of the time), and is why the spinning doesn't end up hurting us (the cpu is also busy in the schedule() code otherwise). This is the win-win scenario for adaptive. This is the case generally with short-hold locks (like the spinlocks that were converted to mutexes in -rt tend to be).
For cases where the lock is held longer than the scheduling overhead, we will undoubtedly see more cpu utilization than the non-adaptive case. However, we may _still_ see performance improvements in some scenarios due to the fact that the "grant" operation still has less overhead (thead A can skip the "wakeup" and therefore thread B still comes out of the loop with less latency that it could otherwise. In this scenario you are trading cpu cycles for reduced latency, though the performance gains (if any) are not as profound in the ideal case.
..and then of course there are cpu-bound workloads with long(er) held locks. They don't like adaptive-locks much at all ;)
Long story short, I would try to quantify what your "t1-t0" delta actually is for the workload you are observing to start. That will give you a ballpark of whether you should expect any gains or not.
-Greg
--