Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Previous thread: [PATCH -rt] apply rcu_process_callbacks() changes from mainline by Paul E. McKenney on Friday, August 1, 2008 - 2:11 pm. (1 message)

Next thread: sh O= builds broken by Adrian Bunk on Friday, August 1, 2008 - 2:27 pm. (4 messages)
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:16 pm

** RFC for PREEMPT_RT branch, 26-rt1 **

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a 
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches.  This yields at
least a 10-15% improvement for diskio on my 4-way x86_64 system.  An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
   (at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
   spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
   if their non-rt equivalent.  We have determined via instrumentation
   that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above.  Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-6.

(I realize patch #6 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this more readable).

Patch 7 is the first real consumer of the libpi logic to try to enhance
performance.  It accomplishes this by deferring pi-boosting a lock
owner unless it is absolutely necessary.  Since instrumentation
shows that the majority of locks are acquired either via the fast-path,
or via the adaptive-spin path, we can eliminate most of the pi-overhead
with this technique.  This yields a measurable performance gain (at least
10% for workloads with heavy lock contention was observed in our lab).

For your convenience, you can also find these ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:16 pm

The kernel currently addresses priority-inversion through priority-
inheritence.  However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure.  This causes a few
problems:

 1) This tightly coupled relationship makes it difficult to extend to
    other areas of the kernel (for instance, pi-aware wait-queues may
    be desirable).
 2) Enhancing the rtmutex infrastructure becomes challenging because
    there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version.  The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target.  A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source.  For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/libpi.txt |   59 +++++++
 include/linux/pi.h      |  226 +++++++++++++++++++++++++++
 lib/Makefile            |    3 
 lib/pi.c                |  398 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 685 insertions(+), 1 deletions(-)

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic concept of a "pi_source" and a "pi_sink", where, as the name suggests provides the basic relationship of a ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:16 pm

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place.  Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h     |   37 +++++++--
 include/linux/workqueue.h |    2 
 kernel/fork.c             |    1 
 kernel/rcupreempt-boost.c |   18 +---
 kernel/rtmutex.c          |    6 +
 kernel/sched.c            |  188 ++++++++++++++++++++++++++++++++-------------
 kernel/workqueue.c        |   39 ++++++++-
 7 files changed, 207 insertions(+), 84 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..63ddd1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/pi.h>
 
 #include <asm/processor.h>
 
@@ -1125,6 +1126,7 @@ struct task_struct {
 	int prio, static_prio, normal_prio;
 #ifdef CONFIG_PREEMPT_RCU_BOOST
 	int rcu_prio;
+	struct pi_source rcu_prio_src;
 #endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
 	/* Protection of the PI data structures: */
 	raw_spinlock_t pi_lock;
 
+	struct {
+		struct pi_source src;  /* represents normal_prio to 'this' */
+		struct pi_node node;
+		struct pi_sink snk;  /* registered to 'this' to get updates */
+		int prio;
+	} pi;
+
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct plist_head pi_waiters;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	int rtmutex_prio;
+	struct pi_source rtmutex_prio_src;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
 #endif
 };
 
+static ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:17 pm

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
 }
 #endif
 
+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+	memset(waiter, 0, sizeof(*waiter));
+
+	debug_rt_mutex_init_waiter(waiter);
+	waiter->task = NULL;
+	waiter->write_lock = 0;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	struct task_struct *orig_owner;
 	int missed = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
+	init_waiter(&waiter);
 
 	spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long saved_state = -1, state, flags;
 
+	init_waiter(&waiter);
+
 	spin_lock_irqsave(&mutex->wait_lock, flags);
 	init_rw_lists(rwm);
 
@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 
 	/* Owner is a writer (or a blocked writer). Block on the lock */
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
-
 	if (mtx) {
 		/*
 		 * We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long flags, saved_state = -1, state;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
+	init_waiter(&waiter);
 
 	/* we do PI different for writers that are blocked ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:17 pm

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
 # define rw_check_held(rwm)	do { } while (0)
 #endif
 
+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+	list_add(&rls->list, &rwm->readers);
+}
+
 /*
  * The fast path does not add itself to the reader list to keep
  * from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
 	if (rls->list.prev && !list_empty(&rls->list))
 		return;
 
-	list_add(&rls->list, &rwm->readers);
+	rt_rwlock_add_reader(rls, rwm);
 
 	/* change to reader, so no one else updates too */
 	rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			 * it hasn't been added to the link list yet.
 			 */
 			if (!rls->list.prev || list_empty(&rls->list))
-				list_add(&rls->list, &rwm->readers);
+				rt_rwlock_add_reader(rls, rwm);
 			rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
 			rls->count++;
 			incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			rls->lock = rwm;
 			rls->count = 1;
 			WARN_ON(rls->list.prev && !list_empty(&rls->list));
-			list_add(&rls->list, &rwm->readers);
+			rt_rwlock_add_reader(rls, rwm);
 		} else
 			WARN_ON_ONCE(1);
 		spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
 			spin_lock(&mutex->wait_lock);
 			rls = &current->owned_read_locks[reader_count];
 			if (!rls->list.prev || ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:17 pm

The system already has facilities to perform late/run-time init for
rtmutexes.  We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rtmutex.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
-	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
 #define DEFINE_RT_MUTEX(mutexname) \

--

From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:17 pm

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code.  This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rt_lock.h   |    2 
 include/linux/rtmutex.h   |   15 -
 include/linux/sched.h     |   21 -
 kernel/fork.c             |    2 
 kernel/rcupreempt-boost.c |    2 
 kernel/rtmutex-debug.c    |    4 
 kernel/rtmutex.c          |  944 ++++++++++++++-------------------------------
 kernel/rtmutex_common.h   |   18 -
 kernel/rwlock_torture.c   |   32 --
 kernel/sched.c            |   12 -
 10 files changed, 319 insertions(+), 733 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..d0ef0f1 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
 #include <asm/atomic.h>
 #include <linux/spinlock_types.h>
 #include <linux/sched_prio.h>
+#include <linux/pi.h>
 
 #ifdef CONFIG_PREEMPT_RT
 /*
@@ -67,6 +68,7 @@ struct rw_mutex {
 	atomic_t		count;	/* number of times held for read */
 	atomic_t		owners; /* number of owners as readers */
 	struct list_head	readers;
+	struct pi_sink          pi_snk;
 	int prio;
 };
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..d984244 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/plist.h>
 #include <linux/spinlock_types.h>
+#include <linux/pi.h>
 
 /**
  * The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
 	raw_spinlock_t		wait_lock;
 	struct plist_head	wait_list;
 	struct task_struct	*owner;
+	struct {
+		struct pi_source src;
+		struct pi_node   node;
+		struct pi_sink   snk;
+		int              prio;
+	} pi;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
@@ -96,12 +103,4 @@ extern int rt_mutex_trylock(struct ...
From: Gregory Haskins
Date: Friday, August 1, 2008 - 2:17 pm

Adaptive-locking technology often times acquires the lock by
spinning on a running-owner instead of sleeping.  It is unecessary
to go through pi-boosting if the owner is of equal or (logically)
lower priority. Therefore, we can save some significant overhead
by deferring the boost until absolutely necessary.  This has shown
to improve overall performance in PREEMPT_RT

Special thanks to Peter Morreale for suggesting the optimization to
only consider skipping the boost if the owner is >= to current

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter Morreale <pmorreale@novell.com>
---

 include/linux/rtmutex.h |    1 
 kernel/rtmutex.c        |  195 ++++++++++++++++++++++++++++++++++++-----------
 kernel/rtmutex_common.h |    1 
 3 files changed, 153 insertions(+), 44 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d984244..1d98107 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -33,6 +33,7 @@ struct rt_mutex {
 		struct pi_node   node;
 		struct pi_sink   snk;
 		int              prio;
+		int              boosters;
 	} pi;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 0f64298..de213ac 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -76,14 +76,15 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
 {
 	unsigned long val = (unsigned long)owner | mask;
 
-	if (rt_mutex_has_waiters(lock)) {
+	if (lock->pi.boosters) {
 		struct task_struct *prev_owner = rt_mutex_owner(lock);
 
 		rtmutex_pi_owner(lock, prev_owner, 0);
 		rtmutex_pi_owner(lock, owner, 1);
+	}
 
+	if (rt_mutex_has_waiters(lock))
 		val |= RT_MUTEX_HAS_WAITERS;
-	}
 
 	lock->owner = (struct task_struct *)val;
 }
@@ -177,7 +178,7 @@ static inline int rtmutex_pi_update(struct pi_sink *snk,
 
 	spin_lock_irqsave(&lock->wait_lock, iflags);
 
-	if (rt_mutex_has_waiters(lock)) {
+	if (lock->pi.boosters) {
 		owner = rt_mutex_owner(lock);
 
 ...
From: Gregory Haskins
Date: Monday, August 4, 2008 - 6:21 am

I received feedback that this prologue was too vague to accurately=20
describe what this patch does and why it is not broken to use this=20
optimization.  Therefore, here is the new prologue:

-----------------------
From: Gregory Haskins <ghaskins@novell.com>

rtmutex: pi-boost locks as late as possible

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance.  An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart.  However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath.  The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
   each spin of the lock (if we are not boosted already).  If the
   owner's effective priority is logically less than the waiters
   priority, we must boost them.

2) We check the priority of ourselves against our current queue
   position on the waiters-list (if we are not boosted already).
   If our priority was changed, we need to re-queue ourselves to
   update our position.

3) We break out of the adaptive-spin if either of the above
   conditions (1), (2) change so that we can re-evaluate the
   lock conditions.

4) We must enter ...
From: Gregory Haskins
Date: Monday, August 4, 2008 - 8:01 pm

David Holmes (CC'd) pointed out that this statement is a little vague=20
and confusing as well.  The question is: how could the owner be exposed=20
to preemption since it would presumably be running at or above the=20
waiters priority or we would have boosted it already?  The answer is=20
that this is in reference to the fact that the owner may have its=20
priority lowered after we have already made the decision to defer boostin=
g.

Therefore, my updated statement should read:

"The downside is that we technically leave the owner exposed to getting=20
preempted *should it get asynchronously deprioritized*, even if ...."  =20
As I go on to explain below, this deboosting could only happen as the=20
result of a setscheduler() call, which I assert should not be cause for=20
concern. However, I wanted to highlight this phenomenon in the interest=20
of full disclosure since it is technically a difference in behavior from =

the original algorithm.  I will update the header with this edit for=20
clarity.

Thanks for the review, David!



From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

** RFC for PREEMPT_RT branch, 26-rt1 **

Synopsis: We gain a 13%+ IO improvement in the PREEMPT_RT kernel by
re-working some of the PI logic.

[
	pi-enhancements v2

	Changes since v1:

	*) Added proper reference counting to prevent tasks from
	   deleting while a node->update() is still in flight
	*) unified the RCU boost path
]


[
	fyi -> you can find this series at the following URLs in
	addition to this thread:

  		 http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=shortlog;h=pi...

  		 ftp://ftp.novell.com/dev/ghaskins/pi-rework-v2.tar.bz2

]

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a 
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches.  This yields at
least a 13-15% improvement for diskio on my 4-way x86_64 system.  An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
   (at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
   spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
   if their non-rt equivalent.  We have determined via instrumentation
   that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we want to address (1) and (2)
above.  Therefore the first step is to separate the PI code away from
rtmutex into its own library (libpi). This is covered in patches 1-7.

(I realize patch #7 is a little hard to review since I removed and added
a lot of code that the unified diff is all mashing together...I will try
to find a way to make this ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
 }
 #endif
 
+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+	memset(waiter, 0, sizeof(*waiter));
+
+	debug_rt_mutex_init_waiter(waiter);
+	waiter->task = NULL;
+	waiter->write_lock = 0;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	struct task_struct *orig_owner;
 	int missed = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
+	init_waiter(&waiter);
 
 	spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long saved_state = -1, state, flags;
 
+	init_waiter(&waiter);
+
 	spin_lock_irqsave(&mutex->wait_lock, flags);
 	init_rw_lists(rwm);
 
@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 
 	/* Owner is a writer (or a blocked writer). Block on the lock */
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
-
 	if (mtx) {
 		/*
 		 * We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long flags, saved_state = -1, state;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
+	init_waiter(&waiter);
 
 	/* we do PI different for writers that are blocked ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

The kernel currently addresses priority-inversion through priority-
inheritence.  However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure.  This causes a few
problems:

 1) This tightly coupled relationship makes it difficult to extend to
    other areas of the kernel (for instance, pi-aware wait-queues may
    be desirable).
 2) Enhancing the rtmutex infrastructure becomes challenging because
    there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version.  The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target.  A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source.  For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/libpi.txt |   59 +++++
 include/linux/pi.h      |  278 +++++++++++++++++++++++++
 lib/Makefile            |    3 
 lib/pi.c                |  516 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 855 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/libpi.txt
 create mode 100644 include/linux/pi.h
 create mode 100644 lib/pi.c

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the basic ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 6:16 am

[
  of course, 2 seconds after I hit "send" on v2 I realized there was a
  race condition in libpi w.r.t. the sinkref->prio.  Rather than spam
  you guys with a full "v3" refresh of the series, here is a fixed
  version of patch 1/8 which constitutes "v3" when used with patches
  2-8 from v2.
] 

The kernel currently addresses priority-inversion through priority-
inheritence.  However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure.  This causes a few
problems:

 1) This tightly coupled relationship makes it difficult to extend to
    other areas of the kernel (for instance, pi-aware wait-queues may
    be desirable).
 2) Enhancing the rtmutex infrastructure becomes challenging because
    there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version.  The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target.  A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source.  For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/libpi.txt |   59 +++++
 include/linux/pi.h      |  277 ++++++++++++++++++++++++++
 lib/Makefile            |    3 
 lib/pi.c                |  509 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 847 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/libpi.txt
 create mode 100644 include/linux/pi.h
 create mode ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance.  An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart.  However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath.  The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
   each spin of the lock (if we are not boosted already).  If the
   owner's effective priority is logically less than the waiters
   priority, we must boost them.

2) We check the priority of ourselves against our current queue
   position on the waiters-list (if we are not boosted already).
   If our priority was changed, we need to re-queue ourselves to
   update our position.

3) We break out of the adaptive-spin if either of the above
   conditions (1), (2) change so that we can re-evaluate the
   lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
   voluntarily preempt since we are losing our ability to
   dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place.  Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h     |   37 +++++++--
 include/linux/workqueue.h |    2 
 kernel/fork.c             |    1 
 kernel/rcupreempt-boost.c |   23 +-----
 kernel/rtmutex.c          |    6 +
 kernel/sched.c            |  188 ++++++++++++++++++++++++++++++++-------------
 kernel/workqueue.c        |   39 ++++++++-
 7 files changed, 206 insertions(+), 90 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..63ddd1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/pi.h>
 
 #include <asm/processor.h>
 
@@ -1125,6 +1126,7 @@ struct task_struct {
 	int prio, static_prio, normal_prio;
 #ifdef CONFIG_PREEMPT_RCU_BOOST
 	int rcu_prio;
+	struct pi_source rcu_prio_src;
 #endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
 	/* Protection of the PI data structures: */
 	raw_spinlock_t pi_lock;
 
+	struct {
+		struct pi_source src;  /* represents normal_prio to 'this' */
+		struct pi_node node;
+		struct pi_sink snk;  /* registered to 'this' to get updates */
+		int prio;
+	} pi;
+
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct plist_head pi_waiters;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	int rtmutex_prio;
+	struct pi_source rtmutex_prio_src;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
 #endif
 };
 
+static ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code.  This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rt_lock.h   |    2 
 include/linux/rtmutex.h   |   15 -
 include/linux/sched.h     |   21 -
 kernel/fork.c             |    2 
 kernel/rcupreempt-boost.c |    2 
 kernel/rtmutex-debug.c    |    4 
 kernel/rtmutex-tester.c   |    4 
 kernel/rtmutex.c          |  944 ++++++++++++++-------------------------------
 kernel/rtmutex_common.h   |   18 -
 kernel/rwlock_torture.c   |   32 --
 kernel/sched.c            |   12 -
 11 files changed, 321 insertions(+), 735 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..d0ef0f1 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
 #include <asm/atomic.h>
 #include <linux/spinlock_types.h>
 #include <linux/sched_prio.h>
+#include <linux/pi.h>
 
 #ifdef CONFIG_PREEMPT_RT
 /*
@@ -67,6 +68,7 @@ struct rw_mutex {
 	atomic_t		count;	/* number of times held for read */
 	atomic_t		owners; /* number of owners as readers */
 	struct list_head	readers;
+	struct pi_sink          pi_snk;
 	int prio;
 };
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..d984244 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/plist.h>
 #include <linux/spinlock_types.h>
+#include <linux/pi.h>
 
 /**
  * The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
 	raw_spinlock_t		wait_lock;
 	struct plist_head	wait_list;
 	struct task_struct	*owner;
+	struct {
+		struct pi_source src;
+		struct pi_node   node;
+		struct pi_sink   snk;
+		int              prio;
+	} pi;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
@@ -96,12 +103,4 @@ ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    5 +++--
 kernel/fork.c         |   32 +++++++++++++++-----------------
 kernel/sched.c        |   23 +++++++++++++++++++++++
 3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63ddd1f..9132b42 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
 		struct pi_node node;
 		struct pi_sink snk;  /* registered to 'this' to get updates */
 		int prio;
+		struct rcu_head rcu; /* for destruction cleanup */
+
 	} pi;
 
 #ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
 		call_rcu(&t->rcu, __put_task_struct_cb);
 }
 #else
-extern void __put_task_struct(struct task_struct *t);
 
 static inline void put_task_struct(struct task_struct *t)
 {
 	if (atomic_dec_and_test(&t->usage))
-		__put_task_struct(t);
+		pi_dropref(&t->pi.node, 0);
 }
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..399a0a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
 }
 EXPORT_SYMBOL(free_task);
 
-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
 {
-	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
 	BUG_ON(atomic_read(&tsk->usage));
-	WARN_ON(!tsk->exit_state);
 	WARN_ON(tsk == current);
 
+#ifdef CONFIG_PREEMPT_RT
+	WARN_ON(!tsk->exit_state);
+#else
+	WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
 	security_task_free(tsk);
 	free_uid(tsk->user);
 	put_group_info(tsk->group_info);
+
+#ifdef CONFIG_PREEMPT_RT
 	delayacct_tsk_free(tsk);
+#endif
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
 
-#else
-
-void __put_task_struct(struct task_struct *tsk)
+#ifdef CONFIG_PREEMPT_RT
+void __put_task_struct_cb(struct ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
 # define rw_check_held(rwm)	do { } while (0)
 #endif
 
+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+	list_add(&rls->list, &rwm->readers);
+}
+
 /*
  * The fast path does not add itself to the reader list to keep
  * from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
 	if (rls->list.prev && !list_empty(&rls->list))
 		return;
 
-	list_add(&rls->list, &rwm->readers);
+	rt_rwlock_add_reader(rls, rwm);
 
 	/* change to reader, so no one else updates too */
 	rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			 * it hasn't been added to the link list yet.
 			 */
 			if (!rls->list.prev || list_empty(&rls->list))
-				list_add(&rls->list, &rwm->readers);
+				rt_rwlock_add_reader(rls, rwm);
 			rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
 			rls->count++;
 			incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			rls->lock = rwm;
 			rls->count = 1;
 			WARN_ON(rls->list.prev && !list_empty(&rls->list));
-			list_add(&rls->list, &rwm->readers);
+			rt_rwlock_add_reader(rls, rwm);
 		} else
 			WARN_ON_ONCE(1);
 		spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
 			spin_lock(&mutex->wait_lock);
 			rls = &current->owned_read_locks[reader_count];
 			if (!rls->list.prev || ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 5:08 am

The system already has facilities to perform late/run-time init for
rtmutexes.  We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rtmutex.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
-	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
 #define DEFINE_RT_MUTEX(mutexname) \

--

From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

** RFC for PREEMPT_RT branch, 26-rt1 **

Synopsis: We gain a 13%+ IO improvement in the PREEMPT_RT kernel by
re-working some of the PI logic.

[
	Changelog:

	v4:

		1) Incorporated review comments

		   *) Fixed checkpatch warning about extern in .c
		   *) Renamed s/snk/sink
		   *) Renamed s/addref/get
		   *) Renamed s/dropref/put
		   *) Made pi_sink use static *ops

		2) Fixed a bug w.r.t. enabling interrupts too early  

	v3:

		*) fixed a race with sinkref->prio

	v2:

		*) Added proper reference counting to prevent tasks from
	   	   deleting while a node->update() is still in flight
		*) unified the RCU boost path

	v1:

		*) initial release
]


[
	fyi -> you can find this series at the following URLs in
	addition to this thread:

  		 http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=shortlog;h=pi...

  		 ftp://ftp.novell.com/dev/ghaskins/pi-rework.tar.bz2

]

Hi All,

The following series applies to 26-rt1 as a request-for-comment on a 
new approach to priority-inheritance (PI), as well as some performance
enhancements to take advantage of those new approaches.  This yields at
least a 13-15% improvement for diskio on my 4-way x86_64 system.  An
8-way system saw as much as 700% improvement during early testing, but
I have not recently reconfirmed this number.

Motivation for series:

I have several ideas on things we can do to enhance and improve kernel
performance with respect to PREEMPT_RT

1) For instance, it would be nice to support priority queuing and
   (at least positional) inheritance in the wait-queue infrastructure.

2) Reducing overhead in the real-time locks (sleepable replacements for
   spinlock_t in PREEMPT_RT) to try to approach the minimal overhead
   if their non-rt equivalent.  We have determined via instrumentation
   that one area of major overhead is the pi-boost logic.

However, today the PI code is entwined in the rtmutex infrastructure,
yet we require more flexibility if we ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

The kernel currently addresses priority-inversion through priority-
inheritence.  However, all of the priority-inheritence logic is
integrated into the Real-Time Mutex infrastructure.  This causes a few
problems:

 1) This tightly coupled relationship makes it difficult to extend to
    other areas of the kernel (for instance, pi-aware wait-queues may
    be desirable).
 2) Enhancing the rtmutex infrastructure becomes challenging because
    there is no seperation between the locking code, and the pi-code.

This patch aims to rectify these shortcomings by designing a stand-alone
pi framework which can then be used to replace the rtmutex-specific
version.  The goal of this framework is to provide similar functionality
to the existing subsystem, but with sole focus on PI and the
relationships between objects that can boost priority, and the objects
that get boosted.

We introduce the concept of a "pi_source" and a "pi_sink", where, as the
name suggests provides the basic relationship of a priority source, and
its boosted target.  A pi_source acts as a reference to some arbitrary
source of priority, and a pi_sink can be boosted (or deboosted) by
a pi_source.  For more details, please read the library documentation.

There are currently no users of this inteface.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/libpi.txt |   59 ++++++
 include/linux/pi.h      |  293 ++++++++++++++++++++++++++++
 lib/Makefile            |    3 
 lib/pi.c                |  489 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 843 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/libpi.txt
 create mode 100644 include/linux/pi.h
 create mode 100644 lib/pi.c

diff --git a/Documentation/libpi.txt b/Documentation/libpi.txt
new file mode 100644
index 0000000..197b21a
--- /dev/null
+++ b/Documentation/libpi.txt
@@ -0,0 +1,59 @@
+lib/pi.c - Priority Inheritance library
+
+Sources and sinks:
+------------
+
+This library introduces the ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:32 pm

^^^^^^


--

From: Matthias Behr
Date: Saturday, August 16, 2008 - 8:32 am

Hi Greg,

I got a few review comments/questions. Pls see below.

Best Regards,
Matthias



Shouldn't the atomic/locked part cover the ...->free(...) as well? A pi_get right after the atomic_dec_and_test but before the free() could lead to a free() with refs>0?



--

From: Gregory Haskins
Date: Tuesday, August 19, 2008 - 1:34 am

Hi Matthias,



Actually, it already does.  The free can only be called by the last=20

A pi_get() after the ref could have already dropped to zero is broken at =

a higher layer.  E.g. the caller of pi_get() has to ensure that there=20
are no races against the reference dropping to begin with.  This is the=20
same as any reference-counted object (for instance, see get_task_struct()=
).

Thanks for the review, Matthias!

Regards,
-Greg


From: Peter Zijlstra
Date: Saturday, August 16, 2008 - 12:56 pm

You should have started out by discussing your design - the document
just rambles a bit about some implementation details - it doesn't talk
about how it maps to the PI problem space.

Anyway - from what I can make of the code, you managed to convert the pi
graph walking code that used to be in rt_mutex_adjust_prio_chain() and
was iterative, into a recursive function call.

Not something you should do lightly..

--

From: Gregory Haskins
Date: Tuesday, August 19, 2008 - 1:40 am

Hi Peter,


The doc is still a work-in-progress, but point taken ;)  I will address=20

As we discussed on IRC yesterday, you are correct here.  I was thinking=20
that the graph couldn't get deeper than a few dozen entries, but I=20
forgot about userspace futex access.  But, this is precisely what the=20
"release early" policy is designed to catch ;)

I think I can make a slight adjustment to the model to return it to an=20
iterative design.  I will address this in v5.

Thanks for the review, Peter!

Regards,
-Greg




From: Esben Nielsen
Date: Friday, August 22, 2008 - 5:55 am

Disclaimer: I am no longer actively involved and I must admit I might
have lost out on much of
what have been going on since I contributed to the PI system 2 years
ago. But I allow myself to comment
anyway.


This is really a good idea. When I had time (2 years ago) to actively
work on these problem
I also came to the conclusion that PI should be more general than just
the rtmutex. Preemptive RCU
was the example which drove it.

But I do disagree that general objects should get boosted: The end
targets are always tasks. The objects might
be boosted as intermediate steps, but priority end the only applies to tasks.


This is bad from a RT point of view: You have a hard time determininig
the number of sinks per node. An rw-lock could have an arbitrary
number of readers (is supposed to really). Therefore
you have no chance of knowing how long the boost/deboost operation
will take. And you also know for how long the boosted tasks stay
boosted. If there can be an arbitrary number of


WHAT??? There is a finite lock depth defined. I know we did that
originally but it wasn't hardcoded (as far as I remember) and
it was certainly not as low as 5.



Yack! You keep interrupts off while doing the chain. I think my main
contribution to the PI system 2 years ago was to do this preemptively.
I.e. there was points in the loop where interrupts and preemption
where turned on.

Remember: It goes into user space again. An evil user could craft an
application with a very long lock depth and keep higher priority real
time tasks from running for an arbitrary long time (if
no limit on the lock depth is set, which is bad because it will be too
low in some cases.)

But as I said I have had no time to watch what has actually been going
on in the kernel for the last 2 years roughly. The said defects might
have creeped in by other contributers already :-(

Esben
--

From: Gregory Haskins
Date: Friday, August 22, 2008 - 6:15 am

Hi Esben,
  Thank you for the review.  Comments inline.

Actually I fully agree with you here.  Its probably just poor wording on
my part, but this is exactly what happens.  We may "boost" arbitrary
objects on the way to boosting a task...but the intermediate objects are
just there to help find our way to the proper tasks.  Ultimately
ht departure from the previous implementation which had the notion of onl=
y a single sink (i.e. "task->pi_blocked_on").  The reason why we added th=
e ability to add more than one sink was not to change the default chainin=
g model (I.e. multiple boost targets), but rather to add a flexible notif=
ication mechanism that is peripheral to the chain, which are informally c=
e.  Rather, they act as endpoints to a priority boosting.  Ultimately, ev=
ery chain ends with a leaf-sink, which presumably will act on the new pri=
ority information.  However, there may be any number of leaf-sinks along =
a chain as well.  Each one will act on its localized priority in its own =
implementation specific way.  For instance, a task_struct pi-leaf may cha=
nge the priority of the task and reschedule it if necessary.  Whereas an =

While you may have a valid concern about what rwlocks can do to
determinism, not that we already have PI enabled rwlocks before my
patch, so I am not increasing nor decreasing determinism in this
regard.  That being said, Steven Rostedt (author of the pi-rwlocks,
CC'd) has facilities to manage this (such as limiting the number of
readers to num_online_cpus) which this design would retain.  Long story
short, I do not believe I have made anything worse here, so this is a

Note that this is simply in reference to how many direct sinks you can
link to a node, not how long the resulting chain can grow.  The chain
depth is actually completely unconstrained by the design.  I chose "5"
here because typically we need 1 sink for the next link in the chain,
and 1 sink for local notifications.  The other 3 are there for head-room
(we often hit 3-4 ...
From: Gregory Haskins
Date: Friday, August 22, 2008 - 9:08 am

ght departure from the previous implementation which had the notion of on=
ly a single sink (i.e. "task->pi_blocked_on").  The reason why we added t=
he ability to add more than one sink was not to change the default chaini=
ng model (I.e. multiple boost targets), but rather to add a flexible noti=
fication mechanism that is peripheral to the chain, which are informally =
se.  Rather, they act as endpoints to a priority boosting.  Ultimately, e=
very chain ends with a leaf-sink, which presumably will act on the new pr=
iority information.  However, there may be any number of leaf-sinks along=
 a chain as well.  Each one will act on its localized priority in its own=
 implementation specific way.  For instance, a task_struct pi-leaf may ch=
ange the priority of the task and reschedule it if necessary.  Whereas an=

To clarify what I meant here:  If you think of a normal linked-list node
having a single "next" pointer, this implementation is like each node
having up to 5 "next" pointers.   However typically only 1-2 are used,
and all but one will usually point to a "leaf" node, meaning it does not
form a chain but terminates processing locally.  Typically there will be
only one link to something that forms a chain with other nodes.  I did
this because I realized the pattern (boost/deboost/update) was similar
whether the node was a leaf or a chain-link, so I unified both behind
the single pi_sink interface.

That being understood, note that as with any linked-list, the nodes can
still have an arbitrary chaining depth (and I will fix this to be


From: Steven Rostedt
Date: Friday, August 22, 2008 - 6:17 am

Esben, you are always welcomed. You are one of the copyright owners of

Yeah, I believe our number is 1024, and is not hardcoded, but is there

I haven't looked to hard at this code yet, but this may only be kernel 

The rtmutex.c has hardly changed since you last left it. The two big 
additions, were adaptive locks, which hardly touched the pi chain, and my 
rwlocks allowing multiple readers. It added a hook to allow going into the 
pi chain for all readers while holding a spinlock and yes irqs off. The 
difference is that it is a bug to hold an rwlock (internal kernel lock 
only) and take a futex. Thus, this rwlock code did have a recursive depth 
of 5. Perhaps that's what the PI depth is from above?

I still haven't had the time to analyze Gregory's code, so those points 
that you made, may be only related to kernel activities (like the new 
rwlock code). But by generalizing it, it decouples the PI from the 
locking, which in general is a good thing, but for the multiple reader 
locks, it is dangerous to decouple it, since there are a lot of 
assumptions between the multiple PI owner code and rwlocks. In a general 
approach, the assumptions will be harder to see.

-- Steve

--

From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

This is a first pass at converting the system to use the new PI library.
We dont go for a wholesale replacement quite yet so that we can focus
on getting the basic plumbing in place.  Later in the series we will
begin replacing some of the proprietary logic with the generic
framework.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h     |   37 +++++++--
 include/linux/workqueue.h |    2 
 kernel/fork.c             |    1 
 kernel/rcupreempt-boost.c |   23 +-----
 kernel/rtmutex.c          |    6 +
 kernel/sched.c            |  188 ++++++++++++++++++++++++++++++++-------------
 kernel/workqueue.c        |   39 ++++++++-
 7 files changed, 206 insertions(+), 90 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c885f78..5521a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/pi.h>
 
 #include <asm/processor.h>
 
@@ -1125,6 +1126,7 @@ struct task_struct {
 	int prio, static_prio, normal_prio;
 #ifdef CONFIG_PREEMPT_RCU_BOOST
 	int rcu_prio;
+	struct pi_source rcu_prio_src;
 #endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
@@ -1298,11 +1300,20 @@ struct task_struct {
 	/* Protection of the PI data structures: */
 	raw_spinlock_t pi_lock;
 
+	struct {
+		struct pi_source src;  /* represents normal_prio to 'this' */
+		struct pi_node node;
+		struct pi_sink sink;  /* registered to 'this' to get updates */
+		int prio;
+	} pi;
+
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct plist_head pi_waiters;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	int rtmutex_prio;
+	struct pi_source rtmutex_prio_src;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -1440,6 +1451,26 @@ struct task_struct {
 #endif
 };
 
+static ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/sched.h |    7 +++++--
 kernel/fork.c         |   32 +++++++++++++++-----------------
 kernel/sched.c        |   21 +++++++++++++++++++++
 3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5521a64..7ae8eca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1305,6 +1305,8 @@ struct task_struct {
 		struct pi_node node;
 		struct pi_sink sink;  /* registered to 'this' to get updates */
 		int prio;
+		struct rcu_head rcu; /* for destruction cleanup */
+
 	} pi;
 
 #ifdef CONFIG_RT_MUTEXES
@@ -1633,12 +1635,11 @@ static inline void put_task_struct(struct task_struct *t)
 		call_rcu(&t->rcu, __put_task_struct_cb);
 }
 #else
-extern void __put_task_struct(struct task_struct *t);
 
 static inline void put_task_struct(struct task_struct *t)
 {
 	if (atomic_dec_and_test(&t->usage))
-		__put_task_struct(t);
+		pi_put(&t->pi.node, 0);
 }
 #endif
 
@@ -2469,6 +2470,8 @@ static inline int task_is_current(struct task_struct *task)
 }
 #endif
 
+extern void prepare_free_task(struct task_struct *tsk);
+
 #define TASK_STATE_TO_CHAR_STR "RMSDTtZX"
 
 #endif /* __KERNEL__ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 399a0d0..4d4fba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -130,39 +130,37 @@ void free_task(struct task_struct *tsk)
 }
 EXPORT_SYMBOL(free_task);
 
-#ifdef CONFIG_PREEMPT_RT
-void __put_task_struct_cb(struct rcu_head *rhp)
+void prepare_free_task(struct task_struct *tsk)
 {
-	struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
-
 	BUG_ON(atomic_read(&tsk->usage));
-	WARN_ON(!tsk->exit_state);
 	WARN_ON(tsk == current);
 
+#ifdef CONFIG_PREEMPT_RT
+	WARN_ON(!tsk->exit_state);
+#else
+	WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
+#endif
+
 	security_task_free(tsk);
 	free_uid(tsk->user);
 	put_group_info(tsk->group_info);
+
+#ifdef ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

We will be adding more logic to rt_mutex_waiters and therefore lets
centralize the initialization to make this easier going forward.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 7d11380..12de859 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -805,6 +805,15 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
 }
 #endif
 
+static void init_waiter(struct rt_mutex_waiter *waiter)
+{
+	memset(waiter, 0, sizeof(*waiter));
+
+	debug_rt_mutex_init_waiter(waiter);
+	waiter->task = NULL;
+	waiter->write_lock = 0;
+}
+
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -823,9 +832,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	struct task_struct *orig_owner;
 	int missed = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
+	init_waiter(&waiter);
 
 	spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
@@ -1324,6 +1331,8 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long saved_state = -1, state, flags;
 
+	init_waiter(&waiter);
+
 	spin_lock_irqsave(&mutex->wait_lock, flags);
 	init_rw_lists(rwm);
 
@@ -1335,10 +1344,6 @@ rt_read_slowlock(struct rw_mutex *rwm, int mtx)
 
 	/* Owner is a writer (or a blocked writer). Block on the lock */
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
-	waiter.write_lock = 0;
-
 	if (mtx) {
 		/*
 		 * We drop the BKL here before we go into the wait loop to avoid a
@@ -1538,8 +1543,7 @@ rt_write_slowlock(struct rw_mutex *rwm, int mtx)
 	int saved_lock_depth = -1;
 	unsigned long flags, saved_state = -1, state;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
+	init_waiter(&waiter);
 
 	/* we do PI different for writers that are blocked ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

We will use this later in the series to add PI functions on "add".

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/rtmutex.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 12de859..62fdc3d 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1122,6 +1122,12 @@ static void rw_check_held(struct rw_mutex *rwm)
 # define rw_check_held(rwm)	do { } while (0)
 #endif
 
+static inline void
+rt_rwlock_add_reader(struct reader_lock_struct *rls, struct rw_mutex *rwm)
+{
+	list_add(&rls->list, &rwm->readers);
+}
+
 /*
  * The fast path does not add itself to the reader list to keep
  * from needing to grab the spinlock. We need to add the owner
@@ -1163,7 +1169,7 @@ rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own)
 	if (rls->list.prev && !list_empty(&rls->list))
 		return;
 
-	list_add(&rls->list, &rwm->readers);
+	rt_rwlock_add_reader(rls, rwm);
 
 	/* change to reader, so no one else updates too */
 	rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK);
@@ -1197,7 +1203,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			 * it hasn't been added to the link list yet.
 			 */
 			if (!rls->list.prev || list_empty(&rls->list))
-				list_add(&rls->list, &rwm->readers);
+				rt_rwlock_add_reader(rls, rwm);
 			rt_rwlock_set_owner(rwm, RT_RW_READER, 0);
 			rls->count++;
 			incr = 0;
@@ -1276,7 +1282,7 @@ static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx)
 			rls->lock = rwm;
 			rls->count = 1;
 			WARN_ON(rls->list.prev && !list_empty(&rls->list));
-			list_add(&rls->list, &rwm->readers);
+			rt_rwlock_add_reader(rls, rwm);
 		} else
 			WARN_ON_ONCE(1);
 		spin_unlock(&current->pi_lock);
@@ -1473,7 +1479,7 @@ __rt_read_fasttrylock(struct rw_mutex *rwm)
 			spin_lock(&mutex->wait_lock);
 			rls = &current->owned_read_locks[reader_count];
 			if (!rls->list.prev || ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

The system already has facilities to perform late/run-time init for
rtmutexes.  We want to add more advanced initialization later in the
series so we force all rtmutexes through the init path in preparation
for the later patches.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rtmutex.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index b263bac..14774ce 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -64,8 +64,6 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \
-	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
 #define DEFINE_RT_MUTEX(mutexname) \

--

From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:28 pm

We have previously only laid some of the groundwork to use the PI
library, but left the existing infrastructure in place in the
rtmutex code.  This patch converts the rtmutex PI code to officially
use the PI library.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/rt_lock.h   |    2 
 include/linux/rtmutex.h   |   15 -
 include/linux/sched.h     |   21 -
 kernel/fork.c             |    2 
 kernel/rcupreempt-boost.c |    2 
 kernel/rtmutex-debug.c    |    4 
 kernel/rtmutex-tester.c   |    4 
 kernel/rtmutex.c          |  944 ++++++++++++++-------------------------------
 kernel/rtmutex_common.h   |   18 -
 kernel/rwlock_torture.c   |   32 --
 kernel/sched.c            |   12 -
 11 files changed, 321 insertions(+), 735 deletions(-)

diff --git a/include/linux/rt_lock.h b/include/linux/rt_lock.h
index c00cfb3..c5da71d 100644
--- a/include/linux/rt_lock.h
+++ b/include/linux/rt_lock.h
@@ -14,6 +14,7 @@
 #include <asm/atomic.h>
 #include <linux/spinlock_types.h>
 #include <linux/sched_prio.h>
+#include <linux/pi.h>
 
 #ifdef CONFIG_PREEMPT_RT
 /*
@@ -67,6 +68,7 @@ struct rw_mutex {
 	atomic_t		count;	/* number of times held for read */
 	atomic_t		owners; /* number of owners as readers */
 	struct list_head	readers;
+	struct pi_sink          pi_sink;
 	int prio;
 };
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 14774ce..e069182 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/plist.h>
 #include <linux/spinlock_types.h>
+#include <linux/pi.h>
 
 /**
  * The rt_mutex structure
@@ -27,6 +28,12 @@ struct rt_mutex {
 	raw_spinlock_t		wait_lock;
 	struct plist_head	wait_list;
 	struct task_struct	*owner;
+	struct {
+		struct pi_source src;
+		struct pi_node   node;
+		struct pi_sink   sink;
+		int              prio;
+	} pi;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
@@ -96,12 +103,4 @@ ...
From: Gregory Haskins
Date: Friday, August 15, 2008 - 1:29 pm

PREEMPT_RT replaces most spinlock_t instances with a preemptible
real-time lock that supports priority inheritance.  An uncontended
(fastpath) acquisition of this lock has no more overhead than
its non-rt spinlock_t counterpart.  However, the contended case
has considerably more overhead so that the lock can maintain
proper priority queue order and support pi-boosting of the lock
owner, yet remaining fully preemptible.

Instrumentation shows that the majority of acquisitions under most
workloads falls either into the fastpath category, or the adaptive
spin category within the slowpath.  The necessity to pi-boost a
lock-owner should be sufficiently rare, yet the slow-path path
blindly incurs this overhead in 100% of contentions.

Therefore, this patch intends to capitalize on this observation
in order to reduce overhead and improve acquisition throughput.
It is important to note that real-time latency is still treated
as a higher order constraint than throughput, so the full
pi-protocol is observed using new carefully constructed rules
around the old concepts.

1) We check the priority of the owner relative to the waiter on
   each spin of the lock (if we are not boosted already).  If the
   owner's effective priority is logically less than the waiters
   priority, we must boost them.

2) We check the priority of ourselves against our current queue
   position on the waiters-list (if we are not boosted already).
   If our priority was changed, we need to re-queue ourselves to
   update our position.

3) We break out of the adaptive-spin if either of the above
   conditions (1), (2) change so that we can re-evaluate the
   lock conditions.

4) We must enter pi-boost mode if, at any time, we decide to
   voluntarily preempt since we are losing our ability to
   dynamically process the conditions above.

Note: We still fully support priority inheritance with this
protocol, even if we defer the low-level calls to adjust priority.
The difference is really in terms of being a ...
Previous thread: [PATCH -rt] apply rcu_process_callbacks() changes from mainline by Paul E. McKenney on Friday, August 1, 2008 - 2:11 pm. (1 message)

Next thread: sh O= builds broken by Adrian Bunk on Friday, August 1, 2008 - 2:27 pm. (4 messages)