[PATCH 3/3] [PATCH] ipc/sem.c: cacheline align the ipc spinlock for semaphores

Previous thread: [PATCH] uml: Fix build breakage after slab.h changes by Jan Kiszka on Sunday, April 18, 2010 - 9:37 am. (10 messages)

Next thread: uml: pthreads instead of manual clone()? by Jan Kiszka on Sunday, April 18, 2010 - 10:50 am. (2 messages)
From: Manfred Spraul
Date: Sunday, April 18, 2010 - 10:07 am

Hi,

The following series of patches tries to fix the spinlock contention
reported by Chris Manson: His benchmark exposes problems of the current
code:

- In the worst case, the algorithm used by update_queue() is O(N^2).
  Bulk wake-up calls can enter this worst case.
  The patch series fix that.
  Note that the benchmark app doesn't expose the problem, it just should
  be fixed: Real world apps might do the wake-ups in another order
  than perfect FIFO.

- The part of the code that runs within the semaphore array spinlock
  is significantly larger than necessary.
  The patch series fixes that. This change is responsible for the
  main improvement.

- The cacheline with the spinlock is also used for a variable that
  is read in the hot path (sem_base) and for a variable that is
  unnecessarily written to multiple times (sem_otime).
  The last step of the series cacheline-aligns the spinlock.

With regards to the last patch:
I'm not sure if this should be performed:
Cacheline aligning a spinlock helps to avoid cacheline trashing
among multiple cpus, at the cost of wasting memory.
Testing on a large system might show if it's worth the memory.

I've tested the series with a modified version of Chris' benchmark
app [1]:

- before:
echo "512 32000 512 512" > /proc/sys/kernel/sem
./sembench -t 250 -w 64 -o 0 -r 30 -b:
<<<
250 threads, waking 64 at a time
using ipc sem operations
main thread burns: 2194274
worker burn count total 6352691 min 21505 max 29230 avg 25410
run time 30 seconds 211756 worker burns per second
lifetime 30 seconds 58573142769 ticks.
free 745344 contended 7801967
wait 110278262115 avg 14134 1/1000 cpu time 470
<<<

- after:
echo "512 32000 512 512" > /proc/sys/kernel/sem
./sembench -t 250 -w 64 -o 0 -r 30 -b:
<<<
250 threads, waking 64 at a time
using ipc sem operations
main thread burns: 112890
worker burn count total 7084335 min 27869 max 28769 avg 28337
run time 30 seconds 236144 worker burns per second
lifetime 30 seconds 59518717621 ...
From: Manfred Spraul
Date: Sunday, April 18, 2010 - 10:07 am

The SysV semaphore code allows to perform multiple operations on
all semaphores in the array as atomic operations.
After a modification, update_queue() checks which of the waiting
tasks can complete.
The algorithm that is used to identify the tasks is O(N^2) in the
worst case.
The patch fixes that for the case of an array where all sleeping
tasks are single sembuf operations and a multi-sembuf operation
is used to wake up multiple tasks.
---
 ipc/sem.c |  112 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index dbef95b..efadb6d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -434,6 +434,69 @@ static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
 		sma->complex_count--;
 }
 
+/** check_restart(sma, q)
+ * @sma: semaphore array
+ * @q: the operation that just completed
+ *
+ * update_queue is O(N^2) when it restarts scanning the whole queue of
+ * waiting operations. Therefore this function checks if the restart is
+ * really necessary. It is called after a previously waiting operation
+ * was completed.
+ */
+static int check_restart(struct sem_array *sma, struct sem_queue *q)
+{
+	struct sem * curr;
+	struct sem_queue *h;
+
+	/* if the operation didn't modify the array, then no restart */
+	if (q->alter == 0)
+		return 0;
+
+	/* pending complex operations are too difficult to analyse */
+	if (sma->complex_count)
+		return 1;
+
+	/* we were a sleeping complex operation. Too difficult */
+	if (q->nsops > 1)
+		return 1;
+
+	curr = sma->sem_base + q->sops[0].sem_num;
+
+	/* No-one waits on this queue */
+	if (list_empty(&curr->sem_pending))
+		return 0;
+
+	/* the new semaphore value */
+	if (curr->semval) {
+		/* It is impossible that someone waits for the new value:
+		 * - q is a previously sleeping simple operation that
+		 *   altered the array. It must be a decrement, because
+		 *   simple increments never sleep.
+		 * - The value is ...
From: Manfred Spraul
Date: Sunday, April 18, 2010 - 10:07 am

This patch cacheline aligns the spinlock for sysv semaphores:
Without the patch, the spinlock and sem_otime [written by
every semop that modified the array] and sem_base [read in the
hot path of try_atomic_semop()] can be in the same cacheline.
---
 include/linux/sem.h |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 8a4adbe..f2961af 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -79,6 +79,7 @@ struct  seminfo {
 #ifdef __KERNEL__
 #include <asm/atomic.h>
 #include <linux/rcupdate.h>
+#include <linux/cache.h>
 
 struct task_struct;
 
@@ -91,7 +92,8 @@ struct sem {
 
 /* One sem_array data structure for each set of semaphores in the system. */
 struct sem_array {
-	struct kern_ipc_perm	sem_perm;	/* permissions .. see ipc.h */
+	struct kern_ipc_perm	____cacheline_aligned_in_smp
+				sem_perm;	/* permissions .. see ipc.h */
 	time_t			sem_otime;	/* last semop time */
 	time_t			sem_ctime;	/* last change time */
 	struct sem		*sem_base;	/* ptr to first semaphore in array */
-- 
1.6.6.1

--

From: Manfred Spraul
Date: Sunday, April 18, 2010 - 10:07 am

For the sysv sem code, waking up a tasks consists out of two steps:
- the right tasks must be identified.
- they must be woken up.

Right now, both steps run while the array spinlock is held.
This patch reorders the code and moves the actual wake_up_process()
behind the point where the spinlock is dropped.
---
 ipc/sem.c |  111 +++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 79 insertions(+), 32 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index efadb6d..34ae151 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -381,7 +381,6 @@ static int try_atomic_semop (struct sem_array * sma, struct sembuf * sops,
 		sop--;
 	}
 	
-	sma->sem_otime = get_seconds();
 	return 0;
 
 out_of_range:
@@ -404,25 +403,41 @@ undo:
 	return result;
 }
 
-/*
- * Wake up a process waiting on the sem queue with a given error.
- * The queue is invalid (may not be accessed) after the function returns.
+/** wake_up_sem_queue_prepare(q, error): Prepare wake-up
+ * @q: queue entry that must be signaled
+ * @error: Error value for the signal
+ *
+ * Prepare the wake-up of the queue entry q.
  */
-static void wake_up_sem_queue(struct sem_queue *q, int error)
+static void wake_up_sem_queue_prepare(struct list_head *pt, struct sem_queue *q, int error)
 {
-	/*
-	 * Hold preempt off so that we don't get preempted and have the
-	 * wakee busy-wait until we're scheduled back on. We're holding
-	 * locks here so it may not strictly be needed, however if the
-	 * locks become preemptible then this prevents such a problem.
-	 */
-	preempt_disable();
+	if (list_empty(pt)) {
+		/*
+		 * Hold preempt off so that we don't get preempted and have the
+		 * wakee busy-wait until we're scheduled back on.
+		 */
+		preempt_disable();
+	}
 	q->status = IN_WAKEUP;
-	wake_up_process(q->sleeper);
-	/* hands-off: q can disappear immediately after writing q->status. */
-	smp_wmb();
-	q->status = error;
-	preempt_enable();
+	q->pid = error;
+
+	list_add_tail(&q->simple_list, ...
Previous thread: [PATCH] uml: Fix build breakage after slab.h changes by Jan Kiszka on Sunday, April 18, 2010 - 9:37 am. (10 messages)

Next thread: uml: pthreads instead of manual clone()? by Jan Kiszka on Sunday, April 18, 2010 - 10:50 am. (2 messages)