semaphore: fix

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: <git-commits-head@...>
Date: Thursday, May 8, 2008 - 5:59 pm

Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bf726e...
Commit:     bf726eab3711cf192405d21688a4b21e07b6188a
Parent:     3de2403e6659d71b36ec820dc9b942762ddfe6eb
Author:     Ingo Molnar <mingo@elte.hu>
AuthorDate: Thu May 8 11:53:48 2008 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu May 8 17:00:42 2008 +0200

    semaphore: fix
    
    Yanmin Zhang reported:
    
    | Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
    | regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
    | and Itanium Montecito. Bisect located the patch below:
    |
    | 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
    | commit 64ac24e738823161693bf791f87adc802cf529ff
    | Author: Matthew Wilcox <matthew@wil.cx>
    | Date:   Fri Mar 7 21:55:58 2008 -0500
    |
    |     Generic semaphore implementation
    |
    | After I manually reverted the patch against 2.6.26-rc1 while fixing
    | lots of conflicts/errors, aim7 regression became less than 2%.
    
    i reproduced the AIM7 workload and can confirm Yanmin's findings that
    -.26-rc1 regresses over .25 - by over 67% here.
    
    Looking at the workload i found and fixed what i believe to be the real
    bug causing the AIM7 regression: it was inefficient wakeup / scheduling
    / locking behavior of the new generic semaphore code, causing suboptimal
    performance.
    
    The problem comes from the following code. The new semaphore code does
    this on down():
    
            spin_lock_irqsave(&sem->lock, flags);
            if (likely(sem->count > 0))
                    sem->count--;
            else
                    __down(sem);
            spin_unlock_irqrestore(&sem->lock, flags);
    
    and this on up():
    
            spin_lock_irqsave(&sem->lock, flags);
            if (likely(list_empty(&sem->wait_list)))
                    sem->count++;
            else
                    __up(sem);
            spin_unlock_irqrestore(&sem->lock, flags);
    
    where __up() does:
    
            list_del(&waiter->list);
            waiter->up = 1;
            wake_up_process(waiter->task);
    
    and where __down() does this in essence:
    
            list_add_tail(&waiter.list, &sem->wait_list);
            waiter.task = task;
            waiter.up = 0;
            for (;;) {
                    [...]
                    spin_unlock_irq(&sem->lock);
                    timeout = schedule_timeout(timeout);
                    spin_lock_irq(&sem->lock);
                    if (waiter.up)
                            return 0;
            }
    
    the fastpath looks good and obvious, but note the following property of
    the contended path: if there's a task on the ->wait_list, the up() of
    the current owner will "pass over" ownership to that waiting task, in a
    wake-one manner, via the waiter->up flag and by removing the waiter from
    the wait list.
    
    That is all and fine in principle, but as implemented in
    kernel/semaphore.c it also creates a nasty, hidden source of contention!
    
    The contention comes from the following property of the new semaphore
    code: the new owner owns the semaphore exclusively, even if it is not
    running yet.
    
    So if the old owner, even if just a few instructions later, does a
    down() [lock_kernel()] again, it will be blocked and will have to wait
    on the new owner to eventually be scheduled (possibly on another CPU)!
    Or if another task gets to lock_kernel() sooner than the "new owner"
    scheduled, it will be blocked unnecessarily and for a very long time
    when there are 2000 tasks running.
    
    I.e. the implementation of the new semaphores code does wake-one and
    lock ownership in a very restrictive way - it does not allow
    opportunistic re-locking of the lock at all and keeps the scheduler from
    picking task order intelligently.
    
    This kind of scheduling, with 2000 AIM7 processes running, creates awful
    cross-scheduling between those 2000 tasks, causes reduced parallelism, a
    throttled runqueue length and a lot of idle time. With increasing number
    of CPUs it causes an exponentially worse behavior in AIM7, as the chance
    for a newly woken new-owner task to actually run anytime soon is less
    and less likely.
    
    Note that it takes just a tiny bit of contention for the 'new-semaphore
    catastrophy' to happen: the wakeup latencies get added to whatever small
    contention there is, and quickly snowball out of control!
    
    I believe Yanmin's findings and numbers support this analysis too.
    
    The best fix for this problem is to use the same scheduling logic that
    the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
    wanted because we do not want to over-schedule), but also allow
    opportunistic locking of the lock even if a wakee is already "in
    flight".
    
    The patch below implements this new logic. With this patch applied the
    AIM7 regression is largely fixed on my quad testbox:
    
      # v2.6.25 vanilla:
      ..................
      Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
      2000    56096.4         91      207.5   789.7   0.4675
      2000    55894.4         94      208.2   792.7   0.4658
    
      # v2.6.26-rc1-166-gc0a1811 vanilla:
      ...................................
      Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
      2000    33230.6         83      350.3   784.5   0.2769
      2000    31778.1         86      366.3   783.6   0.2648
    
      # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
      ...............................................
      Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
      2000    55707.1         92      209.0   795.6   0.4642
      2000    55704.4         96      209.0   796.0   0.4642
    
    i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
    performance levels and have zero idle time during the test, as expected.
    
    Btw., interactivity also improved dramatically with the fix - for
    example console-switching became almost instantaneous during this
    workload (which after all is running 2000 tasks at once!), without the
    patch it was stuck for a minute at times.
    
    There's another nice side-effect of this speedup patch, the new generic
    semaphore code got even smaller:
    
       text    data     bss     dec     hex filename
       1241       0       0    1241     4d9 semaphore.o.before
       1207       0       0    1207     4b7 semaphore.o.after
    
    (because the waiter.up complication got removed.)
    
    Longer-term we should look into using the mutex code for the generic
    semaphore code as well - but i's not easy due to legacies and it's
    outside of the scope of v2.6.26 and outside the scope of this patch as
    well.
    
    Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/semaphore.c |   64 ++++++++++++++++++++++++---------------------------
 1 files changed, 30 insertions(+), 34 deletions(-)

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		__down(sem);
+	sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_interruptible(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_killable(sem);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies)
 	int result = 0;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(sem->count > 0))
-		sem->count--;
-	else
+	if (unlikely(!sem->count))
 		result = __down_timeout(sem, jiffies);
+	if (!result)
+		sem->count--;
 	spin_unlock_irqrestore(&sem->lock, flags);
 
 	return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
-		sem->count++;
-	else
+	sem->count++;
+	if (unlikely(!list_empty(&sem->wait_list)))
 		__up(sem);
 	spin_unlock_irqrestore(&sem->lock, flags);
 }
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
 struct semaphore_waiter {
 	struct list_head list;
 	struct task_struct *task;
-	int up;
 };
 
 /*
@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 {
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
+	int ret = 0;
 
-	list_add_tail(&waiter.list, &sem->wait_list);
 	waiter.task = task;
-	waiter.up = 0;
+	list_add_tail(&waiter.list, &sem->wait_list);
 
 	for (;;) {
-		if (state == TASK_INTERRUPTIBLE && signal_pending(task))
-			goto interrupted;
-		if (state == TASK_KILLABLE && fatal_signal_pending(task))
-			goto interrupted;
-		if (timeout <= 0)
-			goto timed_out;
+		if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (timeout <= 0) {
+			ret = -ETIME;
+			break;
+		}
 		__set_task_state(task, state);
 		spin_unlock_irq(&sem->lock);
 		timeout = schedule_timeout(timeout);
 		spin_lock_irq(&sem->lock);
-		if (waiter.up)
-			return 0;
+		if (sem->count > 0)
+			break;
 	}
 
- timed_out:
-	list_del(&waiter.list);
-	return -ETIME;
-
- interrupted:
 	list_del(&waiter.list);
-	return -EINTR;
+	return ret;
 }
 
 static noinline void __sched __down(struct semaphore *sem)
@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)
 {
 	struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
 						struct semaphore_waiter, list);
-	list_del(&waiter->list);
-	waiter->up = 1;
 	wake_up_process(waiter->task);
 }
--
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
semaphore: fix, Linux Kernel Mailing List..., (Thu May 8, 5:59 pm)