Re: [PATCH 1/2] ipc semaphores: reduce ipc_lock contention in semtimedop

Previous thread: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU by Chris Mason on Monday, April 12, 2010 - 11:49 am. (2 messages)

Next thread: WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot' by Eric Paris on Monday, April 12, 2010 - 12:08 pm. (4 messages)
From: Chris Mason
Date: Monday, April 12, 2010 - 11:49 am

We've been poking around in semtimedop for a while now, mostly because
it is consistently showing up at the top of the CPU profiles for benchmarking
runs on big numa systems.  The biggest problem seems to be the IPC lock, and
the fact that we hold it for a long time while we loop over different lists and
try to do semaphore operations.

Zach Brown came up with a set of patches a while ago that switched away from
the global pending list, and semtimedop was recently optimized for the
single sop case by Nick and Manfred.

This patch series tries to build on ideas from all of these patches.  The
list of pending semaphore operations is pushed down to the individual
semaphore and the locking is also pushed down into the semaphore.  The
result is much faster with my micro benchmark:

http://oss.oracle.com/~mason/sembench.c

It more than doubles the total number of post/wait cycles the benchmark
is able to get in 30s.  Before this patch, semtimedop scored about the
same as futexes for the post/wait cycles, and so now it is 2x faster.

I did run this code through all of the ltp ipc tests, and later this week I
hope to get a full tpc database benchmark on it.

--

From: Chris Mason
Date: Monday, April 12, 2010 - 11:49 am

One feature of the ipc semaphores is they are defined to be
atomic for the full set of operations done per syscall.  So if you do a
semtimedop syscall changing 100 semaphores, the kernel needs to try all
100 changes and only apply them when all 100 are able to succeed without
putting the process to sleep.

Today we use a single lock per semaphore array (the ipc lock).  This lock is
held every time we try a set of operations requested by userland, and
when taken again when a process is woken up.

Whenever a given set of changes sent to semtimedop would sleep, that
set is queued up on a big list of pending changes for the entire
semaphore array.

Whenever a semtimedop call changes a single semaphore value, it
walks the entire list of pending operations to see if any of them
can now succeed.  The ipc lock is held for this entire loop.

This change makes two major changes, pushing both the list of pending
operations and a spinlock down to each individual semaphore.  Now:

Whenever a given semaphore modification is going to block, the set of
operations semtimedop wants to do is saved onto that semaphore's list.

Whenever a givem semtimedop call changes a single semaphore value, it
walks the list of pending operations on that single semaphore to see if
they can now succeed.  If any of the operations will block on a
different semaphore, they are moved to that semaphore's list.

The locking is now done per-semaphore.  In order to get the changes done
atomically, the lock of every semaphore being changed is taken while we
test the requested operations.  We sort the operations by semaphore id
to make sure we don't deadlock in the kernel.

I have a microbenchmark to test how quickly we can post and wait in
bulk.  With this change, semtimedop is able do to more than twice
as much work in the same run.  On a large numa machine, it brings
the IPC lock system time (reported by perf) down from 85% to 15%.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 include/linux/sem.h |  ...
From: Manfred Spraul
Date: Tuesday, April 13, 2010 - 10:15 am

Hi Chris,


Have you thought about odd corner cases:
Nick noticed the last time that it is possible to wait for arbitrary values:
in one semop:
     - decrease semaphore 5 by 10
     - wait until semaphore 5 is 0
Does sorting preserve the behavior?
--

From: Chris Mason
Date: Tuesday, April 13, 2010 - 10:39 am

Do you mean within a single sop array doing all three of these?  I don't
know if the sort is going to leave the three operations on semaphore 5
in the same order (it probably won't).

But I could change that by having it include the slot in the original
sop array in the sorting.  That way if we have duplicate semnums in the
array, they will end up in the same position relative to each other in
the sorted result.

(ewwww ;)

-chris
--

From: Nick Piggin
Date: Tuesday, April 13, 2010 - 11:09 am

I had a bit of a hack at doing per-semaphore stuff when I was looking
at the first optimization, but it was tricky to make it work.

The other thing I don't know if your patch gets right is requeueing on
of the operations. When you requeue from one list to another, then you
seem to lose ordering with other pending operations, so that would
seem to break the API as well (can't remember if the API strictly
mandates FIFO, but anyway it can open up starvation cases).

I was looking at doing a sequence number to be able to sort these, but
it ended up getting over complex (and SAP was only using simple ops so
it didn't seem to need much better).

We want to be careful not to change semantics at all. And it gets
tricky quickly :( What about Zach's simpler wakeup API?


--

From: Chris Mason
Date: Tuesday, April 13, 2010 - 11:19 am

I don't see anything in the docs about the FIFO order.  I could add an
extra sort on sequence number pretty easily, but is the starvation case

Yeah, that's why my patches include code to handle userland sending
duplicate semids.  Zach's simpler API is cooking too, but if I can get
this done without insane complexity it helps with more than just the
post/wait oracle workload.

-chris

--

From: Nick Piggin
Date: Tuesday, April 13, 2010 - 11:57 am

Yes, because it's not just a theoretical livelock, it can be basically
a certainty, given the right pattern of semops.

You could have two mostly-independent groups of processes, each taking
and releasing a different sem, which are always contended (eg. if it is
being used for a producer-consumer type situation, or even just mutual
exclusion with high contention).

Then you could have some overall management process for example which


Iam worried about complexity and slowing other cases, given that Oracle
DB seems willing to adapt to the (better suited) new API. So I'd be
interested to know what it helps outside Oracle.

--

From: Chris Mason
Date: Tuesday, April 13, 2010 - 12:01 pm

Sure, I'd hope that your benchmark from last time around is faster now.

-chris

--

From: Nick Piggin
Date: Tuesday, April 13, 2010 - 12:25 pm

I didn't actually reproduce it here, I think it was a customer or
partner workload. But SAP only seemed to have one contended semnum in
its array, and it was being operated on with "simple" semops (so that's
about as far as the patches went).

I didn't notice anything that should make that go faster?

Yes, with such a workload, using semops is basically legacy and simple
mutexes should work better. So I'm not outright against improving sysv
sem performance for more complex cases where nothing else we have works
as well.

--

From: Chris Mason
Date: Tuesday, April 13, 2010 - 12:38 pm

Detecting the dups just keeps me from deadlocking.  I'm locking each
individual semaphore in sequence, so if userland does something strange
and sends two updates to the same semaphore, the code detects that and

Since I'm avoiding the ipc lock while operating on the array, it'll help
any workload that hits on two or more semaphores in the array at

I'm not in a hurry to overhaul a part of the kernel that has been stable
for a long time.  But it really needs some love I think.  I'll have more
numbers from a tpc run later this week.

-chris

--

From: Nick Piggin
Date: Tuesday, April 13, 2010 - 1:05 pm

Yeah, I don't think SAP did that, significantly to matter. Possibly

Yep, I'm not against it. "industry standard benchmark" numbers would
be great.

I do think we need to be really careful with semantics though. The
API's been around for long enough that it is going to have been
(ab)used in every way possible :)

--

From: Manfred Spraul
Date: Sunday, May 16, 2010 - 9:57 am

The management process won't get the sem on Linux either:
Linux implements FIFO, but there is no protection at all against starvation.

If I understand the benchmark numbers correctly, a 4-core, 2 GHz Phenom 
is able to do ~ 2 million semaphore operations per second in one 
semaphore array.
That's the limit - cache line trashing on the sma structure prevent 
higher numbers.

For a NUMA system, the limit is probably lower.

Chris:
Do you have an estimate how many semop() your app will perform in one array?

Perhaps we should really remove the per-array list, sma->sem_perm.lock 
and sma->sem_otime.

--
     Manfred
--

From: Chris Mason
Date: Sunday, May 16, 2010 - 3:40 pm

There are two different workloads at play.  The first is to just use
semaphores as a lock, which is a traditional mutex type operation (one
at a time).  This isn't a problem with the current code, aside from lock
contention created by the second case.

The second case is batched wakeup.  One process will wake hundreds or

So far for our uses the per-array list was the biggest trouble.  I tried
to benchmark your patches on Friday, but these are preproduction systems
and it appears to have woken up in a bad mood.

The hardware guys are giving it some TLC and I'll be able to run on
Monday.

-chris

--

From: Nick Piggin
Date: Monday, May 17, 2010 - 12:20 am

Yeah I did realise this after I posted. But anyway I think FIFO
is reasonable to have, although you *may* be able to justify
removing it after your research of other UNIXes, if there are
--

From: Manfred Spraul
Date: Wednesday, April 14, 2010 - 9:16 am

How do you want to determine the sequence number?
What is the oracle workload, which multi-sembuf operations does it use?
How many semaphores are in one array?

When the last optimizations were written, I've searched a bit:
- postgres uses per-process semaphores, with small semaphore arrays.
     [process sleeps on it's own semaphore and is woken up by someone 
else when it can make progress]
- with google, I couldn't find anything relevant that uses multi-sembuf 
semop() calls.

And I agree with Nick: We should be careful about changing the API.

--
     Manfred
--

From: Chris Mason
Date: Wednesday, April 14, 2010 - 10:33 am

I haven't tried yet, but hopefully it won't be a problem.  A later patch
does atomics on the reference count and it doesn't show up in the

This is similar to Oracle (and the sembench program).  Each process has
a semaphore and when it is waiting for a commit it goes to sleep on it.
They are woken up in bulk with semtimedop calls from a single process.

But oracle also uses semaphores for locking in a traditional sense.

Putting the waiters into a per-semaphore list is really only part of the
speedup.  The real boost comes from the patch to break up the locks into
a per semaphore lock.

We gain another 10-15% from a later patch that gets uses atomics on the
refcount, which lets us do sem_putref without a lock (meaning we're
lockless once we get woken up).


I think this should help any workload that has more than one semaphore

Definitely, thanks for reading through it.

-chris

--

From: Manfred Spraul
Date: Wednesday, April 14, 2010 - 12:11 pm

Hmm. Thus you have:
- single sembuf decrease operations that are waiting frequently.
- multi-sembuf  increase operations.

What about optimizing for that case?
Increase operations succeed immediately. Thus complex_count is 0.

If we have performed an update operation, then we can scan all 
simple_lists that have seen an increase instead of checking the global 
list - as long as there are no complex operations waiting.
Right now, we give up if the update operation was a complex operation - 
but that does not matter.
All that matters are the sleeping operations, not the operation that did 
the wakeup.
Ok. Then simple tricks won't help.
How many semaphores are in one array?

--
     Manfred
From: Chris Mason
Date: Wednesday, April 14, 2010 - 12:50 pm

I've been wondering about that.  I can optimize the patch to special
case the increase operations.  The only problem I saw was checking or
the range overflow.  Current behavior will abort the whole set if the

Zach Brown's original patch set tried just the list magic and not

On a big system I saw about 4000 semaphores total.  The database will
just allocate as many as it can into a single array and keep creating
arrays until it has all it needs.

-chris

--

From: Manfred Spraul
Date: Thursday, April 15, 2010 - 9:33 am

What happens if SEMMSL is reduced (first entry in /proc/sys/kernel/sem)?

--
     Manfred
--

From: Chris Mason
Date: Thursday, April 15, 2010 - 9:34 am

Performance improves slightly but the ipc lock is still at the top of the
profile ;)

-chris

--

From: Zach Brown
Date: Tuesday, April 13, 2010 - 11:24 am

> What about Zach's simpler wakeup API?

It's making slow progress in the background as a longer-term experiment.

http://oss.oracle.com/~zab/wake-many/

That URL still has an API description, patches, and little test  
utilities for the simple first draft.

- z
--

From: Manfred Spraul
Date: Friday, April 16, 2010 - 4:26 am

Looking at the current code:
- update_queue() can be O(N^2) if only some of the waiting tasks are 
woken up.
Actually: all non-woken up tasks are rescanned after a task that can be 
woken up is found.

- Your test app tests the best case for the current code:
You wake up the tasks in the same order as the called semop().
If you invert the order (i.e.: worklist_add() adds to head instead of 
tail), I would expect an even worse performance of the current code.

The O(N^2) is simple to fix, I've attached a patch.
For your micro-benchmark, the patch does not change much: you wake-up 
in-order, thus the current code does not misbehave.

Do you know how Oracle wakes up the tasks?
No: The code accesses a local variable. The loop above the comment 
Is it possible to move the sem_putref into wakeup_sem_queue()?
Right now, the exit path of semtimedop doesn't touch the spinlock.
You remove that optimization.

--
     Manfred
From: Chris Mason
Date: Friday, April 16, 2010 - 4:45 am

Ordering in terms of the sem array?  I had them try many variations ;) I


I'll look at this, we need to be able to go through the sma to remove
the process from the lists if it woke up on its own, but I don't see why
we can't putref in wakeup.

My current revision of the patch uses an atomic instead of the lock, so it
restores the lockless wakeup either way.  Still it is better to putref in
wakeup.

-chris
--

Previous thread: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU by Chris Mason on Monday, April 12, 2010 - 11:49 am. (2 messages)

Next thread: WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot' by Eric Paris on Monday, April 12, 2010 - 12:08 pm. (4 messages)