We've been poking around in semtimedop for a while now, mostly because it is consistently showing up at the top of the CPU profiles for benchmarking runs on big numa systems. The biggest problem seems to be the IPC lock, and the fact that we hold it for a long time while we loop over different lists and try to do semaphore operations. Zach Brown came up with a set of patches a while ago that switched away from the global pending list, and semtimedop was recently optimized for the single sop case by Nick and Manfred. This patch series tries to build on ideas from all of these patches. The list of pending semaphore operations is pushed down to the individual semaphore and the locking is also pushed down into the semaphore. The result is much faster with my micro benchmark: http://oss.oracle.com/~mason/sembench.c It more than doubles the total number of post/wait cycles the benchmark is able to get in 30s. Before this patch, semtimedop scored about the same as futexes for the post/wait cycles, and so now it is 2x faster. I did run this code through all of the ltp ipc tests, and later this week I hope to get a full tpc database benchmark on it. --
One feature of the ipc semaphores is they are defined to be atomic for the full set of operations done per syscall. So if you do a semtimedop syscall changing 100 semaphores, the kernel needs to try all 100 changes and only apply them when all 100 are able to succeed without putting the process to sleep. Today we use a single lock per semaphore array (the ipc lock). This lock is held every time we try a set of operations requested by userland, and when taken again when a process is woken up. Whenever a given set of changes sent to semtimedop would sleep, that set is queued up on a big list of pending changes for the entire semaphore array. Whenever a semtimedop call changes a single semaphore value, it walks the entire list of pending operations to see if any of them can now succeed. The ipc lock is held for this entire loop. This change makes two major changes, pushing both the list of pending operations and a spinlock down to each individual semaphore. Now: Whenever a given semaphore modification is going to block, the set of operations semtimedop wants to do is saved onto that semaphore's list. Whenever a givem semtimedop call changes a single semaphore value, it walks the list of pending operations on that single semaphore to see if they can now succeed. If any of the operations will block on a different semaphore, they are moved to that semaphore's list. The locking is now done per-semaphore. In order to get the changes done atomically, the lock of every semaphore being changed is taken while we test the requested operations. We sort the operations by semaphore id to make sure we don't deadlock in the kernel. I have a microbenchmark to test how quickly we can post and wait in bulk. With this change, semtimedop is able do to more than twice as much work in the same run. On a large numa machine, it brings the IPC lock system time (reported by perf) down from 85% to 15%. Signed-off-by: Chris Mason <chris.mason@oracle.com> --- include/linux/sem.h | ...
Hi Chris,
Have you thought about odd corner cases:
Nick noticed the last time that it is possible to wait for arbitrary values:
in one semop:
- decrease semaphore 5 by 10
- wait until semaphore 5 is 0
Does sorting preserve the behavior?
--
Do you mean within a single sop array doing all three of these? I don't know if the sort is going to leave the three operations on semaphore 5 in the same order (it probably won't). But I could change that by having it include the slot in the original sop array in the sorting. That way if we have duplicate semnums in the array, they will end up in the same position relative to each other in the sorted result. (ewwww ;) -chris --
I had a bit of a hack at doing per-semaphore stuff when I was looking at the first optimization, but it was tricky to make it work. The other thing I don't know if your patch gets right is requeueing on of the operations. When you requeue from one list to another, then you seem to lose ordering with other pending operations, so that would seem to break the API as well (can't remember if the API strictly mandates FIFO, but anyway it can open up starvation cases). I was looking at doing a sequence number to be able to sort these, but it ended up getting over complex (and SAP was only using simple ops so it didn't seem to need much better). We want to be careful not to change semantics at all. And it gets tricky quickly :( What about Zach's simpler wakeup API? --
I don't see anything in the docs about the FIFO order. I could add an extra sort on sequence number pretty easily, but is the starvation case Yeah, that's why my patches include code to handle userland sending duplicate semids. Zach's simpler API is cooking too, but if I can get this done without insane complexity it helps with more than just the post/wait oracle workload. -chris --
Yes, because it's not just a theoretical livelock, it can be basically a certainty, given the right pattern of semops. You could have two mostly-independent groups of processes, each taking and releasing a different sem, which are always contended (eg. if it is being used for a producer-consumer type situation, or even just mutual exclusion with high contention). Then you could have some overall management process for example which Iam worried about complexity and slowing other cases, given that Oracle DB seems willing to adapt to the (better suited) new API. So I'd be interested to know what it helps outside Oracle. --
Sure, I'd hope that your benchmark from last time around is faster now. -chris --
I didn't actually reproduce it here, I think it was a customer or partner workload. But SAP only seemed to have one contended semnum in its array, and it was being operated on with "simple" semops (so that's about as far as the patches went). I didn't notice anything that should make that go faster? Yes, with such a workload, using semops is basically legacy and simple mutexes should work better. So I'm not outright against improving sysv sem performance for more complex cases where nothing else we have works as well. --
Detecting the dups just keeps me from deadlocking. I'm locking each individual semaphore in sequence, so if userland does something strange and sends two updates to the same semaphore, the code detects that and Since I'm avoiding the ipc lock while operating on the array, it'll help any workload that hits on two or more semaphores in the array at I'm not in a hurry to overhaul a part of the kernel that has been stable for a long time. But it really needs some love I think. I'll have more numbers from a tpc run later this week. -chris --
Yeah, I don't think SAP did that, significantly to matter. Possibly Yep, I'm not against it. "industry standard benchmark" numbers would be great. I do think we need to be really careful with semantics though. The API's been around for long enough that it is going to have been (ab)used in every way possible :) --
The management process won't get the sem on Linux either:
Linux implements FIFO, but there is no protection at all against starvation.
If I understand the benchmark numbers correctly, a 4-core, 2 GHz Phenom
is able to do ~ 2 million semaphore operations per second in one
semaphore array.
That's the limit - cache line trashing on the sma structure prevent
higher numbers.
For a NUMA system, the limit is probably lower.
Chris:
Do you have an estimate how many semop() your app will perform in one array?
Perhaps we should really remove the per-array list, sma->sem_perm.lock
and sma->sem_otime.
--
Manfred
--
There are two different workloads at play. The first is to just use semaphores as a lock, which is a traditional mutex type operation (one at a time). This isn't a problem with the current code, aside from lock contention created by the second case. The second case is batched wakeup. One process will wake hundreds or So far for our uses the per-array list was the biggest trouble. I tried to benchmark your patches on Friday, but these are preproduction systems and it appears to have woken up in a bad mood. The hardware guys are giving it some TLC and I'll be able to run on Monday. -chris --
Yeah I did realise this after I posted. But anyway I think FIFO is reasonable to have, although you *may* be able to justify removing it after your research of other UNIXes, if there are --
How do you want to determine the sequence number?
What is the oracle workload, which multi-sembuf operations does it use?
How many semaphores are in one array?
When the last optimizations were written, I've searched a bit:
- postgres uses per-process semaphores, with small semaphore arrays.
[process sleeps on it's own semaphore and is woken up by someone
else when it can make progress]
- with google, I couldn't find anything relevant that uses multi-sembuf
semop() calls.
And I agree with Nick: We should be careful about changing the API.
--
Manfred
--
I haven't tried yet, but hopefully it won't be a problem. A later patch does atomics on the reference count and it doesn't show up in the This is similar to Oracle (and the sembench program). Each process has a semaphore and when it is waiting for a commit it goes to sleep on it. They are woken up in bulk with semtimedop calls from a single process. But oracle also uses semaphores for locking in a traditional sense. Putting the waiters into a per-semaphore list is really only part of the speedup. The real boost comes from the patch to break up the locks into a per semaphore lock. We gain another 10-15% from a later patch that gets uses atomics on the refcount, which lets us do sem_putref without a lock (meaning we're lockless once we get woken up). I think this should help any workload that has more than one semaphore Definitely, thanks for reading through it. -chris --
Hmm. Thus you have:
- single sembuf decrease operations that are waiting frequently.
- multi-sembuf increase operations.
What about optimizing for that case?
Increase operations succeed immediately. Thus complex_count is 0.
If we have performed an update operation, then we can scan all
simple_lists that have seen an increase instead of checking the global
list - as long as there are no complex operations waiting.
Right now, we give up if the update operation was a complex operation -
but that does not matter.
All that matters are the sleeping operations, not the operation that did
the wakeup.
Ok. Then simple tricks won't help.
How many semaphores are in one array?
--
Manfred
I've been wondering about that. I can optimize the patch to special case the increase operations. The only problem I saw was checking or the range overflow. Current behavior will abort the whole set if the Zach Brown's original patch set tried just the list magic and not On a big system I saw about 4000 semaphores total. The database will just allocate as many as it can into a single array and keep creating arrays until it has all it needs. -chris --
What happens if SEMMSL is reduced (first entry in /proc/sys/kernel/sem)?
--
Manfred
--
Performance improves slightly but the ipc lock is still at the top of the profile ;) -chris --
> What about Zach's simpler wakeup API? It's making slow progress in the background as a longer-term experiment. http://oss.oracle.com/~zab/wake-many/ That URL still has an API description, patches, and little test utilities for the simple first draft. - z --
Looking at the current code:
- update_queue() can be O(N^2) if only some of the waiting tasks are
woken up.
Actually: all non-woken up tasks are rescanned after a task that can be
woken up is found.
- Your test app tests the best case for the current code:
You wake up the tasks in the same order as the called semop().
If you invert the order (i.e.: worklist_add() adds to head instead of
tail), I would expect an even worse performance of the current code.
The O(N^2) is simple to fix, I've attached a patch.
For your micro-benchmark, the patch does not change much: you wake-up
in-order, thus the current code does not misbehave.
Do you know how Oracle wakes up the tasks?
No: The code accesses a local variable. The loop above the comment
Is it possible to move the sem_putref into wakeup_sem_queue()?
Right now, the exit path of semtimedop doesn't touch the spinlock.
You remove that optimization.
--
Manfred
Ordering in terms of the sem array? I had them try many variations ;) I I'll look at this, we need to be able to go through the sma to remove the process from the lists if it woke up on its own, but I don't see why we can't putref in wakeup. My current revision of the patch uses an atomic instead of the lock, so it restores the lockless wakeup either way. Still it is better to putref in wakeup. -chris --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
