Hi all,
I noticed that sysv ipc now uses very special locking: first a global
rw-semaphore, then within that semaphore rcu:
ids->rw_mutex is a per-namespace (i.e.: usually global) semaphore. Thus
ipc_lock writes into a global cacheline. Everything else is based on
per-object locking, especially sysv sem doesn't contain a single global
lock/statistic counter/...
That can't be the Right Thing (tm): Either there are cases where we need
the scalability (then using IDRs is impossible), or the scalability is
never needed (then the remaining parts from RCU should be removed).
I don't have a suitable test setup, has anyone performed benchmarks
recently?
Is sysv semaphore still important, or have all apps moved to posix
semaphores/futexes?
Nadia: Do you have access to a suitable benchmark?A microbenchmark on a single-cpu system doesn't help much (except that
2.6.25 is around factor 2 slower for sysv msg ping-pong between two
tasks compared to the numbers I remember from older kernels....)--
Manfred
--
If I remember well, at that time I had used ctxbench and I wrote some
other small scripts.
And the results I had were around 2 or 3% slowdown, but I have to
confirm that by checking in my archives.I'll also have a look at the remaining RCU critical sections in the code.
Regards,
Nadia--
Do you have access to multi-core systems? The "best case" for the rcu
code would be
- 8 or 16 cores
- one instance of ctxbench running on each core, bound to that core.I'd expect a significant slowdown. The big question is if it matters.
--
Manfred
--
Hi,
Here is what I could find on my side:
=============================================================
lkernel@akt$ cat tst3/res_new/output
[root@akt tests]# echo 32768 > /proc/sys/kernel/msgmni
[root@akt tests]# ./msgbench_std_dev_plot -n
32768000 msgget iterations in 21.469724 seconds = 1526294/sec32768000 msgsnd iterations in 18.891328 seconds = 1734583/sec
32768000 msgctl(ipc_stat) iterations in 15.359802 seconds = 2133472/sec
32768000 msgctl(msg_stat) iterations in 15.296114 seconds = 2142260/sec
32768000 msgctl(ipc_rmid) iterations in 32.981277 seconds = 993542/sec
AVERAGE STD_DEV MIN MAX
GET: 21469.724000 566.024657 19880 23607
SEND: 18891.328000 515.542311 18433 21962
IPC_STAT: 15359.802000 274.918673 15147 17166
MSG_STAT: 15296.114000 155.775508 15138 16790
RM: 32981.277000 675.621060 32141 35433lkernel@akt$ cat tst3/res_ref/output
[root@akt tests]# echo 32768 > /proc/sys/kernel/msgmni
[root@akt tests]# ./msgbench_std_dev_plot -r
32768000 msgget iterations in 665.842852 seconds = 49213/sec32768000 msgsnd iterations in 18.363853 seconds = 1784458/sec
32768000 msgctl(ipc_stat) iterations in 14.609669 seconds = 2243001/sec
32768000 msgctl(msg_stat) iterations in 14.774829 seconds = 2217950/sec
32768000 msgctl(ipc_rmid) iterations in 31.134984 seconds = 1052483/sec
AVERAGE STD_DEV MIN MAX
GET: 665842.852000 946.697555 654049 672208
SEND: 18363.853000 107.514954 18295 19563
IPC_STAT: 14609.669000 43.100272 14529 14881
MSG_STAT: 14774.829000 97.174924 14516 15436
RM: 31134.984000 444.612055 30521 33523==================================================================
Unfortunately, I haven't kept the exact kernel release numbers, but the
testing method was:
res_ref = unpatched kernel
res_new = same kernel release with my patches applied.What I'll try to do i...
I could give it a spin -- though I would need to be pointed to the
patch and the test.Thanx, Paul
--
I'd just compare a recent kernel with something older, pre Fri Oct 19
11:53:44 2007Then download ctxbench, run one instance on each core, bound with taskset.
http://www.tmr.com/%7Epublic/source/
(I don't juse ctxbench myself, if it doesn't work then I could post my
own app. It would be i386 only with RDTSCs inside)I'll try to run it on my PentiumIII/850, right now I'm still setting
everything up.--
Manfred
--
(test gizmos are always welcome)
Results for Q6600 box don't look particularly wonderful.
taskset -c 3 ./ctx -s
2.6.24.3
3766962 itterations in 9.999845 seconds = 376734/sec2.6.22.18-cfs-v24.1
4375920 itterations in 10.006199 seconds = 437330/secfor i in 0 1 2 3; do taskset -c $i ./ctx -s& done
2.6.22.18-cfs-v24.1
4355784 itterations in 10.005670 seconds = 435361/sec
4396033 itterations in 10.005686 seconds = 439384/sec
4390027 itterations in 10.006511 seconds = 438739/sec
4383906 itterations in 10.006834 seconds = 438128/sec2.6.24.3
1269937 itterations in 9.999757 seconds = 127006/sec
1266723 itterations in 9.999663 seconds = 126685/sec
1267293 itterations in 9.999348 seconds = 126742/sec
1265793 itterations in 9.999766 seconds = 126592/sec-Mike
--
Attached is a patch that I wrote that adds cpu binding. Feel free to add
it to your sources. It's not that usefull, recent linux distros include
a "taskset" command that can bind a task to a given cpu. I needed it for
an older distro.With regards to the multi-core case: I've always ignored them, I
couldn't find a good/realistic test case.
Thundering herds (i.e.: one task wakes up lots of waiting tasks) is at
least for sysv msg and sysv sem lockless: the woken up tasks do not take
any locks, they return immediately to user space.
Additionally, I don't know if the test case is realistic: at least
postgres uses one semaphore for each process/thread, thus waking up
multiple tasks never happens.Another case would be to bind both tasks to different cpus. I'm not sure
if this happens in real life. Anyone around who knows how other
databases implement locking? Is sysv sem still used?--
Manfred
Ouch - 71% slowdown with just 4 cores. Wow.
Attached are my own testapps: one for sysv msg, one for sysv sem.
Could you run them? Taskset is done internally, just execute$ for i in 1 2 3 4;do ./psem $i 5;./pmsg $i 5;done
Only tested on uniprocessor, I hope the pthread_setaffinity works as
expected....--
Manfred
2.6.22.18-cfs-v24-smp 2.6.24.3-smp
Result matrix: (psem)
Thread 0: 2394885 1: 2394885 Thread 0: 2004534 1: 2004535
Total: 4789770 Total: 4009069
Result matrix: (pmsg)
Thread 0: 2345913 1: 2345914 Thread 0: 1971000 1: 1971000
Total: 4691827 Total: 3942000Result matrix:
Thread 0: 1613610 2: 1613611 Thread 0: 477112 2: 477111
Thread 1: 1613590 3: 1613590 Thread 1: 485607 3: 485607
Total: 6454401 Total: 1925437
Result matrix:
Thread 0: 1409956 2: 1409956 Thread 0: 519398 2: 519398
Thread 1: 1409776 3: 1409776 Thread 1: 519169 3: 519170
Total: 5639464 Total: 2077135Result matrix:
Thread 0: 516309 3: 516309 Thread 0: 401157 3: 401157
Thread 1: 318546 4: 318546 Thread 1: 408252 4: 408252
Thread 2: 352940 5: 352940 Thread 2: 703600 5: 703600
Total: 2375590 Total: 3026018
Result matrix:
Thread 0: 478356 3: 478356 Thread 0: 344738 3: 344739
Thread 1: 241655 4: 241655 Thread 1: 343614 4: 343615
Thread 2: 252444 5: 252445 Thread 2: 589298 5: 589299
Total: 1944911 Total: 2555303Result matrix:
Thread 0: 443392 4: 443392 Thread 0: 398491 4: 398491
Thread 1: 443338 5: 443339 Thread 1: 398473 5: 398473
Thread 2: 444069 6: 444070 Thread 2: 394647 6: 394648
Thread 3: 444078 7: 444078 Thread 3: 394784 7: 394785
Total: 3549756 ...
Thanks. Unfortunately the test was buggy, it bound the tasks to the
wrong cpu :-(
Could you run it again? Actually 1 cpu and 4 cpus are probably enough.--
Manfred
Sure. (ran as before, hopefully no transcription errors)
2.6.22.18-cfs-v24-smp 2.6.24.3-smp
Result matrix: (psem)
Thread 0: 2395778 1: 2395779 Thread 0: 2054990 1: 2054992
Total: 4791557 Total: 4009069
Result matrix: (pmsg)
Thread 0: 2317014 1: 2317015 Thread 0: 1959099 1: 1959099
Total: 4634029 Total: 3918198Result matrix:
Thread 0: 2340716 2: 2340716 Thread 0: 1890292 2: 1890293
Thread 1: 2361052 3: 2361052 Thread 1: 1899031 3: 1899032
Total: 9403536 Total: 7578648
Result matrix:
Thread 0: 1429567 2: 1429567 Thread 0: 1295071 2: 1295071
Thread 1: 1429267 3: 1429268 Thread 1: 1289253 3: 1289254
Total: 5717669 Total: 5168649Result matrix:
Thread 0: 2263039 3: 2263039 Thread 0: 1351208 3: 1351209
Thread 1: 2265120 4: 2265121 Thread 1: 1351300 4: 1351300
Thread 2: 2263642 5: 2263642 Thread 2: 1319512 5: 1319512
Total: 13583603 Total: 8044041
Result matrix:
Thread 0: 483934 3: 483934 Thread 0: 514766 3: 514767
Thread 1: 239714 4: 239715 Thread 1: 252764 4: 252765
Thread 2: 270216 5: 270216 Thread 2: 253216 5: 253217
Total: 1987729 Total: 2041495Result matrix:
Thread 0: 2260038 4: 2260039 Thread 0: 642235 4: 642236
Thread 1: 2262748 5: 2262749 Thread 1: 642742 5: 642743
Thread 2: 2271236 6: 2271237 Thread 2: 640281 6: 640282
Thread 3: 2257651 7: 2257652 Thread 3: 641931 ...
Looking at the output over morning java, I noticed that pmsg didn't get
recompiled due to a fat finger, so those numbers are bogus. Corrected
condensed version of output is below, charted data attached.(hope evolution doesn't turn this into something other than plain text)
1
2
3
4
2.6.22.18-cfs-v24.1 psem
4791557
9403536
13583603
18103350
2.6.22.18-cfs-v24.1 pmsg
4906249
9171440
13264752
17774106
2.6.24.3 psem
4009069
7578648
8044041
5134381
2.6.24.3 pmsg
3917588
7290206
7644794
4824967
Pff, I'd rather have had the bounce. Good thing I attached the damn
--
Thanks:
sysv sem:
- 2.6.22 had almost linear scaling (up to 4 cores).
- 2.6.24.3 scales to 2 cpus, then it collapses. with 4 cores, it's 75%
slower than 2.6.22.sysv msg:
- neither 2.6.22 nor 2.6.24 scale very good. That's more or less
expected, the message queue code contains a few global statistic
counters (msg_hdrs, msg_bytes).The cleanup of sysv is nice, but IMHO sysv sem should remain scalable -
and a gloal semaphore with IDR can't be as scalable as the RCU protected
array that was used before.--
Manfred
--
Actually, 2.6.22 is fine, and 2.6.24.3 is not, just as sysv sem. I just
noticed that pmsg didn't get recompiled last night (fat finger) , and
sent a correction.-Mike
--
Hi all,
I've revived my Dual-CPU Pentium III/850:
I couldn't notice a scalability-problem (two cpus are around 190%, but
just the normal performance of 2.6.25-rc3 is abyssimal, 55 to 60% slower
than 2.6.18.8:psem 2.6.18 2.6.25 Diff [%]
1 cpu 948.005 398.435 -57,97
2 cpus 1.768.273 734.816 -58,44
Scalability [%] 193,26 192,21pmsg 2.6.18 2.6.25 Diff [%]
1 cpu 821.582 356.904 -56,56
2 cpus 1.488.058 661.754 -55,53
Scalability [%] 190,56 192,71Attached are the .config files and the individual results.
Did I accidentially enable a scheduler debug option?--
Manfred
After manually reverting 3e148c79938aa39035669c1cfa3ff60722134535,
2.6.25.git scaled linearly, but as you noted, markedly down from earlier
kernels with this benchmark. 2.6.24.4 with same revert, but all
2.6.25.git ipc changes piled on top still performed close to 2.6.22, so
I went looking. Bisection led me to..8f4d37ec073c17e2d4aa8851df5837d798606d6f is first bad commit
commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jan 25 21:08:29 2008 +0100sched: high-res preemption tick
Use HR-timers (when available) to deliver an accurate preemption tick.
The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
..and I verified it via :-/ echo 7 > sched_features in latest. That
only bought me roughly half though, so there's a part three in there
somewhere.-Mike
We can't just revert that patch: with IDR, a global lock is mandatory :-(
We must either revert the whole idea of using IDR or live with the
reduced scalability.Actually, there are further bugs: the undo structures are not
namespace-aware, thus semop with SEM_UNDO, unshare, create new array
with same id, but more semaphores, another semop with SEM_UNDO will
corrupt kernel memory :-(
I'll try to clean up the bugs first, then I'll look at the scalability
again.--
Manfred
--
Yeah, I looked at the problem, but didn't know what the heck to do about
Great!
-Mike
--
I could get better results with the following solution:
wrote an RCU-based idr api (layers allocation is managed similarly to
the radix-tree one)Using it in the ipc code makes me get rid of the read lock taken in
ipc_lock() (the one introduced in 3e148c79938aa39035669c1cfa3ff60722134535).You'll find the results in attachment (kernel is 2.6.25-rc3-mm1).
output.25_rc3_mm1.ref.8 --> pmsg output for the 2.6.25-rc3-mm1
plot.25_rc3_mm1.ref.8 --> previous file results for use by gnuplot
output.25_rc3_mm1.ridr.8 --> pmsg output for the 2.6.25-rc3-mm1
+ rcu-based idrs
plot.25_rc3_mm1.ridr.8 --> previous file results for use by gnuplotI think I should be able to send a patch next week. It is presently an
uggly code: I copied idr.c and idr.h into ridr.c and ridr.h to go fast,
so didn't do any code factorization.Regards
Nadia
Sorry forgot the command:
for i in 1 2 3 4 5 6 7 8;do ./pmsg $i 5;done > output.25_rc3_mm1.ref.8
Regards,
Nadia--
You should revert it all. The scalability problem isn't good, but from
what you're saying, the idea isn't ready yet. Revert it all, fix the
problems at your leisure, and submit new patches then.
--
Ouch, I guess hrtimers are just way expensive on some hardware...
--
It takes a large bite out of my P4 as well.
--
That would be about on par with my luck. I'll try to muster up the
gumption to go looking for part three, though my motivation for
searching long ago proved to be a dead end wrt sysv ipc.-Mike
--
| Artem Bityutskiy | [PATCH 12/44 take 2] [UBI] allocation unit implementation |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
| Christoph Hellwig | pcmcia ioctl removal |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
