(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). lolz. Catastrophic meltdown. Thanks for doing all that work - at a guess I'd say it's mmap_sem. Perhaps with some assist from the CPU scheduler. If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n, CONFIG_RWSEM_XCHGADD_ALGORITHM=y does it help? Anyway, there's a testcase in bugzilla and it looks like we got us some --
Looks like we dont need to guess, just look at the call graph profile (a'ka It shows a very brutal amount of page fault invoked mmap_sem spinning Doesnt look like it, the perf stat numbers show that the scheduler is only very lightly involved: > > 129875.554435 task-clock-msecs # 10.210 CPUs > > 1883 context-switches # 0.000 M/sec a context switch only every ~68 milliseconds. Ingo Ingo --
Isn't this already fixed? It's the same old "x86-64 rwsemaphores are using the shit-for-brains generic version" thing, and it's fixed by 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation 5d0b723 x86: clean up rwsem type system 59c33fa x86-32: clean up rwsem inline asm statements NOTE! None of those are in 2.6.33 - they were merged afterwards. But they are in 2.6.34-rc1 (and obviously current -git). So Anton would have to compile his own kernel to test his load. We could mark them as stable material if the load in question is a real load rather than just a test-case. On one of the random page-fault benchmarks the rwsem fix was something like a 400% performance improvement, and it was apparently visible in real life on some crazy SGI "initialize huge heap concurrently on lots of threads" load. Side note: the reason the spinlock sucks is because of the fair ticket locks, it really does all the wrong things for the rwsem code. That's why old kernels don't show it - the old unfair locks didn't show the same kind of behavior. Linus --
It is not just a test-case, it is real-life code. With real-life problems on 2.6.32 and later :) Anton.--
another option is to run the rawhide kernel via something like: yum update --enablerepo=development kernel this will give kernel-2.6.34-0.13.rc1.git1.fc14.x86_64, which has those changes included. OTOH that kernel has debugging [lockdep] enabled so it might not be Yeah. Ingo --
I will apply this commits to 2.6.32, I afraid current OFED (which I need also) will not work on 2.6.33+. Anton.--
On Tue, 23 Mar 2010 19:03:36 +0100 You should be able to simply set CONFIG_RWSEM_GENERIC_SPINLOCK=n, CONFIG_RWSEM_XCHGADD_ALGORITHM=y by hand, as I mentioned earlier? --
Hm. I tried, but when I do "make oldconfig", then it gets rewritten, so I assume that it conflicts with some other setting from default fedora kernel config. trying to figure out which one exactly. Anton.--
Have you tracked this down yet? I just got the patches applied against an older kernel and am running into the same issue. Thanks, Robin --
I decided to not track down this issue and just applied patches. I understood that with this patches there is no need to change this config options. Am I wrong? Anton--
We might need to also apply: bafaecd11df15ad5b1e598adc7736afcd38ee13d Robin --
For the record, these are the patches I have applied to a 2.6.32 kernel from a vendor: 59c33fa7791e9948ba467c2b83e307a0d087ab49 5d0b7235d83eefdafda300656e97d368afcafc9a 1838ef1d782f7527e6defe87e180598622d2d071 0d1622d7f526311d87d7da2ee7dd14b73e45d3fc bafaecd11df15ad5b1e598adc7736afcd38ee13d A quick look at the disassembly makes it look like we are using the rwsem_64, et al. Robin --
I think you can prevent overwriting this options if you set them in arch/x86/configs/x86_64_defconfig Anton --
No. Doesn't work. The XADD code simply never worked on x86-64, which is why those three commits I pointed at are required. Oh, and you need one more commit (at least) in addition to the three I already mentioned - the one that actually adds the x86-64 wrappers and Kconfig option: bafaecd x86-64: support native xadd rwsem implementation so the minimal list of commits (on top of 2.6.33) is at least 59c33fa x86-32: clean up rwsem inline asm statements 5d0b723 x86: clean up rwsem type system bafaecd x86-64: support native xadd rwsem implementation 1838ef1 x86-64, rwsem: 64-bit xadd rwsem implementation and I just verified that they at least cherry-pick cleanly (in that order). I _think_ it would be good to also do 0d1622d x86-64, rwsem: Avoid store forwarding hazard in __downgrade_write but that one is a small detail, not anything fundamentally important. Linus --
> I will apply this commits to 2.6.32, I afraid current OFED (which I > need also) will not work on 2.6.33+. What do you need from OFED that is not in 2.6.34-rc1? -- Roland Dreier <rolandd@cisco.com> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html --
I didn't go too 2.6.34-rc1. I tried 2.6.33, mlx4 driver which comes with kernel produces panic on my hardwire. And OFED-1.5 doesn't support this kernel (probably it still can be compiled, didn't check). Anton. --
Applied mentioned patches. Things didn't improve too much.
before:
prog: Total exploration time 9.880 real 60.620 user 76.970 sys
after:
prog: Total exploration time 9.020 real 59.430 user 66.190 sys
perf report:
38.58% prog [kernel] [k] _spin_lock_irqsave
37.42% prog ./prog [.] DBSLLlookup_ret
6.22% prog ./prog [.] SuperFastHash
3.65% prog /lib64/libc-2.11.1.so [.] __GI_memcpy
2.09% prog ./anderson.6.dve2C [.] get_successors
1.75% prog [kernel] [k] clear_page_c
1.73% prog ./prog [.] index_next_dfs
0.71% prog [kernel] [k] handle_mm_fault
0.38% prog ./prog [.] cb_hook
0.33% prog ./prog [.] get_local
0.32% prog [kernel] [k] page_fault
Anton.
--
Could you verify with a callgraph profile what that spin_lock_irqsave() is? If those rwsem patches were successfull mmap_sem should no longer have a spinlock to content on, in which case it might be another lock. If not, something went wrong with backporting those patches. --
I attach here callgraph. Also I checked kernel source, actual code which was compiled is exactly what should be after patches. Do I miss something?
Yeah, I missed at least one commit, namely bafaecd x86-64: support native xadd rwsem implementation which is the one that actually makes x86-64 able to use the xadd version. Linus --
I think we got a winner! Problem seems to be fixed. Just for record, I used next patches: 59c33fa7791e9948ba467c2b83e307a0d087ab49 5d0b7235d83eefdafda300656e97d368afcafc9a 1838ef1d782f7527e6defe87e180598622d2d071 4126faf0ab7417fbc6eb99fb0fd407e01e9e9dfe bafaecd11df15ad5b1e598adc7736afcd38ee13d 0d1622d7f526311d87d7da2ee7dd14b73e45d3fc Thanks, Anton. --
Ok. If you have performance numbers for before/after these patches for your actual workload, I'd suggest posting them to stable@kernel.org, and maybe those rwsem fixes will get back-ported. The patches are pretty small, and should be fairly safe. So they are certainly stable material. Linus --
Tomorrow I will try to patch and check 2.6.33 and see are this patches enough to restore performance or not, because on 2.6.33 kernel performance issue also used to involve somehow crgoup business (and performance was terrible even comparing to broken 2.6.32). If it will not fix 2.6.33, then I will ask to reopen the bug, otherwise I will post to stable@. Thanks again for help, Anton. --
We havent had any stability problems with them, except one trivial build bug, so -stable would be nice. Ingo --
Oh, you're right. There was that UML build bug. But I think that was included in the list of commits Anton had - commit 4126faf0ab ("x86: Fix breakage of UML from the changes in the rwsem system"). Linus --
Yes, it is included into my list. When I will submit it into stable, I will include it also. Anton --
It would be also nice to get that change into 2.6.32 stable. That is widely used on larger systems. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Looking at the changes to the files in question, it looks like it should all apply cleanly to 2.6.32, so I don't see any reason not to backport further back. Somebody should double-check, though. Linus --
I have queued them all up for .33 and .32-stable kernel releases now. thanks, greg k-h --
On Tue, 23 Mar 2010 18:34:09 +0100 Yes. Note that we fall off a cliff at nine threads on a 16-way. As soon as a core gets two threads scheduled onto it? Probably triggered by an MM change, possibly triggered by a sched change which tickled a preexisting MM shortcoming. Who knows. Anton, we have an executable binary in the bugzilla report but it would be nice to also have at least a description of what that code is actually doing. A quick strace shows quite a lot of mprotect activity. A pseudo-code walkthrough, perhaps? Thanks. --
Right now can't say too much about the code (we just gave a chance to neighbor group to run their code on our cluster, so I'm totally unfriendly with this code). I will forward your question to them. But probably right now you can get more information (including sources) here http://fmt.cs.utwente.nl/tools/ltsmin/ Anton--
it's AMD Opterons so no SMT. My (wild) guess would be that 8 cpus can still do cacheline ping-pong reasonably efficiently, but it starts breaking down very seriously with 9 or more cores bouncing the same single cache-line. Breakdowns in scalability are usually very non-linear, for hardware and software reasons. '8 threads' sounds like a hw limit to me. From the scheduler POV there's no big difference between 8 or 9 CPUs used [this is non-HT] - with 8 or 7 cores still idle. Ingo --
Although case is solved, I will post description for testcase program. Just in case someone wonder or would like to keep it for some later tests. ------------------------------------------------------------------------ It is a parallel model checker. The command line you used does reachability on the state space of mode anderson.6, meaning that it searches through all possible states (int vectors). Each thread gets a vector from the queue, calculates its successor states and puts them in a lock-less static hash table (pseudo BFS exploration because the threads each have there own queue). How did ingo run the binary? Because the static table size should be chosen to fit into memory. "-s 27" allocates 2^27 * (|vector| + 1 ) * sizeof(int) bytes. |vector| is equal to 19 for anderson.6, ergo the table size is 10GB. This could explain the huge number of page faults ingo gets. But anyway, you can imagine that the code is quiet jumpy and has a big memory footprint, so the page faults may also be normal. ------------------------------------------------------------------------ --
<snip> I had an "opportunity" to investigate page fault behavior on 2.6.18+ [RHEL5.4] on an 8-socket Istanbul system earlier this year. When I saw this mail, I collected up the data I had from that adventure and ran additional tests on 2.6.33 and 2.6.34-rc1. I have attached plots for what "per node" and "system wide" page fault scalability. The per node plot [#1] shows the page fault rate of 1 to 6 [nr_cores_per_socket] tasks [processes] and threads faulting in a fixed GB/task at the same time on a single socket. The system wide plot [#3] show 1 to 48 [nr_sockets * nr_cores_per_socket] tasks and threads again faulting in a fixed GB/task... For the latter test, I load one core per socket at at time, then add the 2nd core per socket, ... In all cases, the individual tasks/threads are fork()ed/pthread_create()d by a parent bound to the cpu where they'll run to obtain node-local kernel data structures. The tests run with SCHED_FIFO. I plot both "faults per wall clock second"--the aggregate rate--and "faults per cpu second" or normalized rate. The per node scalability doesn't look all that different across the 3 releases, especially the faults per cpu seconds curves. However, in the system wide multi-threaded tests, 2.6.33 is an anomaly compared to both 2.6.18+ and 2.6.34-rc1. The 2.6.18+ and 2.6.34.rc1 multi-threaded tests show a lot of noise and, of course, a lot lower fault rate relative the the multi-task tests. I aborted the 2.6.33 system wide multi-threaded test at 32 threads because it was just taking too long. Unfortunately, with this many curves, the legends obscure much of the plot. So, rather than bloat this message any more, I've packaged up the raw data along with plots with and without legends and placed the tarball here: http://free.linux.hp.com/~lts/Pft/ That directory also contains the source for the version of the pft test used, along with the scripts used to run the tests and plot the results. Note that some manual editing of the "plot ...
