So, I learned a few things since I put up the previous set of benchmarks: - The erratic behaviour from Linux is due to the glibc memory allocator. Using Google's tcmalloc, the problem disappears. - I missed a few things when porting jemalloc from FreeBSD. One of them was fairly major. Due to my mistake jemalloc on NetBSD was, basically, single threaded. That said it did show a noticable improvement over phkmalloc. - There was a nasty performance bug in NetBSD's pthread mutexes, which is now fixed. libpthread has also had a couple more tweaks for performance that have had a positive impact. - The memory allocator used has a significant effect on sysbench itself: it needs to be multithreaded. - Mindaugas has made more improvements to his scheduler and these are showing a really positive effect. So after making some changes to NetBSD, and changes to how I'm benchmarking the systems, I have rerun them. In contrast to the previous runs, this one is done locally: http://www.netbsd.org/~ad/sysbench2/4cpu.png Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket I/O causing contention on kernel_lock. It will be interesting to see. Thanks, Andrew
Well you have to be careful there, tcmalloc apparently defers frees, and is not really a general purpose malloc. The linux performance problems I am somewhat surprised by this, because on FreeBSD it is really not spending much time in the kernel (only ~20% system time), so there does not seem to be much scope for a 10% performance difference. Also it took quite a lot of work to optimize locking of various kernel subsystems that are used by this workload, and until that point there was significant kernel lock contention which reduced performance by tens of percent. I would have expected this to matter on NetBSD - even with the vmlocking work there is still more to go. Here is the initial run with CVS HEAD sources (I took out the obvious things from GENERIC.MP like I386_CPU support, etc, and removed the default datasize and stack size limits). Same benchmark config that Andrew is using, etc. http://people.freebsd.org/~kris/scaling/netbsd.png There are a couple of things to note: * the drop-off above 8 threads on FreeBSD is due to non-scalability of mysql itself. i.e. it comes from pthread mutex contention in userland. This is the only relevant lock contention point in the FreeBSD kernel on this workload. There are some things we can do in libpthread to mitigate the performance loss in the over-contended pthread situation, but we haven't done them yet. * The tail end of the graph is somewhat noisy, which is the reason for the jump at 19 threads (I only graphed a single run). The distribution at 20 clients looks like: +------------------------------------------------------------+ | x x | |x x x xxx x x xx x x xxx x xx| | |_______________A_M_____________| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 20 2326.01 2758.86 258...
OK, I have repeated the benchmarking in two additional cases: 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew gave me (based on the vmlocking branch). This is using the new scheduler. 2) As above with experimental libc and libpthread also given to me by Andrew. I dunno what changes these contain either :) I was only able to run in the 8 CPU configuration because when I tried to disable CPUs with cpuctl, processes would hang under load. This is probably a scheduler issue. http://people.freebsd.org/~kris/scaling/netbsd.png This shows some improvement but not much, relatively speaking. In particular performance at 4 threads is still significantly below FreeBSD performance, which (given what I measured previously) suggests that there is still a performance deficit with 4 CPUs on NetBSD. It would be nice to be able to test this directly though, maybe Andrew can give me a kernel that has MAXCPU=4 or whatever the NetBSD version is. Kris
It's actually GENERIC.MP from current, with SCHED_M2. No vmlocking code involved - would you be able to update the labels? The libc has jemalloc, Interesting. :-). Thanks for running this. I'm still optimistic about the 4 CPU case so I'm very interested in seeing what the results would be. I'll have a look into the offline problem this evening. Thanks, Andrew
In this case, it is not a matter of testing - just left this part for more thought, while this functionality is not used much. Anyway, I have made a primary patch for this, hope will be useful shortly. Switching the CPU offline with cpuctl keeps the execution of bound threads only. At this moment, it should be OK for the benchmarking. -- Best regards, Mindaugas www.NetBSD.org
Except it didn't work, as above ;-) Kris
OK thanks. In the meantime I ran sysbench with postgresql 8.2. Same NetBSD configs as before (except I built my own kernel with the sched_m2 patches since I needed to tweak the sysv ipc parameters). http://people.freebsd.org/~kris/scaling/netbsd-pgsql.png postgresql is much more scalable than mysql on this workload and doesn't have silly scaling bottlenecks inside the application (cf the tail of the FreeBSD curve for mysql which is where pthread mutex contention kicked in). Kris
Here are some more graphs. This one is on the 4 CPU P3 500 MHz and shows postgresql 8.2. FreeBSD is about 15-20% higher throughput. http://people.freebsd.org/~kris/scaling/4cpu-pgsql.png This one shows mysql on the same system http://people.freebsd.org/~kris/scaling/4cpu-mysql.png In that test NetBSD does outperform FreeBSD but only by 3-4% (in particular I am not seeing the ~10% difference that Andrew observes on his 4*p3 700MHz). Given the age of the hardware and the fact that I am not seeing it on other workloads or on modern hardware it might just be due to a small scheduling difference on this configuration. Kris
Is this system running an amd64 kernel, or an i386 kernel? One thing I noticed when comparing NetBSD and FreeBSD performance for a very diferent workload recently was that NetBSD/amd64 is missing even the most recent optimized i386 versions of some fairly basic stuff like memset, memcpy, copyin/copyout, much less versions tuned for modern processors (e.g. for set and copy SSE2 instructions can be used if you know they're present, as we do know on amd64). I suspect FreeBSD/amd64 is better about this. Thor
It is i386. Kris
I don't want to get too into linux tuning for this, but horde is a good alternative to tcmalloc if there are concerns about using tcmalloc in this capacity. (horde also works on solaris, and could probably be ported further)
My opinion is that this is a problem for the Linux and glibc developers to solve :-) Kris
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Did you mean hoard memory allocator? Regards - ----------------------------------------- Adam Hamsik jabber: haad@jabber.org icq: 249727910 Proud NetBSD user. We program to have fun. Even when we program for money, we want to have fun as well. ~ Yukihiro Matsumoto -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iD8DBQFHBkuulIxPgX3Go0MRAmdZAJ93woVHpWwZ2MyG0h8WzFiviHaaigCgsbhG AGJIrcNHqbknyQ9ipF8DnSo= =zUED -----END PGP SIGNATURE-----
Yes. :)
| Martin Bligh | Re: Unified tracing buffer |
| Ingo Molnar | [announce] "kill the Big Kernel Lock (BKL)" tree |
| Con Kolivas | [PATCH] [RFC] sched: accurate user accounting |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Krzysztof Oledzki | Error: an inet prefix is expected rather than "0/0". |
| Wenji Wu | A Linux TCP SACK Question |
| Ramachandra K | [PATCH 11/13] QLogic VNIC: Driver utility file - implements various utility macros |
| Jay Cliburn | Re: atl1 64-bit => 32-bit DMA borkage (reproducible, bisected) |
git: | |
| Andrew Morton | Untracked working tree files |
| Pierre Habouzit | Re: libgit2 - a true git library |
| Nicolas Vilz 'niv' | git + ssh + key authentication feature-request |
| Martin Langhoff | Re: pack operation is thrashing my server |
| Steve B | SSH brute force attacks no longer being caught by PF rule |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| rancor | How to copy/pipe console buffert to file? |
| Richard Stallman | Real men don't attack straw men |
| Question on swap as ramdisk partition | 41 minutes ago | Linux kernel |
| Netfilter kernel module | 11 hours ago | Linux kernel |
| serial driver xmit problem | 13 hours ago | Linux kernel |
| Why Windows is better than Linux | 13 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 20 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 2 days ago | Linux general |
