Threading Benchmarks, NetBSD versus FreeBSD

Submitted by Jeremy
on October 7, 2007 - 8:24pm

Andrew Doran posted some threading benchmark results to NetBSD's tech-kern mailing list, following up to some benchmarks he'd posted earlier. The results compared NetBSD -current with FreeBSD -current, and the Linux 2.6.21 kernel. Kris Kennaway was surprised by the results, and ran his own benchmarks with minimal configuration changes, summarizing, "this measurement shows that FreeBSD is performing 70-80% better than NetBSD in this 4 CPU configuration. This is in contrast to Andrew's findings which seem to show NetBSD performing 10% better than FreeBSD on a 4 CPU system (a very old one though)." He added, "the drop-off above 8 threads on FreeBSD is due to non-scalability of mysql itself. i.e. it comes from pthread mutex contention in userland."

Kris ran additional benchmarks with PostgreSQL instead of MySQL, showing much improved scalability above 8 threads, "postgresql is much more scalable than mysql on this workload and doesn't have silly scaling bottlenecks inside the application (cf the tail of the FreeBSD curve for mysql which is where pthread mutex contention kicked in)." He continued his testing, and found that on older 4CPU P3 hardware NetBSD did outperform FreeBSD, "but only by 3-4% (in particular I am not seeing the ~10% difference that Andrew observes on his 4*p3 700MHz). Given the age of the hardware and the fact that I am not seeing it on other workloads or on modern hardware it might just be due to a small scheduling difference on this configuration."


From: Andrew Doran <ad@...>
Subject: Thread benchmarks, round 2
Date: Oct 4, 7:04 pm 2007

So, I learned a few things since I put up the previous set of benchmarks:

- The erratic behaviour from Linux is due to the glibc memory allocator.
Using Google's tcmalloc, the problem disappears.

- I missed a few things when porting jemalloc from FreeBSD. One of them
was fairly major. Due to my mistake jemalloc on NetBSD was, basically,
single threaded. That said it did show a noticable improvement over
phkmalloc.

- There was a nasty performance bug in NetBSD's pthread mutexes, which
is now fixed. libpthread has also had a couple more tweaks for performance
that have had a positive impact.

- The memory allocator used has a significant effect on sysbench itself:
it needs to be multithreaded.

- Mindaugas has made more improvements to his scheduler and these are
showing a really positive effect.

So after making some changes to NetBSD, and changes to how I'm benchmarking
the systems, I have rerun them. In contrast to the previous runs, this one
is done locally:

http://www.netbsd.org/~ad/sysbench2/4cpu.png

Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect
that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket
I/O causing contention on kernel_lock. It will be interesting to see.

Thanks,
Andrew


From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 5:18 am 2007

Andrew Doran wrote:
> So, I learned a few things since I put up the previous set of benchmarks:
>
> - The erratic behaviour from Linux is due to the glibc memory allocator.
> Using Google's tcmalloc, the problem disappears.

Well you have to be careful there, tcmalloc apparently defers frees, and
is not really a general purpose malloc. The linux performance problems
are (were? I haven't tried recent kernels) real though.

> - I missed a few things when porting jemalloc from FreeBSD. One of them
> was fairly major. Due to my mistake jemalloc on NetBSD was, basically,
> single threaded. That said it did show a noticable improvement over
> phkmalloc.
>
> - There was a nasty performance bug in NetBSD's pthread mutexes, which
> is now fixed. libpthread has also had a couple more tweaks for performance
> that have had a positive impact.
>
> - The memory allocator used has a significant effect on sysbench itself:
> it needs to be multithreaded.
>
> - Mindaugas has made more improvements to his scheduler and these are
> showing a really positive effect.
>
> So after making some changes to NetBSD, and changes to how I'm benchmarking
> the systems, I have rerun them. In contrast to the previous runs, this one
> is done locally:
>
> http://www.netbsd.org/~ad/sysbench2/4cpu.png

I am somewhat surprised by this, because on FreeBSD it is really not
spending much time in the kernel (only ~20% system time), so there does
not seem to be much scope for a 10% performance difference. Also it
took quite a lot of work to optimize locking of various kernel
subsystems that are used by this workload, and until that point there
was significant kernel lock contention which reduced performance by tens
of percent. I would have expected this to matter on NetBSD - even with
the vmlocking work there is still more to go.

I will try to reproduce this on my own hardware (see below).

> Kris Kennaway has kindly offered to try NetBSD on an 8-way system. I expect
> that NetBSD will hit a fairly clear ceiling due to poll, fcntl and socket
> I/O causing contention on kernel_lock. It will be interesting to see.

Here is the initial run with CVS HEAD sources (I took out the obvious
things from GENERIC.MP like I386_CPU support, etc, and removed the
default datasize and stack size limits). Same benchmark config that
Andrew is using, etc.

http://people.freebsd.org/~kris/scaling/netbsd.png

There are a couple of things to note:

* the drop-off above 8 threads on FreeBSD is due to non-scalability of
mysql itself. i.e. it comes from pthread mutex contention in userland.
This is the only relevant lock contention point in the FreeBSD kernel
on this workload. There are some things we can do in libpthread to
mitigate the performance loss in the over-contended pthread situation,
but we haven't done them yet.

* The tail end of the graph is somewhat noisy, which is the reason for
the jump at 19 threads (I only graphed a single run). The distribution
at 20 clients looks like:

+------------------------------------------------------------+
| x x |
|x x x xxx x x xx x x xxx x xx|
| |_______________A_M_____________| |
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 20 2326.01 2758.86 2586.47 2572.856 116.69937

Next, to try and reproduce Andrew's result, I disabled 4 CPUs (using
cpuctl in NetBSD) and compared FreeBSD and NetBSD again. I didnt do a
full graph yet, but the results are consistent with what I saw on 8 CPUs.

NetBSD:

4 threads
1137.83
1135.49
1138.80
1138.06

20 threads

1101.84
1068.56
1075.32
998.49

Note that these are lower but not too different from the NetBSD values
when all 8 CPUs are in use.

FreeBSD:

4 threads
1985.48
1997.13
1997.43

20 threads
1813.02
1817.73
1824.59

The 4 thread performance is basically identical to the 8 CPU case,
showing that the FreeBSD scaling graphed on 8 CPUs is the same as on 4
CPUs (but without the tail since mysql contention is now rate-limited),
i.e. FreeBSD is continuing to scale linearly.

This measurement shows that FreeBSD is performing 70-80% better than
NetBSD in this 4 CPU configuration. This is in contrast to Andrew's
findings which seem to show NetBSD performing 10% better than FreeBSD on
a 4 CPU system (a very old one though).

I will try later with the experimental kernel Andrew sent me (which
includes the new scheduler). If it indeed gives a 100% performance
improvement that would be a significant result :-)

Kris


From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 3:08 pm 2007

Kris Kennaway wrote:

> The 4 thread performance is basically identical to the 8 CPU case,
> showing that the FreeBSD scaling graphed on 8 CPUs is the same as on 4
> CPUs (but without the tail since mysql contention is now rate-limited),
> i.e. FreeBSD is continuing to scale linearly.
>
> This measurement shows that FreeBSD is performing 70-80% better than
> NetBSD in this 4 CPU configuration. This is in contrast to Andrew's
> findings which seem to show NetBSD performing 10% better than FreeBSD on
> a 4 CPU system (a very old one though).
>
> I will try later with the experimental kernel Andrew sent me (which
> includes the new scheduler). If it indeed gives a 100% performance
> improvement that would be a significant result :-)

OK, I have repeated the benchmarking in two additional cases:

1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew
gave me (based on the vmlocking branch). This is using the new scheduler.

2) As above with experimental libc and libpthread also given to me by
Andrew. I dunno what changes these contain either :)

I was only able to run in the 8 CPU configuration because when I tried
to disable CPUs with cpuctl, processes would hang under load. This is
probably a scheduler issue.

http://people.freebsd.org/~kris/scaling/netbsd.png

This shows some improvement but not much, relatively speaking. In
particular performance at 4 threads is still significantly below FreeBSD
performance, which (given what I measured previously) suggests that
there is still a performance deficit with 4 CPUs on NetBSD. It would be
nice to be able to test this directly though, maybe Andrew can give me a
kernel that has MAXCPU=4 or whatever the NetBSD version is.

Kris


From: Andrew Doran <ad@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 3:39 pm 2007

On Fri, Oct 05, 2007 at 09:08:07PM +0200, Kris Kennaway wrote:

> OK, I have repeated the benchmarking in two additional cases:
>
> 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew
> gave me (based on the vmlocking branch). This is using the new scheduler.
>
> 2) As above with experimental libc and libpthread also given to me by
> Andrew. I dunno what changes these contain either :)

It's actually GENERIC.MP from current, with SCHED_M2. No vmlocking code
involved - would you be able to update the labels? The libc has jemalloc,
and libpthread is simply an up to date copy.

> I was only able to run in the 8 CPU configuration because when I tried
> to disable CPUs with cpuctl, processes would hang under load. This is
> probably a scheduler issue.

Right, I doubt that bit has been well tested since the scheduler is so new.

> http://people.freebsd.org/~kris/scaling/netbsd.png
>
> This shows some improvement but not much, relatively speaking. In
> particular performance at 4 threads is still significantly below FreeBSD
> performance, which (given what I measured previously) suggests that
> there is still a performance deficit with 4 CPUs on NetBSD. It would be
> nice to be able to test this directly though, maybe Andrew can give me a
> kernel that has MAXCPU=4 or whatever the NetBSD version is.

Interesting. :-). Thanks for running this. I'm still optimistic about the 4
CPU case so I'm very interested in seeing what the results would be. I'll
have a look into the offline problem this evening.

Thanks,
Andrew


From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 5, 5:38 pm 2007

Andrew Doran wrote:
> On Fri, Oct 05, 2007 at 09:08:07PM +0200, Kris Kennaway wrote:
>
>> OK, I have repeated the benchmarking in two additional cases:
>>
>> 1) NetBSD with 8 CPUs and some kind of experimental kernel that Andrew
>> gave me (based on the vmlocking branch). This is using the new scheduler.
>>
>> 2) As above with experimental libc and libpthread also given to me by
>> Andrew. I dunno what changes these contain either :)
>
> It's actually GENERIC.MP from current, with SCHED_M2. No vmlocking code
> involved - would you be able to update the labels? The libc has jemalloc,
> and libpthread is simply an up to date copy.

Done.

>> I was only able to run in the 8 CPU configuration because when I tried
>> to disable CPUs with cpuctl, processes would hang under load. This is
>> probably a scheduler issue.
>
> Right, I doubt that bit has been well tested since the scheduler is so new.
>
>> http://people.freebsd.org/~kris/scaling/netbsd.png
>>
>> This shows some improvement but not much, relatively speaking. In
>> particular performance at 4 threads is still significantly below FreeBSD
>> performance, which (given what I measured previously) suggests that
>> there is still a performance deficit with 4 CPUs on NetBSD. It would be
>> nice to be able to test this directly though, maybe Andrew can give me a
>> kernel that has MAXCPU=4 or whatever the NetBSD version is.
>
> Interesting. :-). Thanks for running this. I'm still optimistic about the 4
> CPU case so I'm very interested in seeing what the results would be. I'll
> have a look into the offline problem this evening.

OK thanks.

In the meantime I ran sysbench with postgresql 8.2. Same NetBSD configs
as before (except I built my own kernel with the sched_m2 patches since
I needed to tweak the sysv ipc parameters).

http://people.freebsd.org/~kris/scaling/netbsd-pgsql.png

postgresql is much more scalable than mysql on this workload and doesn't
have silly scaling bottlenecks inside the application (cf the tail of
the FreeBSD curve for mysql which is where pthread mutex contention
kicked in).

Kris


From: Kris Kennaway <kris@...> Subject: Re: Thread benchmarks, round 2 Date: Oct 6, 12:20 pm 2007

Kris Kennaway wrote:

> In the meantime I ran sysbench with postgresql 8.2. Same NetBSD configs
> as before (except I built my own kernel with the sched_m2 patches since
> I needed to tweak the sysv ipc parameters).
>
> http://people.freebsd.org/~kris/scaling/netbsd-pgsql.png
>
> postgresql is much more scalable than mysql on this workload and doesn't
> have silly scaling bottlenecks inside the application (cf the tail of
> the FreeBSD curve for mysql which is where pthread mutex contention
> kicked in).

Here are some more graphs.

This one is on the 4 CPU P3 500 MHz and shows postgresql 8.2. FreeBSD
is about 15-20% higher throughput.

http://people.freebsd.org/~kris/scaling/4cpu-pgsql.png

This one shows mysql on the same system

http://people.freebsd.org/~kris/scaling/4cpu-mysql.png

In that test NetBSD does outperform FreeBSD but only by 3-4% (in
particular I am not seeing the ~10% difference that Andrew observes on
his 4*p3 700MHz). Given the age of the hardware and the fact that I am
not seeing it on other workloads or on modern hardware it might just be
due to a small scheduling difference on this configuration.

Kris


I wonder why glibc guys

on
October 7, 2007 - 9:16pm

I wonder why glibc guys don't throw away the current slow memory allocator and replace it with something newer, like the one by Google, or jemalloc, or something like these?

Glibc? There is just one in

Anonymous (not verified)
on
October 7, 2007 - 10:28pm

Glibc? There is just one in Linux, there is no glibc in BSD.

Which doesn't invalidate the

Anonymous (not verified)
on
October 7, 2007 - 11:40pm

Which doesn't invalidate the question though. Why is the glibc memory allocation/deallocation so slow and yet not replaced by something faster?

indeed

on
October 8, 2007 - 12:36am

Exactly. Why are we still using glibc everywhere if it performs this poorly?

One reason would be that

Anonymous (not verified)
on
October 8, 2007 - 6:12am

One reason would be that tcmalloc does not ever give memory back to the system. I haven't looked at jemalloc so no idea about that one.

Because the glibc

Anonymous (not verified)
on
October 8, 2007 - 8:08am

Because the glibc maintainers dont care about you.

Just look at Drepper's various agenda's with Joe Average.

They cater to the companies.

That's quite irrelevant: the

Anonymous (not verified)
on
October 8, 2007 - 9:02am

That's quite irrelevant: the average Joe doesn't run MySQL on a 4-CPU server.

Speed is not everything. For

Anonymous (not verified)
on
October 8, 2007 - 9:44am

Speed is not everything. For a library like glibc, robustness and portability are also essential.

Apart from that, I think glibc malloc performs pretty well. Where are the benchmarks that show that it is "so slow"?

http://goog-perftools.sourcef

Anonymous (not verified)
on
October 8, 2007 - 9:55am

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

Caveat: these results may be quite different for libc 2.6.

I have just looked at glibc

Anonymous (not verified)
on
October 9, 2007 - 2:49am

I have just looked at glibc cvs tree and found that Malloc they use is still ptmalloc2 and not ptmalloc3. However Glibc maintainers have certainly done some modifications to original ptmalloc2 to improve speed or space usage.
Also ptmalloc2 is based on Doug Lea's malloc(dlmalloc) 2.7 which is quite old. Ptmalloc3 is based on the more recent dlmalloc 2.8.3 which dates back from 2005.
Othe scalable allocators worth mentioning are Hoard and Nedmalloc. The latter is also based on dlmalloc.

Highly useful information.

Anonymous (not verified)
on
October 9, 2007 - 9:13am

Highly useful information. Thank you.

because..

Anonymous (not verified)
on
October 8, 2007 - 1:12am

http://goog-perftools.sourceforge.net/doc/tcmalloc.html
Caveats

For some systems, TCMalloc may not work correctly on with applications that aren't linked against libpthread.so (or the equivalent on your OS). It should work on Linux using glibc 2.3, but other OS/libc combinations have not been tested.

TCMalloc may be somewhat more memory hungry than other mallocs, though it tends not to have the huge blowups that can happen with other mallocs. In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades a little bit of speed for more space efficiency.

TCMalloc currently does not return any memory to the system.

Don't try to load TCMalloc into a running binary (e.g., using JNI in Java programs). The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

License

Nony mouse (not verified)
on
October 8, 2007 - 9:29am

It's released under a BSD license (?)

glibc 2.6 has some type of

Anonymous (not verified)
on
October 8, 2007 - 9:53am

glibc 2.6 has some type of improved allocator compared to the earlier ones (ptmalloc3 according to some googling). Some of the worst-case behaviour of the current one should already be mitigated. Unfortunately the graphs do not say which malloc they tested against, maybe they did not have glibc 2.6...

On the other hand, if these results were derived with glibc 2.6, we should seriously look into improving it.

tcmalloc website seems to be comparing against ptmalloc2, so the behaviour reported there should no longer reflect the current version. They explain how they derived these results: someone should try replicating them with a libc 2.6.

I doubt the impact of the allocator is that great for most of us, but it does look like there's room for improvement.

Glibc version

BigChris (not verified)
on
October 9, 2007 - 12:08am

As noted in one of Andrew Doran's emails to the tech-kern mailing list, he is running Fedora Core 7 for the Linux tests. He is running with the stock kernel and glibc that comes with FC7, and using something like the LD_PRELOAD trick to run MySQL with the tcmalloc allocator.

Chris

graph generation

Anonymous (not verified)
on
October 7, 2007 - 10:27pm

anyone gimme a clue as to what tool they're using to generate the graphs? rrdtool?

Looks like the venerable

moltonel (not verified)
on
October 7, 2007 - 10:56pm

Looks like the venerable gnuplot.

Gnuplot?

on
October 7, 2007 - 11:00pm

Hmm those are nice looking graphs for gnuplot.... Can anyone tell me what font that is? My gnuplot graphs are nowhere near as pretty as that :(

I think you get something

Erik Wikström (not verified)
on
October 7, 2007 - 11:46pm

I think you get something like that when you use "set terminal png small" or "set terminal png tiny". The small/tiny bit refers to the fonts to be used. Don't you just love an application where the fonts that can be used are dependent on the format you saves the output in. Don't get me wrong, gnuplot is one of the best plotting tools out there, it is just not very user friendly or logical.

wx

on
October 8, 2007 - 12:25am

Compile gnuplot with wxwidgets support and the fonts should be available. I used to have that problem as well until about a week ago.

posgresql and threads

Anonymous (not verified)
on
October 8, 2007 - 9:49am

How is postgresql relevant for a threading benchmark when it does not use threads?

At a guess...

Flewellyn (not verified)
on
October 8, 2007 - 10:28am

I'd say the idea was to compare the performance of multithreading (MySQL's approach) with plain old multiprocessing with shared memory (PostgreSQL). The idea, I imagine, is that ideally a multithreading application would be faster than a multiprocessing one, because of the reduced overhead of threads versus processes.

I'm not sure how valid this assumption is, mind you. I'm just guessing what their reasoning was.

threaded vs non-threaded

Kris Kennaway (not verified)
on
October 8, 2007 - 11:06am

The "threading benchmark" is something the author of this article made up :) My work is to do with measuring kernel performance on SMP systems, and trying to identify and fix performance bottlenecks on various common workloads. Threaded vs non-threaded applications aren't really the point, except that they tend to exercise different parts of the kernel.

made up

on
October 8, 2007 - 1:02pm

Hi Kris,

Thanks for your clarification. However, I'm still unclear on a two or three word summary that would make a descriptive title as to what these tests are measuring. I chose "threading benchmark" as I saw a series of benchmarks that in which the X and Y axis seemed to indicate we were graphing the performance of threads. Would "performance benchmarking" be better? Or just "measuring performance"?

posgresql and threads

Anonymous (not verified)
on
October 8, 2007 - 11:53am

This article is not talking about POSIX threads (pthreads) only it talks about threading performance in general. In the kernel scheduler and other subsystems often works only with threads named light-weight processes (LWP). It does not matter if testing threaded application (MySQL) or not (PostgreSQL) it is based on LWPs anyway. Just keep in mind that benchmarks are done for OS not databases.

There is a talk about "old systems". It does not matter that 4-CPU machine is so old the general principle of SMP is the same. It shows that NetBSD improved SMP performance very much both on the old and new machines. Of course there are still a lot of if improvements to do.

FreeBSD is a slight bit

Anonymous (not verified)
on
October 8, 2007 - 11:20pm

FreeBSD is a slight bit slower than others on PIII-based machine, I think it is because the kernel size is larger and PIII just can not handle it
effectively. I don't care PIII at all, I even threw it away some days ago.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.