Re: Thread benchmarks

Previous thread: Pulling a fix from wrstuden-fixsa into NetBSD 4.0.1 by Bill Stouder-Studenmund on Friday, September 28, 2007 - 10:31 am. (1 message)

Next thread: Thread benchmarks by Andrew Doran on Friday, September 28, 2007 - 10:51 am. (12 messages)
From: Andrew Doran
Date: Friday, September 28, 2007 - 10:50 am

Back in March I posted some MySQL benchmarks after we switched to a 1:1
threading model in -current *. I've spent a lot of time tuning the pthread
library so I thought I'd post a followup. The original benchmark that I used
(supersmack) now performs much better on -current that it did a few months
ago, so I picked something else this time: MySQL sysbench.

Most of the sysbench runs that I've seen to date have sysbench running on
the same machine as the database. That's a good test but with the exception
of small installations and out-of-band activity, production setups rarely
look like that. So I ran sysbench itself on a seperate dual core system.

Here are the results, comparing NetBSD 3 with NetBSD-current:

        http://www.netbsd.org/~ad/sysbench/netbsd.png

And NetBSD-current compared to other systems:

        http://www.netbsd.org/~ad/sysbench/netbsd-and-others.png

Note this is stock NetBSD-current with FreeBSD's malloc() (jemalloc) in
libc. I'll be merging that some time soon.

With the vmlocking CVS branch and Mindaugas' new scheduler NetBSD peaks
around 500 TPS. There is a very gradual fall off in the number of TPS
achieved as the number of connections begins to ramp up. I suspect that
could be due to a weakness somewhere in the network stack, so I'm hopeful
that a bit of time spent profiling with large numbers of connections could
yield good results.

Thanks,
Andrew

* http://mail-index.netbsd.org/tech-kern/2007/03/02/0005.html
From: matthew sporleder
Date: Friday, September 28, 2007 - 10:57 am

Can you talk more about the malloc replacement?  Also- an interesting
thing about benchmarks in the past was the long-running stability of
netbsd.  Did you see anything like that?
From: Andrew Doran
Date: Friday, September 28, 2007 - 12:34 pm

There's a good bit of information at the URL below and the imlementation is
in FreeBSD's CVS. The main advantage to jemalloc is that it works well with
large numbers of threads.

	http://people.freebsd.org/~jasone/jemalloc/

Joerg has suggested what we try a few other BSD licensed allocators and see

Well, what do you mean by stability? :-). The majority of the kinks have
been ironed out of the scheduler and thread library now, so the results on
NetBSD are constant given the same test setup and conditions. The one issue
that exists is that we are dropping a few TPS for every connection that's
added. NetBSD holds up like that until 900 simultaneous client threads. At
around 900 threads, some quite odd (and as yet unknown) behaviour is tickled
and the rate collapses to about 100tps.

Thanks,
Andrew
From: Thor Lancelot Simon
Date: Friday, September 28, 2007 - 11:27 am

Something interesting's happening in the Linux line on the graph right
at the right edge of the plotted region (20 threads).  Could you perhaps
run NetBSD-current against Linux again with the maximum number of threads
ramping up to 40, to see what the two curves look like as we head in
that direction?

Either we degrade a lot more gracefully than Linux under load, or there's
an artifact in the Linux graph.  The current plot makes it impossible to
tell which, though.

Thor

From: Andrew Doran
Date: Friday, September 28, 2007 - 12:39 pm

I have also tried 10-100 and 100-1000 client connections. I don't have the
numbers at hand, but Linux peaks around 550 tps somewhere around 100 client
connections. The numbers I was getting from Linux were quite erratic and I
had to throw out a few sets of results where the downward spikes were so bad

In the long run Linux will beat NetBSD. That said it the behaviour I saw on
this test cannot be called graceful!

Thanks,
Andrew
From: Christoph Egger
Date: Monday, October 1, 2007 - 1:32 am

I am interested in seeing NetBSD numbers with a NUMA-aware memory
allocator.
From: Mike Cheponis
Date: Monday, October 1, 2007 - 10:33 pm

Why is this?

I don't know any fundamental reason this should be so.

Thanks,

-Mike
From: Alistair Crooks
Date: Tuesday, October 2, 2007 - 3:13 am

I think that this is because the Linux graph is so unpredictable - it
is all over the place in the graphs collected by Andy, and which
someone else agreed was the case under Linux - it has more spikes than
my son's hair. Anyway, because it's so unpredictable, it can be used
to prove that Linux performs better than any other operating system at
any point in the graph (whilst at the same time handwaving away that
it's performing worse).

QED.

Regards,
Al the statistician
From: jonathan
Date: Tuesday, October 2, 2007 - 10:21 am

In message <Pine.NEB.4.64.0710012230280.900@S.culver.net>,

Hi,

There are at least two ways to take Andrew's quoted text.  One way is
that over time, Linux will do better than NetBSD (in the long haul,
which is how I first read the quoted line).  The other way is that at
this point in time, if we look at datapoints beyond the right edge of
Andrew's graphs from last week, Linux does better than NetBSD.  From
off-list discussion with Andrew, I am sure he means the second, not
the first.

Andrew's comments in another message, referring to a gradual drop-off
with increasign number of connections and suggesting kernel profiling
in that regime, to find the source of the gradual drop-off, also
support thet second reading.

Andrew can say more on this score if he chooses.
From: Andrew Doran
Date: Wednesday, October 3, 2007 - 6:23 am

Right, it was badly worded. With the later peaks that Linux shows, and with
NetBSD's gradual fall off, continuing on to (say) 500 threads will show
Linux achieving a higher transaction rate.

Andrew
From: Adam Hamsik
Date: Friday, September 28, 2007 - 12:25 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Regards
- -----------------------------------------
Adam Hamsik
jabber: haad@jabber.org
icq: 249727910

Proud NetBSD user.

We program to have fun.
Even when we program for money, we want to have fun as well.
~ Yukihiro Matsumoto




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFG/VUXlIxPgX3Go0MRAkH6AKDXWaUGLR6whdxzqkPBb9vO4ERwXwCfbKVL
HekCuq6oCF8THzJbwWDYO80=
=00NO
-----END PGP SIGNATURE-----
From: Warner Losh
Date: Friday, September 28, 2007 - 1:04 pm

Which kernel config did you use for the FreeBSD results?  In tests
that have been run on p4 hardware, the FreeBSD system's graph looks
more like NetBSD's than the one presented here.  FreeBSD's kernel has
a lot of debugging options that hurt performance on by default.  Also,
FreeBSD's malloc defaults to 'AJ' in head, which would result in
reduced performance.

Warner
From: Andrew Doran
Date: Monday, October 1, 2007 - 6:33 am

I took the generic config, removed the debugging options (INVARIANTS,

I can try turning off debugging in the allocator. What else would you like
me to try? I would like to provide remote access to the two systems but
unfortunatley my Internet link is unreliable and I'm not in a position to
leave them on 24x7. Some details on the test. I grabbed my.cnf from Jeff
Roberson's weblog:

	http://people.freebsd.org/~jeff/bsd.cnf

Relevant bits of dmesg from the MySQL host:

total memory = 2047 MB
avail memory = 2008 MB
cpu0: Intel Pentium III Xeon (686-class), 701.64 MHz, id 0x6a1
cpu0: features 383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 383fbff<PGE,MCA,CMOV,PAT,PSE36,MMX>
cpu0: features 383fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu0: L2 cache 1 MB 32B/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
fxp0 at pci1 dev 6 function 0: i82559 Ethernet, rev 8
fxp0: interrupting at ioapic0 pin 3 (irq 3)
fxp0: Ethernet address 00:02:a5:45:a6:48
inphy0 at fxp0 phy 1: i82555 10/100 media interface, rev. 4
inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

The disk subsystem doesn't matter since I was running the read-only test,
and with 10000 rows everything fits in core. I compiled MySQL by hand on
each system:

./configure --prefix=/local/mysql --with-pthread --with-innodb

Everything but necessary processes were killed on the two systems, so they
were running at most sshd, screen, sysbench and the minimum to be able to
log in. I did a warm-up run and then started testing:

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
        echo "=> ${i} THREADS"
        sysbench --test=oltp --db-driver=mysql --mysql-host=${HOST} \
            --mysql-user=root --mysql-table-engine=innodb --num-threads=${i} \
            --max-time=60 --max-requests=0 --oltp-read-only=on run | \
	    tee -a ...
From: Kris Kennaway
Date: Monday, October 1, 2007 - 3:26 pm

You should rebuild malloc with MALLOC_PRODUCTION defined (edit 
lib/libc/stdlib/malloc.c) as well as making sure that either 
/etc/malloc.conf is removed or symlinked to 'aj'.  This is pretty important.

Could you also provide a copy of your FreeBSD kernel configuration file 

OK, the only difference to my config is that I have

innodb_log_file_size=900M


OK.  The FreeBSD port also defines

                 --enable-thread-safe-client
                 --without-debug
		--enable-assembler

(and some other options that don't look relevant).  --with-pthread might 
  enable the first option but if not it could cause performance 
anomalies (i.e. this is relevant for the client, of course).  For 
example I accidentally built postgresql without threaded client support 
recently and spent a while trying to work out why sysbench suddenly ran 

I use

sysbench --test=oltp --num-threads=$1 --mysql-user=root --max-time=120 
--max-requests=0 --oltp-read-only=on --db-driver=mysql 
--mysql-host=192.168.5.120 run

which seems to be equivalent (the default table engine is innodb in our 
config).

Can you run 'vmstat -w 1' for e.g. 30 seconds on your FreeBSD system 
when the test is running?  I see total CPU usage at 100%, with system at 

I tested on a quad 500 MHz p3 (i.e. 30% slower clock speed than your 
system), via 100Mbps em0.  Performance was already at the level of the 
FreeBSD curve on your graph (about 320 tps across a range of loads), and 
if I scale up by 700/500 then it's about the same as your NetBSD curve. 
  I suspect that this will actually underestimate performance a bit 
because the CPU is an older generation than yours, so the difference is 
not just clock speed.  One thing that is kind of interesting is that 
some of the locking optimizations that we have not yet committed don't 
make a difference on this machine and workload, apparently they are only 
important at 8 CPUs and above.

Anyway, this all suggests to me that something is going wrong on your ...
From: Darren Reed
Date: Tuesday, October 2, 2007 - 2:16 am

When does this get turned on for normal FreeBSD builds?
Just those that are "releases" (vs current)?

Darren
From: M. Warner Losh
Date: Tuesday, October 2, 2007 - 2:46 am

In message: <47020C5F.3060703@netbsd.org>

Yes.  -HEAD has that turned off so that we maximum sanity testing
during development cycles.  After we branch, one of the things done on
the branch before a release is to turn off all the performance
degrading debugging/sanity code.  Once off on a branch, it stays off
for the life of the branch.

Warner
From: Andrew Doran
Date: Wednesday, October 3, 2007 - 7:08 am

It turns out that this was due to debugging in malloc(). As suggested I
recompiled FreeBSD's libc without the debugging, and FreeBSD's performance
is much better: as of right now, NetBSD and FreeBSD are fairly closely
matched on my 4 way system. From two single runs with both NetBSD and
FreeBSD using SCHED_4BSD:

	http://www.netbsd.org/~ad/sysbench/sysbench-4bsd.png

Here with SCHED_ULE and with NetBSD using Mindaugas' experimental scheduler.
Like ULE, it uses per-CPU run queues. Among other things that means threads
tend to migrate less.

	http://www.netbsd.org/~ad/sysbench/sysbench-pcpu.png

Thanks,
Andrew
From: Brett Lymn
Date: Monday, October 1, 2007 - 7:56 pm

That certainly does look impressive.  Good work.  Do you have any
indication of performance scaling vs number of processors?

-- 
Brett Lymn
"Warning:
The information contained in this email and any attached files is
confidential to BAE Systems Australia. If you are not the intended
recipient, any use, disclosure or copying of this email or any
attachments is expressly prohibited.  If you have received this email
in error, please notify us immediately. VIRUS: Every care has been
taken to ensure this email and its attachments are virus free,
however, any loss or damage incurred in using this email is not the
sender's responsibility.  It is your responsibility to ensure virus
checks are completed before installing any data sent in this email to
your computer."

Previous thread: Pulling a fix from wrstuden-fixsa into NetBSD 4.0.1 by Bill Stouder-Studenmund on Friday, September 28, 2007 - 10:31 am. (1 message)

Next thread: Thread benchmarks by Andrew Doran on Friday, September 28, 2007 - 10:51 am. (12 messages)