Malloc scaling/speedup work, request for testers/reviews

Previous thread: Had to roll a temporary 2.7.1 tag by Matthew Dillon on Friday, April 16, 2010 - 8:58 pm. (1 message)

Next thread: MD5s of 2.6.2 by Sdävtaker on Wednesday, April 21, 2010 - 7:46 pm. (3 messages)
From: Venkatesh Srinivas
Date: Sunday, April 18, 2010 - 3:11 pm

Hiya,

I've been working for some time on improving the libc malloc in
DragonFly, specifically improving its thread scaling and reducing the
number of mmap/munmap system calls it issues.

To address the issue of scaling, I added per-thread magazines for
allocations < 8K (ones that would hit the current slab zones); the
magazine logic is based straight on Bonwick and Adams's 2001
'Magazines and Vmem' paper, about the Solaris kernel allocator and
libumem. To address the number of mmaps the allocator was making, I
added a single magazine between the slab layer and mmap - it caches up
to 64 zones and will reuse them rather than requesting/releasing to
the system. In addition, I made the first request for a zone allocate
not one, but 8 zones, the second will allocate 6, so on and on, till
we have stabilized at allocating one-at-a-time. This logic was meant
to deal with programs issuing requests for different-sized objects
early on in their life.

Some benchmark results so far:

sh6bench =============================
To quote Jason Evans, 'is a quirky malloc benchmark that has been used
in some of the published malloc literature, so I include it here.
sh6bench does cycles of allocating groups of objects, where the size
of objects in each group is random. Sometimes the objects are held for
a while before being freed.' The test is available at:
http://m-net.arbornet.org/~sv5679/sh6bench.c

When run on DragonFly, with 50000 calls for objects between 1 and 1512
bytes, nmalloc (the current libc allocator) takes 85sec to finish the
test; nmalloc-1.33 takes 58s, spending nearly 20 sec less in system
time.

When tested with 2500 calls, for 1...1512 byte objects on FreeBSD 8,
nmalloc, nmalloc 1.33, and jemalloc (the FreeBSD libc allocator) turn
in times very close to one another.
Here are the total memory uses and mmap call counts:
(nmalloc 'g' is nmalloc with the per-thread caches disabled).

                         mmaps / munmaps		total space requested/release     ...
From: Dylan Reinhold
Date: Wednesday, April 21, 2010 - 6:56 pm

Venkatesh,
  I ran a quick test on one of my machines. I used the LD_PRELOAD
I ran with 10,000 calls 1-1000 blocks, it shaved ~21 seconds off the the 
time.
The system is a AMD Athlon(tm) XP 1800+ (1493.66-MHz) / with 512megs of 
ram).
I tried with your 50,000 and 1-1,512 block size, but the system ran out 
of memory/swap and the utility died.


Pre -
dylan@backup_a:~/malloc$ ./sh6bench
call count [1000]: 10000
min block size [1]: 1
max block size [1000]:
Total elapsed time: 52.00 (51.5938 CPU)

Post -
dylan@backup_a:~/malloc$ ./sh6bench
call count [1000]: 10000
min block size [1]:
max block size [1000]:
Total elapsed time: 31.00 (30.9297 CPU)


Regards,
Dylan
Previous thread: Had to roll a temporary 2.7.1 tag by Matthew Dillon on Friday, April 16, 2010 - 8:58 pm. (1 message)

Next thread: MD5s of 2.6.2 by Sdävtaker on Wednesday, April 21, 2010 - 7:46 pm. (3 messages)