Re: Quad core CPUs loaded at only 50% when running a CPU and mmap intensive multi-threaded task

Previous thread: [PATCH] x86: let 64 bit to use 32 bit calibrate_apic_clock by Yinghai Lu on Sunday, August 24, 2008 - 10:41 pm. (5 messages)

Next thread: [BUG] cpufreq: constant cpu_khz by Peter Zijlstra on Sunday, August 24, 2008 - 11:33 pm. (8 messages)
From: Peter Zijlstra
Date: Sunday, August 24, 2008 - 10:51 pm

Can you share your .config, and prehaps tell what kernel version did
work for you?

--

From: edwin
Date: Monday, August 25, 2008 - 12:04 am

Sorry, I forgot to include the .config, its at the end of this mail (the 
cfs debug info output included the .config though)

Well, I just bought this new box, so there isn't a kernel version that I 
know that worked on this hardware (but I am trying to boot some older 
versions now).
However on my previous box (Athlon64, non-SMP) I have never seen such a 
problem (that the CPU is loaded only 50% with clamd) and I've been
running 2.6.26 and 2.6.27-rc4 there too.

Details below, short summary here:
2.6.24: WORKS, clamd 400% CPU, testprogram runs in 27.4 seconds, 67% CPU 
load; and 28.5 seconds w/o setting affinity
2.6.25+: DOES NOT WORK, clamd 200%-300% CPU, testprogram runs in 38-40 
seconds, 48-48% CPU load, and 47-56 seconds w/o setting affinity

Debian has 2.6.18, 2.6.22, 2.6.24, 2.6.25, 2.6.26.
2.6.22 won't work with my lvm, so I can't boot that, so I tried 2.6.24:

2.6.24 doesn't have sched_debug enabled in the stock kernel 
unfortunately, but the output of cfs-debug-info.sh is available here, 
maybe it contains some useful info:
http://edwintorok.googlepages.com/testrun-1219645937.tar.gz

Is this enough info for you to reproduce the problem, or do you want me 
to try and bisect?

1. Linux debian 2.6.24-etchnhalf.1-amd64 #1 SMP Mon JUl 21 10:36:02 UTC 
2008 x86_64 GNU/Linux
WORKS! Clamd loads CPUs at 400%
My testprogram loads it at ~60% (vs 38% - 48% with 2.6.26)
Results for build: cpu worker default affinity
Real time: 6.523395 s (since program startup)
clock: 13.046436
All threads clock sum: 22.147781
Resource usage: 24.353521 s
CPU usage: 84.94 %
OK
Results for build: cpu worker affinity fix hack
Real time: 6.522737 s (since program startup)
clock: 13.045105
All threads clock sum: 22.339515
Resource usage: 24.445527 s
CPU usage: 84.70 %
OK
Results for build: mmap worker default affinity
Real time: 14.285333 s (since program startup)
clock: 28.570301
All threads clock sum: 54.037440
Resource usage: 44.674792 s
CPU usage: 65.39 %
CPU usage is too ...
From: Peter Zijlstra
Date: Monday, August 25, 2008 - 2:23 am

No, I think I know what's going on..

mmap() and munmap() need to take the mmap_sem for writing (since they
modify the memory map) and you let each thread (one for each cpu) take
that process wide lock, twice, for a million times.

Guess what happens ;-)



--

From: Török Edwin
Date: Monday, August 25, 2008 - 2:49 am

Are you referring to the mmap_sem lock, or my mutex lock around 

So the problem is that doing mmap() doesn't scale well with multiple 
threads, because there is contention on mmap_sem?
Why did 2.6.24 seem to work better?

Best regards,
--Edwin
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 3:02 am

mmap_sem, its process wide, and your test prog bangs on it like there's


Perhaps the scheduler overhead did increase, can you try:

echo NO_HRTICK > /debug/sched_features

(after mounting debugfs on /debug, or adjusting the path to where you do
have it mounted)

That might cause some overhead on very high context switch rates.

--

From: Török Edwin
Date: Monday, August 25, 2008 - 3:22 am

Well, the real program (clamd) that this testprogram tries to simulate 
does an mmap for almost every file, and I have lots of small files.
6.5G, 114122 files, average size 57k.

I'll run latencytop again, last time it has showed 100ms - 500ms latency 

No difference, and turning off the other features from sched_features 
doesn't seem to help either.

Best regards,
--Edwin
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 3:36 am

Right - does it make sense to teach clamav about pread() ?

--


Peter Zijlstra wrote:
> On Mon, 2008-08-25 at 13:22 +0300, T

MaxThreads 8, full DB (13 % slower than 2.6.24)
4m42s

MaxThreads 4, full DB (8% faster than 2.6.24)
2m35s

MaxThreads 8, full DB, 2.6.24:
4m3s

MaxThreads 4, full DB, 2.6.24:
2m50s

I have run an echo 3>/proc/sys/vm/drop_caches before each, I hope that 
clears all caches,
I have xfs on top of lvm, on top of raid10, and iostat shows only 0 - 
20% activity (%util).
That could also mean of course that the disks can provide data fast 


Still, if I have more threads, performance *decreases* almost linearly 
with 2.6.27 (and probably with 2.6.25+ if clamd behaves the same as my 
test proggie),  however with 2.6.24

With Debian etch having 2.6.24 (etchnhalf actually), and lenny shipping 
with 2.6.25 or 2.6.26, users upgrading from etch to lenny could see a 
performance decrease.

Best regards,
--Edwin
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 4:41 am

I would certainly consider this for small (< 1M?) files. With mmap the
faults and pte overhead aren't free either, and the extra memcpy from
pread() isn't that much.

--

From: Theodore Tso
Date: Monday, August 25, 2008 - 6:48 am

Even for very big files, if you're only doing a single sequential pass
over a very large file (for example when converting a Canon raw image
file to TIFF format --- I know because I was trying to optimize dcraw
a while aback), you take the page fault for each 4k page, and so
simply using read/pread is faster.  And that's on a single-threded
program.  With a multithreaded program, the locking issues come on top
of that.

Maybe if I had used hugepages it would have been a win, I suppose, but
I never tried the experiment.  And this was several years ago, on much
older hardware, so maybe the relative times of doing the memory copy
versus the page fault, but I wouldn't be surprised if it's even more
expensive to do the mmap, relatively speaking.

							- Ted
--

From: Andi Kleen
Date: Tuesday, August 26, 2008 - 1:12 am

Theodore Tso <tytso@mit.edu> writes:

As a general comment it still sounds there is a regression here?
If the workload was faster in an earlier kernel and is now slow
clearly something got slower? And that might be fixable. 

hugepages do not support arbitary file backing.

-Andi

-- 
ak@linux.intel.com
--

From: Török Edwin
Date: Monday, September 8, 2008 - 12:10 pm

After some more careful testing with the real program (clamd) I can say
that there is no regression.
If I scan the exact same files as the box running 2.6.18 I get similar
results, the difference is within 10%  [1].

There is however a problem with mmap [mmap with N threads is as slow as
mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O,
I think I am hitting the problem described in this thread (2 years ago!)
http://lwn.net/Articles/200215/
http://lkml.org/lkml/2006/9/19/260

It looks like such a patch is still not part of 2.6.27, what happened to it?
I will see if that patch applies to 2.6.27, and will rerun my test with
that patch applied too.

While running clamd I noticed in latencytop, that besides mmap/munmap
latencies (around 20 ms), I also get page fault latencies (again around
20 ms).

So I wrote another test program [2] that walks a directory tree, and
reads each file once using read for each, and once using mmap for each.
It clears the cache (using echo 3 >/proc/sys/vm/drop_caches) before each
test (and I run the read test first, so if the cache wouldn't be cleared,
then mmap would be faster not slower).
The results show that reading files using mmap() takes about the same
time, regardless of how many threads I use (1,2,4,8,16), but
using read has a near linear speedup with the number of threads.


First lets see some numbers [3], time to run the program on /usr/bin in
seconds [4]
Number of CPUs, 4
Number of threads ->, 1,, 2,, 4,, 8,, 16
Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap
2.6.27-rc5, 16.70, 17.01, 12.86, 16.26, 7.31, 15.16, 4.01, 14.93, 3.79,
15.40
2.6.26-1-amd64, 17.90, 16.95, 13.30, 16.18, 7.31, 15.34, 3.87, 14.96,
3.86, 15.89
2.6.22-3-amd64, 15.12, 15.41, 11.98, 15.17, 6.36, 14.29, 3.15, 14.61,
3.08, 15.44

The kernels are standard Debian kernels, except for 2.6.27-rc5 which
I've built myself (posted .config earlier in this thread).
mmap and read are about the same speed with nthreads=1, so ...
From: Török Edwin
Date: Friday, September 12, 2008 - 12:16 pm

The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch.
There have been many changes since 2.6.18 (like replacing find_get_page
with find_lock_page, filemap returning VM_FAULT codes, etc.).
I have probably done something wrong, because the resulting kernel won't
boot: I  get abnormal exits and random sigbus during boot.

Can you please help porting the patch to 2.6.27-rc6? I have attached my
2 attempts at the end of this mail.

Also it looks like the original patch just releases the mmap_sem if
there is lock contention on the page, but keeps mmap_sem during read?
I would like mmap_sem be released during disk I/O too.

I also tried changing i_mmap_lock into a semaphore, however I that won't
work since some users of i_mmap_lock can't sleep.
Taking the i_mmap_lock spinlock in filemap fault is also not possible,
since we would sleep while holding a spinlock.

Just to confirm that the problem is with pagefaults and mmap, I dropped
the mmap_sem in filemap_fault, and then
I got same performance in my testprogram for mmap and read. Of course

I've written a latency tracer (using ptrace), and I identified the
mutex/mmap related latencies (total runtime 23m):
- mmap-ed files (created by libclamav) ~6680 ms  total
- creating/removing anonymous mappings, created by glibc, when I use
functions like fopen/fclose:

With 8 threads:
=====> Total: 3227.732 ms, average: 3.590 ms, times: 899
=== /lib/libc.so.6  (mmap)
=== /lib/libc.so.6  (_IO_file_doallocate)
=== /lib/libc.so.6  (_IO_doallocbuf)
=== /lib/libc.so.6  (_IO_file_seekoff)
=== /lib/libc.so.6  (_IO_file_attach)
=== /lib/libc.so.6  (_IO_fdopen)
=== /usr/lib/libz.so.1  (gzflush)
=== /usr/lib/libz.so.1  (gzdopen)
=== /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip)

=====> Total: 2069.519 ms, average: 3.624 ms, times: 571
=== /lib/libc.so.6  (munmap)
=== /lib/libc.so.6  (_IO_setb)
=== /lib/libc.so.6  (_IO_file_close_it)
=== /lib/libc.so.6  (_IO_fclose)
=== /usr/lib/libz.so.1  ...
From: Mike Waychison
Date: Friday, September 12, 2008 - 12:35 pm

I actually have to forward port this and a bunch of other mm speed-ups 
in the coming two weeks, though they'll be ports from 2.6.18 to 2.6.26. 
  I'll be sending them out to linux-mm once I've done so and completed 

The 'lock'ing of the page is the part that waits for the read to finish, 

You shouldn't be seeing much contention on the i_mmap_lock, at least not 
in the fault path (it's mostly just painful when you have a lot of 

--

From: Török Edwin
Date: Friday, September 12, 2008 - 1:10 pm

I was thinking of using i_mmap_lock as an alternative to holding
mmap_sem, but it didn't seem like a good idea after all.

Best regards,
--Edwin
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 3:37 am

OK, I'll poke a little more at is later today to see if I can spot
something

--

Previous thread: [PATCH] x86: let 64 bit to use 32 bit calibrate_apic_clock by Yinghai Lu on Sunday, August 24, 2008 - 10:41 pm. (5 messages)

Next thread: [BUG] cpufreq: constant cpu_khz by Peter Zijlstra on Sunday, August 24, 2008 - 11:33 pm. (8 messages)