Can you share your .config, and prehaps tell what kernel version did work for you? --
Sorry, I forgot to include the .config, its at the end of this mail (the cfs debug info output included the .config though) Well, I just bought this new box, so there isn't a kernel version that I know that worked on this hardware (but I am trying to boot some older versions now). However on my previous box (Athlon64, non-SMP) I have never seen such a problem (that the CPU is loaded only 50% with clamd) and I've been running 2.6.26 and 2.6.27-rc4 there too. Details below, short summary here: 2.6.24: WORKS, clamd 400% CPU, testprogram runs in 27.4 seconds, 67% CPU load; and 28.5 seconds w/o setting affinity 2.6.25+: DOES NOT WORK, clamd 200%-300% CPU, testprogram runs in 38-40 seconds, 48-48% CPU load, and 47-56 seconds w/o setting affinity Debian has 2.6.18, 2.6.22, 2.6.24, 2.6.25, 2.6.26. 2.6.22 won't work with my lvm, so I can't boot that, so I tried 2.6.24: 2.6.24 doesn't have sched_debug enabled in the stock kernel unfortunately, but the output of cfs-debug-info.sh is available here, maybe it contains some useful info: http://edwintorok.googlepages.com/testrun-1219645937.tar.gz Is this enough info for you to reproduce the problem, or do you want me to try and bisect? 1. Linux debian 2.6.24-etchnhalf.1-amd64 #1 SMP Mon JUl 21 10:36:02 UTC 2008 x86_64 GNU/Linux WORKS! Clamd loads CPUs at 400% My testprogram loads it at ~60% (vs 38% - 48% with 2.6.26) Results for build: cpu worker default affinity Real time: 6.523395 s (since program startup) clock: 13.046436 All threads clock sum: 22.147781 Resource usage: 24.353521 s CPU usage: 84.94 % OK Results for build: cpu worker affinity fix hack Real time: 6.522737 s (since program startup) clock: 13.045105 All threads clock sum: 22.339515 Resource usage: 24.445527 s CPU usage: 84.70 % OK Results for build: mmap worker default affinity Real time: 14.285333 s (since program startup) clock: 28.570301 All threads clock sum: 54.037440 Resource usage: 44.674792 s CPU usage: 65.39 % CPU usage is too ...
No, I think I know what's going on.. mmap() and munmap() need to take the mmap_sem for writing (since they modify the memory map) and you let each thread (one for each cpu) take that process wide lock, twice, for a million times. Guess what happens ;-) --
Are you referring to the mmap_sem lock, or my mutex lock around So the problem is that doing mmap() doesn't scale well with multiple threads, because there is contention on mmap_sem? Why did 2.6.24 seem to work better? Best regards, --Edwin --
mmap_sem, its process wide, and your test prog bangs on it like there's Perhaps the scheduler overhead did increase, can you try: echo NO_HRTICK > /debug/sched_features (after mounting debugfs on /debug, or adjusting the path to where you do have it mounted) That might cause some overhead on very high context switch rates. --
Well, the real program (clamd) that this testprogram tries to simulate does an mmap for almost every file, and I have lots of small files. 6.5G, 114122 files, average size 57k. I'll run latencytop again, last time it has showed 100ms - 500ms latency No difference, and turning off the other features from sched_features doesn't seem to help either. Best regards, --Edwin --
Right - does it make sense to teach clamav about pread() ? --
Peter Zijlstra wrote: > On Mon, 2008-08-25 at 13:22 +0300, T
MaxThreads 8, full DB (13 % slower than 2.6.24) 4m42s MaxThreads 4, full DB (8% faster than 2.6.24) 2m35s MaxThreads 8, full DB, 2.6.24: 4m3s MaxThreads 4, full DB, 2.6.24: 2m50s I have run an echo 3>/proc/sys/vm/drop_caches before each, I hope that clears all caches, I have xfs on top of lvm, on top of raid10, and iostat shows only 0 - 20% activity (%util). That could also mean of course that the disks can provide data fast Still, if I have more threads, performance *decreases* almost linearly with 2.6.27 (and probably with 2.6.25+ if clamd behaves the same as my test proggie), however with 2.6.24 With Debian etch having 2.6.24 (etchnhalf actually), and lenny shipping with 2.6.25 or 2.6.26, users upgrading from etch to lenny could see a performance decrease. Best regards, --Edwin --
I would certainly consider this for small (< 1M?) files. With mmap the faults and pte overhead aren't free either, and the extra memcpy from pread() isn't that much. --
Even for very big files, if you're only doing a single sequential pass over a very large file (for example when converting a Canon raw image file to TIFF format --- I know because I was trying to optimize dcraw a while aback), you take the page fault for each 4k page, and so simply using read/pread is faster. And that's on a single-threded program. With a multithreaded program, the locking issues come on top of that. Maybe if I had used hugepages it would have been a win, I suppose, but I never tried the experiment. And this was several years ago, on much older hardware, so maybe the relative times of doing the memory copy versus the page fault, but I wouldn't be surprised if it's even more expensive to do the mmap, relatively speaking. - Ted --
Theodore Tso <tytso@mit.edu> writes: As a general comment it still sounds there is a regression here? If the workload was faster in an earlier kernel and is now slow clearly something got slower? And that might be fixable. hugepages do not support arbitary file backing. -Andi -- ak@linux.intel.com --
After some more careful testing with the real program (clamd) I can say that there is no regression. If I scan the exact same files as the box running 2.6.18 I get similar results, the difference is within 10% [1]. There is however a problem with mmap [mmap with N threads is as slow as mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O, I think I am hitting the problem described in this thread (2 years ago!) http://lwn.net/Articles/200215/ http://lkml.org/lkml/2006/9/19/260 It looks like such a patch is still not part of 2.6.27, what happened to it? I will see if that patch applies to 2.6.27, and will rerun my test with that patch applied too. While running clamd I noticed in latencytop, that besides mmap/munmap latencies (around 20 ms), I also get page fault latencies (again around 20 ms). So I wrote another test program [2] that walks a directory tree, and reads each file once using read for each, and once using mmap for each. It clears the cache (using echo 3 >/proc/sys/vm/drop_caches) before each test (and I run the read test first, so if the cache wouldn't be cleared, then mmap would be faster not slower). The results show that reading files using mmap() takes about the same time, regardless of how many threads I use (1,2,4,8,16), but using read has a near linear speedup with the number of threads. First lets see some numbers [3], time to run the program on /usr/bin in seconds [4] Number of CPUs, 4 Number of threads ->, 1,, 2,, 4,, 8,, 16 Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap 2.6.27-rc5, 16.70, 17.01, 12.86, 16.26, 7.31, 15.16, 4.01, 14.93, 3.79, 15.40 2.6.26-1-amd64, 17.90, 16.95, 13.30, 16.18, 7.31, 15.34, 3.87, 14.96, 3.86, 15.89 2.6.22-3-amd64, 15.12, 15.41, 11.98, 15.17, 6.36, 14.29, 3.15, 14.61, 3.08, 15.44 The kernels are standard Debian kernels, except for 2.6.27-rc5 which I've built myself (posted .config earlier in this thread). mmap and read are about the same speed with nthreads=1, so ...
The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch. There have been many changes since 2.6.18 (like replacing find_get_page with find_lock_page, filemap returning VM_FAULT codes, etc.). I have probably done something wrong, because the resulting kernel won't boot: I get abnormal exits and random sigbus during boot. Can you please help porting the patch to 2.6.27-rc6? I have attached my 2 attempts at the end of this mail. Also it looks like the original patch just releases the mmap_sem if there is lock contention on the page, but keeps mmap_sem during read? I would like mmap_sem be released during disk I/O too. I also tried changing i_mmap_lock into a semaphore, however I that won't work since some users of i_mmap_lock can't sleep. Taking the i_mmap_lock spinlock in filemap fault is also not possible, since we would sleep while holding a spinlock. Just to confirm that the problem is with pagefaults and mmap, I dropped the mmap_sem in filemap_fault, and then I got same performance in my testprogram for mmap and read. Of course I've written a latency tracer (using ptrace), and I identified the mutex/mmap related latencies (total runtime 23m): - mmap-ed files (created by libclamav) ~6680 ms total - creating/removing anonymous mappings, created by glibc, when I use functions like fopen/fclose: With 8 threads: =====> Total: 3227.732 ms, average: 3.590 ms, times: 899 === /lib/libc.so.6 (mmap) === /lib/libc.so.6 (_IO_file_doallocate) === /lib/libc.so.6 (_IO_doallocbuf) === /lib/libc.so.6 (_IO_file_seekoff) === /lib/libc.so.6 (_IO_file_attach) === /lib/libc.so.6 (_IO_fdopen) === /usr/lib/libz.so.1 (gzflush) === /usr/lib/libz.so.1 (gzdopen) === /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip) =====> Total: 2069.519 ms, average: 3.624 ms, times: 571 === /lib/libc.so.6 (munmap) === /lib/libc.so.6 (_IO_setb) === /lib/libc.so.6 (_IO_file_close_it) === /lib/libc.so.6 (_IO_fclose) === /usr/lib/libz.so.1 ...
I actually have to forward port this and a bunch of other mm speed-ups in the coming two weeks, though they'll be ports from 2.6.18 to 2.6.26. I'll be sending them out to linux-mm once I've done so and completed The 'lock'ing of the page is the part that waits for the read to finish, You shouldn't be seeing much contention on the i_mmap_lock, at least not in the fault path (it's mostly just painful when you have a lot of --
I was thinking of using i_mmap_lock as an alternative to holding mmap_sem, but it didn't seem like a good idea after all. Best regards, --Edwin --
OK, I'll poke a little more at is later today to see if I can spot something --
