Hello, I am working on improving Mozilla startup times. It turns out that page faults(caused by lack of cooperation between user/kernelspace) are the main cause of slow startup. I need some insights from someone who understands linux vm behavior. Current Situation: The dynamic linker mmap()s executable and data sections of our executable but it doesn't call madvise(). By default page faults trigger 131072byte reads. To make matters worse, the compile-time linker + gcc lay out code in a manner that does not correspond to how the resulting executable will be executed(ie the layout is basically random). This means that during startup 15-40mb binaries are read in basically random fashion. Even if one orders the binary optimally, throughput is still suboptimal due to the puny readahead. IO Hints: Fortunately when one specifies madvise(WILLNEED) pagefaults trigger 2mb reads and a binary that tends to take 110 page faults(ie program stops execution and waits for disk) can be reduced down to 6. This has the potential to double application startup of large apps without any clear downsides. Suse ships their glibc with a dynamic linker patch to fadvise() dynamic libraries(not sure why they switched from doing madvise before). I filed a glibc bug about this at http://sourceware.org/bugzilla/show_bug.cgi?id=11431 . Uli commented with his concern about wasting memory resources. What is the impact of madvise(WILLNEED) or the fadvise equivalent on systems under memory pressure? Does the kernel simply start ignoring these hints? Also, once an application is started is it reasonable to keep it madvise(WILLNEED)ed or should the madvise flags be reset? Perhaps the kernel could monitor the page-in patterns to increase the readahead sizes? This may already happen, I've noticed that a handful of pagefaults trigger > 131072bytes of IO, perhaps this just needs tweaking. Thanks, Taras Glek PS. For more details on this issue see my blog at ...
Try tuning /sys/block/<dev>/queue/read_ahead_kb and see if that makes any difference - that's the default maximum readahead for the given block device and defaults to 128k. There has been some recent work to increase the default readahead size, so if changing the default improves performance then perhaps a fix for your problem is already in the works? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Almost certainly teaching my grandmother to suck eggs, but are you aware of the work Michael Meeks has done on improving openoffice.org startup time? -- Roland Dreier <rolandd@cisco.com> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html --
Yes. There were some stones left unturned in the cold startup area. Turns out that every single large application suffers from low io throughput likely due to lack of cooperation between the dynamic linker and the kernel. There is a glibc bug filed on that. http://sourceware.org/bugzilla/show_bug.cgi?id=11431 Unfortunately, few userspace people seem to know exactly how madvise() hints behave, so I was hoping someone on LKML would clue me in. Taras --
It will throttle based on memory pressure. In idle situations it will eat your file cache, however, to satisfy the request. Now, the file cache should be much bigger than the amount of unneeded pages you prefault with the hint over the whole library, so I guess the benefit of prefaulting the right pages outweighs the downside of evicting some cache for unused library pages. Still, it's a workaround for deficits in the demand-paging/readahead It's a one-time operation that starts immediate readahead, no permanent --
Define idle situations. Do you mean that madv(willneed) will aggresively readahead, but only while cpu(or disk?) is idle? I am trying to optimize application startup which means that the cpu is I may be measuring this wrong, but in my experience the only change madvise(willneed) does in increase the length parameter to __do_page_cache_readahead(). My script is at http://hg.mozilla.org/users/tglek_mozilla.com/startup/file/6453ad2a7906/kernelio.stp . Taras --
Sorry. I meant without memory pressure. It will trigger readahead for the whole page range immediately, unless the sum of free pages and file cache pages is less than that. So yes, it will be aggressive against the cache but should not touch things Whether the page is read on a major fault or by means of WILLNEED, they both end up calling this function. It's just that faulting does all the heuristics and WILLNEED will just force reading the pages in the specified range. But your question whether it would be reasonable to keep the region WILLNEED madvised makes no sense. It's just a request to prepopulate the page cache from disk data immediately instead of waiting for faults to trigger the reads. --
Ok. Thanks for clarifying that. I was misinterpreting my io log. Is there a way to force page faults from a particular memory mapping to do more readahead? Ie if WILLNEED is not used. Have heuristics that read backwards been considered? Ie currently if one faults in page at offset 4096, that page a few pages following that will be preread. Would be interesting to try to preread pages before and after the page being faulted in. For a graph of "backwards" io see the "Post-linker Fail" section in http://blog.mozilla.com/tglek/2010/03/24/linux-why-loading-binaries-from-disk-sucks/ Taras --
Hi Taras, How about improve Fedora (and other distros) to preload Mozilla (and other apps the user run at the previous boot) with fadvise() at boot time? This sounds like the most reasonable option. As for the kernel readahead, I have a patchset to increase default mmap read-around size from 128kb to 512kb (except for small memory This is interesting. I wonder how SuSE implements the policy. Do you have the patch or some strace output that demonstrates the Program page faults are inherently random, so the straightforward solution would be to increase the mmap read-around size (for desktops with reasonable large memory), rather than to improve program layout Thank you :) Cheers, Fengguang --
That's a slightly different usecase. I'd rather have all large apps startup as efficiently as possible without any hacks. Though until we Yes. Is the current readahead really doing read-around(ie does it read pages before the one being faulted)? From what I've seen, having the dynamic linker read binary sections backwards causes faults. glibc-2.3.90-ld.so-madvise.diff in http://www.rpmseek.com/rpm/glibc-2.4-31.12.3.src.html?hl=com&cba=0:G:0:3732595:0:1... Program page faults may exhibit random behavior once they've started. During startup page-in pattern of over-engineered OO applications is very predictable. Programs are laid out based on compilation units, which have no relation to how they are executed. Another problem is that any large old application will have lots of code that is either rarely executed or completely dead. Random sprinkling of live code among mostly unneeded code is a problem. I'm able to reduce startup pagefaults by 2.5x and mem usage by a few MB with proper binary layout. Even if one lays out a program wrongly, the worst-case pagein pattern will be pretty similar to what it is by default. But yes, I completely agree that it would be awesome to increase the readahead size proportionally to available memory. It's a little silly to be reading tens of megabytes in 128kb increments :) You rock for Cheers, Taras --
Hi, Wu and Taras. I have been watched at this thread. That's because I had a experience on reducing startup latency of application in embedded system. I think sometime increasing of readahead size wouldn't good in embedded. Many of embedded system has nand as storage and compression file system. About nand, as you know, random read effect isn't rather big than hdd. About compression file system, as one has a big compression, it would make startup late(big block read and decompression). We had to disable readahead of code page with kernel hacking. And it would make application slow as time goes by. But at that time we thought latency is more important than performance on our application. Of course, it is different whenever what is file system and compression ratio we use . So I think increasing of readahead size might always be not good. Please, consider embedded system when you have a plan to tweak readahead, too. :) -- Kind regards, Minchan Kim --
Minchan, glad to know that you have experiences on embedded Linux. While increasing the general readahead size from 128kb to 512kb, I also added a limit for mmap read-around: if system memory size is less than X MB, then limit read-around size to X KB. For example, do only 128KB read-around for a 128MB embedded box, and 32KB ra for 32MB box. Do you think it a reasonable safety guard? Patch attached. Thanks, Fengguang
Thanks for reply, Wu. I didn't have looked at the your attachment. That's because it's not matter of memory size in my case. It was alone application on system and it was first main application of system. It means we had a enough memory. I guess there are such many of embedded system. At that time, although I could disable readahead totally with read_ahead_kb, I didn't want it. That's because I don't want to disable readahead on the file I/O and data section of program. So at a loss, I hacked kernel to disable readahead of only code section. -- Kind regards, Minchan Kim --
In general, the more memory size, the less we care about the possible I would like to auto tune readahead size based on the device's IO throughput and latency estimation, however that's not easy.. Other than that, if we can assert "this class of devices won't benefit from large readahead", then we can do some static assignment. Thanks, Fengguang --
A few month ago, I saw your patch about enhancing readahead. At that time, many guys tested several size of USB and SSD which are consist of nand device. The result is good if we does readahead untile some crossover point. So I think we need readahead about file I/O in non-rotation device, too. But startup latency is important than file I/O performance in some machine. With analysis at that time, code readahead of application affected slow startup. In addition, during bootup, cache hit ratio was very small. So I hoped we can disable readahead just only code section(ie, roughly exec vma's filemap fault). :) I don't want you to solve this problem right now. Just let you understand embedded system's some problem -- Kind regards, Minchan Kim --
Yeah, I've never heard of such a demand, definitely good to know it! Thanks, Fengguang --
Boot time user space readahead can do better than kernel heuristic readahead in several ways: - it can collect better knowledge on which files/pages will be used which lead to high readahead hit ratio and less cache consumption - it can submit readahead requests for many files in parallel, which enables queuing (elevator, NCQ etc.) optimizations There are too many data in http://people.mozilla.com/~tglek/startup/systemtap_graphs/ld_bug/report.txt 550 Can't open /pub/linux/distributions/suse/pub/suse/update/10.1/rpm/src/glibc-2.4-31.12.3.src.rpm: No such file or directory Thank you. I guess the 128kb is more than ten years old.. Cheers, --
The first part of the file lists sections in a file and their hex offset+size. lines like 0 512 offset(#1) mean a read at position 0 of 512 bytes. Incidentally this first read is coming from vfs_read, so the log doesn't take account readahead (unlike the other reads caused by mmap page faults). So 15310848 131072 offset(#2)===================== eaa73c 1523c .bss eaa73c 19d1e .comment 15142912 131072 offset(#3)===================== e810d4 200 .dynamic e812d4 470 .got e81744 3b50 .got.plt e852a0 2549c .data Shows 2 reads where the dynamic linker first seeks to the end of the file(to zero out .bss, causing IO via COW) and the backtracks to read in .dynamic. However you are right, all of the backtracking reads are over 64K. Thanks for explaining that. I am guessing your change to boost Released it yesterday. Hopefully other bloated binaries will benefit from this too. http://blog.mozilla.com/tglek/2010/04/07/icegrind-valgrind-plugin-for-optimizing-cold-... Thanks a lot Wu, I feel I understand the kernel side of what's happening now. Taras --
Yes, every binary/library starts with this 512b read. It is requested by ld.so/ld-linux.so, and will trigger a 4-page readahead. This is not good readahead. I wonder if ld.so can switch to mmap read for the first read, in order to trigger a larger 128kb readahead. However this It sounds painful to produce the valgrind log, fortunately the end user won't suffer. Is it viable to turn on the "-ffunction-sections -fdata-sections" options distribution wide? If so, you may sell it to Fedora :) Thanks, Fengguang --
Hi, Wu.
AFAIK, kernel reads first sector(ELF header and so one) of binary in
case of binary.
in fs/exec.c,
prepare_binprm()
{
...
return kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
}
But dynamic loader uses libc_read for reading of shared library's one.
So you may have a chance to increase readahead size on binary but hard on shared
library. Many of app have lots of shared library so the solution of
only binary isn't big about
performance. :(
--
Kind regards,
Minchan Kim
--
Correction with data: in my system, ld is doing one 832b initial read for every library:
$ strace true
execve("/bin/true", ["true"], [/* 44 vars */]) = 0
brk(0) = 0x608000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3ea0000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e9e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=140899, ...}) = 0
mmap(NULL, 140899, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3b3e7b000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY) = 3
==> read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\353\1\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1379752, ...}) = 0
mmap(NULL, 3487784, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb3b3931000
mprotect(0x7fb3b3a7b000, 2097152, PROT_NONE) = 0
mmap(0x7fb3b3c7b000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14a000) = 0x7fb3b3c7b000
mmap(0x7fb3b3c80000, 18472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3c80000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e7a000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb3b3e79000
arch_prctl(ARCH_SET_FS, 0x7fb3b3e796f0) = 0
mprotect(0x7fb3b3c7b000, 16384, PROT_READ) = 0
mprotect(0x7fb3b3ea1000, 4096, PROT_READ) = 0
munmap(0x7fb3b3e7b000, 140899) = ...I have an older patch to create dynamic bitmaps based on the last run and only prefetch those pages. It wasn't entirely a win for everything and didn't work for shared libraries, but with some additional tuning the approach still has potential I think, by combining memory saving with prefetching. ftp://firstfloor.org/pub/ak/pbitmap/INTRO http://halobates.de/dp2.pdf For your use case the algorithm would likely need some glibc support. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
On Mon, 05 Apr 2010 15:43:02 -0700 Yes, the linker scrambles the executable's block ordering. This just isn't an interesting case. World-wide, the number of people who compile their own web browser and execute it from the file which ld produced is, umm, seven. So I'd suggest that you always copy the executable to a temp file and mv it back before running any timing tests. --
Gentoo users? Linux From Scratch? There are many more than 7 of us. Unless you are talking about the build environments always running some tool after ld which I am not aware of. -- Zan Lynx zlynx@acm.org "Knowledge is Power. Power Corrupts. Study Hard. Be Evil." --
OK, eight then. But I still don't think it's the case we should optimise for. Not if it impacts the common case even the slightest. It'd be far far better to change those distros to perform the very cheap, once-off step of straightening out their executables (including shared libraries). --
"make install" tends to copy. I am not aware of any Makefiles that link directly to /usr/bin, and usually that wouldn't work anyways because of permissions. copy fixes the problem. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
... and those people who are executing the binary out of the build directory are probably running the regression test (i.e., "make; make check") and on most developer machines, if they're lucky they have enough memory that the executable will still be in their page cache. :-) This being said, on modern file systems (i.e., btrfs, ext4, xfs, et. al), delayed allocation should mostly hide this problem; and if not, and the linker can estimate in advance how big the resulting binary will be, it could be modified to use the fallocate(2) system call. -- Ted --
... and those people who are executing the binary out of the build directory are probably running the regression test (i.e., "make; make check") and on most developer machines, if they're lucky they have enough memory that the executable will still be in their page cache. :-) This being said, on modern file systems (i.e., btrfs, ext4, xfs, et. al), delayed allocation should mostly hide this problem; and if not, and the linker can estimate in advance how big the resulting binary will be, it could be modified to use the fallocate(2) system call. -- Ted --
I'm sorry that you don't find this interesting. I did not suggest that people compile their own browser to get a perfect layout. This is something that Mozilla can do when preparing builds and it's also something distributions can do. It just so happens that large parts of startup will be very similar for every single firefox install, might as You mean to get it into a cache or to hope to avoid fragmentation? If you are suggesting this to avoid measuring the startup overhead of paging the binary in, I strongly disagee. It is the slowest part of firefox startup and needs to be addressed. Taras --
It's not a case we should optimise for. It's perfectly reasonable for the kernel to assume that the executable is reasonably well-laid-out on disk. And if is _isn't_ well-laid-out than that should be fixed in userspace, because for simple locality-of-reference reasons, that's always going to produce the fastest result. Plus it's the common case as well - the executable was copied from DVD or over the network or whatever. Plus it's so utterly trivial for people who compile-their-own to straighten the file out - just run cp! These people have gone and screwed up their file layout - they should fix that, rather than trying No, nothing like that at all. What I'm saying is that you shouldn't be testing or attempting to optimise for files which were laid out by ld. Because those files are an utter mess - the block ordering is simply all over the place. And the great majority of people aren't using executables which were laid out on disk by ld! Instead, straighten out the block layout with `cp', then go and do the testing and the optimisation. Because if you're not taking this first step then you're just not serious about performance at all! Here's a small executable, as laid out by ld: File offset disk blocks 0-0: 18383385-18383385 (1) 1-1: 18383389-18383389 (1) 2-3: 18383392-18383393 (2) 4-4: 18383400-18383400 (1) 5-7: 18383430-18383432 (3) 8-11: 18383450-18383453 (4) 12-12: 18383423-18383423 (1) 13-14: 18383447-18383448 (2) 15-16: 18383474-18383475 (2) 17-17: 18383390-18383390 (1) 18-18: 18383398-18383398 (1) 19-20: 18383418-18383419 (2) 21-21: 18383421-18383421 (1) 22-22: 18383397-18383397 (1) 23-23: 18383399-18383399 (1) 24-24: 18383407-18383407 (1) 25-25: 18383391-18383391 (1) 26-26: 18383396-18383396 (1) 27-28: 18383394-18383395 (2) 29-34: 18383401-18383406 (6) 35-38: 18383425-18383428 (4) 39-39: 18383433-18383433 (1) 40-40: 18383463-18383463 (1) 41-44: 18383490-18383493 (4) 45-45: ...
Yeah ok. We are talking about different things. I meant the linker lays out the program badly, ie within the executable itself. Turns out that naively concatenating various compilation units makes for binaries that load slowly due to excessive seeking within the file. I wasn't talking about filesystem fragmentation. I agree that filesystem bustage caused by executing the linker isn't interesting. Taras --
My understanding was that this is usually gone when you use a delayed allocation fs (xfs, ext4), unless your link sequence takes much longer than the flush window. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
