A conversation on the lkml [1] discussed the notion of recording disk i/o during the boot process of a server the first time it boots, then using this information to precache data the system is reasonably sure to require during subsequent boots. In response to the idea, 2.6 kernel maintainer Andrew Morton [interview [2]] commented, "I wrote a similar thing in September of 2001." He went on to explain how his method worked, including a link [3] to the old code, however pointing out:
"So it's all an attempt to optimise the boot-time I/O patterns. It was pretty much a waste of time, gaining only 10% or so, from memory. You could get just as much or more speedup from simply launching all the initscripts in parallel, although this did tend to break stuff."
While Andrew's solution involved a kernel module, the conversation went on to discuss how a similar affect might be achieved from user space. Existing implementations by other operating systems were also discussed, with links to relevant documentation.
From: Felix von Leitner [email blocked] To: linux-kernel Subject: Request: I/O request recording Date: Sat, 24 Jan 2004 19:10:27 +0100 I would like to have a user space program that I could run while I cold start KDE. The program would then record which I/O pages were read in which order. The output of that program could then be used to pre-cache all those pages, but in an order that reduces disk head movement. Demand Loading unfortunately produces lots of random page I/O scattered all over the disk. Having a way to know which pages are accessed in which order at a typical cold start would be very benefitial, not only for the purpose described above but it could also be used as input for a linker code reordering optimization. What do you think? Felix
From: Arjan van de Ven [email blocked] To: Felix von Leitner [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 19:26:17 +0100 On Sat, 2004-01-24 at 19:10, Felix von Leitner wrote: > I would like to have a user space program that I could run while I cold > start KDE. The program would then record which I/O pages were read in > which order. The output of that program could then be used to pre-cache > all those pages, but in an order that reduces disk head movement. > Demand Loading unfortunately produces lots of random page I/O scattered > all over the disk. I recently did something like this (and it scared me, it seems a typical Fedora boot into gnome opens like 11.000 files ;) but via a printk in the kernel.... I experimented with readahead'ing all that stuff while the initscripts ran in the hope it would save time... but it doesn't somehow. Some other things kinda help; if you feel adventurous you could play with the kernel-utils RPM in rawhide which does a readahead of the files the desktop opens while GDM login window is displayed; if the user isn't typing his name really fast that decreases the desktop startup time...
From: Ville Herva [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 21:25:45 +0200 On Sat, Jan 24, 2004 at 07:26:17PM +0100, you [Arjan van de Ven] wrote: > > I recently did something like this (and it scared me, it seems a typical > Fedora boot into gnome opens like 11.000 files ;) but via a printk in > the kernel.... > > I experimented with readahead'ing all that stuff while the initscripts > ran in the hope it would save time... but it doesn't somehow. Did you sort the sectors to be read, or just read the files into page cache in randomish order ? Or do you mean that even after all the files were read into cache, the X startup time didn't get any better (not counting the cache priming)? -- v --
From: Arjan van de Ven [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 23:43:49 +0100 On Sat, Jan 24, 2004 at 09:25:45PM +0200, Ville Herva wrote: > > Did you sort the sectors to be read, or just read the files into page cache > in randomish order ? semi random order but mostly submitted in parallel so the kernel has lots of freedom to reorder > Or do you mean that even after all the files were read into cache, the X > startup time didn't get any better (not counting the cache priming)? I mean that the time it takes to prime is just about exactly the time you then win... eg net gain of about zero
From: Andrew Morton [4] [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 15:35:51 -0800 Felix von Leitner [email blocked] wrote: > > I would like to have a user space program that I could run while I cold > start KDE. The program would then record which I/O pages were read in > which order. The output of that program could then be used to pre-cache > all those pages, but in an order that reduces disk head movement. > Demand Loading unfortunately produces lots of random page I/O scattered > all over the disk. I wrote a similar thing in September of 2001. What you do is: - Reboot the system, wait until everything is steady-state (eg: X has started, applications are loaded). - Load a kernel module which dumps the current contents of the pagecache (filename/offset-into-file) into a file. (The kernel module writes to modprobe's stdout, so you just do modprobe fboot-dump > /tmp/fboot-dump.out I'm very proud of this.) - Post-process the resulting output into a database which is used on the next reboot. - reboot - This time a userspace application cuts in real early and reads the database and preloads all the pagecache using "optimal" I/O patterns so that everything which you will need in the subsequent boot is already in memory. So it's all an attempt to optimise the boot-time I/O patterns. It was pretty much a waste of time, gaining only 10% or so, from memory. You could get just as much or more speedup from simply launching all the initscripts in parallel, although this did tend to break stuff. Anyway, the code's ancient but might provide some ideas: http://www.zip.com.au/~akpm/linux/fboot.tar.gz [5]
From: Bart Samwel [email blocked] Subject: Re: Request: I/O request recording Date: Sun, 25 Jan 2004 23:59:02 +0100 Andrew Morton wrote: > Felix von Leitner [email blocked] wrote: > >>I would like to have a user space program that I could run while I cold >>start KDE. The program would then record which I/O pages were read in >>which order. The output of that program could then be used to pre-cache >>all those pages, but in an order that reduces disk head movement. >>Demand Loading unfortunately produces lots of random page I/O scattered >>all over the disk. > > I wrote a similar thing in September of 2001. What you do is: [...] When I saw this thread I've fiddled for a bit with the block_dump functionality that's in the laptop_mode patch. I wanted to see if it could support a similar thing completely from user space (except for the block_dump code, of course). I've written a small tool to generate a complete file that lists tuples (sector, size, device) from the kernel output in syslog; it parses all "READ block xxx" messages since the last reboot. Putting this through sort -n -u delivers a nicely sorted file, ready for optimized reading. Unfortunately I'm now stuck within the other part, which is reading the pages back in memory at the next boot. It's not working, and I was hoping someone here could take a look and tell me what I'm doing wrong. Here's what I've tried so far. I've written a program that simply reads the ranges by opening the device and reading from sector*512 to sector*512+size. It uses async io for efficiency, and to allow the kernel to merge read requests. It seems to read all the data, but after that the other programs seem to read most of it *again*! I only go from 8500 down to 7000 reads or so, while most of the 7000 reads that remain are also in the range that is being prefetched. :( I was wondering if the pages could have been removed so soon, so, to make sure, I mmaped the whole shebang with MAP_LOCKED and PROT_READ, and kept the mapping process in memory during the whole boot process. This had exactly the same effect. So, I thought that I might be reading the wrong blocks. However, when I feed it something like (160000, 4096, hdb1) I get a block_dump log that says exactly that (plus some extra, because mmap seems to read in a bit more than needed). So, that's not it. I'm out of clues. If someone would be so kind to take a look at what I'm doing wrong, I'd very much appreciate it. I've put the code up at http://www.xs4all.nl/~bsamwel/block_read_replay.tar.gz [6]. How to use it: 1. Patch your kernel with the patch that's included in the tarball. This patch modifies the block_dump output slightly, and enables a block_dump value of 2 which only reports READ actions. It's against 2.6.1-mm2, but it should apply fine to any kernel that has laptop_mode in it. 2. Record the bootup info. Somewhere at the very beginning, include "echo 2 > /proc/sys/vm/block_dump" in an init script. Reboot, and after the bootup sequence is complete, do echo 0 > /proc/sys/vm/block_dump. 3. "make" and put brexec (one of the two versions) somewhere your init scripts can access it. 4. Run slbrp (SysLog Block Read Parser) to generate a block list file: slbrp /var/log/syslog | sort -n -u > /etc/bootup_blocks. 5. Precede the echo 2 > /proc/sys/vm/block_dump at startup with a brexec ("block read executor") call, e.g. "brexec /etc/bootup_blocks". The mmap version takes an extra parameter <N> = the number of seconds to keep the pages mapped and must be put in the background because it will simply wait for N seconds before exiting. So, it should be something like "brexec /etc/bootup_blocks 60" and then "sleep 30" to give it time to read everything before bootup continues. Yes, it's not pretty. It's just used for experimenting, so it doesn't matter. 6. Reboot, and disable block_dump after booting, like in step (2). Now the logging of reads only starts _after_ brexec has attempted to load all pages, and this gives info on what is still loaded. You'll probably see that it loads many things that are also listed in the bootup_blocks file. Now my question is: what am I doing wrong that it needs to read those again? -- Bart
From: Andrew Morton [7] [email blocked] Subject: Re: Request: I/O request recording Date: Sun, 25 Jan 2004 15:09:14 -0800 Bart Samwel [email blocked] wrote: > > When I saw this thread I've fiddled for a bit with the block_dump > functionality that's in the laptop_mode patch. I wanted to see if it > could support a similar thing completely from user space (except for the > block_dump code, of course). I've written a small tool to generate a > complete file that lists tuples (sector, size, device) from the kernel > output in syslog; it parses all "READ block xxx" messages since the > last reboot. Putting this through sort -n -u delivers a nicely sorted > file, ready for optimized reading. > > Unfortunately I'm now stuck within the other part, which is reading the > pages back in memory at the next boot. It's not working, and I was > hoping someone here could take a look and tell me what I'm doing wrong. Linux caches disk data on a per-file basis. So if you preload pagecache via the /dev/hda1 "file", that is of no benefit to the /etc/passwd file. Each one has its own unique pagecache. When reading pages for /etc/passwd we don't go looking for the same disk blocks in the cache of /dev/hda1. Which is why the userspace cache preloading needs to know the pathnames of all the relevant files - it needs to open and read each one, applying knowledge of disk layout while doing it.
From: Bart Samwel [email blocked] Subject: Re: Request: I/O request recording Date: Mon, 26 Jan 2004 00:29:49 +0100 Andrew Morton wrote: > > Linux caches disk data on a per-file basis. So if you preload pagecache > via the /dev/hda1 "file", that is of no benefit to the /etc/passwd file. > Each one has its own unique pagecache. When reading pages for /etc/passwd > we don't go looking for the same disk blocks in the cache of /dev/hda1. > > Which is why the userspace cache preloading needs to know the pathnames of > all the relevant files - it needs to open and read each one, applying > knowledge of disk layout while doing it. Hmmm, that explains why this didn't work. :( So if I wanted to do this completely from user space using only block_dump data I'd probably have to go through all files and find out if they had any blocks in common with my preload set -- presuming there is a way to find that out, which there probably isn't. That makes this idea pretty much useless, I'm sorry to have bothered you with it. -- Bart
From: Andrew Morton [8] [email blocked] Subject: Re: Request: I/O request recording Date: Sun, 25 Jan 2004 15:38:03 -0800 Bart Samwel [email blocked] wrote: > > Hmmm, that explains why this didn't work. :( So if I wanted to do this > completely from user space using only block_dump data I'd probably have > to go through all files and find out if they had any blocks in common > with my preload set -- presuming there is a way to find that out, which > there probably isn't. That makes this idea pretty much useless, I'm > sorry to have bothered you with it. > You could certainly do that. Given disk block #N you need to search all files on the disk asking "who owns this block". The FIBMAP ioctl can be used on most filesystems (ext2, ext3, others..) to find out which blocks a file is using. See bmap.c in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz [9] Unfortunately you cannot determine a directory's blocks in this way. Ext3's directories live in the /dev/hda1 pagecache anyway. ext2's directories each have their own pagecache.
From: Diego Calleja GarcÃa [email blocked] Subject: Re: Request: I/O request recording Date: Mon, 26 Jan 2004 01:23:56 +0100 El Sun, 25 Jan 2004 15:38:03 -0800 Andrew Morton [email blocked] escribió: > Unfortunately you cannot determine a directory's blocks in this way. > Ext3's directories live in the /dev/hda1 pagecache anyway. ext2's > directories each have their own pagecache. It would be possible to "hijack" the syscalls at libc level and look at what the program is doing?
From: Andrew Morton [10] [email blocked] Subject: Re: Request: I/O request recording Date: Sun, 25 Jan 2004 16:32:05 -0800 Diego Calleja GarcÃa [email blocked] wrote: > > It would be possible to "hijack" the syscalls at libc level and look at what > the program is doing? That would work. It misses out on pagefaults, which are kind of syscalls in disguise. So for any files which were mmapped you'd have to either assume that all of the file's pages are required, or use mincore() to poke around and find out which pages were really faulted in.
From: Bart Samwel [email blocked] Subject: Re: Request: I/O request recording Date: Mon, 26 Jan 2004 12:50:24 +0100 Andrew Morton wrote: > Unfortunately you cannot determine a directory's blocks in this way. > Ext3's directories live in the /dev/hda1 pagecache anyway. ext2's > directories each have their own pagecache. I found out two things while trying to do this: 1. Many filesystems in linux set f_fsid to zero for statfs. I was trying to use this to skip over mount points, but that doesn't work. Had to use the st_dev field from stat instead. :( 2. Swapfiles apparently don't like to be touched. I did an ioctl(FIGETBSZ) on a swapfile, and it would simply block until I did a swapoff on the file. I didn't even get to the FIBMAP part. :( Is this correct behaviour? And is there any way to detect this so that I can work around it? -- Bart
From: Andrew Morton [11] [email blocked] Subject: Re: Request: I/O request recording Date: Mon, 26 Jan 2004 03:57:58 -0800 Bart Samwel [email blocked] wrote: > > 2. Swapfiles apparently don't like to be touched. I did an > ioctl(FIGETBSZ) on a swapfile, and it would simply block until I did a > swapoff on the file. I didn't even get to the FIBMAP part. :( Is this > correct behaviour? yup. > And is there any way to detect this so that I can work around it? swapoff -a beforehand, I guess.
From: Davide Libenzi [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 15:53:44 -0800 (PST) On Sat, 24 Jan 2004, Andrew Morton wrote: > > So it's all an attempt to optimise the boot-time I/O patterns. It was > pretty much a waste of time, gaining only 10% or so, from memory. You > could get just as much or more speedup from simply launching all the > initscripts in parallel, although this did tend to break stuff. > > Anyway, the code's ancient but might provide some ideas: > > http://www.zip.com.au/~akpm/linux/fboot.tar.gz [12] Warning. I don't know if they do have a patent for this, but MS does this starting from XP (look inside %WINDIR%\PreFetch). It is both boot and app based. - Davide
From: Andrew Morton [13] [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 16:03:42 -0800 Did they do it in August 2001?
From: Davide Libenzi [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 16:09:12 -0800 (PST) On Sat, 24 Jan 2004, Andrew Morton wrote: > > Did they do it in August 2001? Ouch, I don't know. I know for sure that it came with XP, but I'm not really into MS things ;) This is one of the links that talk about that: http://msdn.microsoft.com/msdnmag/issues/01/12/XPKernel/default.aspx [14] - Davide
From: Valdis Kletnieks [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 19:04:47 -0500 On Sat, 24 Jan 2004 15:53:44 PST, Davide Libenzi said: > > Warning. I don't know if they do have a patent for this, but MS does this > starting from XP (look inside %WINDIR%\PreFetch). It is both boot and app > based. Hmm.. prior art time. ;) IBM's OS/VS1 and MVS operating systems had the 'link pack area', where frequently loaded modules were loaded at system startup. And there were numerous 3rd party optimizers that would analyze the LOAD SVC patterns on your system and produce a list of which modules should be pre-loaded in order to get the most bang for the buck (even a *large* 370/168 or 303x processor might be able to spare a megabyte tops, so optimizing it was important, and sites would spend $5K on software that would optimize the memory usage and save them a memory upgrade at $40K a meg... This was mid-70s, so definitely pre-XP.
From: Davide Libenzi [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 16:10:56 -0800 (PST) On Sat, 24 Jan 2004 Valdis.Kletnieks@vt.edu [15] wrote: > > This was mid-70s, so definitely pre-XP. They (MS) do work of a page fault basis though. It is quite different. - Davide
From: Diego Calleja [email blocked] Subject: Re: Request: I/O request recording Date: Sat, 24 Jan 2004 21:11:56 +0100 El Sat, 24 Jan 2004 19:10:27 +0100 Felix von Leitner [email blocked] escribió: > Having a way to know which pages are accessed in which order at a > typical cold start would be very benefitial, not only for the purpose > described above but it could also be used as input for a linker code > reordering optimization. > > What do you think? That's exactly what XP does (and Mac OS X, for that matter). And it really works (ie: you can notice it) XP records what the OS does in the first 2 minutes (or so). The next time it boots, it tries to load the files that he knows that are going to be used. The same for an app that is frecuently used: it records what the app does, and it optimizes the startup of that app. Take a look at: (search prefetch) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/appendix/hh/appendix/enhancements5_0qhx.asp [16] http://msdn.microsoft.com/msdnmag/issues/01/12/xpkernel/default.aspx [17] Andrew Morton wrote a patch some time ago for 2.5.64-mm6 which achieves a similar effect, I think: " To test the nonlinear mapping code more thoroughly I have arranged for all executable file-backed mmaps to be treated as nonlinear. This means that when an executable is first mapped in, the kernel will slurp the whole thing off disk in one hit. Some IO changes were made to speed this up. This means that large cache-cold executables start significantly faster. Launching X11+KDE+mozilla goes from 23 seconds to 16. Starting OpenOffice seems to be 2x to 3x faster, and starting Konqueror maybe 3x faster too. Interesting." (see: http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/1296.html [18]) The patches are still available. IIRC, they were dropped "because it should be done in userspace". It'd very interesting to write userspace program that does what XP does (it looks like a good idea for desktops) http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/1296.html [19]
Related Links:
- Archive of above thread [20]
- KernelTrap interview with Andrew Morton [21]