Robert Love's [interview] O_STREAMING patch (a patch to optimize read-once, streaming I/O) has made it into Andrew Morton's [interview] -mm tree.
O_STREAMING is a flag you can use with open(), which explicitly tells the kernel not to keep a cache of the file, by dropping pages from the pagecache before the current index. O_STREAMING is suited when you want to read large files sequentially, and when you know you'll read them only once.
Other recent additions to -mm are kgdb, shpte [story] and John Levon's [interview] latest 2.5 oprofile.
From: Robert Love
Subject: [PATCH] O_STREAMING - flag for optimal streaming I/O
To: linux-kernel
Date: 2002-10-07 19:50:04 PST
Attached patch implements an O_STREAMING file I/O flag which enables
manual drop-behind of pages.
If the file has O_STREAMING set then the user has explicitly said "this
is streaming data, I know I will not revisit this, do not cache
anything". So we drop pages from the pagecache before our current
index. We have to fiddle a bit to get writes working since we do
write-behind but the logic is there and it works.
Some numbers. A simple streaming read to verify the pagecache effects:
Streaming 1GB Read (avg of many runs, mem=2GB):
O_STREAMING Wall time Change in Page Cache
Yes 25.58s 0
No 25.55s +835MB
Another read with much more VM pressure:
Streaming 1GB Read (avg of many runs, mem=8M)
O_STREAMING Wall time Change in Page Cache
Yes 25.76s 0
No 29.01s +1MB
And now the kicker:
Kernel compile (make -j2) and concurrent streaming I/O
(avg of two runs, mem=128M):
O_STREAMING Time to complete Kernel Compile
Yes 3m27.863s
No 4m15.818s
This is c/o Andrew Morton.
Patch is against 2.4.20-pre9. Why not 2.5? Because Andrew says we can
do better, perhaps with a real drop-behind heuristic. As 20 Oct looms
quite close, we shall see.
Robert Love[patch here]
From: J.A. Magallon Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 03:50:06 PST Hi, On 2002.10.08 Robert Love wrote: >Attached patch implements an O_STREAMING file I/O flag which enables >manual drop-behind of pages. > >If the file has O_STREAMING set then the user has explicitly said "this >is streaming data, I know I will not revisit this, do not cache >anything". So we drop pages from the pagecache before our current >index. We have to fiddle a bit to get writes working since we do >write-behind but the logic is there and it works. > Sorry if this is a newbie question, but, does glibc pass flags blindly to the syscal ?? Ie, I do not need to rebuild glibc to use this in open(), fcntl() and so on, just I can make sure that bit 04000000 is set in the flags. TIA
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 11:30:10 PST On Tue, 2002-10-08 at 06:42, J.A. Magallon wrote: > Sorry if this is a newbie question, but, does glibc pass flags blindly > to the syscal ?? Ie, I do not need to rebuild glibc to use this in > open(), fcntl() and so on, just I can make sure that bit 04000000 > is set in the flags. Right. Do something like: #define O_STREAMING 04000000 fd = open(file, ... | O_STREAMING); or open it via fopen() and use fcntl() to set O_STREAMING. Robert Love
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 12:00:09 PST On Tue, 2002-10-08 at 14:38, Chris Wedgwood wrote: > > Attached patch implements an O_STREAMING file I/O flag which enables > > manual drop-behind of pages. I answered this in a previous email to this list: In a lot of ways. This flag changes no semantics except to not let pages from the mapping populate the page cache for very long. In other words, this flag pretty much disables the pagecache for this mapping, although we happily keep it around for write-behind and read-ahead. But once the data is behind us and safe to kill, we do. It is manual drop-behind. O_DIRECT has a lot of semantics, one of which is to attempt to minimize cache effects. It is also synchronous, requires properly aligned buffers, and pretty much minimizes interaction with as much of the kernel as possible. I am not overly familiar with its uses, but I always assumed the big user is applications that implement their own caching layer. O_STREAMING would be for your TiVo or network audio streamer. Any file I/O that is inherently sequential and access-once. No point trashing the pagecache with its data - but otherwise the behavior is normal. Basically, with O_STREAMING you want normal semantics except drop-behind of the pages. You even still want the pagecache caching your data - just the not-yet-written write-behind data and the not-yet-read read-ahead data. With O_DIRECT you get a whole different can-of-worms. Basically you cut out a lot of the kernel. You can do normal libc file I/O on an O_STREAMING file with no semantic changes; except the drop-behind of the pages. Robert Love
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 13:40:05 PST On Tue, 2002-10-08 at 15:05, Chris Wedgwood wrote: > > In other words, this flag pretty much disables the pagecache for > > this mapping, although we happily keep it around for write-behind > > and read-ahead. But once the data is behind us and safe to kill, we > > do. It is manual drop-behind. > > OK. What might use this though? What applications might want to > disable the page-cache but still use write-behind? Streaming I/O wants read-ahead. Filesystems themselves implement the write-behind and we do not want to circumvent so much of the kernel. The point of O_STREAMING is one change: drop pages in the pagecache behind our current position, that are free-able, because we know we will never want them. Its a hint from the application saying "I will never revisit this so dump it". O_DIRECT is a much bigger can of worms. You lose a lot of what the kernel provides. You have to do things in block-sized chunks. Etc. etc. > > O_DIRECT has a lot of semantics, one of which is to attempt to > > minimize cache effects. > > It depends on the OS. Some OS are broken and treat O_DIRECT as a > hint, Linux and IRIX know it's a *requirement*. Yep. Linux treats most "hints" (e.g. madvise) as a requirement - it fails if it cannot do it. That is against the spec most of the time, but oh well... > > O_STREAMING would be for your TiVo or network audio streamer. Any > > file I/O that is inherently sequential and access-once. No point > > trashing the pagecache with its data - but otherwise the behavior is > > normal. > > Actually, this sounds perfect for O_DIRECT. But I don't know much > about streaming video. > > Since you only want the data once, why use the page-cache at all and > needlessly copy? Certainly, the requirements for O_DIRECT are not > that hard to meet or implement. > > Don't get me wrong, I'm not saying this is a bad thing at all. The > patch is small and elegant so it's hard to object; I'm just trying to > understand where in practice I would use this over O_DIRECT. Shrug. I do not have much experience with O_DIRECT. I suspect the synchronous nature and the requirement of aligned buffers is not ideal. With O_STREAMING you can simply set the flag and use your normal I/O and normal interfaces and have a field day. Andrew, any experience on one vs. the other? Robert Love
From: Andrew Morton Subject: 2.5.42-mm1 To: linux-kernel Date: 2002-10-11 22:50:02 PST Robert Love wrote: > > ... > > Andrew, any experience on one vs. the other? I'd say that if you were designing a new application which streams large amount of data then yes, you would design it to use O_DIRECT. You would instantiate a separate IO worker thread and a message passing mechanism so that thread would pump your data for you, and would peform your readahead, etc. If your filesystem supports O_DIRECT, of course. Not all do. The strength of O_STREAMING is that you can take an existing, working, megahuge application and make it play better with the VM by changing a single line of code. No big redesign needed.
From: Andrew Morton To: linux-kernel Subject: 2.5.41-mm3 Date: 2002-10-11 02:38:05 PST url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.41/2.5.41-mm3/ . Merged up John's latest 2.5 oprofile, so folks will have to wean themselves off the crufty old one. You'll have to grab the userspace tools from http://oprofile.sourceforge.net/oprofile-2.5.html Use: mkdir /dev/oprofile # John forgot this ./configure --with-kernel-support make install Or just do what I didn't do and read the web page. Quite a few things seem to have changed with oprofile. A typical profiling cycle would now be: rm -rf /var/lib/oprofile op_start --ctr0-count=50000 --ctr0-event=CPU_CLK_UNHALTED --vmlinux=/path/to/vmlinux op_stop sleep 3 oprofpp -l -i /boot/vmlinux kill $(cat /var/lib/oprofile/lock) You must kill the daemon by hand before you can run op_start again. . I've dropped the 512-byte O_DIRECT alignment patch for now. It's over in the experimental directory. I'd like to get a decent round of testing with the bio_add_page fix so we can get that into Linus and get direct-io generally stabilised again before moving on. . We've had some encouraging performance test results on the shared pagetable code, but also a couple of crashes. The people who are monitoring performance may want to try that out. It is selectable in config. . Turns out that the idea of unmapped mapped pagecache a little earlier than swapping out anon memory was a poor one. Changed the VM so that we treat these types of pages the same. It would be really appreciated if people who are interested in "the desktop experience" could give this patchset a try. It's working well for me; but that's not a large sample... -guruhugh.patch -pte-highmem-warning.patch -raw-use-o_direct.patch -remove-radix_tree_reserve.patch -ext3-yield.patch -readv-writev-check-fix.patch Merged +kgdb.patch Make things simpler for myself +oprofile-25.patch Latest version +hugetlb-meminfo.patch Change the layout of the hugetlbpage info in /proc/meminfo +dio-bio-add-fix-1.patch Direct-io fixes +net-loopback.patch Davem's patch to make the loopback device save a copy. Doesn't seem to affect anything really. -dio-fine-alignment.patch Moved to ../experimental for now +blkdev-o_direct-short-read.patch Fix O_DIRECT-read-past-EOF for blockdevs +msync-correctness.patch msync() standards fix +page_reserved-accounting.patch Global accounting for PageReserved pages +use-page_reserved_accounting.patch Use the above in a couple of VM decision-making places +shpte-ifdef.patch Reduce shpte ifdeffery a little +shpte-mprotect-fix.patch Shared pagetable mprotect fix.
From: Andrew Morton To: linux-kernel Subject: 2.5.42-mm1 Date: 2002-10-11 22:50:02 PST url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm1/ Just a resync. Added O_STREAMING support. +blk-queue-bounce.patch Inline blk_queue_bounce(). +o_streaming.patch O_STREAMING for 2.5. 100% untested. +shpte-unmap-fix.patch A shared-pte bugfix +shmmap.patch Proactive pagetable sharing for mmap()
From: Andrew Morton To: linux-kernel Subject: 2.5.42-mm2 Date: 2002-10-11 23:56:02 PST url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm2/ mm1 had a little problem in the compilation department - missing chunk from fs/fcntl.c. +fix-pgpgout.patch Fix /proc/vmstat:pgpgin/pgpgout accounting for 512-byte IOs +dio-fine-alignment.patch Bring back the 512-byte alignment patch +sard.patch Keep sard ticking over +remove-kiobufs.patch Remove the kiobuf infrastructure.
-mm -> mainline?
This is very interesting, just like most of the things that go into Andrew's -mm tree. Speaking of which, what are the chances that the patches he's got in there (or some of them, at least) will make it into mainline 2.5 before the freeze?
--
schnee
re: -mm -> mainline?
If after testing it Andrew decides it's a Good Thing and he passes it along to Linus, I'd say the odds are very high Linus will merge it into the mainline. This has happened more than once with other pieces Andrew first tested in -mm...