Linux: O_STREAMING - optimal streaming I/O

Submitted by nimrod
on October 12, 2002 - 8:49am

Robert Love's [interview] O_STREAMING patch (a patch to optimize read-once, streaming I/O) has made it into Andrew Morton's [interview] -mm tree.

O_STREAMING is a flag you can use with open(), which explicitly tells the kernel not to keep a cache of the file, by dropping pages from the pagecache before the current index. O_STREAMING is suited when you want to read large files sequentially, and when you know you'll read them only once.

Other recent additions to -mm are kgdb, shpte [story] and John Levon's [interview] latest 2.5 oprofile.


From: Robert Love
Subject: [PATCH] O_STREAMING - flag for optimal streaming I/O
To: linux-kernel
Date: 2002-10-07 19:50:04 PST

Attached patch implements an O_STREAMING file I/O flag which enables
manual drop-behind of pages.

If the file has O_STREAMING set then the user has explicitly said "this
is streaming data, I know I will not revisit this, do not cache
anything".  So we drop pages from the pagecache before our current
index.  We have to fiddle a bit to get writes working since we do
write-behind but the logic is there and it works.

Some numbers.  A simple streaming read to verify the pagecache effects:

        Streaming 1GB Read (avg of many runs, mem=2GB):
        O_STREAMING     Wall time       Change in Page Cache
        Yes             25.58s          0
        No              25.55s          +835MB

Another read with much more VM pressure:

        Streaming 1GB Read (avg of many runs, mem=8M)
        O_STREAMING     Wall time       Change in Page Cache
        Yes             25.76s          0
        No              29.01s          +1MB

And now the kicker:

        Kernel compile (make -j2) and concurrent streaming I/O
        (avg of two runs, mem=128M):
        O_STREAMING     Time to complete Kernel Compile
        Yes             3m27.863s
        No              4m15.818s

This is c/o Andrew Morton.

Patch is against 2.4.20-pre9.  Why not 2.5?  Because Andrew says we can
do better, perhaps with a real drop-behind heuristic.  As 20 Oct looms
quite close, we shall see.

        Robert Love

[patch here]


From: J.A. Magallon Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 03:50:06 PST Hi, On 2002.10.08 Robert Love wrote: >Attached patch implements an O_STREAMING file I/O flag which enables >manual drop-behind of pages. > >If the file has O_STREAMING set then the user has explicitly said "this >is streaming data, I know I will not revisit this, do not cache >anything". So we drop pages from the pagecache before our current >index. We have to fiddle a bit to get writes working since we do >write-behind but the logic is there and it works. > Sorry if this is a newbie question, but, does glibc pass flags blindly to the syscal ?? Ie, I do not need to rebuild glibc to use this in open(), fcntl() and so on, just I can make sure that bit 04000000 is set in the flags. TIA
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 11:30:10 PST On Tue, 2002-10-08 at 06:42, J.A. Magallon wrote: > Sorry if this is a newbie question, but, does glibc pass flags blindly > to the syscal ?? Ie, I do not need to rebuild glibc to use this in > open(), fcntl() and so on, just I can make sure that bit 04000000 > is set in the flags. Right. Do something like: #define O_STREAMING 04000000 fd = open(file, ... | O_STREAMING); or open it via fopen() and use fcntl() to set O_STREAMING. Robert Love
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 12:00:09 PST On Tue, 2002-10-08 at 14:38, Chris Wedgwood wrote: > > Attached patch implements an O_STREAMING file I/O flag which enables > > manual drop-behind of pages. I answered this in a previous email to this list: In a lot of ways. This flag changes no semantics except to not let pages from the mapping populate the page cache for very long. In other words, this flag pretty much disables the pagecache for this mapping, although we happily keep it around for write-behind and read-ahead. But once the data is behind us and safe to kill, we do. It is manual drop-behind. O_DIRECT has a lot of semantics, one of which is to attempt to minimize cache effects. It is also synchronous, requires properly aligned buffers, and pretty much minimizes interaction with as much of the kernel as possible. I am not overly familiar with its uses, but I always assumed the big user is applications that implement their own caching layer. O_STREAMING would be for your TiVo or network audio streamer. Any file I/O that is inherently sequential and access-once. No point trashing the pagecache with its data - but otherwise the behavior is normal. Basically, with O_STREAMING you want normal semantics except drop-behind of the pages. You even still want the pagecache caching your data - just the not-yet-written write-behind data and the not-yet-read read-ahead data. With O_DIRECT you get a whole different can-of-worms. Basically you cut out a lot of the kernel. You can do normal libc file I/O on an O_STREAMING file with no semantic changes; except the drop-behind of the pages. Robert Love
From: Robert Love Subject: Re: [PATCH] O_STREAMING - flag for optimal streaming I/O To: linux-kernel Date: 2002-10-08 13:40:05 PST On Tue, 2002-10-08 at 15:05, Chris Wedgwood wrote: > > In other words, this flag pretty much disables the pagecache for > > this mapping, although we happily keep it around for write-behind > > and read-ahead. But once the data is behind us and safe to kill, we > > do. It is manual drop-behind. > > OK. What might use this though? What applications might want to > disable the page-cache but still use write-behind? Streaming I/O wants read-ahead. Filesystems themselves implement the write-behind and we do not want to circumvent so much of the kernel. The point of O_STREAMING is one change: drop pages in the pagecache behind our current position, that are free-able, because we know we will never want them. Its a hint from the application saying "I will never revisit this so dump it". O_DIRECT is a much bigger can of worms. You lose a lot of what the kernel provides. You have to do things in block-sized chunks. Etc. etc. > > O_DIRECT has a lot of semantics, one of which is to attempt to > > minimize cache effects. > > It depends on the OS. Some OS are broken and treat O_DIRECT as a > hint, Linux and IRIX know it's a *requirement*. Yep. Linux treats most "hints" (e.g. madvise) as a requirement - it fails if it cannot do it. That is against the spec most of the time, but oh well... > > O_STREAMING would be for your TiVo or network audio streamer. Any > > file I/O that is inherently sequential and access-once. No point > > trashing the pagecache with its data - but otherwise the behavior is > > normal. > > Actually, this sounds perfect for O_DIRECT. But I don't know much > about streaming video. > > Since you only want the data once, why use the page-cache at all and > needlessly copy? Certainly, the requirements for O_DIRECT are not > that hard to meet or implement. > > Don't get me wrong, I'm not saying this is a bad thing at all. The > patch is small and elegant so it's hard to object; I'm just trying to > understand where in practice I would use this over O_DIRECT. Shrug. I do not have much experience with O_DIRECT. I suspect the synchronous nature and the requirement of aligned buffers is not ideal. With O_STREAMING you can simply set the flag and use your normal I/O and normal interfaces and have a field day. Andrew, any experience on one vs. the other? Robert Love
From: Andrew Morton Subject: 2.5.42-mm1 To: linux-kernel Date: 2002-10-11 22:50:02 PST Robert Love wrote: > > ... > > Andrew, any experience on one vs. the other? I'd say that if you were designing a new application which streams large amount of data then yes, you would design it to use O_DIRECT. You would instantiate a separate IO worker thread and a message passing mechanism so that thread would pump your data for you, and would peform your readahead, etc. If your filesystem supports O_DIRECT, of course. Not all do. The strength of O_STREAMING is that you can take an existing, working, megahuge application and make it play better with the VM by changing a single line of code. No big redesign needed.

From: Andrew Morton
To: linux-kernel
Subject: 2.5.41-mm3
Date: 2002-10-11 02:38:05 PST

url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.41/2.5.41-mm3/

. Merged up John's latest 2.5 oprofile, so folks will have to wean
  themselves off the crufty old one.
  
  You'll have to grab the userspace tools from
  http://oprofile.sourceforge.net/oprofile-2.5.html
  
  Use:
  
        mkdir /dev/oprofile     # John forgot this
        ./configure --with-kernel-support
        make install

  Or just do what I didn't do and read the web page.

  Quite a few things seem to have changed with oprofile.  A typical
  profiling cycle would now be:

        rm -rf /var/lib/oprofile
        op_start --ctr0-count=50000 --ctr0-event=CPU_CLK_UNHALTED 
                --vmlinux=/path/to/vmlinux
        
        op_stop
        sleep 3
        oprofpp -l -i /boot/vmlinux
        kill $(cat /var/lib/oprofile/lock)

  You must kill the daemon by hand before you can run op_start
  again.

. I've dropped the 512-byte O_DIRECT alignment patch for now. It's
  over in the experimental directory.  I'd like to get a decent round
  of testing with the bio_add_page fix so we can get that into Linus
  and get direct-io generally stabilised again before moving on.

. We've had some encouraging performance test results on the
  shared pagetable code, but also a couple of crashes.  The people
  who are monitoring performance may want to try that out.  It is
  selectable in config.

. Turns out that the idea of unmapped mapped pagecache a little earlier
  than swapping out anon memory was a poor one.  Changed the VM so that
  we treat these types of pages the same.

  It would be really appreciated if people who are interested in "the
  desktop experience" could give this patchset a try.  It's working
  well for me; but that's not a large sample...

-guruhugh.patch
-pte-highmem-warning.patch
-raw-use-o_direct.patch
-remove-radix_tree_reserve.patch
-ext3-yield.patch
-readv-writev-check-fix.patch

 Merged

+kgdb.patch

 Make things simpler for myself

+oprofile-25.patch

 Latest version

+hugetlb-meminfo.patch

 Change the layout of the hugetlbpage info in /proc/meminfo

+dio-bio-add-fix-1.patch

 Direct-io fixes

+net-loopback.patch

 Davem's patch to make the loopback device save a copy.  Doesn't seem
 to affect anything really.

-dio-fine-alignment.patch

 Moved to ../experimental for now

+blkdev-o_direct-short-read.patch

 Fix O_DIRECT-read-past-EOF for blockdevs

+msync-correctness.patch 
  
 msync() standards fix
  
+page_reserved-accounting.patch
  
 Global accounting for PageReserved pages

+use-page_reserved_accounting.patch

 Use the above in a couple of VM decision-making places

+shpte-ifdef.patch

 Reduce shpte ifdeffery a little
 
+shpte-mprotect-fix.patch

 Shared pagetable mprotect fix.

From: Andrew Morton To: linux-kernel Subject: 2.5.42-mm1 Date: 2002-10-11 22:50:02 PST url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm1/ Just a resync. Added O_STREAMING support. +blk-queue-bounce.patch Inline blk_queue_bounce(). +o_streaming.patch O_STREAMING for 2.5. 100% untested. +shpte-unmap-fix.patch A shared-pte bugfix +shmmap.patch Proactive pagetable sharing for mmap()
From: Andrew Morton To: linux-kernel Subject: 2.5.42-mm2 Date: 2002-10-11 23:56:02 PST url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.42/2.5.42-mm2/ mm1 had a little problem in the compilation department - missing chunk from fs/fcntl.c. +fix-pgpgout.patch Fix /proc/vmstat:pgpgin/pgpgout accounting for 512-byte IOs +dio-fine-alignment.patch Bring back the 512-byte alignment patch +sard.patch Keep sard ticking over +remove-kiobufs.patch Remove the kiobuf infrastructure.

-mm -> mainline?

schneelocke
on
October 12, 2002 - 12:11pm

This is very interesting, just like most of the things that go into Andrew's -mm tree. Speaking of which, what are the chances that the patches he's got in there (or some of them, at least) will make it into mainline 2.5 before the freeze?

--
schnee

re: -mm -> mainline?

Jeremy
on
October 12, 2002 - 12:59pm

If after testing it Andrew decides it's a Good Thing and he passes it along to Linus, I'd say the odds are very high Linus will merge it into the mainline. This has happened more than once with other pieces Andrew first tested in -mm...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.