Hi,
I got a problem report with fio where larger block size random reads
where markedly slower with buffered IO than with O_DIRECT, and the
initial thought was that perhaps this was some fio oddity. The reporter
eventually discovered that turning off the fadvise hint made it work
fine. So I took a look, and it seems we never do readahead for
FMODE_RANDOM even if the request size is larger than 1 page. That seems
like a bug, if an application is doing eg 16kb random reads, you want to
readahead the 12kb remaining data. On devices where smaller transfer
sizes are slower than larger ones, this can make a large difference.
This patch makes us readahead even for FMODE_RANDOM, iff we'll be
reading more pages in that single read. I ran a quick test here, and it
appears to fix the problem (no difference with fadvise POSIX_FADV_RANDOM
being passed in or not).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
diff --git a/mm/readahead.c b/mm/readahead.c
index 337b20e..d4b201c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -501,8 +501,11 @@ void page_cache_sync_readahead(struct address_space *mapping,
if (!ra->ra_pages)
return;
- /* be dumb */
- if (filp->f_mode & FMODE_RANDOM) {
+ /*
+ * Be dumb for files marked as randomly accessed, but do readahead
+ * inside the original request (req_size > 1).
+ */
+ if ((filp->f_mode & FMODE_RANDOM) && req_size == 1) {
force_page_cache_readahead(mapping, filp, offset, req_size);
return;
}
--
Jens Axboe
--
Hi Jens,
I guess the application is doing (at least partial) sequential reads,
while at the same time tell kernel with POSIX_FADV_RANDOM that it's
doing random reads.
If so, it's mainly the application's fault.
However the kernel can behave more smart and less "dumb" :)
It can inherit the current good behavior of "submit one single 16kb io
request for a 16kb random read() syscall", while still be able to
The patch only fixes the (req_size != 1) case that exposed by your
application. A complete fix would be
@@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;
- /* be dumb */
- if (filp->f_mode & FMODE_RANDOM) {
- force_page_cache_readahead(mapping, filp, offset, req_size);
- return;
- }
-
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
And a more optimized patch would look like this. Note that only the
last chunk is necessary for bug fixing, and only this chunk can be
applied to vanilla 2.6.34-rc3.
If no problem, I'll submit a patch with only the last chunk for
2.6.34, and submit the remaining chunks for 2.6.35.
Thanks,
Fengguang
---
Subject: readahead: more smart readahead on POSIX_FADV_RANDOM
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Fri Apr 02 08:52:42 CST 2010
Some times user space applications will tell POSIX_FADV_RANDOM
while still doing some sequential reads.
The kernel can behave a bit smarter in this case, by letting the
readahead heuristics handle the POSIX_FADV_RANDOM case, but with
less aggressive assumption on sequential reads.
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
--- linux.orig/mm/readahead.c 2010-04-02 08:43:53.000000000 +0800
+++ linux/mm/readahead.c 2010-04-02 09:00:51.000000000 +0800
@@ -664,6 +664,7 @@ ondemand_readahead(struct address_space
unsigned long max = ...The application is doing large random reads. It's purely random, so the POSIX_FADV_RANDOM hint is correct. However, thinking about it, it may be that we later hit a random "block" that has now been cached due to this read-ahead. Let me try and rule that out completely and see if there's still the difference. The original reporter observed 4kb reads Hmm, are we talking about the same thing? I want to hit read-ahead for the remaining pages inside that random read, eg ensure that read-ahead I'll try and give this a spin. On the laptop, I cannot reproduce the -- Jens Axboe --
How large is it? For random reads > read_ahead_kb, ondemand_readahead() will break it into read_ahead_kb sized IOs, while 4kb reads hit the disk (on POSIX_FADV_RANDOM)? That sounds like behavior in pre .34 kernels that is fixed by commit 0141450f66c: I think Yes. When the above block is gone, ondemand_readahead() will be invoked, and the readahead heuristic will find that it's an oversize read (whose size is > 128k) and start 128kb readahead for it. Thanks, --
The test case was 128kb random reads. So should still be within the normal read_ahead_kb. I suspect the reporter would not have noticed if the issue size was as large as read_ahead_kb even if the request size Could explain why I'm not reproducing when doing a quick test on the laptop. It is an older kernel. So it could be that I'm just imaging the issue on the current kernel, I don't have hard data to back it up on that version. So disregard the patch for now, part-sequential behaviour on POSIX_FADV_RANDOM isn't the issue here. -- Jens Axboe --
Yeah. 128kb random reads won't trigger readahead. However each 129kb random read will trigger 2*128kb readahead IOs, OK. Thanks, Fengguang --
