Re: [PATCH] readahead even for FMODE_RANDOM

Previous thread: Re: [ANNOUNCE] 2.6.33.1-rt11 - BUG? by Thomas Gleixner on Thursday, April 1, 2010 - 11:23 am. (5 messages)

Next thread: none
From: Jens Axboe
Date: Thursday, April 1, 2010 - 11:31 am

Hi,

I got a problem report with fio where larger block size random reads
where markedly slower with buffered IO than with O_DIRECT, and the
initial thought was that perhaps this was some fio oddity. The reporter
eventually discovered that turning off the fadvise hint made it work
fine. So I took a look, and it seems we never do readahead for
FMODE_RANDOM even if the request size is larger than 1 page. That seems
like a bug, if an application is doing eg 16kb random reads, you want to
readahead the 12kb remaining data. On devices where smaller transfer
sizes are slower than larger ones, this can make a large difference.

This patch makes us readahead even for FMODE_RANDOM, iff we'll be
reading more pages in that single read. I ran a quick test here, and it
appears to fix the problem (no difference with fadvise POSIX_FADV_RANDOM
being passed in or not).

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/mm/readahead.c b/mm/readahead.c
index 337b20e..d4b201c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -501,8 +501,11 @@ void page_cache_sync_readahead(struct address_space *mapping,
 	if (!ra->ra_pages)
 		return;
 
-	/* be dumb */
-	if (filp->f_mode & FMODE_RANDOM) {
+	/*
+	 * Be dumb for files marked as randomly accessed, but do readahead
+	 * inside the original request (req_size > 1).
+	 */
+	if ((filp->f_mode & FMODE_RANDOM) && req_size == 1) {
 		force_page_cache_readahead(mapping, filp, offset, req_size);
 		return;
 	}

-- 
Jens Axboe

--

From: Wu Fengguang
Date: Thursday, April 1, 2010 - 6:23 pm

Hi Jens,

 
I guess the application is doing (at least partial) sequential reads,
while at the same time tell kernel with POSIX_FADV_RANDOM that it's
doing random reads.

If so, it's mainly the application's fault.

However the kernel can behave more smart and less "dumb" :)
It can inherit the current good behavior of "submit one single 16kb io
request for a 16kb random read() syscall", while still be able to

The patch only fixes the (req_size != 1) case that exposed by your
application. A complete fix would be 

@@ -820,12 +825,6 @@ void page_cache_sync_readahead(struct ad
 	if (!ra->ra_pages)
 		return;
 
-	/* be dumb */
-	if (filp->f_mode & FMODE_RANDOM) {
-		force_page_cache_readahead(mapping, filp, offset, req_size);
-		return;
-	}
-
 	/* do read-ahead */
 	ondemand_readahead(mapping, ra, filp, false, offset, req_size);
 }

And a more optimized patch would look like this.  Note that only the
last chunk is necessary for bug fixing, and only this chunk can be
applied to vanilla 2.6.34-rc3.

If no problem, I'll submit a patch with only the last chunk for
2.6.34, and submit the remaining chunks for 2.6.35.

Thanks,
Fengguang
---
Subject: readahead: more smart readahead on POSIX_FADV_RANDOM
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Fri Apr 02 08:52:42 CST 2010

Some times user space applications will tell POSIX_FADV_RANDOM
while still doing some sequential reads.

The kernel can behave a bit smarter in this case, by letting the
readahead heuristics handle the POSIX_FADV_RANDOM case, but with
less aggressive assumption on sequential reads.

CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

--- linux.orig/mm/readahead.c	2010-04-02 08:43:53.000000000 +0800
+++ linux/mm/readahead.c	2010-04-02 09:00:51.000000000 +0800
@@ -664,6 +664,7 @@ ondemand_readahead(struct address_space 
 	unsigned long max = ...
From: Jens Axboe
Date: Thursday, April 1, 2010 - 11:38 pm

The application is doing large random reads. It's purely random, so
the POSIX_FADV_RANDOM hint is correct. However, thinking about it, it
may be that we later hit a random "block" that has now been cached due
to this read-ahead. Let me try and rule that out completely and see if
there's still the difference. The original reporter observed 4kb reads


Hmm, are we talking about the same thing? I want to hit read-ahead for
the remaining pages inside that random read, eg ensure that read-ahead

I'll try and give this a spin. On the laptop, I cannot reproduce the

-- 
Jens Axboe

--

From: Wu Fengguang
Date: Thursday, April 1, 2010 - 11:52 pm

How large is it? For random reads > read_ahead_kb,
ondemand_readahead() will break it into read_ahead_kb sized IOs, while

4kb reads hit the disk (on POSIX_FADV_RANDOM)? That sounds like
behavior in pre .34 kernels that is fixed by commit 0141450f66c:


I think Yes. When the above block is gone, ondemand_readahead() will
be invoked, and the readahead heuristic will find that it's an
oversize read (whose size is > 128k) and start 128kb readahead for it.

Thanks,
--

From: Jens Axboe
Date: Thursday, April 1, 2010 - 11:59 pm

The test case was 128kb random reads. So should still be within the
normal read_ahead_kb. I suspect the reporter would not have noticed if
the issue size was as large as read_ahead_kb even if the request size

Could explain why I'm not reproducing when doing a quick test on the
laptop. It is an older kernel. So it could be that I'm just imaging the
issue on the current kernel, I don't have hard data to back it up on
that version.

So disregard the patch for now, part-sequential behaviour on
POSIX_FADV_RANDOM isn't the issue here.

-- 
Jens Axboe

--

From: Wu Fengguang
Date: Friday, April 2, 2010 - 12:21 am

Yeah. 128kb random reads won't trigger readahead.

However each 129kb random read will trigger 2*128kb readahead IOs,

OK.

Thanks,
Fengguang
--

Previous thread: Re: [ANNOUNCE] 2.6.33.1-rt11 - BUG? by Thomas Gleixner on Thursday, April 1, 2010 - 11:23 am. (5 messages)

Next thread: none