Linux: Random File I/O Regressions In 2.6

Submitted by Jeremy
on May 4, 2004 - 9:30am

A recent discussion on the lkml looked into a reproducable random file I/O regression in 2.6 compared to the 2.4 kernel [forum]. Alexey Kopytov posted the benchmark results, attempting to simulate the workload of a database under intensive load. The tests were tried with all I/O schedulers, including the anticipatory [story], deadline [story] and CFQ [story], and in all cases 2.4 outperformed 2.6.

The resulting thread was primarily between anticipatory scheduler author Nick Piggin [interview] and 2.6 kernel [forum] maintainer Andrew Morton [interview], examining the state of the readahead cache. At one point, Andrew suggested, "readahead has got too complex and is getting band-aidy. I'd prefer to tear it down and rethink things." The regression was ultimately tracked down by Ram Pai, caused by multiple threads accessing the same file descriptor. Andrew agreed, "for some reason this mucks up the 2.6 readahead state more than 2.4's." He provided a small patch to properly maintain the read ahead state, included within.


From: Alexey Kopytov [email blocked]
To:  linux-kernel
Subject: Random file I/O regressions in 2.6
Date: 	Sun, 2 May 2004 23:57:59 +0400

Hello!

I tried to compare random file I/O performance in 2.4 and 2.6 kernels and 
found some regressions that I failed to explain. I tested 2.4.25, 2.6.5-bk2 
and 2.6.6-rc3 with my own utility SysBench which was written to generate 
workloads similar to a database under intensive load. 

For 2.6.x kernels anticipatory, deadline, CFQ and noop I/O schedulers were
tested with AS giving the best results for this workload, but it's still about 
1.5 times worse than the results for 2.4.25 kernel.

The SysBench 'fileio' test was configured to generate the following workload:
16 worker threads are created, each running random read/write file requests in
blocks of 16 KB with a read/write ratio of 1.5. All I/O operations are evenly
distributed over 128 files with a total size of 3 GB. Each 100 requests, an
fsync() operations is performed sequentially on each file. The total number of
requests is limited by 10000.

The FS used for the test was ext3 with data=ordered.

Here are the results (values are number of seconds to complete the test):

2.4.25: 77.5377

2.6.5-bk2(noop): 165.3393
2.6.5-bk2(anticipatory): 118.7450
2.6.5-bk2(deadline): 130.3254
2.6.5-bk2(CFQ): 146.4286

2.6.6-rc3(noop): 164.9486
2.6.6-rc3(anticipatory): 125.1776
2.6.6-rc3(deadline): 131.8903
2.6.6-rc3(CFQ): 152.9280

I have published the results as well as the hardware and kernel setups at the
SysBench home page: http://sysbench.sourceforge.net/results/fileio/

Any comments or suggestions would be highly appreciated.

-- 
Alexey Kopytov, Software Developer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification


From: Nick Piggin [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 03 May 2004 21:14:45 +1000 I am able to reproduce this here. 2.6 isn't improved by increasing nr_requests, relaxing IO scheduler deadlines, or turning off readahead. It looks like 2.6 is submitting a lot of the IO in 4KB sized requests... Hmm, oh dear. It looks like the readahead logic shat itself and/or do_generic_mapping_read doesn't know how to handle multipage reads properly. What ends up happening is that readahead gets turned off, then the 16K read ends up being done in 4 synchronous 4K chunks. Because they are synchronous, they have no chance of being merged with one another either. I have attached a proof of concept hack... I think what should really happen is that page_cache_readahead should be taught about the size of the requested read, and ensures that a decent amount of reading is done while within the read request window, even if beyond-request-window-readahead has been previously unsuccessful. Numbers with an IDE disk, 256MB ram 2.4.24: 81s 2.6.6-rc3-mm1: 126s rc3-mm1+patch: 87s The small remaining regression might be explained by 2.6's smaller nr_requests, IDE driver, io scheduler tuning, etc. > Here are the results (values are number of seconds to complete the test): > > 2.4.25: 77.5377 > > 2.6.5-bk2(noop): 165.3393 > 2.6.5-bk2(anticipatory): 118.7450 > 2.6.5-bk2(deadline): 130.3254 > 2.6.5-bk2(CFQ): 146.4286 > > 2.6.6-rc3(noop): 164.9486 > 2.6.6-rc3(anticipatory): 125.1776 > 2.6.6-rc3(deadline): 131.8903 > 2.6.6-rc3(CFQ): 152.9280 > > I have published the results as well as the hardware and kernel setups at the > SysBench home page: http://sysbench.sourceforge.net/results/fileio/ > > Any comments or suggestions would be highly appreciated. > From your website: "Another interesting fact is that AS gives the best results for this workload, though it's believed to give worse results for this kind of workloads as compared to other I/O schedulers available in 2.6.x kernels." The anticipatory scheduler is actually in a fairly good state of tune, and can often beat deadline even for random read/write/fsync tests. The infamous database regression problem is when this sort of workload is combined with TCQ disk drives. Nick [read-populate.patch text/x-patch (1161 bytes)] include/linux/mm.h | 0 linux-2.6-npiggin/mm/filemap.c | 5 ++++- 2 files changed, 4 insertions(+), 1 deletion(-) diff -puN mm/readahead.c~read-populate mm/readahead.c diff -puN mm/filemap.c~read-populate mm/filemap.c --- linux-2.6/mm/filemap.c~read-populate 2004-05-03 19:56:00.000000000 +1000 +++ linux-2.6-npiggin/mm/filemap.c 2004-05-03 20:51:37.000000000 +1000 @@ -627,6 +627,9 @@ void do_generic_mapping_read(struct addr index = *ppos >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; + force_page_cache_readahead(mapping, filp, index, + max_sane_readahead(desc->count >> PAGE_CACHE_SHIFT)); + for (;;) { struct page *page; unsigned long end_index, nr, ret; @@ -644,7 +647,7 @@ void do_generic_mapping_read(struct addr } cond_resched(); - page_cache_readahead(mapping, ra, filp, index); + page_cache_readahead(mapping, ra, filp, index + desc->count); nr = nr - offset; find_page: diff -puN include/linux/mm.h~read-populate include/linux/mm.h _
From: Andrew Morton [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 3 May 2004 11:08:54 -0700 Nick Piggin [email blocked] wrote: > > What ends up happening is that readahead gets turned off, then the > 16K read ends up being done in 4 synchronous 4K chunks. Because they > are synchronous, they have no chance of being merged with one another > either. yup. > I have attached a proof of concept hack... I think what should really > happen is that page_cache_readahead should be taught about the size > of the requested read, and ensures that a decent amount of reading is > done while within the read request window, even if > beyond-request-window-readahead has been previously unsuccessful. The "readahead turned itself off" thing is there to avoid doing lots of pagecache lookups in the very common case where the file is fully cached. The place which needs attention is handle_ra_miss(). But first I'd like to reacquaint myself with the intent behind the lazy-readahead patch. Was never happy with the complexity and special-cases which that introduced. > cond_resched(); > - page_cache_readahead(mapping, ra, filp, index); > + page_cache_readahead(mapping, ra, filp, index + desc->count); > `index' is a pagecache index and desc->count is a byte counter.
From: Ram Pai [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: 03 May 2004 13:22:08 -0700 On Mon, 2004-05-03 at 11:08, Andrew Morton wrote: > > The "readahead turned itself off" thing is there to avoid doing lots of > pagecache lookups in the very common case where the file is fully cached. > > The place which needs attention is handle_ra_miss(). But first I'd like to > reacquaint myself with the intent behind the lazy-readahead patch. Was > never happy with the complexity and special-cases which that introduced. lazy-readahead has no role to play here. The readahead window got closed because the i/o pattern was totally random. My guess is multiple threads are generating 16k i/o on the same fd. In such a case the i/os can get interleaved and the readahead window size goes for a toss(which is expected behavior) Well if this is infact the case: the question is 1. does the i/o pattern really has some sequentiality to deserve a readahead? 2. or should we ensure that the interleaved case be somehow handled, by including the size parameter? I know Nick has implied option (2) but I think from the readahead's point of view it is (1), RP
From: Andrew Morton [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 3 May 2004 13:57:19 -0700 Ram Pai [email blocked] wrote: > > > The place which needs attention is handle_ra_miss(). But first I'd like to > > reacquaint myself with the intent behind the lazy-readahead patch. Was > > never happy with the complexity and special-cases which that introduced. > > lazy-readahead has no role to play here. Sure. But lazy-readahead is bolted on the side and is generally not to my liking. I'd like to find a solution to the sysbench problem which also solves the thing which lazy-readahead addressed. > The readahead window got closed > because the i/o pattern was totally random. My guess is multiple threads > are generating 16k i/o on the same fd. In such a case the i/os can get > interleaved and the readahead window size goes for a toss(which is > expected behavior) I don't think it's that. The app is doing well-aligned 16k reads and writes. If we get enough pagecache hits on the reads, readahead turns itself off (fair enough) but fails to turn itself on again. The readahead logic _should_ be able to adapt to the fixed-sized I/Os and issue correct-sized reads immediately after each seek. I _think_ this will fix the problem which lazy-readahead addressed, but as usual we don't have a rigorous description of that problem :( > Well if this is infact the case: the question is > 1. does the i/o pattern really has some sequentiality to > deserve a readahead? > 2. or should we ensure that the interleaved case be somehow > handled, by including the size parameter? > > I know Nick has implied option (2) but I think from the readahead's > point of view it is (1), Readahead has got too complex and is getting band-aidy. I'd prefer to tear it down and rethink things.
From: Peter Zaitsev [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 03 May 2004 14:37:26 -0700 On Mon, 2004-05-03 at 13:57, Andrew Morton wrote: > Ram Pai [email blocked] wrote: > > > > > The place which needs attention is handle_ra_miss(). But first I'd like to > > > reacquaint myself with the intent behind the lazy-readahead patch. Was > > > never happy with the complexity and special-cases which that introduced. > > > > lazy-readahead has no role to play here. > Andrew, Could you please clarify how this things become to be dependent on read-ahead at all. At my understanding read-ahead it to catch sequential (or other) access pattern and do some advance reading, so instead of 16K request we do 128K request, or something similar. But how could read-ahead disabled end up in 16K request converted to several sequential synchronous 4K requests ? It all looks pretty strange. -- Peter Zaitsev, Senior Support Engineer MySQL AB, www.mysql.com
From: Andrew Morton [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 3 May 2004 14:59:22 -0700 Peter Zaitsev [email blocked] wrote: > > On Mon, 2004-05-03 at 13:57, Andrew Morton wrote: > > Ram Pai [email blocked] wrote: > > > > > > > The place which needs attention is handle_ra_miss(). But first I'd like to > > > > reacquaint myself with the intent behind the lazy-readahead patch. Was > > > > never happy with the complexity and special-cases which that introduced. > > > > > > lazy-readahead has no role to play here. > > > > Andrew, > > Could you please clarify how this things become to be dependent on > read-ahead at all. readahead is currently the only means by which we build up nice large multi-page BIOs. > At my understanding read-ahead it to catch sequential (or other) access > pattern and do some advance reading, so instead of 16K request we do > 128K request, or something similar. That's one of its usage patterns. It's also supposed to detect the fixed-sized-reads-seeking-all-over-the-place situation. In which case it's supposed to submit correctly-sized multi-page BIOs. But it's not working right for this workload. A naive solution would be to add special-case code which always does the fixed-size readahead after a seek. Basically that's if (ra->next_size == -1UL) force_page_cache_readahead(...) in filemap.c. But this means that the kernel does lots of pointless pagecache lookups when everything is in pagecache. We should detect this situation and stop doing readahead completely, until we start getting pagecache lookup misses again. > But how could read-ahead disabled end up in 16K request converted to > several sequential synchronous 4K requests ? Readahead got itself turned off because of pagecache hits and didn't turn itself on again.
From: Nick Piggin [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Tue, 04 May 2004 09:58:01 +1000 Andrew Morton wrote: > > That's one of its usage patterns. It's also supposed to detect the > fixed-sized-reads-seeking-all-over-the-place situation. In which case it's > supposed to submit correctly-sized multi-page BIOs. But it's not working > right for this workload. > > A naive solution would be to add special-case code which always does the > fixed-size readahead after a seek. Basically that's > > if (ra->next_size == -1UL) > force_page_cache_readahead(...) > I think a better solution to this case would be to ensure the readahead window is always min(size of read, some large number); The size of the read is basically a free and accurate "hint" to the minimum size of the required readahead. Either that or do a simple "preread" while you're still in the read request window, and run readahead when that completes.
From: Andrew Morton [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 3 May 2004 17:10:05 -0700 Nick Piggin [email blocked] wrote: > > I think a better solution to this case would be to ensure the > readahead window is always min(size of read, some large number); > That would cause the kernel to perform lots of pointless pagecache lookups when the file is already 100% cached.
From: Nick Piggin [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Tue, 04 May 2004 10:19:50 +1000 Andrew Morton wrote: > > That would cause the kernel to perform lots of pointless pagecache lookups > when the file is already 100% cached. > That's pretty sad. You need a "preread" or something which sends the pages back... or uses the actor itself. readahead would then have to be reworked to only run off the end of the read window, but that is what it should be doing anyway.
From: Ram Pai [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: 03 May 2004 17:50:05 -0700 On Mon, 2004-05-03 at 17:19, Nick Piggin wrote: > > > That's pretty sad. You need a "preread" or something which > sends the pages back... or uses the actor itself. readahead > would then have to be reworked to only run off the end of > the read window, but that is what it should be doing anyway. Sorry, If I am saying this again. I have checked the behaviour of the readahead code using my user level simulator as well as running some DSS benchmark and iozone benchmark. It generates a steady stream of large i/o for large-random-reads and should not exhibit the bad behavior that we are seeing. I feel this bad behavior is because of interleaved access by multiple thread. To illustrate with an example: t1 request reads from page 100 to 104 simultaneously t2 requests reads on the same fd from 200 to 204 So do_page_cache_readahead() can be called in the following pattern. 100,200,101,201,102,202,103,203,104,204. Because of this pattern the readahaed code assumes that the read pattern is absolutely random and hence closes the readahead window. I think I should generate a patch to validate this behavior, I will. How about having some /proc counters that keep track of number of window-closes because of cache-hits and because of cache-misses? RP
From: Andrew Morton [email blocked] Subject: Re: Random file I/O regressions in 2.6 Date: Mon, 3 May 2004 23:29:28 -0700 Ram Pai [email blocked] wrote: > > Sorry, If I am saying this again. I have checked the behaviour of the > readahead code using my user level simulator as well as running some > DSS benchmark and iozone benchmark. It generates a steady stream of > large i/o for large-random-reads and should not exhibit the bad behavior > that we are seeing. I feel this bad behavior is because of interleaved > access by multiple thread. you're right - the benchmark has multiple threads issuing concurrent pread()s against the same fd. For some reason this mucks up the 2.6 readahead state more than 2.4's. Putting a semaphore around do_generic_file_read() or maintaining the state as below fixes it up. I wonder if we should bother fixing this? I guess as long as the app is using pread() it is a legitimate thing to be doing, so I guess we should... --- 25/mm/filemap.c~readahead-seralisation 2004-05-03 23:14:43.399947720 -0700 +++ 25-akpm/mm/filemap.c 2004-05-03 23:14:43.404946960 -0700 @@ -612,7 +612,7 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * - note the struct file * is only passed for the use of readpage */ void do_generic_mapping_read(struct address_space *mapping, - struct file_ra_state *ra, + struct file_ra_state *_ra, struct file * filp, loff_t *ppos, read_descriptor_t * desc, @@ -622,6 +622,7 @@ void do_generic_mapping_read(struct addr unsigned long index, offset; struct page *cached_page; int error; + struct file_ra_state ra = *_ra; cached_page = NULL; index = *ppos >> PAGE_CACHE_SHIFT; @@ -644,13 +645,13 @@ void do_generic_mapping_read(struct addr } cond_resched(); - page_cache_readahead(mapping, ra, filp, index); + page_cache_readahead(mapping, &ra, filp, index); nr = nr - offset; find_page: page = find_get_page(mapping, index); if (unlikely(page == NULL)) { - handle_ra_miss(mapping, ra, index); + handle_ra_miss(mapping, &ra, index); goto no_cached_page; } if (!PageUptodate(page)) @@ -752,6 +753,8 @@ no_cached_page: goto readpage; } + *_ra = ra; + *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; if (cached_page) page_cache_release(cached_page); _

Related Links:

For Shame!

Anonymous
on
May 4, 2004 - 9:37am

Band-aid code is the way of the MS codebase. I expect better from the kernel devs. Down with the Quick n dirty fix! Makes me wonder how much else in the kernel is quick fix city rather than proper working code.

blah

Anonymous
on
May 4, 2004 - 10:18am

Band-aid code is the way of the MS codebase? Well you said it...

I expect better from the kernel devs. Down with the Quick n dirty fix! Makes me wonder how much else in the kernel is quick fix city rather than proper working code.

Who said it was a quick and dirty fix? Maybe you'd care to share your infinite wisdom with that Andrew idiot and the rest of lkml. Enlighten them as to why it is dirty and what the "proper working code" is please.

Lastly, if you stay up at night wondering about how much else in the kernel is quick fix city then just do yourself a favour and stop using it.

>Readahead has got too comple

Anonymous
on
May 4, 2004 - 11:36am

>Readahead has got too complex and is getting band-aidy. I'd prefer to
>tear it down and rethink things.

If that's not saying that we need to find a better way than the quick fixes we've been using then I'm capable of causing rifts in time and space with my flatulence.

Oh noes, he criticised the kernel and the devs!
OMG NO YUO!!!!!11111oneoneoneone /aol

>Readahead has got too comple

Anonymous
on
May 4, 2004 - 5:26pm

>Readahead has got too complex and is getting band-aidy. I'd prefer to
>tear it down and rethink things.
It is exactly this comment that shows that they are not doing quick fixes. They want to rethink things! That's exactly how development is supposed to be.

Ivory tower mentality

Anonymous
on
May 5, 2004 - 5:02am

you read what you want to read. The statement in the context of the article, *clearly* means that they have used quick fixes and its bitten them in the ass. Andrew then suggested that they tear it down and do it *right*. Meh, these semantics bore me.

That really depends on the po

Anonymous
on
May 6, 2004 - 2:18am

That really depends on the point of view. It would have been worth, if nobody has brought up that these are just quick fixes and should be done right. But than we would not have this discussion, because nobody would have known that it is done in a bad.

huh?

Anonymous
on
May 7, 2004 - 6:19pm

Is it your connection that's dropping packets, or your brain?

review

Anonymous
on
May 5, 2004 - 11:28am

We as a community need to keep closer watch of the kernel and make sure sloppy code does not persist!

We're in the stable series

on
May 4, 2004 - 10:55am

Keep in mind we're in the stable series. If a particular subsystem is deemed to be crufty, it'll can bandaided in the stable series and get reworked in the development series.

This is pretty typical software engineering as far as I can see.

A similar model is the trunk/branch model, where all the development occurs on the mainline (or trunk), and releases are branched off of there. Typically, only bugfixes are allowed on branches, and those fixes may even be short-sighted but low-risk fixes intended to be good enough to get the thing released. Longer term fixes (those that carry more short term risk) can occur on the trunk.

The Linux model differs from the trunk/branch model in that we don't retain a development trunk at all times.

but they can add some patch i

Anonymous
on
May 7, 2004 - 2:39am

but they can add some patch in the branch kernel,such as mm kernel tree or others future 2.6 stable branch for test

What kernel version is AM's patch against?

on
May 5, 2004 - 4:35am

I tried to apply the patch that AM put up, but it failed both on 2.6.5 and 2.6.6-rc3. Does anyone know which kernel version this patch will work against?

Interesting

Anonymous
on
May 5, 2004 - 4:41am

How does 2.2 compare? I'd assume 2.4 was an improvement (in this area, let alone others) over 2.2 but I'd be curious how it measures up. Or maybe I'm just silly....

I bet 2.0 was the fastest....

Anonymous
on
May 5, 2004 - 10:28am

I bet 2.0 was the fastest..... but maybe 2.2.
It would be cool to pinpoint the exact release where Linus started lossing all his common sense.

Detections of concurrent seq. reads wouldn't be too hard

Anonymous
on
May 5, 2004 - 12:12pm

For example a small table of structs could be used. Each struct would contain an address (of the last read address for this sequence), a time-stamp, a buffer/pointer to a buffer of previously read memory, the size of the buffer, the base address of the buffer (the first byte in the buffer matches a read address).

Then, when going to do a read, the table could be searched (linear time) for an address one less than the one being requested. If it is found then the buffer can be read from, and the entry updated. If not, then the LRU entry's data is overwritten with the data for the start of this sequence (the oldest entry can be found during the linear search).

In the example cited 100, 200, 101, 201, ... the table would only need to have 2 entries. For a desktop machine very few entries would likely be needed (8 would probably be enough for most users (gut feeling)). For servers that do a lot of file reading this should be set higher (e.g. 256, maybe even 1024). This would probably be a compile time option.

I must note that my knowledge of the kernel is practically nothing, and that I don't understand all of the issues involved. Still, I think that the proposal is a worthwhile one.

- Moti

It might be more complicated that that... dunno.

on
May 5, 2004 - 12:40pm

As Andrew pointed out, the prefetch mechanism is the mechanism by which larger block I/O requests are generated. I'm not exactly sure what that translates to--is the prefetch mechanism merging actual requests in addition to generating prefetches, or is it just a predictor that generates additional requests?

If the prefetch mechanism is actually merging requests, then to handle the interleaved case, you need to have a window of requests over which you can sort and merge. That can add latency.

If the prefetch mechanism is only trying to identify linear fetch sequences and is just generating additional large fetches ahead of the program, then a simple hashed lookup on block number could be enough to sort out the interleaved case. Basically, a small hash table (8 bins or similar), hashed on the upper and middle bits of the block number could make it a O(1) lookup.

Complexity

Anonymous
on
May 11, 2004 - 3:01am

When one continues to make "improvements" to an existing design, the resulting complexity increases eventually to a point of diminishing returns - a.k.a, one ends up with a pile of crap. I hope this scenario is not happening now with all the relatively new kernel developers working on 2.6.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.