This patch fixes a race between direct IO writes and non-direct IO reads on the same file. The symptom is a stale file page seen by any non-direct-IO reader, which persists until the page is invalidated somehow (e.g. page rewritten again, or memory pressure, or reboot). An improper return test caused direct-IO's after-write page invalidations to be skipped. If we're writing page N, and the reader is reading page N-x for small x, and the read code decides to readahead, it's not too hard to cause a race that leaves an old, stale copy of the page in the page cache. Retval is usually +nonzero after the mapping->a_ops->direct_IO call! Signed-off-by: Karl Schendel <kschendel@datallegro.com> --- By the way, I agree that the userland situation is stupid, and I'm addressing that in the application (happens to be the Ingres DBMS). However, the kernel shouldn't compound the stupidity. I'll try to watch for replies, but it would be very useful to cc me at kschendel@datallegro.com if any discussion is needed; I'm not subscribed to lkml. --- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400 +++ linux-2.6.23.1/mm/filemap.c 2007-10-26 16:12:08.000000000 -0400 @@ -2194,7 +2194,7 @@ generic_file_direct_IO(int rw, struct ki } retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); - if (retval) + if (retval < 0) goto out; /* -
Hmm. If I read this right, this bug seems to have been introduced by
commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean
pages before dio write") back in March.
Before that, we'd call invalidate_inode_pages2_range() unconditionally
after the call mapping->a_ops->direct_IO() if it was a write and there
were cached pages in the mapping (well, "unconditionally" in the sense
that it didn't depend on the return value of the ->direct_IO() call).
However, with both the old and the new code _and_ with your patch, the
return code - in case the invalidate failed - was corrupted. So we may
actually end up doing some IO, but then returning the "wrong" error code
from the invalidate. Hmm?
Somebody who cares about direct-IO and who - unlike me - doesn't think
it's a total and idiotic crock should think hard about this. I'm including
Karl's email, but also an alternate patch for consideration.
And maybe some day we can all agree that direct_IO is crap and should not
be done.
Linus
--
diff --git a/mm/filemap.c b/mm/filemap.c
index 5209e47..032371a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2510,22 +2510,17 @@ generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
}
retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
+ if (retval < 0)
goto out;
/*
* Finally, try again to invalidate clean pages which might have been
* faulted in by get_user_pages() if the source of the write was an
* mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * thing to do, so we don't support it 100%.
*/
- if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
- }
+ if (rw == WRITE && ...A point. In an all-seeing, all-caring universe, it would be the read
hitting the cached page that couldn't be invalidated that would get
the error, not the write. I can't get too worked up over that, though.
In any case, as you say, if the write worked it should report as such.
Perhaps this equivalent but slightly cleaned-up patch instead?
Karl
--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 18:00:00.000000000 -0400
@@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki
}
retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
- goto out;
/*
* Finally, try again to invalidate clean pages which might have been
- * faulted in by get_user_pages() if the source of the write was an
- * mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * cached by non-direct readahead, or faulted in by get_user_pages()
+ * if the source of the write was an mmap'ed region of the file
+ * we're writing. Either one is a pretty crazy thing to do,
+ * so we don't support it 100%. If this invalidation
+ * fails, tough, the write still worked...
*/
- if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
+ if (retval >= 0 && rw == WRITE && mapping->nrpages) {
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
}
out:
return retval;
-
Agreed. And it's a really dumb bug. ->direct_io will almost always return -EIOCBQUEUED for aio dio so it won't be invalidating for aio dio writes. (Notice that the testing in that commit mentions two racing processes, I bet U$1M that I only tested sync dio :/) I think that test should be changed to if (retval < 0 && retval != -EIOCBQUEUED) If the invalidation fails then the app is left with stale data in the page cache and current data on disk. The return code corruption you're referring to was intended to communicate this scary situation to the app with EIO. It sucks. Does it suck more than returning success for the dio write when later buffered reads will return stale data? I dunno. What does Chris (Mason) and I certainly love the idea of getting rid of fs/direct-io.c. Getting O_DIRECT working with the page-granular buffered locking rules while doing large IOs (and, as far as I know, potentially sector-granular) without noticeable performance regressions is a mess. - z -
Well, actually, in this case both processes are doing sync IO. It's just that the writer is direct and the reader isn't, with the reader usually behind the writer but close enough that readahead crosses the writer reasonably often. With the if (retval) test, we only invalidate if nothing got written at all! I hadn't even thought of aio directio. Yeah, the invalidate should happen for -EIOCBQUEUED as well, I guess. The disparity in direct-io-ness on the part of the reader vs writer is a userland dum-dum, no question. (That's getting fixed, too.) Karl -
How about not testing at all? Which was what the old code did. Just do the invalidate unconditionally for any writes, and screw the end result of the invalidate, since we cannot afford to overwrite the previous return value anyway in any realistic scenario? Linus -
I'm reasonably comfortable with that, sure. This second invalidation only catches reads which userspace raced with the write, and that's already racy by definition. I can throw together a patch if you haven't already committed one by the time you read this ;). - z -
I'm not touching that code except to send out possible patches for others to test and comment on. I have no test-cases, nor any real interest in it. So yeah, please send me a tested and thought-through patch. It sounded like Karl actually had a test-case to trigger this, but maybe I'm confused (and it sounds a bit unlikely, since it should be hard to trigger.. Although maybe you can trigger it by doing a direct_IO write with the *source* being an mmap() of the file you're writing to, and depending on the write itself re-populating the destination in the page cache?) Karl? Linus -
Yes, I do - I'd been tripping over it once every couple weeks,
and I finally figured out how to hold my mouth right so that it
fails (almost) every time.
The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.
Karl
---
--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 19:21:20.000000000 -0400
@@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki
}
retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
- goto out;
/*
* Finally, try again to invalidate clean pages which might have been
- * faulted in by get_user_pages() if the source of the write was an
- * mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * cached by non-direct readahead, or faulted in by get_user_pages()
+ * if the source of the write was an mmap'ed region of the file
+ * we're writing. Either one is a pretty crazy thing to do,
+ * so we don't support it 100%. If this invalidation
+ * fails, tough, the write still worked...
*/
if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
}
out:
return retval;
-
OK, I tested and verified Karl's fix and wrote some commentary around it. (Would a aio-dio git repo on kernel.org for these kind of fixes be well received?) ---- dio: fix cache invalidation after sync writes Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean pages before dio write") introduced a bug which stopped dio from ever invalidating the page cache after writes. It still invalidated it before writes so most users were fine. Karl Schendel reported hitting this bug ( http://lkml.org/lkml/2007/10/26/481 ) when he had a buffered reader immediately reading file data after an O_DIRECT wirter had written the data. The kernel issued read-ahead beyond the position of the reader which overlapped with the O_DIRECT writer. The failure to invalidate after writes caused the reader to see stale data from the read-ahead. The following patch is originally from Karl. The following commentary is his: The below 3rd try takes on your suggestion of just invalidating no matter what the retval from the direct_IO call. I ran it thru the test-case several times and it has worked every time. The post-invalidate is probably still too early for async-directio, but I don't have a testcase for that; just sync. And, this won't be any worse in the async case. I added a test to the aio-dio-regress repository which mimics Karl's IO pattern. It verifed the bad behaviour and that the patch fixed it. I agree with Karl, this still doesn't help the case where a buffered reader follows an AIO O_DIRECT writer. That will require a bit more work. This gives up on the idea of returning EIO to indicate to userspace that stale data remains if the invalidation failed. Signed-off-by: Zach Brown <zach.brown@oracle.com> --- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400 +++ linux-2.6.23.1/mm/filemap.c 2007-10-26 19:21:20.000000000 -0400 @@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki } retval = ...
If it's one of these "in a blue moon" things I don't think it matters. Whatever is easier. In general, patches have lots of advantages: (a) you can apply it regardless of version, and (b) pushing the email containing them back and forth with commentary etc is very powerful. So I would generally see git as a "maintainer handling issue", not a "these kinds of fixes" issue. Git doesn't obviate the need for having people look at patches (and that implies sending them out). But git _is_ a good way to push the finished product out, if it's a recurring thing. Linus -
Sure thing. It looks like Karl just sent out the patch I had in mind, so I'll run it through the tests on Monday. I assume everyone can It's disappointingly easy to trigger with ext3 and ordered extending writes. The invalidation can race with jbd pinning the bhs while it commits the transaction. It happens that ext3 returns failure from ->releasepage if jbd is committing the transaction. dio will definitely bring the page in with get_user_pages() but I'd like to think that that pinned page could be safely invalidated out of the page cache while dio holds a page reference. I've never tried it, though. - z -
