According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
to the VFS/VM. Indeed some filesystems such as tmpfs can return
AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
Therefore, some user programs fail, esp. if they're written such as this:
err = msync(...);
if (err != 0)
// fail
They temporarily fixed the specific program in question (apt-get) to check
if (err < 0)
// fail
Is this a bug indeed, or are user programs supposed to handle
AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what
should the kernel return: a zero, or an -errno (and which one)?
Thanks,
Erez.
-
On Sun, 7 Oct 2007 15:20:19 -0400
shit. That's a nasty bug. Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.
Does this fix it?
--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:
ret = (*writepage)(page, wbc, data);
- if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+ if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+ ret = 0;
+ }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
_
-
Pekka Enberg replied with an identical patch a few days ago, but for some reason the same condition flows up to msync as -1 EIO instead of AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the thread is below. Thanks. Ryan -
Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.
But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing). I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)). shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?
Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.
Thanks,
Hugh
--- 2.6.23/mm/shmem.c 2007-10-09 21:31:38.000000000 +0100
+++ linux/mm/shmem.c 2007-10-12 01:25:46.000000000 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;
BUG_ON(!PageLocked(page));
+ if (!wbc->for_reclaim) {
+ set_page_dirty(page);
+ unlock_page(page);
+ return 0;
+ }
BUG_ON(page_mapped(page));
mapping = page->mapping;
-
Hi Hugh,
Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
without unionfs even?
Pekka
-
I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. So, while I don't disagree with your patch to write_cache_pages (though it wasn't clear to me whether it should break from or continue the loop if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's really the root of the problem. Hugh -
Hi Hugh,
Thanks for the explanation, you're obviously correct.
However, I don't think the mapping_cap_writeback_dirty() check in
__filemap_fdatawrite_range() works as expected when tmpfs is a lower
mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
for unionfs mappings so do_fsync() will call write_cache_pages() that
unconditionally invokes shmem_writepage() via unionfs_writepage().
Unless, of course, there's some other unionfs magic I am missing.
Pekka
-
In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's ->writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Thanks, Erez. -
Hi Erez,
Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be
calling unionfs_writepage() _at all_ if the lower mapping has
BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally
untested patch below?
Pekka
---
fs/unionfs/mmap.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
Index: linux-2.6.23-rc8/fs/unionfs/mmap.c
===================================================================
--- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c
+++ linux-2.6.23-rc8/fs/unionfs/mmap.c
@@ -17,6 +17,7 @@
* published by the Free Software Foundation.
*/
+#include <linux/backing-dev.h>
#include "union.h"
/*
@@ -144,6 +145,21 @@ out:
return err;
}
+static int unionfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *lower_inode;
+ struct inode *inode;
+
+ inode = mapping->host;
+ lower_inode = unionfs_lower_inode(inode);
+
+ if (!mapping_cap_writeback_dirty(lower_inode->i_mapping))
+ return 0;
+
+ return generic_writepages(mapping, wbc);
+}
+
/*
* readpage is called from generic_page_read and the fault handler.
* If your file system uses generic_page_read for the read op, it
@@ -371,6 +387,7 @@ out:
struct address_space_operations unionfs_aops = {
.writepage = unionfs_writepage,
+ .writepages = unionfs_writepages,
.readpage = unionfs_readpage,
.prepare_write = unionfs_prepare_write,
.commit_write = unionfs_commit_write,
-
[...] Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't "leak" outside the kernel. Erez. -
Hi,
I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
!wbc->for_reclaim case which would explain why we haven't hit this bug
before. Hugh, Andrew?
And btw, I think we need to fix ecryptfs too.
Pekka
-
Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes I've made to unionfs recently -- Mike Halcrow already knows :-) Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon; Erez. -
Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc->for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase "go figure" ;) Hugh -
Hi Hugh,
Heh. As far as I can tell, the implication of "wider use" was added by
Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
VFS documentation", so perhaps he might know? Neil?
Pekka
-
I take as gospel this extract from Andrew's original 2.5.52 comment: So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? In the full 2.5.52 comment, Andrew explains how prior to this secret code, we used fail_writepage, which in the memory pressure case did an activate_page, with the intention of moving the page to the active list - but that didn't actually work, because the page is off the LRUs at this point, being passed around between pagevecs. I've always preferred the way it was originally trying to do it, which seems clearer and less error-prone than having a special code which people then have to worry about. Here's the patch I'd like to present in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk simply SetPageActive for this case (and go back to obeying the usual unlocking rule for writepage), vmscan.c observe and act accordingly. But I've not tested it at all (well, I've run with it in, but not actually going down the paths in question): it may suffer from something silly like the original fail_writepage. Plus I might be persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE) conditional on !PageActive(page), just to produce the same stats as before (though they don't make a lot of sense, counting other non-writes as writes). And would it need a deprecation ...
That's a nice historical review, Huge, of how got into these mess we're in now -- it all starts with good intentions. :-) On a related note, I would just love to get rid of calling the lower ->writepage in unionfs b/c I can't even tell if I have a lower page to use all the time. I'd prefer to call vfs_write() if I can, but I'll need a struct file, or at least a dentry. What ecryptfs does is store a struct file inside it's inode, so it can use it later in ->writepage to call vfs_write on the lower f/s. And Unionfs may have to go in that direction too, but this trick is not terribly clean -- storing a file inside an inode. I realize that the calling path to ->writepage doesn't have a file/dentry any more, but if we're considering larger changes to the writepage related code, can we perhaps consider passing a file or dentry to >writepage (same as commit_write, perhaps). Thanks, Erez. -
Why do you want to do that? You gave a good reason why it's easier for ecryptfs, but I doubt it's robust. The higher the level you choose to use, the harder to guarantee it won't deadlock. Or that's my gut feeling anyway. It's several years since I've thought about such issues: just because I came into this knowing about shmem_writepage, is perhaps not a good reason to choose me as advisor! Hugh -
Hi Hugh,
I don't see ClearPageReclaim added anywhere. Is that done on purpose?
Other than that, the patch looks good to me and I think we should
stick it into -mm to punish Andrew for his secret hack ;-).
Pekka
-
Yes, based on vfs.txt I figured unionfs should return AOP_WRITEPAGE_ACTIVATE. But, now that unionfs has ->writepages which won't even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I no longer need unionfs_writepage to bother checking for AOP_WRITEPAGE_ACTIVATE, or even return it up? But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to Erez. -
unionfs_writepage returns it in two different cases: when it can't find the underlying page; and when the underlying writepage returns it. I'd say it's wrong to return it in both cases. In the first case, you don't really want your page put back to the head of the active list, you want to come back to try it again quite soon (I think): so you should just redirty and unlock and pretend success. ramdisk uses A_W_A because none of its pages will ever become freeable (and comment points out it'd be better if they weren't even on the LRUs - I think several people have recently been putting forward patches to keep such timewasters off the LRUs). shmem uses A_W_A when there's no swap (left), or when the underlying shm is marked as locked in memory: in each case, best to move on to look for other pages to swap out. (But I'm not quite convincing myself that the temporarily out-of-swap case is different from yours above.) It's about fixing some horrid busy loops where vmscan kept going over the same hopeless pages repeatedly, instead of moving on to better candidates. Oh, there's a third case, when move_to_swap_cache fails: that's rare, and I think I was just too lazy to separate them. In your second case, I fail to see why the unionfs level should mimic the lower level: you've successfully copied data and marked the lower level pages as dirty, vmscan will come back to those in due course, but it's just a waste of time for it to come back to unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to userspace issue, as does Pekka & Andrew's patch to write_cache_pages, as does my patch to shmem_writepage. And I'm contending that unionfs_writepage should in no case return A_W_A up. But so long as A_W_A is still defined, unionfs_writepage does still need to check for it after calling the lower level ->writepage (because it needs to do the missing unlock_page): unionfs_writepages prevents unionfs_writepage being called on the normal writeout path, but it's still ...
It's possible to provoke unionfs (not yet in mainline, though in mm and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect it's possible to provoke the 2.6.23 ecryptfs in the same way (but the 2.6.24 ecryptfs no longer calls lower level's ->writepage). This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could leak from tmpfs via write_cache_pages and unionfs to userspace. There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 - writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so far as it goes; but insufficient because it doesn't address the underlying issue, that shmem_writepage expects to be called only by vmscan (relying on backing_dev_info capabilities to prevent the normal writeback path from ever approaching it). That's an increasingly fragile expectation, and ramdisk_writepage (the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check wbc->for_reclaim before returning it. Make the same check in shmem_writepage, thereby sidestepping the page_mapped BUG also. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- Unionfs intends its own, third fix to these issues, checking backing_dev_info capabilities as the normal writeback path does. And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE entirely (mainly to put a stop to everybody asking what it means and when it happens and how to handle it) - but that's a slightly bigger patch, needing a little more testing, probably for 2.6.25. I've CC'ed this to stable as you did for the write_cache_pages fix: it's probably required for ecryptfs (but unionfs was much easier to set up and test), and helpful to distros using unionfs and checking stable for fixes. Does this make the write_cache_pages fix redundant? Probably, but let's have both in for safety. mm/shmem.c | 5 +++++ 1 file changed, 5 insertions(+) --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100 +++ linux/mm/shmem.c 2007-10-24 ...
On Wed, 24 Oct 2007 22:02:15 +0100 (BST) Needs a comment, methinks. -
It's possible to provoke unionfs (not yet in mainline, though in mm and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect it's possible to provoke the 2.6.23 ecryptfs in the same way (but the 2.6.24 ecryptfs no longer calls lower level's ->writepage). This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could leak from tmpfs via write_cache_pages and unionfs to userspace. There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 - writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so far as it goes; but insufficient because it doesn't address the underlying issue, that shmem_writepage expects to be called only by vmscan (relying on backing_dev_info capabilities to prevent the normal writeback path from ever approaching it). That's an increasingly fragile assumption, and ramdisk_writepage (the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check wbc->for_reclaim before returning it. Make the same check in shmem_writepage, thereby sidestepping the page_mapped BUG also. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- Unionfs intends its own, third fix to these issues, checking backing_dev_info capabilities as the normal writeback path does. And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE entirely (mainly to put a stop to everybody asking what it means and when it happens and how to handle it) - but that's a slightly bigger patch, needing a little more testing, probably for 2.6.25. I've CC'ed this to stable as you did for the write_cache_pages fix: it's probably required for ecryptfs (but unionfs was much easier to set up and test), and helpful to distros using unionfs and checking stable for fixes. Does this make the write_cache_pages fix redundant? Probably, but let's have both in for safety. mm/shmem.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100 +++ ...
Hi Hugh,
I find the above bit somewhat misleading as it implies that the
!wbc->for_reclaim case can be removed after ecryptfs has similar fix
as unionfs. Can we just say that while BDI_CAP_NO_WRITEBACK does
prevent some callers from entering ->writepage(), it's just an
optimization and ->writepage() must deal with !wbc->for_reclaim case
properly?
Pekka
-
Sorry for being obtuse, but I don't see how that's misleading at all. ecryptfs already has a (dissimilar) fix in 2.6.24-rc1, not using the writepage route at all. But it remains the case that some stacking filesystem may (would you prefer "might" to "may"? "may" has a nice double meaning of "might" and "we'll allow it", but this patch does indeed allow it) use the ->writepage of its underlying filesystem. With unionfs also fixed, we don't know of an absolute need for this patch (and so, on that basis, the !wbc->for_reclaim case could indeed be removed very soon); but as I see it, the unionfs case has shown that it's time to future-proof this code against whatever stacking filesystems come along. Hence I didn't mention the names of such filesystems in the source comment. The !page_mapped assumption has been built in there since earliest 2.4, but it took a while for us to get a way to express it in a BUG. Hugh -
Hi Hugh,
Heh, what can I say, after several readings, I still find your above
explanation (which I totally agree with) more to the point than the
actual comment :-).
In any case, the patch looks good to me.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Pekka
-
I think "future proof" for other stackable f/s is a good idea, esp. since many of the stackable f/s we've developed and distributed over the past 10 years are in some use in various places: gzipfs, avfs, tracefs, replayfs, ncryptfs, versionfs, wrapfs, i3fs, and more (see www.filesystems.org). Cheers, Erez. -
A number of filesystems want partial or full stackability, so getting rid of lack-of-stackability whereever it may be is highly valuable. -hpa -
... I don't disagree with your unionfs_writepages patch, Pekka, but I think it should be viewed as an optimization (don't waste time trying to write a group of pages when we know that nothing will be done) rather than as essential. Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only been ramdisk and shmem generating it. ramdisk is careful only to return it in the wbc->for_reclaim case: I think (as in the patch I sent out before) shmem now ought to do so too for safety. Back in 2.4 days it was reasonable to assume that ->writepage would only get called from certain places, but things move faster nowadays, and the unionfs example shows others are liable to start ab/using it. I'll send Andrew that patch tomorrow (it's simple enough, but I'd Can it now? Current git has a patch from Andrew which bears a striking resemblance to that from Pekka, stopping the leak from write_cache_pages. Hugh -
Hi Hugh,
Ok, so tmpfs needs your fix still.
I don't think it can, it looks ok now.
Pekka
-
Sorry for my delay, here are a few replies. I think that's inappropriate. Why should unionfs_writepage re-mark its page as dirty when the lower level does so? Unionfs has successfully done its write to the lower level, what the lower level then gets up to (writing then or not) is its own business: needn't be propagated upwards. The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better. What I'd like most of all is to eliminate it, in favour of vmscan.c working out the condition for itself: but I've given that no thought, it may not be reasonable. unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot find_lock_page: that case may be appropriate. Though I don't really understand it: seems dangerous to be relying upon the lower level page just happening to be there already. Isn't memory pressure then likely to clog up with lots of upper level dirty pages which cannot get You're both agreed on that, but I don't see how: ecryptfs writes the lower level via vfs_write, it's not using the lower level's writepage, is it? Hugh -
What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Is it considered an error or not? If it's an error, then I usually feel that it's important for a stacked f/s to return that error indication upwards. The unionfs page and the lower page are somewhat tied together, at least logically. For unionfs's page to be considered to have been written successfully, the lower page has to be written successfully. So again, if the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs page to have been written successfully or not? If I don't return AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may never get flushed out? Anyway, now that unionfs has ->writepages that won't bother calling ->write for file systems with BDI_CAP_NO_WRITEBACK, the issue of Based on vfs.txt (which perhaps should be revised :-), I was trying to do the best I can to ensure that no data is lost if the current page cannot be written out to the lower f/s. I used to do grab_cache_page() before, but that caused problems: writepage is not the right place to _increase_ memory pressure by allocating a new page... One solution I thought of is do what ecryptfs does: keep an open struct file in my inode and call vfs_write(), but I don't see that as a significantly cleaner/better solution. (BTW, ecrypfts kinda had to go for vfs_write b/c it changes the data size and content of what it writes below; unionfs is simpler in that manner b/c it needs to write the same data to the lower file at the same offset.) Another idea we've experimented with before is "page pointer flipping." In writepage, we temporarily set the page->mapping->host to the lower_inode; then we call the lower writepage with OUR page; then fix back the page->mapping->host to the upper inode. This had two benefits: first we can guarantee that we always have a page to write below, and second we don't need to keep both upper and lower pages (reduces memory pressure). Before we did this page pointer ...
Sigh - not at you, at it! It's a secret that couldn't be kept secret, No, it's definitely not an error. It'a a private note from tmpfs (or ramdisk) to vmscan, saying "don't waste your time coming back to me with this page until you have to, please move on to another Indeed, but this is not an error. Remember, neither ramdisk nor tmpfs is stable storage: okay, tmpfs can go out to disk by using swap, but that's not stable storage - it's not reconstituted after reboot. (If there's an error in writing to swap, well, that's a different issue; and there's few filesystems where such an I/O Consider it written successfully. (What does written mean with tmpfs? Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE. If you mark your page as clean and successfully written, vmscan will be able to free it. If needed, we can get the data back from the lower page on demand, but meanwhile a page has been freed, which is what vmscan reclaim is all about. (But of course, in the case where you couldn't get hold of a page for the lower, you must redirty Yes, but just hoping the lower page will be there, and doing nothing to encourage it to become there, sounds an even poorer strategy to me. It's not easy, I know. Your position reminds me of the loop driver (drivers/block/loop.c), which has long handled this situation (with great success, though I doubt an absolute guarantee) by taking __GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file: look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(), that's new to me and seems to be for a very special case). I grepped for gfp in unionfs, and there seems to be nothing: I doubt you can be robust under memory pressure without doing something about that. If you can take __GFP_IO|__GFP_FS off the lower mapping (just while in unionfs_writepage, or longer term? what locking needed?), I wouldn't call it ugly, but it is exceptional and dangerous and cannot be sanctioned without a great deal of ...
Hugh, thanks for the great explanations and suggestions (in multiple emails). I'm going to test all of those soon. Erez. -
Huge,
I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
and such. I revised my unionfs_writepage and unionfs_sync_page, and tested
it under memory pressure: I have a couple of live CDs that use tmpfs and can
deterministically reproduce the conditions resulting in A_W_A. I also went
back to using grab_cache_page but with the gfp_mask suggestions you made.
I'm happy to report that it all works great now! Below is the entirety of
the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and
others can look it over and see if you find any problems.
Thanks,
Erez.
static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
char *kaddr, *lower_kaddr;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;
inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode->i_mapping;
/*
* find lower page (returns a locked page)
*
* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
* memory pressure conditions. This is similar to how the loop
* driver behaves (see loop_set_fd in drivers/block/loop.c).
* If we can't find the lower page, we redirty our page and return
* "success" so that the VM will call us again in the (hopefully
* near) future.
*/
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));
lower_page = grab_cache_page(mapping, page->index);
mapping_set_gfp_mask(mapping, old_gfp_mask);
if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}
/* get page address, and encode it */
kaddr = kmap(page);
lower_kaddr = kmap(lower_page);
memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);
kunmap(page);
kunmap(lower_page);
BUG_ON(!mapping->a_ops->writepage);
/* call lower writepage (expects locked page) ...... but still a few problems, I'm afraid. The greatest problem is a tmpfs one, that would be for me to solve. On reflection, I think I went too far in asking you to mask off __GFP_IO. Loop has to do so because it's a block device, down towards the IO layer; but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent recursion into the FS layer with danger of deadlock, and leaving __GFP_IO Hmm, several points on that. When suggesting something like this, I did remark "what locking needed?". You've got none: which is problematic if two stacked mounts are playing with the same underlying file concurrently (yes, userspace would have a data coherency problem in such a case, but the kernel still needs to worry about its own internal integrity) - you'd be in danger of masking (__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore, losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long. Neither likely but both wrong. See the comment on mapping_set_gfp_mask() in include/pagemap.h: * This is non-atomic. Only to be used before the mapping is activated. Strictly speaking, I guess loop was a little bit guilty even when just loop_set_fd() did it: the underlying mapping might already be active. It appears to be just as guilty as you in its do_loop_switch() case (done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl which would only be expected to be called once, during installation; whereas you're using mapping_set_gfp_mask here with great frequency. Another point on this is: loop masks __GFP_IO|__GFP_FS off the file for the whole duration while it is looped, whereas you're flipping it just in this preliminary section of unionfs_writepage. I think you're probably okay to be doing it only here within ->writepage: I think loop covered every operation because it's at the block device level, perhaps both reads and writes needed to serve reclaim at the higher FS level; and also easier to do it once for all. Are you ...
Hi Hugh, I've addressed all of your concerns and am happy to report that the
newly revised unionfs_writepage works even better, including under my
memory-pressure conditions. To summarize my changes since the last time:
- I'm only masking __GFP_FS, not __GFP_IO
- using find_or_create_page to avoid locking issues around mapping mask
- handle for_reclaim case more efficiently
- using copy_highpage so we handle KM_USER*
- un/locking upper/lower page as/when needed
- updated comments to clarify what/why
- unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
to have it)
Below is the newest version of unionfs_writepage. Let me know what you
think.
I have to say that with these changes, unionfs appears visibly faster under
memory pressure. I suspect the for_reclaim handling is probably the largest
contributor to this speedup.
Many thanks,
Erez.
//////////////////////////////////////////////////////////////////////////////
static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *lower_mapping; /* lower inode mapping */
gfp_t mask;
inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
lower_mapping = lower_inode->i_mapping;
/*
* find lower page (returns a locked page)
*
* We turn off __GFP_FS while we look for or create a new lower
* page. This prevents a recursion into the file system code, which
* under memory pressure conditions could lead to a deadlock. This
* is similar to how the loop driver behaves (see loop_set_fd in
* drivers/block/loop.c). If we can't find the lower page, we
* redirty our page and return "success" so that the VM will call us
* again in the (hopefully near) future.
*/
mask = mapping_gfp_mask(lower_mapping) & ~(__GFP_FS);
lower_page = find_or_create_page(lower_mapping, page->index, mask);
if (!lower_page) {
err = ...[Dave, I've Cc'ed you re handle_write_count_underflow, see below.] That's good news, and that unionfs_writepage looks good to me - with three reservations I've not observed before. One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9 and your 2 Nov 1-8/8 patches applied (rejects being patches which were already in 2.6.23-mm1). I was hoping to reproduce the BUG_ON(entry->val) that I fear from shmem_writepage(), before fixing it; but not seen that at all yet - that might be good news, but it's more likely I just haven't tried hard enough yet. For now I'm doing repeated make -j20 kernel builds, pushing into swap, in a unionfs mount of just a single dir on tmpfs. This has shown up several problems, two of which I've had to hack around to get further. The first: I very quickly hit "BUG: atomic counter underflow" from ...
I've never actually seen this happen in practice, but I do know exactly
Actually, I think your s/while/if/ change is probably a decent fix.
Barring any other races, that loop should always have made progress on
and don't have a positive mnt->__mnt_writers, we know something is going
badly. We WARN_ON() there, which should at least give an earlier
warning that the system is not doing well. But it doesn't fix the
inevitable. Could you try the attached patch and see if it at least
warns you earlier?
The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
__fput() time. Since that code never got a write on the mount, we'll
see an imbalance if the file was opened for a write. I don't see this
file's mnt set anywhere, so I'm not completely sure that this is it. In
any case, rolling your own 'struct file' without using alloc_file() and
friends is a no-no.
BTW, I have some "debugging" code in my latest set of patches that I
think should fix this kind of imbalance with the mnt->__mnt_writers().
It ensures that before we do that mnt_drop_write() at __fput() that we
absolutely did a mnt_want_write() at some point in the 'struct file's
life.
-- Dave
linux-2.6.git-dave/fs/namespace.c | 31 ++++++++++++++++++++++---------
linux-2.6.git-dave/include/linux/mount.h | 1 +
2 files changed, 23 insertions(+), 9 deletions(-)
diff -puN fs/namei.c~fix-naughty-loop fs/namei.c
diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c
--- linux-2.6.git/fs/namespace.c~fix-naughty-loop 2007-11-05 08:03:59.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-05 08:35:06.000000000 -0800
@@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr
*/
static void handle_write_count_underflow(struct vfsmount *mnt)
{
- while (atomic_read(&mnt->__mnt_writers) <
- MNT_WRITER_UNDERFLOW_LIMIT) {
- /*
- * It isn't necessary to hold all of the locks
- * at the same time, but doing it this way makes
- * us share a lot more code.
- ...Thanks, Dave, yes, that gives me a nice warning: leak detected on mount(c25ebd80) writers count: -65537 WARNING: at fs/namespace.c:249 handle_write_count_underflow() [<c0103486>] show_trace_log_lvl+0x1b/0x2e [<c01034b6>] show_trace+0x16/0x1b [<c0103589>] dump_stack+0x19/0x1e [<c0171906>] handle_write_count_underflow+0x4c/0x60 [<c0171983>] mnt_drop_write+0x69/0x8e [<c0160211>] __fput+0xff/0x162 [<c016010d>] fput+0x2e/0x33 [<c01b8f63>] unionfs_file_release+0xc2/0x1c5 [<c01601a1>] __fput+0x8f/0x162 [<c016010d>] fput+0x2e/0x33 [<c015ec9d>] filp_close+0x50/0x5d [<c015ed1e>] sys_close+0x74/0xb4 [<c01026ce>] sysenter_past_esp+0x5f/0x85 and the test then goes quietly on its way instead of hanging. Though I imagine, with your patch or mine, that it's then making an unfortunate frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type thereafter. But it's hardly your responsibility to optimize for bugs elsewhere. The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to I'll let Erez take it from there... Hugh -
[...] This #ifdef'd code in unionfs is actually not enabled. I left it there as a reminder of possible future things to come (esp. if nameidata gets split). There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd() that says: #ifdef ALLOC_LOWER_ND_FILE /* * XXX: one day we may need to have the lower return an open file * for us. It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may * very well be needed for nfs4. */ struct file *file; #endif /* ALLOC_LOWER_ND_FILE */ In the interest of keeping unionfs as simple as I can, when I implemented the whole "pass a lower nd" stuff, I left thos bits of semi-experimental #ifdef code for this lower file upon open-intent. It's not enabled and up until now, it didn't seem to be needed. Do you think unionfs has to start using this nd->intent.open.file stuff? Thanks, Erez. -
Hugh, my code looks like:
if (wbc->for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
Do you mean I should set_page_dirty(lower_page) unconditionally before
I looked at all mainline filesystems's ->writepage to see what, if any, they
do with their page's uptodate flag. Most f/s don't touch the flag one way
or another.
cifs_writepage sets the uptodate flag unconditionally: why?
ecryptfs_writepage has a legit reason: if encrypting the page failed, it doesn't want
anyone to use it, so it clears its page's uptodate flag (else it sets it as
uptodate).
hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm
not sure if it makes sense or not.
ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which
indicates to me that the page passed to ->writepage should always be
uptodate. Is that a fair statement?
smb_writepage pretty much unconditionally calls SetPageUptodate(page). Why?
Is there a reason smbfs and cifs both do this unconditionally? If so, then
why is ntfs calling BUG_ON if the page isn't uptodate? Either that BUG_ON
in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they
can't both be right.
And finally, unionfs clears the uptodate flag on error from the lower
->writepage, and otherwise sets the flag on success from the lower
->writepage. My gut feeling is that unionfs shouldn't change the page
uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
write out a page which isn't up-to-date, right? Otherwise, whether
unionfs_writepage manages to write the lower page or not, why should that
invalidate the state of the unionfs page itself? Come to think of it, I
think clearing pageuptodate on error ...Yes. Whether you're wrong not to be doing that already, I've not checked; but I think doing so will make unionfs safer against any future changes in the relationship between set_page_dirty and clear_page_dirty_for_io. For example, look at clear_page_dirty_for_io: it's decrementing some statistics which __set_page_dirty_nobuffers increments. Does use of unionfs (over some filesystems) make those numbers wrap increasingly negative? Does adding this set_page_dirty(lower_page) correct that? I'm not going to try and guess what assorted filesystems are up to, and not all of them will be bugfree. The crucial point of PageUptodate is that we insert a filesystem page into the page cache before it's had any data read in: it needs to be !PageUptodate until the data is there, and then marked PageUptodate to say the data is good and others can That was my point, and I don't really have more to add. It's unusual to do anything with PageUptodate when writing. By clearing it when the lower level has an error, you're throwing away the changes already made at the upper level. You might have some good reason for that, but it's worth questioning. If you don't know why you're Set/Clear'ing it there, better to just take that out. Hugh -
While looking into something else entirely, I realize that _here_ you are missing a SetPageUptodate(lower_page): should go in after the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue for some kind of barrier there too, but I don't think unionfs has a special need to be ahead of the pack on that issue.) Think about it: when find_or_create_page has created a fresh page in the cache, and you've just done copy_highpage to put the data into it, you now need to mark it as Uptodate: otherwise a subsequent vfs_read or whatever on the lower level will find that page !Uptodate and read stale data back from disk instead of what you just copied in, unless its dirtiness has got it written back to disk meanwhile. Odd that that hasn't been noticed at all: I guess it may be hard to get testing to reclaim lower/upper pages in such a way as to test out such paths thoroughly. Hugh -
Hehe. Funny, you mention this... A few days ago, while I was doing your other recommended pageuptodate cleanups, I also added the same call to SetPageUptodate(lower_page) as you suggested. I tested that change along w/ the other changes you suggested, and they all seem to work great all the way from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to work around or fix more non-unionfs bugs in -mm than unionfs ones to get it to work :-) I posted all of these patches just now. You're CC'ed. Hopefully Andrew can pull from my unionfs.git branch soon. You also reported in your previous emails some hangs/oopses while doing make -j 20 in unionfs on top of a single tmpfs, using -mm. After several days, I've not been able to reproduce this w/ my latest set of patches. If you can send me your .config and the specs on the h/w you're using (cpus, mem, etc.), I'll see if I can find something similar to it on my end and run the same tests. Cheers, Erez. -
I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
but the one including those 9 patches you posted, now gets through
my testing with tmpfs without a problem. I do still get occasional
"unionfs: new lower inode mtime (bindex=0, name=<directory>)"
messages, but nothing worse seen yet: a big improvement.
I deceived myself for a while that the danger of shmem_writepage
hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
I must go back to working out an escape from that one (despite never
seeing it).
I did think you could clean up the doubled set_page_dirtys,
but it's of no consequence.
Hugh
--- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c 2007-11-17 12:23:30.000000000 +0000
+++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.000000000 +0000
@@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
copy_highpage(lower_page, page);
flush_dcache_page(lower_page);
SetPageUptodate(lower_page);
+ set_page_dirty(lower_page);
/*
* Call lower writepage (expects locked page). However, if we are
@@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
* success.
*/
if (wbc->for_reclaim) {
- set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
+
BUG_ON(!lower_mapping->a_ops->writepage);
- set_page_dirty(lower_page);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)
-
