Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Erez Zadok <ezk@...>
Cc: Pekka Enberg <penberg@...>, Ryan Finnie <ryan@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, <cjwatson@...>, <linux-mm@...>
Date: Monday, October 29, 2007 - 4:33 pm

On Sun, 28 Oct 2007, Erez Zadok wrote:

That's very encouraging...


... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...


On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.


Hmm, several points on that.

When suggesting something like this, I did remark "what locking needed?".
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
 * This is non-atomic.  Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage.  I think you're
probably okay to be doing it only here within ->writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level ->writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the ->writepage (but rather depends on
filesystem).  And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc->for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc->for_reclaim case, I think you don't really
need to call the lower level writepage at all.  Just set_page_dirty
on the lower page, unlock it and return.  In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it.  Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page.  Look at the declaration of grab_cache_page
in linux/pagemap.h: it immediately extracts the gfp_mask from the
mapping and passes that down to find_or_create_page, which doesn't
use the mapping's gfp_mask at all.

So, stop flipping and use find_or_create_page directly yourself.


You need to unlock_page, don't you?  Or move the "out" label up
before the unlock_page.  There seems to have been confusion about
this even in the current 2.6.23-mm1 unionfs_writepage: the only
case in which a writepage returns with its page still locked is
that AOP_WRITEPAGE_ACTIVATE case we're going to get rid of.


Better to use kmap_atomic.  unionfs_writepage cannot get called
at interrupt time, I see no reason to avoid KM_USER0 and KM_USER1:
therefore simply use copy_highpage(lower_page, page) and let it do
all the kmapping and copying.

If PAGE_CACHE_SIZE ever diverges from PAGE_SIZE (e.g. Christoph
Lameter's variable page_cache_size patches), then yes, this
would need updating to a loop over several pages (or better,
linux/highmem.h should then provide a function to do it).


Another instance of that confusion: lower_page is already unlocked,
on success or failure; it's only the anomalous AOP_WRITEPAGE_ACTIVATE
case that leaves it locked.


Page needs to be unlocked, whether here or at out.


Right (once you've got the locking right).


I'm not going to comment much on your unionfs_sync_page: it looks
like a total misunderstanding of what sync_page does, assuming from
the name that it syncs the page in a fsync/msync/sync manner.

No, it would much better be named "unplug_page_io": please take a
look at sync_page() in mm/filemap.c, observe how it gets called
(via wait_on_page_bit) and what it ends up doing.  (Don't pay much
attention to what Documentation/filesystems says about it, either!)

It's an odd business; I think Nick did have a patch to get rid of
it completely, which would be nice; but changes to unplugging I/O
(kicking off the I/O after saving up several requests to do all
together) can be a hang-prone business.

Do you need a unionfs_sync_page at all?  I think not, since the
I/O, plugged or unplugged, is below your lower level filesystem.

But I started by mentioning a serious tmpfs problem.  Now I've
persuaded you to go back to grab_cache_page/find_or_create_page,
I realize a nasty problem for tmpfs.  Under memory pressure, you're
liable to be putting tmpfs file pages into the page cache at the
same time as they're already present but in disguise as swap cache
pages.  Perhaps the solution will be quite simple (since you're
overwriting the whole page), but I do need to think about it.

Hugh
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userl..., Hugh Dickins, (Mon Oct 29, 4:33 pm)
[PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE, Hugh Dickins, (Wed Oct 24, 5:02 pm)
Re: [PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE, Andrew Morton, (Wed Oct 24, 5:08 pm)