I happened to be going through the source code for write_cache_pages(),
and I came across a reference to AOP_WRITEPAGE_ACTIVATE. I was curious
what the heck that was, so I did search for it, and found this in
Documentation/filesystems/vfs.txt:
If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
try too hard if there are problems, and may choose to write out
other pages from the mapping if that is easier (e.g. due to
internal dependencies). If it chooses not to start writeout, it
should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
calling ->writepage on that page.
See the file "Locking" for more details.
No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
chooses not to writeout page and call redirty_page_for_writeback()
instead.
Is this a change we should make, for example when btrfs refuses a
writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
if the page involved hasn't been allocated an on-disk block yet (i.e.,
delayed allocation)? The change seems to be that we should call
redirty_page_for_writeback() as before, but then _not_ unlock the page,
and return AOP_WRITEPAGE_ACTIVATE. Is this a good and useful thing for
us to do?
Right now, the only writepage() function which is returning
AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
using redirty_page_for_writeback(). Should it, out of consistency's
sake if not to keep various zone accounting straight?
There are some longer-term issues, including the fact that ext4 and
btrfs are violating some of the rules laid out in
Documentation/vfs/Locking regarding what writepage() is supposed to do
under direct reclaim -- something which isn't going to be practical for
us to change on the file-system side, at least not without doing some
pretty nasty and serious rework, for both ext4 and I suspect btrfs. But
if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
gracefully with the ...No, not exactly. Btrfs refuses the writepage() in the direct reclaim cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone scanning. I don't want to speak for Chris, but I assume it's due to stack depth concerns --- if it was just due to worrying about fs recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS. So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a red herring. That code is in production here, and we've made all sorts of changes so ti can be used for more than just debugging. So please ignore it, it's our local hack, and if it breaks that's our problem. More importantly, just two weeks ago I talked to soeone in the financial sector, who was testing out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, and he ran into the ext4/OOM problem while using an upstream kernel. It involved Oracle pinning down 3G worth of pages, and him trying to do a huge streaming backup (which of course wasn't using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by the fact that ext4_writepage() was refusing the writepage() and most of the pages weren't nailed down by Oracle were delalloc. The same test scenario using ext3 worked just fine, of course. Under normal cases it's not a problem since statistically there should be enough other pages in the system compared to the number of pages that are subject to delalloc, such that pages can usually get pushed out until the writeback code can get around to writing out the pages. But in cases where the zones have been made artificially small, or you have a big program like Oracle pinning down a large number of pages, then of course we have problems. I'm trying to fix things from the file system side, which means trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in Documentation/filesystems/Locking as something which MUST be used if writepage() is going ...
Btrfs refuses all PF_MEMALLOC writepage. It will go ahead and process a regular writepage but in practice that never happens...everyone else except a few internal btrfs callers use writepages. I wish I had thought of stack depth back then, but really this was to keep kswapd out of the heavy work done by delalloc. From a locking point of view we're properly GPF_NOFS, so its safe, but it just isn't a PG_writeback will protect you from vmtruncate, but may also want to block_write_full_page takes a locked page and if all goes well produces a writeback page without the page locked. Basically it needs the page locked until after it has the writeback bit set to protect against truncate and make sure the page buffers don't go away while it is looping over them. My understanding of the current scheme is that truncate will wait on both locked and writeback pages. The page lock is used while setting up the page for writeback, which is true both for writepages and writepage. I don't think we need a new lock on top of the page lock and the writeback bit, but maybe I don't see exactly which problem you're solving. A given range of pages is either: 1) allocated but not under IO. ext4 must write these pages to disk before truncate can finish for data=ordered reasons, unless it manages to log the orphan item. Figuring out dependency between the orphan item, which i_size is on disk right now, and holes is pretty tricky, so I'd go with the less complex: just wait for all the allocated delalloc pages to hit the disk. 2) Allocated and under IO. These pages go to disk. 3) Delalloc and not under IO. Truncate (or notify_change if you lean toward the xfs crowd) should be able to clean these up without waiting for the IO. Of the three, #3 is probably the most common, which #1 a close second. Is this a case that we really need to optimize for? -chris --
