Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
fs/buffer.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2008-05-07 20:27:15.182659486 -0700
+++ linux-2.6/fs/buffer.c 2008-05-07 20:29:13.052102980 -0700
@@ -3255,6 +3255,104 @@ int bh_submit_read(struct buffer_head *b
}
EXPORT_SYMBOL(bh_submit_read);
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+ int rc;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = 1,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 0
+ };
+
+ if (!mapping->a_ops->writepage)
+ /* No write method for the address space */
+ return;
+
+ if (!clear_page_dirty_for_io(page))
+ /* Someone else already triggered a write */
+ return;
+
+ rc = mapping->a_ops->writepage(page, &wbc);
+ if (rc < 0)
+ /* I/O Error writing */
+ return;
+
+ if (rc == AOP_WRITEPAGE_ACTIVATE)
+ unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int ...Oh, no, please don't trigger more random single page writeback from memory reclaim. We shoul dbe killing the VM's use of ->writepage, not encouraging it. If you are going to clean bufferheads (or pages), please clean entire mappings via ->writepages as it leads to far superior I/O patterns and a far higher aggregate rate of page cleaning..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
That brings up another issue: Lets say I use writepages on a large file (couple of gig). How much do you want to write back? --
We're out of memory. I'd suggest write backing as much as you can without blocking. e.g. treat it like pdflush and say 1024 pages, or like balance_dirty_pages() and write a 'write_chunk' back from the mapping (i.e. sync_writeback_pages()). Any of these are better from an I/O perspective than single page writeback.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
But then filesystem can do tricks like writing out the surrounding areas as needed. The filesystem likely can estimate better how much writeout makes sense. --
Defragmentation is triggered as part of the usual memory reclaim Pushing write-around into a method that is only supposed to write the single page that is passed to it is a pretty bad abuse of the API. Especially as we have many simple, ranged writeback methods you could call. filemap_fdatawrite_range(), do_writepages(), ->writepages, etc. FWIW, look at the mess of layering violations that write clustering causes in XFS because we have to do this to keep allocation overhead and fragmentation down to a minimum. It's a nasty hack to mitigate the impact of the awful I/O patterns we see from the VM - suggesting that all filesystems do this just so you don't have to call a slightly smarter writeback primitive is insane.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Could you provide me such a patch? I would not know how much to writeout. If we had such a method then we could also use that for the swap case where we also write out single pages? --
How hard is it? I don't have time right now to do this, but it's essentially: mapping = page->mapping; ...... - mapping->aops->writepage(); + filemap_fdatawrite_range(mapping, start, end); Where [start,end] span page->index and are is large enough to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX pages, preferrably larger for 4k page size machines). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Or just sync_inode(). -- Evgeniy Polyakov --
Oh, god no. Let's not put the inode_lock right at the top of the VM page cleaning path. We don't need to modify inode state, the superblock dirty lists, etc - all we need to do is write dirty pages on a given mapping in a more efficient manner. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
I'm not advocating that, but having swap on reclaim does not hurt anyone, this is essentially the same, but with different underlying storage. System will do that anyway sooner or later during usual writeback, which in turn can be a result of the same reclaim... -- Evgeniy Polyakov --
Sure. But my point is simply that sync_inode() is far too heavy-weight to be used in a reclaim context. The fact that it holds the inode_lock will interfere with normal writeback via pdflush and that could potentially slow down writeback even more. e.g. think of kswapd threads running on 20 nodes of a NUMA machine all at once writing back dirty memory (yes, it happens). If we use sync_inode() to write back dirty mappings we would then have at least 20 CPUs serialising on the inode_lock trying to write back pages. If we instead use a thin wrapper around ->writepages() then they can all run in parallel through the filesystem(s), block devices, etc rather than being serialised at the highest possible Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
It's more than efficiency. There are lots and lots of things we cannot do in direct-reclaim context. a) Can't lock pages (well we kinda sorta could, but generally code will just trylock) b) Cannot rely on the inode or the address_space being present in memory after we have unlocked the page. c) Cannot run iput(). Or at least, we couldn't five or six years ago. afaik nobody has investigated whether the situation is now better or worse. d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere in which you share code with normal writeback paths. Plus e), f), g) and h). Direct-reclaim is a hostile environment. Things like b) are a real killer - nasty, subtle, rare, memory-pressure-dependent crashes. --
Which basically means we can not do direct writeback at reclaim time?.. -- Evgeniy Polyakov --
Well, we _can_, but doing so within the present constraints is delicate. An implementation which locked all the to-be-written pages up front and then wrote them out and which was careful not to touch the inode or address_space after the last page is unlocked could work. Or perhaps add a new lock to the inode and then in reclaim a) lock a page on the LRU, thus pinning the address_space and inode. b) take some new sleeping lock in the inode c) unlock that page and now proceed to do writeback. But still honouring !GFP_FS. and teach the unmount code to take the per-inode locks too, to ensure that reclaim has got out of there before zapping the inodes. Perhaps a per-superblock lock rather than per-inode, dunno. But we won't be able to just dive in there and call the existing writeback functions from within reclaim. Because a) callers can hold all sorts of locks, including implicit ones such as journal_start() and b) reclaim doesn't have a reference on the page's inode, and the inode and address_space can vanish if reclaim isn't holding a lock on one of the address_space's pages. --
I happened to notice your remark in the buffer heads defrag thread. Do you remember what that limitation was about? Because just a few months ago I discovered a shmem race which I fixed by doing igrab+iput in shmem_writepage, in the reclaim context. Feeling guilty now: I'd better investigate, but would welcome a starting pointer. (If I'm lucky, it'll be that the generic code in vmscan.c cannot use iput, but particular filesystems might themselves be safe to.) Thanks, Hugh --
Hi Hugh. If we are talking about the same things, its waiting for pages to be synced (wither written back or truncated) when inode is about to be destroyed. Thus reclaim can sleep wating for pages to be synced, which it is about to move somewhere itself. Deadlock. The same for writepage - if we drop inode there it can wait for pages to be synced, which inturn requires writeback, where we are sleeping already... -- Evgeniy Polyakov --
Ages and ages ago. I expect it was a deadlock thing. iput_final() can end up calling things like write_inode() which can want to do things like opening a transaction against filesystem A while already having one open against filesystem B. Which is both deadlockable and BUGable. Yes, it was specific to the direct-reclaim calling context. --
And actually having tiny operations under inode_lock is the last thing
to worry about when we are about to start writing pages to disk because
memory is so fragmented that we need to move things around.
That is the simplest from the typing viewpoint, one can also do
something like that:
struct address_space *mapping = page->mapping;
struct backing_dev_info *bdi = mapping->backing_dev_info;
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
.older_than_this = NULL,
.nr_to_write = 13,
.range_cyclic = 0,
.range_start = start_index,
.range_end = end_index
};
do_writepages(mapping, &wbc);
Cristoph, is this example you wnated to check out? It will only try to
write .nr_to_write pages between .range_start and .range_end without
syncing inode info itself.
--
Evgeniy Polyakov
--
Which is the exact implementation of filemap_fdatawrite_range(mapping, start, end); Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Cool, I did not know that, probably because it is not exported :) -- Evgeniy Polyakov --
Well that is what Dave wants. I'd rather go the safe route for now and defer this until later. I think you are much more an expert on the filesystems and I/O paths than I am. So I'd rather take my hands of as soon as possible. --
I don't think that's true on no-MMU. Defragmentation can be needed often on no-MMU when there's lots of free memory, just in the wrong places. -- Jamie --
