login
Header Space

 
 

[patch 10/21] buffer heads: Support slab defrag

Previous thread: [patch 05/21] slub: Slab defrag core by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)

Next thread: [patch 18/21] Filesystem: Socket inode defragmentation by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (4 messages)
To: <akpm@...>
Cc: <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Friday, May 9, 2008 - 11:08 pm

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
---
 fs/buffer.c |   99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2008-05-07 20:27:15.182659486 -0700
+++ linux-2.6/fs/buffer.c	2008-05-07 20:29:13.052102980 -0700
@@ -3255,6 +3255,104 @@ int bh_submit_read(struct buffer_head *b
 }
 EXPORT_SYMBOL(bh_submit_read);
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	int rc;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = 1,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 0
+	};
+
+	if (!mapping-&gt;a_ops-&gt;writepage)
+		/* No write method for the address space */
+		return;
+
+	if (!clear_page_dirty_for_io(page))
+		/* Someone else already triggered a write */
+		return;
+
+	rc = mapping-&gt;a_ops-&gt;writepage(page, &amp;wbc);
+	if (rc &lt; 0)
+		/* I/O Error writing */
+		return;
+
+	if (rc == AOP_WRITEPAGE_ACTIVATE)
+		unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffer...
To: Christoph Lameter <clameter@...>
Cc: <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Sunday, May 11, 2008 - 8:24 pm

Oh, no, please don't trigger more random single page writeback from
memory reclaim.  We shoul dbe killing the VM's use of -&gt;writepage,
not encouraging it.

If you are going to clean bufferheads (or pages), please clean entire
mappings via -&gt;writepages as it leads to far superior I/O patterns
and a far higher aggregate rate of page cleaning.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 1:42 pm

That brings up another issue: Lets say I use writepages on a large file 
(couple of gig). How much do you want to write back?

--
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 7:10 pm

We're out of memory. I'd suggest write backing as much as you can
without blocking.  e.g. treat it like pdflush and say 1024 pages, or
like balance_dirty_pages() and write a 'write_chunk' back from the
mapping (i.e.  sync_writeback_pages()).

Any of these are better from an I/O perspective than single page
writeback....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Friday, May 16, 2008 - 1:01 pm

But then filesystem can do tricks like writing out the surrounding areas 
as needed. The filesystem likely can estimate better how much writeout 
makes sense.

--
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Monday, May 19, 2008 - 1:45 am

Defragmentation is triggered as part of the usual memory reclaim


Pushing write-around into a method that is only supposed to write
the single page that is passed to it is a pretty bad abuse of the
API. Especially as we have many simple, ranged writeback methods
you could call. filemap_fdatawrite_range(), do_writepages(),
-&gt;writepages, etc.

FWIW, look at the mess of layering violations that write clustering
causes in XFS because we have to do this to keep allocation overhead
and fragmentation down to a minimum. It's a nasty hack to mitigate
the impact of the awful I/O patterns we see from the VM - suggesting
that all filesystems do this just so you don't have to call a
slightly smarter writeback primitive is insane....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 6:53 pm

I don't think that's true on no-MMU.  Defragmentation can be needed
often on no-MMU when there's lots of free memory, just in the wrong
places.

-- Jamie
--
To: David Chinner <dgc@...>
Cc: <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Monday, May 19, 2008 - 12:44 pm

Could you provide me such a patch? I would not know how much to writeout. 
If we had such a method then we could also use that for the swap case 
where we also write out single pages?
--
To: Christoph Lameter <clameter@...>
Cc: David Chinner <dgc@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Monday, May 19, 2008 - 8:25 pm

How hard is it? I don't have time right now to do this, but it's essentially:

	mapping = page-&gt;mapping;
	......
-	mapping-&gt;aops-&gt;writepage();
+	filemap_fdatawrite_range(mapping, start, end);

Where [start,end] span page-&gt;index and are is large enough
to get a substantial sized I/O to disk (say at least SWAP_CLUSTER_MAX
pages, preferrably larger for 4k page size machines).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 2:56 am

Or just sync_inode().

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: David Chinner <dgc@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 5:46 pm

Oh, god no. Let's not put the inode_lock right at the top of
the VM page cleaning path. We don't need to modify inode state,
the superblock dirty lists, etc - all we need to do is write
dirty pages on a given mapping in a more efficient manner.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 6:25 pm

I'm not advocating that, but having swap on reclaim does not hurt
anyone, this is essentially the same, but with different underlying
storage. System will do that anyway sooner or later during usual
writeback, which in turn can be a result of the same reclaim...

-- 
	Evgeniy Polyakov
--
To: David Chinner <dgc@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 7:22 pm

And actually having tiny operations under inode_lock is the last thing
to worry about when we are about to start writing pages to disk because
memory is so fragmented that we need to move things around.

That is the simplest from the typing viewpoint, one can also do
something like that:

struct address_space *mapping = page-&gt;mapping;
struct backing_dev_info *bdi = mapping-&gt;backing_dev_info;
struct writeback_control wbc = {
	.bdi = bdi,
	.sync_mode = WB_SYNC_ALL, /* likly we want to wait... */
	.older_than_this = NULL,
	.nr_to_write = 13,
	.range_cyclic = 0,
	.range_start = start_index,
	.range_end = end_index
};

do_writepages(mapping, &amp;wbc);

Cristoph, is this example you wnated to check out? It will only try to
write .nr_to_write pages between .range_start and .range_end without
syncing inode info itself.

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: David Chinner <dgc@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 9:56 pm

Well that is what Dave wants. I'd rather go the safe route for now and 
defer this until later. I think you are much more an expert on the 
filesystems and I/O paths than I am. So I'd rather take my hands of as 
soon as possible.

--
To: Evgeniy Polyakov <johnpol@...>
Cc: David Chinner <dgc@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 7:30 pm

Which is the exact implementation of

	filemap_fdatawrite_range(mapping, start, end);

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Wednesday, May 21, 2008 - 2:20 am

Cool, I did not know that, probably because it is not exported :)

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: David Chinner <dgc@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 7:19 pm

Sure. But my point is simply that sync_inode() is far too
heavy-weight to be used in a reclaim context. The fact that it holds
the inode_lock will interfere with normal writeback via pdflush and
that could potentially slow down writeback even more.

e.g. think of kswapd threads running on 20 nodes of a NUMA machine
all at once writing back dirty memory (yes, it happens). If we use
sync_inode() to write back dirty mappings we would then have at
least 20 CPUs serialising on the inode_lock trying to write back
pages. If we instead use a thin wrapper around -&gt;writepages() then
they can all run in parallel through the filesystem(s), block
devices, etc rather than being serialised at the highest possible

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To: David Chinner <dgc@...>
Cc: Evgeniy Polyakov <johnpol@...>, Christoph Lameter <clameter@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Tuesday, May 20, 2008 - 7:28 pm

It's more than efficiency.  There are lots and lots of things we cannot
do in direct-reclaim context.

a) Can't lock pages (well we kinda sorta could, but generally code
   will just trylock)

b) Cannot rely on the inode or the address_space being present in
   memory after we have unlocked the page.

c) Cannot run iput().  Or at least, we couldn't five or six years
   ago.  afaik nobody has investigated whether the situation is now
   better or worse.

d) lots of deadlock scenarios - need to test __GFP_FS basically everywhere
   in which you share code with normal writeback paths.

Plus e), f), g) and h).  Direct-reclaim is a hostile environment. 
Things like b) are a real killer - nasty, subtle, rare,
memory-pressure-dependent crashes.

--
To: Andrew Morton <akpm@...>
Cc: David Chinner <dgc@...>, Christoph Lameter <clameter@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Wednesday, May 21, 2008 - 2:15 am

Which basically means we can not do direct writeback at reclaim time?..

-- 
	Evgeniy Polyakov
--
To: Evgeniy Polyakov <johnpol@...>
Cc: David Chinner <dgc@...>, Christoph Lameter <clameter@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Wednesday, May 21, 2008 - 2:24 am

Well, we _can_, but doing so within the present constraints is delicate.

An implementation which locked all the to-be-written pages up front and
then wrote them out and which was careful not to touch the inode or
address_space after the last page is unlocked could work.

Or perhaps add a new lock to the inode and then in reclaim

a) lock a page on the LRU, thus pinning the address_space and inode.

b) take some new sleeping lock in the inode

c) unlock that page and now proceed to do writeback.  But still
   honouring !GFP_FS.

and teach the unmount code to take the per-inode locks too, to ensure
that reclaim has got out of there before zapping the inodes.  Perhaps a
per-superblock lock rather than per-inode, dunno.

But we won't be able to just dive in there and call the existing
writeback functions from within reclaim.  Because

a) callers can hold all sorts of locks, including implicit ones such
   as journal_start() and

b) reclaim doesn't have a reference on the page's inode, and the
   inode and address_space can vanish if reclaim isn't holding a lock
   on one of the address_space's pages.

--
To: Andrew Morton <akpm@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, May 21, 2008 - 1:52 pm

I happened to notice your remark in the buffer heads defrag thread.
Do you remember what that limitation was about?

Because just a few months ago I discovered a shmem race which I fixed
by doing igrab+iput in shmem_writepage, in the reclaim context.  Feeling
guilty now: I'd better investigate, but would welcome a starting pointer.

(If I'm lucky, it'll be that the generic code in vmscan.c cannot
use iput, but particular filesystems might themselves be safe to.)

Thanks,
Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Evgeniy Polyakov <johnpol@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, May 21, 2008 - 2:12 pm

Ages and ages ago.  I expect it was a deadlock thing.  iput_final() can
end up calling things like write_inode() which can want to do things
like opening a transaction against filesystem A while already having
one open against filesystem B.  Which is both deadlockable and BUGable.

Yes, it was specific to the direct-reclaim calling context.
--
To: Hugh Dickins <hugh@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, May 21, 2008 - 1:58 pm

Hi Hugh.


If we are talking about the same things, its waiting for pages to be
synced (wither written back or truncated) when inode is about to be
destroyed. Thus reclaim can sleep wating for pages to be synced, which
it is about to move somewhere itself. Deadlock. The same for writepage -
if we drop inode there it can wait for pages to be synced, which inturn
requires writeback, where we are sleeping already...

-- 
	Evgeniy Polyakov
--
Previous thread: [patch 05/21] slub: Slab defrag core by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)

Next thread: [patch 18/21] Filesystem: Socket inode defragmentation by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (4 messages)
speck-geostationary