Re: [PATCH (RESEND)] don't scan/accumulate more pages than mballoc will allocate

Previous thread: none

Next thread: Re: [PATCH] ext4_freeze: don't return to userspace with a mutex held by tytso on Sunday, April 4, 2010 - 1:52 pm. (6 messages)
From: Eric Sandeen
Date: Monday, March 29, 2010 - 8:29 am

(resend, email sent Friday seems lost)

There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

It'd be better to not have a magic number of 2048 in here, so I'll
look for a cleaner way to get this info out of mballoc; I still need
to look at what Aneesh has in the patch queue, that might help.
This is something we could probably put in for now, though; the 2048
is already enshrined in a comment in inode.c, at least.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -2318,6 +2318,10 @@ static void mpage_add_bh_to_extent(struc
 	sector_t next;
 	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
 
+	/* Don't go larger than mballoc is willing to allocate ...
From: tytso
Date: Monday, April 5, 2010 - 6:11 am

I wonder if a better way of fixing this is to changing
mpage_da_map_pages() to call ext4_get_blocks() multiple times.  This
should be a lot easier after we integrate mpage_da_submit_io() into
mpage_da_map_pages().  That way we can way more efficient; in a loop,
we accumulate the pages, call ext4_get_blocks(), then submit the IO
(as a single block I/O submission, instead of 4k at a time through
ext4_writepages()), and then call ext4_get_blocks() again, etc.

I'm willing to include this patch as an interim stopgap, but
eventually, I think we need to refactor and reorganize
mpage_da_map_pages() and and mpage_da_submit_IO(), and let them call
mballoc (via ext4_get_blocks) multiple times in a loop.

Thoughts, suggestions?

					- Ted
--

From: Eric Sandeen
Date: Monday, April 5, 2010 - 7:42 am

That sounds reasonable, I'll look into writing something up and testing
it a bit.

Up to you whether the initial patch goes in, I know it's kind of
stopgap/hacky.

thanks,

--

From: Theodore Ts'o
Date: Wednesday, April 7, 2010 - 7:10 pm

From: From: Eric Sandeen <sandeen@redhat.com>

There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

Eventually we need to change mpage_da_map_pages() also submit its I/O
to the block layer, subsuming mpage_da_submit_io(), and then change it
call ext4_get_blocks() multiple times.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---

This is the slightly revised version of Eric's patch that I've added to
the ext4 patch queue. -- Ted

 fs/ext4/inode.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c6ca10..2c12926 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2349,6 +2349,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	sector_t next;
 	int nrblocks = mpd->b_size >> ...
From: Eric Sandeen
Date: Wednesday, April 7, 2010 - 7:31 pm

Seems fine, thanks.


--

Previous thread: none

Next thread: Re: [PATCH] ext4_freeze: don't return to userspace with a mutex held by tytso on Sunday, April 4, 2010 - 1:52 pm. (6 messages)