[PATCH 0/2] Fix buffer dirtying in data=journal mode

Previous thread: Re: fio test triggering bad data on ext4 by Frank Mehnert on Monday, June 21, 2010 - 2:37 am. (1 message)

Next thread: Updated ext4 quota design document by Theodore Ts'o on Monday, June 21, 2010 - 5:29 am. (7 messages)
From: Jan Kara
Date: Monday, June 21, 2010 - 3:42 am

Hi,

  two patches below fix warnings and possible BUGs in ext3/ext4 when running
in data=journal mode (see https://bugzilla.kernel.org/show_bug.cgi?id=14156).
Admittedly, the workaround I've chosen is kind of ugly but reimplementing
block_prepare_write() just for the data=journal mode seems even uglier to me.
If anybody has better solution, I'd gladly hear it.

								Honza
--

From: Jan Kara
Date: Monday, June 21, 2010 - 3:42 am

block_prepare_write() can dirty freshly created buffer. This is a problem
for data=journal mode because data buffers shouldn't be dirty unless they
are undergoing checkpoint. So we have to tweak get_block function for
data=journal mode to catch the case when block_prepare_write would dirty
the buffer, do the work instead of block_prepare_write, and properly handle
dirty buffer as data=journal mode requires it.

It might be cleaner to avoid using block_prepare_write() for data=journal
mode writes but that would require us to duplicate most of the function
which isn't nice either...

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   56 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index ea33bdf..2b61cc4 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -993,6 +993,43 @@ out:
 	return ret;
 }
 
+static int ext3_journalled_get_block(struct inode *inode, sector_t iblock,
+				     struct buffer_head *bh, int create)
+{
+	handle_t *handle = ext3_journal_current_handle();
+	int ret;
+
+	/* This function should ever be used only for real buffers */
+	BUG_ON(!bh->b_page);
+
+	ret = ext3_get_blocks_handle(handle, inode, iblock, 1, bh, create);
+	if (ret > 0) {
+		if (buffer_new(bh)) {
+			struct page *page = bh->b_page;
+
+			/*
+			 * This is a terrible hack to avoid block_prepare_write
+			 * marking our buffer as dirty
+			 */
+			if (PageUptodate(page)) {
+				ret = ext3_journal_get_write_access(handle, bh);
+				if (ret < 0)
+					goto out;
+				unmap_underlying_metadata(bh->b_bdev,
+							  bh->b_blocknr);
+				clear_buffer_new(bh);
+				set_buffer_uptodate(bh);
+				ret = ext3_journal_dirty_metadata(handle, bh);
+				if (ret < 0)
+					goto out;
+			}
+		}
+		ret = 0;
+	}
+out:
+	return ret;
+}
+
 int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		u64 start, u64 len)
 {
@@ -1196,15 +1233,18 @@ ...
From: Josef Bacik
Date: Monday, July 19, 2010 - 11:02 am

Hey Jan,

It looks like in __block_prepare_write we zero out the end of the page if we're
not writing to the entire block, but you short-circuit this behavior with this
get_block.  So it's possible that if we only write to half of the block, the
last half is going to have whatever stale data was in there before, right?
Thanks,

Josef
--

From: Josef Bacik
Date: Monday, July 19, 2010 - 11:41 am

Oops, ignore me, nothing changes for the !PageUptodate() case, which is where
the page zero'ing part happens.  Carry on :),

Josef
--

From: Jan Kara
Date: Monday, June 21, 2010 - 3:42 am

block_prepare_write() can dirty freshly created buffer. This is a problem
for data=journal mode because data buffers shouldn't be dirty unless they
are undergoing checkpoint. So we have to tweak get_block function for
data=journal mode to catch the case when block_prepare_write would dirty
the buffer, do the work instead of block_prepare_write, and properly handle
dirty buffer as data=journal mode requires it.

It might be cleaner to avoid using block_prepare_write() for data=journal
mode writes but that would require us to duplicate most of the function
which isn't nice either...

CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   47 +++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 81d6054..526404b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1388,6 +1388,43 @@ out:
 	return ret;
 }
 
+static int ext4_journalled_get_block(struct inode *inode, sector_t iblock,
+				     struct buffer_head *bh, int create)
+{
+	handle_t *handle = ext4_journal_current_handle();
+	int ret;
+
+	/* This function should ever be used only for real buffers */
+	BUG_ON(!bh->b_page);
+
+	ret = ext4_get_blocks(handle, inode, iblock, 1, bh,
+			      create ? EXT4_GET_BLOCKS_CREATE : 0);
+	if (ret > 0) {
+		if (buffer_new(bh)) {
+			struct page *page = bh->b_page;
+
+			/*
+			 * This is a terrible hack to avoid block_prepare_write
+			 * marking our buffer as dirty
+			 */
+			if (PageUptodate(page)) {
+				ret = ext4_journal_get_write_access(handle, bh);
+				if (ret < 0)
+					return ret;
+				unmap_underlying_metadata(bh->b_bdev,
+							  bh->b_blocknr);
+				clear_buffer_new(bh);
+				set_buffer_uptodate(bh);
+				ret = jbd2_journal_dirty_metadata(handle, bh);
+				if (ret < 0)
+					return ret;
+			}
+		}
+		ret = 0;
+	}
+	return ret;
+}
+
 /*
  * `handle' can be NULL if create is zero
  */
@@ -1599,12 +1636,14 @@ retry:
 ...
From: Jan Kara
Date: Wednesday, July 21, 2010 - 7:32 am

Ted, how about this patch? Do you agree with it? Thanks.

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
--

From: Jan Kara
Date: Monday, July 12, 2010 - 10:38 am

Ted, could you have a look at the ext4 patch and merge it if you're fine
with it? Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
--

Previous thread: Re: fio test triggering bad data on ext4 by Frank Mehnert on Monday, June 21, 2010 - 2:37 am. (1 message)

Next thread: Updated ext4 quota design document by Theodore Ts'o on Monday, June 21, 2010 - 5:29 am. (7 messages)