Re: [PATCH 2/2] ext4: fix inode bitmaps manipulation in free_inode

Previous thread: Re: [PATCH 4/5] ext2: Move ext2_write_super() out of ext2_setup_super() by Jan Blunck on Wednesday, April 14, 2010 - 12:55 am. (1 message)

Next thread: [PATCH 6/7] ext2: Add ext2_sb_info s_lock spinlock by Jan Blunck on Wednesday, April 14, 2010 - 5:38 am. (1 message)
From: Dmitry Monakhov
Date: Wednesday, April 14, 2010 - 4:19 am

I've finally automated my favorite testcase (see attachment), 
before i've run it by hand.
And sometimes i've saw following complain from fsck:
fsck.ext4 -f -n /dev/sdb2
...
Pass 5: Checking group summary information
Inode bitmap differences:  -93582
Fix? no

Free inodes count wrong for group #12 (4634, counted=4633).
Fix? no

Free inodes count wrong (35610, counted=35609).
Fix? no
...

I've started to look an inode bitmap manipulation code paths
and found strange logic in ext{3,4}_free_inode functions

1) Group lock acquired twice for bitmap and for group_desc.
   There are not any advantage from this double locking, only
   error path(where the bit is already cleared) takes an
   advantage from this locking schema.
   It is reasonable to batch it in to one locking block.
2) if we failed to read gdp then bh2 is undefined so
   may result in oops due to undefince pointer dereferance.
3) if we failed to get write_access to gdp we skip
   handle_dirty_metadata for inode_bitmap which is also a bug.

I've redesigned free_inode logic(see later two emails) and
currently i'm not able to reproduce the bug, but i can not
guarantee it is goes away.

From: Dmitry Monakhov
Date: Wednesday, April 14, 2010 - 4:23 am

- Reorganize locking scheme to batch two atomic operation in to one.
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
  inode bitmaps.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext3/ialloc.c |   62 +++++++++++++++++++++++++++--------------------------
 1 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index ef9008b..8352a68 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -98,7 +98,7 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_group_desc * gdp;
 	struct ext3_super_block * es;
 	struct ext3_sb_info *sbi;
-	int fatal = 0, err;
+	int fatal = 0, err, cleared = 0;
 
 	if (atomic_read(&inode->i_count) > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
@@ -150,38 +150,40 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	if (fatal)
 		goto error_return;
 
-	/* Ok, now we can actually update the inode bitmaps.. */
-	if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
-					bit, bitmap_bh->b_data))
-		ext3_error (sb, "ext3_free_inode",
-			      "bit already cleared for inode %lu", ino);
-	else {
-		gdp = ext3_get_group_desc (sb, block_group, &bh2);
-
+	fatal = -ESRCH;
+	gdp = ext3_get_group_desc (sb, block_group, &bh2);
+	if (gdp) {
 		BUFFER_TRACE(bh2, "get_write_access");
 		fatal = ext3_journal_get_write_access(handle, bh2);
-		if (fatal) goto error_return;
-
-		if (gdp) {
-			spin_lock(sb_bgl_lock(sbi, block_group));
-			le16_add_cpu(&gdp->bg_free_inodes_count, 1);
-			if (is_directory)
-				le16_add_cpu(&gdp->bg_used_dirs_count, -1);
-			spin_unlock(sb_bgl_lock(sbi, block_group));
-			percpu_counter_inc(&sbi->s_freeinodes_counter);
-			if (is_directory)
-				percpu_counter_dec(&sbi->s_dirs_counter);
-
-		}
-		BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
-		err = ext3_journal_dirty_metadata(handle, bh2);
-		if (!fatal) fatal = err;
 ...
From: Dmitry Monakhov
Date: Wednesday, April 14, 2010 - 4:23 am

- Reorganize locking scheme to batch two atomic operation in to one.
  This also allow us to state what healthy group must obey following rule
  ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
  inode bitmaps.
- Move non group members update out of group_lock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ialloc.c |   91 +++++++++++++++++++++++++++--------------------------
 1 files changed, 46 insertions(+), 45 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..78ceab5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,59 +240,60 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	if (fatal)
 		goto error_return;
 
-	/* Ok, now we can actually update the inode bitmaps.. */
-	cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
-					bit, bitmap_bh->b_data);
-	if (!cleared)
-		ext4_error(sb, "bit already cleared for inode %lu", ino);
-	else {
-		gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+	fatal = -ESRCH;
+	gdp = ext4_get_group_desc(sb, block_group, &bh2);
+	if (gdp) {
 		BUFFER_TRACE(bh2, "get_write_access");
 		fatal = ext4_journal_get_write_access(handle, bh2);
-		if (fatal) goto error_return;
-
-		if (gdp) {
-			ext4_lock_group(sb, block_group);
-			count = ext4_free_inodes_count(sb, gdp) + 1;
-			ext4_free_inodes_set(sb, gdp, count);
-			if (is_directory) {
-				count = ext4_used_dirs_count(sb, gdp) - 1;
-				ext4_used_dirs_set(sb, gdp, count);
-				if (sbi->s_log_groups_per_flex) {
-					ext4_group_t f;
-
-					f = ext4_flex_group(sbi, block_group);
-					atomic_dec(&sbi->s_flex_groups[f].used_dirs);
-				}
+	}
+	ext4_lock_group(sb, block_group);
+	if (fatal) {
+		/* Skip group descriptor update, update only inode bitmaps */
+		cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+		ext4_unlock_group(sb, ...
From: tytso
Date: Wednesday, April 14, 2010 - 5:12 pm

This is what I dropped into the ext4 patch queue.  It fixes up some
spelling errors, and a few other minor changes.

					- Ted

ext4: clean up inode bitmaps manipulation in ext4_free_inode

From: Dmitry Monakhov <dmonakhov@openvz.org>

- Reorganize locking scheme to batch two atomic operation in to one.
  This also allow us to state what healthy group must obey following rule
  ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
  inode bitmaps.
- Move non-group members update out of group_lock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ialloc.c |   88 +++++++++++++++++++++++++++---------------------------
 1 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..25fe42f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,56 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	if (fatal)
 		goto error_return;
 
-	/* Ok, now we can actually update the inode bitmaps.. */
-	cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
-					bit, bitmap_bh->b_data);
-	if (!cleared)
-		ext4_error(sb, "bit already cleared for inode %lu", ino);
-	else {
-		gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+	fatal = -ESRCH;
+	gdp = ext4_get_group_desc(sb, block_group, &bh2);
+	if (gdp) {
 		BUFFER_TRACE(bh2, "get_write_access");
 		fatal = ext4_journal_get_write_access(handle, bh2);
-		if (fatal) goto error_return;
-
-		if (gdp) {
-			ext4_lock_group(sb, block_group);
-			count = ext4_free_inodes_count(sb, gdp) + 1;
-			ext4_free_inodes_set(sb, gdp, count);
-			if (is_directory) {
-				count = ext4_used_dirs_count(sb, gdp) - 1;
-				ext4_used_dirs_set(sb, gdp, count);
-				if (sbi->s_log_groups_per_flex) {
-					ext4_group_t f;
-
-					f = ext4_flex_group(sbi, ...
From: tytso
Date: Thursday, April 15, 2010 - 6:06 pm

Here's my -V3 respin of this patch, which further cleans up the code
and removes some duplicated code by only calling ext4_clear_bit() from
one call site.

I think I'm about done for this, so if you agree with my improvements
as improvements, it might be useful to port this back to ext3 version
of this patch.

						- Ted

ext4: clean up inode bitmaps manipulation in ext4_free_inode

From: Dmitry Monakhov <dmonakhov@openvz.org>

- Reorganize locking scheme to batch two atomic operation in to one.
  This also allow us to state what healthy group must obey following rule
  ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
  inode bitmaps.
- Move non-group members update out of group_lock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/ialloc.c |   81 ++++++++++++++++++++++++-----------------------------
 1 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..52618d5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,49 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	if (fatal)
 		goto error_return;
 
-	/* Ok, now we can actually update the inode bitmaps.. */
-	cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
-					bit, bitmap_bh->b_data);
-	if (!cleared)
-		ext4_error(sb, "bit already cleared for inode %lu", ino);
-	else {
-		gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+	fatal = -ESRCH;
+	gdp = ext4_get_group_desc(sb, block_group, &bh2);
+	if (gdp) {
 		BUFFER_TRACE(bh2, "get_write_access");
 		fatal = ext4_journal_get_write_access(handle, bh2);
-		if (fatal) goto error_return;
-
-		if (gdp) {
-			ext4_lock_group(sb, block_group);
-			count = ext4_free_inodes_count(sb, gdp) + 1;
-			ext4_free_inodes_set(sb, gdp, count);
-			if ...
From: Dmitry Monakhov
Date: Saturday, April 17, 2010 - 3:57 am

From: Dmitry Monakhov
Date: Wednesday, April 14, 2010 - 4:35 am

BTW sometimes i've saw other corruption
e2fsck -fn /dev/sdb2
e2fsck 1.41.9 (22-Aug-2009)
Pass 1: Checking inodes, blocks, and sizes
Inode 69, i_blocks is 439472, should be 439480.  Fix? no
...

By unknown reason node extent's block wasn't accounted 
in to i_blocks. Now I'm digging in to that issue.
--

From: Jan Kara
Date: Wednesday, April 14, 2010 - 6:34 am

Interesting. So some inode is marked as free although it is in
use, right? That sounds like a nasty bug - if you reproduce this
again, could you use debugfs to find out what file type is that
  I guess you think that this happens because we pass the lock parameter
to ext3_clear_bit_atomic. But if you would actually look at the definition
of the function, you would see that it's hard to find an architecture that
uses the lock. Most architectures just use atomic bitop to clear the bit.
  No, because during mount time we check that all gdp pointers exist so
  It doesn't matter. At the moment ext3_journal_get_write_access fails we
abort the journal so no writes are allowed to the filesystem anyway. So
modified bitmap has hardly any chance to get to disk and you have to
run fsck to clean up the mess anyway...

								Honza
--

From: Dmitry Monakhov
Date: Wednesday, April 14, 2010 - 7:33 am

No problems, 
wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2
In fact i've had even better image (with only 1 free inode in a
--

From: Jan Kara
Date: Thursday, April 15, 2010 - 2:39 pm

I've looked at it: So the problem is the other way around (I always
confuse this). The inode is properly deleted but the bit remains set
in the bitmap. What is strange is that group descriptor counts are
correct so it's really only the bitmap bit that is wrong. I've looked
through the inode allocation and freeing code back and forth but I could
not find a place where this could realistically happen.
  So just for record:
Inode has mtime = ctime = atime = dtime (so it was really deleted), i_nlink
= 0, it is a directory, i_disksize = 4096, i_blocks = 0. So indeed it looks
that we were in ext4_mkdir, we failed to allocate the block for directory
and went to out_clear_inode (thus i_disksize remained to be set to 4096,
otherwise it would be set to 0)... But how it happened that the bit in the
bitmap didn't get cleared while the group descriptors were updated is
beyond me.
  Alternatively the inode could have been deleted just fine and later we
just set the bit in the inode bitmap and didn't update anything else. But
even this does not seem to be possible to me...
  Hmm, I've looked at the code again and I think the check is there mainly
to avoid Oops in case filesystem got corrupted and we computed some bogus
group number. Not that I would see how that could happen in this particular
case but in some other uses of ext3_get_group_desc it could happen. So
moving the gdp check before we use bh2 probably makes some sence (although
it's probably just a style cleanup in this case).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Dmitry Monakhov
Date: Thursday, April 15, 2010 - 3:01 pm

I will, but for now i'm working on fix for OOPS
from fs/ext4/extents.c:3479  due to ex == NULL
Ok, if we know that any error result in EIO or panic when let's just
call it style cleanup(simplification), imho new code is more readable.
--

From: tytso
Date: Friday, April 16, 2010 - 6:33 am

Agreed.  The reason you're seeing me respin this patch a few times is
because we recently added some additional qualification testing for
ext4 in $DAYJOB, and we've found that running dbench followed by fsck
-fy also seems to be a good way of tickling this bug --- and applying
the patch which you wrote does seem to make it go away.  

Like you, I can't reproduce the problem once the patch has been
applied; and like you and Jan, I can't see how this patch would
actually fix a race or some other bug.  But given that (a) it
definitely is a code cleanup, and (b) it empircally seems to make the
bug go away, and (c) we've seen this problem in our production
servers, I'm inclined to take it.

I hope to spend a bit more time in the next few days trying to figure
out what the actual root cause is, so we can figure out whether this
is really fixing a problem, or just making it harder to hit.

Dmitry, I need to thank you for all of the ext4 testing and bug fixing
you've been doing.  I really appreciate it!!!  I'm pretty sure BTW
that BZ #15792 is also one that we've seen on our production servers,
and so you're finding issues that aren't just showing up in
regression/stress test suites, but can and actually do happen in
real-world settings.

   	  	   	       	    	      	     - Ted
--

From: Eric Sandeen
Date: Wednesday, April 14, 2010 - 9:03 am

running fsstress in verbose mode, and disabling link/unlink/symlink,
you can sometimes narrow it down to a sequence of operations on that file, too.
(keep track of the seed nr...)

Of course if it's a random-ish race that probably won't be of much use.  :)

-Eric
--

From: Eric Sandeen
Date: Wednesday, April 14, 2010 - 9:01 am

Thanks!  Feel free to cc: the xfs list since the patch hits

just little editor nitpicks: 

+# Perform fsstress test with parallel dd
+# This is proven to be a good stress test
+# * Continuous dd results in ENOSPC condition but only for a limited period
+#   of time.


What is all this for?

FWIW other fsstress tests use an $FSSTRESS_AVOID variable,

Is this prealloc just because fsstress may run resvsp?
FWIW, other fsstress tests aren't in that group, so this is
as little inconsistent.

Thanks for writing an xfstests patch! :)

-Eric
--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 4:47 pm

This is close to the same as test 083:

# Exercise filesystem full behaviour - run numerous fsstress
# processes in write mode on a small filesystem.  NB: delayed
# allocate flushing is quite deadlock prone at the filesystem
# full boundary due to the fact that we will retry allocation
# several times after flushing, before giving back ENOSPC.

That test is not really doing anything XFS specific,

OK, so on a 10GB scratch device, this is going to write 50GB of
data, which at 100MB/s is going to take roughly 10 minutes.
The test should use a limited size filesystems (mkfs_scratch_sized)
to limit the runtime...

FWIW, test 083 spends most of it's runtime at or near ENOSPC, so

You don't need to check the scratch fs in the test - that is done by
the test harness after the test completes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

Previous thread: Re: [PATCH 4/5] ext2: Move ext2_write_super() out of ext2_setup_super() by Jan Blunck on Wednesday, April 14, 2010 - 12:55 am. (1 message)

Next thread: [PATCH 6/7] ext2: Add ext2_sb_info s_lock spinlock by Jan Blunck on Wednesday, April 14, 2010 - 5:38 am. (1 message)