I've finally automated my favorite testcase (see attachment),
before i've run it by hand.
And sometimes i've saw following complain from fsck:
fsck.ext4 -f -n /dev/sdb2
...
Pass 5: Checking group summary information
Inode bitmap differences: -93582
Fix? no
Free inodes count wrong for group #12 (4634, counted=4633).
Fix? no
Free inodes count wrong (35610, counted=35609).
Fix? no
...
I've started to look an inode bitmap manipulation code paths
and found strange logic in ext{3,4}_free_inode functions
1) Group lock acquired twice for bitmap and for group_desc.
There are not any advantage from this double locking, only
error path(where the bit is already cleared) takes an
advantage from this locking schema.
It is reasonable to batch it in to one locking block.
2) if we failed to read gdp then bh2 is undefined so
may result in oops due to undefince pointer dereferance.
3) if we failed to get write_access to gdp we skip
handle_dirty_metadata for inode_bitmap which is also a bug.
I've redesigned free_inode logic(see later two emails) and
currently i'm not able to reproduce the bug, but i can not
guarantee it is goes away.
- Reorganize locking scheme to batch two atomic operation in to one.
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
inode bitmaps.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext3/ialloc.c | 62 +++++++++++++++++++++++++++--------------------------
1 files changed, 32 insertions(+), 30 deletions(-)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index ef9008b..8352a68 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -98,7 +98,7 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
struct ext3_group_desc * gdp;
struct ext3_super_block * es;
struct ext3_sb_info *sbi;
- int fatal = 0, err;
+ int fatal = 0, err, cleared = 0;
if (atomic_read(&inode->i_count) > 1) {
printk ("ext3_free_inode: inode has count=%d\n",
@@ -150,38 +150,40 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
- bit, bitmap_bh->b_data))
- ext3_error (sb, "ext3_free_inode",
- "bit already cleared for inode %lu", ino);
- else {
- gdp = ext3_get_group_desc (sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext3_get_group_desc (sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext3_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- spin_lock(sb_bgl_lock(sbi, block_group));
- le16_add_cpu(&gdp->bg_free_inodes_count, 1);
- if (is_directory)
- le16_add_cpu(&gdp->bg_used_dirs_count, -1);
- spin_unlock(sb_bgl_lock(sbi, block_group));
- percpu_counter_inc(&sbi->s_freeinodes_counter);
- if (is_directory)
- percpu_counter_dec(&sbi->s_dirs_counter);
-
- }
- BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
- err = ext3_journal_dirty_metadata(handle, bh2);
- if (!fatal) fatal = err;
...- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
inode bitmaps.
- Move non group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ialloc.c | 91 +++++++++++++++++++++++++++--------------------------
1 files changed, 46 insertions(+), 45 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..78ceab5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,59 +240,60 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if (is_directory) {
- count = ext4_used_dirs_count(sb, gdp) - 1;
- ext4_used_dirs_set(sb, gdp, count);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_dec(&sbi->s_flex_groups[f].used_dirs);
- }
+ }
+ ext4_lock_group(sb, block_group);
+ if (fatal) {
+ /* Skip group descriptor update, update only inode bitmaps */
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ ext4_unlock_group(sb, ...This is what I dropped into the ext4 patch queue. It fixes up some
spelling errors, and a few other minor changes.
- Ted
ext4: clean up inode bitmaps manipulation in ext4_free_inode
From: Dmitry Monakhov <dmonakhov@openvz.org>
- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
inode bitmaps.
- Move non-group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ialloc.c | 88 +++++++++++++++++++++++++++---------------------------
1 files changed, 44 insertions(+), 44 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..25fe42f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,56 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if (is_directory) {
- count = ext4_used_dirs_count(sb, gdp) - 1;
- ext4_used_dirs_set(sb, gdp, count);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, ...Here's my -V3 respin of this patch, which further cleans up the code
and removes some duplicated code by only calling ext4_clear_bit() from
one call site.
I think I'm about done for this, so if you agree with my improvements
as improvements, it might be useful to port this back to ext3 version
of this patch.
- Ted
ext4: clean up inode bitmaps manipulation in ext4_free_inode
From: Dmitry Monakhov <dmonakhov@openvz.org>
- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
inode bitmaps.
- Move non-group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ialloc.c | 81 ++++++++++++++++++++++++-----------------------------
1 files changed, 37 insertions(+), 44 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..52618d5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,49 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if ...BTW sometimes i've saw other corruption e2fsck -fn /dev/sdb2 e2fsck 1.41.9 (22-Aug-2009) Pass 1: Checking inodes, blocks, and sizes Inode 69, i_blocks is 439472, should be 439480. Fix? no ... By unknown reason node extent's block wasn't accounted in to i_blocks. Now I'm digging in to that issue. --
Interesting. So some inode is marked as free although it is in use, right? That sounds like a nasty bug - if you reproduce this again, could you use debugfs to find out what file type is that I guess you think that this happens because we pass the lock parameter to ext3_clear_bit_atomic. But if you would actually look at the definition of the function, you would see that it's hard to find an architecture that uses the lock. Most architectures just use atomic bitop to clear the bit. No, because during mount time we check that all gdp pointers exist so It doesn't matter. At the moment ext3_journal_get_write_access fails we abort the journal so no writes are allowed to the filesystem anyway. So modified bitmap has hardly any chance to get to disk and you have to run fsck to clean up the mess anyway... Honza --
No problems, wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2 In fact i've had even better image (with only 1 free inode in a --
I've looked at it: So the problem is the other way around (I always confuse this). The inode is properly deleted but the bit remains set in the bitmap. What is strange is that group descriptor counts are correct so it's really only the bitmap bit that is wrong. I've looked through the inode allocation and freeing code back and forth but I could not find a place where this could realistically happen. So just for record: Inode has mtime = ctime = atime = dtime (so it was really deleted), i_nlink = 0, it is a directory, i_disksize = 4096, i_blocks = 0. So indeed it looks that we were in ext4_mkdir, we failed to allocate the block for directory and went to out_clear_inode (thus i_disksize remained to be set to 4096, otherwise it would be set to 0)... But how it happened that the bit in the bitmap didn't get cleared while the group descriptors were updated is beyond me. Alternatively the inode could have been deleted just fine and later we just set the bit in the inode bitmap and didn't update anything else. But even this does not seem to be possible to me... Hmm, I've looked at the code again and I think the check is there mainly to avoid Oops in case filesystem got corrupted and we computed some bogus group number. Not that I would see how that could happen in this particular case but in some other uses of ext3_get_group_desc it could happen. So moving the gdp check before we use bh2 probably makes some sence (although it's probably just a style cleanup in this case). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
I will, but for now i'm working on fix for OOPS from fs/ext4/extents.c:3479 due to ex == NULL Ok, if we know that any error result in EIO or panic when let's just call it style cleanup(simplification), imho new code is more readable. --
Agreed. The reason you're seeing me respin this patch a few times is because we recently added some additional qualification testing for ext4 in $DAYJOB, and we've found that running dbench followed by fsck -fy also seems to be a good way of tickling this bug --- and applying the patch which you wrote does seem to make it go away. Like you, I can't reproduce the problem once the patch has been applied; and like you and Jan, I can't see how this patch would actually fix a race or some other bug. But given that (a) it definitely is a code cleanup, and (b) it empircally seems to make the bug go away, and (c) we've seen this problem in our production servers, I'm inclined to take it. I hope to spend a bit more time in the next few days trying to figure out what the actual root cause is, so we can figure out whether this is really fixing a problem, or just making it harder to hit. Dmitry, I need to thank you for all of the ext4 testing and bug fixing you've been doing. I really appreciate it!!! I'm pretty sure BTW that BZ #15792 is also one that we've seen on our production servers, and so you're finding issues that aren't just showing up in regression/stress test suites, but can and actually do happen in real-world settings. - Ted --
running fsstress in verbose mode, and disabling link/unlink/symlink, you can sometimes narrow it down to a sequence of operations on that file, too. (keep track of the seed nr...) Of course if it's a random-ish race that probably won't be of much use. :) -Eric --
Thanks! Feel free to cc: the xfs list since the patch hits just little editor nitpicks: +# Perform fsstress test with parallel dd +# This is proven to be a good stress test +# * Continuous dd results in ENOSPC condition but only for a limited period +# of time. What is all this for? FWIW other fsstress tests use an $FSSTRESS_AVOID variable, Is this prealloc just because fsstress may run resvsp? FWIW, other fsstress tests aren't in that group, so this is as little inconsistent. Thanks for writing an xfstests patch! :) -Eric --
This is close to the same as test 083: # Exercise filesystem full behaviour - run numerous fsstress # processes in write mode on a small filesystem. NB: delayed # allocate flushing is quite deadlock prone at the filesystem # full boundary due to the fact that we will retry allocation # several times after flushing, before giving back ENOSPC. That test is not really doing anything XFS specific, OK, so on a 10GB scratch device, this is going to write 50GB of data, which at 100MB/s is going to take roughly 10 minutes. The test should use a limited size filesystems (mkfs_scratch_sized) to limit the runtime... FWIW, test 083 spends most of it's runtime at or near ENOSPC, so You don't need to check the scratch fs in the test - that is done by the test harness after the test completes. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
