[Bug 25352] resizing ext4 will corrupt filesystem

Previous thread: Re: [PATCH]: icount: Replace the icount list by a two-level tree by Andreas Dilger on Monday, December 20, 2010 - 12:46 pm. (1 message)

Next thread: Re: [Bug 25352] New: resizing ext4 will corrupt filesystem by Ted Ts'o on Monday, December 20, 2010 - 8:32 pm. (2 messages)
From: bugzilla-daemon
Date: Monday, December 20, 2010 - 1:56 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352

           Summary: resizing ext4 will corrupt filesystem
           Product: File System
           Version: 2.5
    Kernel Version: 2.6.37-rc6
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: ext4
        AssignedTo: fs_ext4@kernel-bugs.osdl.org
        ReportedBy: kees@outflux.net
        Regression: Yes


Using resize2fs on an ext4 will result in a corrupted filesystem. This is a
regression (obviously).

I would expect "fsck" to be clean on a recently resized filesystem, but it is
not:

Pass 5: Checking group summary information
Block bitmap differences:  +(2621440--2621951) +(2654210--2655360)
+(2686976--2687487) +(2719744--2720255) +(2752512--2753023) +(2785280--2785791)
+(2818048--2818559) +(2850816--2851327) +(2883584--2884095) +(2916352--2916863)
+(2949120--2949631) +(2981888--2982399) +(3014656--3015167) +(3047424--3047935)
+(3080192--3080703) +(3112960--3113471) +(3145728--3146239) +(3178496--3179007)
+(3211264--3211775) +(3244032--3244543) +(3276800--3277311) +(3309568--3310079)
+(3342336--3342847) +(3375104--3375615) +(3407872--3408383) +(3440640--3441151)
+(3473408--3473919) +(3506176--3506687) +(3538944--3539455) +(3571712--3572223)
+(3604480--3604991) +(3637248--3637759) +(3670016--3670527) +(3702784--3703295)
+(3735552--3736063) +(3768320--3768831) +(3801088--3801599) +(3833856--3834367)
+(3866624--3867135) +(3899392--3899903)

etc

Reproducer script attached...

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Monday, December 20, 2010 - 1:57 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #1 from Kees Cook <kees@outflux.net>  2010-12-20 20:57:53 ---
Created an attachment (id=41062)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=41062)
script that will demo a corrupted ext4 after resize

This has already been reported to Ubuntu, but was reproduced with an upstream
kernel, so I've opened this report as well.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/692704

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Monday, December 20, 2010 - 8:33 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #2 from Theodore Tso <tytso@mit.edu>  2010-12-21 03:33:48 ---
Created an attachment (id=41142)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=41142)
Proposed patch

Yes, this is a regression new to 2.6.37-rc1, which was introduced by
commit a31437b85: ext4: use sb_issue_zeroout in setup_new_group_blocks.

When we replaced the loop zero'ing the inode table blocks with
sb_issue_zeroout, we accidentally also removed this little tidbit:

-               ext4_set_bit(bit, bh->b_data);

... which was responsible for setting the block allocation bitmap to
reserve the block descriptor blocks and inode table blocks.  Oops...

I believe this patch should fix things.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Monday, December 20, 2010 - 9:05 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #3 from Theodore Tso <tytso@mit.edu>  2010-12-21 04:05:20 ---
On Mon, Dec 20, 2010 at 08:56:46PM +0000, bugzilla-daemon@bugzilla.kernel.org

Yes, this is a regression new to 2.6.37-rc1, which was introduced by
commit a31437b85: ext4: use sb_issue_zeroout in setup_new_group_blocks.

When we replaced the loop zero'ing the inode table blocks with
sb_issue_zeroout, we accidentally also removed this little tidbit:

-               ext4_set_bit(bit, bh->b_data);

... which was responsible for setting the block allocation bitmap to
reserve the block descriptor blocks and inode table blocks.  Oops...

I believe this patch should fix things.

                        - Ted

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Monday, December 20, 2010 - 9:26 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #4 from Kees Cook <kees@outflux.net>  2010-12-21 04:26:12 ---
Thanks for tracking it down! After a fsck, I'm still seeing fs corruption,
unfortunately:

[177266.375628] EXT4-fs error (device dm-1): htree_dirblock_to_tree:586: inode
#12255304: block 88074025: comm rm: bad entry in directory: rec_len is smaller
than minimal - offset=0(4096), inode=0, rec_len=0, name_len=0
[177266.375872] EXT4-fs error (device dm-1): htree_dirblock_to_tree:586: inode
#12255304: block 88074026: comm rm: bad entry in directory: rec_len is smaller
than minimal - offset=0(8192), inode=0, rec_len=0, name_len=0
[177266.376135] EXT4-fs error (device dm-1): empty_dir:1922: inode #12255304:
block 88074025: comm rm: bad entry in directory: rec_len is smaller than
minimal - offset=0(4096), inode=0, rec_len=0, name_len=0
[177266.376360] EXT4-fs error (device dm-1): empty_dir:1922: inode #12255304:
block 88074026: comm rm: bad entry in directory: rec_len is smaller than
minimal - offset=0(8192), inode=0, rec_len=0, name_len=0

fsck didn't notice this problem, but walking the tree seems to trigger it. I've
been trying to clean it up by just removing the offending directory, but it I
figured I'd mention it since it seems to be a problem that fsck -f didn't see.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 5:31 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Lukas Czerner <lczerner@redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lczerner@redhat.com




--- Comment #5 from Lukas Czerner <lczerner@redhat.com>  2010-12-21 12:31:33 ---
Oops indeed. Ted, thanks for the patch, it seems to fix the problem
completely.

-Lukas

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 6:10 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #6 from Lukas Czerner <lczerner@redhat.com>  2010-12-21 13:10:28 ---

Oops indeed. Ted, thanks for the patch, it seems to fix the problem
completely.

-Lukas

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 7:19 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Theodore Tso <tytso@mit.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@mit.edu




--- Comment #7 from Theodore Tso <tytso@mit.edu>  2010-12-21 14:19:17 ---
Kees, was this (comment #4) using your resize-corruption.sh patch?  After
applying the patch I've enclosed, I've rerun your script, and it showed no
problems.  I then mounted the testfs file system, and ran ls -lR on /mnt/test,
and still no problems...

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 11:03 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #8 from Kees Cook <kees@outflux.net>  2010-12-21 18:03:21 ---
Ted, no, sorry; I didn't mean to confuse. Those are just left-over corruption
from my initial fs hit. I just thought I'd report the fact that fsck didn't
notice this when cleaning up from the original corruption.

I.e. here's my timeline for this corruption:

resize
get errors in dmesg
umount
fsck -f (for half a day, cleans up tons)
mount
delete all of lost+found
continue using fs
more dmesg errors
umount
fsck -f (returns without error)
mount
continue using fs
still dmesg errors
rm offending directory completely
no more errors

So, it seemed like a flaw in fsck that it didn't find the bad directory, but
since it was related to the corruption introduced by this kernel bug, I thought
I'd bring it up in this thread.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 12:19 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #9 from Theodore Tso <tytso@mit.edu>  2010-12-21 19:19:46 ---
Ah, thanks for the clarification.

Ok, I think I see what's going on.  It's a difference of how e2fsck treats a
case of rec_len == 0 for block sizes less than 64k compared to the kernel.  
It's an edge case, but it's one we should definitely fix.  Thanks for pointing
it out.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Tuesday, December 21, 2010 - 3:32 pm

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Rafael J. Wysocki <rjw@sisk.pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |florian@mickler.org,
                   |                            |maciej.rutecki@gmail.com,
                   |                            |rjw@sisk.pl
             Blocks|                            |21782




--- Comment #10 from Rafael J. Wysocki <rjw@sisk.pl>  2010-12-21 22:32:29 ---
Handled-By : Theodore Tso <tytso@mit.edu>

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Friday, December 24, 2010 - 6:38 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Rafael J. Wysocki <rjw@sisk.pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |CODE_FIX




--- Comment #11 from Rafael J. Wysocki <rjw@sisk.pl>  2010-12-24 13:38:17 ---
Fixed by
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8a7411a243...
.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Friday, December 24, 2010 - 6:40 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Rafael J. Wysocki <rjw@sisk.pl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

From: bugzilla-daemon
Date: Thursday, December 30, 2010 - 6:47 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352


Martin Steigerwald <Martin@Lichtvoll.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Martin@Lichtvoll.de




--- Comment #12 from Martin Steigerwald <Martin@Lichtvoll.de>  2010-12-30 13:47:11 ---
I had a corrupted ext4 yesterday after I made a ThinkPad T42 BIOS update while
I just let the kernel hibernate. The kernel consequently oopsed after resuming
after the BIOS update - well whether it did so consequently, but it did it, I
made a screenshot of it, some ACPI related stuff AFAIR. Now I wonder whether it
was me wanting to save boot and uptime causing the issue or whether it was the
online resize a few days before - and I just didn't notice it cause actually I
did not reboot since then before.

Can you have a short log at the following to see whether that might have been
the same online resizing issue? I'd just like to know what might have been the
cause for that filesystem issue - cause I doubt that my risk based approach of
doing the BIOS update could have caused such a corruption. I will use the
shutdown and reboot method on any subsequent BIOS updated anyway - that much I
learned.

I already recovered by rsync'ing changed files to my backup as far as possible
and then redoing Ext4 from scratch with mkfs.ext4 and then restoring from
backup. I do not have the old state available anymore as I do not have a spare
220 GB to dd the filesystem to.

Thus I just like to know whether the following hints at this online resizing
issue or not. I have full output logs available on request. This is with:

martin@shambhala:~> cat /proc/version 
Linux version 2.6.37-rc7-tp42-ata-eh-dbg-dirty (martin@shambhala) (gcc version
4.4.5 (Debian 4.4.5-8) ) #1 PREEMPT Wed Dec 22 11:41:20 CET 2010

Which is a plain 2.6.37-rc7 + a libata debug patch in order to get to the cause
of bug ...
From: bugzilla-daemon
Date: Thursday, December 30, 2010 - 7:12 am

https://bugzilla.kernel.org/show_bug.cgi?id=25352





--- Comment #13 from Martin Steigerwald <Martin@Lichtvoll.de>  2010-12-30 14:12:08 ---
Hmmm, the test script produces different fsck.ext4 output. But then my Ext4
filesystem had about two days to grow the initial corruption. And the syslog
shows first problems on the 27th of December while I did the BIOS update
yesterday evening.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
--

Previous thread: Re: [PATCH]: icount: Replace the icount list by a two-level tree by Andreas Dilger on Monday, December 20, 2010 - 12:46 pm. (1 message)

Next thread: Re: [Bug 25352] New: resizing ext4 will corrupt filesystem by Ted Ts'o on Monday, December 20, 2010 - 8:32 pm. (2 messages)