"I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of the cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series," announced Theodore Ts'o. He also noted of the current ext4 git tree, "it also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series." In an earlier thread Theodore posted a series of patches specifically intended for inclusion in the upcoming 2.6.24 kernel. Included in the patch series was a patch for improving fsck performance, "in performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count." The patch included an explanation of how the feature works, enabled through a mkfs option:
"With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation."
From: Theodore Ts'o <tytso@...> Subject: 2.6.23-rc8-ext4-1 patchset released Date: Oct 4, 1:59 am 2007 I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series. It also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series. It's available in the standard place: git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 2.6.23-rc9-ext4-1 http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=summary and ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/2.6.23-rc9-ext4-1 - Ted + # pulled from rc8-mm2 + jbd2-ext4-cleanups-convert-to-kzalloc.patch + # also in akpm's broken-out-2007-10-01-04-09 + jbd2-fix-commit-code-to-properly-abort-journal.patch + + # pulled from akpm's broken-out-2007-10-01-04-09 + jbd2-debug-code-cleanup.patch + # pulled from akpm's broken-out-2007-10-01-04-09 + remove-ifdef-config_ext4_index.patch + # Large blocksize support for ext4 + ext4_large_blocksize_support.patch + ext4_rec_len_overflow_with_64kblk_fix-v2.patch + # Large blocksize support for ext2/3 + # Will drop these patches once they are in akmp's mm tree + ext2_large_blocksize_support.patch + ext3_large_blocksize_support.patch + ext2_rec_len_overflow_with_64kblk_fix-v2.patch + ext3_rec_len_overflow_with_64kblk_fix-v2.patch + # New patchset + ext4-convert_bg_block_bitmap_to_bg_block_bitmap_lo.patch + ext4-convert_bg_inode_bitmap_and_bg_inode_table.patch + ext4-convert_s_blocks_count_to_s_blocks_count_lo.patch + ext4-convert_s_r_blocks_count_and_s_free_blocks_count.patch + ext4-convert_ext4_extent.ee_start_to_ext4_extent.ee_start_lo.patch + ext4-convert_ext4_extent_idx.ei_leaf_to_ext4_extent_idx.ei_leaf_lo.patch + ext4-sparse-fix.patch + # Large block support for blocksize > pagesize + # Needed for Christoph Lameter's largeblock patchset + # to support large block on system that + # blocksize > pagesize + jbd-blocks-reservation-fix-for-large-blk.patch + jbd2-blocks-reservation-fix-for-large-blk.patch + ext4_fix_setup_new_group_blocks_locking.patch + ext4_lighten_up_resize_transaction_requirements.patch -
From: Theodore Ts'o <tytso@...> Subject: [PATCH, RFC] Ext4 patches planned for submission upstream Date: Oct 4, 1:50 am 2007 The following ext4 patches are planned for submission to Linus once the merge window for 2.6.24-rc1 is opened. - Ted -
From: Theodore Ts'o <tytso@...> Subject: [PATCH] Ext4: Uninitialized Block Groups Date: Oct 4, 1:50 am 2007 From: Andreas Dilger <adilger@clusterfs.com> In pass1 of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use. This is this the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes. With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation. The feature is enabled through a mkfs option mke2fs /dev/ -O uninit_groups A patch adding support for uninitialized block groups to e2fsprogs tools has been posted to the linux-ext4 mailing list. The patches have been stress tested with fsstress and fsx. In performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count. Since typical ext4 filesystems only use 1-10% of their inodes, this feature can greatly reduce e2fsck time for users. With performance improvement of 2-20 times, depending on how full the filesystem is. The attached graph shows the major improvements in e2fsck times in filesystems with a large total inode count, but few inodes in use. In each group descriptor if we have EXT4_BG_INODE_UNINIT set in bg_flags: Inode table is not initialized/used in this group. So we can skip the consistency check during fsck. EXT4_BG_BLOCK_UNINIT set in bg_flags: No block in the group is used. So we can skip the block bitmap verification for this group. We also add two new fields to group descriptor as a part of uninitialized group patch. __le16 bg_itable_unused; /* Unused inodes count */ __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */ bg_itable_unused: If we have EXT4_BG_INODE_UNINIT not set in bg_flags then bg_itable_unused will give the offset within the inode table till the inodes are used. This can be used by fsck to skip list of inodes that are marked unused. bg_checksum: Now that we depend on bg_flags and bg_itable_unused to determine the block and inode usage, we need to make sure group descriptor is not corrupt. We add checksum to group descriptor to detect corruption. If the descriptor is found to be corrupt, we mark all the blocks and inodes in the group used. Signed-off-by: Avantika Mathur <mathur@us.ibm.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- fs/Kconfig | 1 + fs/ext4/balloc.c | 92 ++++++++++++++++++++++++++++- fs/ext4/group.h | 29 +++++++++ fs/ext4/ialloc.c | 146 ++++++++++++++++++++++++++++++++++++++++++++--- fs/ext4/resize.c | 2 + fs/ext4/super.c | 47 +++++++++++++++ include/linux/ext4_fs.h | 16 ++++- 7 files changed, 317 insertions(+), 16 deletions(-) create mode 100644 fs/ext4/group.h diff --git a/fs/Kconfig b/fs/Kconfig index f9eed6d..97eef97 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -140,6 +140,7 @@ config EXT4DEV_FS tristate "Ext4dev/ext4 extended fs support development (EXPERIMENTAL)" depends on EXPERIMENTAL select JBD2 + select CRC16 help Ext4dev is a predecessor filesystem of the next generation extended fs ext4, based on ext3 filesystem code. It will be diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index e53b4af..d1a8882 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -20,6 +20,7 @@ #include <linux/quotaops.h> #include <linux/buffer_head.h> +#include "group.h" /* * balloc.c contains the blocks allocation and deallocation routines */ @@ -42,6 +43,74 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr, } +/* Initializes an uninitialized block bitmap if given, and returns the + * number of blocks free in the group. */ +unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh, + int block_group, struct ext4_group_desc *gdp) +{ + unsigned long start; + int bit, bit_max; + unsigned free_blocks; + struct ext4_sb_info *sbi = EXT4_SB(sb); + + if (bh) { + J_ASSERT_BH(bh, buffer_locked(bh)); + + /* If checksum is bad mark all blocks used to prevent allocation + * essentially implementing a per-group read-only flag. */ + if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) { + ext4_error(sb, __FUNCTION__, + "Checksum bad for group %u\n", block_group); + gdp->bg_free_blocks_count = 0; + gdp->bg_free_inodes_count = 0; + gdp->bg_itable_unused = 0; + memset(bh->b_data, 0xff, sb->s_blocksize); + return 0; + } + memset(bh->b_data, 0, sb->s_blocksize); + } + + /* Check for superblock and gdt backups in this group */ + bit_max = ext4_bg_has_super(sb, block_group); + + if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) || + block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) * + sbi->s_desc_per_block) { + if (bit_max) { + bit_max += ext4_bg_num_gdb(sb, block_group); + bit_max += + le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks); + } + } else { /* For META_BG_BLOCK_GROUPS */ + int group_rel = (block_group - + le32_to_cpu(sbi->s_es->s_first_meta_bg)) % + EXT4_DESC_PER_BLOCK(sb); + if (group_rel == 0 || group_rel == 1 || + (group_rel == EXT4_DESC_PER_BLOCK(sb) - 1)) + bit_max += 1; + } + + /* Last and first groups are always initialized */ + free_blocks = EXT4_BLOCKS_PER_GROUP(sb) - bit_max; + + if (bh) { + for (bit = 0; bit < bit_max; bit++) + ext4_set_bit(bit, bh->b_data); + + start = block_group * EXT4_BLOCKS_PER_GROUP(sb) + + le32_to_cpu(sbi->s_es->s_first_data_block); + + /* Set bits for block and inode bitmaps, and inode table */ + ext4_set_bit(ext4_block_bitmap(sb, gdp) - start, bh->b_data); + ext4_set_bit(ext4_inode_bitmap(sb, gdp) - start, bh->b_data); + for (bit = le32_to_cpu(gdp->bg_inode_table) - start, + bit_max = bit + sbi->s_itb_per_group; bit < bit_max; bit++) + ext4_set_bit(bit, bh->b_data); + } + + return free_blocks - sbi->s_itb_per_group - 2; +} + /* * The free blocks are managed by bitmaps. A file system contains several * blocks groups. Each group contains 1 bitmap block for blocks, 1 bitmap @@ -110,16 +179,29 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb, * * Return buffer_head on success or NULL in case of failure. */ -static struct buffer_head * +struct buffer_head * read_block_bitmap(struct super_block *sb, unsigned int block_group) { struct ext4_group_desc * desc; struct buffer_head * bh = NULL; - desc = ext4_get_group_desc (sb, block_group, NULL); + desc = ext4_get_group_desc(sb, block_group, NULL); if (!desc) goto error_out; - bh = sb_bread(sb, ext4_block_bitmap(sb, desc)); + if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { + bh = sb_getblk(sb, ext4_block_bitmap(sb, desc)); + if (!buffer_uptodate(bh)) { + lock_buffer(bh); + if (!buffer_uptodate(bh)) { + ext4_init_block_bitmap(sb, bh, block_group, + desc); + set_buffer_uptodate(bh); + } + unlock_buffer(bh); + } + } else { + bh = sb_bread(sb, ext4_block_bitmap(sb,desc)); + } if (!bh) ext4_error (sb, "read_block_bitmap", "Cannot read block bitmap - " @@ -586,6 +668,7 @@ do_more: desc->bg_free_blocks_count = cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) + group_freed); + desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc); spin_unlock(sb_bgl_lock(sbi, block_group)); percpu_counter_mod(&sbi->s_freeblocks_counter, count); @@ -1644,8 +1727,11 @@ allocated: ret_block, goal_hits, goal_attempts); spin_lock(sb_bgl_lock(sbi, group_no)); + if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); gdp->bg_free_blocks_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num); + gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp); spin_unlock(sb_bgl_lock(sbi, group_no)); percpu_counter_mod(&sbi->s_freeblocks_counter, -num); diff --git a/fs/ext4/group.h b/fs/ext4/group.h new file mode 100644 index 0000000..9310979 --- /dev/null +++ b/fs/ext4/group.h @@ -0,0 +1,29 @@ +/* + * linux/fs/ext4/group.h + * + * Copyright (C) 2007 Cluster File Systems, Inc + * + * Author: Andreas Dilger <adilger@clusterfs.com> + */ + +#ifndef _LINUX_EXT4_GROUP_H +#define _LINUX_EXT4_GROUP_H +#if defined(CONFIG_CRC16) +#include <linux/crc16.h> +#endif + +extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group, + struct ext4_group_desc *gdp); +extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group, + struct ext4_group_desc *gdp); +struct buffer_head *read_block_bitmap(struct super_block *sb, + unsigned int block_group); +extern unsigned ext4_init_block_bitmap(struct super_block *sb, + struct buffer_head *bh, int group, + struct ext4_group_desc *desc); +#define ext4_free_blocks_after_init(sb, group, desc) \ + ext4_init_block_bitmap(sb, NULL, group, desc) +extern unsigned ext4_init_inode_bitmap(struct super_block *sb, + struct buffer_head *bh, int group, + struct ext4_group_desc *desc); +#endif /* _LINUX_EXT4_GROUP_H */ diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index b8b538d..1fa418c 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -28,6 +28,7 @@ #include "xattr.h" #include "acl.h" +#include "group.h" /* * ialloc.c contains the inodes allocation and deallocation routines @@ -43,6 +44,52 @@ * the free blocks count in the block. */ +/* + * To avoid calling the atomic setbit hundreds or thousands of times, we only + * need to use it within a single byte (to ensure we get endianness right). + * We can use memset for the rest of the bitmap as there are no other users. + */ +static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap) +{ + int i; + + if (start_bit >= end_bit) + return; + + ext4_debug("mark end bits +%d through +%d used\n", start_bit, end_bit); + for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++) + ext4_set_bit(i, bitmap); + if (i < end_bit) + memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3); +} + +/* Initializes an uninitialized inode bitmap */ +unsigned ext4_init_inode_bitmap(struct super_block *sb, + struct buffer_head *bh, int block_group, + struct ext4_group_desc *gdp) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + + J_ASSERT_BH(bh, buffer_locked(bh)); + + /* If checksum is bad mark all blocks and inodes use to prevent + * allocation, essentially implementing a per-group read-only flag. */ + if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) { + ext4_error(sb, __FUNCTION__, "Checksum bad for group %u\n", + block_group); + gdp->bg_free_blocks_count = 0; + gdp->bg_free_inodes_count = 0; + gdp->bg_itable_unused = 0; + memset(bh->b_data, 0xff, sb->s_blocksize); + return 0; + } + + memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8); + mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), EXT4_BLOCKS_PER_GROUP(sb), + bh->b_data); + + return EXT4_INODES_PER_GROUP(sb); +} /* * Read the inode allocation bitmap for a given block_group, reading @@ -59,8 +106,20 @@ read_inode_bitmap(struct super_block * sb, unsigned long block_group) desc = ext4_get_group_desc(sb, block_group, NULL); if (!desc) goto error_out; - - bh = sb_bread(sb, ext4_inode_bitmap(sb, desc)); + if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { + bh = sb_getblk(sb, ext4_inode_bitmap(sb, desc)); + if (!buffer_uptodate(bh)) { + lock_buffer(bh); + if (!buffer_uptodate(bh)) { + ext4_init_inode_bitmap(sb, bh, block_group, + desc); + set_buffer_uptodate(bh); + } + unlock_buffer(bh); + } + } else { + bh = sb_bread(sb, ext4_inode_bitmap(sb, desc)); + } if (!bh) ext4_error(sb, "read_inode_bitmap", "Cannot read inode bitmap - " @@ -169,6 +228,8 @@ void ext4_free_inode (handle_t *handle, struct inode * inode) if (is_directory) gdp->bg_used_dirs_count = cpu_to_le16( le16_to_cpu(gdp->bg_used_dirs_count) - 1); + gdp->bg_checksum = ext4_group_desc_csum(sbi, + block_group, gdp); spin_unlock(sb_bgl_lock(sbi, block_group)); percpu_counter_inc(&sbi->s_freeinodes_counter); if (is_directory) @@ -438,7 +499,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode) struct ext4_sb_info *sbi; int err = 0; struct inode *ret; - int i; + int i, free = 0; /* Cannot create files in a deleted directory */ if (!dir || !dir->i_nlink) @@ -520,11 +581,13 @@ repeat_in_this_group: goto out; got: - ino += group * EXT4_INODES_PER_GROUP(sb) + 1; - if (ino < EXT4_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) { - ext4_error (sb, "ext4_new_inode", - "reserved inode or inode > inodes count - " - "block_group = %d, inode=%lu", group, ino); + ino++; + if ((group == 0 && ino < EXT4_FIRST_INO(sb)) || + ino > EXT4_INODES_PER_GROUP(sb)) { + ext4_error(sb, __FUNCTION__, + "reserved inode or inode > inodes count - " + "block_group = %d, inode=%lu", group, + ino + group * EXT4_INODES_PER_GROUP(sb)); err = -EIO; goto fail; } @@ -532,13 +595,78 @@ got: BUFFER_TRACE(bh2, "get_write_access"); err = ext4_journal_get_write_access(handle, bh2); if (err) goto fail; + + /* We may have to initialize the block bitmap if it isn't already */ + if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) && + gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { + struct buffer_head *block_bh = read_block_bitmap(sb, group); + + BUFFER_TRACE(block_bh, "get block bitmap access"); + err = ext4_journal_get_write_access(handle, block_bh); + if (err) { + brelse(block_bh); + goto fail; + } + + free = 0; + spin_lock(sb_bgl_lock(sbi, group)); + /* recheck and clear flag under lock if we still need to */ + if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); + free = ext4_free_blocks_after_init(sb, group, gdp); + gdp->bg_free_blocks_count = cpu_to_le16(free); + } + spin_unlock(sb_bgl_lock(sbi, group)); + + /* Don't need to dirty bitmap block if we didn't change it */ + if (free) { + BUFFER_TRACE(block_bh, "dirty block bitmap"); + err = ext4_journal_dirty_metadata(handle, block_bh); + } + + brelse(block_bh); + if (err) + goto fail; + } + spin_lock(sb_bgl_lock(sbi, group)); + /* If we didn't allocate from within the initialized part of the inode + * table then we need to initialize up to this inode. */ + if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) { + if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) { + gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT); + + /* When marking the block group with + * ~EXT4_BG_INODE_UNINIT we don't want to depend + * on the value of bg_itable_unsed even though + * mke2fs could have initialized the same for us. + * Instead we calculated the value below + */ + + free = 0; + } else { + free = EXT4_INODES_PER_GROUP(sb) - + le16_to_cpu(gdp->bg_itable_unused); + } + + /* + * Check the relative inode number against the last used + * relative inode number in this group. if it is greater + * we need to update the bg_itable_unused count + * + */ + if (ino > free) + gdp->bg_itable_unused = + cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino); + } + gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1); if (S_ISDIR(mode)) { gdp->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1); } + gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp); spin_unlock(sb_bgl_lock(sbi, group)); BUFFER_TRACE(bh2, "call ext4_journal_dirty_metadata"); err = ext4_journal_dirty_metadata(handle, bh2); @@ -560,7 +688,7 @@ got: inode->i_gid = current->fsgid; inode->i_mode = mode; - inode->i_ino = ino; + inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb); /* This is the optimal IO size (for stat), not the fs block size */ inode->i_blocks = 0; inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime = diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index aa11d7d..3359450 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -16,6 +16,7 @@ #include <linux/errno.h> #include <linux/slab.h> +#include "group.h" #define outside(b, first, last) ((b) < (first) || (b) >= (last)) #define inside(b, first, last) ((b) >= (first) && (b) < (last)) @@ -842,6 +843,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input) ext4_inode_table_set(sb, gdp, input->inode_table); /* LV FIXME */ gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count); gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb)); + gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp); /* * Make the new blocks and inodes valid next. We do this before diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 420d39d..b59610d 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -37,12 +37,14 @@ #include <linux/quotaops.h> #include <linux/seq_file.h> #include <linux/log2.h> +#include <linux/crc16.h> #include <asm/uaccess.h> #include "xattr.h" #include "acl.h" #include "namei.h" +#include "group.h" static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); @@ -1237,6 +1239,43 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es, return res; } +__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group, + struct ext4_group_desc *gdp) +{ + __u16 crc = 0; + + if (sbi->s_es->s_feature_ro_compat & + cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) { + int offset = offsetof(struct ext4_group_desc, bg_checksum); + __le32 le_group = cpu_to_le32(block_group); + + crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid)); + crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group)); + crc = crc16(crc, (__u8 *)gdp, offset); + offset += sizeof(gdp->bg_checksum); /* skip checksum */ + /* for checksum of struct ext4_group_desc do the rest...*/ + if ((sbi->s_es->s_feature_incompat & + cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) && + offset < le16_to_cpu(sbi->s_es->s_desc_size)) + crc = crc16(crc, (__u8 *)gdp + offset, + le16_to_cpu(sbi->s_es->s_desc_size) - + offset); + } + + return cpu_to_le16(crc); +} + +int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group, + struct ext4_group_desc *gdp) +{ + if ((sbi->s_es->s_feature_ro_compat & + cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) && + (gdp->bg_checksum != ext4_group_desc_csum(sbi, block_group, gdp))) + return 0; + + return 1; +} + /* Called at mount-time, super-block is locked */ static int ext4_check_descriptors (struct super_block * sb) { @@ -1291,6 +1330,14 @@ static int ext4_check_descriptors (struct super_block * sb) i, inode_table); return 0; } + if (!ext4_group_desc_csum_verify(sbi, i, gdp)) { + ext4_error(sb, __FUNCTION__, + "Checksum for group %d failed (%u!=%u)\n", i, + le16_to_cpu(ext4_group_desc_csum(sbi, i, + gdp)), + le16_to_cpu(gdp->bg_checksum)); + return 0; + } first_block += EXT4_BLOCKS_PER_GROUP(sb); gdp = (struct ext4_group_desc *) ((__u8 *)gdp + EXT4_DESC_SIZE(sb)); diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 151738a..b77b59f 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -105,19 +105,25 @@ */ struct ext4_group_desc { - __le32 bg_block_bitmap; /* Blocks bitmap block */ - __le32 bg_inode_bitmap; /* Inodes bitmap block */ + __le32 bg_block_bitmap; /* Blocks bitmap block */ + __le32 bg_inode_bitmap; /* Inodes bitmap block */ __le32 bg_inode_table; /* Inodes table block */ __le16 bg_free_blocks_count; /* Free blocks count */ __le16 bg_free_inodes_count; /* Free inodes count */ __le16 bg_used_dirs_count; /* Directories count */ - __u16 bg_flags; - __u32 bg_reserved[3]; + __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */ + __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */ + __le16 bg_itable_unused; /* Unused inodes count */ + __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */ __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */ __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */ __le32 bg_inode_table_hi; /* Inodes table block MSB */ }; +#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */ +#define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */ +#define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */ + #ifdef __KERNEL__ #include <linux/ext4_fs_i.h> #include <linux/ext4_fs_sb.h> @@ -665,6 +671,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino) #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001 #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002 #define EXT4_FEATURE_RO_COMPAT_BTREE_DIR 0x0004 +#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010 #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020 #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040 @@ -684,6 +691,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino) EXT4_FEATURE_INCOMPAT_64BIT) #define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \ + EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \ EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \ EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \ EXT4_FEATURE_RO_COMPAT_BTREE_DIR) -- 1.5.3.2.81.g17ed -

ReiserFS and ext3
I switched from ReiserFS4 to ext3 on my latest machine, because it looks like Hans Reiser is going to the big house for a long time and the development of ReiserFS massively disruptive. I must say I am unhappy with ext3. ext3 forces an fsck every 30 mounts, that adds minutes to a 20 second boot. What the heck is that? Reiser4 never resorted to that? Will ext4 change this?
Reiser4 never resorted to
Reiser4 never resorted to that?
That's because they believe in silently corrupting your data. ;)
I've never seen a FS which should not be periodically checked for errors. Remember, many categories of bugs of operating conditions can cause FS corruption. How long do you want to run before you FS is totally F'd? Would you rather know about a potential problem shortly after it occurred or after a huge chunk of your critical data is completely fubar'd?
Asking for a couple of minutes once every couple of months to years is laughable. Frankly, you should be checking more frequently then that. And the fact that once every 30 mounts is an issue, suggests you don't have critical data to begin with. So disable it if you like.
You can always tweak the forced check using the readily available and provided tools. It's easy to do. I always lower the forced check counts to lower, sane values so it can actually catch errors rather than be forced to run them long after a serious problem has occurred. And guess what, I do catch errors a couple times per year and I'm always so glad I'm smart enough to know, no FS is completely immune to errors. And anyone telling you otherwise is either an idiot or a moron. That's like declaring software is 100% bug free; which is impossible.
How is its poorest liability for high corrupted performance?
* Wannabe!!!
* How is its poorest liability for high corrupted performance? Its theory is good but I don't understand its liability in practice :(
* Is it true that "e2fsck time is constant"? I don't believe that no time is constant, the time is asymptotically proportional to some parameters.
" "here is a a high water mark of used inodes for each block group". Sure??? It's assumption but it's not a true premise.
* Does it work for the "worst-case" (for highly corrupted filesystems using loop FUSE with random fucker)? I don't believe working it for worst-case.
* What is the matter if wrong reference fills zeros to pages where are the used inodes? Hahahaha.
* "based solely on the number of used inodes rather than the total inode count" Does it repair hashed unused/corrupt inodes?
* Can it detect cycles of inodes?
* Can it detect orphaned inodes?
* Can it detect orphaned cycles of inodes?
One word to say: "/lost+found"!!!
Is it expected that /lost+found does recover the lost files positively?
We need TestSuites scripts ala Unit in distributed Tinderboxes-like to detect bugs of the fscker.
Do you remember the FAT of itself "C:>scandisk C:" and don't move the mouse!?
Yes. e2fsck is in fact the
Yes. e2fsck is in fact the only fsck for a Linux file system that has a regression test suite, and it has had one for an eternity.
> * Is it true that "e2fsck
> * Is it true that "e2fsck time is constant"? I don't believe that no time is constant, the
> time is asymptotically proportional to some parameters.
When they say "constant" they usually mean "there exists an upper bound that can be shown to hold true". In this case they meant that the filesystem size does not affect the fsck time, if it holds the same amount of data (inodes).
> " "here is a a high water mark of used inodes for each block group". Sure??? It's
> assumption but it's not a true premise.
Sure it is true. The filesystem maintains the knowledge of highest used inode within each group, so it just knows what it is.
Long fsck + fsck-every-n-boots considered annoying
It would be wonderful if someone came up with a compromise. fsck'ing a large filesystem means that a reboot will take between 30 seconds and 30 minutes.. Moreover, it takes 30 seconds N times, and 30 minutes once. One does not generally know a-priori which it will be.
Possible solutions:
1) Schedule a periodic reboot or drop-to-single-user-mode that includes a forced fsck.
2) When # of reboots approaches maximum mount count, warn the user (similar to software update).
3) Build a real incremental or on-line fsck.
-- plenty of academia about this, but AFAICT no production patches for production 2.6 filesystems.
fsck & lvm snapshots
i have fsck disabled (tune2fs -i 0 -c 0) on all ext3 filesystems on lvm.
why? snapshots. i have a cron script that snapshots, mounts, and fscks each filesystem.
on my server this takes a little while, but it happens in the early morning so i don't notice.
i do the same on my workstation, but the script is ran by anacron (as it is apm suspended most of the time). yeah, there's a little slow down while the fsck runs against the snapshot, but no more than the daily backup, aide, integrit, updatedb, etc cron jobs that also run after apm wake-up while i'm reading email and rss feeds.
on the wife's desktop i don't care as the fsck is fast enough (small drive) and her data is safely stored in a network share on the server (where it is RAIDed, backed up daily, and fsck weekly).
Check at startup is poor
>>Reiser4 never resorted to that?
>That's because they believe in silently corrupting your data. ;)
>I've never seen a FS which should not be periodically checked for errors.
But this should be done online, not at startup..
But this should be done
But this should be done online, not at startup..
Normally not realistic. Sure, it can be done, but the code gets a lot more complex and normally has a negative performance impact associated with such complexity.
So "should" we, no. "Would be nice", absolutely. Realistic? Depends on the performance impact. For many category of users, this would not be satisfactory in the least. Not to mention, the added complexity is more likely to BE the cause of the FS error which you've now coded up to catch at runtime. Tail. Chase. ;)
Not really sure how it can
Not really sure how it can become slower than "Wait 1 hour for this to complete with your not being able to do anything"
"Slower", as in, the total
"Slower", as in, the total throughput the FS would otherwise be capable of is reduced. For many, higher throughput is much more important than waiting a couple of minutes during their next and infrequent boot cycle. And with the optimizations made, this wait will likely be dramatically reduced.
As an example, if you typically require 5 minutes to check a FS and you have actively touched 50% of your inodes, the check will only take 2.5 minutes now. For many other file systems which are largely read-only (e.g. /usr), no inode scan will be required at all.
I still maintain that there
I still maintain that there are good cases for "servers" and "desktops" for both approaches (online, offline FSCK).
It strikes me that checking
It strikes me that checking every 30 mounts isn't what you actually want -- you just want to check regularly. For example, I boot my desktop machine daily, so my filesystem gets checked about once every 30 days. I might reboot a server once a month, so its filesystem gets automatically checked once every three years.
Instead filesystem checks should probably be scheduled as either every 30 mounts or every 30 days, whichever comes first (both values tunable, of course). This ensures that the filesystem actually gets checked regularly.
Actually ideal would be a situation like Microsoft's NTFS CHKDSK: you can check the filesystem online, but to fix anything requires unmounting the filesystem. I'm not sure what the difficulties are with online checking of ext[234], so I'm not sure if this would actually be any easier than full online checking, but it's a start. And it would allow regular checking via cron jobs of any filesystem containing Important Information, or with a high probability of filesystem errors (though if that's the case, you'd better be backing up...).
tune2fs -i interval-between-checks[d|m|w]
see "man tune2fs". search for "-i" and "-c".
you want "-i 30" and "-c 30d" (though "-c 30" is equivalent but less explicit). that checks every 30 boots AND 30 days. (at least i presume it's a boolean "and", not an "or".)
if it's a boolean "or" and you want "and", then you can create a boot script that checks the time-last-checked value (assuming dumpe2fs displays such) and if it's within X minutes of the current time, then "tune2fs -C 0" (reset the number of mounts).
> For example, I boot my
> For example, I boot my desktop machine daily, so my filesystem gets checked about once every 30 days. I might reboot a server once a month, so its filesystem gets automatically checked once every three years.
Yikes, that's an awful lot of damn reboots. I can tell you come from a microsoft background. I'm probably a little more representative of the typical unix user. My servers get booted every 12-18 months or so (kernel upgrades) and my desktops get booted every few weeks to every few months, if I'm messing around with development kernels.
So, at every 30 mounts, it would take me 30-50 years for a forced fsck on the servers.
Or maybe he's someone that's
Or maybe he's someone that's energy conscious, and has HW or kernel not quite up to standby and/or hibernate?
We're not all pigs with energy, you know...
I normally change both the
I normally change both the check count and maximum interval. This is an effective strategy. To boot, I make sure the count/intervals reflect the significance of data stored on the partition. If the data is generally useless or easy to replace, push it out (e.g. /tmp). If the data is important, make it sooner. Generally, I have one or two FS's checked every couple of boots to no more than 30 days out or so. This means rather than checking every single FS at one time, my machines spends an extra couple of seconds to minutes per boot cycle to protect my data. The checking is staged over several boots and I never pay the FS cost all at one time, unless something really bad happened and then I may manually force it. Everyone is happy. For 98%+ users, the scheme described here is more than enough.
Silently corrupting data?
"That's because they believe in silently corrupting your data. ;)"
Help me understand. I ran Reiser for 3.5 years without incident. I don't find a several minute check acceptable at all for a desktop system, ever.
Help me understand. I ran
Help me understand. I ran Reiser for 3.5 years without incident. I don't find a several minute check acceptable at all for a desktop system, ever.
My comment was tongue and cheek. You are obviously seriously humor impaired. The smiley face was the ultimate clue. Shesh. Dense.
But, frankly, I couldn't care less if you lose your data or not. It is your data and not mine. I have lost data via Reiser before but it was just a cache so I couldn't care less. But, it doesn't change the fact you are actively begging to lose data. Obviously your data isn't important and/or is easily replaced. You don't care about your data so why should anyone else. Period. Nuff said.
anti-reiser hysteria?
Wow, the anti-reiser hysteria is rolling into high gear now. The anti-reiser troll is showing his fangs eh? ;)
Seriously, I've run nothing except reiserfs on production servers (which are all SLES and opensuse) since 2004 and have never ever lost a bit, not so much as a hiccup, ever. Some of these boxes are pounded hard 24/7/365 (db servers, mail servers for 13,000 users) and most have been up for many hundreds of days.
Think about it: Would suse make reiser the default fs for the enterprise server if it wasn't rock solid? I don't think so - those stuffy swiss banks do so hate data corruption, you know.
Oh, but wait. some anonymous poster on some website said reiserfs is bad, so I guess we have been wrong all this time ;)
SUSE abandoned reiserfs
Wouls suse abandon makieng reiser the default fs for the enterprise server if it was rock solid? Because that happened a year ago: http://lwn.net/Articles/202780/
--
:wq
Novell did not abandon
Novell did not abandon ReiserFS because of its stability record. They abandoned it once Hans Reiser was detained because it lost its primary maintainers.
They don't want to support it forever due to the low number of people working on it. Now that Namesys is defunct, Red Hat developer Jeff Layton and SUSE developer Jeff Mahoney are maintaining it pretty much alone and don't appear to spend much time on it. When one of them bails out, the code is likely to start bitrotting sooner or later.
LOL, reiserfs not abandoned
> zdzichu
> Wouls suse abandon makieng reiser the default fs for the enterprise
> server if it was rock solid? Because that happened a year ago:
> http://lwn.net/Articles/202780/
You seem to misunderstand the announcement. Novell/SuSE have not abandoned reiserfs, it is still fully supported as always. In fact, reiserfs is still the default filesystem in Novell enterprise linux (SLES10SP1) which was released fairly recently. Finally reiserfs is still available and fully supported in opensuse. What's changed is that ext3 is the default, but I just installed suse 10.3 on reiserfs and I'm happy to report that it works beautifully as always.
To be honest, I heard there was trouble with the reiserfs code in some old (early 2.4, especially redhad) kernels, but that is ancient history.
To be clear, I am not
To be clear, I am not pounding on ReiserFS. Having said that, I have only ever lost data on ResierFS, FAT-based file systems, and NTFS.
The NTFS data loss one was the worst of them all. The server crashed. On boot it forced a FS check. In order to restore integrity, it decided a couple thousand directories (and all files within) needed to be removed because of corruption. When the check completed, file system integrity had been restored but 90% of all my critical data had been deleted. You just gatta love MS. Thankfully we had backups and ultimately only lost 24-hours worth of data.
Turns out the cause of the corruption has a loose SCSI cable.
Maybe I should switch back
I am the original troll. I ran Reiser for 3 years and never had to sit through an fsck. I never lost data, never had problems of any kind. I was a happy customer. Why do the extN resort to a crude check that Reiser skips. This morning an ext3 check took 10 minutes. That is a joke. I have no reason to dislike ReiserFS other than they have a *severe* marketing problem. Perhaps it should be renamed.
tune2fs -c 0 -i 0 , and
tune2fs -c 0 -i 0 , and forget about this extra fsck insurance, no problem. Though I usually set check interval to 4-6 month.
Reiser4
Please include Reiser4 in Linux
i dont like this politics game, and thought linux is much cooler
Sure
>Think about it: Would suse make reiser the default fs for the enterprise server if it wasn't rock solid? I don't
>think so - those stuffy swiss banks do so hate data corruption, you know.
Obviously they would because they did. Just a quick glance over this past year I see a discussion about locking issues, 2 null pointer dereferences, xattr ref counting problems and filesystems over 8TiB getting block bitmaps created with wrapped values. And on top of those there's the issue that reiser3 is the last, IIRC, filesystem to use the BKL making it scale poorly when using more than one reiser3 filesystem.
Yes, there will be bugs in every filesystem so this is no surprise but going so far as to call reiser3 "rock solid" is a stretch.
Completely disagree, not modern!
I disagree. Yes, file systems can be corrupted. But to guard against that you have to check *data* integrity. Just file system integrity is pretty useless, odds are that any controller or driver bug corrupts your data is much larger than the odds of corrupting your meta data.
Adding fscks to boot just does not scale. The checks have to be better and they have to be able to run in the background. Repairing file systems should still be done in single user mode, but recurring integrity checks should be automatic and not disturb normal operations.
But to guard against that
But to guard against that you have to check *data* integrity.
You can not have data integrity if the FS is in question. FS corruption often immediately proceeds data corruption. FS corruption left unchecked often leads to dramatic data loss.
You can have data corruption without FS corruption. FS corruption is a common cause of data corruption.
And you did never work as an
And you did never work as an consultant out in the field with a laptop. This is something that hits me 1 - 2 times a week. And it's annoying! (Yes I've fixed it with disabling the check...) Computers are way more than a server...
tune2fs
Have a look at the man page for tune2fs. It lets you change the interval between checks.
switch it of
this is a pretty conservative measure to prevent a corrupted file system. use "tune2fs -c0" to switch this off.
but be aware (taken from the manpage):
You should strongly consider the consequences of disabling mount-count-dependent checking entirely. Bad disk drives, cables, memory, and kernel bugs could all corrupt a filesystem without marking the filesystem dirty or in error. If you are using journaling on your filesystem, your filesystem will never be marked dirty, so it will not normally be checked. A filesystem error detected by the kernel will still force an fsck on the next reboot, but it may already be too late to prevent data loss at that point.
When e2fsck if i wannt reboot my machine?
When e2fsck if i wannt reboot my unique machine?
I've a cheap 450 V.A. UPS (Uninterrumpted Power Supplier), but i never never reboot my PC chatarra.
I think that is more probable that my somewhere of my filesystem is corrupted after many days of downloading (pssshhh, porn and free/open software), i don't know to e2fsck it without reboot.
Any idea?
It's pretty easy to change
It's pretty easy to change that on your own. Read the man page for tune2fs (in the e2fsutils package if you don't have it), There's an option to change how often it forces an fsck. It's not uncommon for it to be set to never force a check with ext3 on non-critical systems (i.e. desktops), but you should manually do it from time to time anyway (there's another option to tune2fs to force a check only on next mount) to catch improperly unlinked inodes and the like.
it depends
Some distributions (Fedora, CentOS, Red Hat) disable the 30x fsck by default, and for good reason. Ext3 is and has been for a long time very reliable, there's no need for that.
This is my main complaint regarding Ubuntu - that they don't disable the 30x fsck. There's a bug report filed with the developers, hopefully they'll do something before the next version is released.
Silent data corruption
Even if ext3 operates flawlessly, underlying hardware (disk, bus, controller, cable, chipset, cpu) can cause failures. Kernel Trap covered this recently:
http://kerneltrap.org/Linux/Data_Errors_During_Drive_Communication
underlying hardware (disk,
underlying hardware (disk, bus, controller, cable, chipset, cpu)
And memory. ;)
Some distributions (Fedora,
Some distributions (Fedora, CentOS, Red Hat) disable the 30x fsck by default, and for good reason. Ext3 is and has been for a long time very reliable, there's no need for that.
This is my main complaint regarding Ubuntu - that they don't disable the 30x fsck. There's a bug report filed with the developers, hopefully they'll do something before the next version is released.
I'm the number two post in the thread. I'm running RH. Mount checking is absolutely not disabled. At least not in the enterprise series of releases. Any distro you find which is disabling it by default should be stayed away from as it proves it is being packaged by morons. If it is disabled, re-enable it if your data is important to you.
I run ext3 on almost all my FS and as I said, I normally find FS corruption a couple times per year. Again, whoever is giving you your information is an idiot or a moron. Don't listen to them. Someone in the thread down below even provided a nice link which details the various causes which can create FS corruption.
You like the Reiser-zealot above, want to ignorantly believe your FS is magically immune, by all means, continue rolling the dice.
I love...
I just love how a piece of news about EXT4 diverges to a long discussion about EXT3 vs ReiserFS... and all that because somebody does not know how to tune his OS!
It should be called the
It should be called the RicerFS if you ask me. They are arguing over the rear spoiler while completely missing the point of either file system.
Amazingly, nobody has yet
Amazingly, nobody has yet complained about the initial consistency check imposed for every reiserfs mount...
I had some 400 GB reiser3 volume spread on RAID5 once, and seem to recall it took 10-20 seconds at every bootup to mount it. Compared to that time loss of mounting the volume, I'd be happy to wait a few minutes every month.
I eventually lost patience and formatted with another filesystem. Probably XFS. It's been a while.
I have 2 questions about
I have 2 questions about file system check during boot/init:
1. Can it be interrupted and postponed? I hate it when I have to start my OS quickly only to realize that it wants to check my file systems which take some minutes.
2. If not, what is the reason?
Why CRC-16 and not CRC-32?
I had posted it before here but it didn't post.
Why CRC-16 and not CRC-32?
I had posted here and don't remember more words ... because i don't know the human control of i-post-it-and-it-doesn't-be-published-yet-until-unknown-inside-wants.
I always use CRC-32 like TCP/IP, .zip, .gz, .bz2, ...
paradox birthday ...
p_crc16(x) = 1.5258e-05 vs p_crc32(x) = 2.3283e-10 ...
... i don't remember more ...
Three (and a half) reasons I can think of:
The data tables for CRC-16 are smaller, for one thing, meaning the algorithm is much faster when contending for cache space.
Second, the likelihood of a collision is still small with only 16 bits of checksum. You have a 1/65536 chance that a random block overwriting a metadata block will have a valid checksum, unless some joker is filling their files with 512 byte checksummed blocks, AND one of those blocks gets misdirected over this one. The types of errors being guarded against -- random bit flips mostly -- are more than adequately guarded against with a CRC-16 on a block this size. CRC-16 should be able to catch up to 4 bit flips always on a 4096-bit block, if I remember my math correctly.
Then there's the space issue. The new data structure has to fit in the same space as the old data structure. There are only so many reserved bytes that can get pressed into service for this checksum without it being seen as extravagant. Since the checksum currently only guards the "check / no-check" decision for fsck, and that decision presumably can be overridden anyway, there isn't nearly as much riding on this as there could be.
--
Program Intellivision and play Space Patrol!
Bad thing, bad thing and bad thing.
1) "The data tables for CRC-16 are smaller, for one thing, meaning the algorithm is much faster when contending for cache space."
It's not always true in 32-bit architecture.
The data table for CRC-16 is 256 entries of 2 bytes to be fine, but in a 32-bit architecture, it's 256 entries of 4 bytes (=1024 bytes).
The data table for CRC-32 is 256 entries of 4 bytes (=1024 bytes).
So 1024 bytes of table of CRC-16 or CRC-32 are not a problem for the most tiny L1-cache 4 KiB.
2) A million of machines with CRC-16, the probability of failed machines is 1'000'000 x 1/65536 = 15.2587, it's probabily 15 failed machines each million of machines of Internet.
Then CRC-32 is better than CRC-16, you have a 1/4294967296 chance instead of 1/65536 chance to survive to the gunshoot of the Russian Roulette.
3) "Then there's the space issue".
It's the stupid thing that doesn't affect to 2 GiB RAM machines.
"Doubling size from CRC-16 to CRC-32 doesn't implies doubling space of data structures containing this CRC field!!!"
Good bye :)
Uhm, go read the code
1) Those tables look like 2-bytes per entry to me. An array of u16 takes 2 bytes per entry on most architectures. Try it and surprise yourself.
2) Your math is suspect. You first have to have a corrupt block. Of the corrupt block, the likelihood it won't be caught by this CRC is small. For the types of errors that tend to occur (small number of bit flips and replace-with-garbage), the odds of missing the error are exceedingly small. 1/65536 is for purely random blocks. A purely random block will be noticed by other inconsistencies. The types of errors being protected against are NOT purely random, so the odds are somewhat better than 1/65536. The checksum is only protecting a few bytes, for goodness sakes, not the whole disk, and it only controls the decision whether to skip a particular group of inodes during fsck--it's not integral to the general functioning of the partition. For a correct checksum to also cause a failure, the block it protects would also have to say "skip me." The odds of that for a random block are more like 1 in 2^32, because the count itself is 16 bits.
3) Also, the space issue I'm referring to is the constrained space in the on-disk structure. The structure is a fixed size, with only so many reserved bytes. Allocating uses to them needs to be done judiciously. Wasting bytes for trivial features makes future expansion difficult. If the structure gets too full, you'll have to include a pointer to a completely new structure allocated elsewhere that holds additional fields, which complicates code. Remember: Backward compatible on-disk structures are fixed in size.
Take a look at the structure:
struct ext4_group_desc { __le32 bg_block_bitmap; /* Blocks bitmap block */ __le32 bg_inode_bitmap; /* Inodes bitmap block */ __le32 bg_inode_table; /* Inodes table block */ __le16 bg_free_blocks_count; /* Free blocks count */ __le16 bg_free_inodes_count; /* Free inodes count */ __le16 bg_used_dirs_count; /* Directories count */ __u16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */ __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */ __le16 bg_itable_unused; /* Unused inodes count */ __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */ __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */ __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */ __le32 bg_inode_table_hi; /* Inodes table block MSB */ };Notice there's only 8 bytes of reserved space left after this change. Previously, there were 12. Using a 32-bit checksum would have dropped the reserved space to 6 bytes. Prior to this feature, this structure had "bg_reserved[3]" and lacked the new fields "bg_itable_unused" and "bg_checksum."
--
Program Intellivision and play Space Patrol!
Hasn't anyone here heard of chunkfs?
Yes, the forced occasional fsck is neccessary and annoying.
A proposed solution is to run smaller checks more often,
so there isn't ever a huge delay. See
http://infohost.nmt.edu/~val/chunkfs/
The ext4 developers are aware of chunkfs, and are at least
asking themselves whether they could make ext4 do something similar; see
http://lists.openwall.net/linux-ext4/2007/07/02/4