Provide an alternate definition for the page_cache_xxx(mapping, ...) functions that can determine the current page size from the mapping and generate the appropriate shifts, sizes and mask for the page cache operations. Change the basic functions that allocate pages for the page cache to be able to handle higher order allocations. Provide a new function mapping_setup(stdruct address_space *, gfp_t mask, int order) that allows the setup of a mapping of any compound page order. mapping_set_gfp_mask() is still provided but it sets mappings to order 0. Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in order for the filesystem to be able to use larger pages. For some key block devices and filesystems the conversion is done here. mapping_setup() for higher order is only allowed if the mapping does not use DMA mappings or HIGHMEM since we do not support bouncing at the moment. Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings. Modify the set_blocksize() function so that an arbitrary blocksize can be set. Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many platforms (order 11). Typically file systems are not only limited by the core VM but also by the structure of their internal data structures. The core VM limitations fall away with this patch. The functionality provided here can do nothing about the internal limitations of filesystems. Known internal limitations: Ext2 64k XFS 64k Reiserfs 8k Ext3 4k (rumor has it that changing a constant can remove the limit) Ext4 4k Signed-off-by: Christoph Lameter <clameter@sgi.com> --- block/Kconfig | 17 ++++++ drivers/block/rd.c | 6 ++- fs/block_dev.c | 29 +++++++--- fs/buffer.c | 4 +- fs/inode.c | 7 ++- fs/xfs/linux-2.6/xfs_buf.c | 3 +- include/linux/buffer_head.h | 12 ++++- include/linux/fs.h | 5 ...
There are patches original worked by Takashi Sato to support large block size (up to 64k) in ext2/3/4, which addressed the directory issue as well. I just forward ported. Will posted it in a separate thread. Haven't get a chance to integrated with your patch yet (next step). thanks, -
The next 4 patches support large block size (up to PAGESIZE, max 64KB) for ext2/3/4, originally from Takashi Sato. http://marc.info/?l=linux-ext4&m=115768873518400&w=2 It's quite simple to support large block size in ext2/3/4, mostly just enlarge the block size limit. But it is NOT possible to have 64kB blocksize on ext2/3/4 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is to put 2 empty records in such a directory, or to special-case an impossible value like rec_len = 0xffff to handle this. The Patch-set consists of the following 4 patches. [1/4] ext2/3/4: enlarge blocksize - Allow blocksize up to pagesize [2/4] ext2: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize [3/4] ext3: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize [4/4] ext4: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. Next steps: Need a e2fsprogs changes to able test this feature. As mkfs needs to be educated not assuming rec_len to be blocksize all the time. Will try it with Christoph Lameter's large block patch next. Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp> Signed-off-by: Mingming Cao <cmm@us.ibm.com> --- --- fs/ext2/super.c | 2 +- fs/ext3/super.c | 5 ++++- fs/ext4/super.c | 5 +++++ include/linux/ext2_fs.h | 4 ++-- include/linux/ext3_fs.h | 4 ++-- include/linux/ext4_fs.h | 4 ++-- 6 files changed, 16 insertions(+), 8 deletions(-) Index: linux-2.6.23-rc3/fs/ext2/super.c =================================================================== --- linux-2.6.23-rc3.orig/fs/ext2/super.c 2007-08-12 21:25:24.000000000 -0700 +++ ...
Two problems were found when testing largeblock on ext3. Patches to follow. Good news is, with your changes, plus all these extN changes, I am able to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k page size. fsx test runs fine for an hour on ext3 with 16k blocksize on x86:-) Mingming -
>From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c 2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd/journal.c 2007-08-31 11:01:18.000000000 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
* jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
* and allocate frozen and commit buffers from these slabs.
*
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
*/
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size) get_order((size) << (PAGE_SHIFT - 10))
static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+ "jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+ "jbd_16k", "jbd_32k", "jbd_64k"
};
static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c 2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 11:04:37.000000000 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
* jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
* and allocate frozen and commit buffers from these slabs.
*
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the ...But the real fix is to kill this code. We can't send down slab pages down the block layer without breaking iscsi or aoe. And this code is only used in so rare cases that all the normal testing won't hit it. Very bad combination. -
We are doing what you describe right now. So the current code is broken? -
How about getting rid of the slabs there and use kmalloc? Kmalloc in mm (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page allocator calls. Not sure what to do about the 1k and 2k requests though. -
The problem is that we must never use kmalloc pages, so we always need to request a page or more for these. Better to use get_free_page directly, that's how I fixed it in XFS a while ago. -
So you'd be fine with replacing the allocs with get_free_pages(GFP_xxx, get_order(size)) ? -
Yes. And rip out all that code related to setting up the slabs. I plan to add WARN_ONs to bio_add_page and friends to detect further usage of slab pages if there is any. -
jbd/jbd2: Replace slab allocations with page cache allocations
From: Christoph Lameter <clameter@sgi.com>
JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.
Tested on 2.6.23-rc6 with fsx runs fine.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/jbd/checkpoint.c | 2
fs/jbd/commit.c | 6 +-
fs/jbd/journal.c | 107 ++++---------------------------------------------
fs/jbd/transaction.c | 10 ++--
fs/jbd2/checkpoint.c | 2
fs/jbd2/commit.c | 6 +-
fs/jbd2/journal.c | 109 ++++----------------------------------------------
fs/jbd2/transaction.c | 18 ++++----
include/linux/jbd.h | 23 +++++++++-
include/linux/jbd2.h | 28 ++++++++++--
10 files changed, 83 insertions(+), 228 deletions(-)
Index: linux-2.6.23-rc5/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
/*
* Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;
jbd_unlock_bh_state(bh_in);
- tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+ tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- jbd_slab_free(tmp, bh_in->b_size);
+ jbd_free(tmp, bh_in->b_size);
goto repeat;
}
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
/* Set up a default-sized revoke table for the new mount. */
err = journal_init_revoke(journal, ...Thanks Mingming. -
Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? -
kmalloc is using slabs :) The intent was to avoid the wasted memory, but as we've repeated a gazillion times wasted memory on a rather rare codepath doesn't really matter when you just crash random storage drivers otherwise. -
Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari -
Here is the incremental small cleanup patch.
Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/jbd/journal.c | 8 +++++---
fs/jbd/revoke.c | 12 ++++++------
fs/jbd2/journal.c | 8 +++++---
fs/jbd2/revoke.c | 12 ++++++------
4 files changed, 22 insertions(+), 18 deletions(-)
Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.000000000 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
- kfree(journal->j_wbuf);
+ jbd_kfree(journal->j_wbuf);
jbd_kfree(journal);
}
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.000000000 -0700
+++ ...Shouldn't we kill jbd_kmalloc instead? -
It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(<page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. Regards, Mingming -
I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy -- David Kleikamp IBM Linux Technology Center -
Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter <clameter@sgi.com> JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> --- fs/jbd/commit.c | 6 +-- fs/jbd/journal.c | 99 ++------------------------------------------------ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c | 6 +-- fs/jbd2/journal.c | 99 ++------------------------------------------------ fs/jbd2/transaction.c | 18 ++++----- include/linux/jbd.h | 18 +++++---- include/linux/jbd2.h | 21 +++++----- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c =================================================================== --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.000000000 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.000000000 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS); ...
__GFP_NOFAIL should only be used when we have no way of recovering from failure. The allocation in journal_init_common() (at least) _can_ recover and hence really shouldn't be using __GFP_NOFAIL. (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is there as a marker which says "we really shouldn't be doing this but we don't know how to fix it"). So sometime it'd be good if you could review all the __GFP_NOFAILs in there and see if we can remove some, thanks. -
Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.
Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/jbd/journal.c | 2 +-
fs/jbd/transaction.c | 3 +--
fs/jbd2/journal.c | 2 +-
fs/jbd2/transaction.c | 3 +--
4 files changed, 4 insertions(+), 6 deletions(-)
Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:48:40.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-19 11:48:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-19 11:49:10.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
- GFP_NOFS|__GFP_NOFAIL);
+ new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:49:46.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * ...Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/jbd/journal.c | 6 +++---
fs/jbd/revoke.c | 8 ++++----
fs/jbd2/journal.c | 6 +++---
fs/jbd2/revoke.c | 8 ++++----
4 files changed, 14 insertions(+), 14 deletions(-)
Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
- journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.000000000 -0700
+++ ...On Wed, 19 Sep 2007 12:22:09 -0700 These were all OK using GFP_KERNEL. GFP_NOFS should only be used when the caller is holding some fs locks which might cause a deadlock if that caller reentered the fs in ->writepage (and maybe put_inode and such). That isn't the case in any of the above code, which is all mount time stuff (I think). ext3/4 should be using GFP_NOFS when the caller has a transaction open, has a page locked, is holding i_mutex, etc. -
Thanks for your feedback. Mingming -
Is there a reason for this change except "it's in a filesystem, so it should be GFP_NOFS"? We are only doing journal setup during mount so there shouldn't be any problem using GFP_KERNEL. I don't think it will inject any defect into the code, but I don't think it is needed either. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Shaggy -- David Kleikamp IBM Linux Technology Center -
Oh, I see your patch now. You mean use GFP_NOFS instead of -- David Kleikamp IBM Linux Technology Center -
-
This should probably be a __GFP_NOFAIL if we are trying to start a new handle in truncate, as there is no way to propagate an error to the caller. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Thanks, updated version. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases they are not needed. Signed-off-by: Mingming Cao <cmm@us.ibm.com> --- fs/jbd/journal.c | 2 +- fs/jbd2/journal.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c =================================================================== --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.000000000 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 14:23:45.000000000 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/journal.c =================================================================== --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.000000000 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 14:23:45.000000000 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); -
Convert kmalloc to kzalloc() and get rid of the memset().
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext3/xattr.c | 3 +--
fs/ext4/xattr.c | 3 +--
fs/jbd/journal.c | 3 +--
fs/jbd/transaction.c | 2 +-
fs/jbd2/journal.c | 3 +--
fs/jbd2/transaction.c | 2 +-
6 files changed, 6 insertions(+), 10 deletions(-)
Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-21 09:08:02.000000000
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-21 09:10:37.000000000
-0700
@@ -653,10 +653,9 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
- memset(journal, 0, sizeof(*journal));
init_waitqueue_head(&journal->j_wait_transaction_locked);
init_waitqueue_head(&journal->j_wait_logspace);
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-21
09:13:11.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-21 09:13:24.000000000
-0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
+ new_transaction = kzalloc(sizeof(*new_transaction),
GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-21
09:10:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-21 09:11:13.000000000
-0700
@@ -654,10 +654,9 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
- journal = ...JBD2 naming cleanup From: Mingming Cao <cmm@us.ibm.com> change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4 Signed-off-by: Mingming Cao <cmm@us.ibm.com> --- fs/ext4/extents.c | 2 +- fs/ext4/super.c | 2 +- fs/jbd2/commit.c | 2 +- fs/jbd2/journal.c | 8 ++++---- fs/jbd2/recovery.c | 2 +- fs/jbd2/revoke.c | 4 ++-- include/linux/ext4_jbd2.h | 6 +++--- include/linux/jbd2.h | 30 +++++++++++++++--------------- 8 files changed, 28 insertions(+), 28 deletions(-) Index: linux-2.6.23-rc6/fs/ext4/super.c =================================================================== --- linux-2.6.23-rc6.orig/fs/ext4/super.c 2007-09-21 16:27:31.000000000 -0700 +++ linux-2.6.23-rc6/fs/ext4/super.c 2007-09-21 16:27:46.000000000 -0700 @@ -966,7 +966,7 @@ static int parse_options (char *options, if (option < 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi->s_commit_interval = HZ * option; break; case Opt_data_journal: Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h =================================================================== --- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h 2007-09-10 19:50:29.000000000 -0700 +++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h 2007-09-21 16:27:46.000000000 -0700 @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include <linux/fs.h> #include <linux/jbd2.h> @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_ return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ Index: linux-2.6.23-rc6/include/linux/jbd2.h =================================================================== --- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-21 09:07:09.000000000 -0700 +++ ...
On Fri, 21 Sep 2007 16:13:56 -0700 I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally better to raise separate patches, please - the ext3 patches I'll merge directly but the ext4 patches should go through (and be against) the ext4 devel tree. I fixed lots of rejects against the already-pending changes to these filesystems. You forgot to remove the memsets in both start_this_handle()s. -
Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into ext4 devel tree already, I will remove the ext3/jbd part out of the ext4 Thanks for catching this. Mingming -
The blocks per page could be less or quals to 1 with the large block support in VM.
The patch fixed the way to calculate the number of blocks to reserve in journal in the
case blocksize > pagesize.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.000000000 -0700
+++ my2.6/fs/jbd/journal.c 2007-08-31 13:28:18.000000000 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
int journal_blocks_per_page(struct inode *inode)
{
- return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+ int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ if (bits > 0)
+ return 1 << bits;
+ else
+ return 1;
}
/*
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c 2007-08-31 13:32:21.000000000 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.000000000 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
int jbd2_journal_blocks_per_page(struct inode *inode)
{
- return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+ int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ if (bits > 0)
+ return 1 << bits;
+ else
+ return 1;
}
/*
-
Ahh. Good. I could add the path to the large blocksize patchset? -
Support large blocksize up to PAGESIZE (max 64KB) for ext4.
From: Takashi Sato <sho@tnes.nec.co.jp>
This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.
The Patch-set consists of the following 2 patches.
[1/2] ext4: enlarge blocksize
- Allow blocksize up to pagesize
[2/2] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.
Patch consider to be merge to 2.6.24-rc1.
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext4/super.c | 5 +++++
include/linux/ext4_fs.h | 4 ++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 619db84..d8bb279 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
goto out_fail;
}
+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+ goto out_fail;
+ }
+
/*
* The ext4 superblock will not be buffer aligned for other than 1kB
* block sizes. We need to calculate the offset from buffer start.
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index f9881b6..d15a15e 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -77,8 +77,8 @@
* Macro-instructions used to manage several block sizes
*/
#define ...ext4: Avoid rec_len overflow with 64KB block size
From: Jan Kara <jack@suse.cz>
With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext4/dir.c | 12 ++++---
fs/ext4/namei.c | 76 ++++++++++++++++++++++-------------------------
include/linux/ext4_fs.h | 20 ++++++++++++
3 files changed, 62 insertions(+), 46 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3ab01c0..20b1e28 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir,
unsigned long offset)
{
const char * error_msg = NULL;
- const int rlen = le16_to_cpu(de->rec_len);
+ const int rlen = ext4_rec_len_from_disk(de->rec_len);
if (rlen < EXT4_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -176,10 +176,10 @@ revalidate:
* least that it is non-zero. A
* failure will be detected in the
* dirent test below. */
- if (le16_to_cpu(de->rec_len) <
- EXT4_DIR_REC_LEN(1))
+ if (ext4_rec_len_from_disk(de->rec_len)
+ < EXT4_DIR_REC_LEN(1))
break;
- i += le16_to_cpu(de->rec_len);
+ i += ext4_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
- offset += le16_to_cpu(de->rec_len);
+ offset += ext4_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
* if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
- filp->f_pos += ...Support large blocksize up to PAGESIZE (max 64KB) for ext2
From: Takashi Sato <sho@tnes.nec.co.jp>
This patch set supports large block size(>4k, <=64k) in ext2,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext2 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.
The Patch-set consists of the following 2 patches.
[1/2] ext2: enlarge blocksize
- Allow blocksize up to pagesize
[2/2] ext2: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext2, and able to handle empty directory block.
Please consider to include to mm tree.
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext2/super.c | 3 ++-
include/linux/ext2_fs.h | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 639a32c..765c805 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
brelse(bh);
if (!sb_set_blocksize(sb, blocksize)) {
- printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+ printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n",
+ blocksize);
goto failed_sbi;
}
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 153d755..910a705 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb)
* Macro-instructions used to manage several block sizes
*/
#define ...ext2: Avoid rec_len overflow with 64KB block size
From: Jan Kara <jack@suse.cz>
With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext2/dir.c | 43 +++++++++++++++++++++++++++++++------------
include/linux/ext2_fs.h | 1 +
2 files changed, 32 insertions(+), 12 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2bf49d7..1329bdb 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,24 @@
typedef struct ext2_dir_entry_2 ext2_dirent;
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT2_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT2_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* ext2 uses block-sized chunks. Arguably, sector-sized ones would be
* more robust, but we have what we have
@@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page)
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
- rec_len = le16_to_cpu(p->rec_len);
+ rec_len = ext2_rec_len_from_disk(p->rec_len);
if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name,
*/
static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
{
- return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+ return (ext2_dirent *)((char*)p +
+ ext2_rec_len_from_disk(p->rec_len));
}
static inline unsigned
@@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
return 0;
}
}
- filp->f_pos ...On Mon, 01 Oct 2007 17:35:46 -0700 This patch clashes in non-trivial ways with ext2-convert-to-new-aops-fix.patch and perhaps other things which are already queued for 2.6.24 inclusion, so I'll need to ask for an updated patch, please. Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if we're aiming for complete support of 64k blocksize in 2.6.24's ext2, additional testing and checking will be needed. -
If the rel_len overflow patch isn't going to make it, then we also need to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible to allow this to be up to 32768 w/o the rec_len overflow fix however. Yes, this does imply that those patches were in the wrong order in the patch series, and I apologize for that, even if it isn't my fault. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
On Thu, 4 Oct 2007 16:40:44 -0600 Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and ext2-support-large-blocksize-up-to-pagesize.patch. -
Sorry, for the delayed answer but I had some urgent bugs to fix... Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch? As far as I understand your previous email (and also as I've checked against 2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2... I'll send you an updated patch for ext2 in a moment. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -
ok.. I'm basically not applying anything any more - the whole thing is a teetering wreck. I need to go through the input queue delicately adding things which look important or relatively non-injurious. -
OK, I'll fixup those rejects and send a new patch. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -
OK, attached is a patch diffed against 2.6.23-rc9-mm2 - does that work
fine for you?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
------
With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.
Signed-off-by: Jan Kara <jack@suse.cz>
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/fs/ext2/dir.c linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c
--- linux-2.6.23-mm/fs/ext2/dir.c 2007-10-11 12:08:16.000000000 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c 2007-10-11 12:14:24.000000000 +0200
@@ -28,6 +28,24 @@
typedef struct ext2_dir_entry_2 ext2_dirent;
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT2_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT2_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* ext2 uses block-sized chunks. Arguably, sector-sized ones would be
* more robust, but we have what we have
@@ -106,7 +124,7 @@ static void ext2_check_page(struct page
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
- rec_len = le16_to_cpu(p->rec_len);
+ rec_len = ext2_rec_len_from_disk(p->rec_len);
if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -204,7 +222,8 @@ static inline int ext2_match (int len, c
*/
static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
{
- return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+ return (ext2_dirent *)((char*)p +
+ ext2_rec_len_from_disk(p->rec_len));
}
static inline unsigned
@@ -316,7 +335,7 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
- filp->f_pos += ...Of course, ext2 shouldn't be trying to write a bad record length into a directory entry. But are we sure that there is no way in which this situation could occur is the on-disk data was _already_ bad? Because it is very bad for a fileysstem to go BUG in response to unexpected data on the disk. -
btw, this changes ext2's on-disk format. a) is the ext2 format documented anywhere? If so, that document will need updating. b) what happens when an old ext2 driver tries to read and/or write this directory entry? Do we need a compat flag for it? c) what happens when old and new ext3 or ext4 try to read/write this directory entry? -
Old ext2 only supports up to 4k include/linux/ext2_fs.h: #define EXT2_MIN_BLOCK_SIZE 1024 #define EXT2_MAX_BLOCK_SIZE 4096 #define EXT2_MIN_BLOCK_LOG_SIZE 10 Should fail to mount the volume since the block size is too large. -
should, but does it? box:/usr/src/25> grep MAX_BLOCK_SIZE fs/ext2/*.[ch] include/linux/ext2* include/linux/ext2_fs.h:#define EXT2_MAX_BLOCK_SIZE 4096 box:/usr/src/25> -
Just to clarify this is only changes the directory entries format on ext2/3/4 fs with 64k block size. But currently without kernel changes The e2fsprogs needs to be changed to sync up with this change. Ted has a paper a while back to show ext2 disk format http://web.mit.edu/tytso/www/linux/ext2intro.html The Documentation/filesystems/ext2.txt doesn't have the ext2 format documented. That document is out-dated need to be reviewed and cleaned Without the first patch in this series: ext2 large blocksize support patches, it fails to mount a ext2 filesystem with 64k block size. [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE http://lkml.org/lkml/2007/10/1/361 So the old ext2/3/4 driver will not get access the directory entry with 64k block size format changes. Regards, -
Support large blocksize up to PAGESIZE (max 64KB) for ext3
From: Takashi Sato <sho@tnes.nec.co.jp>
This patch set supports large block size(>4k, <=64k) in ext3
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext3 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.
The Patch-set consists of the following 2 patches.
[1/2] ext3: enlarge blocksize
- Allow blocksize up to pagesize
[2/2] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext3, and able to handle empty directory block.
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext3/super.c | 6 +++++-
include/linux/ext3_fs.h | 4 ++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 9537316..b4bfd36 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}
brelse (bh);
- sb_set_blocksize(sb, blocksize);
+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n",
+ blocksize);
+ goto out_fail;
+ }
logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
bh = sb_bread(sb, logic_sb_block);
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index ece49a8..7aa5556 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -76,8 +76,8 @@
* Macro-instructions used to manage several block sizes
*/
#define ...ext3: Avoid rec_len overflow with 64KB block size
From: Jan Kara <jack@suse.cz>
With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext3_next_entry() when we are changing them anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext3/dir.c | 10 +++--
fs/ext3/namei.c | 90 ++++++++++++++++++++++-------------------------
include/linux/ext3_fs.h | 20 ++++++++++
3 files changed, 68 insertions(+), 52 deletions(-)
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c00723a..3c4c43a 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir,
unsigned long offset)
{
const char * error_msg = NULL;
- const int rlen = le16_to_cpu(de->rec_len);
+ const int rlen = ext3_rec_len_from_disk(de->rec_len);
if (rlen < EXT3_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -177,10 +177,10 @@ revalidate:
* least that it is non-zero. A
* failure will be detected in the
* dirent test below. */
- if (le16_to_cpu(de->rec_len) <
+ if (ext3_rec_len_from_disk(de->rec_len) <
EXT3_DIR_REC_LEN(1))
break;
- i += le16_to_cpu(de->rec_len);
+ i += ext3_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
- offset += le16_to_cpu(de->rec_len);
+ offset += ext3_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
* if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ...[2/4] ext2: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext2/dir.c | 46 ++++++++++++++++++++++++++++++++++++----------
include/linux/ext2_fs.h | 13 +++++++++++++
2 files changed, 49 insertions(+), 10 deletions(-)
Index: linux-2.6.23-rc3/fs/ext2/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/dir.c 2007-08-29 15:29:51.000000000 -0700
@@ -94,9 +94,9 @@ static void ext2_check_page(struct page
goto out;
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
+ offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
p = (ext2_dirent *)(kaddr + offs);
rec_len = le16_to_cpu(p->rec_len);
-
if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
if (rec_len & 3)
@@ -108,6 +108,7 @@ static void ext2_check_page(struct page
if (le32_to_cpu(p->inode) > max_inumber)
goto Einumber;
}
+ offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
if (offs != limit)
goto Eend;
out:
@@ -283,6 +284,7 @@ ext2_readdir (struct file * filp, void *
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
+ de = EXT2_DIR_ADJUST_TAIL_ADDR(kaddr, de, sb->s_blocksize);
if (de->rec_len == 0) {
ext2_error(sb, __FUNCTION__,
"zero-length directory entry");
@@ -305,8 +307,10 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
+ filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
ext2_put_page(page);
}
return 0;
@@ -343,13 +347,14 @@ struct ext2_dir_entry_2 * ...[3/4] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext3/dir.c | 13 ++++---
fs/ext3/namei.c | 88 +++++++++++++++++++++++++++++++++++++++---------
include/linux/ext3_fs.h | 9 ++++
3 files changed, 91 insertions(+), 19 deletions(-)
Index: linux-2.6.23-rc3/fs/ext3/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/dir.c 2007-08-29 15:40:06.000000000 -0700
@@ -100,10 +100,11 @@ static int ext3_readdir(struct file * fi
unsigned long offset;
int i, stored;
struct ext3_dir_entry_2 *de;
- struct super_block *sb;
int err;
struct inode *inode = filp->f_path.dentry->d_inode;
int ret = 0;
+ struct super_block *sb = inode->i_sb;
+ unsigned tail = sb->s_blocksize;
sb = inode->i_sb;
@@ -167,8 +168,11 @@ revalidate:
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (filp->f_version != inode->i_version) {
- for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ if (filp->f_version != inode->i_version) {
+ for (i = 0; i < tail && i < offset; ) {
de = (struct ext3_dir_entry_2 *)
(bh->b_data + i);
/* It's too expensive to do a full
@@ -189,7 +193,7 @@ revalidate:
}
while (!error && filp->f_pos < inode->i_size
- && offset < sb->s_blocksize) {
+ && offset < tail) {
de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
}
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = ...[4/4] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
fs/ext4/dir.c | 11 ++++--
fs/ext4/namei.c | 88 +++++++++++++++++++++++++++++++++++++++---------
include/linux/ext4_fs.h | 9 ++++
3 files changed, 90 insertions(+), 18 deletions(-)
Index: linux-2.6.23-rc3/fs/ext4/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/dir.c 2007-08-29 15:33:19.000000000 -0700
@@ -100,10 +100,11 @@ static int ext4_readdir(struct file * fi
unsigned long offset;
int i, stored;
struct ext4_dir_entry_2 *de;
- struct super_block *sb;
int err;
struct inode *inode = filp->f_path.dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
int ret = 0;
+ unsigned tail = sb->s_blocksize;
sb = inode->i_sb;
@@ -166,8 +167,11 @@ revalidate:
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
if (filp->f_version != inode->i_version) {
- for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ for (i = 0; i < tail && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
/* It's too expensive to do a full
@@ -188,7 +192,7 @@ revalidate:
}
while (!error && filp->f_pos < inode->i_size
- && offset < sb->s_blocksize) {
+ && offset < tail) {
de = (struct ext4_dir_entry_2 *) (bh->b_data + offset);
if (!ext4_check_dir_entry ("ext4_readdir", inode, de,
bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
}
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = EXT4_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
offset = 0;
brelse ...