Re: [PATCH] JBD slab cleanups

Previous thread: Re: [NFS] [3/4] 2.6.23-rc4: known regressions by J. Bruce Fields on Wednesday, August 29, 2007 - 2:54 pm. (1 message)

Next thread: [PATCH] fs/jfs: use DIV_ROUND_UP where appropriate by Shaun Zinck on Wednesday, August 29, 2007 - 9:17 pm. (2 messages)
From: Mingming Cao
Date: Wednesday, August 29, 2007 - 5:11 pm

There are patches original worked by Takashi Sato to support large block
size (up to 64k) in ext2/3/4, which addressed the directory issue as
well. I just forward ported. Will posted it in a separate thread.
Haven't get a chance to integrated with your patch yet (next step).

thanks,

-



From: Mingming Cao
Date: Wednesday, August 29, 2007 - 5:47 pm

The next 4 patches support large block size (up to PAGESIZE, max 64KB)
for ext2/3/4, originally from Takashi Sato.
http://marc.info/?l=linux-ext4&m=115768873518400&w=2


It's quite simple to support large block size in ext2/3/4, mostly just
enlarge the block size limit.  But it is NOT possible to have 64kB
blocksize on ext2/3/4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is to put 2 empty records in such
a directory, or to special-case an impossible value like rec_len =
0xffff to handle this. 


The Patch-set consists of the following 4 patches.
  [1/4]  ext2/3/4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [3/4]  ext3: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [4/4]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. 

Next steps:
Need a e2fsprogs changes to able test this feature. As mkfs needs to be
educated not assuming rec_len to be blocksize all the time.
Will try it with Christoph Lameter's large block patch next.


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
---
 fs/ext2/super.c         |    2 +-
 fs/ext3/super.c         |    5 ++++-
 fs/ext4/super.c         |    5 +++++
 include/linux/ext2_fs.h |    4 ++--
 include/linux/ext3_fs.h |    4 ++--
 include/linux/ext4_fs.h |    4 ++--
 6 files changed, 16 insertions(+), 8 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/super.c	2007-08-12 21:25:24.000000000 -0700
+++ ...
From: Mingming Cao
Date: Friday, August 31, 2007 - 5:01 pm

Two problems were found when testing largeblock on ext3.  Patches to
follow. 

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming


From: Mingming Cao
Date: Friday, August 31, 2007 - 5:12 pm

From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 11:01:18.000000000 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+	"jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+	"jbd_16k", "jbd_32k", "jbd_64k"
 };
 
 static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 11:04:37.000000000 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the ...
From: Christoph Hellwig
Date: Saturday, September 1, 2007 - 11:39 am

But the real fix is to kill this code.  We can't send down slab pages
down the block layer without breaking iscsi or aoe.  And this code is
only used in so rare cases that all the normal testing won't hit it.
Very bad combination.
-



From: Christoph Lameter
Date: Sunday, September 2, 2007 - 4:40 am

We are doing what you describe right now. So the current code is broken?


-



From: Christoph Hellwig
Date: Sunday, September 2, 2007 - 8:28 am

From: Christoph Lameter
Date: Monday, September 3, 2007 - 12:55 am

How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
(and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
allocator calls. Not sure what to do about the 1k and 2k requests though.


From: Christoph Hellwig
Date: Monday, September 3, 2007 - 6:40 am

The problem is that we must never use kmalloc pages, so we always need
to request a page or more for these.  Better to use get_free_page directly,
that's how I fixed it in XFS a while ago.

From: Christoph Lameter
Date: Monday, September 3, 2007 - 12:31 pm

So you'd be fine with replacing the allocs with

get_free_pages(GFP_xxx, get_order(size)) ?


From: Christoph Hellwig
Date: Monday, September 3, 2007 - 12:33 pm

Yes.  And rip out all that code related to setting up the slabs.  I plan
to add WARN_ONs to bio_add_page and friends to detect further usage of
slab pages if there is any.


From: Mingming Cao
Date: Friday, September 14, 2007 - 11:53 am

jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter <clameter@sgi.com>

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/checkpoint.c   |    2 
 fs/jbd/commit.c       |    6 +-
 fs/jbd/journal.c      |  107 ++++---------------------------------------------
 fs/jbd/transaction.c  |   10 ++--
 fs/jbd2/checkpoint.c  |    2 
 fs/jbd2/commit.c      |    6 +-
 fs/jbd2/journal.c     |  109 ++++----------------------------------------------
 fs/jbd2/transaction.c |   18 ++++----
 include/linux/jbd.h   |   23 +++++++++-
 include/linux/jbd2.h  |   28 ++++++++++--
 10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c	2007-09-13 13:45:39.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd_slab_free(tmp, bh_in->b_size);
+			jbd_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
 	/* Set up a default-sized revoke table for the new mount. */
 	err = journal_init_revoke(journal, ...
From: Mingming Cao
Date: Monday, September 17, 2007 - 12:29 pm

Currently memory allocation for committed_data(and frozen_buffer) for
bufferhead is done through jbd slab management, as Christoph Hellwig
pointed out that this is broken as jbd should not pass slab pages down
to IO layer. and suggested to use get_free_pages() directly.

The problem with this patch, as Andreas Dilger pointed today in ext4
interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
1/3-1/2 page space. 

What was the originally intention to set up slabs for committed_data(and
frozen_buffer) in JBD? Why not using kmalloc?


-



From: Badari Pulavarty
Date: Monday, September 17, 2007 - 3:01 pm

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari


From: Mingming Cao
Date: Monday, September 17, 2007 - 3:57 pm

Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    8 +++++---
 fs/jbd/revoke.c   |   12 ++++++------
 fs/jbd2/journal.c |    8 +++++---
 fs/jbd2/revoke.c  |   12 ++++++------
 4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-17 14:32:16.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-17 14:33:59.000000000 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
 		iput(journal->j_inode);
 	if (journal->j_revoke)
 		journal_destroy_revoke(journal);
-	kfree(journal->j_wbuf);
+	jbd_kfree(journal->j_wbuf);
 	jbd_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-17 14:32:22.000000000 -0700
+++ ...
From: Christoph Hellwig
Date: Tuesday, September 18, 2007 - 2:04 am

Shouldn't we kill jbd_kmalloc instead?

-



From: Mingming Cao
Date: Tuesday, September 18, 2007 - 9:35 am

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming

-



From: Dave Kleikamp
Date: Tuesday, September 18, 2007 - 11:04 am

I disagree.  Why would jbd need to globally change the way it allocates
memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures.  Having to change one particular instance won't
necessarily mean we want to change all of them.  Adding unnecessary
wrappers only obfuscates the code making it harder to understand.  You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments.  Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


From: Mingming Cao
Date: Tuesday, September 18, 2007 - 6:00 pm

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter <clameter@sgi.com>

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/commit.c       |    6 +--
 fs/jbd/journal.c      |   99 ++------------------------------------------------
 fs/jbd/transaction.c  |   12 +++---
 fs/jbd2/commit.c      |    6 +--
 fs/jbd2/journal.c     |   99 ++------------------------------------------------
 fs/jbd2/transaction.c |   18 ++++-----
 include/linux/jbd.h   |   18 +++++----
 include/linux/jbd2.h  |   21 +++++-----
 8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-18 17:51:21.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 ...
From: Andrew Morton
Date: Tuesday, September 18, 2007 - 7:19 pm

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says "we really shouldn't be doing this but
we don't know how to fix it").

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.
-



From: Mingming Cao
Date: Wednesday, September 19, 2007 - 12:15 pm

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.

Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c      |    2 +-
 fs/jbd/transaction.c  |    3 +--
 fs/jbd2/journal.c     |    2 +-
 fs/jbd2/transaction.c |    3 +--
 4 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:48:40.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-19 11:48:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-19 11:49:10.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
+		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 11:49:46.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * ...
From: Mingming Cao
Date: Wednesday, September 19, 2007 - 12:22 pm

Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/journal.c  |    6 +++---
 fs/jbd/revoke.c   |    8 ++++----
 fs/jbd2/journal.c |    6 +++---
 fs/jbd2/revoke.c  |    8 ++++----
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_NOFS);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
+++ ...
From: Andrew Morton
Date: Wednesday, September 19, 2007 - 2:34 pm

On Wed, 19 Sep 2007 12:22:09 -0700

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in ->writepage (and
maybe put_inode and such).  That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.


From: Mingming Cao
Date: Wednesday, September 19, 2007 - 2:55 pm

Thanks for your feedback.

Mingming

-



From: Andreas Dilger
Date: Wednesday, September 19, 2007 - 9:25 pm

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"?  We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL.  I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From: Dave Kleikamp
Date: Wednesday, September 19, 2007 - 12:26 pm

No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-



From: Dave Kleikamp
Date: Wednesday, September 19, 2007 - 12:28 pm

Oh, I see your patch now.  You mean use GFP_NOFS instead of
-- 
David Kleikamp
IBM Linux Technology Center

-



From: Mingming Cao
Date: Wednesday, September 19, 2007 - 1:47 pm

From: Andreas Dilger
Date: Wednesday, September 19, 2007 - 12:48 pm

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


From: Mingming Cao
Date: Wednesday, September 19, 2007 - 3:03 pm

Thanks, updated version.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases
they are not needed.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    2 +-
 fs/jbd2/journal.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));



From: Christoph Hellwig
Date: Monday, September 17, 2007 - 12:34 pm

kmalloc is using slabs :)

The intent was to avoid the wasted memory, but as we've repeated a gazillion
times wasted memory on a rather rare codepath doesn't really matter when
you just crash random storage drivers otherwise.

From: Christoph Lameter
Date: Friday, September 14, 2007 - 11:58 am

Thanks Mingming.



From: Mingming Cao
Date: Friday, August 31, 2007 - 5:12 pm

The blocks per page could be less or quals to 1 with the large block support in VM.
The patch fixed the way to calculate the number of blocks to reserve in journal in the
case blocksize > pagesize.



Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-31 13:27:16.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 13:28:18.000000000 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
 
 int journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-31 13:32:21.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 13:32:30.000000000 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
 
 int jbd2_journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*



From: Christoph Lameter
Date: Wednesday, August 29, 2007 - 5:59 pm

Ahh. Good.

I could add the path to the large blocksize patchset?


From: Mingming Cao
Date: Wednesday, August 29, 2007 - 5:47 pm

[2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/ext2/dir.c           |   46 ++++++++++++++++++++++++++++++++++++----------
 include/linux/ext2_fs.h |   13 +++++++++++++
 2 files changed, 49 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/dir.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/dir.c	2007-08-29 15:29:51.000000000 -0700
@@ -94,9 +94,9 @@ static void ext2_check_page(struct page 
 			goto out;
 	}
 	for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
+		offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 		p = (ext2_dirent *)(kaddr + offs);
 		rec_len = le16_to_cpu(p->rec_len);
-
 		if (rec_len < EXT2_DIR_REC_LEN(1))
 			goto Eshort;
 		if (rec_len & 3)
@@ -108,6 +108,7 @@ static void ext2_check_page(struct page 
 		if (le32_to_cpu(p->inode) > max_inumber)
 			goto Einumber;
 	}
+	offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 	if (offs != limit)
 		goto Eend;
 out:
@@ -283,6 +284,7 @@ ext2_readdir (struct file * filp, void *
 		de = (ext2_dirent *)(kaddr+offset);
 		limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
 		for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(kaddr, de, sb->s_blocksize);
 			if (de->rec_len == 0) {
 				ext2_error(sb, __FUNCTION__,
 					"zero-length directory entry");
@@ -305,8 +307,10 @@ ext2_readdir (struct file * filp, void *
 					return 0;
 				}
 			}
+			filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 			filp->f_pos += le16_to_cpu(de->rec_len);
 		}
+		filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 		ext2_put_page(page);
 	}
 	return 0;
@@ -343,13 +347,14 @@ struct ext2_dir_entry_2 * ...
From: Christoph Lameter
Date: Wednesday, August 29, 2007 - 5:12 pm

Ahh. Great. Keep me posted.

-



Previous thread: Re: [NFS] [3/4] 2.6.23-rc4: known regressions by J. Bruce Fields on Wednesday, August 29, 2007 - 2:54 pm. (1 message)

Next thread: [PATCH] fs/jfs: use DIV_ROUND_UP where appropriate by Shaun Zinck on Wednesday, August 29, 2007 - 9:17 pm. (2 messages)