[PATCH] ext4: add rb_tree cache to struct ext4_group_info

Previous thread: Re: bug in ext3 code causing OOM error on systems with small memory by Frans van de Wiel on Saturday, March 27, 2010 - 3:27 pm. (1 message)

Next thread: Psychologists - 272,188 records and 9,874 emails by woodcarver Hurley on Sunday, March 28, 2010 - 9:55 am. (1 message)
From: jing zhang
Date: Sunday, March 28, 2010 - 2:11 am

From: Jing Zhang <zj.barak@gmail.com>

Date: Sun Mar 28 17:09:33     2010

rb_tree cache is added to struct ext4_group_info at minor cost.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K. V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jing Zhang <zj.barak@gmail.com>

---

--- linux-2.6.32/fs/ext4/ext4.h	2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/ext4-12.h	2010-03-28 16:47:56.000000000 +0800
@@ -1622,6 +1622,7 @@ static inline void ext4_update_i_disksiz
 struct ext4_group_info {
 	unsigned long   bb_state;
 	struct rb_root  bb_free_root;
+	struct rb_node *bb_free_cache;
 	ext4_grpblk_t	bb_first_free;	/* first free block */
 	ext4_grpblk_t	bb_free;	/* total free blocks */
 	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
--- linux-2.6.32/fs/ext4/mballoc.c	2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c	2010-03-28 16:43:08.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
 		count2++;
 		ext4_lock_group(sb, entry->group);
 		/* Take it out of per group rb tree */
+		if (db->bb_free_cache == &entry->node)
+			db->bb_free_cache = rb_next(&entry->node);
 		rb_erase(&entry->node, &(db->bb_free_root));
 		mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);

@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
 	struct ext4_free_data *entry;

 	grp = ext4_get_group_info(sb, group);
-	n = rb_first(&(grp->bb_free_root));
+	if (grp->bb_free_cache)
+		n = grp->bb_free_cache;
+	else
+		n = rb_first(&(grp->bb_free_root));

 	while (n) {
 		entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct rb_node **n = &db->bb_free_root.rb_node, *node;
 	struct rb_node *parent = NULL, *new_node;
+	int left = 1;

 	BUG_ON(!ext4_handle_valid(handle));
 	BUG_ON(e4b->bd_bitmap_page == NULL);
@@ ...
From: Eric Sandeen
Date: Sunday, March 28, 2010 - 8:03 am

Please include a reason for the change in the commit message,
so that in the future people don't have to figure out for
themselves why a change was made.  It also helps reviewers.  :)


I think it's better if you don't use that style, but instead:

 		else if (block >= (entry->start_blk + entry->count)) {
			n = &(*n)->rb_right;
			left = 0;
		} else {
			...
--

From: jing zhang
Date: Monday, March 29, 2010 - 6:40 am

fine

Cache of rb_tree is added to struct ext4_group_info, and is maintained
at minor cost.
Then rb_tree traversing in the function,
ext4_mb_generate_from_freelist() is accelerated.

---

--- linux-2.6.32/fs/ext4/mballoc.c	2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c	2010-03-29 21:31:18.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
 		count2++;
 		ext4_lock_group(sb, entry->group);
 		/* Take it out of per group rb tree */
+		if (db->bb_free_cache == &entry->node)
+			db->bb_free_cache = rb_next(&entry->node);
 		rb_erase(&entry->node, &(db->bb_free_root));
 		mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);

@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
 	struct ext4_free_data *entry;

 	grp = ext4_get_group_info(sb, group);
-	n = rb_first(&(grp->bb_free_root));
+	if (grp->bb_free_cache)
+		n = grp->bb_free_cache;
+	else
+		n = rb_first(&(grp->bb_free_root));

 	while (n) {
 		entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct rb_node **n = &db->bb_free_root.rb_node, *node;
 	struct rb_node *parent = NULL, *new_node;
+	int left = 1;

 	BUG_ON(!ext4_handle_valid(handle));
 	BUG_ON(e4b->bd_bitmap_page == NULL);
@@ -4369,15 +4375,18 @@ ext4_mb_free_metadata(handle_t *handle,
 		 * blocks */
 		page_cache_get(e4b->bd_buddy_page);
 		page_cache_get(e4b->bd_bitmap_page);
+		/* just for sure */
+		db->bb_free_cache = 0;
 	}
 	while (*n) {
 		parent = *n;
 		entry = rb_entry(parent, struct ext4_free_data, node);
 		if (block < entry->start_blk)
 			n = &(*n)->rb_left;
-		else if (block >= (entry->start_blk + entry->count))
+		else if (block >= (entry->start_blk + entry->count)) {
 			n = &(*n)->rb_right;
-		else {
+			left = 0;
+		} else {
 			ext4_grp_locked_error(sb, e4b->bd_group, __func__,
 					"Double free of blocks %d (%d %d)",
 					block, ...
From: Aneesh Kumar K. V
Date: Tuesday, March 30, 2010 - 11:29 am

Did we measure the cost of not doing this ? Is there a profile data that
we can look at ?

-aneesh
--

From: jing zhang
Date: Wednesday, March 31, 2010 - 7:26 am

With the added cache, there is over 50% probability that the operation,
       rb_first(&(grp->bb_free_root));
can be saved, when there are multiple nodes in tree.

It seems what is added is following what is called O(1), one of the
works by Mr. Ingo Molnar, but I am not sure, and let's ask Mr. Ingo
Molnar.

          - zj
--

From: tytso
Date: Saturday, April 3, 2010 - 8:34 am

Sure, but does it matter?  The red-black tree is per-block group, and
rb_first() is O(ln n), and it's cleared after every transaction
commit.  Have you measured how deep it gets?  Have you measured how
much CPU time this would actually save?

I'm almost certiain the code complexity isn't worth it.  For example,
your patch is buggy.  There are places where the red black tree is
manipulated, and where the node pointed at by bb_free_cache could get
freed.  For example, see release_blocks_on_commit() and
ext4_mb_free_metadata().

That being said, I'm not convinced ext4_mb_generate_from_freelist() is
(a) necessary, or (b) bug-free, either.  The whole point of having
extents in bb_free_root tree is that those extents aren't safe to be
placed in the buddy bitmap.  And ext4_mb_generate_from_freelist()
isn't freeing the nodes from the rbtree.  Fortunately it looks like
ext4_mb_generate_from_freelist is only getting called when the buddy
bitmap is being set up, so the rbtree should be empty during those
times.

I need to do some more investigation, but I think the function can be
removed entirely.

						- Ted
--

From: jing zhang
Date: Saturday, April 3, 2010 - 6:26 pm

I will check release_blocks_on_commit() and ext4_mb_free_metadata()

Do you mean that ext4_mb_generate_from_freelist() can be removed entirely?

--

From: tytso
Date: Saturday, April 3, 2010 - 7:06 pm

Maybe.  I need to do more investigation to be sure.  The code in
mballoc is subtle, and in some places there is stuff which is vestigal
or misnamed, but it means that I'm going to be very careful before
changing things.

It also means that if you submit patches, you need to be very clear
what you think the surrounding code is doing, why you think it's
wrong, and how your patch make things better.  For example, this:

    The function, ext4_mb_discard_group_preallocations(), works alone as
    hard as possible without correct understanding its caller's good
    thinking.

    And now try to relieve it in simple way.

is almost useless as a comment.  It doesn't help me understand the
code.  "hard as possible"?  Huh?  "without correct understanding"?
How can code, unless it's artificially intelligent, have
understanding?  And if you meant the original author had no
understanding, how do you know that?  "caller's good thinking"?  Same
question; the calling code doesn't think.

This sort of explanation isn't helpful in understanding the patch.

							- Ted
--

From: jing zhang
Date: Saturday, April 3, 2010 - 11:26 pm

Another good lesson, I got it.

Thanks
              - zj
--

Previous thread: Re: bug in ext3 code causing OOM error on systems with small memory by Frans van de Wiel on Saturday, March 27, 2010 - 3:27 pm. (1 message)

Next thread: Psychologists - 272,188 records and 9,874 emails by woodcarver Hurley on Sunday, March 28, 2010 - 9:55 am. (1 message)