From: Jing Zhang <zj.barak@gmail.com>
Date: Sun Mar 28 17:09:33 2010
rb_tree cache is added to struct ext4_group_info at minor cost.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: "Aneesh Kumar K. V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jing Zhang <zj.barak@gmail.com>
---
--- linux-2.6.32/fs/ext4/ext4.h 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/ext4-12.h 2010-03-28 16:47:56.000000000 +0800
@@ -1622,6 +1622,7 @@ static inline void ext4_update_i_disksiz
struct ext4_group_info {
unsigned long bb_state;
struct rb_root bb_free_root;
+ struct rb_node *bb_free_cache;
ext4_grpblk_t bb_first_free; /* first free block */
ext4_grpblk_t bb_free; /* total free blocks */
ext4_grpblk_t bb_fragments; /* nr of freespace fragments */
--- linux-2.6.32/fs/ext4/mballoc.c 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c 2010-03-28 16:43:08.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
count2++;
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
+ if (db->bb_free_cache == &entry->node)
+ db->bb_free_cache = rb_next(&entry->node);
rb_erase(&entry->node, &(db->bb_free_root));
mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);
@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
struct ext4_free_data *entry;
grp = ext4_get_group_info(sb, group);
- n = rb_first(&(grp->bb_free_root));
+ if (grp->bb_free_cache)
+ n = grp->bb_free_cache;
+ else
+ n = rb_first(&(grp->bb_free_root));
while (n) {
entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct rb_node **n = &db->bb_free_root.rb_node, *node;
struct rb_node *parent = NULL, *new_node;
+ int left = 1;
BUG_ON(!ext4_handle_valid(handle));
BUG_ON(e4b->bd_bitmap_page == NULL);
@@ ...Please include a reason for the change in the commit message,
so that in the future people don't have to figure out for
themselves why a change was made. It also helps reviewers. :)
I think it's better if you don't use that style, but instead:
else if (block >= (entry->start_blk + entry->count)) {
n = &(*n)->rb_right;
left = 0;
} else {
...
--
fine
Cache of rb_tree is added to struct ext4_group_info, and is maintained
at minor cost.
Then rb_tree traversing in the function,
ext4_mb_generate_from_freelist() is accelerated.
---
--- linux-2.6.32/fs/ext4/mballoc.c 2009-12-03 11:51:22.000000000 +0800
+++ ext4_mm_leak/mballoc-12.c 2010-03-29 21:31:18.000000000 +0800
@@ -2548,6 +2548,8 @@ static void release_blocks_on_commit(jou
count2++;
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
+ if (db->bb_free_cache == &entry->node)
+ db->bb_free_cache = rb_next(&entry->node);
rb_erase(&entry->node, &(db->bb_free_root));
mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);
@@ -3188,7 +3190,10 @@ static void ext4_mb_generate_from_freeli
struct ext4_free_data *entry;
grp = ext4_get_group_info(sb, group);
- n = rb_first(&(grp->bb_free_root));
+ if (grp->bb_free_cache)
+ n = grp->bb_free_cache;
+ else
+ n = rb_first(&(grp->bb_free_root));
while (n) {
entry = rb_entry(n, struct ext4_free_data, node);
@@ -4353,6 +4358,7 @@ ext4_mb_free_metadata(handle_t *handle,
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct rb_node **n = &db->bb_free_root.rb_node, *node;
struct rb_node *parent = NULL, *new_node;
+ int left = 1;
BUG_ON(!ext4_handle_valid(handle));
BUG_ON(e4b->bd_bitmap_page == NULL);
@@ -4369,15 +4375,18 @@ ext4_mb_free_metadata(handle_t *handle,
* blocks */
page_cache_get(e4b->bd_buddy_page);
page_cache_get(e4b->bd_bitmap_page);
+ /* just for sure */
+ db->bb_free_cache = 0;
}
while (*n) {
parent = *n;
entry = rb_entry(parent, struct ext4_free_data, node);
if (block < entry->start_blk)
n = &(*n)->rb_left;
- else if (block >= (entry->start_blk + entry->count))
+ else if (block >= (entry->start_blk + entry->count)) {
n = &(*n)->rb_right;
- else {
+ left = 0;
+ } else {
ext4_grp_locked_error(sb, e4b->bd_group, __func__,
"Double free of blocks %d (%d %d)",
block, ...Did we measure the cost of not doing this ? Is there a profile data that we can look at ? -aneesh --
With the added cache, there is over 50% probability that the operation,
rb_first(&(grp->bb_free_root));
can be saved, when there are multiple nodes in tree.
It seems what is added is following what is called O(1), one of the
works by Mr. Ingo Molnar, but I am not sure, and let's ask Mr. Ingo
Molnar.
- zj
--
Sure, but does it matter? The red-black tree is per-block group, and rb_first() is O(ln n), and it's cleared after every transaction commit. Have you measured how deep it gets? Have you measured how much CPU time this would actually save? I'm almost certiain the code complexity isn't worth it. For example, your patch is buggy. There are places where the red black tree is manipulated, and where the node pointed at by bb_free_cache could get freed. For example, see release_blocks_on_commit() and ext4_mb_free_metadata(). That being said, I'm not convinced ext4_mb_generate_from_freelist() is (a) necessary, or (b) bug-free, either. The whole point of having extents in bb_free_root tree is that those extents aren't safe to be placed in the buddy bitmap. And ext4_mb_generate_from_freelist() isn't freeing the nodes from the rbtree. Fortunately it looks like ext4_mb_generate_from_freelist is only getting called when the buddy bitmap is being set up, so the rbtree should be empty during those times. I need to do some more investigation, but I think the function can be removed entirely. - Ted --
I will check release_blocks_on_commit() and ext4_mb_free_metadata() Do you mean that ext4_mb_generate_from_freelist() can be removed entirely? --
Maybe. I need to do more investigation to be sure. The code in
mballoc is subtle, and in some places there is stuff which is vestigal
or misnamed, but it means that I'm going to be very careful before
changing things.
It also means that if you submit patches, you need to be very clear
what you think the surrounding code is doing, why you think it's
wrong, and how your patch make things better. For example, this:
The function, ext4_mb_discard_group_preallocations(), works alone as
hard as possible without correct understanding its caller's good
thinking.
And now try to relieve it in simple way.
is almost useless as a comment. It doesn't help me understand the
code. "hard as possible"? Huh? "without correct understanding"?
How can code, unless it's artificially intelligent, have
understanding? And if you meant the original author had no
understanding, how do you know that? "caller's good thinking"? Same
question; the calling code doesn't think.
This sort of explanation isn't helpful in understanding the patch.
- Ted
--
Another good lesson, I got it.
Thanks
- zj
--
