Re: [PATCH] avoid scanning bitmaps for group preallocation

Previous thread: [Bug 13549] Kernel oops while online resizing of an ext4 filesystem by bugzilla-daemon on Monday, March 22, 2010 - 2:51 pm. (1 message)

Next thread: bug in inode allocator? by Darrick J. Wong on Monday, March 22, 2010 - 5:21 pm. (5 messages)
From: Andreas Dilger
Date: Monday, March 22, 2010 - 3:03 pm

Here is the patch I mentioned today on the call.  It avoids (or at  
least reduces) serious latency (10 minutes or more) on a large  
filesystem (8TB+) on the first write, if the filesystem is nearly  
full.  The latency is entirely due to seeking to read the block  
bitmaps, so is considerably less serious on flex_bg formatted  
filesystems.

A better long-term approach would be to store in the superblock the  
last group that had space to allocate a stripe-sized chunk and/or flag  
in the group descriptor if there is not a large amount of contiguous  
free space therein (cleared on freeing blocks in the group).

Having the mount-time buddy-bitmap (and checksum verifying) scanning  
thread start at mount would only help if the first write to the  
filesystem is not immediately after mount (which it is in Lustre at  
least).  Having a filesystem-wide (r)btree for the freespace (ala XFS)  
would also only help if the btree could be (at least partially) built  
from bitmaps before the first write, unless we cache the bitmap on  
disk, which caused Lustre plenty in the past and I'm leery to do it.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

From: Aneesh Kumar K. V
Date: Friday, March 26, 2010 - 3:28 am

@@ -125,8 +125,7 @@
  * list. In case of inode preallocation we follow a list of heuristics
  * based on file size. This can be found in ext4_mb_normalize_request. If
  * we are doing a group prealloc we try to normalize the request to
- * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is
- * 512 blocks. This can be tuned via
+ * sbi->s_mb_group_prealloc.  This can be tuned via
  * /sys/fs/ext4/<partition/mb_group_prealloc. The value is represented in
  * terms of number of blocks. If we have mounted the file system with -O
  * stripe=<value> option the group prealloc request is normalized to the
@@ -2029,9 +2028,12 @@ repeat:
			if (group == ngroups)
				group = 0;
 
-			/* quick check to skip empty groups */
+			/* If there's no chance that this group has a better
+			 * extent, just skip it instead of seeking to read
+			 * block bitmap from disk. Initially ac_b_ex.fe_len = 0,
+			 * so this always skips groups with no free space. */
			grp = ext4_get_group_info(sb, group);
-			if (grp->bb_free == 0)
+			if (grp->bb_free <= ac->ac_b_ex.fe_len)
				continue;
 
			err = ext4_mb_load_buddy(sb, group, &e4b);

I was wondering whether we need to make sure we also use criteria value
when checking for bb_free. If we are really low on space we may want to
return what is left right ?. Or does ac_b_ex take care of that ?

-aneesh
--

From: Andreas Dilger
Date: Friday, March 26, 2010 - 10:58 am

ac_b_ex is the best currently ALLOCATED extent, so mballoc wouldn't  
ever select an extent that is smaller than ac_b_ex.fe_len.  That means  
it is pointless to even look at a group which has fewer free blocks  
than ac_b_ex.fe_len.

Later, after the group information is loaded, ldiskfs_mb_good_group()  
will skip the group if the average fragment size is smaller than the  
GOAL extent, but only for certain criterion levels.  At the highest  
criterion, any group with free blocks will be scanned.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

Previous thread: [Bug 13549] Kernel oops while online resizing of an ext4 filesystem by bugzilla-daemon on Monday, March 22, 2010 - 2:51 pm. (1 message)

Next thread: bug in inode allocator? by Darrick J. Wong on Monday, March 22, 2010 - 5:21 pm. (5 messages)