(This patch was previously posted on linux-ext4 where Andreas Dilger offered some valuable comments on it). This patch modifies the block allocation strategy in ext3 in order to improve fsck performance. This was initially sent out as a patch for ext2, but given the lack of ongoing development on ext2, I have crossported it to ext3 instead. Slow fsck is not a serious problem on ext3 due to journaling, but once in a while users do need to run full fsck on their ext3 file systems. This can be due to several reasons: (1) bad disk, bad crash, etc, (2) bug in jbd/ext3, and (3) every few reboots, it's good to run fsck anyway. This patch will help reduce full fsck time for ext3. I've seen 50-65% reduction in fsck time when using this patch on a near-full file system. With some fsck optimizations, this figure becomes 80%. Most of Ext3 metadata is clustered on disk. For example, Ext3 partitions the block space into block groups and stores the metadata for each block group (inode table, block bitmap, inode bitmap) at the beginning of the block group. Clustering related metadata together not only helps ext3 I/O performance by keeping data and related metadata close together, but also helps fsck since it is able to find all the metadata in one place. However, indirect blocks are an exception. Indirect blocks are allocated on-demand and are spread out along with the data. This layout enables good I/O performance due to the close proximity between an indirect block and its data blocks but it makes things difficult for fsck which must now rotate almost the entire disk in order to read all indirect blocks. In fact, our measurements have indirect block read accesses in fsck can significantly improve fsck times. One solution to this problem implemented in this patch is to cluster indirect blocks together on a per group basis, similar to how inodes and bitmaps are clustered. Indirect block clusters (metaclusters) help fsck performance by enabling fsck to fetch all indirect blocks by reading from...
So we have a section of blocks around the middle of the blockgroup which are used for indirect blocks. Presmably it starts around 50% of the way into the blockgroup? How do you decide its size? What happens when it fills up but we still have room for more data blocks in that blockgroup? Can this reserved area cause disk space wastage (all data blocks used, metacluster area not yet full). The file data block allocator now needs to avoid allocating blocks from inside this reserved area. How is this implemented? It is awfully similar hm, system time variance is weird. You might have found an ext3 bug (or a cpu time accounting bug). They're large files. It would be interesting to see what the numbers are Less speedup, for more-and-smaller files, it appears. An important question is: how does it stand up over time? Simply laying files out a single time on a fresh fs is the easy case. But what happens if that disk has been in continuous create/delete/truncate/append usage for We can get onto that later ;) -
There are couple of factors to consider when choosing a size: 1. The size cannot be too small, or the metacluster will fill up too quickly and then we'll have to fall back to regular indirect block allocation. E.g., if average file size of files in a block group is 512KB, a default block group having 32K blocks of size 4KB will need ~256 indirect blocks, one for each file. 2. If number of indirect blocks is too high, there will be less space for data block allocation and so it'll make it more likely that we run out of data blocks and start using blocks from the metacluster which makes metaclustering useless. Considering these factors, I think we should have <1% of blocks reserved for the metacluster. The current patch uses (blocks_per_group Metaclustering is honored only as long as we have free data blocks and free metacluster blocks. If we run out of either, we start using the other. Of course, once that happens indirect blocks may not be It is actually much simpler than the reservation code, so I haven't used it. The logic is implemented in <20 lines in kernbench below shows the behavior with small files. I'll also post results from running Not necessarily. If a lot of files use indirect blocks which happens when file length >48KB on a 4KB blocksize file system, then we have a lot of indirect blocks to read during fsck and hence this patch will be useful. But -
It does fall back, but it does so starting from the beginning of the block group by using the old-style allocation routines if it can't find any space in the metacluster region. What I'd suggest that it do instead is to start searching from the end of metacluster region, and then wrap around to the beginning of the block group, and then if it can't find any blocks when it reaches the beginning of the metacluster region, then go to the next block group that would be used by ext3_new_blocks(), and start searching in the metacluster region --- that way a smart e2fsck that is doing clustering could just arrange to pre-read the metacluster region for each block group, and if it finds an indirect block that is another block group's metacluster region, it could try reading in those blocks too. In order to do this, I'd suggest considering to fold ext3_new_blocks and ext3_new_indirect_blocks() into the same function, with just a passed-in flag to indicate whether for each block group the metacluster region or the non-metacluster region should be searched Another question is how does it stand up if the average size of files is different from what you anticipate? If the files are bigger than you expect, or smaller than you expect, then the ratio of indirect blocks to data blocks will be different, at which point allocations won't be perfectly split up between metacluster region. For this reason, the exact size of the metacluster region should probably be a superblock tunable --- and once we have the superblock tunable, I'd use the non-zero metacluster size to determine whether or not to enable this feature, and not to use a mount option. Mount options really should be avoided whenever possible, in favor of settings in the superblock. - Ted -
Thanks for the great feedback. When we fallback to old-style allocation, new blocks get allocated next Ideally, this is how things should be done, but I feel in practice, it will make little difference. To summarize, the difference between my approach and above approach is that when out of free blocks in a block group while allocating indirect block, the above approach repeats the same allocation algorithm in the next block group, while I fully fall back to old-style allocation meaning the indirect block gets co-located with the data block in the next block group with a free block. In practice, this will make a difference only for one indirect block as from next request onwards the goal will be updated to the new group making the behavior like what you propose. Still, I think your We initially avoided making metaclustering a superblock tunable as we didn't want to make any changes to the on-disk format as then ext4 extents are also a good option. If metaclustering gains acceptance it might make sense to make it a superblock tunable. However, I would avoid putting metacluster size into the superblock for the following reason. Ideally, we should not have to bother about finding the sweet spot of metacluster size as (1) a given file system can be used for storing different kinds of files at different times and it would be a pain to tune it every now and then, and (2) it opens the possibility of doubting metcluster size for unrelated ext3/fsck performance anomalies. The user should be able to just enable metaclustering and ext3 should take care of the rest as best as it can. That said, some type of coarse metaclustering advice can definitely be stored in the superblock. Allow me to propose a solution that will most likely address the above issue and please ignore its complexity for a moment. Instead of a two level partitioning in the block space between data blocks and metacluster blocks, have a 3 or 4 level partitioning. E.g., a block group with 'd' blocks can have d/32 blo...
Well, also I suggested that if the metacluster region is full, that it
attempt to find a block starting at end of the metacluster region and
then wrap around, instead of starting at the beginning of the block
group. That way it's more likely that subsequent metadata block is
The practice of starting search in the next block block in the
metadata area only makes a difference for one indirect block, yes, but
it's the right thing to do. And if you fold the ext3_new_blocks and
ext3_new_indirect_blocks(), it's really not that hard. You can
basically do something like this:
if (alloc_for_metadata)
strategy = 0x132;
else
strategy = 0x231;
for (; strategy; strategy = strategy >> 8) {
switch (strategy & 0xF) {
case 1:
start = block_group_start;
end = mc_start - 1;
break;
case 2:
start = mc_start;
end = mc_end;
break;
case 3:
start = mc_end + 1;
end = block_group_end;
break;
}
<search region between start.. end>
Allocating a superblock field is no big deal. I'll note further that
metaclustering is not necessarily mutually exclusive with ext4
extents. Allocating the extent tree data blocks out of the
metacluster blocks can be a good idea, depending on the average size
of the blocks and how fragmented the filesystem gets (and hence how
many contiguous extents can be expected). If the filesystem is
storing lots of really big files where being contiguous across
multiple blockgroups are productive, then the metacluster area would
actually be counterproductive. And if files are all small so the
extents fit the inode, the metadata cluster area wouldn't be necessary
at all. But if there are multiple external extent blocks in a block
Yes, it doesn't make sense to retune the filesystem. I was assuming
I'm not sure I understand your concern. The reality is that 99% of
the time users will never change it from the defaults, but making it
tunable makes it much, much easier for...I have implemented a revised patch that addresses the concerns raised with the previous patch. To summarize, here were the three main concerns: 1. Metacluster size is sensitive to average file size in the block-group/file-system so how do we find a good metacluster size ? The last discussion we had on this topic was to have multiple metaclusters of different sizes per group and have direct blocks overflow towards smaller metacluster sizes and indirect blocks overflow towards bigger metaclusters. The overflowing from one metacluster level to the next would happen only when none of the block groups have any free blocks left at that metacluster level, IOW overflowing was a file system wide event. 2. When indirect block allocation in metacluster fails, don't fully fall back to old-style allocation scheme, instead fall back to old-style scheme only for that block group and repeat this for each block group: Now all this happens in ext3_new_blocks(). To control the size of that function, created a few helper functions. 3. Don't have separate functions for allocating indirect and direct blocks as there is considerable overlap especially in the journaling code: Rolled the two functions into one ext3_new_blocks() function which is directly called from ext3_alloc_branch() instead of via ext3_alloc_blocks() (which is nuked now). The current approach I've implemented is similar in principle but it fixes a problem with the above scheme. The above scheme of overflowing into the next "metacluster level" upon exhaustion of current metacluster level across all block groups results in increased fragmentation. E.g., say a block group BG1 ran out of blocks at metacluster (mc) level X and now wants to use the next level. It checks and find that a different block group BG2 still has free blocks at mc level X so it starts using level X in BG2 which results in the file getting fragmented. In the new patch, we'd continue using level X+1 in BG1 to reduce overall fragmentation and so the new patch res...
Hi, Resending this email as I haven't received any feedback on this patch probably due to the holiday season. Any feedback will be greatly appreciated! Thanks I have implemented a revised patch that addresses the concerns raised with the previous patch. To summarize, here were the three main concerns: 1. Metacluster size is sensitive to average file size in the block-group/file-system so how do we find a good metacluster size ? The last discussion we had on this topic was to have multiple metaclusters of different sizes per group and have direct blocks overflow towards smaller metacluster sizes and indirect blocks overflow towards bigger metaclusters. The overflowing from one metacluster level to the next would happen only when none of the block groups have any free blocks left at that metacluster level, IOW overflowing was a file system wide event. 2. When indirect block allocation in metacluster fails, don't fully fall back to old-style allocation scheme, instead fall back to old-style scheme only for that block group and repeat this for each block group: Now all this happens in ext3_new_blocks(). To control the size of that function, created a few helper functions. 3. Don't have separate functions for allocating indirect and direct blocks as there is considerable overlap especially in the journaling code: Rolled the two functions into one ext3_new_blocks() function which is directly called from ext3_alloc_branch() instead of via ext3_alloc_blocks() (which is nuked now). The current approach I've implemented is similar in principle but it fixes a problem with the above scheme. The above scheme of overflowing into the next "metacluster level" upon exhaustion of current metacluster level across all block groups results in increased fragmentation. E.g., say a block group BG1 ran out of blocks at metacluster (mc) level X and now wants to use the next level. It checks and find that a different block group BG2 still has free blocks at mc level X so it starts using lev...
Your results are very impressive. In my opinion, the sooner this goes in, the better, since everybody hates waiting for fsck. The only issue that jumps out at me is, the patch is big and changes a significant amount of Ext3 code outside of the metacluster path, which is not a bad thing except that these changes are going to need to be tested fairly heavily. The way to do that is, put a big [CALL FOR TESTING] in your subject line the next time you post, and use an attention-getting subject line like "Make Ext3 fsck way faster". Diff the patch against the latest stable kernel to make things as easy as possible for the people who are hopefully going to download your patch, try it, and report their results. The other way is just to ask Andrew to put it in -mm when you feel ready, but your chances are much better if you already have people sending in mails saying how great your patch is. Another thing you might consider is a port to Ext4. After all, the world has waited this long for your patch, so it can likely survive waiting a little longer. You somehow seem to have missed attracting the attention of Jon Corbet, a rare occurrence for a patch of this significance. With the subject line modified as above, you are more likely to get the attention you deserve. Good luck! Regards, Daniel --
On Fri, 11 Jan 2008 09:05:17 -0800 It needs to be reviewed. In exhaustive detail. Few people can do that and I went to merge it so it could get some testing while we await review but the patch has all its tabs replaced with spaces, is seriously wordwrapped and has random newlines added to it. Please fix email client and resend (offlist is OK if it is unaltered). We should have a think about which workloads are most likely to be adversely affected by this change. --
Agreed, there just have to be a few bugs in this many lines of code. I spent a couple of hours going through it, not really looking at the algorithms but just the superficial details. I only found minor nits, and not many of those. For example, I do not like to see "if (free_blocks == 0)" written as"if (free_blocks <= 0)" in an attempt to increase robustness. What it actually does is make the effect of an error more subtle, or even "corrects" it. Firmly in the niggle category. I checked the locking of sbi->bginfo and didn't see a flaw, good. I see a missing KERN_INFO added to a printk, it technically counts as an unrelated change but oh well. Stylistically this new code is hard to tell apart from the incumbent code, except for being more heavily commented. I wish all kernel code was written this clearly. At this point I will run away in favor of for-real Ext3 hackers (you Odd, the original post has tabs and the updated one does not, though the I was just rolling up my sleeves to construct the nasty sequential case where the head keeps seeking back to the center of the group after picking up each 4 MB of doubly indexed data when I realized that even the most simple minded disk cache makes this case a non-issue. The drive will most likely suck a full track (roughly .5 MB) or big chunk thereof into cache the first time it seeks to the index cluster, thus having a whole group of double index blocks in cache and then will proceed to chew happily and linearly through the data blocks. It seems like placing those second level index blocks all together really helps this case. Hmm, how to break it. How about having a disk full of 100 MB files and skipping all over the disk randomly reading one block each time. That will fill the disk cache, and each random read then requires seeking to two places that were hopefully close together without index node clustering, and now will be an average of 32 MB apart. Each of these "extra" seeks costs a ...
Thanks for the great feedback Daniel! Following this email, I'll be sending out two separate emails with the actual patches, one against the latest stable kernel and one against the latest mm patch, using the format suggested by you. Sorry about the tabs and spaces thing, I've fixed my email client now. Thanks for pointing out problems with the (free_blocks <= 0) expression, I totally agree with you and have fixed it in the new patch. Regarding some of the other points that you raised: 1. Worse performance with random read on large files using metaclustering: This is a genuine drawback of this kind of metaclustering. In short, the choice is between a slightly slower random read or a much faster fsck. However, I believe that going forward, especially if and when we port metaclustering to Ext4, slower random read will probably be less of an issue, since there'll be fewer indirect blocks (due to the use of extents) and so we'll be able to do more aggressive prefetching of indirect blocks to help random reads. That said, it would be great to see what random read performance other users report since in my own experiments the degradation has been somewhat smaller than I'd expect (I've also tried more complex random read non-standard benchmarks that I haven't reported numbers for and they did "reasonably ok" with metaclustering, but of course standard reproducible results are always better). 2. Porting to Ext4: It seems that popular opinion is that some form of metaclustering could be useful for Ext4 as some other Ext4 hackers have also suggested the same on LKML and I'd be glad to work on it. However, I think metaclustering provides genuine value to current users of Ext3 and Ext2 and most people will agree that these two file systems are very likely to remain popular for quite some time now (the backport of metaclustering to Ext2 is quite trivial, so if metaclustering gets accepted in Ext3, I'll probably release a "use-at-your-own-risk" patch for Ext2 users). Thanks! Abhi...
Thanks for the comments. Ah ok. I think a generalization of this idea is that when we must mix indirect blocks and data blocks, at least start the search for a free block for them from different parts of the block group. Of course, an added benefit of your suggestion is that the non-metaclustered indirect blocks will likely be close to the metacluster. I think this approach will help fsck but from a design point of view it may not be good for IO performance. E.g., the reason sequential read performance with metaclustering is same as regular is because we prefetch co-located indirect blocks (I have verified this by turning off prefetching). When allocating indirect blocks outside the metacluster using the above scheme, co-location of indirect blocks is still likely but less so. OTOH, by falling back to the old-style allocation routines, we at least make sure that the indirect block is close to its data block which is good for performance. I think both approaches are pretty close. One thumb rule we've followed during metaclustering design is: "when in doubt, favor IO performance over Thanks for this nice visualization :-) I agree with your and Andreas' concern about fragmentation due to the current scheme of putting metacluster in the middle of the block group. Here are some stats concerning different metacluster locations: - Placing metacluster at the end of the block group results in 2% degradation in sequential reads from large files. Putting it at the beginning improves sequential read performance by 0.5%. - For random reads the beginning and ending configurations have idential performance which is almost the same as regular ext3 performance but 1% worse than the middle configuration. - I haven't compared the different metacluster locations for sequential reads from small files, but in general I've found the behavior to be very similar to random reads from a large file. So I think putting metacluster levels at the beginning of the block group Thanks, Abhishekj - ...
In the ext4-devel discussion, I asked about placement of the reserved blocks. Placement at the beginning of the group showed at worst marginally less performance and in some cases better performance. I suspect putting the reserved blocks at the beginning of the group would have a better long-term effect on performance because they are not in the middle of large contiguous allocations in the middle of the group. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. -
Try Chris Mason's compilebench, which is a decent aging simulation. -- Mathematics is the supreme nostalgia of our time. -
Thanks for the suggestion Matt. It took me some time to get compilebench working due to the known issue with drop_caches due to circular lock dependency between j_list_lock and inode_lock (compilebench triggers drop_caches quite frequently). Here are the results for compilebench run with options "-i 30 -r 30". I repeated the test 5 times on each of vanilla and mc configurations. Setup: 4 cpu, 8GB RAM, 400GB disk. Average vanilla results ========================================================================== intial create total runs 30 avg 46.49 MB/s (user 1.12s sys 2.25s) create total runs 5 avg 12.90 MB/s (user 1.08s sys 1.97s) patch total runs 4 avg 8.70 MB/s (user 0.60s sys 2.31s) compile total runs 7 avg 21.44 MB/s (user 0.32s sys 2.95s) clean total runs 4 avg 59.91 MB/s (user 0.05s sys 0.26s) read tree total runs 2 avg 21.85 MB/s (user 1.12s sys 2.89s) read compiled tree total runs 1 avg 23.47 MB/s (user 1.45s sys 4.89s) delete tree total runs 2 avg 13.18 seconds (user 0.64s sys 1.02s) no runs for delete compiled tree stat tree total runs 4 avg 4.76 seconds (user 0.70s sys 0.50s) stat compiled tree total runs 1 avg 7.84 seconds (user 0.74s sys 0.54s) Average metaclustering results ========================================================================== intial create total runs 30 avg 45.04 MB/s (user 1.13s sys 2.42s) create total runs 5 avg 15.64 MB/s (user 1.08s sys 1.98s) patch total runs 4 avg 10.50 MB/s (user 0.61s sys 3.11s) compile total runs 7 avg 28.07 MB/s (user 0.33s sys 4.06s) clean total runs 4 avg 83.27 MB/s (user 0.04s sys 0.27s) read tree total runs 2 avg 21.17 MB/s (user 1.15s sys 2.91s) read compiled tree total runs 1 avg 22.79 MB/s (user 1.38s sys 4.89s) delete tree total runs 2 avg 9.23 seconds (user 0.62s sys 1.01s) no runs for delete compiled tree stat tree total runs 4 avg 4.72 seconds (user 0.71s sys 0.50s) stat compiled tree total runs 1 avg 6.50 seconds (user 0.79s sys 0.53s) Overall, metaclustering does better than vanilla except in a...
Abhishek> It took me some time to get compilebench working due to the Abhishek> known issue with drop_caches due to circular lock dependency Abhishek> between j_list_lock and inode_lock (compilebench triggers Abhishek> drop_caches quite frequently). Here are the results for Abhishek> compilebench run with options "-i 30 -r 30". I repeated the Abhishek> test 5 times on each of vanilla and mc configurations. Abhishek> Setup: 4 cpu, 8GB RAM, 400GB disk. How about running these tests on a more pedestrian system which people will actually have? Like 1gb, 1cpu and 400gb of a single disk? -
Well it strikes me as about half up and half down, but the ups are indeed much more substantial. Looks quite promising. -- Mathematics is the supreme nostalgia of our time. -
I think above testcases are just run with normal I/O case. Did you test the direct I/O test with this patch? With metaclustering patch, I can't run 'fsstress' since it oops when direct I/O. or is it dependent on other ext3 patches? Since I tested it with latest kernel 2.6.24-rc3 instead of mm kernel. Thank you, Kyungmin Park -
