(This patch was previously posted on linux-ext4 where Andreas Dilger
offered some valuable comments on it).This patch modifies the block allocation strategy in ext3 in order to
improve fsck performance. This was initially sent out as a patch for
ext2, but given the lack of ongoing development on ext2, I have
crossported it to ext3 instead. Slow fsck is not a serious problem on
ext3 due to journaling, but once in a while users do need to run full
fsck on their ext3 file systems. This can be due to several reasons:
(1) bad disk, bad crash, etc, (2) bug in jbd/ext3, and (3) every few
reboots, it's good to run fsck anyway. This patch will help reduce
full fsck time for ext3. I've seen 50-65% reduction in fsck time when
using this patch on a near-full file system. With some fsck
optimizations, this figure becomes 80%.Most of Ext3 metadata is clustered on disk. For example, Ext3
partitions the block space into block groups and stores the metadata
for each block group (inode table, block bitmap, inode bitmap) at the
beginning of the block group. Clustering related metadata together not
only helps ext3 I/O performance by keeping data and related metadata
close together, but also helps fsck since it is able to find all the
metadata in one place. However, indirect blocks are an exception.
Indirect blocks are allocated on-demand and are spread out along with
the data. This layout enables good I/O performance due to the close
proximity between an indirect block and its data blocks but it makes
things difficult for fsck which must now rotate almost the entire disk
in order to read all indirect blocks. In fact, our measurements have
indirect block read accesses in fsck can significantly improve fsck
times.One solution to this problem implemented in this patch is to cluster
indirect blocks together on a per group basis, similar to how inodes
and bitmaps are clustered. Indirect block clusters (metaclusters) help
fsck performance by enabling fsck to fetch all indirect blocks by
reading from...
So we have a section of blocks around the middle of the blockgroup which
are used for indirect blocks.Presmably it starts around 50% of the way into the blockgroup?
How do you decide its size?
What happens when it fills up but we still have room for more data blocks
in that blockgroup?Can this reserved area cause disk space wastage (all data blocks used,
metacluster area not yet full).The file data block allocator now needs to avoid allocating blocks from
inside this reserved area. How is this implemented? It is awfully similarhm, system time variance is weird. You might have found an ext3 bug (or a
cpu time accounting bug).They're large files. It would be interesting to see what the numbers are
Less speedup, for more-and-smaller files, it appears.
An important question is: how does it stand up over time? Simply laying
files out a single time on a fresh fs is the easy case. But what happens
if that disk has been in continuous create/delete/truncate/append usage forWe can get onto that later ;)
-
There are couple of factors to consider when choosing a size:
1. The size cannot be too small, or the metacluster will fill up too
quickly and then we'll have to fall back to regular indirect block
allocation. E.g., if average file size of files in a block group is
512KB, a default block group having 32K blocks of size 4KB will need
~256 indirect blocks, one for each file.
2. If number of indirect blocks is too high, there will be less space
for data block allocation and so it'll make it more likely that we run
out of data blocks and start using blocks from the metacluster which
makes metaclustering useless.Considering these factors, I think we should have <1% of blocks
reserved for the metacluster. The current patch uses (blocks_per_groupMetaclustering is honored only as long as we have free data blocks and
free metacluster blocks. If we run out of either, we start using the
other. Of course, once that happens indirect blocks may not beIt is actually much simpler than the reservation code, so I haven't
used it. The logic is implemented in <20 lines inkernbench below shows the behavior with small files. I'll also post
results from runningNot necessarily. If a lot of files use indirect blocks which happens when file
length >48KB on a 4KB blocksize file system, then we have a lot of
indirect blocks to read during fsck and hence this patch will be useful. But-
It does fall back, but it does so starting from the beginning of the
block group by using the old-style allocation routines if it can't
find any space in the metacluster region. What I'd suggest that it do
instead is to start searching from the end of metacluster region, and
then wrap around to the beginning of the block group, and then if it
can't find any blocks when it reaches the beginning of the metacluster
region, then go to the next block group that would be used by
ext3_new_blocks(), and start searching in the metacluster region ---
that way a smart e2fsck that is doing clustering could just arrange to
pre-read the metacluster region for each block group, and if it finds
an indirect block that is another block group's metacluster region, it
could try reading in those blocks too.In order to do this, I'd suggest considering to fold ext3_new_blocks
and ext3_new_indirect_blocks() into the same function, with just a
passed-in flag to indicate whether for each block group the
metacluster region or the non-metacluster region should be searchedAnother question is how does it stand up if the average size of files
is different from what you anticipate? If the files are bigger than
you expect, or smaller than you expect, then the ratio of indirect
blocks to data blocks will be different, at which point allocations
won't be perfectly split up between metacluster region.For this reason, the exact size of the metacluster region should
probably be a superblock tunable --- and once we have the superblock
tunable, I'd use the non-zero metacluster size to determine whether or
not to enable this feature, and not to use a mount option. Mount
options really should be avoided whenever possible, in favor of
settings in the superblock.- Ted
-
Thanks for the great feedback.
When we fallback to old-style allocation, new blocks get allocated next
Ideally, this is how things should be done, but I feel in practice, it
will make little difference. To summarize, the difference between
my approach and above approach is that when out of free blocks in a
block group while allocating indirect block, the above approach repeats
the same allocation algorithm in the next block group, while I fully
fall back to old-style allocation meaning the indirect block gets
co-located with the data block in the next block group with a free
block. In practice, this will make a difference only for one indirect
block as from next request onwards the goal will be updated to the new
group making the behavior like what you propose. Still, I think yourWe initially avoided making metaclustering a superblock tunable as we
didn't want to make any changes to the on-disk format as then ext4
extents are also a good option. If metaclustering gains acceptance
it might make sense to make it a superblock tunable. However, I would
avoid putting metacluster size into the superblock for the following
reason. Ideally, we should not have to bother about finding the sweet
spot of metacluster size as
(1) a given file system can be used for storing different kinds
of files at different times and it would be a pain to tune it every now
and then, and
(2) it opens the possibility of doubting metcluster size for unrelated
ext3/fsck performance anomalies.
The user should be able to just enable metaclustering and ext3 should
take care of the rest as best as it can. That said, some type of coarse
metaclustering advice can definitely be stored in the superblock.Allow me to propose a solution that will most likely address the above
issue and please ignore its complexity for a moment. Instead of a two
level partitioning in the block space between data blocks and
metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
group with 'd' blocks can have d/32 blo...
Well, also I suggested that if the metacluster region is full, that it
attempt to find a block starting at end of the metacluster region and
then wrap around, instead of starting at the beginning of the block
group. That way it's more likely that subsequent metadata block isThe practice of starting search in the next block block in the
metadata area only makes a difference for one indirect block, yes, but
it's the right thing to do. And if you fold the ext3_new_blocks and
ext3_new_indirect_blocks(), it's really not that hard. You can
basically do something like this:if (alloc_for_metadata)
strategy = 0x132;
else
strategy = 0x231;
for (; strategy; strategy = strategy >> 8) {
switch (strategy & 0xF) {
case 1:
start = block_group_start;
end = mc_start - 1;
break;
case 2:
start = mc_start;
end = mc_end;
break;
case 3:
start = mc_end + 1;
end = block_group_end;
break;
}
<search region between start.. end>Allocating a superblock field is no big deal. I'll note further that
metaclustering is not necessarily mutually exclusive with ext4
extents. Allocating the extent tree data blocks out of the
metacluster blocks can be a good idea, depending on the average size
of the blocks and how fragmented the filesystem gets (and hence how
many contiguous extents can be expected). If the filesystem is
storing lots of really big files where being contiguous across
multiple blockgroups are productive, then the metacluster area would
actually be counterproductive. And if files are all small so the
extents fit the inode, the metadata cluster area wouldn't be necessary
at all. But if there are multiple external extent blocks in a blockYes, it doesn't make sense to retune the filesystem. I was assuming
I'm not sure I understand your concern. The reality is that 99% of
the time users will never change it from the defaults, but making it
tunable makes it much, much easier for...
I have implemented a revised patch that addresses the concerns raised
with the previous patch. To summarize, here were the three main
concerns:1. Metacluster size is sensitive to average file size in the
block-group/file-system so how do we find a good metacluster size ?
The last discussion we had on this topic was to have multiple
metaclusters of different sizes per group and have direct blocks
overflow towards smaller metacluster sizes and indirect blocks
overflow towards bigger metaclusters. The overflowing from one
metacluster level to the next would happen only when none of the block
groups have any free blocks left at that metacluster level, IOW
overflowing was a file system wide event.2. When indirect block allocation in metacluster fails, don't fully
fall back to old-style allocation scheme, instead fall back to
old-style scheme only for that block group and repeat this for each
block group: Now all this happens in ext3_new_blocks(). To control the
size of that function, created a few helper functions.3. Don't have separate functions for allocating indirect and direct
blocks as there is considerable overlap especially in the journaling
code: Rolled the two functions into one ext3_new_blocks() function
which is directly called from ext3_alloc_branch() instead of via
ext3_alloc_blocks() (which is nuked now).The current approach I've implemented is similar in principle but it
fixes a problem with the above scheme. The above scheme of overflowing
into the next "metacluster level" upon exhaustion of current
metacluster level across all block groups results in increased
fragmentation. E.g., say a block group BG1 ran out of blocks at
metacluster (mc) level X and now wants to use the next level. It
checks and find that a different block group BG2 still has free blocks
at mc level X so it starts using level X in BG2 which results in the
file getting fragmented. In the new patch, we'd continue using level
X+1 in BG1 to reduce overall fragmentation and so the new patch
res...
Hi,
Resending this email as I haven't received any feedback on this patch
probably due to the holiday season. Any feedback will be greatly
appreciated!Thanks
I have implemented a revised patch that addresses the concerns raised
with the previous patch. To summarize, here were the three main
concerns:1. Metacluster size is sensitive to average file size in the
block-group/file-system so how do we find a good metacluster size ?
The last discussion we had on this topic was to have multiple
metaclusters of different sizes per group and have direct blocks
overflow towards smaller metacluster sizes and indirect blocks
overflow towards bigger metaclusters. The overflowing from one
metacluster level to the next would happen only when none of the block
groups have any free blocks left at that metacluster level, IOW
overflowing was a file system wide event.2. When indirect block allocation in metacluster fails, don't fully
fall back to old-style allocation scheme, instead fall back to
old-style scheme only for that block group and repeat this for each
block group: Now all this happens in ext3_new_blocks(). To control the
size of that function, created a few helper functions.3. Don't have separate functions for allocating indirect and direct
blocks as there is considerable overlap especially in the journaling
code: Rolled the two functions into one ext3_new_blocks() function
which is directly called from ext3_alloc_branch() instead of via
ext3_alloc_blocks() (which is nuked now).The current approach I've implemented is similar in principle but it
fixes a problem with the above scheme. The above scheme of overflowing
into the next "metacluster level" upon exhaustion of current
metacluster level across all block groups results in increased
fragmentation. E.g., say a block group BG1 ran out of blocks at
metacluster (mc) level X and now wants to use the next level. It
checks and find that a different block group BG2 still has free blocks
at mc level X so it starts using lev...
Your results are very impressive. In my opinion, the sooner this goes
in, the better, since everybody hates waiting for fsck. The only issue
that jumps out at me is, the patch is big and changes a significant
amount of Ext3 code outside of the metacluster path, which is not a bad
thing except that these changes are going to need to be tested fairly
heavily.The way to do that is, put a big [CALL FOR TESTING] in your subject line
the next time you post, and use an attention-getting subject line
like "Make Ext3 fsck way faster". Diff the patch against the latest
stable kernel to make things as easy as possible for the people who are
hopefully going to download your patch, try it, and report their
results.The other way is just to ask Andrew to put it in -mm when you feel
ready, but your chances are much better if you already have people
sending in mails saying how great your patch is.Another thing you might consider is a port to Ext4. After all, the
world has waited this long for your patch, so it can likely survive
waiting a little longer.You somehow seem to have missed attracting the attention of Jon Corbet,
a rare occurrence for a patch of this significance. With the subject
line modified as above, you are more likely to get the attention you
deserve. Good luck!Regards,
Daniel
--
On Fri, 11 Jan 2008 09:05:17 -0800
It needs to be reviewed. In exhaustive detail. Few people can do that and
I went to merge it so it could get some testing while we await review but
the patch has all its tabs replaced with spaces, is seriously wordwrapped
and has random newlines added to it. Please fix email client and resend
(offlist is OK if it is unaltered).We should have a think about which workloads are most likely to be
adversely affected by this change.--
Agreed, there just have to be a few bugs in this many lines of code.
I spent a couple of hours going through it, not really looking at the
algorithms but just the superficial details. I only found minor nits,
and not many of those.For example, I do not like to see "if (free_blocks == 0)" written as"if
(free_blocks <= 0)" in an attempt to increase robustness. What it
actually does is make the effect of an error more subtle, or
even "corrects" it. Firmly in the niggle category.I checked the locking of sbi->bginfo and didn't see a flaw, good.
I see a missing KERN_INFO added to a printk, it technically counts as an
unrelated change but oh well.Stylistically this new code is hard to tell apart from the incumbent
code, except for being more heavily commented. I wish all kernel code
was written this clearly.At this point I will run away in favor of for-real Ext3 hackers (you
Odd, the original post has tabs and the updated one does not, though the
I was just rolling up my sleeves to construct the nasty sequential case
where the head keeps seeking back to the center of the group after
picking up each 4 MB of doubly indexed data when I realized that even
the most simple minded disk cache makes this case a non-issue. The
drive will most likely suck a full track (roughly .5 MB) or big chunk
thereof into cache the first time it seeks to the index cluster, thus
having a whole group of double index blocks in cache and then will
proceed to chew happily and linearly through the data blocks.
It seems like placing those second level index blocks all together
really helps this case. Hmm, how to break it.How about having a disk full of 100 MB files and skipping all over the
disk randomly reading one block each time. That will fill the disk
cache, and each random read then requires seeking to two places that
were hopefully close together without index node clustering, and now
will be an average of 32 MB apart. Each of these "extra" seeks costs a
...
Thanks for the great feedback Daniel!
Following this email, I'll be sending out two separate emails with the
actual patches, one against the latest stable kernel and one against
the latest mm patch, using the format suggested by you. Sorry about
the tabs and spaces thing, I've fixed my email client now.Thanks for pointing out problems with the (free_blocks <= 0)
expression, I totally agree with you and have fixed it in the new
patch. Regarding some of the other points that you raised:1. Worse performance with random read on large files using metaclustering:
This is a genuine drawback of this kind of metaclustering. In short,
the choice is between a slightly slower random read or a much faster
fsck. However, I believe that going forward, especially if and when we
port metaclustering to Ext4, slower random read will probably be less
of an issue, since there'll be fewer indirect blocks (due to the use
of extents) and so we'll be able to do more aggressive prefetching of
indirect blocks to help random reads.That said, it would be great to see what random read performance other
users report since in my own experiments the degradation has been
somewhat smaller than I'd expect (I've also tried more complex random
read non-standard benchmarks that I haven't reported numbers for and
they did "reasonably ok" with metaclustering, but of course standard
reproducible results are always better).2. Porting to Ext4:
It seems that popular opinion is that some form of metaclustering
could be useful for Ext4 as some other Ext4 hackers have also
suggested the same on LKML and I'd be glad to work on it. However, I
think metaclustering provides genuine value to current users of Ext3
and Ext2 and most people will agree that these two file systems are
very likely to remain popular for quite some time now (the backport of
metaclustering to Ext2 is quite trivial, so if metaclustering gets
accepted in Ext3, I'll probably release a "use-at-your-own-risk" patch
for Ext2 users).Thanks!
Abhi...
Thanks for the comments.
Ah ok. I think a generalization of this idea is that when we must mix
indirect blocks and data blocks, at least start the search for a free
block for them from different parts of the block group. Of course, an
added benefit of your suggestion is that the non-metaclustered
indirect blocks will likely be close to the metacluster.I think this approach will help fsck but from a design point of view it
may not be good for IO performance. E.g., the reason sequential
read performance with metaclustering is same as regular is
because we prefetch co-located indirect blocks (I have verified this by
turning off prefetching). When allocating indirect
blocks outside the metacluster using the above scheme, co-location of
indirect blocks is still likely but less so. OTOH, by falling back to the
old-style allocation routines, we at least make sure that the indirect
block is close to its data block which is good for performance. I think
both approaches are pretty close. One thumb rule we've followed during
metaclustering design is: "when in doubt, favor IO performance overThanks for this nice visualization :-)
I agree with your and Andreas' concern about fragmentation due to the
current scheme of putting metacluster in the middle of the block group.
Here are some stats concerning different metacluster locations:
- Placing metacluster at the end of the block group results in 2%
degradation in sequential reads from large files. Putting it at the
beginning improves sequential read performance by 0.5%.
- For random reads the beginning and ending configurations have
idential performance which is almost the same as regular ext3 performance
but 1% worse than the middle configuration.
- I haven't compared the different metacluster locations for sequential
reads from small files, but in general I've found the behavior to be
very similar to random reads from a large file.So I think putting metacluster levels at the beginning of the block group
Thanks,
Abhishekj
-...
In the ext4-devel discussion, I asked about placement of the reserved
blocks. Placement at the beginning of the group showed at worst
marginally less performance and in some cases better performance.
I suspect putting the reserved blocks at the beginning of the group
would have a better long-term effect on performance because they are
not in the middle of large contiguous allocations in the middle of
the group.Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.-
Try Chris Mason's compilebench, which is a decent aging simulation.
--
Mathematics is the supreme nostalgia of our time.
-
Thanks for the suggestion Matt.
It took me some time to get compilebench working due to the known
issue with drop_caches due to circular lock dependency between
j_list_lock and inode_lock (compilebench triggers drop_caches quite
frequently). Here are the results for compilebench run with options
"-i 30 -r 30". I repeated the test 5 times on each of vanilla and mc
configurations.Setup: 4 cpu, 8GB RAM, 400GB disk.
Average vanilla results
==========================================================================
intial create total runs 30 avg 46.49 MB/s (user 1.12s sys 2.25s)
create total runs 5 avg 12.90 MB/s (user 1.08s sys 1.97s)
patch total runs 4 avg 8.70 MB/s (user 0.60s sys 2.31s)
compile total runs 7 avg 21.44 MB/s (user 0.32s sys 2.95s)
clean total runs 4 avg 59.91 MB/s (user 0.05s sys 0.26s)
read tree total runs 2 avg 21.85 MB/s (user 1.12s sys 2.89s)
read compiled tree total runs 1 avg 23.47 MB/s (user 1.45s sys 4.89s)
delete tree total runs 2 avg 13.18 seconds (user 0.64s sys 1.02s)
no runs for delete compiled tree
stat tree total runs 4 avg 4.76 seconds (user 0.70s sys 0.50s)
stat compiled tree total runs 1 avg 7.84 seconds (user 0.74s sys 0.54s)Average metaclustering results
==========================================================================
intial create total runs 30 avg 45.04 MB/s (user 1.13s sys 2.42s)
create total runs 5 avg 15.64 MB/s (user 1.08s sys 1.98s)
patch total runs 4 avg 10.50 MB/s (user 0.61s sys 3.11s)
compile total runs 7 avg 28.07 MB/s (user 0.33s sys 4.06s)
clean total runs 4 avg 83.27 MB/s (user 0.04s sys 0.27s)
read tree total runs 2 avg 21.17 MB/s (user 1.15s sys 2.91s)
read compiled tree total runs 1 avg 22.79 MB/s (user 1.38s sys 4.89s)
delete tree total runs 2 avg 9.23 seconds (user 0.62s sys 1.01s)
no runs for delete compiled tree
stat tree total runs 4 avg 4.72 seconds (user 0.71s sys 0.50s)
stat compiled tree total runs 1 avg 6.50 seconds (user 0.79s sys 0.53s)Overall, metaclustering does better than vanilla except in a...
Abhishek> It took me some time to get compilebench working due to the
Abhishek> known issue with drop_caches due to circular lock dependency
Abhishek> between j_list_lock and inode_lock (compilebench triggers
Abhishek> drop_caches quite frequently). Here are the results for
Abhishek> compilebench run with options "-i 30 -r 30". I repeated the
Abhishek> test 5 times on each of vanilla and mc configurations.Abhishek> Setup: 4 cpu, 8GB RAM, 400GB disk.
How about running these tests on a more pedestrian system which people
will actually have? Like 1gb, 1cpu and 400gb of a single disk?-
Well it strikes me as about half up and half down, but the ups are
indeed much more substantial. Looks quite promising.--
Mathematics is the supreme nostalgia of our time.
-
I think above testcases are just run with normal I/O case.
Did you test the direct I/O test with this patch?
With metaclustering patch, I can't run 'fsstress' since it oops when direct I/O.
or is it dependent on other ext3 patches?
Since I tested it with latest kernel 2.6.24-rc3 instead of mm kernel.Thank you,
Kyungmin Park
-
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Justin Piszcz | exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen |
| Heiko Carstens | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
| Radu Rendec | htb parallelism on multi-core platforms |
