Linux: Readahead and Fragmentation

Submitted by Jeremy
on October 9, 2002 - 10:58pm

An earlier thread on the lkml debating whether to call the upcoming stable kernel 2.6 or 3.0 [story] evolved into several other sub-discussions. One such discussion was looking at the state of the Linux desktop, and the issue of the kernel readahead implementation was brought in. Andrew Morton [interview] pointed out that by going into the function ext2_new_inode() and replacing the call to find_group_dir() with find_group_other(), full disk bandwidth can be achieved.

Why then has this not been done? To find out, read on for the full discussion between Andrew Morton, Danial Phillips, Linus Torvalds, Alan Cox and Theodore Ts'o. Towards the end of the thread, Andrew sheds some light on the specific questions asked by Chris Friesen, "What's the difference between find_group_dir and find_group_other, and why aren't we using find_group_other already if its so much faster?"


From: Daniel Phillips
Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA))
Date: 	Mon, 7 Oct 2002 19:43:17 +0200

On Sunday 06 October 2002 17:19, Martin J. Bligh wrote:
> > Then there's the issue of application startup. There's not enough
> > read ahead. This is especially sad, as the order of page faults is 
> > at least partially predictable.
> 
> Is the problem really, fundamentally a lack of readahead in the
> kernel? Or is it that your application is huge bloated pig? 

Readahead isn't the only problem, but it is a huge problem.  The current 
readahead model is per-inode, which is very little help with lots of small 
files, especially if they are fragmented or out of order.  There are various 
ways to fix this; they are all difficult[1].  Fortunately, we can call this 
"tuning work" so it can be done during the stable series.

[1] We could teach each filesystem how to read ahead across directories, or 
we could teach the vfs how to do physical readahead.  Choose your poison.

-- 
Daniel


From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 11:31:42 -0700 Daniel Phillips wrote: > > On Sunday 06 October 2002 17:19, Martin J. Bligh wrote: > > > Then there's the issue of application startup. There's not enough > > > read ahead. This is especially sad, as the order of page faults is > > > at least partially predictable. > > > > Is the problem really, fundamentally a lack of readahead in the > > kernel? Or is it that your application is huge bloated pig? > > Readahead isn't the only problem, but it is a huge problem. The current > readahead model is per-inode, which is very little help with lots of small > files, especially if they are fragmented or out of order. There are various > ways to fix this; they are all difficult[1]. Fortunately, we can call this > "tuning work" so it can be done during the stable series. > > [1] We could teach each filesystem how to read ahead across directories, or > we could teach the vfs how to do physical readahead. Choose your poison. Devices do physical readahead, and it works nicely. Go into ext2_new_inode, replace the call to find_group_dir with find_group_other. Then untar a kernel tree, unmount the fs, remount it and see how long it takes to do a `find . -type f xargs cat > /dev/null' on that tree. If your disk is like my disk, you will achieve full disk bandwidth.
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 11:51:46 -0700 (PDT) On Mon, 7 Oct 2002, Andrew Morton wrote: > > Devices do physical readahead, and it works nicely. Indeed. There isn't any reasonable device where this isn't the case: the _device_ (and sometimes the driver - floppy.c) does a lot better at readahead than higher layers can do anyway. > Go into ext2_new_inode, replace the call to find_group_dir with > find_group_other. I hate that thing. Hate hate hate. Maybe we should just do this, and hope that somebody will do a proper off-line cleanup tool. In the meantime, it might just be possible to take a look at the uid, and if the uid matches use find_group_other, but for non-matching uids use find_group_dir. That gives a "compact for same users, distribute for different users" heuristic, which might be acceptable for normal use (and the theoretical cleanup tool could fix it up). Add some other heuristics ("if the difference between free group sizes is bigger than a factor of two"), and maybe it would be useful. The current approach sucks for everybody, and makes it impossible to get good throughput on a disk on many very common loads. Linus
From: Alan Cox Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: 07 Oct 2002 21:14:29 +0100 On Mon, 2002-10-07 at 19:51, Linus Torvalds wrote: > In the meantime, it might just be possible to take a look at the uid, and > if the uid matches use find_group_other, but for non-matching uids use > find_group_dir. That gives a "compact for same users, distribute for > different users" heuristic, which might be acceptable for normal use (and > the theoretical cleanup tool could fix it up). Factoring the uid/gid/pid in actually may help in other ways. If we are doing it by pid or by uid we will reduce the interleave of multiple files thing you sometimes get
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not3.0 - (NUMA)) Date: Mon, 07 Oct 2002 13:31:56 -0700 Alan Cox wrote: > > On Mon, 2002-10-07 at 19:51, Linus Torvalds wrote: > > In the meantime, it might just be possible to take a look at the uid, and > > if the uid matches use find_group_other, but for non-matching uids use > > find_group_dir. That gives a "compact for same users, distribute for > > different users" heuristic, which might be acceptable for normal use (and > > the theoretical cleanup tool could fix it up). > > Factoring the uid/gid/pid in actually may help in other ways. If we are > doing it by pid or by uid we will reduce the interleave of multiple > files thing you sometimes get Yes, that would help on multiuser setups. Delayed allocation is a great fix for that problem though. The other obvious heuristic is "if the parent directory was created in the last five seconds, use find_group_other()". But that made Linus go "ewwww".
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not3.0 - (NUMA)) Date: Mon, 7 Oct 2002 13:46:49 -0700 (PDT) On Mon, 7 Oct 2002, Andrew Morton wrote: > > The other obvious heuristic is "if the parent directory was > created in the last five seconds, use find_group_other()". But > that made Linus go "ewwww". Well, it makes me go "less ewww" than the current scheme, so if that turns out to be acceptable to others, I won't mind _too_ much. The reason I don't like time too much persoanlly is that it's not very reproducible. Especially if the times are in the second range. I'd rather have a heuristic that is deterministic. Linus
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 13:44:05 -0700 (PDT) On 7 Oct 2002, Alan Cox wrote: > > Factoring the uid/gid/pid in actually may help in other ways. If we are > doing it by pid or by uid we will reduce the interleave of multiple > files thing you sometimes get 'pid' would probably work better than what we have now, even though I bet it would get confused by a large number of installers (ie "make install" in just about any project will use multiple different processes to copy over separate subdirectories. In the X11R6 tree it uses individual "cp" processes for each file!) The session ID would avoid some of that, but they both have a fundamental problem: neither pid nor session ID is actually saved in any directory structure, so it's quite hard to use that as a heuristic for whether a new file should go into the same directory group as the directory it is created in. That's why "uid" would work better. The uid has a different issue, though, namely the fact that when user directories are created, they are basically always created as uid 0 first, and then a "chown" - which means that the user heuristic wouldn't actually trigger at the right time. So the heuristic couldn't be just "newfile->uid == directory->uid", it would have to be something better. I think last time we had the discussion, time-based things were also felt were good heuristics in many cases.. It could also be good to have an additional static hint on whether directories should be spread out or not. Administrators could set the "spread out" bit on the /, /home and /var/spool/(news|mail) directories, for example, causing those to spread out their subdirectories. but not causing normal user activity to do so. Yeah, yeah, I know there are papers on this. I don't care. I think something has to be done, and last time the discussion petered out at about this point. Linus
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not3.0 - (NUMA)) Date: Mon, 07 Oct 2002 14:16:29 -0700 Linus Torvalds wrote: > > On 7 Oct 2002, Alan Cox wrote: > > > > Factoring the uid/gid/pid in actually may help in other ways. If we are > > doing it by pid or by uid we will reduce the interleave of multiple > > files thing you sometimes get > > 'pid' would probably work better than what we have now, even though I bet > it would get confused by a large number of installers (ie "make install" > in just about any project will use multiple different processes to copy > over separate subdirectories. In the X11R6 tree it uses individual "cp" > processes for each file!) > > The session ID would avoid some of that, but they both have a fundamental > problem: neither pid nor session ID is actually saved in any directory > structure, so it's quite hard to use that as a heuristic for whether a new > file should go into the same directory group as the directory it is > created in. > > That's why "uid" would work better. Sound good to me. At leat this puts a veneer of respectability over decapitating find_group_other(), which is really what we all want to do anyway ;) > The uid has a different issue, though, > namely the fact that when user directories are created, they are basically > always created as uid 0 first, and then a "chown" - which means that the > user heuristic wouldn't actually trigger at the right time. So the > heuristic couldn't be just "newfile->uid == directory->uid", it would have > to be something better. Last time, Al suggested that we always use the find_group_other() approach if the directory is being made at the top-level of the filesystem. So if /home is a mountpoint, the user directories get spread out. I think this, and the UID comparison will be good enough.
From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 21:05:16 +0200 On Monday 07 October 2002 20:31, Andrew Morton wrote: > Daniel Phillips wrote: > > [1] We could teach each filesystem how to read ahead across directories, or > > we could teach the vfs how to do physical readahead. Choose your poison. > > Devices do physical readahead, and it works nicely. Devices have a few MB of readahead cache, the kernel can have thousands of times as much. -- Daniel
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 12:24:36 -0700 (PDT) On Mon, 7 Oct 2002, Daniel Phillips wrote: > > Devices have a few MB of readahead cache, the kernel can have thousands of > times as much. I don't think that is in the least realistic. There's _no_ way that the krenel could do physical readahead for more than a few tens or hundreds of kB - the latency impact would just be too much to handle, and the VM impact is not likely insignificant either. So the device readahead is _not_ noticeably smaller than what the kernel can reasonably do, and it does a better job of it (ie disks can fill track buffers optimally, depending on where the head hits etc). Linus
From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 22:02:43 +0200 On Monday 07 October 2002 21:24, Linus Torvalds wrote: > On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > Devices have a few MB of readahead cache, the kernel can have thousands of > > times as much. > > I don't think that is in the least realistic. > > There's _no_ way that the krenel could do physical readahead for more than > a few tens or hundreds of kB If that's a bet, I'll take you up on it. > - the latency impact would just be too much > to handle, and the VM impact is not likely insignificant either. I did say difficult. It really is, but there are big gains to be had. This is easy to verify: say you have 100 MB of kernel source stored in, say, 50 different clumps on disk. Complete with seeks, a perfectly prescient readahead algorithm can read that into memory in about 5 seconds, even with my lame scsi raid controller[1]. So two of those needs 10 seconds, and I can diff those two trees in 2 seconds, in cache. In practice it takes 90 seconds, so there is obviously a lot of room for improvement. Note that if the disks really were capable of handling the readahead themselves they would already give me the 12 second result, not the 90 seconds. They simply can't, because they haven't got enough cache. [1] If the controller wasn't lame it would read the 100 MB in less than a second, with its (peak) total of 200 MB/s media bandwith, less 20% worth of parity blocks. -- Daniel
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 13:28:16 -0700 (PDT) On Mon, 7 Oct 2002, Daniel Phillips wrote: > > If that's a bet, I'll take you up on it. Sure. The mey is: - we can more easily fix the f*cking filesystems to be sane - then trying to add prescient read-ahead to the kernel In other words, trying to do an impossibly good job on read-ahead is _stupid_, when the real problem is that ext2 lays out files in total crap ways. > I did say difficult. It really is, but there are big gains to be had. But why do the horribly stupid thing, when Andrew has already shown that a one-liner change to ext2/3 gives you platter speeds (and better speeds than your approach _can_ get, since you still are going to end up seeking a lot, even if you can make your read-ahead prescient). In other words, you're overcompensating. Linus
From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 23:16:58 +0200 On Monday 07 October 2002 22:28, Linus Torvalds wrote: > On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > If that's a bet, I'll take you up on it. > > Sure. The mey is: ^^^ - we can more easily fix the f*cking filesystems to be sane > - then trying to add prescient read-ahead to the kernel -- Daniel
From: Charles Cazabon Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 16:14:27 -0600 Daniel Phillips wrote: > > > > Sure. The mey is: > ^^^ From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 14:55:42 -0700 (PDT) On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > Sure. The mey is: > ^^^ From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Tue, 8 Oct 2002 00:02:57 +0200 On Monday 07 October 2002 23:55, Linus Torvalds wrote: > On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > > > Sure. The mey is: > > ^^^ > Yeah. What the heck happened to my fingers? Apparently, one of them missed the key it was aiming for and the other one changed hands. -- Daniel
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 15:12:19 -0700 Daniel Phillips wrote: > > On Monday 07 October 2002 23:55, Linus Torvalds wrote: > > On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > > > > > Sure. The mey is: > > > ^^^ > > > Yeah. What the heck happened to my fingers? > > Apparently, one of them missed the key it was aiming for and the other one > changed hands. > They don't call him Kubys for nothing. I dug out and dusted off Al's Orlov allocator patch. And found a comment which rather helps explain how it works. I performance tested this back in November. See http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.1/0281.html Bottom line: it's as good as the use-first-fit-everywhere approach, and appears to have better long-term antifragmentation characteristics. I shall test it. fs/ext2/ext2.h | 1 fs/ext2/ialloc.c | 164 ++++++++++++++++++++++++++++++++++++++++++++- fs/ext2/super.c | 8 ++ include/linux/ext2_fs_sb.h | 2 4 files changed, 172 insertions(+), 3 deletions(-) --- 2.5.41/fs/ext2/ialloc.c~orlov-allocator Mon Oct 7 14:31:50 2002 +++ 2.5.41-akpm/fs/ext2/ialloc.c Mon Oct 7 15:04:09 2002 @@ -18,6 +18,7 @@ #include #include #include +#include /* * ialloc.c contains the inodes allocation and deallocation routines @@ -205,6 +206,7 @@ static void ext2_preread_inode(struct in * For other inodes, search forward from the parent directory's block * group to find a free inode. */ +#if 0 static int find_group_dir(struct super_block *sb, int parent_group) { @@ -238,9 +240,141 @@ static int find_group_dir(struct super_b mark_buffer_dirty(best_bh); return best_group; } +#endif + +/* + * Orlov's allocator for directories. + * + * We always try to spread first-level directories. + * + * If there are blockgroups with both free inodes and free blocks counts + * not worse than average we return one with smallest directory count. + * Otherwise we simply return a random group. + * + * For the rest rules look so: + * + * It's OK to put directory into a group unless + * it has too many directories already (max_dirs) or + * it has too few free inodes left (min_inodes) or + * it has too few free blocks left (min_blocks) or + * it's already running too large debt (max_debt). + * Parent's group is prefered, if it doesn't satisfy these + * conditions we search cyclically through the rest. If none + * of the groups look good we just look for a group with more + * free inodes than average (starting at parent's group). + * + * Debt is incremented each time we allocate a directory and decremented + * when we allocate an inode, within 0--255. + */ + +#define INODE_COST 64 +#define BLOCK_COST 256 + +static int find_group_orlov(struct super_block *sb, struct inode *parent) +{ + int parent_group = EXT2_I(parent)->i_block_group; + struct ext2_sb_info *sbi = EXT2_SB(sb); + struct ext2_super_block *es = sbi->s_es; + int ngroups = sbi->s_groups_count; + int inodes_per_group = EXT2_INODES_PER_GROUP(sb); + int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups; + int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups; + int blocks_per_dir; + int ndirs = sbi->s_dir_count; + int max_debt, max_dirs, min_blocks, min_inodes; + int group = -1, i; + struct ext2_group_desc *desc; + struct buffer_head *bh; + + if (parent == sb->s_root->d_inode) { + struct ext2_group_desc *best_desc = NULL; + struct buffer_head *best_bh = NULL; + int best_ndir = inodes_per_group; + int best_group = -1; + + get_random_bytes(&group, sizeof(group)); + parent_group = (unsigned)group % ngroups; + for (i = 0; i bg_free_inodes_count) + continue; + if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir) + continue; + if (le16_to_cpu(desc->bg_free_inodes_count) bg_free_blocks_count) bg_used_dirs_count); + best_desc = desc; + best_bh = bh; + } + if (best_group >= 0) { + desc = best_desc; + bh = best_bh; + group = best_group; + goto found; + } + goto fallback; + } + + blocks_per_dir = (le32_to_cpu(es->s_blocks_count) - + le32_to_cpu(es->s_free_blocks_count)) / ndirs; + + max_dirs = ndirs / ngroups + inodes_per_group / 16; + min_inodes = avefreei - inodes_per_group / 4; + min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; + + max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST); + if (max_debt * INODE_COST > inodes_per_group) + max_debt = inodes_per_group / INODE_COST; + if (max_debt > 255) + max_debt = 255; + if (max_debt == 0) + max_debt = 1; + + for (i = 0; i bg_free_inodes_count) + continue; + if (sbi->debts[group] >= max_debt) + continue; + if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs) + continue; + if (le16_to_cpu(desc->bg_free_inodes_count) bg_free_blocks_count) bg_free_inodes_count) + continue; + if (le16_to_cpu(desc->bg_free_inodes_count) >= avefreei) + goto found; + } + + return -1; + +found: + desc->bg_free_inodes_count = + cpu_to_le16(le16_to_cpu(desc->bg_free_inodes_count) - 1); + desc->bg_used_dirs_count = + cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) + 1); + sbi->s_dir_count++; + mark_buffer_dirty(bh); + return group; +} -static int find_group_other(struct super_block *sb, int parent_group) +static int find_group_other(struct super_block *sb, struct inode *parent) { + int parent_group = EXT2_I(parent)->i_block_group; int ngroups = EXT2_SB(sb)->s_groups_count; struct ext2_group_desc *desc; struct buffer_head *bh; @@ -312,9 +446,9 @@ struct inode * ext2_new_inode(struct ino es = EXT2_SB(sb)->s_es; repeat: if (S_ISDIR(mode)) - group = find_group_dir(sb, EXT2_I(dir)->i_block_group); + group = find_group_orlov(sb, dir); else - group = find_group_other(sb, EXT2_I(dir)->i_block_group); + group = find_group_other(sb, dir); err = -ENOSPC; if (group == -1) @@ -349,6 +483,15 @@ repeat: es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1); + + if (S_ISDIR(mode)) { + if (EXT2_SB(sb)->debts[i] debts[i]++; + } else { + if (EXT2_SB(sb)->debts[i]) + EXT2_SB(sb)->debts[i]--; + } + mark_buffer_dirty(EXT2_SB(sb)->s_sbh); sb->s_dirt = 1; inode->i_uid = current->fsuid; @@ -478,6 +621,21 @@ unsigned long ext2_count_free_inodes (st #endif } +/* Called at mount-time, super-block is locked */ +unsigned long ext2_count_dirs (struct super_block * sb) +{ + unsigned long count = 0; + int i; + + for (i = 0; i s_groups_count; i++) { + struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL); + if (!gdp) + continue; + count += le16_to_cpu(gdp->bg_used_dirs_count); + } + return count; +} + #ifdef CONFIG_EXT2_CHECK /* Called at mount-time, super-block is locked */ void ext2_check_inodes_bitmap (struct super_block * sb) --- 2.5.41/fs/ext2/super.c~orlov-allocator Mon Oct 7 14:31:58 2002 +++ 2.5.41-akpm/fs/ext2/super.c Mon Oct 7 14:52:38 2002 @@ -665,6 +665,12 @@ static int ext2_fill_super(struct super_ printk ("EXT2-fs: not enough memoryn"); goto failed_mount; } + sbi->debts = kmalloc(sbi->s_groups_count, GFP_KERNEL); + if (!sbi->debts) { + printk ("EXT2-fs: not enough memoryn"); + goto failed_mount_group_desc; + } + memset(sbi->debts, 0, sbi->s_groups_count); for (i = 0; i s_group_desc[i] = sb_bread(sb, logic_sb_block + i + 1); if (!sbi->s_group_desc[i]) { @@ -681,6 +687,7 @@ static int ext2_fill_super(struct super_ goto failed_mount2; } sbi->s_gdb_count = db_count; + sbi->s_dir_count = ext2_count_dirs(sb); get_random_bytes(&sbi->s_next_generation, sizeof(u32)); /* * set up enough so that it can read an inode @@ -706,6 +713,7 @@ static int ext2_fill_super(struct super_ failed_mount2: for (i = 0; i s_group_desc[i]); +failed_mount_group_desc: kfree(sbi->s_group_desc); failed_mount: brelse(bh); --- 2.5.41/include/linux/ext2_fs_sb.h~orlov-allocator Mon Oct 7 14:32:07 2002 +++ 2.5.41-akpm/include/linux/ext2_fs_sb.h Mon Oct 7 14:38:23 2002 @@ -43,6 +43,8 @@ struct ext2_sb_info { int s_inode_size; int s_first_ino; u32 s_next_generation; + unsigned long s_dir_count; + u8 *debts; }; #endif /* _LINUX_EXT2_FS_SB */ --- 2.5.41/fs/ext2/ext2.h~orlov-allocator Mon Oct 7 14:37:36 2002 +++ 2.5.41-akpm/fs/ext2/ext2.h Mon Oct 7 14:37:51 2002 @@ -45,6 +45,7 @@ extern int ext2_new_block (struct inode extern void ext2_free_blocks (struct inode *, unsigned long, unsigned long); extern unsigned long ext2_count_free_blocks (struct super_block *); +extern unsigned long ext2_count_dirs (struct super_block *); extern void ext2_check_blocks_bitmap (struct super_block *); extern struct ext2_group_desc * ext2_get_group_desc(struct super_block * sb, unsigned int block_group, .
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 13:14:23 -0700 Daniel Phillips wrote: > > This is easy to verify: say you have 100 MB of kernel source stored in, say, > 50 different clumps on disk. Disks use segmentation on their readahead buffers. Typically four-way. So they will only buffer four different chunks of disk at a time. If you're reading from 50 different places on disk, the disk keeps invalidating readahead at the segment level.
From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 22:22:51 +0200 On Monday 07 October 2002 22:14, Andrew Morton wrote: > Daniel Phillips wrote: > > > > This is easy to verify: say you have 100 MB of kernel source stored in, say, > > 50 different clumps on disk. > > Disks use segmentation on their readahead buffers. Typically four-way. > So they will only buffer four different chunks of disk at a time. > > If you're reading from 50 different places on disk, the disk keeps > invalidating readahead at the segment level. Sure, and kernel-based physical readahead would not have that problem. (Kernel-based physical readahead has its own problems, for example: how do you determine that a given physical block is already cached in an inode and so should be ignored as a readahead candidate?) -- Daniel
From: Chris Friesen Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 14:58:49 -0400 Andrew Morton wrote: > Go into ext2_new_inode, replace the call to find_group_dir with > find_group_other. Then untar a kernel tree, unmount the fs, > remount it and see how long it takes to do a > > `find . -type f xargs cat > /dev/null' > > on that tree. If your disk is like my disk, you will achieve > full disk bandwidth. Pardon my ignorance, but what's the difference between find_group_dir and find_group_other, and why aren't we using find_group_other already if its so much faster? Chris
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 12:36:48 -0700 Chris Friesen wrote: > > Andrew Morton wrote: > > > Go into ext2_new_inode, replace the call to find_group_dir with > > find_group_other. Then untar a kernel tree, unmount the fs, > > remount it and see how long it takes to do a > > > > `find . -type f xargs cat > /dev/null' > > > > on that tree. If your disk is like my disk, you will achieve > > full disk bandwidth. > > Pardon my ignorance, but what's the difference between find_group_dir > and find_group_other, and why aren't we using find_group_other already > if its so much faster? > ext2 and ext3 filesystems are carved up into "block groups", aka "cylinder groups". Each one is 4096*8 blocks - typically 128 MB. So you can easily have hundreds of blockgroups on a single partition. The inode allocator is designed to arrange that files which are within the same directory fall in the same blockgroup, for locality of reference. But new directories are placed "far away", in block groups which have plenty of free space. (find_group_dir -> find a blockgroup for a directory). The thinking here is that files in a separate directory are related, and files in different directories are unrelated. So we can take advantage of that heuristic - go and use a new blockgroup each time a new directory is created. This is a levelling algorithm which tries to keep all blockgroups at a similar occupancy level. That's a good thing, because high occupancy levels lead to fragmentation. find_group_other() is basically first-fit-from-start-of-disk, and if we use that for directories as well as files, your untar-onto-a-clean-disk simply lays everything out in a contiguous chunk. Part of the problem here is that it has got worse over time. The size of a blockgroup is hardwired to blocksize*bits-in-a-byte*blocksize. But disks keep on getting bigger. Five years ago (when, presumably, this algorithm was designed), a typical partition had, what? Maybe four blockgroups? Now it has hundreds, and so the "levelling" is levelling across hundreds of blockgroups and not just a handful. I did a lot of work on this back in November 2001, mainly testing with a trace-based workload from Keith Smith. See http://www.eecs.harvard.edu/~keith/usenix.1995.html Al Viro wrote a modified allocator (which nobody understood ;)) based on Orlov's algorithm. I ended up concluding that the current (sucky) code is indeed best for minimising long-term fragmentation under slow-growth scenarios. And worst for fast-growth. Orlov was in between on both. Simply nuking find_group_dir() was best for fast-growth, worst for slow-growth. Block allocators are fertile grounds for academic papers. It's complex. There is a risk that you can do something which is cool in testing, but ends up exploding horridly after a year's use. By which time we have ten million deployed systems running like dogs, damn all we can do about it. The best solution is to use first-fit and online defrag to fix the long-term fragmentation. It really is. There has been no appreciable progress on this. A *practical* solution is to keep a spare partition empty and do a `cp -a' from one partition onto another once per week and swizzle the mountpoints. Because the big copy will unfragment everything. ho-hum. I shall forward-port Orlov, and again attempt to understand it ;)
From: Daniel Phillips Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 21:21:28 +0200 On Monday 07 October 2002 20:58, Chris Friesen wrote: > Andrew Morton wrote: > > > Go into ext2_new_inode, replace the call to find_group_dir with > > find_group_other. Then untar a kernel tree, unmount the fs, > > remount it and see how long it takes to do a > > > > `find . -type f xargs cat > /dev/null' > > > > on that tree. If your disk is like my disk, you will achieve > > full disk bandwidth. > > Pardon my ignorance, but what's the difference between find_group_dir > and find_group_other, and why aren't we using find_group_other already > if its so much faster? These are the heuristics that determine where in the volume directory inodes are allocated: http://lxr.linux.no/source/fs/ext2/ialloc.c#L221 Ext2 likes to spread directory inodes around the volume so that there is room to keep the associated file blocks nearby. This interacts rather poorly with readahead. -- Daniel
From: Linus Torvalds Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 12:35:26 -0700 (PDT) On Mon, 7 Oct 2002, Daniel Phillips wrote: > > Ext2 likes to spread directory inodes around the volume so that there is > room to keep the associated file blocks nearby. This interacts rather > poorly with readahead. Not a read-ahead problem. It interacts rather poory _full_stop_. It means that the inode tables are spread all out, the bitmaps are fragmented etc, so the disk head has to move all over the disk even when only working with one directory tree like the kernel sources. Kernel-level read-ahead doens't much help, because the FS tries to keep the data blocks for individual files together - which is the case the kernel _can_ try to optimize a bit. Physical read-ahead doesn't work either, since the parts that can be physically read ahead are the ones that the regular in-file read-ahead already mostly takes care of it. So the problem with spreading stuff out doesn't have anything to do with read-ahead, and has everything to do with the basic issue of BAD LOCALITY. Locality is _good_, independently of read-ahead and independently of medium. Locality helps regardless of any read-ahead, although it is clearly true that bad locality makes readahead more futile. Linus
From: Theodore Ts'o Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 7 Oct 2002 20:39:04 -0400 On Mon, Oct 07, 2002 at 12:35:26PM -0700, Linus Torvalds wrote: > > On Mon, 7 Oct 2002, Daniel Phillips wrote: > > > > Ext2 likes to spread directory inodes around the volume so that there is > > room to keep the associated file blocks nearby. This interacts rather > > poorly with readahead. > > Not a read-ahead problem. It interacts rather poory _full_stop_. > > It means that the inode tables are spread all out, the bitmaps are > fragmented etc, so the disk head has to move all over the disk even when > only working with one directory tree like the kernel sources. It depends on what you are doing. BSD, and even XFS, uses the concept of using cylinder groups or block groups as one of many tools to avoid file fragmentation and to concetrate locality for files within a directory. The reason why FAT filesystems have file fragmentation problems in far more worse way is because they attempt don't have the concept of a block group, and simply always allocate from the beginning of the filesystem. This is effectively what would happen if you had a single block/cylinder group in the filesystem. > So the problem with spreading stuff out doesn't have anything to do with > read-ahead, and has everything to do with the basic issue of BAD LOCALITY. > Locality is _good_, independently of read-ahead and independently of > medium. Ironically, as I mentioned, one of the reasons behind the block group scheme is to *increase* locality for files within a particular directory. As you point out quite correctly, though, it tends to destroy locality across an entire directory tree. Maybe the answer is that we need some way of declaring that some directory is the root of "a directory tree". That way, the filesystem can keep directories underneath the directory tree close together, and the filesystem can try to keep directory trees far apart from each other. In order to do something like this, we would just need a filesystem API extension to allow programs like tar and bitkeeper to give a hint that a new directory tree is being established --- and ideally, it needs to be done at mkdir time, so that the filesystem can perform appropriate do a better job of deciding where to place the initial root of the "directory tree". Things would also work if you declared some directory tree to be the root of a "directory tree" after the directory was initially created, but the allocation hueristics wouldn't be nearly as effective. Linus, what do you think about defining a new flag which could be passed as part of the mode bits to mkdir()? If we allow the filesystem to get some additional hints from userspace about what the difference between /usr/src/linux (where directory spreading is a bad idea) and /usr/home (where directory spreading is a very good idea), it would make life for the filesystem much easier. - Ted
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Mon, 07 Oct 2002 19:59:26 -0700 Theodore Ts'o wrote: > > ... > It depends on what you are doing. BSD, and even XFS, uses the concept > of using cylinder groups or block groups as one of many tools to avoid > file fragmentation and to concetrate locality for files within a > directory. The reason why FAT filesystems have file fragmentation > problems in far more worse way is because they attempt don't have the > concept of a block group, and simply always allocate from the > beginning of the filesystem. This is effectively what would happen if > you had a single block/cylinder group in the filesystem. > In the testing which I did, based on Keith Smith's traces, the current code really isn't very effective. What I did was to run his aging workload an increasing number of times. Then measured the fragmentation of the files which it left behind. I measured the fragmentation simply by timing how long it took to read all the files, and compared that to how long it took to read the same files when they had been laid down on a fresh fs. After ten aging rounds, with the current block allocator, we're running 4x to 5x times slower. With the Orlov allocator, we're running 5x to 6x slower. Either way, that's a big performance slowdown. Orlov turns a 400% slowdown into a 500% slowdown. So it is a 25% regression for slow growth. But it is a 300% to 500% improvement for fast-growth. (Well, it used to be. But I just fixed a memory-corrupting bug in it which I think has slowed it down. It's now only double the speed on scsi, triple on IDE). What we need, *regardless* of which allocator is used is effective defrag tools. I just retested. 2.5.41, scsi: time find linux-2.4.19 -type f | xargs cat > /dev/null find linux-2.4.19 -type f 0.06s user 0.24s system 1% cpu 19.274 total xargs cat > /dev/null 0.23s user 1.42s system 8% cpu 19.954 total 2.5.41, IDE: time find linux-2.4.19 -type f | xargs cat > /dev/null find linux-2.4.19 -type f 0.06s user 0.23s system 0% cpu 29.274 total xargs cat > /dev/null 0.23s user 1.58s system 5% cpu 30.199 total 2.5.41+Orlov, SCSI: time find linux-2.4.19 -type f | xargs cat > /dev/null find linux-2.4.19 -type f 0.06s user 0.24s system 2% cpu 11.579 total xargs cat > /dev/null 0.23s user 1.46s system 14% cpu 11.951 total 2.5.41+Orlov, IDE: time find linux-2.4.19 -type f | xargs cat > /dev/null find linux-2.4.19 -type f 0.06s user 0.24s system 2% cpu 12.225 total xargs cat > /dev/null 0.22s user 1.59s system 14% cpu 12.500 total We need some of that goodness. > > [ administrator hints ] > Alas, nobody uses them :( Maybe a mount option? But I think the current algorithm should default to "off".
From: Theodore Ts'o Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Tue, 8 Oct 2002 12:15:55 -0400 On Mon, Oct 07, 2002 at 07:59:26PM -0700, Andrew Morton wrote: > In the testing which I did, based on Keith Smith's traces, the > current code really isn't very effective. > > What I did was to run his aging workload an increasing number of > times. Then measured the fragmentation of the files which it > left behind. I measured the fragmentation simply by timing > how long it took to read all the files, and compared that to > how long it took to read the same files when they had been laid > down on a fresh fs. What access pattern did you use when you read the files? Did you sweep through filesystem directory by directory, or did you use some other pattern (perhaps random)? It would also be interesting to get a measure of fragmentation of the filesystems as measured by e2fsck. This only measures file fragmentation, and not file locality on a per-directory (or more ideally per-directory tree, but establishing where the directory trees are is difficult). > > > > [ administrator hints ] > > > > Alas, nobody uses them :( No one will use them if they are need to do so manually. But if we can convert a few programs to use them, then it might work. And people didn't much use madvise() when it was first introduced either, but it doesn't mean that the existence of the interface was a bad idea.... If the current algorithm is so bad, then maybe the trick is to use the fast-growth optimized allocator as the default, *unless* given a hint to do so via some magic mkdir flag. Then if certain programs, such as adduser (when creating a home directory), "cp -r", "bk clone", tar, etc. where modified to give hints that the a particular directory was at the top of a directory tree, then slow-growth optimized allocator could be used to spread apart directory trees. No, it's not perfect, but it should be better not using any hints at all. (And yes, it will take a while before the userpsace tools that provide said hints are widely deployed.) And if we don't have any user-space hints, then we default to the fast-growth algorithm, which should make Linus happy. :-) > Maybe a mount option? But I think the current algorithm should > default to "off". How about a mount option with the possible values: "fast", "slow", "hinted", and "auto", with the default being "auto" or "hinted"? (Where hinted utilizes user-space hints, and "auto" utilizes user-space hints if present, plus some of the so-called ugly hueristics which you had discussed?) - Ted
From: Andrew Morton Subject: Re: The reason to call it 3.0 is the desktop (was Re: [OT] 2.6 not 3.0 - (NUMA)) Date: Tue, 08 Oct 2002 12:39:01 -0700 Theodore Ts'o wrote: > > On Mon, Oct 07, 2002 at 07:59:26PM -0700, Andrew Morton wrote: > > In the testing which I did, based on Keith Smith's traces, the > > current code really isn't very effective. > > > > What I did was to run his aging workload an increasing number of > > times. Then measured the fragmentation of the files which it > > left behind. I measured the fragmentation simply by timing > > how long it took to read all the files, and compared that to > > how long it took to read the same files when they had been laid > > down on a fresh fs. > > What access pattern did you use when you read the files? Did you > sweep through filesystem directory by directory, or did you use some > other pattern (perhaps random)? Well this is all rather dim in my memory, so the confidence level is drooping. I am sure that the timing was a single find | xargs cat thing. I also know that I investigated whether the increased time was due to the metadata access or the data access. I _think_ it was mainly metadata. But it all needs to be redone, really. > ... > > Maybe a mount option? But I think the current algorithm should > > default to "off". > > How about a mount option with the possible values: "fast", "slow", > "hinted", and "auto", with the default being "auto" or "hinted"? > (Where hinted utilizes user-space hints, and "auto" utilizes > user-space hints if present, plus some of the so-called ugly > hueristics which you had discussed?) > Well the current Orlov patch will spread top-level directories, so as long as /home is a mountpoint, we're fine. For more generalality, yes, I think a new chattr flag on the parent directory which says "spread my subdirectories out" would be a good solution.

Existing vs New filesystems

on
October 10, 2002 - 12:21pm

The above discussion seems to focus more on improvements to *new* filesystems (API/Flags to denote a new directory being created, for example). Would these changes improve performance on existing filesystems? For example, I have several directory trees that probably fit the slow-gowing scenario. Each tree has several nested directories, each containing multi-megabyte files. Total for each tree is ~500-1000 MB. Writes rarely occur, thus would find_group_other actually improve read performance or only write performance?
Guess I'll have to test....

It'd benefit new files only

Anonymous
on
October 11, 2002 - 5:58am

Basically, changing the allocation algorithm would benefit new files only, and then only in a limited manner. For best results, you would need to move everything off the filesystem and back on it.

The chattr idea (marking a dir as 'spread' or 'clump') seems interesting. It would require the administrator to have his head on straight, though, or if you wanted it fully auto, a daemon that would periodically sanitize the spread/clump settings. (eg. make sure /home is set to spread, and perhaps each homedir set to clump unless it has a large branch factor. Indeed, expected branch factor (along with the expected weight of each branch) seems like it should be a good predictor of whether the directory works best spread or clumped.)

Batching open/read/write system calls

Anonymous
on
October 11, 2002 - 2:10pm

It seems that a fundamental issue has been missed here: the kernel does not have enough information to do directory-level read-ahead.

Heuristics and larger buffers are likely to be less effective here: whereas files and block devices are one-dimensional arrays with mostly sequential access patterns, directory trees tend to be sprawling hierarchies with much less structure. Predicting which file the process is going to access next may end up being an unacceptable gamble of I/O bandwidth and buffers.

It might be more helpful to create an interface which allow calls like open/read/write/stat to be dispatched in batch. (Perhaps they could be strung together into scatter-gather buffers, and have the results returned in the same way.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.