From: Dave Chinner <dchinner@redhat.com> The inode unused list is currently a global LRU. This does not match the other global filesystem cache - the dentry cache - which uses per-superblock LRU lists. Hence we have related filesystem object types using different LRU reclaimatin schemes. To enable a per-superblock filesystem cache shrinker, both of these caches need to have per-sb unused object LRU lists. Hence this patch converts the global inode LRU to per-sb LRUs. The patch only does rudimentary per-sb propotioning in the shrinker infrastructure, as this gets removed when the per-sb shrinker callouts are introduced later on. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/fs-writeback.c | 2 +- fs/inode.c | 87 +++++++++++++++++++++++++++++++++++++++----- fs/super.c | 1 + include/linux/fs.h | 4 ++ include/linux/writeback.h | 1 - 5 files changed, 83 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 5c4161f..b1e76ef 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -565,7 +565,7 @@ select_queue: /* * The inode is clean, unused */ - list_move(&inode->i_list, &inode_unused); + list_move(&inode->i_list, &inode->i_sb->s_inode_lru); } } inode_sync_complete(inode); diff --git a/fs/inode.c b/fs/inode.c index 2bee20a..3caa758 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -25,6 +25,7 @@ #include <linux/mount.h> #include <linux/async.h> #include <linux/posix_acl.h> +#include "internal.h" /* * This is needed for the following functions: @@ -74,7 +75,6 @@ static unsigned int i_hash_shift __read_mostly; */ LIST_HEAD(inode_in_use); -LIST_HEAD(inode_unused); static struct hlist_head *inode_hashtable __read_mostly; /* @@ -292,6 +292,7 @@ void __iget(struct inode *inode) if (!(inode->i_state & (I_DIRTY|I_SYNC))) list_move(&inode->i_list, &inode_in_use); ...
Is this an improvement I wonder? The dcache is using per sb lists because it specifically requires sb traversal. What allocation/reclaim really wants (for good scalability and NUMA characteristics) is per-zone lists for these things. It's easy to convert a single list into per-zone lists. It is much harder to convert per-sb lists into per-sb x per-zone lists. --
Right - I originally implemented the per-sb dentry lists for
scalability purposes. i.e. to avoid monopolising the dentry_lock
during unmount looking for dentries on a specific sb and hanging the
system for several minutes.
However, the reason for doing this to the inode cache is not for
scalability, it's because we have a tight relationship between the
dentry and inode cacheѕ. That is, reclaim from the dentry LRU grows
the inode LRU. Like the registration of the shrinkers, this is kind
of an implicit, undocumented behavour of the current shrinker
implemenation.
What this patch series does is take that implicit relationship and
make it explicit. It also allows other filesystem caches to tie
into the relationship if they need to (e.g. the XFS inode cache).
What it _doesn't do_ is change the macro level behaviour of the
No it's not. Just convert the s_{dentry,inode}_lru lists on each
superblock and call the shrinker with a new zone mask field to pick
the correct LRU. That's no harder than converting a global LRU.
Anyway, you'd still have to do per-sb x per-zone lists for the dentry LRUs,
so changing the inode cache to per-sb makes no difference.
However, this is a moot point because we don't have per-zone shrinker
interfaces. That's an entirely separate discussion because of the
macro-level behavioural changes it implies....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
I've done some testing showing parity. They've been along the lines of: - populate cache with 1m dentries + inodes - run 'time echo 2 > /proc/sys/vm/drop_caches' I've used different methods of populating the caches to have them non-sequential in the LRU (i.e. trigger fragmentation), have dirty backing inodes (e.g. the VFS inode clean, the xfs inode dirty because transactions haven't completed), etc. The variation on the test is around +-10%, with the per-sb shrinkers averaging about 5% lower time to reclaim. This is within the error margin of the test, so it's not really a conclusive win, but it is certainly shows that it does not slow anything down. If you've got a You've still got to allocate that extra memory on the per-sb dentry LRUs so it's not really a valid argument. IOWs, if it's too much memory for per-sb inode LRUs, then it's too much memory for the The same vague questions wondering about the benefit of per-sb dentry LRUs were raised when I first proposed them years ago, and look where we are now. Besides, focussing on whether this one patch is a benefit or not is really missing the point because it's the benefits of this patchset as a whole that need to be considered.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I guess the problem is that inode LRU cache isn't very useful as long as there are dentries in the way (which is most of the time, isn't it?). I think nfsd will exercise them better? Dont know of Well it would be per-zone, per-sb list, but I don't think that Not about how much is too much, it's about more cost or memory To be fair that is because there were specific needs to do per-sb I would indeed like to focus on the benefits of the patchset as a whole. Leaving aside the xfs changes, it would be interesting to have at least a few numbers for dcache/icache heavy workloads. --
On Tue, 25 May 2010 18:53:04 +1000 It's a shape that s_inode_lru is still protected by inode_lock. One day we're going to get in trouble over that lock. Migrating to a per-sb lock would be logical and might help. Did you look into this? I expect we'd end up taking both inode_lock and the new sb->lru_lock in several places, which wouldn't be of any help, at least in the interim. Long-term, the locking for fs-writeback.c should move to the per-superblock one also, at which time this problem largely goes away I think. Unfortunately the writeback inode lists got moved into the backing_dev_info, whcih messes It's regrettable to be counting the same thing twice. Did you look May as well fix the typo while we're there. Please review all these comments to ensure that they are still accurate and complete. --
Yes, I have. Yes, it's possible. It's solving a different problem, Sort of. The complexity is the stats are userspace visible, so they can't just be removed. Replacing the current stats means that when they are read from /proc we would need to walk all the superblocks to aggregate them. The bit I haven't looked at yet is whether walking superblocks is allowed in a proc handler. So in the mean time, I just copied what was done for the dentry_stats. If it's ok to do this walk, then we can change both Will do. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
It almost all goes away in my inode lock splitup patches. Inode lru and dirty lists were the last things protected by the global lock there. I am actually going to do per-zone lrus for these guys and per-zone locks (which is actually better than per-sb because it gives NUMA scalability within a single sb). The dirty/writeback lists should probably be per-bdi locked. --
