From: Dave Chinner <dchinner@redhat.com>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Add a
context argument to the shrinker callback so that it can easily be
used in such situations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
arch/x86/kvm/mmu.c | 2 +-
drivers/gpu/drm/i915/i915_gem.c | 2 +-
fs/dcache.c | 2 +-
fs/gfs2/glock.c | 2 +-
fs/gfs2/quota.c | 2 +-
fs/gfs2/quota.h | 2 +-
fs/inode.c | 2 +-
fs/mbcache.c | 5 +++--
fs/nfs/dir.c | 2 +-
fs/nfs/internal.h | 2 +-
fs/quota/dquot.c | 2 +-
fs/ubifs/shrinker.c | 2 +-
fs/ubifs/ubifs.h | 2 +-
fs/xfs/linux-2.6/xfs_buf.c | 5 +++--
fs/xfs/quota/xfs_qm.c | 7 +++++--
include/linux/mm.h | 13 +++++++------
mm/vmscan.c | 9 ++++++---
17 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 48aeee8..d8ecb5b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2915,7 +2915,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm)
kvm_mmu_zap_page(kvm, page);
}
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(void *ctx, int nr_to_scan, gfp_t gfp_mask)
{
struct kvm *kvm;
struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 368d726..ed94bd6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5058,7 +5058,7 @@ void i915_gem_release(struct drm_device * dev, struct drm_file *file_priv)
}
static int
-i915_gem_shrink(int nr_to_scan, gfp_t ...Looks good about this mm part.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
off-topic: shrink_slab() was introduced for page/[id]-cache scan balancing
at first. now it still have hardcorded shrinker->nr calculation for slab
although now lots another subsystem using it. shrinker->seeks seems no
intuitive knob. probably we should try generalization it in future. but
it is another story. I think this patch provide good first step.
delta = (4 * scanned) / shrinker->seeks;
delta *= max_pass;
do_div(delta, lru_pages + 1);
shrinker->nr += delta;
Thanks.
--
Any chance we can still get this into 2.6.34? It's really needed to fix a regression in XFS that would be hard to impossible to work around inside the fs. While it touches quite a few places the changes are trivial and well understood. --
Why do you even need this context argument? Reclaim is not doing anything smart about this, it would just call each call shrinker in turn. Do you not have an easily traversable list of mountpoints? Can you just make a list of them? It would be cheaper than putting a whole shrinker structure into them anyway. The main reason I would be against proliferation of dynamic shrinker registration would be that it could change reclaim behaviour depending on how they get ordered (in the cache the caches are semi-dependent, like inode cache and dentry cache). Unless there is a reason I missed, I would much prefer not to do this like this. --
It's not being smart, but it is detemining how many objects to reclaim in each shrinker call based on memory pressure and the number of reclimable objects in the cache the shrinker works on. That's exactly the semantics I want for per-filesystem inode cache No, XFS does not have one, and I'm actively trying to remove any global state that crosses mounts that does exist (e.g. the global It's not cheaper or simpler. To make it work properly, I'd need to aggregate counters over all the filesystems in the list, work out how much to reclaim from each, etc. It is quite messy compared to deferecing the context to check one variable and either return or invoke the existing inode reclaim code. I also don't want to have a situation where i have to implement fairness heuristics to avoid reclaiming one filesystem too much or only end up reclaiming one or two inodes per filesystem per shrinker call because of the number of filesytems is similar to the shrinker batch size. The high level shrinker code already does this reclaim proportioning and does it far better than can be done in the scope of a shrinker callback. IOWs, adding a context allows XFS to do inode reclaim far more efficiently than if it was implemented through global state and a single shrinker. FWIW, we have this problem in the inode and dentry cache - we've got all sorts of complexity for being fair about reclaiming across all superblocks. I don't want to duplicate that complexity - instead I Adding a context does not change that implicit ordering based on registration order. Any filesystem based shrinker is going to be registered after the core infrastructure shrnikers, so they are not going to perturb the current ordering. And if this is enough of a problem to disallow context based cache shrinkers, then lets fix the interface so that we encode the dependencies explicitly in the registration interface rather than doing it implicitly. IOWs, I don't think this is a valid reason for not allowing a context to ...
And you can easily do that in the filesystem code because either
The shrinker list is global state too, so it's not a big deal. A
simple list and an rwsem will work fine and be smaller than adding
the full shrinker structure in the mount point.
And that way you get a mountpoint list to potentially use for other
It most definitely is cheaper, in terms of memory footprint. For
cost of doing the traversals, letting the shrinker code do it is
doing the *exact* same thing -- it's just traversing all your mount
You would just do it proportionately with the size of each fliesystem,
The dcache pruning is not complex.
if (prune_ratio != 1)
w_count = (sb->s_nr_dentry_unused / prune_ratio)
+ 1;
else
w_count = sb->s_nr_dentry_unused;
By definition the ordering changes based on the order of registration.
Well yeah you could do all that maybe. I think it would definitely be
required if we were to do context shrinkers like this. But AFAIKS there
is simply no need at all. Definitely it is not preventing XFS from
following more like the existing shrinker implementations.
--
So you're basically saying that we shouldn't improve the shrinker interface because you don't think that anyone should be doing anything different to what is already there. If a change of interface means that we end up with shorter call chains, less global state, more flexibilty, better batching and IO patterns, less duplication of code and algorithms and it doesn't cause any regressions, then where's the problem? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I'm saying that dynamic registration is no good, if we don't have a Yep that would all be great but I don't see how the interface change enables any of that at all. It seems to me that the advantage goes the other way because it doesn't put as much crap into your mount structure and you end up with an useful traversable list of mounts as a side-effect. --
We can happily throw in a priority field into the shrinker structure, but at this stage in the release process I'd rather have an as simple as possible fix for the regression. And just adding the context pointer which is a no-op for all existing shrinkers fits that scheme very well. If it makes you happier I can queue up a patch to add the priorities for 2.6.35. I think figuring out any meaningful priorities will be There really is not need for that. The Linux VFS is designed to have superblocks independent, which actually is a good thing as global state gets in the way of scalability or just clean code. Note that a mounts list would be even worse as filesystems basically are not concerned with mount points themselves. --
I don't understand, it should be implemented like just all the other shrinkers AFAIKS. Like the dcache one that has to shrink multiple superblocks. There is absolutely no requirement for this API change to implement it in XFS. If you then add a patch to change the API and can show how it improves But the shrinker list *is* a global list. The downside of it in the way it was done in the XFS patch is that 1) it is much larger than a simple list head, and 2) not usable by anything other then the shrinker. OK, maybe there will never be other uses for it other than shrinker, but that's not a disadvantage (just one less advantage). --
It is an existing global list just made more useful. Whenever a driver has muliple instances of pool that need shrinking this comes in useful, it's not related to filesystems at all. --
I don't know. It's not really caused by not registering multiple shrinkers. It seems to be caused more by the locking, which is not going away when you have multiple shrinkers. The XFS patch seems to be pinning the mount structure when it is registered, so it would have no such locking/refcounting problems I would say less useful, because shrinker structure cannot be used by anything but the shrinker, wheras a private list can be used by anything, including the applicable shrinker. --
Yes, it is, and one that I think we can clean up significantly by the use of context based shrinkers. IMO, a better approach to the VFS shrinkers is to leverage the fact we already have per-sb dentry LRUs and convert the inode cache to a per-sb LRU as well. We can then remove the current dependency problems by moving to a single context based shrinker (i.e. per-sb) to reclaim from the per-sb dentry LRU, followed by the per-sb inode LRU via a single shrinker. That is, remove the global scope from them because that is the cause of the shrinker call-order dependency problems. Further, if we then add a filesystem callout to the new superblock shrinker callout, we can handle all the needs of XFS (and other filesystems) without requiring them to have global filesystem lists and without introducing new dependencies between registered shrinkers. And given that the shrinker currently handles proportioning reclaim based on the number of objects reported by the cache, it also allows us to further simplify the dentry cache reclaim by removing all the proportioning stuff it does right now... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Well, I've gone and done this global shrinker because I need a fix for the problem before .34 releases, not because I like it. Now my problem is that the accepted method of using global shrinkers (i.e. split nr_to-scan into portions based on per-fs usage) is causing a regression compared to not having a shrinker at all. The context based shrinker did not cause this regression, either. The regression is oom-killer panics with "no killable tasks" - it kills my 1GB RAM VM dead. Without a shrinker or with the context based shrinkers I will see one or two dd processes getting OOM-killed maybe once every 10 or so runs on this VM, but the machine continues to stay up. The global shrinker is turning this into a panic, and it is happening about twice as often. To fix this I've had to remove all the code that proportions the reclaim across all the XFS filesystems in the system. Basically it now walks from the first filesystem in the list to the last every time and effectively it only reclaims from the first filesystem it finds with reclaimable inodes. This is exactly the behaviour the context based shrinkers give me, without the need for adding global lists, additional locking and traverses. Also, context based shrinkers won't re-traverse all the filesystems, avoiding the potential for starving some filesystems of shrinker based reclaim if filesystems earlier in the list are putting more inodes into reclaim concurrently. Given that this behaviour matches pretty closely to the reasons I've already given for preferring context based per-fs shrinkers than a global shrinker and list, can we please move forward with this API change, Nick? As it is, I'm going to cross my fingers and ship this global shrinker because of time limitations, but I certainly hoping that for .35 we can move to context based shrinking.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
It's nicer (and slightly cheaper) to have int (*shrink)(struct shrinker *shrinker, int nr_to_scan, gfp_t gfp_mask); /* no void *ctx; */ Clients can use container_of() to reach their context from the shrinker argument. -- error compiling committee.c: too many arguments to function --
