Re: [PATCH 1/2] mm: add context argument to shrinker callback

Previous thread: [PATCH] mm: disallow direct reclaim page writeback by Dave Chinner on Monday, April 12, 2010 - 5:17 pm. (110 messages)

Next thread: [PATCH 2/2] xfs: add a shrinker to background inode reclaim by Dave Chinner on Monday, April 12, 2010 - 5:24 pm. (1 message)
From: Dave Chinner
Date: Monday, April 12, 2010 - 5:24 pm

From: Dave Chinner <dchinner@redhat.com>

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Add a
context argument to the shrinker callback so that it can easily be
used in such situations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |    2 +-
 fs/mbcache.c                    |    5 +++--
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    2 +-
 fs/quota/dquot.c                |    2 +-
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +++--
 fs/xfs/quota/xfs_qm.c           |    7 +++++--
 include/linux/mm.h              |   13 +++++++------
 mm/vmscan.c                     |    9 ++++++---
 17 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 48aeee8..d8ecb5b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2915,7 +2915,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm)
 	kvm_mmu_zap_page(kvm, page);
 }
 
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(void *ctx, int nr_to_scan, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 368d726..ed94bd6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5058,7 +5058,7 @@ void i915_gem_release(struct drm_device * dev, struct drm_file *file_priv)
 }
 
 static int
-i915_gem_shrink(int nr_to_scan, gfp_t ...
From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 1:17 am

Looks good about this mm part.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


off-topic: shrink_slab() was introduced for page/[id]-cache scan balancing
at first. now it still have hardcorded shrinker->nr calculation for slab
although now lots another subsystem using it. shrinker->seeks seems no
intuitive knob. probably we should try generalization it in future. but
it is another story. I think this patch provide good first step.

                delta = (4 * scanned) / shrinker->seeks;
                delta *= max_pass;
                do_div(delta, lru_pages + 1);
                shrinker->nr += delta;


Thanks.



--

From: Christoph Hellwig
Date: Saturday, April 17, 2010 - 5:15 pm

Any chance we can still get this into 2.6.34?  It's really needed to fix
a regression in XFS that would be hard to impossible to work around
inside the fs.  While it touches quite a few places the changes are
trivial and well understood.

--

From: Nick Piggin
Date: Monday, April 19, 2010 - 7:00 am

Why do you even need this context argument?  Reclaim is not doing anything
smart about this, it would just call each call shrinker in turn.

Do you not have an easily traversable list of mountpoints? Can you just
make a list of them? It would be cheaper than putting a whole shrinker
structure into them anyway.

The main reason I would be against proliferation of dynamic shrinker
registration would be that it could change reclaim behaviour depending
on how they get ordered (in the cache the caches are semi-dependent,
like inode cache and dentry cache).

Unless there is a reason I missed, I would much prefer not to do this
like this.

--

From: Dave Chinner
Date: Monday, April 19, 2010 - 5:41 pm

It's not being smart, but it is detemining how many objects to
reclaim in each shrinker call based on memory pressure and the
number of reclimable objects in the cache the shrinker works on.
That's exactly the semantics I want for per-filesystem inode cache

No, XFS does not have one, and I'm actively trying to remove any
global state that crosses mounts that does exist (e.g. the global

It's not cheaper or simpler. To make it work properly, I'd
need to aggregate counters over all the filesystems in the list,
work out how much to reclaim from each, etc. It is quite messy
compared to deferecing the context to check one variable and either
return or invoke the existing inode reclaim code.

I also don't want to have a situation where i have to implement
fairness heuristics to avoid reclaiming one filesystem too much or
only end up reclaiming one or two inodes per filesystem per shrinker
call because of the number of filesytems is similar to the shrinker
batch size.  The high level shrinker code already does this reclaim
proportioning and does it far better than can be done in the scope
of a shrinker callback. IOWs, adding a context allows XFS to do
inode reclaim far more efficiently than if it was implemented
through global state and a single shrinker.

FWIW, we have this problem in the inode and dentry cache - we've got
all sorts of complexity for being fair about reclaiming across all
superblocks. I don't want to duplicate that complexity - instead I

Adding a context does not change that implicit ordering based on
registration order. Any filesystem based shrinker is going to be
registered after the core infrastructure shrnikers, so they are not
going to perturb the current ordering.

And if this is enough of a problem to disallow context based cache
shrinkers, then lets fix the interface so that we encode the
dependencies explicitly in the registration interface rather than
doing it implicitly.

IOWs, I don't think this is a valid reason for not allowing a
context to ...
From: Nick Piggin
Date: Tuesday, April 20, 2010 - 1:38 am

And you can easily do that in the filesystem code because either

The shrinker list is global state too, so it's not a big deal. A
simple list and an rwsem will work fine and be smaller than adding
the full shrinker structure in the mount point.

And that way you get a mountpoint list to potentially use for other

It most definitely is cheaper, in terms of memory footprint. For
cost of doing the traversals, letting the shrinker code do it is
doing the *exact* same thing -- it's just traversing all your mount

You would just do it proportionately with the size of each fliesystem,

The dcache pruning is not complex.

                if (prune_ratio != 1)
                        w_count = (sb->s_nr_dentry_unused / prune_ratio)
+ 1;
                else
                        w_count = sb->s_nr_dentry_unused;

 
By definition the ordering changes based on the order of registration.

Well yeah you could do all that maybe. I think it would definitely be
required if we were to do context shrinkers like this. But AFAIKS there
is simply no need at all. Definitely it is not preventing XFS from
following more like the existing shrinker implementations.


--

From: Dave Chinner
Date: Tuesday, April 20, 2010 - 3:32 am

So you're basically saying that we shouldn't improve the shrinker
interface because you don't think that anyone should be doing
anything different to what is already there.

If a change of interface means that we end up with shorter call
chains, less global state, more flexibilty, better batching and IO
patterns, less duplication of code and algorithms and it doesn't
cause any regressions, then where's the problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, April 21, 2010 - 1:40 am

I'm saying that dynamic registration is no good, if we don't have a

Yep that would all be great but I don't see how the interface change
enables any of that at all. It seems to me that the advantage goes
the other way because it doesn't put as much crap into your mount
structure and you end up with an useful traversable list of mounts as
a side-effect.
 
--

From: Christoph Hellwig
Date: Thursday, April 22, 2010 - 9:32 am

We can happily throw in a priority field into the shrinker structure,
but at this stage in the release process I'd rather have an as simple
as possible fix for the regression.  And just adding the context pointer
which is a no-op for all existing shrinkers fits that scheme very well.

If it makes you happier I can queue up a patch to add the priorities
for 2.6.35.  I think figuring out any meaningful priorities will be

There really is not need for that.  The Linux VFS is designed to have
superblocks independent, which actually is a good thing as global
state gets in the way of scalability or just clean code.  Note that
a mounts list would be even worse as filesystems basically are not
concerned with mount points themselves.

--

From: Nick Piggin
Date: Thursday, April 22, 2010 - 9:38 am

I don't understand, it should be implemented like just all the other
shrinkers AFAIKS. Like the dcache one that has to shrink multiple
superblocks. There is absolutely no requirement for this API change
to implement it in XFS.

If you then add a patch to change the API and can show how it improves

But the shrinker list *is* a global list. The downside of it in the way
it was done in the XFS patch is that 1) it is much larger than a simple
list head, and 2) not usable by anything other then the shrinker.

OK, maybe there will never be other uses for it other than shrinker, but
that's not a disadvantage (just one less advantage).

--

From: Christoph Hellwig
Date: Thursday, April 22, 2010 - 9:42 am

It is an existing global list just made more useful.  Whenever a driver
has muliple instances of pool that need shrinking this comes in useful,
it's not related to filesystems at all. 

--

From: Nick Piggin
Date: Thursday, April 22, 2010 - 9:57 am

I don't know. It's not really caused by not registering multiple
shrinkers. It seems to be caused more by the locking, which is not
going away when you have multiple shrinkers.

The XFS patch seems to be pinning the mount structure when it is
registered, so it would have no such locking/refcounting problems

I would say less useful, because shrinker structure cannot be used
by anything but the shrinker, wheras a private list can be used by
anything, including the applicable shrinker.

--

From: Dave Chinner
Date: Thursday, April 22, 2010 - 6:58 pm

Yes, it is, and one that I think we can clean up significantly by
the use of context based shrinkers.

IMO, a better approach to the VFS shrinkers is to leverage the fact
we already have per-sb dentry LRUs and convert the inode cache to a
per-sb LRU as well.

We can then remove the current dependency problems by moving to
a single context based shrinker (i.e. per-sb) to reclaim from the
per-sb dentry LRU, followed by the per-sb inode LRU via a single
shrinker. That is, remove the global scope from them because that is
the cause of the shrinker call-order dependency problems.

Further, if we then add a filesystem callout to the new superblock
shrinker callout, we can handle all the needs of XFS (and other
filesystems) without requiring them to have global filesystem lists
and without introducing new dependencies between registered
shrinkers.

And given that the shrinker currently handles proportioning reclaim
based on the number of objects reported by the cache, it also allows
us to further simplify the dentry cache reclaim by removing all the
proportioning stuff it does right now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Tuesday, April 27, 2010 - 8:38 pm

Well, I've gone and done this global shrinker because I need a fix
for the problem before .34 releases, not because I like it.

Now my problem is that the accepted method of using global shrinkers
(i.e. split nr_to-scan into portions based on per-fs usage) is
causing a regression compared to not having a shrinker at all. The
context based shrinker did not cause this regression, either.

The regression is oom-killer panics with "no killable tasks" - it
kills my 1GB RAM VM dead.  Without a shrinker or with the context
based shrinkers I will see one or two dd processes getting
OOM-killed maybe once every 10 or so runs on this VM, but the machine
continues to stay up. The global shrinker is turning this into a
panic, and it is happening about twice as often.

To fix this I've had to remove all the code that proportions the
reclaim across all the XFS filesystems in the system. Basically it
now walks from the first filesystem in the list to the last every
time and effectively it only reclaims from the first filesystem it
finds with reclaimable inodes.

This is exactly the behaviour the context based shrinkers give me,
without the need for adding global lists, additional locking and
traverses. Also, context based shrinkers won't re-traverse all the
filesystems, avoiding the potential for starving some filesystems of
shrinker based reclaim if filesystems earlier in the list are
putting more inodes into reclaim concurrently.

Given that this behaviour matches pretty closely to the reasons I've
already given for preferring context based per-fs shrinkers than a
global shrinker and list, can we please move forward with this API
change, Nick?

As it is, I'm going to cross my fingers and ship this global
shrinker because of time limitations, but I certainly hoping that
for .35 we can move to context based shrinking....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Avi Kivity
Date: Wednesday, April 28, 2010 - 2:39 am

It's nicer (and slightly cheaper) to have

   int (*shrink)(struct shrinker *shrinker, int nr_to_scan, gfp_t gfp_mask);
   /* no void *ctx; */

Clients can use container_of() to reach their context from the shrinker 
argument.

-- 
error compiling committee.c: too many arguments to function

--

Previous thread: [PATCH] mm: disallow direct reclaim page writeback by Dave Chinner on Monday, April 12, 2010 - 5:17 pm. (110 messages)

Next thread: [PATCH 2/2] xfs: add a shrinker to background inode reclaim by Dave Chinner on Monday, April 12, 2010 - 5:24 pm. (1 message)