Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure

Previous thread: [PATCH 1/5] inode: Make unused inode LRU per superblock by Dave Chinner on Tuesday, May 25, 2010 - 1:53 am. (8 messages)

Next thread: [PATCH 4/5] superblock: add filesystem shrinker operations by Dave Chinner on Tuesday, May 25, 2010 - 1:53 am. (2 messages)
From: Dave Chinner
Date: Tuesday, May 25, 2010 - 1:53 am

[Empty message]
From: Nick Piggin
Date: Wednesday, May 26, 2010 - 9:41 am

Nitpick but I prefer just the restart label wher it is previously. This

Would you just elaborate on the lock order problem somewhere? (the
comment makes it look like we *could* take the mutex if we wanted

Do you think truncating in the divisions is at all a problem? It
--

From: Dave Chinner
Date: Wednesday, May 26, 2010 - 4:12 pm

The shrinker is unregistered in deactivate_locked_super() which is
just before ->kill_sb is called. The sb->s_umount lock is held at
this point. hence is the shrinker is operating, we will deadlock if
we try to lock it like this:

	unmount:			shrinker:
					down_read(&shrinker_lock);
	down_write(&sb->s_umount)
	unregister_shrinker()
	down_write(&shrinker_lock)
					prune_super()
					  down_read(&sb->s_umount);
					  (deadlock)

hence if we can't get the sb->s_umount lock in prune_super(), then
the superblock must be being unmounted and the shrinker should abort
as the ->kill_sb method will clean up everything after the shrinker

Same code as currently exists. IIRC, the reasoning is that if we've
got less that 100 objects to reclaim, then we're unlikely to be able
to free up any memory from the caches, anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Wednesday, May 26, 2010 - 6:53 pm

Updated patch below with these issues fixed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

superblock: introduce per-sb cache shrinker infrastructure

From: Dave Chinner <dchinner@redhat.com>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well.  The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- change loop restart in __shrink_dcache_sb() to match previous
  restart semantics
- add a better comment in prune_super() to explain the deadlock we
  are avoiding by using down_read_trylock(&sb->s_umount) before
  starting any shrinking.

 fs/dcache.c        |  127 +++++++---------------------------------------------
 fs/inode.c         |  109 +++-----------------------------------------
 fs/super.c         |   58 ++++++++++++++++++++++++
 include/linux/fs.h |    7 +++
 4 files changed, 89 insertions(+), 212 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dba6b6d..a7cd335 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ ...
From: Al Viro
Date: Wednesday, May 26, 2010 - 9:01 pm

Um...  Maybe I'm dumb, but what's wrong with doing unregistration from
deactivate_locked_super(), right after the call of ->kill_sb()?  At that
point ->s_umount is already dropped, so we won't deadlock at all.
Shrinker rwsem will make sure that all shrinkers-in-progress will run
to completion, so we won't get a superblock freed under prune_super().
I don't particulary mind down_try_read() in prune_super(), but why not
make life obviously safer?

Am I missing something here?
--

From: Dave Chinner
Date: Wednesday, May 26, 2010 - 11:17 pm

I was worried about memory allocation in the ->kill_sb path
deadlocking on the s_umount lock if it enters reclaim. e.g.  XFS
inodes can still be dirty even after the VFS has disposed of them,
and writing them back can require page cache allocation for the
backing buffers. If allocation recurses back into the shrinker, we
can deadlock on the s_umount lock.  This doesn't seem like an XFS
specific problem, so I used a trylock to avoid that whole class of
problems (same way the current shrinkers do).

From there, we can unregister the shrinker before calling ->kill_sb
as per above. That, in turn, means that the unmount
invalidate_inodes() vs shrinker race goes away and the iprune_sem is
not needed in the new prune_icache_sb() function.  I'm pretty sure
that I can now remove the iprune_sem, but I haven't written the
patch to do that yet.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, May 26, 2010 - 11:46 pm

If GFP_FS is set, we wouldn't touch the locks. It is a concern
though, if __GFP_FS allocations were previously permitted under

I do really like that aspect of your patch. It's nice to have the
shrinker always only operating against active supers. So I would
be in favour of your current scheme.


--

From: Nick Piggin
Date: Wednesday, May 26, 2010 - 7:19 pm

You added it to the comment in your updated patch, that was the main

Yeah, which is why I stop short of saying you should change it in
this patch.

But I think we should ensure things can get reclaimed eventually.
100 objects could be 100 slabs, which could be anything from
half a meg to half a dozen. Multiplied by each of the caches.
Could be significant in small systems.
--

From: Dave Chinner
Date: Wednesday, May 26, 2010 - 9:07 pm

True, but usually there are busy objects in the dentry and inode
slabs, so it shouldn't be a significant issue. We can probably
address such problems if they can be demonstrated to be an issue in
a separate patch set....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, May 26, 2010 - 11:35 pm

Hmm, an interesting dynamic that you've changed is that previously
we'd scan dcache LRU proportionately to pagecache, and then scan
inode LRU in proportion to the current number of unused inodes.

But we can think of inodes that are only in use by unused (and aged)
dentries as effectively unused themselves. So this sequence under
estimates how many inodes to scan. This could bias pressure against
dcache I'd think, especially considering inodes are far larger than
dentries. Maybe require 2 passes to get the inodes unused inthe
first pass.

Part of the problem is the funny shrinker API.

The right way to do it is to change the shrinker API so that it passes
down the lru_pages and scanned into the callback. From there, the
shrinkers can calculate the appropriate ratio of objects to scan.
No need for 2-call scheme, no need for shrinker->seeks, and the
ability to calculate an appropriate ratio first for dcache, and *then*
for icache.

A helper of course can do the calculation (considering that every
driver and their dog will do the wrong thing if we let them :)).

unsigned long shrinker_scan(unsigned long lru_pages,
			unsigned long lru_scanned,
			unsigned long nr_objects,
			unsigned long scan_ratio)
{
	unsigned long long tmp = nr_objects;

	tmp *= lru_scanned * 100;
	do_div(tmp, (lru_pages * scan_ratio) + 1);

	return (unsigned long)tmp;
}

Then the shrinker callback will go:
	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_dentry_unused,
				vfs_cache_pressure * SEEKS_PER_DENTRY);
	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
		prune_dcache()

	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_inodes_unused,
				vfs_cache_pressure * SEEKS_PER_INODE);
	...

What do you think of that? Seeing as we're changing the shrinker API
anyway, I'd think it is high time to do somthing like this.


--

From: Dave Chinner
Date: Thursday, May 27, 2010 - 3:40 pm

It's self-balancing - it trends towards an equal number of unused
dentries and inodes in the caches. Yes, it will tear down more
dentries at first, but we need to do that to be able to reclaim
inodes. Ås reclaim progresses the propotion of inodes increases, so
the amount of inodes reclaimed increases. 

Basically this is a recognition that the important cache for
avoiding IO is the inode cache, not he dentry cache. Once the inode
cache is freed that we need to do IO to repopulate it, but
rebuilding dentries fromteh inode cache only costs CPU time. Hence
under light reclaim, inodes are mostly left in cache but we free up
memory that only costs CPU to rebuild. Under heavy, sustained
reclaim, we trend towards freeing equal amounts of objects from both
caches.

This is pretty much what the current code attempts to do - free a
lot of dentries, then free a smaller amount of the inodes that were
used by the freed dentries. Once again it is a direct encoding of
what is currently an implicit design feature - it makes it *obvious*
how we are trying to balance the caches.

Another reason for this is that the calculation changes again to
allow filesystem caches to modiy this proportioning in the next
patch....

FWIW, this also makes workloads that generate hundreds of thousands
of never-to-be-used again negative dentries free dcache memory really

My only concern about this is that exposes the inner workings of the
shrinker and mm subsystem to code that simply doesn't need to know

Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two
call API that matches the current behaviour, leaving the caclulation
of how much to reclaim in shrink_slab(). Encoding it this way makes
it more difficult to change the high level behaviour e.g. if we want
to modify the amount of slab reclaim based on reclaim priority, we'd
have to cahnge every shrinker instead of just shrink_slab().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Thursday, May 27, 2010 - 10:19 pm

I don't know if you've got numbers or patterns to justify that.
My point is that things should stay as close to the old code as

With your patches, if there are no inodes free you would need to take
2 passes at freeing the dentry cache. My suggestion is closer to the

That would still be the case because used inodes aren't getting their

It's just providing a ratio. The shrinkers allready know they are

Well if it is an issue, it should be changed in a different patch

We can modifiy the ratios before calling if needed, or have a default
ratio define to multiply with as well.

But shrinkers are very subsystem specific.

--

From: Dave Chinner
Date: Sunday, May 30, 2010 - 11:39 pm

We don't get a single shrinker call - we get batches of them fo each
shrinker.  The shrinker determines how many objects to scan in the
cache, and that only changes for each shrinker to call based on the
the shrinker->seeks and the number of objects in the cache.  Once
the number is decided, then the shrinker gets batches of reclaim to
operate on.

Fundamentally, the current shrinker will ask for the same percentage
of each cache to be scanned - if it decides to scan 20% of the
dentry cache, it will also decide to scan 20% of the inode cache
Hence what the inode shrinker is doing is scanning 20% of the inodes
freed by the dcache shrinker.

In rough numbers, say we have 100k dentries, and the shrinker
calculates it needs to scan 20% of the caches to reclaim them, the
current code will end up with:

		unused dentries		unused inodes
before		 100k			    0
after dentry	  80k			  20k
after inode	  80k			  16k

So we get 20k dentries freed and 4k inodes freed on that shrink_slab
pass.

To contrast this against the code I proposed, I'll make a couple of
simplicfications to avoid hurting my brain. That is, I'll assume
SHRINK_BATCH=100 (rather than 128) and forgetting about rounding
errors. With this, the algorithm I encoded gives roughly the
following for a 20% object reclaim:

number of batches = 20k / 100 = 200

				Unused
		dentries+inodes		dentries	inodes
before		  100k			100k		    0
batch 1		  100k			99900		  100
batch 2		  100k			99800		  200
....
batch 10	  100k			99000		 1000
batch 20	  99990			98010		 1990
batch 30	  99980			97030		 2950
batch 50	  99910			95100		 4810
batch 60	  99860			94150		 5710
.....
batch 200	  98100			81900		16200

And so (roughly) we see that the number of inodes being reclaim per
set of 10 batches roughly equals the (batch number - 10). Hence over
200 batches, we can expect to see roughly 190 + 180 + ... + 10
inodes reclaimed. That is 1900 inodes.  Similarly for dentries, we
get roughly 1000 + 990 + 980 + ... 810 dentries ...
From: Nick Piggin
Date: Monday, May 31, 2010 - 12:28 am

OK fair point. However


I prefer just to keep changes to a minimum and split into seperate
patches (each with at least basic test or two showing no regression).

As-is you're already changing global inode/dentry passes into per
sb inode and dentry passes. I think it can only be a good thing
for that changeset if other changes are minimised.

Then if it is so obviously good behaviour to reduce dcache pressure,

Well you can also bias against the dcache with any other means,
including the change you've made here. My main point I guess is
that it should not be in the same as this patchset (or at least

Not really. The VM doesn't know about any of those. They are just
told to provide a ratio and some scanning based on some abstract cost.

The VM doesn't know anything about usage patterns, inuse vs unused
objects, exactly how their LRU algorithms are supposed to work, etc.


Clearly they wouldn't with what I was proposing. And the result would
be much more flexible and also gives the shrinkers more information.

--

From: Andrew Morton
Date: Thursday, May 27, 2010 - 1:32 pm

On Tue, 25 May 2010 18:53:06 +1000

I go all tingly when a changelog contains the word "should".


Forgot to update the kerneldoc description of `count'.


--

From: Dave Chinner
Date: Thursday, May 27, 2010 - 4:01 pm

As i said to Nick - the tests I ran showed an average improvement of
5% but the accuracy of the benchmark was +/-10%. Hence it's hard to
draw any conclusive results from that. It appears to be slightly
faster on an otherwise idle system, but...

As it is, the XFS shrinker that gets integrated into this structure
in a later patch peaks at a higher rate - 150k inodes/s vs 90k
inodes/s with the current shrinker - but still it's hard to quantify
qualitatively. I'm going to run more benchmarks to try to get better

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

Previous thread: [PATCH 1/5] inode: Make unused inode LRU per superblock by Dave Chinner on Tuesday, May 25, 2010 - 1:53 am. (8 messages)

Next thread: [PATCH 4/5] superblock: add filesystem shrinker operations by Dave Chinner on Tuesday, May 25, 2010 - 1:53 am. (2 messages)