Re: [patch] mm: vmscan implement per-zone shrinkers

Previous thread: Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings by Tomoya MORINAGA on Tuesday, November 9, 2010 - 5:26 am. (4 messages)

Next thread: SEASON LOAN! by PRIVATE HOME LENDER INC on Tuesday, November 9, 2010 - 5:33 am. (1 message)
From: Nick Piggin
Date: Tuesday, November 9, 2010 - 5:32 am

Hi,

I'm doing some works that require per-zone shrinkers, I'd like to get
the vmscan part signed off and merged by interested mm people, please.

[And before anybody else kindly suggests per-node shrinkers, please go
back and read all the discussion about this first.]

Patches against Linus's current tree.

--
Allow the shrinker to do per-zone shrinking. This requires adding a zone
argument to the shrinker callback and calling shrinkers for each zone
scanned. The logic somewhat in vmscan code gets simpler: the shrinkers are
invoked for each zone, around the same time as the pagecache scanner.
Zone reclaim needed a bit of surgery to cope with the change, but the
idea is the same.

But all shrinkers are currently global-based, so they need a way to
convert per-zone ratios into global scan ratios. So seeing as we are
changing the shrinker API anyway, let's reorganise it to make it saner.

So the shrinker callback is passed:
- the number of pagecache pages scanned in this zone
- the number of pagecache pages in this zone
- the total number of pagecache pages in all zones to be scanned

The shrinker is now completely responsible for calculating and batching
(given helpers), which provides better flexibility. vmscan helper functions
are provided to accumulate these ratios, and help with batching.

Finally, add some fixed-point scaling to the ratio, which helps rounding.

The old shrinker API remains for unconverted code. There is no urgency
to convert them at once.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/drop_caches.c    |    6 
 include/linux/mm.h  |   47 ++++++-
 mm/memory-failure.c |   10 -
 mm/vmscan.c         |  341 +++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 297 insertions(+), 107 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-11-09 22:11:03.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2010-11-09 ...
From: Dave Chinner
Date: Tuesday, November 9, 2010 - 10:18 pm

There are still plenty of unresolved issues with this general
approach to scaling object caches that I'd like to see sorted out
before we merge any significant shrinker API changes. Some things of
the top of my head:

	- how to solve impendence mismatches between VM scalability
	  techniques and subsystem scalabilty techniques that result
	  in shrinker cross-muliplication explosions. e.g. XFS
	  tracks reclaimable inodes in per-allocation group trees,
	  so we'd get AG x per-zone LRU trees using this shrinker
	  method.  Think of the overhead on a 1000AG filesystem on a
	  1000 node machine with 3-5 zones per node....

	- changes from global LRU behaviour to something that is not
	  at all global - effect on workloads that depend on large
	  scale caches that span multiple nodes is largely unknown.
	  It will change IO patterns and affect system balance and
	  performance of the system. How do we
	  test/categorise/understand these problems and address such
	  balance issues?

	- your use of this shrinker architecture for VFS
	  inode/dentry cache scalability requires adding lists and
	  locks to the MM struct zone for each object cache type
	  (inode, dentry, etc). As such, it is not a generic
	  solution because it cannot be used for per-instance caches
	  like the per-mount inode caches XFS uses.

	  i.e. nothing can actually use this infrastructure change
	  without tying itself directly into the VM implementation,
	  and even then not every existing shrinker can use this
	  method of scaling. i.e. some level of abstraction from the
	  VM implementation is needed in the shrinker API.

	- it has been pointed out that slab caches are generally
	  allocated out of a single zone per node, so per-zone
	  shrinker granularity seems unnecessary.

	- doesn't solve the unbound direct reclaim shrinker
	  parallelism that is already causing excessive LRU lock
	  contention on 8p single node systems. While
	  per-LRU/per-node solves the larger scalability issue, it
	  ...
From: Nick Piggin
Date: Tuesday, November 9, 2010 - 11:32 pm

That's an interesting question, but no reason to hold anything up.
It's up to the subsystem to handle it, or if they can't handle it
sanely, you can elect to get no different behaviour than today.

I would say that per-zone is important for the above case, because it
shrinks properly, based on memory pressure and pressure settings (eg.
zone reclaim), and also obviously so that the reclaim threads are
scanning node-local-memory so it doesn't have to pull objects from
all over the interconnect.

With 1000 nodes and 1000AG filesystem, probably NxM is overkill, but
there is no reason you couldn't have one LRU per X AGs.

All this is obviously implementation details -- pagecache reclaim does

You've brought this up before several times and I've answered it.

The default configuration basically doesn't change much. Caches are
allowed to fill up all zones and nodes, exactly the same as today.

On the reclaim side, when you have a global shortage, it will evenly
shrink objects from all zones, so it still approximates LRU behaviour
because number of objects far exceeds number of zones. This is exactly


"zones" is the abstraction. The thing which all allocation and
management of memory is based on, everywhere you do any allocations
in the kernel. A *shrinker* implementation, a thing which is called
in response to memory shortage in a given zone, has to know about
zones. End of story. "Tied to the memory management implementation"
is true to the extent that it is part of the memory management

No they are not, that's just total FUD. Where was that "pointed out"?

Slabs are generally allocated from every zone except for highmem and
movable, and moreover there is nothing to prevent a shrinker
implementation from needing to shrink highmem and movable pages as

It doesn't aim to solve that. It doesn't prevent that from being
solved either (and improving parallelism in a subsystem generally,
you know, helps with lock contention due to lots of threads in there,

Well if you ...
From: Nick Piggin
Date: Tuesday, November 9, 2010 - 11:39 pm

I might add that you've previously brought up and I answered every
single issue below and you've not followed up on my answers. So just
going back to the start and bringing them all up again is ridiculous.

Perhaps you meant here that many nodes have only one populated zone.
If that is _really_ a huge problem to add a couple of list heads per
zone per node (which it isn't), then the subsystem can do per-node
shrinking and just do zone_to_nid(zone) in the shrinker callback (but
that would be retarded so it shouldn't).

I answered Christoph's thread here, and I showed how a node based
callback can result in significant imbalances between zones in a node,
so it won't fly.

If you think that nodes are sufficient, then you can get patches through
mm to make pagecache reclaim per-node based, and changes to the
shirnker API will naturally propogate through.

--

From: Dave Chinner
Date: Wednesday, November 10, 2010 - 4:05 am

See, this is what I think is the problem - you answer any concerns
by saying that the problem is entirely the responsibility of the
subsystem to handle. You've decided how you want shrinkers to work,

And when it's only 1 node or 50 nodes or 50 AGs or 5000 AGs? I don't
like encoding _guesses_ at what might be a good configuration for an
arbitrary NxM configuration because they are rarely right. That's
exactly what you are suggesting here - that I solve this problem by
making guesses at how it might scale and coming up with some


The whole point of the zone-based reclaim is that shrinkers run
when there are _local_ shortages on the _local_ LRUs and this will
happen much more often than global shortages, especially on large
machines. It will result in a change of behaviour, no question about
it.

However, this is not reason for not moving to this model - what I'm
asking is what the plan for categorising problems that arise as a
result of such a change? How do we go about translating random
reports like "I do this and it goes X% slower on 2.6.xx" to "that's
a result of the new LRU reclaim model, and you should do <this> to
try to resolve it". How do we triage such reports? What are our
options to resolve such problems? Are there any knobs we should add
at the beginning to give users ways of changing the behaviour to be
more like the curent code? We've got lots of different knobs to
control page cache reclaim behaviour - won't some of them be

Right - another SEP solution.

The reason people are using shrinkers is that it is simple to
implement a basic one. But implementing a scalable shrinker right
now is fucking hard - look at all the problems we've had with XFS
mostly because of the current API and the unbound parallelism of
direct reclaim.

This new API is a whole new adventure that not many people are going
to have machines capable of executing behavioural testing or
stressing. If everyone is forced to implement their own "scalable
shrinker" we'll end up with a rat's ...
From: Nick Piggin
Date: Wednesday, November 10, 2010 - 5:23 pm

Um, wrong. I provide an API to drive shrinkers, giving sufficient
information (and nothing more). You are the one deciding that it is
the problem of the API to somehow fix up all the implementations.

If you can actually suggest how such a magical API would look like,
please do. Otherwise, please try to get a better understanding of
my API and understand it does nothing more than provide required

Still, it's off topic for shrinker API discussion, I'd be happy to
help with ideas or questions about your particular implementation.

Note you can a: ignore the new information the API gives you, or b:
understand that your current case of 1000 LRUs spread globally over
1000 nodes sucks much much worse than even a poor per-zone distribution,

Uh, yes in fact it is the problem of the shrinker implementation to

But if you have a global workload, then you tend to get relatively
even pressure and it tends to approximate global LRU (especially if
you do memory spreading of your slabs).

Yes there will be some differences, but it's not some giant scary
change, it just can be tested with some benchmarks and people can
complain if something hurts. Bringing reclaim closer to how pagecache

Really? Same as any other heuristic or performance change. We merge
thousands of them every release cycle, and at least a couple of major

What??? Stop saying this. In the above paragraph you were going crazy
about your wrong and obviously unfounded solution that other subsystems
can't use it. I tell you that they can, ie. with kmalloc.

I could spell it out if it is not obvious. You would allocate an array
of LRU structures to fit all zones in the system. And then you would
index into that array with zone_to_nid and zone_idx. OK?

It is not "SEP solution". If a shrinker implementation wants to use per
zone LRUs, then it is that shrinker implementation's problem to allocate
space for them. OK so far?

And if a shrinker implementation doesn't want to do that, and instead
wants a global LRU, then ...
From: Nick Piggin
Date: Wednesday, November 10, 2010 - 10:21 pm

Here, this is the second time I wrote _exactly_ the same reply to your
_exact_ same question (the first time I wrote _exactly_ the same reply
to your exact same question was IIRC a few months before but I couldn't
find the archive.

http://archives.free.net.ph/message/20101015.033017.65c46426.ca.html

So at this point, shut up. You had ample time to put up, and didn't.
Now you're just obstructing and being unreasonable. To quote Christoph,
you don't have carte blance power to delay progress. So stop being an
arsehole, and don't fucking accuse me of scoffing and mocking you.

Because I've been bending over backwards to answer _exactly_ the same
questions multiple times from you and trying to show how either your
assumptions are wrong, or reasonable ways we can mitigate potential
regressions, and trying to be polite and civil the whole time. (which I
might add is much more courtesy than you showed me when you tried to
railroad your inode changes through without bothering to answer or even
read half my comments about them). And what you have done is just ignore
totally my replies to your concerns, stay quiet about them for a month,
and then come back with exactly the same thing again. Like 3 or 4 times.

Really. At this point, if you have already posted a particular comment
more than 50 times, do everybody a favour and redirect the next verbatim
copy /dev/null, OK? I'm sick of it. *Constructive* criticism only, from
now on. Thanks.

I need a zone input to the shinker API, I have demonstrated what for
and why, with good justification, and so I am going to get that API
change merged. It is a simple superset of the current API's
functionality and not some big scary monster (except by somebody who
fundmantally doesn't understand it). 

--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 3:07 am

vmscan part looks good to me. however I hope fs folks review too even though
I'm not sure who is best.


shrink_zone is slightly grep unfriendly. Can you consider shrink_slab_zone() 


objects * SHRINK_FACTOR appear twice in this function.
calculate "objects = obj * SHRINK_FACTOR" at first improve

Seems misleading name a bit. shrinker_do_scan() does NOT scan. 



{
	unsigned long nr = ACCESS_ONCE(*dst);
	batch *= SHRINK_FACTOR;

	if (nr < batch)
		return 0;
	*dst = nr - batch;
	return batch;
}

is slighly cleaner. however It's unclear why dst and batch argument


Doubtful calculation. What mean "sc->nr_scanned - nr_scanned"?

Do we really need this doubtful cpuset hardwall filtering? Why do we
need to change slab reclaim pressure if cpuset is used. In old days,
we didn't have per-zone slab shrinker, then we need artificial slab
pressure boost for preventing false positive oom-killer. but now we have.

However, If you strongly keep old behavior at this time, I don't oppose.





--

From: KOSAKI Motohiro
Date: Sunday, November 14, 2010 - 5:50 pm

And per-zone reclaim can lead to new issue. On 32bit highmem system,
theorically the system has following memory usage.

ZONE_HIGHMEM: 100% used for page cache
ZONE_NORMAL:  100% used for slab

So, traditional page-cache/slab balancing may not work. I think following
new calculation or somethinhg else is necessary.

	if (zone_reclaimable_pages() > NR_SLAB_RECLAIMABLE) {
		using current calculation
	} else {
		shrink number of "objects >> reclaim-priority" objects
		(as page cache scanning calculation)
	}



--

From: Nick Piggin
Date: Tuesday, November 16, 2010 - 12:47 am

Yes, in theory you are right. I guess in theory the same hole exists
if we have 0% page cache reclaimable globally, but this may be slightly

I agree. In fact, perhaps the new calculation would work well in all
cases anyway, so maybe we should move away from making slab reclaim a
slave to pagecache reclaim.

Can we approach that in subsequent patches?

Thanks,
Nick

--

From: Anca Emanuel
Date: Tuesday, November 16, 2010 - 12:53 am

Nick, I want to test your tree.
This is taking too long.
Make something available now. And test it in real configs.
--

From: Figo.zhang
Date: Tuesday, November 16, 2010 - 1:05 am

hi Anca,

would you like to give your test method?
--

From: Anca Emanuel
Date: Tuesday, November 16, 2010 - 1:20 am

On Tue, Nov 16, 2010 at 10:05 AM, Figo.zhang

Nothig special, for now I will test on my PC.
--

From: Figo.zhang
Date: Tuesday, November 16, 2010 - 1:22 am

hi KOSAKI Motohiro,

is it any test suite or test scripts for test page-reclaim performance?

Best,
Figo.zhang
--

From: Anca Emanuel
Date: Tuesday, November 16, 2010 - 1:26 am

On Tue, Nov 16, 2010 at 10:22 AM, Figo.zhang

There is http://www.phoronix.com
--

From: Figo.zhang
Date: Tuesday, November 16, 2010 - 7:41 pm

From: Anca Emanuel
Date: Tuesday, November 16, 2010 - 9:29 pm

If you want some special test, you have to ask Michael Larabel for that.
http://www.phoronix-test-suite.com/
--

From: Figo.zhang
Date: Tuesday, November 16, 2010 - 10:21 pm

yes, i see, the phoronix-test-suite is test such as ffmpeg, games. not

--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:19 am

Hi Figo,

I agree with Anca. Nick's patch will change almost VM behavior. then,
generic performance test suit is better than supecific one.

Thanks.



--

From: Nick Piggin
Date: Tuesday, November 16, 2010 - 1:26 am

I'm working on it, I'll have another batch of patches out to look at
in a couple of hours, if stress testing holds up.
--

From: KOSAKI Motohiro
Date: Tuesday, November 23, 2010 - 12:21 am

I'm not worry about so much "0% page cache reclaimable globally" case

OK!



--

From: Nick Piggin
Date: Tuesday, November 16, 2010 - 12:43 am

Yes that's true. I want to move away from the term "slab" shrinker
however. It seems to confuse people (of course, the shrinker can shrink
memory from any allocator, not just slab).

 
I wasn't quite sure what you meant with this comment and the above one.


It should have a comment: *dst can be accessed without a lock.
However if nr is reloaded from memory between the two expressions
and *dst changes during that time, we could end up with a negative

OK I'll take it into consideration. I guess I didn't want the caller

Yes it is, at the moment. I actually have a flag that I would like
to use (close to OOM flag), so I've just added the placeholder for
now.



Yeah I'm not completely sure. But we should be mindful that until the
major caches are converted to LRU, we still have to care for the global


Well it may not be over the pagecache limit.

I agree the situation is pretty ugly here with all these magic
constants, but I didn't want to change too much in this patch.

Thanks,
Nick
--

Previous thread: Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings by Tomoya MORINAGA on Tuesday, November 9, 2010 - 5:26 am. (4 messages)

Next thread: SEASON LOAN! by PRIVATE HOME LENDER INC on Tuesday, November 9, 2010 - 5:33 am. (1 message)