Hi, I'm doing some works that require per-zone shrinkers, I'd like to get the vmscan part signed off and merged by interested mm people, please. [And before anybody else kindly suggests per-node shrinkers, please go back and read all the discussion about this first.] Patches against Linus's current tree. -- Allow the shrinker to do per-zone shrinking. This requires adding a zone argument to the shrinker callback and calling shrinkers for each zone scanned. The logic somewhat in vmscan code gets simpler: the shrinkers are invoked for each zone, around the same time as the pagecache scanner. Zone reclaim needed a bit of surgery to cope with the change, but the idea is the same. But all shrinkers are currently global-based, so they need a way to convert per-zone ratios into global scan ratios. So seeing as we are changing the shrinker API anyway, let's reorganise it to make it saner. So the shrinker callback is passed: - the number of pagecache pages scanned in this zone - the number of pagecache pages in this zone - the total number of pagecache pages in all zones to be scanned The shrinker is now completely responsible for calculating and batching (given helpers), which provides better flexibility. vmscan helper functions are provided to accumulate these ratios, and help with batching. Finally, add some fixed-point scaling to the ratio, which helps rounding. The old shrinker API remains for unconverted code. There is no urgency to convert them at once. Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- fs/drop_caches.c | 6 include/linux/mm.h | 47 ++++++- mm/memory-failure.c | 10 - mm/vmscan.c | 341 +++++++++++++++++++++++++++++++++++++--------------- 4 files changed, 297 insertions(+), 107 deletions(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2010-11-09 22:11:03.000000000 +1100 +++ linux-2.6/include/linux/mm.h 2010-11-09 ...
There are still plenty of unresolved issues with this general approach to scaling object caches that I'd like to see sorted out before we merge any significant shrinker API changes. Some things of the top of my head: - how to solve impendence mismatches between VM scalability techniques and subsystem scalabilty techniques that result in shrinker cross-muliplication explosions. e.g. XFS tracks reclaimable inodes in per-allocation group trees, so we'd get AG x per-zone LRU trees using this shrinker method. Think of the overhead on a 1000AG filesystem on a 1000 node machine with 3-5 zones per node.... - changes from global LRU behaviour to something that is not at all global - effect on workloads that depend on large scale caches that span multiple nodes is largely unknown. It will change IO patterns and affect system balance and performance of the system. How do we test/categorise/understand these problems and address such balance issues? - your use of this shrinker architecture for VFS inode/dentry cache scalability requires adding lists and locks to the MM struct zone for each object cache type (inode, dentry, etc). As such, it is not a generic solution because it cannot be used for per-instance caches like the per-mount inode caches XFS uses. i.e. nothing can actually use this infrastructure change without tying itself directly into the VM implementation, and even then not every existing shrinker can use this method of scaling. i.e. some level of abstraction from the VM implementation is needed in the shrinker API. - it has been pointed out that slab caches are generally allocated out of a single zone per node, so per-zone shrinker granularity seems unnecessary. - doesn't solve the unbound direct reclaim shrinker parallelism that is already causing excessive LRU lock contention on 8p single node systems. While per-LRU/per-node solves the larger scalability issue, it ...
That's an interesting question, but no reason to hold anything up. It's up to the subsystem to handle it, or if they can't handle it sanely, you can elect to get no different behaviour than today. I would say that per-zone is important for the above case, because it shrinks properly, based on memory pressure and pressure settings (eg. zone reclaim), and also obviously so that the reclaim threads are scanning node-local-memory so it doesn't have to pull objects from all over the interconnect. With 1000 nodes and 1000AG filesystem, probably NxM is overkill, but there is no reason you couldn't have one LRU per X AGs. All this is obviously implementation details -- pagecache reclaim does You've brought this up before several times and I've answered it. The default configuration basically doesn't change much. Caches are allowed to fill up all zones and nodes, exactly the same as today. On the reclaim side, when you have a global shortage, it will evenly shrink objects from all zones, so it still approximates LRU behaviour because number of objects far exceeds number of zones. This is exactly "zones" is the abstraction. The thing which all allocation and management of memory is based on, everywhere you do any allocations in the kernel. A *shrinker* implementation, a thing which is called in response to memory shortage in a given zone, has to know about zones. End of story. "Tied to the memory management implementation" is true to the extent that it is part of the memory management No they are not, that's just total FUD. Where was that "pointed out"? Slabs are generally allocated from every zone except for highmem and movable, and moreover there is nothing to prevent a shrinker implementation from needing to shrink highmem and movable pages as It doesn't aim to solve that. It doesn't prevent that from being solved either (and improving parallelism in a subsystem generally, you know, helps with lock contention due to lots of threads in there, Well if you ...
I might add that you've previously brought up and I answered every single issue below and you've not followed up on my answers. So just going back to the start and bringing them all up again is ridiculous. Perhaps you meant here that many nodes have only one populated zone. If that is _really_ a huge problem to add a couple of list heads per zone per node (which it isn't), then the subsystem can do per-node shrinking and just do zone_to_nid(zone) in the shrinker callback (but that would be retarded so it shouldn't). I answered Christoph's thread here, and I showed how a node based callback can result in significant imbalances between zones in a node, so it won't fly. If you think that nodes are sufficient, then you can get patches through mm to make pagecache reclaim per-node based, and changes to the shirnker API will naturally propogate through. --
See, this is what I think is the problem - you answer any concerns by saying that the problem is entirely the responsibility of the subsystem to handle. You've decided how you want shrinkers to work, And when it's only 1 node or 50 nodes or 50 AGs or 5000 AGs? I don't like encoding _guesses_ at what might be a good configuration for an arbitrary NxM configuration because they are rarely right. That's exactly what you are suggesting here - that I solve this problem by making guesses at how it might scale and coming up with some The whole point of the zone-based reclaim is that shrinkers run when there are _local_ shortages on the _local_ LRUs and this will happen much more often than global shortages, especially on large machines. It will result in a change of behaviour, no question about it. However, this is not reason for not moving to this model - what I'm asking is what the plan for categorising problems that arise as a result of such a change? How do we go about translating random reports like "I do this and it goes X% slower on 2.6.xx" to "that's a result of the new LRU reclaim model, and you should do <this> to try to resolve it". How do we triage such reports? What are our options to resolve such problems? Are there any knobs we should add at the beginning to give users ways of changing the behaviour to be more like the curent code? We've got lots of different knobs to control page cache reclaim behaviour - won't some of them be Right - another SEP solution. The reason people are using shrinkers is that it is simple to implement a basic one. But implementing a scalable shrinker right now is fucking hard - look at all the problems we've had with XFS mostly because of the current API and the unbound parallelism of direct reclaim. This new API is a whole new adventure that not many people are going to have machines capable of executing behavioural testing or stressing. If everyone is forced to implement their own "scalable shrinker" we'll end up with a rat's ...
Um, wrong. I provide an API to drive shrinkers, giving sufficient information (and nothing more). You are the one deciding that it is the problem of the API to somehow fix up all the implementations. If you can actually suggest how such a magical API would look like, please do. Otherwise, please try to get a better understanding of my API and understand it does nothing more than provide required Still, it's off topic for shrinker API discussion, I'd be happy to help with ideas or questions about your particular implementation. Note you can a: ignore the new information the API gives you, or b: understand that your current case of 1000 LRUs spread globally over 1000 nodes sucks much much worse than even a poor per-zone distribution, Uh, yes in fact it is the problem of the shrinker implementation to But if you have a global workload, then you tend to get relatively even pressure and it tends to approximate global LRU (especially if you do memory spreading of your slabs). Yes there will be some differences, but it's not some giant scary change, it just can be tested with some benchmarks and people can complain if something hurts. Bringing reclaim closer to how pagecache Really? Same as any other heuristic or performance change. We merge thousands of them every release cycle, and at least a couple of major What??? Stop saying this. In the above paragraph you were going crazy about your wrong and obviously unfounded solution that other subsystems can't use it. I tell you that they can, ie. with kmalloc. I could spell it out if it is not obvious. You would allocate an array of LRU structures to fit all zones in the system. And then you would index into that array with zone_to_nid and zone_idx. OK? It is not "SEP solution". If a shrinker implementation wants to use per zone LRUs, then it is that shrinker implementation's problem to allocate space for them. OK so far? And if a shrinker implementation doesn't want to do that, and instead wants a global LRU, then ...
Here, this is the second time I wrote _exactly_ the same reply to your _exact_ same question (the first time I wrote _exactly_ the same reply to your exact same question was IIRC a few months before but I couldn't find the archive. http://archives.free.net.ph/message/20101015.033017.65c46426.ca.html So at this point, shut up. You had ample time to put up, and didn't. Now you're just obstructing and being unreasonable. To quote Christoph, you don't have carte blance power to delay progress. So stop being an arsehole, and don't fucking accuse me of scoffing and mocking you. Because I've been bending over backwards to answer _exactly_ the same questions multiple times from you and trying to show how either your assumptions are wrong, or reasonable ways we can mitigate potential regressions, and trying to be polite and civil the whole time. (which I might add is much more courtesy than you showed me when you tried to railroad your inode changes through without bothering to answer or even read half my comments about them). And what you have done is just ignore totally my replies to your concerns, stay quiet about them for a month, and then come back with exactly the same thing again. Like 3 or 4 times. Really. At this point, if you have already posted a particular comment more than 50 times, do everybody a favour and redirect the next verbatim copy /dev/null, OK? I'm sick of it. *Constructive* criticism only, from now on. Thanks. I need a zone input to the shinker API, I have demonstrated what for and why, with good justification, and so I am going to get that API change merged. It is a simple superset of the current API's functionality and not some big scary monster (except by somebody who fundmantally doesn't understand it). --
vmscan part looks good to me. however I hope fs folks review too even though
I'm not sure who is best.
shrink_zone is slightly grep unfriendly. Can you consider shrink_slab_zone()
objects * SHRINK_FACTOR appear twice in this function.
calculate "objects = obj * SHRINK_FACTOR" at first improve
Seems misleading name a bit. shrinker_do_scan() does NOT scan.
{
unsigned long nr = ACCESS_ONCE(*dst);
batch *= SHRINK_FACTOR;
if (nr < batch)
return 0;
*dst = nr - batch;
return batch;
}
is slighly cleaner. however It's unclear why dst and batch argument
Doubtful calculation. What mean "sc->nr_scanned - nr_scanned"?
Do we really need this doubtful cpuset hardwall filtering? Why do we
need to change slab reclaim pressure if cpuset is used. In old days,
we didn't have per-zone slab shrinker, then we need artificial slab
pressure boost for preventing false positive oom-killer. but now we have.
However, If you strongly keep old behavior at this time, I don't oppose.
--
And per-zone reclaim can lead to new issue. On 32bit highmem system,
theorically the system has following memory usage.
ZONE_HIGHMEM: 100% used for page cache
ZONE_NORMAL: 100% used for slab
So, traditional page-cache/slab balancing may not work. I think following
new calculation or somethinhg else is necessary.
if (zone_reclaimable_pages() > NR_SLAB_RECLAIMABLE) {
using current calculation
} else {
shrink number of "objects >> reclaim-priority" objects
(as page cache scanning calculation)
}
--
Yes, in theory you are right. I guess in theory the same hole exists if we have 0% page cache reclaimable globally, but this may be slightly I agree. In fact, perhaps the new calculation would work well in all cases anyway, so maybe we should move away from making slab reclaim a slave to pagecache reclaim. Can we approach that in subsequent patches? Thanks, Nick --
Nick, I want to test your tree. This is taking too long. Make something available now. And test it in real configs. --
hi Anca, would you like to give your test method? --
On Tue, Nov 16, 2010 at 10:05 AM, Figo.zhang Nothig special, for now I will test on my PC. --
hi KOSAKI Motohiro, is it any test suite or test scripts for test page-reclaim performance? Best, Figo.zhang --
On Tue, Nov 16, 2010 at 10:22 AM, Figo.zhang There is http://www.phoronix.com --
If you want some special test, you have to ask Michael Larabel for that. http://www.phoronix-test-suite.com/ --
yes, i see, the phoronix-test-suite is test such as ffmpeg, games. not --
Hi Figo, I agree with Anca. Nick's patch will change almost VM behavior. then, generic performance test suit is better than supecific one. Thanks. --
I'm working on it, I'll have another batch of patches out to look at in a couple of hours, if stress testing holds up. --
I'm not worry about so much "0% page cache reclaimable globally" case OK! --
Yes that's true. I want to move away from the term "slab" shrinker however. It seems to confuse people (of course, the shrinker can shrink memory from any allocator, not just slab). I wasn't quite sure what you meant with this comment and the above one. It should have a comment: *dst can be accessed without a lock. However if nr is reloaded from memory between the two expressions and *dst changes during that time, we could end up with a negative OK I'll take it into consideration. I guess I didn't want the caller Yes it is, at the moment. I actually have a flag that I would like to use (close to OOM flag), so I've just added the placeholder for now. Yeah I'm not completely sure. But we should be mindful that until the major caches are converted to LRU, we still have to care for the global Well it may not be over the pagecache limit. I agree the situation is pretty ugly here with all these magic constants, but I didn't want to change too much in this patch. Thanks, Nick --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds |
