Per-zone LRUs and shrinkers for inode cache.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/inode.c | 84 ++++++++++++++++++++++++++++---------------------
include/linux/mmzone.h | 7 ++++
2 files changed, 56 insertions(+), 35 deletions(-)
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c 2010-10-19 14:38:31.000000000 +1100
+++ linux-2.6/fs/inode.c 2010-10-19 14:39:04.000000000 +1100
@@ -34,7 +34,7 @@
* s_inodes, i_sb_list
* inode_hash_bucket lock protects:
* inode hash table, i_hash
- * inode_lru_lock protects:
+ * zone->inode_lru_lock protects:
* inode_lru, i_lru
* wb->b_lock protects:
* b_io, b_more_io, b_dirty, i_io, i_lru
@@ -49,7 +49,7 @@
* Ordering:
* inode->i_lock
* inode_list_lglock
- * inode_lru_lock
+ * zone->inode_lru_lock
* wb->b_lock
* inode_hash_bucket lock
*/
@@ -100,8 +100,6 @@
* allowing for low-overhead inode sync() operations.
*/
-static LIST_HEAD(inode_lru);
-
struct inode_hash_bucket {
struct hlist_bl_head head;
};
@@ -127,8 +125,6 @@
DECLARE_LGLOCK(inode_list_lglock);
DEFINE_LGLOCK(inode_list_lglock);
-static DEFINE_SPINLOCK(inode_lru_lock);
-
/*
* iprune_sem provides exclusion between the kswapd or try_to_free_pages
* icache shrinking path, and the umount path. Without this exclusion,
@@ -166,7 +162,12 @@
int get_nr_inodes_unused(void)
{
- return inodes_stat.nr_unused;
+ int nr = 0;
+ struct zone *z;
+
+ for_each_populated_zone(z)
+ nr += z->inode_nr_lru;
+ return nr;
}
/*
@@ -177,6 +178,7 @@
{
#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
inodes_stat.nr_inodes = get_nr_inodes();
+ inodes_stat.nr_unused = get_nr_inodes_unused();
return proc_dointvec(table, write, buffer, lenp, ppos);
#else
return -ENOSYS;
@@ -440,10 +442,12 @@
*/
void __inode_lru_list_add(struct inode *inode)
...Regardless of whether this is the right way to scale or not, I don't like the fact that this moves the cache LRUs into the memory management structures, and expands the use of MM specific structures throughout the code. It ties the cache implementation to the current VM implementation. That, IMO, goes against all the principle of modularisation at the source code level, and it means we have to tie all shrinker implemenations to the current internal implementation of the VM. I don't think that is wise thing to do because of the dependencies and impedance mismatches it introduces. As an example: XFS inodes to be reclaimed are simply tagged in a radix tree so the shrinker can reclaim inodes in optimal IO order rather strict LRU order. It simply does not match a zone-based shrinker implementation in any way, shape or form, nor does it's inherent parallelism match that of the way shrinkers are called. Any change in shrinker infrastructure needs to be able to handle these sorts of impedance mismatches between the VM and the cache subsystem. The current API doesn't handle this very well, either, so it's something that we need to fix so that scalability is easy for everyone. Anyway, my main point is that tying the LRU and shrinker scaling to the implementation of the VM is a one-off solution that doesn't work for generic infrastructure. Other subsystems need the same large-machine scaling treatment, and there's no way we should be tying them all into the struct zone. It needs further abstraction. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
[I should have cc'ed this one to linux-mm as well, so I quote your reply in full here] The zone structure really is the basic unit of memory abstraction in the whole zoned VM concept (which covers different properties of both physical address and NUMA cost). The zone contains structures for memory management that aren't otherwise directly related to one another. Generic page waitqueues, page allocator structures, pagecache reclaim structures, memory model data, and various statistics. Structures to reclaim inodes from a particular zone belong in the zone struct as much as those to reclaim pagecache or anonymous memory from that zone too. It actually fits far better in here than globally, because all our allocation/reclaiming/watermarks etc is driven per-zone. It's very fundamental. We allocate memory from, and have to reclaim memory from -- zones. Memory reclaim is driven based on how the VM wants to reclaim memory: nothing you can do to avoid some linkage between the two. Look at it this way. The dumb global shrinker is also tied to an MM implementation detail, but that detail in fact does *not* match the reality of the MM, and so it has all these problems interacting with real reclaim. What problems? OK, on an N zone system (assuming equal zones and even distribution of objects around memory), then if there is a shortage on a particular zone, slabs from _all_ zones are reclaimed. We reclaim a factor of N too many objects. In a NUMA situation, we also touch remote memory with a chance (N-1)/N. As number of nodes grow beyond 2, this quickly goes down hill. In summary, there needs to be some knowledge of how MM reclaims memory in memory reclaim shrinkers -- simply can't do a good implementation without that. If the zone concept changes, the MM gets turned upside This is another problem, similar to what we have in pagecache. In the pagecache, we need to clean pages in optimal IO order, but we still reclaim them according to some LRU order. If you reclaim ...
[ snip lecture on NUMA VM 101 - I got that at SGI w.r.t. Irix more than The allocation API exposes per-node allocation, not zones. The zones are the internal implementation of the API, not what people use I suspect you didn't read what I wrote, so I'll repeat it. XFS has reclaimed inodes in optimal IO order for several releases and so Sounds to me like a per-node LRU/shrinker arrangement is an abstraction that the VM could work with. Indeed, make it run only from the *per-node kswapd* instead of from direct reclaim, and we'd also solve the unbound reclaim parallelism problem at the same time... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Of course it exposes zones (with GFP flags). In fact they were exposed You were talking about XFS's own inode reclaim code? My patches of course don't change that. I would like to see them usable by XFS as well of course, but I'm not forcing anything to be shoehorned in where it doesn't fit properly yet. The Linux inode reclaimer is pretty well "random" from POV of disk order, as you know. I don't have the complete answer about how to write back required inode information in IO optimal order, and at the same time make reclaim optimal reclaiming choices. It could be that a 2 stage reclaim process is enough (have the Linux inode reclaim make the thing and make it eligible for IO and real reclaiming, then have an inode writeout pass that does IO optimal reclaiming from those). That is really quite speculative and out of scope of this patch set. But the point is that this patch set doesn't prohibit anything like The zone really is the right place. If you do it per node, then you can still have shortages in one node in a zone but not That's also out of scope, but it is among things being considered, as far as I know (along with capping number of threads in reclaim etc). But doing zone LRUs doesn't change this either -- kswapd pagecache reclaim also works per node, by simply processing all the zones that belong to the node. --
I'm not sure what data structure is best. I can only say current
zone unawareness slab shrinker might makes following sad scenario.
o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
drop unrelated zone's icache
both makes performance degression. In other words, Linux does not have
flat memory model. so, I don't think Nick's basic concept is wrong.
It's straight forward enhancement. but if it don't fit current shrinkers,
I'd like to discuss how to make better data structure.
and I have dump question (sorry, I don't know xfs at all). current
xfs_mount is below.
typedef struct xfs_mount {
...
struct shrinker m_inode_shrink; /* inode reclaim shrinker */
} xfs_mount_t;
Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?
Thanks.
--
Well if XFS were to use per-ZONE shrinkers, it would remain with a single shrinker context per-sb like it has now, but it would divide its object management into per-zone structures. For subsystems that aren't important, don't take much memory or have much reclaim throughput, they are free to ignore the zone argument and keep using the global input to the shrinker. --
Oops, my fault ;) Yes, my intention is converting mp->m_perag_tree to per-zone. --
<sigh> I don't think anyone wants per-ag X per-zone reclaim lists on a 1024 node machine with a 1,000 AG (1PB) filesystem. As I have already said, the XFS inode caches are optimised in structure to minimise IO and maximise internal filesystem parallelism. They are not optimised for per-cpu or NUMA scalability because if you don't have filesystem level parallelism, you can't scale to large numbers of concurrent operations across large numbers of CPUs in the first place. In the case of XFS, per-allocation group is the way we scale internal parallelism and as long as you have more AGs than you have CPUs, there is very good per-CPU scalability through the filesystem because most operations are isolated to a single AG. That is how we scale parallelism in XFS, and it has proven to scale pretty well for even the largest of NUMA machines. This is what I mean about there being an impedence mismatch between the way the VM and the VFS/filesystem caches scale. Fundamentally, the way filesystems want their caches to operate for optimal performance can be vastly different to the way you want shrinkers to operate for VM scalability. Forcing the MM way of doing stuff down into the LRUs and shrinkers is not a good way of solving this Having a global lock in a shrinker is already a major point of contention because shrinkers have unbound parallelism. Hence all shrinkers need to be converted to use scalable structures. What we need _first_ is the infrastructure to do this in a sane manner, not tie a couple of shrinkers tightly into the mm structures and then walk away. And FWIW, most subsystems that use shrinkers can be compiled in as modules or not compiled in at all. That'll probably leave #ifdef CONFIG_ crap all through the struct zone definition as they are converted to use your current method.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Maybe not, but a 1024 node machine will *definitely* need to minimise interconnect traffic and remote memory access. So if each node can't spare enough memory for a couple of thousand LRU list heads, then XFS's per-ag LRUs may need rethinking (they may provide reasonable scalability on well partitioned workloads, but they can not help the reclaimers to do the right thing -- remote memory accesses will And as I have already said, nothing in my patches changes that. What it provides is *opportunity* for shrinkers to take advantage of per-zone scalability and improved reclaim patterns. Nothing It isn't forcing anything. Maybe you didn't understand the patch Per zone is the way to do it. Shrinkers and reclaim concept is already tightly coupled with the mm. Memory pressure and the need to reclaim occurs solely and only as a function of a zone (or zones). Adding the zone argument to the shrinker does nothing more than adding that previously missing input to the shrinker. "I have a memory shortage in this zone, so I need to free reclaimable objects from this zone" This is a pretty core memory managementy idea. If you "decouple" shrinkers from mm any further, then you end up with something that I haven't thought about how random drivers will do per-zone things. Obviously not an all out dumping ground in struct zone, but it does fit for critical central caches like page, inode, and dentry. Even if they aren't compiled out, we don't want their size bloating things too much if they aren't loaded or in use. Probably dynamic allocation would be the best way to go for them. Pretty simple really. --
