Re: [patch 31/35] fs: icache per-zone inode LRU

Previous thread: [patch 17/35] fs: icache RCU free inodes by npiggin on Monday, October 18, 2010 - 8:42 pm. (1 message)

Next thread: [patch 28/35] fs: icache split writeback and lru locks by npiggin on Monday, October 18, 2010 - 8:42 pm. (1 message)
From: npiggin
Date: Monday, October 18, 2010 - 8:42 pm

Per-zone LRUs and shrinkers for inode cache.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/inode.c             |   84 ++++++++++++++++++++++++++++---------------------
 include/linux/mmzone.h |    7 ++++
 2 files changed, 56 insertions(+), 35 deletions(-)

Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2010-10-19 14:38:31.000000000 +1100
+++ linux-2.6/fs/inode.c	2010-10-19 14:39:04.000000000 +1100
@@ -34,7 +34,7 @@
  *   s_inodes, i_sb_list
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
- * inode_lru_lock protects:
+ * zone->inode_lru_lock protects:
  *   inode_lru, i_lru
  * wb->b_lock protects:
  *   b_io, b_more_io, b_dirty, i_io, i_lru
@@ -49,7 +49,7 @@
  * Ordering:
  * inode->i_lock
  *   inode_list_lglock
- *   inode_lru_lock
+ *   zone->inode_lru_lock
  *   wb->b_lock
  *   inode_hash_bucket lock
  */
@@ -100,8 +100,6 @@
  * allowing for low-overhead inode sync() operations.
  */
 
-static LIST_HEAD(inode_lru);
-
 struct inode_hash_bucket {
 	struct hlist_bl_head head;
 };
@@ -127,8 +125,6 @@
 DECLARE_LGLOCK(inode_list_lglock);
 DEFINE_LGLOCK(inode_list_lglock);
 
-static DEFINE_SPINLOCK(inode_lru_lock);
-
 /*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
@@ -166,7 +162,12 @@
 
 int get_nr_inodes_unused(void)
 {
-	return inodes_stat.nr_unused;
+	int nr = 0;
+	struct zone *z;
+
+	for_each_populated_zone(z)
+		nr += z->inode_nr_lru;
+	return nr;
 }
 
 /*
@@ -177,6 +178,7 @@
 {
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 	inodes_stat.nr_inodes = get_nr_inodes();
+	inodes_stat.nr_unused = get_nr_inodes_unused();
 	return proc_dointvec(table, write, buffer, lenp, ppos);
 #else
 	return -ENOSYS;
@@ -440,10 +442,12 @@
  */
 void __inode_lru_list_add(struct inode *inode)
 ...
From: Dave Chinner
Date: Tuesday, October 19, 2010 - 5:38 am

Regardless of whether this is the right way to scale or not, I don't
like the fact that this moves the cache LRUs into the memory
management structures, and expands the use of MM specific structures
throughout the code. It ties the cache implementation to the current
VM implementation. That, IMO, goes against all the principle of
modularisation at the source code level, and it means we have to tie
all shrinker implemenations to the current internal implementation
of the VM. I don't think that is wise thing to do because of the
dependencies and impedance mismatches it introduces.

As an example: XFS inodes to be reclaimed are simply tagged in a
radix tree so the shrinker can reclaim inodes in optimal IO order
rather strict LRU order. It simply does not match a zone-based
shrinker implementation in any way, shape or form, nor does it's
inherent parallelism match that of the way shrinkers are called.

Any change in shrinker infrastructure needs to be able to handle
these sorts of impedance mismatches between the VM and the cache
subsystem. The current API doesn't handle this very well, either,
so it's something that we need to fix so that scalability is easy
for everyone.

Anyway, my main point is that tying the LRU and shrinker scaling to
the implementation of the VM is a one-off solution that doesn't work
for generic infrastructure. Other subsystems need the same
large-machine scaling treatment, and there's no way we should be
tying them all into the struct zone. It needs further abstraction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Tuesday, October 19, 2010 - 7:35 pm

[I should have cc'ed this one to linux-mm as well, so I quote your
reply in full here]


The zone structure really is the basic unit of memory abstraction
in the whole zoned VM concept (which covers different properties
of both physical address and NUMA cost).

The zone contains structures for memory management that aren't
otherwise directly related to one another. Generic page waitqueues,
page allocator structures, pagecache reclaim structures, memory model
data, and various statistics.

Structures to reclaim inodes from a particular zone belong in the
zone struct as much as those to reclaim pagecache or anonymous
memory from that zone too. It actually fits far better in here than
globally, because all our allocation/reclaiming/watermarks etc is
driven per-zone.


It's very fundamental. We allocate memory from, and have to reclaim
memory from -- zones. Memory reclaim is driven based on how the VM
wants to reclaim memory: nothing you can do to avoid some linkage
between the two.

Look at it this way. The dumb global shrinker is also tied to an
MM implementation detail, but that detail in fact does *not* match
the reality of the MM, and so it has all these problems interacting
with real reclaim.

What problems? OK, on an N zone system (assuming equal zones and
even distribution of objects around memory), then if there is a shortage
on a particular zone, slabs from _all_ zones are reclaimed. We reclaim
a factor of N too many objects. In a NUMA situation, we also touch
remote memory with a chance (N-1)/N.

As number of nodes grow beyond 2, this quickly goes down hill.

In summary, there needs to be some knowledge of how MM reclaims memory
in memory reclaim shrinkers -- simply can't do a good implementation
without that. If the zone concept changes, the MM gets turned upside

This is another problem, similar to what we have in pagecache. In
the pagecache, we need to clean pages in optimal IO order, but we
still reclaim them according to some LRU order.

If you reclaim ...
From: Nick Piggin
Date: Tuesday, October 19, 2010 - 8:12 pm

Gah. Try again.

--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 2:43 am

[ snip lecture on NUMA VM 101 - I got that at SGI w.r.t. Irix more than

The allocation API exposes per-node allocation, not zones. The zones
are the internal implementation of the API, not what people use

I suspect you didn't read what I wrote, so I'll repeat it. XFS has
reclaimed inodes in optimal IO order for several releases and so


Sounds to me like a per-node LRU/shrinker arrangement is an
abstraction that the VM could work with. Indeed, make it run only
from the *per-node kswapd* instead of from direct reclaim, and we'd
also solve the unbound reclaim parallelism problem at the same
time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, October 20, 2010 - 3:02 am

Of course it exposes zones (with GFP flags). In fact they were exposed

You were talking about XFS's own inode reclaim code? My patches
of course don't change that. I would like to see them usable by
XFS as well of course, but I'm not forcing anything to be
shoehorned in where it doesn't fit properly yet.

The Linux inode reclaimer is pretty well "random" from POV of
disk order, as you know.

I don't have the complete answer about how to write back required
inode information in IO optimal order, and at the same time make
reclaim optimal reclaiming choices.

It could be that a 2 stage reclaim process is enough (have the
Linux inode reclaim make the thing and make it eligible for IO
and real reclaiming, then have an inode writeout pass that does
IO optimal reclaiming from those).

That is really quite speculative and out of scope of this patch set.
But the point is that this patch set doesn't prohibit anything like

The zone really is the right place. If you do it per node, then
you can still have shortages in one node in a zone but not

That's also out of scope, but it is among things being
considered, as far as I know (along with capping number of
threads in reclaim etc). But doing zone LRUs doesn't change
this either -- kswapd pagecache reclaim also works per node,
by simply processing all the zones that belong to the node.

--

From: KOSAKI Motohiro
Date: Tuesday, October 19, 2010 - 8:14 pm

I'm not sure what data structure is best. I can only say current
zone unawareness slab shrinker might makes following sad scenario.

 o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
 o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
   drop unrelated zone's icache

both makes performance degression. In other words, Linux does not have
flat memory model. so, I don't think Nick's basic concept is wrong. 
It's straight forward enhancement. but if it don't fit current shrinkers,
I'd like to discuss how to make better data structure.



and I have dump question (sorry, I don't know xfs at all). current
xfs_mount is below.

typedef struct xfs_mount {
 ...
        struct shrinker         m_inode_shrink; /* inode reclaim shrinker */
} xfs_mount_t;


Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?


Thanks.



--

From: Nick Piggin
Date: Tuesday, October 19, 2010 - 8:20 pm

Well if XFS were to use per-ZONE shrinkers, it would remain with a
single shrinker context per-sb like it has now, but it would divide
its object management into per-zone structures.

For subsystems that aren't important, don't take much memory or have
much reclaim throughput, they are free to ignore the zone argument
and keep using the global input to the shrinker.

--

From: KOSAKI Motohiro
Date: Tuesday, October 19, 2010 - 8:29 pm

Oops, my fault ;)
Yes, my intention is converting mp->m_perag_tree to per-zone.




--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 3:19 am

<sigh>

I don't think anyone wants per-ag X per-zone reclaim lists on a 1024
node machine with a 1,000 AG (1PB) filesystem.

As I have already said, the XFS inode caches are optimised in
structure to minimise IO and maximise internal filesystem
parallelism. They are not optimised for per-cpu or NUMA scalability
because if you don't have filesystem level parallelism, you can't
scale to large numbers of concurrent operations across large numbers
of CPUs in the first place.

In the case of XFS, per-allocation group is the way we scale
internal parallelism and as long as you have more AGs than you have
CPUs, there is very good per-CPU scalability through the filesystem
because most operations are isolated to a single AG.  That is how we
scale parallelism in XFS, and it has proven to scale pretty well for
even the largest of NUMA machines. 

This is what I mean about there being an impedence mismatch between
the way the VM and the VFS/filesystem caches scale. Fundamentally,
the way filesystems want their caches to operate for optimal
performance can be vastly different to the way you want shrinkers to
operate for VM scalability. Forcing the MM way of doing stuff down
into the LRUs and shrinkers is not a good way of solving this

Having a global lock in a shrinker is already a major point of
contention because shrinkers have unbound parallelism.  Hence all
shrinkers need to be converted to use scalable structures. What we
need _first_ is the infrastructure to do this in a sane manner, not
tie a couple of shrinkers tightly into the mm structures and then
walk away.

And FWIW, most subsystems that use shrinkers can be compiled in as
modules or not compiled in at all. That'll probably leave #ifdef
CONFIG_ crap all through the struct zone definition as they are
converted to use your current method....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, October 20, 2010 - 3:41 am

Maybe not, but a 1024 node machine will *definitely* need to minimise
interconnect traffic and remote memory access. So if each node can't
spare enough memory for a couple of thousand LRU list heads, then
XFS's per-ag LRUs may need rethinking (they may provide reasonable
scalability on well partitioned workloads, but they can not help
the reclaimers to do the right thing -- remote memory accesses will

And as I have already said, nothing in my patches changes that.
What it provides is *opportunity* for shrinkers to take advantage
of per-zone scalability and improved reclaim patterns. Nothing

It isn't forcing anything. Maybe you didn't understand the patch

Per zone is the way to do it. Shrinkers and reclaim concept is
already tightly coupled with the mm. Memory pressure and the need
to reclaim occurs solely and only as a function of a zone (or zones).
Adding the zone argument to the shrinker does nothing more than adding
that previously missing input to the shrinker.

"I have a memory shortage in this zone, so I need to free reclaimable
objects from this zone"

This is a pretty core memory managementy idea. If you "decouple"
shrinkers from mm any further, then you end up with something that

I haven't thought about how random drivers will do per-zone things.
Obviously not an all out dumping ground in struct zone, but it does
fit for critical central caches like page, inode, and dentry.

Even if they aren't compiled out, we don't want their size bloating
things too much if they aren't loaded or in use. Probably dynamic
allocation would be the best way to go for them. Pretty simple really.

--

Previous thread: [patch 17/35] fs: icache RCU free inodes by npiggin on Monday, October 18, 2010 - 8:42 pm. (1 message)

Next thread: [patch 28/35] fs: icache split writeback and lru locks by npiggin on Monday, October 18, 2010 - 8:42 pm. (1 message)