Re: [PATCH 2/3] Refactor zone_reclaim

Previous thread: Re: [apparmor] [PATCH v2] APPARMOR: add sid to profile mapping and sidrecycling by Seth Arnold on Tuesday, November 30, 2010 - 3:12 am. (1 message)

Next thread: [PATCH] DOCUMENTATION: Correct inline docs to match parm name. by Robert P. J. Day on Tuesday, November 30, 2010 - 4:31 am. (3 messages)
From: Balbir Singh
Date: Tuesday, November 30, 2010 - 3:14 am

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
http://www.mail-archive.com/kvm@vger.kernel.org/msg43654.html

Christoph Lamater recommended splitting out patch 1, which
is what this series does

Detailed Description
====================
This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the ...
From: Balbir Singh
Date: Tuesday, November 30, 2010 - 3:15 am

This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |    4 ++--
 mm/vmscan.c            |    2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-	int node;
 	/*
 	 * zone reclaim becomes active if more unmapped pages exist.
 	 */
 	unsigned long		min_unmapped_pages;
+#ifdef CONFIG_NUMA
+	int node;
 	unsigned long		min_slab_pages;
 #endif
 	struct per_cpu_pageset __percpu *pageset;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cc90d5..325443a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2644,7 +2644,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2854,7 +2853,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 	return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--

From: Christoph Lameter
Date: Tuesday, November 30, 2010 - 12:18 pm

Reviewed-by: Christoph Lameter <cl@linux.com>


--

From: Andrew Morton
Date: Tuesday, November 30, 2010 - 3:23 pm

On Tue, 30 Nov 2010 15:45:12 +0530

Thereby adding a nice dollop of bloat to everyone's kernel.  I don't
think that is justifiable given that the audience for this feature is
about eight people :(

How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL?

Also this patch instantiates sysctl_min_unmapped_ratio and
sysctl_min_slab_ratio on non-NUMA builds but fails to make those
tunables actually tunable in procfs.  Changes to sysctl.c are

More careful reviewers, please.
--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 10:21 pm

My local MTA failed to deliver the message, trying again.

-- 
	Three Cheers,
	Balbir
--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 3:15 am

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
 mm/vmscan.c |   35 +++++++++++++++++++++++------------
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 325443a..0ac444f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2719,6 +2719,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+				unsigned long nr_pages)
+{
+	int priority;
+	/*
+	 * Free memory by calling shrink zone with increasing
+	 * priorities until we have enough memory freed.
+	 */
+	priority = ZONE_RECLAIM_PRIORITY;
+	do {
+		shrink_zone(priority, zone, sc);
+		priority--;
+	} while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -2727,7 +2748,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2751,17 +2771,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
-		/*
-		 * Free memory by calling shrink zone with increasing
-		 * priorities until we have enough memory freed.
-		 ...
From: Christoph Lameter
Date: Tuesday, November 30, 2010 - 12:19 pm

Reviewed-by: Christoph Lameter <cl@linux.com>

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, November 30, 2010 - 6:23 pm

On Tue, 30 Nov 2010 15:45:55 +0530

Why is this min_mapped_pages based on zone (IOW, per-zone) ?


Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 10:22 pm

My local MTA failed to deliver the message, trying again. 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, December 1, 2010 - 1:59 am

On Wed, 1 Dec 2010 10:52:18 +0530

Sorry, what I wanted to here was:

Why min_mapped_pages per zone ?
Why you don't add "limit_for_unmapped_page_cache_size" for the whole system ?

I guess what you really want is "limit_for_unmapped_page_cache_size".

Then, you have to use this kind of mysterious code.
==
(zone_unmapped_file_pages(zone) >
+			UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages))

Thanks,
-Kame

--

From: Minchan Kim
Date: Wednesday, December 1, 2010 - 12:54 am

Hi Balbir,


I don't see any specific logic about naming
"zone_reclaim_unmapped_pages" in your function.
Maybe, caller of this function have to handle sc->may_unmap. So, this
function's naming
is not good.
As I see your logic, the function name would be just "zone_reclaim_pages"
If you want to name it with zone_reclaim_unmapped_pages, it could be
better with setting sc->may_unmap in this function.


-- 
Kind regards,
Minchan Kim
--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 3:16 am

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
 include/linux/swap.h |    5 ++-
 mm/page_alloc.c      |    7 +++--
 mm/vmscan.c          |   72 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..78b0830 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,11 +252,12 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern bool should_balance_unmapped_pages(struct zone *zone);
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62b7280..4228da3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
 			unsigned long mark;
 			int ret;
 
+			if (should_balance_unmapped_pages(zone))
+				wakeup_kswapd(zone, order);
+
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (zone_watermark_ok(zone, order, mark,
 				    classzone_idx, alloc_flags))
@@ -4136,10 +4139,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
-#ifdef CONFIG_NUMA
-		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 						/ 100;
+#ifdef CONFIG_NUMA
+		zone->node = nid;
 		zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 ...
From: Christoph Lameter
Date: Tuesday, November 30, 2010 - 12:21 pm

Looks good.

Reviewed-by: Christoph Lameter <cl@linux.com>


--

From: Andrew Morton
Date: Tuesday, November 30, 2010 - 3:25 pm

On Tue, 30 Nov 2010 15:46:31 +0530


gack, this is on the page allocator fastpath, isn't it?  So
99.99999999% of the world's machines end up doing a pointless call to a
pointless function which pointlessly tests a pointless global and
pointlessly returns?  All because of some whacky KSM thing?

The speed and space overhead of this code should be *zero* if
!CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if
CONFIG_UNMAPPED_PAGECACHE_CONTROL=y.  The way to do the latter is to
inline the test of unmapped_page_control into callers and only if it is

aw c'mon guys, everybody knows that when you add a kernel parameter you


The problem I have with this comment is that it uses the term "balance"
without ever defining it.  Plus "balance" is already a term which is used
in memory reclaim.

So if you can think up a unique noun then that's good but whether or
not that is done, please describe with great care what that term


Well.  Giving 16 a name didn't really clarify anything.  Attentive
readers will want to know what this does, why 16 was chosen and what

So you're OK with shoving all this flotsam into 100,000,000 cellphones? 
This was a pretty outrageous patchset!


--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 10:22 pm

My local MTA failed to deliver the message, trying again. 

-- 
	Three Cheers,
	Balbir
--

From: Minchan Kim
Date: Wednesday, December 1, 2010 - 1:18 am

Yes. Embedded people(at least, me) want it. That's because they don't
have any swap device so they could reclaim only page cache page.
And many page cache pages are mapped at address space of
application(ex, android uses java model so many pages are mapped by
application's address space). It means it's hard to reclaim them
without lagging.
So I have a interest in this patch.

-- 
Kind regards,
Minchan Kim
--

From: Christoph Lameter
Date: Wednesday, December 1, 2010 - 8:01 am

The meaning is analoguous to the other zone reclaim ratio. But yes it

This is a feature that has been requested over and over for years. Using
/proc/vm/drop_caches for fixing situations where one simply has too many
page cache pages is not so much fun in the long run.

--

From: KOSAKI Motohiro
Date: Wednesday, December 1, 2010 - 6:22 pm

I'm not against page cache limitation feature at all. But, this is
too ugly and too destructive fast path. I hope this patch reduce negative
impact more.


--

From: KAMEZAWA Hiroyuki
Date: Wednesday, December 1, 2010 - 7:50 pm

On Thu,  2 Dec 2010 10:22:16 +0900 (JST)

And I think min_mapped_unmapped_pages is ugly. It should be
"unmapped_pagecache_limit" or some because it's for limitation feature.

Thanks,
-Kame

--

From: Balbir Singh
Date: Thursday, December 2, 2010 - 12:01 am

The feature will now be enabled with a CONFIG and boot parameter, I
find changing the naming convention now - it is already in use and
well known is not a good idea. THe name of the boot parameter can be
changed of-course. 

-- 
	Three Cheers,
	Balbir
--

From: KOSAKI Motohiro
Date: Tuesday, November 30, 2010 - 5:14 pm

You can't invoke any reclaim from here. It is in zone balancing detection


This code break page-cache/slab balancing logic. And this is conflict
against Nick's per-zone slab effort.

Plus, high-order + priority=5 reclaim Simon's case. (see "Free memory never 





Hmm....

As far as I reviewed, I can't find any reason why this patch works as expected.
So, I think cleancache looks promising more than this idea. Have you seen Dan's
patch? I would suggested discuss him.

Thanks.


--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 10:22 pm

My local MTA failed to deliver the message, trying again. 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, November 30, 2010 - 6:32 pm

On Tue, 30 Nov 2010 15:46:31 +0530

Hm, I'm not sure the final vision of this feature. Does this reclaiming feature
can't be called directly via balloon driver just before alloc_page() ?

Do you need to keep page caches small even when there are free memory on host ?

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 10:22 pm

My local MTA failed to deliver the message, trying again. 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, November 30, 2010 - 10:35 pm

On Wed, 1 Dec 2010 10:52:59 +0530

That's a point. Then, why the guest has to do _extra_ work for host even when
the host says nothing ? I think trigger this by guests themselves is not very good.

Thanks,
-Kame



--

From: Balbir Singh
Date: Tuesday, November 30, 2010 - 11:40 pm

I've mentioned it before, the guest keeping free memory without a
large performance hit, helps, the balloon driver is able to quickly
retrieve this memory if required or the guest can use this memory for
some other application/task. The cached data is mostly already present
in the host page cache.

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, December 1, 2010 - 12:24 am

On Wed, 1 Dec 2010 12:10:43 +0530

Why ? Are there parameters/stats which shows this is _true_ ? How we can
guarantee/show it to users ?
Please add an interface to show "shared rate between guest/host" If not,
any admin will not turn this on because "file cache status on host" is a
black box for guest admins. I think this patch skips something important steps.

2nd point is maybe for reducing total host memory usage and for increasing
the number of guests on a host. For that, this feature is useful only when all guests
on a host are friendly and devoted to the health of host memory management because
all setting must be done in the guest. This can be passed as even by qemu's command line
argument. And _no_ benefit for the guests who reduce it's resource to help
host management because there is no guarantee dropped caches are on host memory.


So, for both claim, I want to see an interface to show the number of shared pages
between hosts and guests rather than imagine it.

BTW, I don't like this kind of "please give us your victim, please please please"
logic. The host should be able to "steal" what it wants in force.
Then, I think there should be no On/Off visible interfaces. The vm firmware
should tell to turn on this if administrator of the host wants.

BTW2, please test with some other benchmarks (which read file caches.)
I don't think kernel make is good test for this.

Thanks,
-Kame

--

Previous thread: Re: [apparmor] [PATCH v2] APPARMOR: add sid to profile mapping and sidrecycling by Seth Arnold on Tuesday, November 30, 2010 - 3:12 am. (1 message)

Next thread: [PATCH] DOCUMENTATION: Correct inline docs to match parm name. by Robert P. J. Day on Tuesday, November 30, 2010 - 4:31 am. (3 messages)