The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at http://www.mail-archive.com/kvm@vger.kernel.org/msg43654.html Christoph Lamater recommended splitting out patch 1, which is what this series does Detailed Description ==================== This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the ...
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
include/linux/mmzone.h | 4 ++--
mm/vmscan.c | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
-#ifdef CONFIG_NUMA
- int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
+#ifdef CONFIG_NUMA
+ int node;
unsigned long min_slab_pages;
#endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cc90d5..325443a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2644,7 +2644,6 @@ static int __init kswapd_init(void)
module_init(kswapd_init)
-#ifdef CONFIG_NUMA
/*
* Zone reclaim mode
*
@@ -2854,7 +2853,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
return ret;
}
-#endif
/*
* page_evictable - test whether a page is evictable
--
Reviewed-by: Christoph Lameter <cl@linux.com> --
On Tue, 30 Nov 2010 15:45:12 +0530 Thereby adding a nice dollop of bloat to everyone's kernel. I don't think that is justifiable given that the audience for this feature is about eight people :( How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL? Also this patch instantiates sysctl_min_unmapped_ratio and sysctl_min_slab_ratio on non-NUMA builds but fails to make those tunables actually tunable in procfs. Changes to sysctl.c are More careful reviewers, please. --
My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir --
Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
mm/vmscan.c | 35 +++++++++++++++++++++++------------
1 files changed, 23 insertions(+), 12 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 325443a..0ac444f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2719,6 +2719,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
}
/*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+ unsigned long nr_pages)
+{
+ int priority;
+ /*
+ * Free memory by calling shrink zone with increasing
+ * priorities until we have enough memory freed.
+ */
+ priority = ZONE_RECLAIM_PRIORITY;
+ do {
+ shrink_zone(priority, zone, sc);
+ priority--;
+ } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
* Try to free up some pages from this zone through reclaim.
*/
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -2727,7 +2748,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2751,17 +2771,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
- /*
- * Free memory by calling shrink zone with increasing
- * priorities until we have enough memory freed.
- ...Reviewed-by: Christoph Lameter <cl@linux.com> --
On Tue, 30 Nov 2010 15:45:55 +0530 Why is this min_mapped_pages based on zone (IOW, per-zone) ? Thanks, -Kame --
My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir --
On Wed, 1 Dec 2010 10:52:18 +0530 Sorry, what I wanted to here was: Why min_mapped_pages per zone ? Why you don't add "limit_for_unmapped_page_cache_size" for the whole system ? I guess what you really want is "limit_for_unmapped_page_cache_size". Then, you have to use this kind of mysterious code. == (zone_unmapped_file_pages(zone) > + UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages)) Thanks, -Kame --
Hi Balbir, I don't see any specific logic about naming "zone_reclaim_unmapped_pages" in your function. Maybe, caller of this function have to handle sc->may_unmap. So, this function's naming is not good. As I see your logic, the function name would be just "zone_reclaim_pages" If you want to name it with zone_reclaim_unmapped_pages, it could be better with setting sc->may_unmap in this function. -- Kind regards, Minchan Kim --
Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/swap.h | 5 ++- mm/page_alloc.c | 7 +++-- mm/vmscan.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4136,10 +4139,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone->spanned_pages = size; zone->present_pages = realsize; -#ifdef CONFIG_NUMA - zone->node = nid; zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone->node = nid; zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; ...
Looks good. Reviewed-by: Christoph Lameter <cl@linux.com> --
On Tue, 30 Nov 2010 15:46:31 +0530 gack, this is on the page allocator fastpath, isn't it? So 99.99999999% of the world's machines end up doing a pointless call to a pointless function which pointlessly tests a pointless global and pointlessly returns? All because of some whacky KSM thing? The speed and space overhead of this code should be *zero* if !CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if CONFIG_UNMAPPED_PAGECACHE_CONTROL=y. The way to do the latter is to inline the test of unmapped_page_control into callers and only if it is aw c'mon guys, everybody knows that when you add a kernel parameter you The problem I have with this comment is that it uses the term "balance" without ever defining it. Plus "balance" is already a term which is used in memory reclaim. So if you can think up a unique noun then that's good but whether or not that is done, please describe with great care what that term Well. Giving 16 a name didn't really clarify anything. Attentive readers will want to know what this does, why 16 was chosen and what So you're OK with shoving all this flotsam into 100,000,000 cellphones? This was a pretty outrageous patchset! --
My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir --
Yes. Embedded people(at least, me) want it. That's because they don't have any swap device so they could reclaim only page cache page. And many page cache pages are mapped at address space of application(ex, android uses java model so many pages are mapped by application's address space). It means it's hard to reclaim them without lagging. So I have a interest in this patch. -- Kind regards, Minchan Kim --
The meaning is analoguous to the other zone reclaim ratio. But yes it This is a feature that has been requested over and over for years. Using /proc/vm/drop_caches for fixing situations where one simply has too many page cache pages is not so much fun in the long run. --
I'm not against page cache limitation feature at all. But, this is too ugly and too destructive fast path. I hope this patch reduce negative impact more. --
On Thu, 2 Dec 2010 10:22:16 +0900 (JST) And I think min_mapped_unmapped_pages is ugly. It should be "unmapped_pagecache_limit" or some because it's for limitation feature. Thanks, -Kame --
The feature will now be enabled with a CONFIG and boot parameter, I find changing the naming convention now - it is already in use and well known is not a good idea. THe name of the boot parameter can be changed of-course. -- Three Cheers, Balbir --
You can't invoke any reclaim from here. It is in zone balancing detection This code break page-cache/slab balancing logic. And this is conflict against Nick's per-zone slab effort. Plus, high-order + priority=5 reclaim Simon's case. (see "Free memory never Hmm.... As far as I reviewed, I can't find any reason why this patch works as expected. So, I think cleancache looks promising more than this idea. Have you seen Dan's patch? I would suggested discuss him. Thanks. --
My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir --
On Tue, 30 Nov 2010 15:46:31 +0530 Hm, I'm not sure the final vision of this feature. Does this reclaiming feature can't be called directly via balloon driver just before alloc_page() ? Do you need to keep page caches small even when there are free memory on host ? Thanks, -Kame --
My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir --
On Wed, 1 Dec 2010 10:52:59 +0530 That's a point. Then, why the guest has to do _extra_ work for host even when the host says nothing ? I think trigger this by guests themselves is not very good. Thanks, -Kame --
I've mentioned it before, the guest keeping free memory without a large performance hit, helps, the balloon driver is able to quickly retrieve this memory if required or the guest can use this memory for some other application/task. The cached data is mostly already present in the host page cache. -- Three Cheers, Balbir --
On Wed, 1 Dec 2010 12:10:43 +0530 Why ? Are there parameters/stats which shows this is _true_ ? How we can guarantee/show it to users ? Please add an interface to show "shared rate between guest/host" If not, any admin will not turn this on because "file cache status on host" is a black box for guest admins. I think this patch skips something important steps. 2nd point is maybe for reducing total host memory usage and for increasing the number of guests on a host. For that, this feature is useful only when all guests on a host are friendly and devoted to the health of host memory management because all setting must be done in the guest. This can be passed as even by qemu's command line argument. And _no_ benefit for the guests who reduce it's resource to help host management because there is no guarantee dropped caches are on host memory. So, for both claim, I want to see an interface to show the number of shared pages between hosts and guests rather than imagine it. BTW, I don't like this kind of "please give us your victim, please please please" logic. The host should be able to "steal" what it wants in force. Then, I think there should be no On/Off visible interfaces. The vm firmware should tell to turn on this if administrator of the host wants. BTW2, please test with some other benchmarks (which read file caches.) I don't think kernel make is good test for this. Thanks, -Kame --
