Re: [PATCH 0/4] Reclaim page capture v4

Previous thread: none

Next thread: w35und wifi driver for linux-staging by Pavel Machek on Wednesday, October 1, 2008 - 5:36 am. (9 messages)
From: Andy Whitcroft
Date: Wednesday, October 1, 2008 - 5:30 am

For sometime we have been looking at mechanisms for improving the availability
of larger allocations under load.  One of the options we have explored is
the capturing of pages freed under direct reclaim in order to increase the
chances of free pages coelescing before they are subject to reallocation
by racing allocators.

Following this email is a patch stack implementing page capture during
direct reclaim.  It consits of four patches.  The first two simply pull
out existing code into helpers for reuse.  The third makes buddy's use
of struct page explicit.  The fourth contains the meat of the changes,
and its leader contains a much fuller description of the feature.

This update represents a rebase to -mm and incorporates feedback from
KOSAKI Motohiro.  It also incorporates an accounting fix which was
preventing some captures.

I have done a lot of comparitive testing with and without this patch
set and in broad brush I am seeing improvements in hugepage allocations
(worst case size) success on all of my test systems.  These tests consist
of placing a constant stream of high order allocations on the system,
at varying rates.  The results for these various runs are then averaged
to give an overall improvement.

		Absolute	Effective
x86-64		2.48%		 4.58%
powerpc		5.55%		25.22%

x86-64 has a relatively small huge page size and so is always much more
effective at allocating huge pages.  Even there we get a measurable
improvement.  On powerpc the huge pages are much larger and much harder
to recover.  Here we see a full 25% increase in page recovery.

It should be noted that these are worst case testing, and very agressive
taking every possible page in the system.  It would be helpful to get
wider testing in -mm.

Against: 2.6.27-rc1-mm1

Andrew, please consider for -mm.

-apw

Changes since V3:
 - Incorporates an anon vma fix pointed out by MinChan Kim
 - switch to using a pagevec for page capture collection

Changes since V2:
 - Incorporates review feedback from ...
From: Andy Whitcroft
Date: Wednesday, October 1, 2008 - 5:31 am

Explicitly define the struct page fields which buddy uses when it owns
pages.  Defines a new anonymous struct to allow additional fields to
be defined in a later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mm_types.h |    3 +++
 mm/internal.h            |    2 +-
 mm/page_alloc.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
 	    struct page *first_page;	/* Compound tail pages */
+	    struct {
+		unsigned long buddy_order;     /* buddy: free page order */
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
 static inline unsigned long page_order(struct page *page)
 {
 	VM_BUG_ON(!PageBuddy(page));
-	return page_private(page);
+	return page->buddy_order;
 }
 
 extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 921c435..3a646e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
 
 static inline void set_page_order(struct page *page, int order)
 {
-	set_page_private(page, order);
+	page->buddy_order = order;
 	__SetPageBuddy(page);
 #ifdef CONFIG_PAGE_OWNER
 		page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int ...
From: KAMEZAWA Hiroyuki
Date: Thursday, October 2, 2008 - 12:06 am

On Wed,  1 Oct 2008 13:31:00 +0100

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--

From: Andy Whitcroft
Date: Wednesday, October 1, 2008 - 5:30 am

When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks.  Pull these checks out so we can reuse them in a
later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   62 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 55d8d9b..921c435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1271,6 +1271,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return 1;
 }
 
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+					int alloc_flags, gfp_t gfp_mask)
+{
+	if ((alloc_flags & ALLOC_CPUSET) &&
+	    !cpuset_zone_allowed_softwall(zone, gfp_mask))
+		return 0;
+	return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+			int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+		unsigned long mark;
+		if (alloc_flags & ALLOC_WMARK_MIN)
+			mark = zone->pages_min;
+		else if (alloc_flags & ALLOC_WMARK_LOW)
+			mark = zone->pages_low;
+		else
+			mark = zone->pages_high;
+		if (!zone_watermark_ok(zone, order, mark,
+			    classzone_idx, alloc_flags)) {
+			if (!zone_reclaim_mode ||
+					!zone_reclaim(zone, gfp_mask, order))
+				return 0;
+		}
+	}
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
@@ -1424,25 +1462,11 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 ...
From: KAMEZAWA Hiroyuki
Date: Thursday, October 2, 2008 - 12:05 am

On Wed,  1 Oct 2008 13:30:59 +0100

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--

From: Andy Whitcroft
Date: Wednesday, October 1, 2008 - 5:30 am

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   40 +++++++++++++++++++++++++++-------------
 1 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..55d8d9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,32 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +555,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 ...
From: KAMEZAWA Hiroyuki
Date: Thursday, October 2, 2008 - 12:05 am

On Wed,  1 Oct 2008 13:30:58 +0100

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiruyo@jp.fujitsu.com>

--

From: Andy Whitcroft
Date: Wednesday, October 1, 2008 - 5:31 am

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page.  However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in.  This is particularly problematic where the reclaimer is
attempting to allocate a higher order page.  It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list.  Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list.  When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm_types.h   |    1 +
 include/linux/page-flags.h |    4 +
 include/linux/pagevec.h    |    1 +
 mm/internal.h              |    5 ++
 mm/page_alloc.c            |  159 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |  118 +++++++++++++++++++++++++++------
 6 files changed, 267 ...
From: Christoph Lameter
Date: Wednesday, October 1, 2008 - 8:01 am

The reclaim problem is due to the pcp queueing right? Could we disable pcp
queueing during reclaim? pcp processing is not necessarily a gain, so
temporarily disabling it should not be a problem.

At the beginning of reclaim just flush all pcp pages and then do not allow pcp
refills again until reclaim is finished?
--

From: Andy Whitcroft
Date: Thursday, October 2, 2008 - 7:35 am

Not entirely, some pages could get trapped there for sure.  But it is
parallel allocations we are trying to guard against.  Plus we already flush
the pcp during reclaim for higher orders.

-apw

--

From: Christoph Lameter
Date: Thursday, October 2, 2008 - 9:29 am

So we just would need to forbid refilling the pcp.

Parallel allocations are less a problem if the freed order 0 pages get merged
immediately into the order 1 freelist. Of course that will only work 50% of
the time but it will have a similar effect to this patch.

--

From: KOSAKI Motohiro
Date: Thursday, October 2, 2008 - 8:41 pm

Ah, Right.
Could we hear why you like pcp disabling than Andy's patch?

Honestly, I think pcp has some problem.
But I avoid to change pcp because I don't understand its design.

Maybe, we should discuss currect pcp behavior?



--

From: Christoph Lameter
Date: Friday, October 3, 2008 - 5:37 am

pcps are a particular problem on NUMA because the lists are replicated per

In the worst case we see that pcps cause a 5% performance drop (sequential
alloc w/o free followed by sequential free w/o allocs). See my page allocator

pcps need improvement. The performance issues with the page allocator fastpath
are likely due to bloating of the fastpaths (antifrag did not do much good on
that level). Plus current crops of processors are sensitive to cache footprint
issues (seems that the tbench regression in the network stack also are due to
the same effect). Doubly linked lists are not good  today because they touch
multiple cachelines.




--

From: KAMEZAWA Hiroyuki
Date: Thursday, October 2, 2008 - 12:24 am

On Wed,  1 Oct 2008 13:31:01 +0100

Hmm.. is this routine better than
  mm/memory_hotplug.c::do_migrate_range(start_pfn, end_pfn) ?

Thanks,
-Kame

--

From: Andy Whitcroft
Date: Thursday, October 2, 2008 - 8:02 am

Are you suggesting that it might be more adventageous to try and migrate
things out of this area as part of reclaim?  If so then I tend to agree,
though that would be a good idea generally with or without capture.

/me adds it to his todo list to test that out.

-apw
--

From: kamezawa.hiroyu
Date: Thursday, October 2, 2008 - 8:25 am

I just remember I did the same kind of work to offline pages.
Sorry for noise.

I just have an idea to support following kind of interface via memory hotplug
This makes all pages in the section to be hugepage.

 #echo huge > /sys/device/system/memory/memoryXXX/state
 (memory hotplug interface supports online/offline here.)

But no patches yet...

Thanks,
-Kame


--

From: MinChan Kim
Date: Wednesday, October 1, 2008 - 7:46 pm

Hi, Andy.

I tested your patch in my desktop.
The test is just kernel compile with single thread.
My system environment is as follows.

model name	: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
MemTotal:        2065856 kB

When I tested vanilla, compile time is as follows.

2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
0maxresident)k
588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps

When I tested your patch, as follows.

2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps

Regresstion almost is above 2 minutes.
Do you think It is a trivial?

I know your patch is good to allocate hugepage.
But, I think many users don't need it, including embedded system and
desktop users yet.

So I suggest you made it enable optionally.




-- 
Kinds regards,
MinChan Kim
--

From: Andy Whitcroft
Date: Thursday, October 2, 2008 - 8:04 am

Hmmm.  I would not expect to see any significant impact for this kind of
workload as we should not be triggering capture for the lower order
allocations at all.  Something screwey must be occuring.  I will go and
reproduce this here and see if I can figure out just what causes this.

-apw
--

From: KOSAKI Motohiro
Date: Thursday, October 2, 2008 - 8:25 pm

yup.
I also think this is significant regression.

if this is reproduced, that patch shouldn't be merged to -mm, IMHO.




--

From: KOSAKI Motohiro
Date: Thursday, October 2, 2008 - 11:48 pm

Ooops.

No.
if the patch has this significant regression,
nobody turn on its option.

We should fix that.


--

From: MinChan Kim
Date: Monday, October 6, 2008 - 9:29 pm

On Fri, Oct 3, 2008 at 3:48 PM, KOSAKI Motohiro

I have been tested it.
But I can't reproduce such as regression.
I don't know why such regression happed at that time.

Sorry for confusing.
Please ignore my test result at that time.

This is new test result.

before

2346.24user 191.44system 42:07.28elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
458624inputs+4262728outputs (183major+52299730minor)pagefaults 0swaps


after

2349.75user 195.72system 42:16.36elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
475632inputs+4265208outputs (183major+52308969minor)pagefaults 0swaps

I think we can ignore some time gap.
Sometime, after is faster than before.

I could conclude it doesn't have any regressions in my desktop machine.

Tested-by: MinChan Kim <minchan.kim@gmail.com>

-- 
Kinds regards,
MinChan Kim
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, October 1, 2008 - 11:44 pm

On Wed,  1 Oct 2008 13:30:57 +0100

Hmm, can't we use "MIGRATE_ISOLATE" pageblock type for this purpose ?
The page allocater skips pageblock marked as MIGRATE_ISOLATE at allocation.
(pageblock-size is equal to HUGEPAGE size in general.)

Of course, "where should be isolated" is a problem.

Thanks,
-Kame

--

Previous thread: none

Next thread: w35und wifi driver for linux-staging by Pavel Machek on Wednesday, October 1, 2008 - 5:36 am. (9 messages)