Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

Previous thread: none

Next thread: Linux 2.6.26.3: parallel port trouble by Chris Rankin on Wednesday, September 3, 2008 - 12:02 pm. (1 message)
From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 11:44 am

For sometime we have been looking at mechanisms for improving the availability
of larger allocations under load.  One of the options we have explored is
the capturing of pages freed under direct reclaim in order to increase the
chances of free pages coelescing before they are subject to reallocation
by racing allocators.

Following this email is a patch stack implementing page capture during
direct reclaim.  It consits of four patches.  The first two simply pull
out existing code into helpers for reuse.  The third makes buddy's use
of struct page explicit.  The fourth contains the meat of the changes,
and its leader contains a much fuller description of the feature.

This update represents a rebase to -mm and incorporates feedback from
KOSAKI Motohiro.  It also incorporates an accounting fix which was
preventing some captures.

I have done a lot of comparitive testing with and without this patch
set and in broad brush I am seeing improvements in hugepage allocations
(worst case size) success on all of my test systems.  These tests consist
of placing a constant stream of high order allocations on the system,
at varying rates.  The results for these various runs are then averaged
to give an overall improvement.

		Absolute	Effective
x86-64		2.48%		 4.58%
powerpc		5.55%		25.22%

x86-64 has a relatively small huge page size and so is always much more
effective at allocating huge pages.  Even there we get a measurable
improvement.  On powerpc the huge pages are much larger and much harder
to recover.  Here we see a full 25% increase in page recovery.

It should be noted that these are worst case testing, and very agressive
taking every possible page in the system.

Against: 2.6.27-rc1-mm1

Comments?

-apw

Changes since V1:
 - Incorporates review feedback from KOSAKI Motohiro,
 - fixes up accounting when checking watermarks for captured pages,
 - rebase 2.6.27-rc1-mm1,
 - Incorporates review feedback from Mel.

Andy Whitcroft (4):
  pull out the page pre-release and ...
From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 11:44 am

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	if (PageAnon(page))
+		page->mapping = NULL;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
-	if (reserved)
+	if ...
From: Rik van Riel
Date: Wednesday, September 3, 2008 - 6:24 pm

On Wed,  3 Sep 2008 19:44:09 +0100

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.
--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 6:52 pm

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





--

From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 11:44 am

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page.  However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in.  This is particularly problematic where the reclaimer is
attempting to allocate a higher order page.  It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list.  Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list.  When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
--

From: Christoph Lameter
Date: Wednesday, September 3, 2008 - 1:35 pm

You forgot to include the patch.
--

From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 1:53 pm

[Doh, as pointed out by Christoph the patch was missing from this one...]

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page.  However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in.  This is particularly problematic where the reclaimer is
attempting to allocate a higher order page.  It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list.  Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list.  When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 include/linux/mm_types.h   |    1 +
 include/linux/page-flags.h |    6 ++
 mm/internal.h              |    6 ++
 mm/page_alloc.c            |  154 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |  115 +++++++++++++++++++++++++++------
 5 files changed, 261 insertions(+), 21 deletions(-)

diff --git ...
From: Christoph Lameter
Date: Wednesday, September 3, 2008 - 2:00 pm

Doesnt __PAGEFLAG do what you want without having to explicitly specify
__SET/__CLEAR?


How does page allocator fastpath behavior fare with this pathch?
--

From: Peter Zijlstra
Date: Wednesday, September 3, 2008 - 11:38 pm

PAGEFLAG() __PAGEFLAG()

does TESTPAGEFLAG() double.

--

From: Christoph Lameter
Date: Thursday, September 4, 2008 - 7:18 am

Usually one either wants the atomic versions or the non atomic versions. This
usage seems to be mainly non atomic plus one use of ClearPageBuddy() in
capture_or_return() (Which raises some questions about how the bit
modifications are serialized. Is there concurrency during free?)

So

__PAGEFLAG(BuddyCapture, buddy_capture)
	CLEARPAGEFLAG(BuddyCapture, buddy_capture)

--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 1:11 am

Don't worry it because

1. shrink_zone() isn't fastpath because any reclaim isn't fastpath.
2. buddy combining on __free_one_page() isn't fastpath because
   any buddy combining isn't fastpath. (*)

(*)
all modern allocator have delayed buddy combining mecanism
because buddy combining increase cache miss.
(please imazine address X+1 is freed when address X is cold.
 combining cause next alloc get address X, then caller see cold page)

at least, allocator's fastpath should avoid its combining IMHO.

Unfortunately the linux buddy's one is limited because
zone->pcp only cache order-0 page.

Then, higher order pages's free always use slow path now.
but it isn't his patch failure.



--

From: Andy Whitcroft
Date: Thursday, September 4, 2008 - 1:58 am

I think I end up with one extra test that I don't need, but its

The fastpath should be unaffected on the allocation side.  On the free
side there is an additional check for merging with a buddy under capture
as we merge buddies in __free_one_page.

-apw
--

From: Peter Zijlstra
Date: Thursday, September 4, 2008 - 12:20 am

Whole series looks good, a few comments below.


That sentence looks incomplete, did you intend to write something along
the lines of:

Run through the accumulated list of captures pages and /take/ the first
which is big enough to satisfy the original allocation. Free the
remaining pages.


This makes me a little sad - we got a high order page and give it away
again...

Can we start another round of direct reclaim with a lower order to try


--

From: Andy Whitcroft
Date: Thursday, September 4, 2008 - 4:35 am

Well in theory we have already pushed a load of other pages back, the
ones we discarded during the capture selection.  This actually triggers
very rarely in real use, without it we would occasionally OOM but it was
rare.  Looking at some stats collected when running our tests I have yet
to see it trigger.  So its probabally not worth any additional effort

-apw
--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 12:59 am

Hi Andy,

I like almost part of your patch.
(at least, I can ack patch 1/4 - 3/4)

So, I worry about OOM risk.
Can you remember desired page size to capture list (or any other location)?
if possible, __capture_on_page can avoid to capture unnecessary pages.

So, if __capture_on_page() can make desired size page by buddy merging, 
it can free other pages on capture_list.

In worst case, shrink_zone() is called by very much process at the same time.
Then, if each process doesn't back few pages, very many pages doesn't be backed.



--

From: Andy Whitcroft
Date: Thursday, September 4, 2008 - 7:44 am

The testing we have done pushes the system pretty damn hard, about as
hard as you can.  Without the zone watermark checks in capture we would
periodically lose a test to an OOM.  Since adding that I have never seen
an OOM, so I am confident we are safe.  That said, clearly some wider
testing in -mm would be very desirable to confirm that this does not
tickle OOM for some unexpected workload.

I think the idea of trying to short-circuit capture once it has a page
of the requisit order or greater is eminently sensible.  I suspect we
are going to have trouble getting the information to the right place,
but it is clearly worth investigating.  It feels like a logical step on
top of this, so I would propose to do it as a patch on top of this set.

Thanks for your feedback.

-apw
--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 6:52 pm

Ok. makes sense.
Thanks for good patch.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 11:44 am

When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks.  Pull these checks out so we can reuse them in a
later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/page_alloc.c |   62 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a2c2b..2c3874e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return 1;
 }
 
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+					int alloc_flags, gfp_t gfp_mask)
+{
+	if ((alloc_flags & ALLOC_CPUSET) &&
+	    !cpuset_zone_allowed_softwall(zone, gfp_mask))
+		return 0;
+	return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+			int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+		unsigned long mark;
+		if (alloc_flags & ALLOC_WMARK_MIN)
+			mark = zone->pages_min;
+		else if (alloc_flags & ALLOC_WMARK_LOW)
+			mark = zone->pages_low;
+		else
+			mark = zone->pages_high;
+		if (!zone_watermark_ok(zone, order, mark,
+			    classzone_idx, alloc_flags)) {
+			if (!zone_reclaim_mode ||
+					!zone_reclaim(zone, gfp_mask, order))
+				return 0;
+		}
+	}
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
@@ -1427,25 +1465,11 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
-				goto ...
From: Rik van Riel
Date: Wednesday, September 3, 2008 - 6:24 pm

On Wed,  3 Sep 2008 19:44:10 +0100

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.
--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 6:52 pm

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


--

From: Andy Whitcroft
Date: Wednesday, September 3, 2008 - 11:44 am

Explicitly define the struct page fields which buddy uses when it owns
pages.  Defines a new anonymous struct to allow additional fields to
be defined in a later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 include/linux/mm_types.h |    3 +++
 mm/internal.h            |    2 +-
 mm/page_alloc.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
 	    struct page *first_page;	/* Compound tail pages */
+	    struct {
+		unsigned long buddy_order;     /* buddy: free page order */
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
 static inline unsigned long page_order(struct page *page)
 {
 	VM_BUG_ON(!PageBuddy(page));
-	return page_private(page);
+	return page->buddy_order;
 }
 
 extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c3874e..db0dbd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
 
 static inline void set_page_order(struct page *page, int order)
 {
-	set_page_private(page, order);
+	page->buddy_order = order;
 	__SetPageBuddy(page);
 #ifdef CONFIG_PAGE_OWNER
 		page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int order)
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
+	page->buddy_order = 0;
 }
 
 /*
-- 
1.6.0.rc1.258.g80295

--

From: Christoph Lameter
Date: Wednesday, September 3, 2008 - 1:36 pm

Good. I have a similar patch floating around.

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
--

From: Rik van Riel
Date: Wednesday, September 3, 2008 - 6:25 pm

On Wed,  3 Sep 2008 19:44:11 +0100

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.
--

From: KOSAKI Motohiro
Date: Thursday, September 4, 2008 - 6:52 pm

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

Previous thread: none

Next thread: Linux 2.6.26.3: parallel port trouble by Chris Rankin on Wednesday, September 3, 2008 - 12:02 pm. (1 message)