For sometime we have been looking at mechanisms for improving the availability of larger allocations under load. One of the options we have explored is the capturing of pages freed under direct reclaim in order to increase the chances of free pages coelescing before they are subject to reallocation by racing allocators. Following this email is a patch stack implementing page capture during direct reclaim. It consits of four patches. The first two simply pull out existing code into helpers for reuse. The third makes buddy's use of struct page explicit. The fourth contains the meat of the changes, and its leader contains a much fuller description of the feature. This update represents a rebase to -mm and incorporates feedback from KOSAKI Motohiro. It also incorporates an accounting fix which was preventing some captures. I have done a lot of comparitive testing with and without this patch set and in broad brush I am seeing improvements in hugepage allocations (worst case size) success on all of my test systems. These tests consist of placing a constant stream of high order allocations on the system, at varying rates. The results for these various runs are then averaged to give an overall improvement. Absolute Effective x86-64 2.48% 4.58% powerpc 5.55% 25.22% x86-64 has a relatively small huge page size and so is always much more effective at allocating huge pages. Even there we get a measurable improvement. On powerpc the huge pages are much larger and much harder to recover. Here we see a full 25% increase in page recovery. It should be noted that these are worst case testing, and very agressive taking every possible page in the system. Against: 2.6.27-rc1-mm1 Comments? -apw Changes since V1: - Incorporates review feedback from KOSAKI Motohiro, - fixes up accounting when checking watermarks for captured pages, - rebase 2.6.27-rc1-mm1, - Incorporates review feedback from Mel. Andy Whitcroft (4): pull out the page pre-release and ...
When we are about to release a page we perform a number of actions
on that page. We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required. Pull this processing out into a
helper function for reuse in a later patch.
Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++-------------
1 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
}
/*
+ * Prepare this page for release to the buddy. Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+ int i;
+ int reserved = 0;
+
+ if (PageAnon(page))
+ page->mapping = NULL;
+
+ for (i = 0 ; i < (1 << order) ; ++i)
+ reserved += free_pages_check(page + i);
+ if (reserved)
+ return 0;
+
+ if (!PageHighMem(page)) {
+ debug_check_no_locks_freed(page_address(page),
+ PAGE_SIZE << order);
+ debug_check_no_obj_freed(page_address(page),
+ PAGE_SIZE << order);
+ }
+ arch_free_page(page, order);
+ kernel_map_pages(page, 1 << order, 0);
+
+ return 1;
+}
+
+/*
* Frees a list of pages.
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
- int i;
- int reserved = 0;
- for (i = 0 ; i < (1 << order) ; ++i)
- reserved += free_pages_check(page + i);
- if (reserved)
+ if ...On Wed, 3 Sep 2008 19:44:09 +0100 Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
When a process enters direct reclaim it will expend effort identifying and releasing pages in the hope of obtaining a page. However as these pages are released asynchronously there is every possibility that the pages will have been consumed by other allocators before the reclaimer gets a look in. This is particularly problematic where the reclaimer is attempting to allocate a higher order page. It is highly likely that a parallel allocation will consume lower order constituent pages as we release them preventing them coelescing into the higher order page the reclaimer desires. This patch set attempts to address this for allocations above ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing onto a local free list. Instead of freeing them to the main buddy lists, pages are collected and coelesced on this per direct reclaimer free list. Pages which are freed by other processes are also considered, where they coelesce with a page already under capture they will be moved to the capture list. When pressure has been applied to a zone we then consult the capture list and if there is an appropriatly sized page available it is taken immediatly and the remainder returned to the free pool. Capture is only enabled when the reclaimer's allocation order exceeds ALLOC_COSTLY_ORDER as free pages below this order should naturally occur in large numbers following regular reclaim. Thanks go to Mel Gorman for numerous discussions during the development of this patch and for his repeated reviews. Signed-off-by: Andy Whitcroft <apw@shadowen.org> --
You forgot to include the patch. --
[Doh, as pointed out by Christoph the patch was missing from this one...] When a process enters direct reclaim it will expend effort identifying and releasing pages in the hope of obtaining a page. However as these pages are released asynchronously there is every possibility that the pages will have been consumed by other allocators before the reclaimer gets a look in. This is particularly problematic where the reclaimer is attempting to allocate a higher order page. It is highly likely that a parallel allocation will consume lower order constituent pages as we release them preventing them coelescing into the higher order page the reclaimer desires. This patch set attempts to address this for allocations above ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing onto a local free list. Instead of freeing them to the main buddy lists, pages are collected and coelesced on this per direct reclaimer free list. Pages which are freed by other processes are also considered, where they coelesce with a page already under capture they will be moved to the capture list. When pressure has been applied to a zone we then consult the capture list and if there is an appropriatly sized page available it is taken immediatly and the remainder returned to the free pool. Capture is only enabled when the reclaimer's allocation order exceeds ALLOC_COSTLY_ORDER as free pages below this order should naturally occur in large numbers following regular reclaim. Thanks go to Mel Gorman for numerous discussions during the development of this patch and for his repeated reviews. Signed-off-by: Andy Whitcroft <apw@shadowen.org> --- include/linux/mm_types.h | 1 + include/linux/page-flags.h | 6 ++ mm/internal.h | 6 ++ mm/page_alloc.c | 154 +++++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 115 +++++++++++++++++++++++++++------ 5 files changed, 261 insertions(+), 21 deletions(-) diff --git ...
Doesnt __PAGEFLAG do what you want without having to explicitly specify __SET/__CLEAR? How does page allocator fastpath behavior fare with this pathch? --
PAGEFLAG() __PAGEFLAG() does TESTPAGEFLAG() double. --
Usually one either wants the atomic versions or the non atomic versions. This usage seems to be mainly non atomic plus one use of ClearPageBuddy() in capture_or_return() (Which raises some questions about how the bit modifications are serialized. Is there concurrency during free?) So __PAGEFLAG(BuddyCapture, buddy_capture) CLEARPAGEFLAG(BuddyCapture, buddy_capture) --
Don't worry it because 1. shrink_zone() isn't fastpath because any reclaim isn't fastpath. 2. buddy combining on __free_one_page() isn't fastpath because any buddy combining isn't fastpath. (*) (*) all modern allocator have delayed buddy combining mecanism because buddy combining increase cache miss. (please imazine address X+1 is freed when address X is cold. combining cause next alloc get address X, then caller see cold page) at least, allocator's fastpath should avoid its combining IMHO. Unfortunately the linux buddy's one is limited because zone->pcp only cache order-0 page. Then, higher order pages's free always use slow path now. but it isn't his patch failure. --
I think I end up with one extra test that I don't need, but its The fastpath should be unaffected on the allocation side. On the free side there is an additional check for merging with a buddy under capture as we merge buddies in __free_one_page. -apw --
Whole series looks good, a few comments below. That sentence looks incomplete, did you intend to write something along the lines of: Run through the accumulated list of captures pages and /take/ the first which is big enough to satisfy the original allocation. Free the remaining pages. This makes me a little sad - we got a high order page and give it away again... Can we start another round of direct reclaim with a lower order to try --
Well in theory we have already pushed a load of other pages back, the ones we discarded during the capture selection. This actually triggers very rarely in real use, without it we would occasionally OOM but it was rare. Looking at some stats collected when running our tests I have yet to see it trigger. So its probabally not worth any additional effort -apw --
Hi Andy, I like almost part of your patch. (at least, I can ack patch 1/4 - 3/4) So, I worry about OOM risk. Can you remember desired page size to capture list (or any other location)? if possible, __capture_on_page can avoid to capture unnecessary pages. So, if __capture_on_page() can make desired size page by buddy merging, it can free other pages on capture_list. In worst case, shrink_zone() is called by very much process at the same time. Then, if each process doesn't back few pages, very many pages doesn't be backed. --
The testing we have done pushes the system pretty damn hard, about as hard as you can. Without the zone watermark checks in capture we would periodically lose a test to an OOM. Since adding that I have never seen an OOM, so I am confident we are safe. That said, clearly some wider testing in -mm would be very desirable to confirm that this does not tickle OOM for some unexpected workload. I think the idea of trying to short-circuit capture once it has a page of the requisit order or greater is eminently sensible. I suspect we are going to have trouble getting the information to the right place, but it is clearly worth investigating. It feels like a logical step on top of this, so I would propose to do it as a patch on top of this set. Thanks for your feedback. -apw --
Ok. makes sense. Thanks for good patch. Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks. Pull these checks out so we can reuse them in a
later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 43 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a2c2b..2c3874e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
return 1;
}
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+ int alloc_flags, gfp_t gfp_mask)
+{
+ if ((alloc_flags & ALLOC_CPUSET) &&
+ !cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return 0;
+ return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+ int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+ if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+ unsigned long mark;
+ if (alloc_flags & ALLOC_WMARK_MIN)
+ mark = zone->pages_min;
+ else if (alloc_flags & ALLOC_WMARK_LOW)
+ mark = zone->pages_low;
+ else
+ mark = zone->pages_high;
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags)) {
+ if (!zone_reclaim_mode ||
+ !zone_reclaim(zone, gfp_mask, order))
+ return 0;
+ }
+ }
+ return 1;
+}
+
#ifdef CONFIG_NUMA
/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
@@ -1427,25 +1465,11 @@ zonelist_scan:
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
- if ((alloc_flags & ALLOC_CPUSET) &&
- !cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto ...On Wed, 3 Sep 2008 19:44:10 +0100 Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
Explicitly define the struct page fields which buddy uses when it owns
pages. Defines a new anonymous struct to allow additional fields to
be defined in a later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
include/linux/mm_types.h | 3 +++
mm/internal.h | 2 +-
mm/page_alloc.c | 4 ++--
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
+ struct {
+ unsigned long buddy_order; /* buddy: free page order */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
static inline unsigned long page_order(struct page *page)
{
VM_BUG_ON(!PageBuddy(page));
- return page_private(page);
+ return page->buddy_order;
}
extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c3874e..db0dbd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
static inline void set_page_order(struct page *page, int order)
{
- set_page_private(page, order);
+ page->buddy_order = order;
__SetPageBuddy(page);
#ifdef CONFIG_PAGE_OWNER
page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int order)
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
- set_page_private(page, 0);
+ page->buddy_order = 0;
}
/*
--
1.6.0.rc1.258.g80295
--
Good. I have a similar patch floating around. Reviewed-by: Christoph Lameter <cl@linux-foundation.org> --
On Wed, 3 Sep 2008 19:44:11 +0100 Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
