For sometime we have been looking at mechanisms for improving the availability of larger allocations under load. One of the options we have explored is the capturing of pages freed under direct reclaim in order to increase the chances of free pages coelescing before they are subject to reallocation by racing allocators. Following this email is a patch stack implementing page capture during direct reclaim. It consits of four patches. The first two simply pull out existing code into helpers for reuse. The third makes buddy's use of struct page explicit. The fourth contains the meat of the changes, and its leader contains a much fuller description of the feature. This update represents a rebase to -mm and incorporates feedback from KOSAKI Motohiro. It also incorporates an accounting fix which was preventing some captures. I have done a lot of comparitive testing with and without this patch set and in broad brush I am seeing improvements in hugepage allocations (worst case size) success on all of my test systems. These tests consist of placing a constant stream of high order allocations on the system, at varying rates. The results for these various runs are then averaged to give an overall improvement. Absolute Effective x86-64 2.48% 4.58% powerpc 5.55% 25.22% x86-64 has a relatively small huge page size and so is always much more effective at allocating huge pages. Even there we get a measurable improvement. On powerpc the huge pages are much larger and much harder to recover. Here we see a full 25% increase in page recovery. It should be noted that these are worst case testing, and very agressive taking every possible page in the system. It would be helpful to get wider testing in -mm. Against: 2.6.27-rc1-mm1 Andrew, please consider for -mm. -apw Changes since V3: - Incorporates an anon vma fix pointed out by MinChan Kim - switch to using a pagevec for page capture collection Changes since V2: - Incorporates review feedback from ...
Explicitly define the struct page fields which buddy uses when it owns
pages. Defines a new anonymous struct to allow additional fields to
be defined in a later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/mm_types.h | 3 +++
mm/internal.h | 2 +-
mm/page_alloc.c | 4 ++--
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
+ struct {
+ unsigned long buddy_order; /* buddy: free page order */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
static inline unsigned long page_order(struct page *page)
{
VM_BUG_ON(!PageBuddy(page));
- return page_private(page);
+ return page->buddy_order;
}
extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 921c435..3a646e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
static inline void set_page_order(struct page *page, int order)
{
- set_page_private(page, order);
+ page->buddy_order = order;
__SetPageBuddy(page);
#ifdef CONFIG_PAGE_OWNER
page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int ...On Wed, 1 Oct 2008 13:31:00 +0100 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks. Pull these checks out so we can reuse them in a
later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 43 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 55d8d9b..921c435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1271,6 +1271,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
return 1;
}
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+ int alloc_flags, gfp_t gfp_mask)
+{
+ if ((alloc_flags & ALLOC_CPUSET) &&
+ !cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return 0;
+ return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+ int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+ if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+ unsigned long mark;
+ if (alloc_flags & ALLOC_WMARK_MIN)
+ mark = zone->pages_min;
+ else if (alloc_flags & ALLOC_WMARK_LOW)
+ mark = zone->pages_low;
+ else
+ mark = zone->pages_high;
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags)) {
+ if (!zone_reclaim_mode ||
+ !zone_reclaim(zone, gfp_mask, order))
+ return 0;
+ }
+ }
+ return 1;
+}
+
#ifdef CONFIG_NUMA
/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
@@ -1424,25 +1462,11 @@ zonelist_scan:
if (NUMA_BUILD && zlc_active &&
...On Wed, 1 Oct 2008 13:30:59 +0100 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
When we are about to release a page we perform a number of actions
on that page. We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required. Pull this processing out into a
helper function for reuse in a later patch.
Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 40 +++++++++++++++++++++++++++-------------
1 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..55d8d9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,32 @@ static inline int free_pages_check(struct page *page)
}
/*
+ * Prepare this page for release to the buddy. Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+ int i;
+ int reserved = 0;
+
+ for (i = 0 ; i < (1 << order) ; ++i)
+ reserved += free_pages_check(page + i);
+ if (reserved)
+ return 0;
+
+ if (!PageHighMem(page)) {
+ debug_check_no_locks_freed(page_address(page),
+ PAGE_SIZE << order);
+ debug_check_no_obj_freed(page_address(page),
+ PAGE_SIZE << order);
+ }
+ arch_free_page(page, order);
+ kernel_map_pages(page, 1 << order, 0);
+
+ return 1;
+}
+
+/*
* Frees a list of pages.
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
@@ -529,22 +555,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
- int i;
- int reserved = 0;
- for (i = 0 ; i < (1 ...On Wed, 1 Oct 2008 13:30:58 +0100 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiruyo@jp.fujitsu.com> --
When a process enters direct reclaim it will expend effort identifying and releasing pages in the hope of obtaining a page. However as these pages are released asynchronously there is every possibility that the pages will have been consumed by other allocators before the reclaimer gets a look in. This is particularly problematic where the reclaimer is attempting to allocate a higher order page. It is highly likely that a parallel allocation will consume lower order constituent pages as we release them preventing them coelescing into the higher order page the reclaimer desires. This patch set attempts to address this for allocations above ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing onto a local free list. Instead of freeing them to the main buddy lists, pages are collected and coelesced on this per direct reclaimer free list. Pages which are freed by other processes are also considered, where they coelesce with a page already under capture they will be moved to the capture list. When pressure has been applied to a zone we then consult the capture list and if there is an appropriatly sized page available it is taken immediatly and the remainder returned to the free pool. Capture is only enabled when the reclaimer's allocation order exceeds ALLOC_COSTLY_ORDER as free pages below this order should naturally occur in large numbers following regular reclaim. Thanks go to Mel Gorman for numerous discussions during the development of this patch and for his repeated reviews. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- include/linux/mm_types.h | 1 + include/linux/page-flags.h | 4 + include/linux/pagevec.h | 1 + mm/internal.h | 5 ++ mm/page_alloc.c | 159 +++++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 118 +++++++++++++++++++++++++++------ 6 files changed, 267 ...
The reclaim problem is due to the pcp queueing right? Could we disable pcp queueing during reclaim? pcp processing is not necessarily a gain, so temporarily disabling it should not be a problem. At the beginning of reclaim just flush all pcp pages and then do not allow pcp refills again until reclaim is finished? --
Not entirely, some pages could get trapped there for sure. But it is parallel allocations we are trying to guard against. Plus we already flush the pcp during reclaim for higher orders. -apw --
So we just would need to forbid refilling the pcp. Parallel allocations are less a problem if the freed order 0 pages get merged immediately into the order 1 freelist. Of course that will only work 50% of the time but it will have a similar effect to this patch. --
Ah, Right. Could we hear why you like pcp disabling than Andy's patch? Honestly, I think pcp has some problem. But I avoid to change pcp because I don't understand its design. Maybe, we should discuss currect pcp behavior? --
pcps are a particular problem on NUMA because the lists are replicated per In the worst case we see that pcps cause a 5% performance drop (sequential alloc w/o free followed by sequential free w/o allocs). See my page allocator pcps need improvement. The performance issues with the page allocator fastpath are likely due to bloating of the fastpaths (antifrag did not do much good on that level). Plus current crops of processors are sensitive to cache footprint issues (seems that the tbench regression in the network stack also are due to the same effect). Doubly linked lists are not good today because they touch multiple cachelines. --
On Wed, 1 Oct 2008 13:31:01 +0100 Hmm.. is this routine better than mm/memory_hotplug.c::do_migrate_range(start_pfn, end_pfn) ? Thanks, -Kame --
Are you suggesting that it might be more adventageous to try and migrate things out of this area as part of reclaim? If so then I tend to agree, though that would be a good idea generally with or without capture. /me adds it to his todo list to test that out. -apw --
I just remember I did the same kind of work to offline pages. Sorry for noise. I just have an idea to support following kind of interface via memory hotplug This makes all pages in the section to be hugepage. #echo huge > /sys/device/system/memory/memoryXXX/state (memory hotplug interface supports online/offline here.) But no patches yet... Thanks, -Kame --
Hi, Andy. I tested your patch in my desktop. The test is just kernel compile with single thread. My system environment is as follows. model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz MemTotal: 2065856 kB When I tested vanilla, compile time is as follows. 2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata 0maxresident)k 588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps When I tested your patch, as follows. 2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps Regresstion almost is above 2 minutes. Do you think It is a trivial? I know your patch is good to allocate hugepage. But, I think many users don't need it, including embedded system and desktop users yet. So I suggest you made it enable optionally. -- Kinds regards, MinChan Kim --
Hmmm. I would not expect to see any significant impact for this kind of workload as we should not be triggering capture for the lower order allocations at all. Something screwey must be occuring. I will go and reproduce this here and see if I can figure out just what causes this. -apw --
yup. I also think this is significant regression. if this is reproduced, that patch shouldn't be merged to -mm, IMHO. --
Ooops. No. if the patch has this significant regression, nobody turn on its option. We should fix that. --
On Fri, Oct 3, 2008 at 3:48 PM, KOSAKI Motohiro I have been tested it. But I can't reproduce such as regression. I don't know why such regression happed at that time. Sorry for confusing. Please ignore my test result at that time. This is new test result. before 2346.24user 191.44system 42:07.28elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 458624inputs+4262728outputs (183major+52299730minor)pagefaults 0swaps after 2349.75user 195.72system 42:16.36elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 475632inputs+4265208outputs (183major+52308969minor)pagefaults 0swaps I think we can ignore some time gap. Sometime, after is faster than before. I could conclude it doesn't have any regressions in my desktop machine. Tested-by: MinChan Kim <minchan.kim@gmail.com> -- Kinds regards, MinChan Kim --
On Wed, 1 Oct 2008 13:30:57 +0100 Hmm, can't we use "MIGRATE_ISOLATE" pageblock type for this purpose ? The page allocater skips pageblock marked as MIGRATE_ISOLATE at allocation. (pageblock-size is equal to HUGEPAGE size in general.) Of course, "where should be isolated" is a problem. Thanks, -Kame --
