For sometime we have been looking at mechanisms for improving the availability of larger allocations under load. One of the options we have explored is the capturing of pages freed under direct reclaim in order to increase the chances of free pages coelescing before they are subject to reallocation by racing allocators. Following this email is a patch stack implementing page capture during direct reclaim. It consits of four patches. The first two simply pull out existing code into helpers for reuse. The third makes buddy's use of struct page explicit. The fourth contains the meat of the changes, and its leader contains a much fuller description of the feature. This update represents a rebase to -mm and incorporates feedback from KOSAKI Motohiro. It also incorporates an accounting fix which was preventing some captures. I have done a lot of comparitive testing with and without this patch set and in broad brush I am seeing improvements in hugepage allocations (worst case size) success on all of my test systems. These tests consist of placing a constant stream of high order allocations on the system, at varying rates. The results for these various runs are then averaged to give an overall improvement. Absolute Effective x86-64 2.48% 4.58% powerpc 5.55% 25.22% x86-64 has a relatively small huge page size and so is always much more effective at allocating huge pages. Even there we get a measurable improvement. On powerpc the huge pages are much larger and much harder to recover. Here we see a full 25% increase in page recovery. It should be noted that these are worst case testing, and very agressive taking every possible page in the system. It would be helpful to get wider testing in -mm. Against: 2.6.27-rc1-mm1 Andrew, please consider for -mm. -apw Changes since V2: - Incorporates review feedback from Christoph Lameter, - Incorporates review feedback from Peter Zijlstra, and - Checkpatch fixes. Changes since V1: - Incorporates ...
Explicitly define the struct page fields which buddy uses when it owns
pages. Defines a new anonymous struct to allow additional fields to
be defined in a later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
include/linux/mm_types.h | 3 +++
mm/internal.h | 2 +-
mm/page_alloc.c | 4 ++--
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
+ struct {
+ unsigned long buddy_order; /* buddy: free page order */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
static inline unsigned long page_order(struct page *page)
{
VM_BUG_ON(!PageBuddy(page));
- return page_private(page);
+ return page->buddy_order;
}
extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c3874e..db0dbd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
static inline void set_page_order(struct page *page, int order)
{
- set_page_private(page, order);
+ page->buddy_order = order;
__SetPageBuddy(page);
#ifdef CONFIG_PAGE_OWNER
page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int ...When we are about to release a page we perform a number of actions
on that page. We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required. Pull this processing out into a
helper function for reuse in a later patch.
Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++-------------
1 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
}
/*
+ * Prepare this page for release to the buddy. Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+ int i;
+ int reserved = 0;
+
+ if (PageAnon(page))
+ page->mapping = NULL;
+
+ for (i = 0 ; i < (1 << order) ; ++i)
+ reserved += free_pages_check(page + i);
+ if (reserved)
+ return 0;
+
+ if (!PageHighMem(page)) {
+ debug_check_no_locks_freed(page_address(page),
+ PAGE_SIZE << order);
+ debug_check_no_obj_freed(page_address(page),
+ PAGE_SIZE << order);
+ }
+ arch_free_page(page, order);
+ kernel_map_pages(page, 1 << order, 0);
+
+ return 1;
+}
+
+/*
* Frees a list of pages.
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long ...Why do you need to clear down anonymous mapping ? I think if you don't convert this cleardown in free_hot_cold_page(), you don't need it. -- Thanks, MinChan Kim --
Yeah that has slipped through from where originally this patch used to merge two different instances of this code. Good spot. Will sort that -apw --
When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks. Pull these checks out so we can reuse them in a
later patch.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 43 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a2c2b..2c3874e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
return 1;
}
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+ int alloc_flags, gfp_t gfp_mask)
+{
+ if ((alloc_flags & ALLOC_CPUSET) &&
+ !cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return 0;
+ return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+ int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+ if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+ unsigned long mark;
+ if (alloc_flags & ALLOC_WMARK_MIN)
+ mark = zone->pages_min;
+ else if (alloc_flags & ALLOC_WMARK_LOW)
+ mark = zone->pages_low;
+ else
+ mark = zone->pages_high;
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags)) {
+ if (!zone_reclaim_mode ||
+ !zone_reclaim(zone, gfp_mask, order))
+ return 0;
+ }
+ }
+ return 1;
+}
+
#ifdef CONFIG_NUMA
/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
@@ -1427,25 +1465,11 @@ zonelist_scan:
if (NUMA_BUILD && zlc_active &&
...When a process enters direct reclaim it will expend effort identifying and releasing pages in the hope of obtaining a page. However as these pages are released asynchronously there is every possibility that the pages will have been consumed by other allocators before the reclaimer gets a look in. This is particularly problematic where the reclaimer is attempting to allocate a higher order page. It is highly likely that a parallel allocation will consume lower order constituent pages as we release them preventing them coelescing into the higher order page the reclaimer desires. This patch set attempts to address this for allocations above ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing onto a local free list. Instead of freeing them to the main buddy lists, pages are collected and coelesced on this per direct reclaimer free list. Pages which are freed by other processes are also considered, where they coelesce with a page already under capture they will be moved to the capture list. When pressure has been applied to a zone we then consult the capture list and if there is an appropriatly sized page available it is taken immediatly and the remainder returned to the free pool. Capture is only enabled when the reclaimer's allocation order exceeds ALLOC_COSTLY_ORDER as free pages below this order should naturally occur in large numbers following regular reclaim. Thanks go to Mel Gorman for numerous discussions during the development of this patch and for his repeated reviews. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- include/linux/mm_types.h | 1 + include/linux/page-flags.h | 4 + mm/internal.h | 6 ++ mm/page_alloc.c | 154 +++++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 115 +++++++++++++++++++++++++++------ 5 files changed, 259 insertions(+), 21 ...
These are the numbers for the improvement of hugepage allocation success? Then what do you mean by absolute and effective? What sort of constant stream of high order allocations are you talking about? In what "real" situations are you seeing higher order page allocation failures, and in those cases, how much do these patches help? I must say I don't really like this approach. IMO it might be better to put some sort of queue in the page allocator, so if memory becomes low, then processes will start queueing up and not be allowed to jump the queue and steal memory that has been freed by hard work of a direct reclaimer. That would improve a lot of fairness problems as well as improve coalescing for --
The absolute improvement is the literal change in success percentage, the literal percentage of all memory which may be allocated as huge pages. The effective improvement is percentage of the baseline success rates that this change represents; for the powerpc results we get some 20% of memory allocatable without the patches, and 25% with, 25% more The test case simulates a constant demand for huge pages, at various rates. This is intended to replicate the scenario where a system is used with mixed small page and huge page applications, with a dynamic huge page pool. Particularly we are examining the effect of starting a very large huge page job on a busy system. Obviously starting hugepage applications depends on hugepage availability as they are not swappable. This test was chosen because it both tests the initial large page demand The problem with queuing all allocators is two fold. Firstly allocations are obviously required for reclaim and those would have to be exempt from the queue, and these are as likely to prevent coelesce of pages as any other. Secondly where a very large allocation is requested all allocators would be held while reclaim at that size is performed, majorly increasing latency for those allocations. Reclaim for an order 0 page may target of the order of 32 pages, whereas reclaim for x86_64 hugepages is 1024 pages minimum. It would be grosly unfair for a single -apw --
*Much* less likely, actually. Because there should be very little allocation
required for reclaim (only dirty pages, and only when backed by filesystems
that do silly things like not ensuring their own reserves before allowing
the page to be dirtied).
Also, your scheme still doesn't avoid allocation for reclaim so I don't see
I don't see why it should be unfair to allow a process to allocate 1024
order-0 pages ahead of one order-10 page (actually, yes the order 10 page
is I guess somewhat more valueable than the same number of fragmented
pages, but you see where I'm coming from).
At least with queueing you are basing the idea on *some* reasonable policy,
rather than purely random "whoever happens to take this lock at the right
time will win" strategy, which might I add, can even be much more unfair
than you might say queueing is.
However, queueing would still be able to allow some flexibility in priority.
For example:
if (must_queue) {
if (queue_head_prio == 0)
join_queue(1<<order);
else {
queue_head_prio -= 1<<order;
skip_queue();
}
}
--
In the common use model, use of huge pages is all or nothing. Either there are sufficient pages allocatable at application start time or there are not. As huge pages are not swappable once allocated they stay there. Applications either start using huge pages or they fallback to small pages and continue. This makes the real metric, how often does the customer get annoyed becuase their application has fallen back to small pages and is slow, or how often does their database fail to start. It is very hard to directly measure that and thus to get a comparitive figure. Any attempt Well yes and no. A lot of filesystems do do such stupid things. Allocating things like journal pages which have relativly long lives during reclaim. We have seen these getting placed into the memory we Obviously we cannot prevent allocations during reclaim. But we can avoid those allocations falling within the captured range. All pages under capture are marked. Any page returning to the allocator that merges with a buddy under capture, or that is a buddy under capture are kept separatly. Such that any allocations from within reclaim will necessarily take pages from elsewhere in the pool. The key thing about capture is that it effectivly marks ranges of pages out of use for allocations for the period of the reclaim, so we have a number of small ranges blocked out, not the whole pool. This allows parallel allocations (reclaim and otherwise) to succeed against the reserves (refilled by kswapd etc), whilst marking I think we have our wires crossed here. I was saying it would seem unfair to block the allocator from giving out order-0 pages while we are struggling to get an order-10 page for one process. Having a queue would seem to generate such behaviour. What I am trying to achieve with capture is to push areas likely to return us a page of the requested size out of use, while we try and reclaim it without blocking the rest Sure you would be able to make some kind of more flexible ...
But you have customers telling you they're getting annoyed because of this? Or you have your own "realistic" workloads that allocate hugepages on demand (OK, when I say realistic, something like specjbb or whatever is obviously a reasonable macrobenchmark even if it isn't strictly They shouldn't, because that's sad and deadlocky. But yes I agree it Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce freed pages. And I'm not just talking about massive order-10 allocations or something where you have the targetted reclaim which improves chances of getting a free page within that range, but also for order-1/2/3 pages that might be more commonly used in normal kernel workloads but can still have a fairly high latency to succeed if there is a lot of other parallel We don't have our wires crossed. I just don't agree that it is unfair. It might be unfair to allow order-10 allocations at the same *rate* at order-0 allocations, which is why you could allow some priority in the queue. But when you decide you want to satisfy an order-10 allocation, do you want the other guys potentially mopping up your coalescing candidates? (and right, for targetted reclaim, maybe this is less of a problem, but I'm also thinking about a general solution for all orders I don't understand the thought process that leads you to these assertions. Where do you get your idea of fairness or importance? I would say that allowing 2^N order-0 allocations for every order-N allocations if both allocators are in a tight loop (and reserves are low, ie. reclaim is required) is a completely reasonable starting point for fairness. Do you disagree with that? How is it less fair than your But with my queueing you get effectively the same thing without having an oracle. Because you will wait for *any* order-N region to become free. The "tradeoff" is blocking other allocators. But IMO that is actually a good thing because that equates to a general fairness model in our allocator for all ...
We have customers complaining about usability of hugepages. They are keen to obtain the performance benefits from their use, but put off by the difficulty in managing systems with them enabled. That has led to series of changes to improve availability and managability of hugepages, including targetted reclaim, anti-fragmentation, and the dynamic pool. We obviously test hugepages with a series of benchmarks but generally in controlled environments where allocation of hugepages is guarenteed, to understand the performance benefits of their use. This is an acceptable configuration for hugepage diehards, but limits their utility to the wider community. The focus of this work is availability of hugepages and how that applies to managability. A common use model for big systems is batch oriented running a mix of jobs. For this to work reliably we need to move pages into and out of the hugepage pool on a regular basis. Yes in isolation blocking the whole pool would give us a higher chance of coalescing higher order pages. The problem here is that we cannot block the whole pool, we have to allow allocations for reclaim. If we have those occuring we risk loss of pages from the areas we are trying When I am attempting to satisfy an order-10 allocation I clearly want to protect that order-10 area from use for allocation to other allocators. And that is the key thrust of the capture stuff, that is what it does. Capture puts the pages in areas under targetted reclaim out of the allocator during the reclaim pass. Protecting those areas from any parallel parallel allocators, either regular or PF_MEMALLOC; but only protecting those areas. I think we have differing views on what a queuing alone can deliver. Given that the allocator will have to allow allocations made during reclaim in parallel with reclaim for higher orders it does not seem sufficient to have a 'fair queue' alone. That will not stop the smaller necessary allocations from stealing our partial free buddies, ...
Allocations from reclaim should be pretty rare, I think you're too worried about them. Actually, if you have queueing, you can do a simple form of capture just in the targetted reclaim path without all the stuff added to page alloc etc. Because you can hold off from freeing the targetted pages into the page allocator until you have built up a large number of them. That way, they won't be subject to recursive allocations and they will coalesce in the allocator; the queueing will ensure they don't get stolen or broken into smaller pages under us. Yes I really think queueing And I've shown that by a principle of fairness, you don't need to protect just one area, but you can hold off reclaim on all areas, which should greatly increase your chances of getting a hugepage, and will allow you Queueing is a general and natural step which (hopefully) benefits everyone, so I think it should be tried before capture (which introduces new complexity Yes they are orthogonal, but I would like to see if we can do without For order-2,3 etc allocations that can be commonly used by the kernel but are not always in large supply, that fairness model doesn't really work. You can work to free things, then have another process allocate them, then exit direct reclaim to find none left etc. Even for order-0 pages there is a problem in theory (although statistically Yes. --
Am flying today, so I won't get to respond to this one in detail immediatly. -apw --
They are certainly there, particularly with journalled filesystems in the mix. But although I do worry about them, my solution has been to allow all allocations in parallel with reclaim so they fall into the It almost feels as if you think these patches do much more than they actually do. The primary source of captured pages is through collection of pages as we complete synchronous targetted reclaim on them. However we do also collect and merge pages which are already free, and those which are freed from outside the reclaim during the period of the capture. From your description I am getting the impression that your proposal is that there be a queue on the allocator plus capture only for pages we release during targetted reclaim. We would then release our list and try to allocate from the buddy. Relying on buddy to coelesce them, and then allocate from there. The queue protects us from any stealing. In the development and testing of capture, we actually started out with simple version of capture, without the additional ability to collect the already free pages, and those which free during capture. (It is these parts which bring the apparent complexity of change to page_alloc.c and I am assuming is the bit you dislike particularly.) Testing on this simple version showed the chances of actually finding areas of the size requested were close to zero and capture statistically worthless. That seems to indicate that for queue+simple capture to have any statistical chance of improving things all allocations would need to be stopped for the entire period of the capture. Even if we assume reclaim allocations never occur (which we cannot queue), that still leaves us with a singly threaded allocator once reclaim starts. Bearing in mind that for very high order pages that would be periods equivalent to the time it takes To be completely honest here, I do not think I have yet figured out what form queueing you really envisage here. Its name implies an ordering ...
