Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse

Previous thread: none

Next thread: [PATCH 0/9] ASoC Blackfin supporting (v2) by Bryan Wu on Friday, September 5, 2008 - 3:21 am. (15 messages)
From: Andy Whitcroft
Date: Friday, September 5, 2008 - 3:19 am

For sometime we have been looking at mechanisms for improving the availability
of larger allocations under load.  One of the options we have explored is
the capturing of pages freed under direct reclaim in order to increase the
chances of free pages coelescing before they are subject to reallocation
by racing allocators.

Following this email is a patch stack implementing page capture during
direct reclaim.  It consits of four patches.  The first two simply pull
out existing code into helpers for reuse.  The third makes buddy's use
of struct page explicit.  The fourth contains the meat of the changes,
and its leader contains a much fuller description of the feature.

This update represents a rebase to -mm and incorporates feedback from
KOSAKI Motohiro.  It also incorporates an accounting fix which was
preventing some captures.

I have done a lot of comparitive testing with and without this patch
set and in broad brush I am seeing improvements in hugepage allocations
(worst case size) success on all of my test systems.  These tests consist
of placing a constant stream of high order allocations on the system,
at varying rates.  The results for these various runs are then averaged
to give an overall improvement.

		Absolute	Effective
x86-64		2.48%		 4.58%
powerpc		5.55%		25.22%

x86-64 has a relatively small huge page size and so is always much more
effective at allocating huge pages.  Even there we get a measurable
improvement.  On powerpc the huge pages are much larger and much harder
to recover.  Here we see a full 25% increase in page recovery.

It should be noted that these are worst case testing, and very agressive
taking every possible page in the system.  It would be helpful to get
wider testing in -mm.

Against: 2.6.27-rc1-mm1

Andrew, please consider for -mm.

-apw

Changes since V2:
 - Incorporates review feedback from Christoph Lameter,
 - Incorporates review feedback from Peter Zijlstra, and
 - Checkpatch fixes.

Changes since V1:
 - Incorporates ...
From: Andy Whitcroft
Date: Friday, September 5, 2008 - 3:20 am

Explicitly define the struct page fields which buddy uses when it owns
pages.  Defines a new anonymous struct to allow additional fields to
be defined in a later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mm_types.h |    3 +++
 mm/internal.h            |    2 +-
 mm/page_alloc.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
 	    struct page *first_page;	/* Compound tail pages */
+	    struct {
+		unsigned long buddy_order;     /* buddy: free page order */
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
 static inline unsigned long page_order(struct page *page)
 {
 	VM_BUG_ON(!PageBuddy(page));
-	return page_private(page);
+	return page->buddy_order;
 }
 
 extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c3874e..db0dbd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
 
 static inline void set_page_order(struct page *page, int order)
 {
-	set_page_private(page, order);
+	page->buddy_order = order;
 	__SetPageBuddy(page);
 #ifdef CONFIG_PAGE_OWNER
 		page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int ...
From: Andy Whitcroft
Date: Friday, September 5, 2008 - 3:19 am

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	if (PageAnon(page))
+		page->mapping = NULL;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long ...
From: MinChan Kim
Date: Monday, September 8, 2008 - 7:11 am

Why do you need to clear down anonymous mapping ?
I think if you don't convert this cleardown in free_hot_cold_page(),
you don't need it.




-- 
Thanks,
MinChan Kim
--

From: Andy Whitcroft
Date: Monday, September 8, 2008 - 8:14 am

Yeah that has slipped through from where originally this patch used to
merge two different instances of this code.  Good spot.  Will sort that

-apw
--

From: Andy Whitcroft
Date: Friday, September 5, 2008 - 3:20 am

When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks.  Pull these checks out so we can reuse them in a
later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   62 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a2c2b..2c3874e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return 1;
 }
 
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+					int alloc_flags, gfp_t gfp_mask)
+{
+	if ((alloc_flags & ALLOC_CPUSET) &&
+	    !cpuset_zone_allowed_softwall(zone, gfp_mask))
+		return 0;
+	return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+			int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+		unsigned long mark;
+		if (alloc_flags & ALLOC_WMARK_MIN)
+			mark = zone->pages_min;
+		else if (alloc_flags & ALLOC_WMARK_LOW)
+			mark = zone->pages_low;
+		else
+			mark = zone->pages_high;
+		if (!zone_watermark_ok(zone, order, mark,
+			    classzone_idx, alloc_flags)) {
+			if (!zone_reclaim_mode ||
+					!zone_reclaim(zone, gfp_mask, order))
+				return 0;
+		}
+	}
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
@@ -1427,25 +1465,11 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 ...
From: Andy Whitcroft
Date: Friday, September 5, 2008 - 3:20 am

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page.  However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in.  This is particularly problematic where the reclaimer is
attempting to allocate a higher order page.  It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list.  Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list.  When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm_types.h   |    1 +
 include/linux/page-flags.h |    4 +
 mm/internal.h              |    6 ++
 mm/page_alloc.c            |  154 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |  115 +++++++++++++++++++++++++++------
 5 files changed, 259 insertions(+), 21 ...
From: Nick Piggin
Date: Sunday, September 7, 2008 - 11:08 pm

These are the numbers for the improvement of hugepage allocation success?
Then what do you mean by absolute and effective?

What sort of constant stream of high order allocations are you talking
about? In what "real" situations are you seeing higher order page allocation
failures, and in those cases, how much do these patches help?

I must say I don't really like this approach. IMO it might be better to put
some sort of queue in the page allocator, so if memory becomes low, then
processes will start queueing up and not be allowed to jump the queue and
steal memory that has been freed by hard work of a direct reclaimer. That
would improve a lot of fairness problems as well as improve coalescing for
--

From: Andy Whitcroft
Date: Monday, September 8, 2008 - 4:44 am

The absolute improvement is the literal change in success percentage,
the literal percentage of all memory which may be allocated as huge
pages.  The effective improvement is percentage of the baseline success
rates that this change represents; for the powerpc results we get some
20% of memory allocatable without the patches, and 25% with, 25% more

The test case simulates a constant demand for huge pages, at various
rates.  This is intended to replicate the scenario where a system is
used with mixed small page and huge page applications, with a dynamic
huge page pool.  Particularly we are examining the effect of starting a
very large huge page job on a busy system.  Obviously starting hugepage
applications depends on hugepage availability as they are not swappable.
This test was chosen because it both tests the initial large page demand

The problem with queuing all allocators is two fold.  Firstly allocations
are obviously required for reclaim and those would have to be exempt
from the queue, and these are as likely to prevent coelesce of pages
as any other.  Secondly where a very large allocation is requested
all allocators would be held while reclaim at that size is performed,
majorly increasing latency for those allocations.  Reclaim for an order
0 page may target of the order of 32 pages, whereas reclaim for x86_64
hugepages is 1024 pages minimum.  It would be grosly unfair for a single

-apw
--

From: Nick Piggin
Date: Monday, September 8, 2008 - 6:59 am

*Much* less likely, actually. Because there should be very little allocation
required for reclaim (only dirty pages, and only when backed by filesystems
that do silly things like not ensuring their own reserves before allowing
the page to be dirtied).

Also, your scheme still doesn't avoid allocation for reclaim so I don't see

I don't see why it should be unfair to allow a process to allocate 1024
order-0 pages ahead of one order-10 page (actually, yes the order 10 page
is I guess somewhat more valueable than the same number of fragmented
pages, but you see where I'm coming from).

At least with queueing you are basing the idea on *some* reasonable policy,
rather than purely random "whoever happens to take this lock at the right
time will win" strategy, which might I add, can even be much more unfair
than you might say queueing is.

However, queueing would still be able to allow some flexibility in priority.
For example:

if (must_queue) {
  if (queue_head_prio == 0)
    join_queue(1<<order);
  else {
    queue_head_prio -= 1<<order;
    skip_queue();
  }
}

--

From: Andy Whitcroft
Date: Monday, September 8, 2008 - 9:41 am

In the common use model, use of huge pages is all or nothing.  Either
there are sufficient pages allocatable at application start time or there
are not.  As huge pages are not swappable once allocated they stay there.
Applications either start using huge pages or they fallback to small pages
and continue.  This makes the real metric, how often does the customer get
annoyed becuase their application has fallen back to small pages and is
slow, or how often does their database fail to start.  It is very hard to
directly measure that and thus to get a comparitive figure.  Any attempt

Well yes and no.  A lot of filesystems do do such stupid things.
Allocating things like journal pages which have relativly long lives
during reclaim.  We have seen these getting placed into the memory we

Obviously we cannot prevent allocations during reclaim.  But we can avoid
those allocations falling within the captured range.  All pages under
capture are marked.  Any page returning to the allocator that merges with a
buddy under capture, or that is a buddy under capture are kept separatly.
Such that any allocations from within reclaim will necessarily take
pages from elsewhere in the pool.  The key thing about capture is that it
effectivly marks ranges of pages out of use for allocations for the period
of the reclaim, so we have a number of small ranges blocked out, not the
whole pool.  This allows parallel allocations (reclaim and otherwise)
to succeed against the reserves (refilled by kswapd etc), whilst marking

I think we have our wires crossed here.  I was saying it would seem
unfair to block the allocator from giving out order-0 pages while we are
struggling to get an order-10 page for one process.  Having a queue
would seem to generate such behaviour.  What I am trying to achieve with
capture is to push areas likely to return us a page of the requested
size out of use, while we try and reclaim it without blocking the rest

Sure you would be able to make some kind of more flexible ...
From: Nick Piggin
Date: Monday, September 8, 2008 - 8:31 pm

But you have customers telling you they're getting annoyed because of
this? Or you have your own "realistic" workloads that allocate hugepages
on demand (OK, when I say realistic, something like specjbb or whatever
is obviously a reasonable macrobenchmark even if it isn't strictly

They shouldn't, because that's sad and deadlocky. But yes I agree it

Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce
freed pages. And I'm not just talking about massive order-10 allocations
or something where you have the targetted reclaim which improves chances of
getting a free page within that range, but also for order-1/2/3 pages
that might be more commonly used in normal kernel workloads but can still
have a fairly high latency to succeed if there is a lot of other parallel

We don't have our wires crossed. I just don't agree that it is unfair.
It might be unfair to allow order-10 allocations at the same *rate* at
order-0 allocations, which is why you could allow some priority in the
queue. But when you decide you want to satisfy an order-10 allocation,
do you want the other guys potentially mopping up your coalescing
candidates? (and right, for targetted reclaim, maybe this is less of a
problem, but I'm also thinking about a general solution for all orders

I don't understand the thought process that leads you to these assertions. 
Where do you get your idea of fairness or importance?

I would say that allowing 2^N order-0 allocations for every order-N
allocations if both allocators are in a tight loop (and reserves are low,
ie. reclaim is required) is a completely reasonable starting point for
fairness. Do you disagree with that? How is it less fair than your

But with my queueing you get effectively the same thing without having an
oracle. Because you will wait for *any* order-N region to become free.

The "tradeoff" is blocking other allocators. But IMO that is actually a
good thing because that equates to a general fairness model in our allocator
for all ...
From: Andy Whitcroft
Date: Tuesday, September 9, 2008 - 9:35 am

We have customers complaining about usability of hugepages.  They are
keen to obtain the performance benefits from their use, but put off by
the difficulty in managing systems with them enabled.  That has led to
series of changes to improve availability and managability of hugepages,
including targetted reclaim, anti-fragmentation, and the dynamic pool.
We obviously test hugepages with a series of benchmarks but generally in
controlled environments where allocation of hugepages is guarenteed, to
understand the performance benefits of their use.  This is an acceptable
configuration for hugepage diehards, but limits their utility to the
wider community.  The focus of this work is availability of hugepages and
how that applies to managability.  A common use model for big systems
is batch oriented running a mix of jobs.  For this to work reliably we
need to move pages into and out of the hugepage pool on a regular basis.

Yes in isolation blocking the whole pool would give us a higher chance
of coalescing higher order pages.  The problem here is that we cannot
block the whole pool, we have to allow allocations for reclaim.  If we
have those occuring we risk loss of pages from the areas we are trying

When I am attempting to satisfy an order-10 allocation I clearly want to
protect that order-10 area from use for allocation to other allocators.
And that is the key thrust of the capture stuff, that is what it does.
Capture puts the pages in areas under targetted reclaim out of the
allocator during the reclaim pass.  Protecting those areas from any
parallel parallel allocators, either regular or PF_MEMALLOC; but only
protecting those areas.

I think we have differing views on what a queuing alone can deliver.
Given that the allocator will have to allow allocations made during reclaim
in parallel with reclaim for higher orders it does not seem sufficient
to have a 'fair queue' alone.  That will not stop the smaller necessary
allocations from stealing our partial free buddies, ...
From: Nick Piggin
Date: Tuesday, September 9, 2008 - 8:19 pm

Allocations from reclaim should be pretty rare, I think you're too
worried about them.

Actually, if you have queueing, you can do a simple form of capture
just in the targetted reclaim path without all the stuff added to page
alloc etc. Because you can hold off from freeing the targetted pages
into the page allocator until you have built up a large number of them.
That way, they won't be subject to recursive allocations and they will
coalesce in the allocator; the queueing will ensure they don't get
stolen or broken into smaller pages under us. Yes I really think queueing

And I've shown that by a principle of fairness, you don't need to protect
just one area, but you can hold off reclaim on all areas, which should
greatly increase your chances of getting a hugepage, and will allow you

Queueing is a general and natural step which (hopefully) benefits everyone,
so I think it should be tried before capture (which introduces new complexity

Yes they are orthogonal, but I would like to see if we can do without

For order-2,3 etc allocations that can be commonly used by the kernel but
are not always in large supply, that fairness model doesn't really work.
You can work to free things, then have another process allocate them, then
exit direct reclaim to find none left etc.

Even for order-0 pages there is a problem in theory (although statistically

Yes.
--

From: Andy Whitcroft
Date: Tuesday, September 9, 2008 - 11:56 pm

Am flying today, so I won't get to respond to this one in detail
immediatly.

-apw
--

From: Andy Whitcroft
Date: Friday, September 12, 2008 - 2:30 pm

They are certainly there, particularly with journalled filesystems in
the mix.  But although I do worry about them, my solution has been to
allow all allocations in parallel with reclaim so they fall into the

It almost feels as if you think these patches do much more than they
actually do.  The primary source of captured pages is through collection
of pages as we complete synchronous targetted reclaim on them.  However we
do also collect and merge pages which are already free, and those which
are freed from outside the reclaim during the period of the capture.

From your description I am getting the impression that your proposal is
that there be a queue on the allocator plus capture only for pages we
release during targetted reclaim.  We would then release our list and
try to allocate from the buddy.  Relying on buddy to coelesce them, and
then allocate from there.  The queue protects us from any stealing.

In the development and testing of capture, we actually started out with
simple version of capture, without the additional ability to collect the
already free pages, and those which free during capture.  (It is these
parts which bring the apparent complexity of change to page_alloc.c and I
am assuming is the bit you dislike particularly.)  Testing on this simple
version showed the chances of actually finding areas of the size requested
were close to zero and capture statistically worthless.  That seems to
indicate that for queue+simple capture to have any statistical chance
of improving things all allocations would need to be stopped for the
entire period of the capture.  Even if we assume reclaim allocations
never occur (which we cannot queue), that still leaves us with a singly
threaded allocator once reclaim starts.  Bearing in mind that for very
high order pages that would be periods equivalent to the time it takes

To be completely honest here, I do not think I have yet figured out what
form queueing you really envisage here.  Its name implies an ordering ...
Previous thread: none

Next thread: [PATCH 0/9] ASoC Blackfin supporting (v2) by Bryan Wu on Friday, September 5, 2008 - 3:21 am. (15 messages)