[PATCH 9/13] Do not group pages by mobility type on low memory systems

Previous thread: [PATCH] add page->mapping handling interface [0/35] intro by KAMEZAWA Hiroyuki on Monday, September 10, 2007 - 2:40 am. (40 messages)

Next thread: [Sparc64 BUG] hangup under booting 2.6.22.X whith Sparc64 by Kövedi Krisztián on Monday, September 10, 2007 - 4:42 am. (10 messages)
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:20 am

Hi Andrew,

Here is a restacked version of the grouping pages by mobility patches
based on the patches currently in your tree. It should be  a drop-in
replacement for what is in 2.6.23-rc4-mm1 and is what I propose for merging
to mainline. The change from what you have already is that the redundant
patches are removed. For example, the patches that made grouping pages by
mobility configurable and later removed that ability do not exist in this
set. Simiarly, the patches for grouping high-order atomic allocations together
does not exist. Also note that the first patch related to IA-64 in this set
appears unrelated but it's required by patches and having the change at the
start makes the patchset more comprehensible in terms of dependencies. This
rebasing work is largely the work of Andy Whitcroft. Thanks Andy.

The patches replaced in -mm are as ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:20 am

Subject: ia64: parse kernel parameter hugepagesz= in early boot

Parse hugepagesz with early_param() instead of __setup().  __setup()
is called after the memory allocator has been initialised and the
pageblock bitmaps already setup.  In tests on one IA64 there did not
seem to be any problem with using early_param() and in fact may be
more correct as it guarantees the parameter is handled before the
parsing of hugepages=.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/ia64/Kconfig          |    5 +++++
 arch/ia64/mm/hugetlbpage.c |    4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-clean/arch/ia64/Kconfig linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/Kconfig
--- linux-2.6.23-rc5-clean/arch/ia64/Kconfig	2007-09-01 07:08:24.000000000 +0100
+++ linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/Kconfig	2007-09-02 16:18:48.000000000 +0100
@@ -54,6 +54,11 @@ config ARCH_HAS_ILOG2_U64
 	bool
 	default n
 
+config HUGETLB_PAGE_SIZE_VARIABLE
+	bool
+	depends on HUGETLB_PAGE
+	default y
+
 config GENERIC_FIND_NEXT_BIT
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-clean/arch/ia64/mm/hugetlbpage.c linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.23-rc5-clean/arch/ia64/mm/hugetlbpage.c	2007-09-01 07:08:24.000000000 +0100
+++ linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/arch/ia64/mm/hugetlbpage.c	2007-09-02 16:18:48.000000000 +0100
@@ -194,6 +194,6 @@ static int __init hugetlb_setup_sz(char 
 	 * override here with new page shift.
 	 */
 	ia64_set_rr(HPAGE_REGION_BASE, hpage_shift << 2);
-	return 1;
+	return 0;
 }
-__setup("hugepagesz=", ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:20 am

Subject: Add a bitmap that is used to track flags affecting a block of pages

The grouping pages by mobility patchset needs to track if pages within a block
can be moved or reclaimed so that pages are freed to the appropriate list.
This patch adds a bitmap for flags affecting a whole a pageblock_nr_pages
block of pages.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone
and allocated during initialisation.  SPARSEMEM dynamically allocates the
bitmap in a struct mem_section as required.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h          |   13 +++
 include/linux/pageblock-flags.h |   74 ++++++++++++++++++
 mm/page_alloc.c                 |  137 +++++++++++++++++++++++++++++++++++
 3 files changed, 224 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/mmzone.h linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h
--- linux-2.6.23-rc5-001-ia64-parse-kernel-parameter-hugepagesz=-in-early-boot/include/linux/mmzone.h	2007-09-02 16:18:27.000000000 +0100
+++ linux-2.6.23-rc5-002-add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages/include/linux/mmzone.h	2007-09-02 16:19:05.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/seqlock.h>
 #include <linux/nodemask.h>
+#include <linux/pageblock-flags.h>
 #include <asm/atomic.h>
 #include <asm/page.h>
 
@@ -222,6 +223,14 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+#ifndef CONFIG_SPARSEMEM
+	/*
+	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
+	 * In SPARSEMEM, this map is stored in struct mem_section
+	 ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:21 am

Subject: Fix corruption of memmap on ia64-sparsemem when mem_section is not a power of 2

There are problems in the use of SPARSEMEM and pageblock flags that causes
problems on ia64.

The first part of the problem is that units are incorrect in
SECTION_BLOCKFLAGS_BITS computation.  This results in a map_section's
section_mem_map being treated as part of a bitmap which isn't good.  This
was evident with an invalid virtual address when mem_init attempted to free
bootmem pages while relinquishing control from the bootmem allocator.

The second part of the problem occurs because the pageblock flags bitmap is
be located with the mem_section.  The SECTIONS_PER_ROOT computation using
sizeof (mem_section) may not be a power of 2 depending on the size of the
bitmap.  This renders masks and other such things not power of 2 base.
This issue was seen with SPARSEMEM_EXTREME on ia64.  This patch moves the
bitmap outside of mem_section and uses a pointer instead in the
mem_section.  The bitmaps are allocated when the section is being
initialised.

Note that sparse_early_usemap_alloc() does not use alloc_remap() like
sparse_early_mem_map_alloc().  The allocation required for the bitmap on
x86, the only architecture that uses alloc_remap is typically smaller than
a cache line.  alloc_remap() pads out allocations to the cache size which
would be a needless waste.

Credit to Bob Picco for identifying the original problem and effecting a
fix for the SECTION_BLOCKFLAGS_BITS calculation.  Credit to Andy Whitcroft
for devising the best way of allocating the bitmaps only when required for
the section.

From: Bob Picco <bob.picco@hp.com>
[wli@holomorphy.com: warning fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: William Irwin <bill.irwin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |    4 ++-
 mm/sparse.c            |   54 ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:21 am

Subject: Split the free lists for movable and unmovable allocations

This patch adds the core of the fragmentation reduction strategy.  It works by
grouping pages together based on their ability to move.  Basically, it works by
breaking the list in zone->free_area list into MIGRATE_TYPES number of lists.

Mobility grouping works at an abitrary order less than or equal to MAX_ORDER.
Generally this is a fixed sized defined at compile time. However, on
platforms like ia64 where the huge page size is runtime configurable it
is desirable to group at a this order.  On x86_64 and occasionally on x86,
the hugepage size may not always be MAX_ORDER_NR_PAGES.

This patch groups pages together based on the value of HUGETLB_PAGE_ORDER. It
uses a compile-time constant if possible and a variable where the huge page
size is runtime configurable.

It is assumed that grouping should be done at the lowest sensible order and
that the user would not want to override this.  If this is not true,
page_block order could be forced to a variable initialised via a boot-time
kernel parameter.

Note that many allocations are already flagged as __GFP_MOVABLE which is
re-used by this patch to determine how pages should be grouped.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h          |   10 ++
 include/linux/pageblock-flags.h |    1 
 mm/page_alloc.c                 |  143 +++++++++++++++++++++++++++++------
 3 files changed, 129 insertions(+), 25 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-003-fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2/include/linux/mmzone.h linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/include/linux/mmzone.h
--- ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:21 am

Subject: Choose pages from the per cpu list-based on migration type

The freelists for each migrate type can slowly become polluted due to the
per-cpu list.  Consider what happens when the following happens

1. A 2^pageblock_order list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.

The additional code is not measurably slower for the workloads we've tested.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c
--- linux-2.6.23-rc5-004-split-the-free-lists-for-movable-and-unmovable-allocations/mm/page_alloc.c	2007-09-02 16:19:34.000000000 +0100
+++ linux-2.6.23-rc5-005-choose-pages-from-the-per-cpu-list-based-on-migration-type/mm/page_alloc.c	2007-09-02 16:20:09.000000000 +0100
@@ -757,7 +757,8 @@ static int rmqueue_bulk(struct zone *zon
 		struct page *page = __rmqueue(zone, order, migratetype);
 		if (unlikely(page == NULL))
 			break;
-		list_add_tail(&page->lru, list);
+		list_add(&page->lru, list);
+		set_page_private(page, migratetype);
 ...
From: Andrew Morton
Date: Monday, July 13, 2009 - 12:16 pm

On Mon, 10 Sep 2007 12:21:51 +0100 (IST)



We're doing a linear search through the per-cpu magaznines right there
in the page allocator hot path.  Even if the search matches the first
element, the setup costs will matter.

Surely we can make this search go away with a better choice of data

--

From: Mel Gorman
Date: Tuesday, July 14, 2009 - 2:14 am

I have a patch that expands the per-cpu structure and eliminates the search
and I made various attempts at reducing the setup cost (e.g.  checking if
the first element suited before starting the search). However, I wasn't been
able to show for definite it made anything faster but it did increase

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, September 10, 2007 - 4:22 am

Subject: Group short-lived and reclaimable kernel allocations

This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/buffer.c                     |    3 ++-
 fs/jbd/journal.c                |    4 ++--
 fs/jbd/revoke.c                 |    6 ++++--
 fs/proc/base.c                  |   13 +++++++------
 fs/proc/generic.c               |    2 +-
 include/linux/gfp.h             |   15 ++++++++++++---
 include/linux/mmzone.h          |    5 +++--
 include/linux/pageblock-flags.h |    2 +-
 include/linux/slab.h            |    4 +++-
 kernel/cpuset.c                 |    2 +-
 lib/radix-tree.c                |    6 ++++--
 mm/page_alloc.c                 |   10 +++++++---
 mm/shmem.c                      |    4 ++--
 mm/slab.c                       |    2 ++
 mm/slub.c                       |    3 +++
 15 files changed, 54 insertions(+), 27 deletions(-)

Index: linux-2.6.23-rc4-mm1-redropped/fs/buffer.c
===================================================================
--- linux-2.6.23-rc4-mm1-redropped.orig/fs/buffer.c	2007-09-09 18:23:34.000000000 +0100
+++ linux-2.6.23-rc4-mm1-redropped/fs/buffer.c	2007-09-09 18:26:16.000000000 +0100
@@ -3100,7 +3100,8 @@
 	
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-	struct buffer_head *ret = ...
From: Paul Jackson
Date: Monday, September 10, 2007 - 12:44 pm

Minor nit, Mel.

It's easier to read patches if you use the diff -p option:

       -p  --show-c-function
              Show which C function each change is in.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Mel Gorman
Date: Monday, September 10, 2007 - 2:15 pm

That's a fair comment. I normally make sure it's there but it got missed
in a few patches in this set which is awkward. Sorry about that.

-- 
Mel Gorman

-

From: Mel Gorman
Date: Monday, September 10, 2007 - 4:22 am

Subject: Drain per-cpu lists when high-order allocations fail

Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block.  When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0.  It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/mm/page_alloc.c linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c
--- linux-2.6.23-rc5-006-group-short-lived-and-reclaimable-kernel-allocations/mm/page_alloc.c	2007-09-02 16:20:31.000000000 +0100
+++ linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c	2007-09-02 16:20:48.000000000 +0100
@@ -852,6 +852,7 @@ void mark_free_pages(struct zone *zone)
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
+#endif /* CONFIG_PM */
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
@@ -864,7 +865,25 @@ void drain_local_pages(void)
 	__drain_pages(smp_processor_id());
 	local_irq_restore(flags);	
 }
-#endif /* CONFIG_HIBERNATION */
+
+void smp_drain_local_pages(void *arg)
+{
+	drain_local_pages();
+}
+
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator
+ */
+void drain_all_local_pages(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__drain_pages(smp_processor_id());
+	local_irq_restore(flags);
+
+	smp_call_function(smp_drain_local_pages, NULL, 0, 1);
+}
 
 /*
  * Free a 0-order page
@@ -1452,6 +1471,9 @@ nofail_alloc:
 
 	cond_resched();
 
+	if (order != 0)
+		drain_all_local_pages();
+
 	if (likely(did_some_progress)) ...
From: Nick Piggin
Date: Monday, September 10, 2007 - 8:05 am

Does this help? I have a more general version which could go in
-

From: Mel Gorman
Date: Tuesday, September 11, 2007 - 2:34 am

Yes, it does help. It's noticable when one is trying to get as much
memory in hugepages as possible. It reaches a certain point where
hugepages are free but pinned due to per-cpu pages. This "certain point"
depends on the number of CPUs as a ratio to the size of physical memory
as well as a certain degree of randomness as the location of per-cpu
pages is not predictable. Worst case is not being able to allocate
something like (NR_CPUS * pcp->high * 2) hugepages even if they are
otherwise free.

By all means if you have a general version, send it and I'll take a
look. If it's more general and nicer but still can be used to drain the
per-cpu lists when high-order allocations fail, I'm all for it.



-

From: Mel Gorman
Date: Monday, September 10, 2007 - 4:22 am

Subject: Move free pages between lists on steal

When a fallback is forced to steal a page from a block of a different
type and more than half of the block is free reassign that block to the
new type and move the free pages over to the new type's free lists.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
[y-goto@jp.fujitsu.com: fix BUG_ON check at move_freepages()]
[apw@shadowen.org: Move to using pfn_valid_within()]
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andy Whitcroft <andyw@uk.ibm.com>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   72 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c
--- linux-2.6.23-rc5-007-drain-per-cpu-lists-when-high-order-allocations-fail/mm/page_alloc.c	2007-09-02 16:20:48.000000000 +0100
+++ linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c	2007-09-02 16:21:09.000000000 +0100
@@ -662,6 +662,72 @@ static int fallbacks[MIGRATE_TYPES][MIGR
 	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE },
 };
 
+/*
+ * Move the free pages in a range to the free lists of the requested type.
+ * Note that start_page and end_pages are not aligned on a pageblock
+ * boundary. If alignment is required, use move_freepages_block()
+ */
+int move_freepages(struct zone *zone,
+			struct page *start_page, struct page *end_page,
+			int migratetype)
+{
+	struct page *page;
+	unsigned long order;
+	int blocks_moved = 0;
+
+#ifndef CONFIG_HOLES_IN_ZONE
+	/*
+	 * page_zone is not safe to call in this context when
+	 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
+	 * ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:23 am

Subject: Do not group pages by mobility type on low memory systems

Where there are fewer than one pageblock in the system per mobility
type mixing is inevitable and any attempt to prevent it will fail
in a costly manner. This patch checks the size of vm_total_pages in
build_all_zonelists(). If there are not enough areas,  mobility is effectivly
disabled by considering all allocations as the same type (UNMOVABLE).
This is achived via a __read_mostly flag.

This patch removes any need to disable grouping pages by mobility at
compile time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c
--- linux-2.6.23-rc5-008-move-free-pages-between-lists-on-steal/mm/page_alloc.c	2007-09-02 16:21:09.000000000 +0100
+++ linux-2.6.23-rc5-009-do-not-group-pages-by-mobility-type-on-low-memory-systems/mm/page_alloc.c	2007-09-02 16:21:30.000000000 +0100
@@ -154,8 +154,13 @@ int nr_node_ids __read_mostly = MAX_NUMN
 EXPORT_SYMBOL(nr_node_ids);
 #endif
 
+int page_group_by_mobility_disabled __read_mostly;
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
+	if (unlikely(page_group_by_mobility_disabled))
+		return MIGRATE_UNMOVABLE;
+
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
@@ -169,6 +174,10 @@ static inline int allocflags_to_migratet
 {
 	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 
+	if (unlikely(page_group_by_mobility_disabled))
+		return MIGRATE_UNMOVABLE;
+
+	/* Cluster based on mobility */
 	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
 		((gfp_flags & __GFP_RECLAIMABLE) != 0);
 }
@@ -2294,9 ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:23 am

Subject: Bias the location of pages freed for min_free_kbytes in the same pageblock_nr_pages areas

The standard buddy allocator always favours splitting the smallest block of
pages. The effect of this is that the pages free to satisfy min_free_kbytes
tends to be preserved since boot time at the same location of memory for a
very long time, remaining contiguous.  When an administrator sets the
reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
that remain free.  This allows the occasional high atomic allocation
to succeed up until the point the blocks are split.  In practice, it is
difficult to split these blocks but when they do split, the benefit of
having min_free_kbytes for contiguous blocks disappears.  Additionally,
increasing min_free_kbytes once the system has been running for some time
has no guarantee of creating contiguous blocks.

On the other hand, grouping pages by mobility favours the splitting of large
blocks when there are no free pages of the appropriate type available.  A
side-effect of this is that all blocks in memory tends to be used up and
the contiguous free blocks from boot time are not preserved like in the
vanilla allocator.  This can cause a problem if a new caller is unwilling
to reclaim or does not reclaim for long enough.

A failure scenario was found for a wireless network device allocating
order-1 atomic allocations but the allocations were not intense or frequent
enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
This was reproduced on a desktop by booting with mem=256mb, forcing the
driver to allocate at order-1, running a bittorrent client (downloading a
debian ISO) and building a kernel with -j2.

This patch addresses the problem on the desktop machine booted with mem=256mb.
It works by setting aside a reserve of pageblock_nr_pages blocks, the
number of which depends on the value of min_free_kbytes.  These blocks are
only fallen back to when there is no other free pages.  Then the smallest
possible ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:23 am

Subject: Bias the placement of kernel pages at lower pfns

This patch chooses blocks with lower PFNs when placing kernel allocations.
This is particularly important during fallback in low memory situations to
stop unmovable pages being placed throughout the entire address space.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c
--- linux-2.6.23-rc5-010-bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks/mm/page_alloc.c	2007-09-02 16:22:04.000000000 +0100
+++ linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c	2007-09-02 16:22:27.000000000 +0100
@@ -768,6 +768,23 @@ int move_freepages_block(struct zone *zo
 	return move_freepages(zone, start_page, end_page, migratetype);
 }
 
+/* Return the page with the lowest PFN in the list */
+static struct page *min_page(struct list_head *list)
+{
+	unsigned long min_pfn = -1UL;
+	struct page *min_page = NULL, *page;;
+
+	list_for_each_entry(page, list, lru) {
+		unsigned long pfn = page_to_pfn(page);
+		if (pfn < min_pfn) {
+			min_pfn = pfn;
+			min_page = page;
+		}
+	}
+
+	return min_page;
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 						int start_migratetype)
@@ -791,8 +808,11 @@ static struct page *__rmqueue_fallback(s
 			if (list_empty(&area->free_list[migratetype]))
 				continue;
 
+			/* Bias kernel allocations towards low pfns */
 			page = list_entry(area->free_list[migratetype].next,
 					struct page, lru);
+			if (unlikely(start_migratetype != ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:24 am

Subject: Be more agressive about stealing when MIGRATE_RECLAIMABLE allocations fallback

MIGRATE_RECLAIMABLE allocations tend to be very bursty in nature like when
updatedb starts.  It is likely this will occur in situations where MAX_ORDER
blocks of pages are not free.  This means that updatedb can scatter
MIGRATE_RECLAIMABLE pages throughout the address space.  This patch is more
agressive about stealing blocks of pages for MIGRATE_RECLAIMABLE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c
--- linux-2.6.23-rc5-011-bias-the-placement-of-kernel-pages-at-lower-pfns/mm/page_alloc.c	2007-09-02 16:22:27.000000000 +0100
+++ linux-2.6.23-rc5-012-be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback/mm/page_alloc.c	2007-09-02 16:22:47.000000000 +0100
@@ -713,7 +713,7 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int blocks_moved = 0;
+	int pages_moved = 0;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -742,10 +742,10 @@ int move_freepages(struct zone *zone,
 		list_add(&page->lru,
 			&zone->free_area[order].free_list[migratetype]);
 		page += 1 << order;
-		blocks_moved++;
+		pages_moved += 1 << order;
 	}
 
-	return blocks_moved;
+	return pages_moved;
 }
 
 int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
@@ -817,11 +817,22 @@ static struct page *__rmqueue_fallback(s
 
 			/*
 			 * If breaking a large block of pages, move all free
-			 * pages to the preferred allocation list
+			 * pages to the preferred allocation list. If falling
+			 * back for a reclaimable kernel ...
From: Mel Gorman
Date: Monday, September 10, 2007 - 4:24 am

Subject: Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo

This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
The information is collected only on request so there is no runtime overhead.
The statistics are in three parts:

The first part prints information on the size of blocks that pages are
being grouped on and looks like

Page block order: 10
Pages per block:  1024

The second part is a more detailed version of /proc/buddyinfo and looks like

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4

The third part looks like

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
Node 0, zone      DMA            0            1            2            1
Node 0, zone   Normal            3           17           94            4

To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo ...
From: Andrew Morton
Date: Thursday, September 13, 2007 - 6:01 pm

It really gives me the creeps to throw away a large set of large patches
and to then introduce a new set.

What would go wrong if we just merged the patches I already have?
-

From: Mel Gorman
Date: Friday, September 14, 2007 - 7:33 am

Nothing, the end result is more or less the same. There are three style
cleanups in the restack and for some reason, one of the functions moved
but otherwise they are identical.

The restacked version was provided to illustrate what the final stack really
looks like and because I thought you would prefer it over a stack that had
one patch introducing a change and a later patch removing it (like making
it configurable for example). It also allowed us to test against mainline
to make sure everything was ok prior to the merge.

Go ahead with the patches you already
have if you prefer. Just make sure not to include
breakout-page_order-to-internalh-to-avoid-special-knowledge-of-the-buddy-allocator.patch
as it's only required for page-owner-tracking.

Thanks Andrew.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Andrew Morton
Date: Sunday, September 16, 2007 - 3:34 am

memory-unplug-v7-page-isolation.patch uses page_order() also, so I brought this patch back.
-

Previous thread: [PATCH] add page->mapping handling interface [0/35] intro by KAMEZAWA Hiroyuki on Monday, September 10, 2007 - 2:40 am. (40 messages)

Next thread: [Sparc64 BUG] hangup under booting 2.6.22.X whith Sparc64 by Kövedi Krisztián on Monday, September 10, 2007 - 4:42 am. (10 messages)