Previous thread: [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages by Mel Gorman on Friday, April 2, 2010 - 9:02 am. (1 message)

Next thread: [PATCH -tip 0/4] perf-probe bugfixes by Masami Hiramatsu on Friday, April 2, 2010 - 9:50 am. (9 messages)
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

The only change is relatively minor and is around the migration of unmapped
PageSwapCache pages. Specifically, it's not safe to access anon_vma for
these pages when remapping after migration completes so the last patch
makes sure we don't.

Are there any further obstacles to merging?

Changelog since V6
  o Avoid accessing anon_vma when migrating unmapped PageSwapCache pages

Changelog since V5
  o Rebase to mmotm-2010-03-24-14-48
  o Add more reviewed-by's
  o Correct one spelling in vmstat.c and some leader clarifications
  o Split the LRU isolation modes into a separate path
  o Correct a NID change
  o Call migrate_prep less frequently
  o Remove unnecessary inlining
  o Do not interfere with memory hot-remove
  o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
  o page_mapped instead of page_mapcount and allow swapcache to migrate
  o Avoid too many pages being isolated for migration
  o Handle PageSwapCache pages during migration

Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma ...
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    9 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    1 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  379 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmstat.c                |    5 +
 8 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..dbebe58
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:42 +0100

Confused.  "incomplete" and "partial" are synonyms.  Please fully

Why?  What are the implications of this decision?  How was it arrived

I'm kinda surprised that we don't already have a function to do this.

An `unsigned' return value would make more sense.  Perhaps even
`unsigned long', unless there's something else here which would prevent

	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);


What does "This assumes the block is valid" mean?  The code checks

hm.  pfn_to_page() isn't exactly cheap in some memory models.  I wonder
if there was some partial result we could have locally cached across

Strange.  Having just busted a pageblock_order-sized higher-order page
into order-0 pages, the loop goes on and inspects the remaining
(1-2^pageblock_order) pages, presumably to no effect.  Perhaps

	for (; blockpfn < end_pfn; blockpfn++) {

should be

	for (; blockpfn < end_pfn; blockpfn += pageblock_nr_pages) {

or somesuch.

btw, is the whole pageblock_order thing as sucky as it seems?  If I
want my VM to be oriented to making order-4-skb-allocations work, I
need to tune it that way, to coopt something the hugepage fetishists



Well.  This code checks each pfn it touches, but
isolate_freepages_block() doesn't do this - isolate_freepages_block()
happily blunders across a contiguous span of pageframes, assuming that



Can this happen?





This test could/should be moved inside the preceding `if' block.  Or,
better, simply do

		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)



If zone->spanned_pages is much much larger than zone->present_pages,
this code will suck rather a bit.  Is there a reason why that can never

<stares at that for a while>

Perhaps

	while ((ret = compact_finished(zone, cc)) == COMPACT_INCOMPLETE) {

would be clearer.  That would make the definition-site initialisation


Boy, this looks like an infinite loop waiting to happen.  Are you sure?
Suppose we hit a ...
From: Mel Gorman
Date: Wednesday, April 7, 2010 - 8:21 am

I have a difficultly in that it's hard to give you fixes as it would
span two patches. It might be easiest on you overall if you so a

s/COMPACT_INCOMPLETE/COMPACT_CONTINUE/

on both this patch and the direct compaction patch. I'll then send a follow-on
patch documenting the four defines (later patch adds a fourth) as

/* Return values for compact_zone() and try_to_compact_pages() */

/* compaction didn't start as it was not possible or direct reclaim was more suitable */
#define COMPACT_SKIPPED         0

/* compaction should continue to another pageblock */
#define COMPACT_CONTINUE        1

/* direct compaction partially compacted a zone and there are suitable pages */
#define COMPACT_PARTIAL         2

/* The full zone was compacted */


Pro: Latencies are lower, fewer pages are isolated at any given time
Con: There is a wider window during which a parallel allocator can use a

It's somewhat arbitrary, only that reclaim works on similar units and
they share logic on what the correct number of pages to have isolated

The higher the value, the longer the latency is that the lock is held
during isolation but under very heavy memory pressure, there might be
higher success rates for allocation as the window during which parallel
allocators can allocate pages being compacted is reduced.

The lower the value, the lower the time the lock is held. Fewer pages
will be isolated at any given time.

The only advantage of either choice is increasing the value makes it
less likely a parallel allocator will interfere but it had to be
balanced against the lock hold latency time. As we appear to be ok with
the hold time for reclaim, it was reasonable to assume we'd also be ok


Included in the patch below. The corner-case is impossible. We're
isolating only COMPACT_CLUSTER_MAX and this must be less than
MAX_ORDER_NR_PAGES. However, the return value of the function is used with
an unsigned long.  Technically, it could be unsigned int but page counts


Typically, a ...
From: Mel Gorman
Date: Thursday, April 8, 2010 - 9:59 am

When merging compaction and transparent huge pages, Andrea spotted and
fixed this problem in his tree but it should go to mmotm as well.

Thanks Andrea.

==== CUT HERE ====
mm,compaction: page buddy can go away before reading page_order while isolating pages for migration

From: Andrea Arcangeli <aarcange@redhat.com>

zone->lock isn't held so the optimisation is unsafe. The page could be
allocated between when PageBuddy is checked and page-order is called. The
scanner will harmlessly walk the other free pages so let's just skip this
optimization.

This is a fix to the patch "Memory compaction core".

[mel@csn.ul.ie: Expanded the changelog]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>

---
 mm/compaction.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index dadad52..4fb33f6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -262,10 +262,8 @@ static unsigned long isolate_migratepages(struct zone *zone,
 
 		/* Get the page and skip if free */
 		page = pfn_to_page(low_pfn);
-		if (PageBuddy(page)) {
-			low_pfn += (1 << page_order(page)) - 1;
+		if (PageBuddy(page))
 			continue;
-		}
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
--

From: Andrea Arcangeli
Date: Thursday, April 8, 2010 - 10:06 am

Thanks Mel for submitting this fix!
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5

--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   82 ++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..c041638 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,6 +422,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +612,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -662,6 +663,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:40 +0100

(/proc/sys/vm?)

Like unusable_index, this seems awfully specialised.  Perhaps we could
hide it under CONFIG_MEL, or even put it in debugfs with the intention
of removing it in 6 or 12 months time.  Either way, it's hard to
justify permanently adding this stuff to every kernel in the world?


I have a suspicion that all the info in unusable_index and
extfrag_index could be computed from userspace using /proc/kpageflags
(and perhaps a bit of dmesg-diddling to find the zones).  If that can't
be done today, I bet it'd be pretty easy to arrange for it.


--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 3:46 am

Except in this case, the fragmentation index is used by the kernel when
deciding in advance whether compaction will do the job or if lumpy
reclaim is required.

I could avoid exposing this to userspace but it would make it harder to
decide what needs to happen with extfrag_threshold later. i.e. does the
threshold need a different value (proc would help gather the data) or

Moving it to debugfs would satisfy the requirement of tuning extfrag_threshold

It can be computed from buddyinfo. I used a perl script to calculate it
in the past. I exposed the information from in-kernel in these patches so


It is. Will I just remove the proc files, keep the internal calculation
for fragmentation_index and kick that perl script into shape to produce
the same information from buddyinfo?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Tuesday, April 13, 2010 - 5:43 am

==== CUT HERE ====
mm,compaction: Move extfrag_index to debugfs

extfrag_index can be worked out from userspace but for debugging and
tuning compaction, it'd be best for all users to have the same
information. This patch moves extfrag_index to debugfs where it is both
easier to configure out and remove at some future date.

This is a fix to the patch "Export fragmentation index via
/proc/extfrag_index". When merged, it'll collide with the patch "Direct
compact when a high-order allocation fails" but the resolution is
relatively straight forward - preserve the fragmentation_index functions
and delete the proc-related functions as they are now at the bottom of
the file under ifdef CONFIG_DEBUG_FS.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   14 +----
 mm/vmstat.c                        |  110 ++++++++++++++++++------------------
 2 files changed, 57 insertions(+), 67 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66ebc11..74d2605 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,7 +422,6 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
- extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and extfrag_index.
+pagetypeinfo.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -652,17 +651,6 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 ...
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/sysctl/vm.txt |   11 +++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   62 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 88 insertions(+), 1 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..803c018 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index dbebe58..fef591b 100644
--- ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:43 +0100

Might be better if "when the number 1 is written...".  That permits you


It would be better to do

	struct compact_control cc = {
		.nr_freepages = 0,
		etc

because if you later add more fields to compact_control, everything
else works by magick.  That's served us pretty well with

--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 8:39 am

Functionally, they shouldn't even need it. Direct compaction should work
just fine but it's the type of thing a job scheduler might want so it could
easily work out how many huge pages it potentially has in advance for example.
The same information could be figured out if your kpagemap-foo was strong
enough.

It would also be useful for debugging direct compaction in the same way
drop_caches can be useful. i.e. it's rarely the right thing to use but
it can be handy to illustrate a point. I didn't want to write that into

Done. This is done in the patch below. It'll then collide with a later
patch where order is introduced but it's a trivial fixup to move the


==== CUT HERE ====

mm,compaction: Tighten up the allowed values for compact_memory and initialisation

This patch updates the documentation on compact_memory to only define 1
as an allowed value in case it needs to be expanded later. It also
changes how a compact_control structure is initialised to avoid
potential trouble in the future.

This is a fix to the patch "Add /proc trigger for memory compaction".

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |    9 ++++-----
 mm/compaction.c             |    9 +++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..3b3fa1b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -67,11 +67,10 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 compact_memory
 
-Available only when CONFIG_COMPACTION is set. When an arbitrary value
-is written to the file, all zones are compacted such that free memory
-is available in contiguous blocks where possible. This can be important
-for example in the allocation of huge pages although processes will also
-directly compact memory as required.
+Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
+all zones are compacted such ...
From: Mel Gorman
Date: Wednesday, April 7, 2010 - 11:27 am

Minor mistake in the initialisation part of the patch

==== CUT HERE ====
mm,compaction: Initialise cc->zone at the correct time

Init cc->zone after we know what zone we are looking for. This is a fix
to the fix patch "mm,compaction: Tighten up the allowed values for
compact_memory and initialisation"

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/compaction.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d9c5733..effe57d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -396,13 +396,13 @@ static int compact_node(int nid)
 		struct compact_control cc = {
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
-			.zone = zone,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
 		if (!populated_zone(zone))
 			continue;
 
+		cc.zone = zone,
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74d2605..e87775a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,6 +453,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -651,6 +652,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:39 +0100

I'd suggest /proc/sys/vm/unusable_index.  I don't know how pagetypeinfo

That's going to hurt my brain.  Why didn't it report usable free blocks?

Also, the index is scaled by the actual amount of free memory in the
zones, yes?  So to work out how many order-N pages are available you
first need to know how many free pages there are?


All this code will be bloat for most people, I suspect.  Can we find a
suitable #ifdef wrapper to keep my cellphone happy?

--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 3:35 am

For the same reason buddyinfo did - no one complained. It keeps the

Lets say you are graphing the index on a given order over time. If there
are a large number of frees, there can be a large change in that value
but it does nto necessarily tell you how much better or worse the system

It depends on what your question is. As I'm interest in fragmentation,
this value gives me information on that. Your question is about how many
pages of a given order can be allocated right now and that can be worked

It could. However, this information can also be created from buddyinfo and
I have a perl script that can be adapted to duplicate the output of this
proc file. As there isn't an in-kernel user of this information, it can
also be dropped.

Will I roll a patch that moves the proc entry and makes it a CONFIG option
or will I just remove the file altogether? If I remove it, I can adapt
the perl script and add to the other hugepage-related utilities in
libhugetlbfs.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Tuesday, April 13, 2010 - 5:42 am

==== CUT HERE ====
mm,compaction: Move unusable_index to debugfs

unusable_index can be worked out from userspace but for debugging and tuning
compaction, it'd be best for all users to have the same information. This
patch moves extfrag_index to debugfs where it is both easier to configure
out and remove at some future date.

This is a fix to the patch "Export unusable free space index via
/proc/unusable_index"

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   13 +---
 mm/vmstat.c                        |  183 ++++++++++++++++++++----------------
 2 files changed, 105 insertions(+), 91 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..74d2605 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,7 +453,6 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
- unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -611,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -652,16 +651,6 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
-> cat /proc/unusable_index
-Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
-Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 ...
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..4fd75a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5

--

From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:38 +0100

Seems strange to depend on hugetlbfs.  Perhaps depending on
HUGETLB_PAGE would be more logical.

But hang on.  I wanna use compaction to make my order-4 wireless skb
allocations work better!  Why do you hate me?


--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 3:22 am

Because I'm a bad person and I hate your hardware. However, because I'm
told being a bad person for the sake of it just isn't the right thing to
do, I'll expand the reasoning :).

For your specific example, the allocation is also depending on GFP_ATOMIC
which migration cannot handle today. Significant plumbing would be needed
there to make it work and I believe at the moment at atomic-safe compaction
would be a subset of full compaction. This is a "future" thing but I'd also
expect you and others to resist it on the grounds that depending on such
high-order atomics for the correct working of the hardware is just a bad plan.

That does not cover other high-order allocs though such as those required for
stacks or the ARM allocation of PGDs. These are below PAGE_ALLOC_COSTLY_ORDER
so compaction will not currently trigger.  Reviews commented that it would
be preferable to limit the orders compaction handles to start with. The
direction I'd like to continue with this in the future is to have something
like __zone_reclaim to handle clean page cache first and moving more towards
integrating lumpy reclaim and compaction. When this is done, the HUGETLB_PAGE
dependency would be removed and the smaller orders will also be compacted.

In the meantime, we continue to discourage high-order allocations and
compaction gets its initial trial run against huge pages.

==== CUT HERE ====
mm,compaction: Have CONFIG_COMPACTION depend on HUGETLB_PAGE instead of HUGETLBFS

There is a strong coupling between HUGETLB_PAGE and HUGETLBFS but in theory
there can be alternative interfaces to huge pages than HUGETLB_PAGE. This
patch makes CONFIG_COMPACTION depend on the right thing.

This is a fix to the patch "Allow CONFIG_MIGRATION to be set without
CONFIG_NUMA or memory hot-remove" and should be merged together.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 4fd75a0..a275a7d ...
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:37 +0100



What a mouthful.  Can we do s/external_//g?
--

From: Rik van Riel
Date: Tuesday, April 6, 2010 - 5:10 pm

For the function, sure.

However, I believe it would be good to keep the variable
inside the anon_vma as "external_refcount", because the
VMAs attached to the anon_vma take a reference by being
on the list (and leave the refcount alone).
--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 3:01 am

hah indeed. There is a very strong case for merging patch 1 and 3 into
the same patch. They were kept separate because the combined patch was
going to be tricky to review. The expansion of the comment in patch 3

Would you like to make patch 3 patch 2 instead and then merge them when
going upstream?

As it is you are right in that there could be a bug if just 1 was merged
but not 3 because both refcounts are not taken. I could fix up patch 1

We could, but it would be misleading.

anon_vma has an explicit and implicit refcount. The implicit reference
is a VMA being on the anon_vma list. The explicit count is
external_refcount. Just "refcount" implies that it is properly reference
counted which is not the case. Someone looking at memory.c might
conclude that there is a refcounting bug because just the list is
checked.

Now, the right thing to do here is to get rid of implicit reference
counting. Peter Ziljstra has posted an RFC patch series on mm preempt
and the first two patches of that cover using proper reference counting.
When/if that gets merged, a rename from external_refcount to refcount
would be appropriate.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be freed
+		into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 93b3ac6..07cdcc6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -245,6 +246,8 @@ int register_node(struct node *node, int num, struct node *parent)
 ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:05 pm

On Fri,  2 Apr 2010 17:02:44 +0100

Would it make more sense if this was a per-memcg thing rather than a
per-node thing?

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 6, 2010 - 5:31 pm

On Tue, 6 Apr 2010 17:05:59 -0700

memcg doesn't have any relationship with placement of memory (now).
It's just controls the amount of memory.
So, memcg has no relationship with compaction.

A cgroup which controls placement of memory is cpuset.
One idea is per cpuset. But per-node seems ok.

Thanks,
-Kame

--

From: Andrew Morton
Date: Tuesday, April 6, 2010 - 2:56 pm

Which is superior?

Which maps best onto the way systems are used (and onto ways in which
we _intend_ that systems be used)?

Is the physical node really the best unit-of-administration?  And is
direct access to physical nodes the best means by which admins will
manage things?
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 6, 2010 - 6:19 pm

On Tue, 6 Apr 2010 17:56:01 -0400

node has hugepage interface now.

[root@bluextal qemu-kvm-0.12.3]# ls /sys/devices/system/node/node0/hugepages/
hugepages-2048kB


In these days, we tend to use "setup tool" for using cpuset, etc.
(as libcgroup.)

Considering control by userland-support-soft, I think pernode is not bad.
And per-cpuset requires users to mount cpuset.
(Now, most of my customer doesn't use cpuset.)


Thanks,
-Kame


--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 8:42 am

Kamezawa Hiroyuki covered this perfectly. memcg doesn't care and while
cpuset might, there are a lot more people working with nodes than there
are with cpuset.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |   20 ++++++--
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   31 ++++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 178 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c4ab05f..faa3faf 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,27 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
-#define COMPACT_INCOMPLETE	0
-#define COMPACT_PARTIAL		1
-#define COMPACT_COMPLETE	2
+/* Return values for compact_zone() and try_to_compact_pages() */
+#define COMPACT_SKIPPED		0
+#define COMPACT_INCOMPLETE	1
+#define COMPACT_PARTIAL		2
+#define COMPACT_COMPLETE	3
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:06 pm

On Fri,  2 Apr 2010 17:02:45 +0100


So someone else can get in and steal it.  How is that resolved?

Please expound upon the relationship between the icky pageblock_order
and the caller's desired allocation order here.  The compaction design
seems fairly fixated upon pageblock_order - what happens if the caller
wanted something larger than pageblock_order?  The
less-than-pageblock_order case seems pretty obvious, although perhaps


Was that a correct decision?  If we perform compaction when smaller
allocation attemtps fail, will the kernel get better, or worse?

And how do we save my order-4-allocating wireless driver?  That would

Would be nice to add some comments explaining this a bit more. 

ooh, so that starts to explain split_free_page().  But
split_free_page() didn't do the 2UL thing.


Why are we doing all this handwavy stuff?  Why not just try a
compaction run and see if it worked?  That would be more


--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 9:06 am

Well, yes or there wouldn't be a marked reduction in the latency to allocate
a huge page as linked to in the leader and the difference in allocation

It isn't, lumpy reclaim has a similar problem. They could be captured
of course but so far stealing has only been a problem when under very

Compaction works on the same units as anti-fragmentation does - the
pageblock_order. It could work on units smaller than that when selecting
pages to migrate from and to, but there would be little advantage for
some additional complexity.

The caller's desired allocation order determines if compaction has

Then it would get tricky. Selecting for migration stays simple but there would
be additional complexity in finding 2 or more adjacent naturally-aligned
MIGRATE_MOVABLE blocks to migrate to. As pageblock_order is related to the
default huge page size, I'd wonder what caller would be routinely allocating

compact_finished() could be called more regularly but the waste is minimal. At
worst, a few more pages get migrated that weren't necessary for the caller
to successfully allocate. This is not massively dissimilar to how direct

I think better but there are concerns about LRU churn and it might encourage
increased use of high-order allocations. The desire is to try compaction out

Ultimately, it could perform a subset of compaction that doesn't go to

Compaction doesn't, but migration can and you don't know in advance if
it will need to or not. Migration would itself need to take a GFP mask
of what was and wasn't allowed during the course of migration but these
checks to be moved.


No, but split_free_page() knows exactly how much it is removing at that
time. At this point, there is a worst-case expectation that the pages being
migrating from and to are both isolated. At no point should they be all

It won't deadlock, this is a heuristic only that guesses whether compaction
is likely to succeed or not. The watermarks are rechecked every time pages

Because if that index is not ...
From: Mel Gorman
Date: Wednesday, April 7, 2010 - 11:29 am

This patch goes on top of the series. It looks big but it's mainly
moving code.

==== CUT HERE ====
mm,compaction: Do not display compaction-related stats when !CONFIG_COMPACTION

Although compaction can be disabled from .config, the vmstat entries
still exist. This patch removes the vmstat entries. As page_alloc.c
refers directly to the counters, the patch introduces
__alloc_pages_direct_compact() to isolate use of the counters.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/vmstat.h |    2 +
 mm/page_alloc.c        |   92 ++++++++++++++++++++++++++++++++---------------
 mm/vmstat.c            |    2 +
 3 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b4b4d34..7f43ccd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,8 +43,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_COMPACTION
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46f6be4..514cc96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,59 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_COMPACTION
+/* Try memory compaction for high-order allocations before reclaim */
+static struct page *
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	int migratetype, unsigned long *did_some_progress)
+{
+	struct page *page;
+
+	if (!order)
+		return NULL;
+
+	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
+								nodemask);
+	if (*did_some_progress != COMPACT_SKIPPED) ...
From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   47 ++++++++++++++++++++++++++++++-----------------
 1 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..0356e64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int remap_swapcache)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
+	if (rc) {
 		newpage->mapping = NULL;
+	} else {
+		if (remap_swapcache) 
+			remove_migration_ptes(page, newpage);
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated ...
From: KAMEZAWA Hiroyuki
Date: Monday, April 5, 2010 - 11:54 pm

On Fri,  2 Apr 2010 17:02:48 +0100

Seems nice to me.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--

From: Minchan Kim
Subject:
Date: Tuesday, April 6, 2010 - 8:37 am

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks for your effort, Mel.

-- 
Kind regards,
Minchan Kim
--

From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:06 pm

On Fri,  2 Apr 2010 17:02:48 +0100



--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 9:49 am

This function existed before compaction and returns an error code rather


Patch that updates the comment if you prefer it is as follows

==== CUT HERE ====
mm,compaction: Expand comment on unmapped page swap cache

The comment on the handling of anon_vma for unmapped pages is a bit
sparse. Expand it.

This is a fix to the patch "mm,migration: Allow the migration of
PageSwapCache pages"

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0356e64..281a239 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,9 +611,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 			/*
 			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use. In this case, the
-			 * swapcache page gets migrated but the pages are not
-			 * remapped
+			 * swapcache page is safe to use because we don't
+			 * know in advance if the VMA that this page belonged
+			 * to still exists. If the VMA and others sharing the
+			 * data have been freed, then the anon_vma could
+			 * already be invalid.
+			 *
+			 * To avoid this possibility, swapcache pages get
+			 * migrated but are not remapped when migration
+			 * completes
 			 */
 			remap_swapcache = 0;
 		} else { 
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 9:02 am

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to be
stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ae98afc..2a02719 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer ...
From: Andrew Morton
Date: Tuesday, April 6, 2010 - 5:06 pm

On Fri,  2 Apr 2010 17:02:47 +0100


c'mon, let's not make this rod for our backs.

The "A suitable page may exist but watermarks are not met" case can be
addressed by testing the watermarks up-front, surely?

I bet the "Page migration cannot move all pages so fragmentation
remains" case can be addressed by setting some metric in the zone, and
suitably modifying that as a result on ongoing activity.  To tell the
zone "hey, compaction migth be worth trying now".  that sucks too, but not
so much.

Or something.  Putting a wallclock-based throttle on it like this
really does reduce the usefulness of the whole feature.

Internet: "My application works OK on a hard disk but fails when I use an SSD!". 

akpm: "Tell Mel!"

--

From: Andrea Arcangeli
Date: Tuesday, April 6, 2010 - 5:55 pm

Actually I skipped this one in the unified tree (I'm running both
patchsets at the same time as I write this and I should have tweaked
it so that the defrag sysfs control in transparent hugepage turns
memory compaction on and off, plus I embedded the
set_recommended_min_free_kbytes() code inside huge_memory.c
initialization). I merged the whole V7 except the above. It also
didn't pass my threshold, also because this only checks 1 jiffy that
is random and too short to matter.
--

From: Mel Gorman
Date: Wednesday, April 7, 2010 - 9:32 am

Nope, because the number of pages free at each order changes before and
after compaction and you don't know by how much in advance. It wouldn't
be appropriate to assume perfect compaction because unmovable and

When it gets down to it, this patch was about paranoia. If the
heuristics on compaction-avoidance didn't work out, I didn't want
compaction to keep pounding.

That said, this patch would also hide the bug report telling us this happened
and was a mistake. A bug report detailing high oprofile usage in compaction
will be much easier to come across than a report on defer_compaction()
being called too often.


Mel is in and he is listening.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will
not be compacted. This choice is arbitrary and not based on data. To
help optimise the system and set a sensible default for this value, this
patch adds a sysctl extfrag_threshold. The kernel will only compact
memory if the fragmentation index is above the extfrag_threshold.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |   18 ++++++++++++++++--
 include/linux/compaction.h  |    3 +++
 kernel/sysctl.c             |   15 +++++++++++++++
 mm/compaction.c             |   12 +++++++++++-
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..878b1b4 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -131,8 +132,7 @@ out to disk.  This tunable expresses the interval between those wakeups, in
 
 Setting this to zero disables periodic writeback altogether.
 
-==============================================================
-
+============================================================== 
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, dentries and
@@ -150,6 +150,20 @@ user should run `sync' first.
 
 ==============================================================
 
+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards ...

On Fri,  2 Apr 2010 17:02:46 +0100

Was this the most robust, reliable, no-2am-phone-calls thing we could
have done?

What about, say, just doing a bit of both until something worked?  For
extra smarts we could remember what worked best last time, and make
ourselves more likely to try that next time.

Or whatever, but extfrag_threshold must die!  And replacing it with a
hardwired constant doesn't count ;)

--


I guess you could but that is not a million miles away from what
currently happens.

This heuristic is basically "based on free memory layout, how likely is
compaction to succeed?". It makes a decision based on that. A later
patch then checks if the guess was right. If not, just try direct

With the later patch, this is essentially what we do. Granted we
remember the opposite "If the kernel guesses wrong, then don't compact

I think what you have in mind is "just try compaction every time" but my
concern about that is we'll hit a corner case where a lot of CPU time is
taken scanning zones uselessly. That is what this heuristic and the
back-off logic in a later patch was meant to avoid. I haven't thought of
a better alternative :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Tarkan Erimer
Subject:
Date: Tuesday, April 6, 2010 - 7:47 am

Hi Mel,


These patches are applicable to which kernel version or versions ?
I tried on 2.6.33.2 and 2.6.34-rc3 without succeed. 

Tarkan
--

From: Mel Gorman
Subject:
Date: Tuesday, April 6, 2010 - 8:00 am

It's based on Andrew's tree mmotm-2010-03-24-14-48.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Tarkan Erimer
Subject:
Date: Tuesday, April 6, 2010 - 8:03 am

OK. Thanks for the reply. 

Tarkan
--

Previous thread: [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages by Mel Gorman on Friday, April 2, 2010 - 9:02 am. (1 message)

Next thread: [PATCH -tip 0/4] perf-probe bugfixes by Masami Hiramatsu on Friday, April 2, 2010 - 9:50 am. (9 messages)