Re: [PATCH] fix count_vm_event preempt in memory compaction direct reclaim

Previous thread: [PATCH 02/14] mm,migration: Share the anon_vma ref counts between KSM and page migration by Mel Gorman on Tuesday, April 20, 2010 - 2:01 pm. (1 message)

Next thread: Re: Should calculation of vm.overcommit_ratio be changed? by Dirk Geschke on Tuesday, April 20, 2010 - 1:51 pm. (1 message)
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

This series is based on the patches in mmotm-2010-04-15-14-42.patch but
rebased against 2.6.34-rc4 and collapsed. The cumulative fixes were getting
tricky to follow and this series should give a clearer view of what is
going on. It should be possible to drop all the compaction-related patches
in mmotm and drop this series directly in its place.

Changelog since mmotm
  o Have defer_compaction use an exponential backoff instead of time
  o Rename COMPACT_INCOMPLETE and document the return values

Changelog since V7 (these are the changes made while in mmotm)
  o Have compaction depend on CONFIG_HUGETLB_PAGE instead of CONFIG_HUGETLBFS
  o Move extfrag_index and unusable_index to debugfs
  o Many minor cleanups in the core patch
  o Drop unsafe optimisation on PageBuddy as zone->lock is not held
  o Call kernel_map_pages after split_free_page
  o Update documentation on compact_memory proc entry
  o Initialise cc->zone properly when compacting all of memory
  o Do not display vmstats related to compaction when disabled

Changelog since V6
  o Avoid accessing anon_vma when migrating unmapped PageSwapCache pages

Changelog since V5
  o Rebase to mmotm-2010-03-24-14-48
  o Add more reviewed-by's
  o Correct one spelling in vmstat.c and some leader clarifications
  o Split the LRU isolation modes into a separate path
  o Correct a NID change
  o Call migrate_prep less frequently
  o Remove unnecessary inlining
  o Do not interfere with memory hot-remove
  o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
  o page_mapped instead of page_mapcount and allow swapcache to migrate
  o Avoid too many pages being isolated for migration
  o Handle PageSwapCache pages during migration

Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags

Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
Memory compaction needs access to these modes for isolating pages for
migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 94ec325..32af03c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1070f83..8bdd85c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5

--

From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone (1 << compact_defer_shift) times.
If the next compaction attempt also fails, compact_defer_shift is
increased up to a maximum of 6. If compaction succeeds, the defer
counters are reset again.

The zone that is deferred is the first zone in the zonelist - i.e. the
preferred zone.  To defer compaction in the other zones, the information
would need to be stored in the zonelist or implemented similar to the
zonelist_cache.  This would impact the fast-paths and is not justified at
this time.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/compaction.h |   39 +++++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    9 +++++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 3719325..5ac5155 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,6 +22,36 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* Do not skip compaction more than 64 times */
+#define COMPACT_MAX_DEFER_SHIFT 6
+
+/*
+ * Compaction is deferred when compaction fails to result in a page
+ * allocation success. 1 << compact_defer_limit compactions are skipped up
+ * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
+ */
+static inline void defer_compaction(struct zone ...

The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation.  One of these
is based on the fragmentation.  If the index is below 500, memory will not
be compacted.  This choice is arbitrary and not based on data.  To help
optimise the system and set a sensible default for this value, this patch
adds a sysctl extfrag_threshold.  The kernel will only compact memory if
the fragmentation index is above the extfrag_threshold.

[randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |   16 +++++++++++++++-
 include/linux/compaction.h  |    3 +++
 kernel/sysctl.c             |   15 +++++++++++++++
 mm/compaction.c             |   12 +++++++++++-
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 3b3fa1b..6274970 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -149,6 +149,20 @@ user should run `sync' first.
 
 ==============================================================
 
+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards 1000 imply failures are due to fragmentation and -1 implies
+that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated.  This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 5938db5..b768a1d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -543,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, April 20, 2010 - 7:49 pm

On Tue, 20 Apr 2010 22:01:03 +0100

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


Hmm. Can't we move this part to mm/rmap.c ?

	as
		anon_vma_drop_external_reference()
or some.

Thanks,
-Kame

--

From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation.  With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed.  Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap.  If compaction fails to free a page of a suitable size, then
reclaim will still occur.

Direct compaction returns as soon as possible.  As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.

[akpm@linux-foundation.org: Fix build errors]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |   24 ++++++++--
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   63 +++++++++++++++++++++++
 mm/vmstat.c                |   16 +++++-
 5 files changed, 215 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ba98cfe..eed40ec 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,31 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
-#define COMPACT_CONTINUE	0
-#define COMPACT_PARTIAL		1
-#define COMPACT_COMPLETE	2
+/* Return values for compact_zone() and try_to_compact_pages() */
+/* compaction didn't start as it was not possible or direct reclaim was more suitable */
+#define COMPACT_SKIPPED		0
+/* compaction should continue to another pageblock */
+#define COMPACT_CONTINUE	1
+/* direct compaction partially compacted a zone and there are suitable pages */
+#define COMPACT_PARTIAL		2
+/* The full zone was compacted */
+#define COMPACT_COMPLETE	3
 
 #ifdef ...
From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 5:19 am

==
From: Andrea Arcangeli <aarcange@redhat.com>

Preempt is enabled so it must use count_vm_event.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1768,7 +1768,7 @@ __alloc_pages_direct_compact(gfp_t gfp_m
 				alloc_flags, preferred_zone,
 				migratetype);
 		if (page) {
-			__count_vm_event(COMPACTSUCCESS);
+			count_vm_event(COMPACTSUCCESS);
 			return page;
 		}
 
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 5:51 am

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

Andrew, this is a fix to the patch
mmcompaction-direct-compact-when-a-high-order-allocation-fails.patch


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 6:11 am

You're welcome.

I updated current aa.git origin/master and origin/anon_vma_chain
branches (post THP-23*).

There's also another patch I've in my tree that you didn't picked up
and I wonder what's the issue here. This less a bugfix because it
seems to only affect lockdep, I don't know why lockdep forbids to call
migrate_prep with any lock held (in this case the mmap_sem). migrate.c
is careful to comply with it, compaction.c isn't. It's not mandatory
to succeed for compaction, so in doubt I just commented it out. It'll
also decrease the IPI load so I wasn't very concerned to re-enable it.

-----
Subject: disable migrate_prep()

From: Andrea Arcangeli <aarcange@redhat.com>

I get trouble from lockdep if I leave it enabled:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.34-rc3 #50
-------------------------------------------------------
largepages/4965 is trying to acquire lock:
 (events){+.+.+.}, at: [<ffffffff8105b788>] flush_work+0x38/0x130

 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<ffffffff8141b022>] do_page_fault+0xd2/0x430


flush_work apparently wants to run free from lock and it bugs in:

	lock_map_acquire(&cwq->wq->lockdep_map);

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -383,7 +383,9 @@ static int compact_zone(struct zone *zon
 	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
 	cc->free_pfn &= ~(pageblock_nr_pages-1);
 
+#if 0
 	migrate_prep();
+#endif
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
 		unsigned long nr_migrate, nr_remaining;
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 6:55 am

Simple, I didn't spot it. If you pointed it out to me, I didn't take

I haven't seen this problem. The testing I'd have been doing with compaction

It's not mandatory but the LRU lists should be drained so they can be properly
isolated. It'd make a slight difference to success rates as there will be

While true, is compaction density that high under normal workloads? I guess
it would be if a scanner was constantly trying to promote pages.  If the
IPI load is out of hand, I'm ok with disabling in some cases. For example,
I'd be ok with it being skipped if it was part of a daemon doing speculative
promotion but I'd prefer it to still be used if the static hugetlbfs pool

Hmm, I'm not seeing where in the fault path flush_work is getting called
from. Can you point it out to me please?

We already do some IPI work in the page allocator although it happens after
direct reclaim and only for high-order pages. What happens there and what
happens in migrate_prep are very similar so if there was a problem with IPI

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 7:48 am

That explains it! But anything can call alloc_pages(order>0) with some

Yes success rate will be slightly worse but this also applies to all
regular vmscan paths that don't send IPI but they only flush the local
queue with lru_add_drain, simply pages won't be freed until there will
be some other cpu holding the refcount on them, it is not specific to
compaction.c but it applies to vmscan.c and vmscan likely not wanting
to send an IPI flood because it could too if it wanted.

But I guess I should at least use lru_add_drain() in replacement of



Where? I never triggered other issues in the page allocator with
lockdep, just this one pops up.
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 8:14 am

Or a migrate_prep_local? migrate_prep() was there in case there was extensive
work that needed to be done. At least if the two versions were beside each
other, it would be a bit clearer if migrate_prep was ever modified. It'd

Actually, it's not an IPI in this case is it?  As it's schedule_on_each_cpu,
it should be added to a workqueue and executed by keventd at some point in the


Ah, it's the difference between schedule_on_each_cpu that migrate_prep does
and on_each_cpu that the page allocator uses. That's why I haven't seen
it before.

How about the following as an alternative to dropp migrate_prep?

==== CUT HERE ====
mm,compaction: Do not schedule work on other CPUs for compaction

Migration normally requires a call to migrate_prep() as a preparation
step. This schedules work on all CPUs for pagevecs to be drained. This
makes sense for move_pages and memory hot-remove but is unnecessary
for memory compaction.

To avoid queueing work on multiple CPUs, this patch introduces
migrate_prep_local() which drains just local pagevecs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/migrate.h |    2 ++
 mm/compaction.c         |    2 +-
 mm/migrate.c            |   11 ++++++++++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 05d2292..6dec3ef 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -19,6 +19,7 @@ extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
 
 extern int migrate_prep(void);
+extern int migrate_prep_local(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
@@ -32,6 +33,7 @@ static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, int offlining) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 ...
From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 8:25 am

Yep this is what I'd like too... btw in the comments you also mention
IPI but I guess that's ok. About the cost I'm not sure but I would
expect the cost of this to be even higher because it also has to run
the scheduler unlike a real IPI.
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 8:32 am

Fair point. I wouldn't be too sure what the relative costs are but either
way it's bad and unnecessary in the case of compaction.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

Add a per-node sysfs file called compact.  When the file is written to,
each zone in that node is compacted.  The intention that this would be
used by something like a job scheduler in a batch system before a job
starts so that the job can allocate the maximum number of hugepages
without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be freed
+		into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 057979a..2bdd8a9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -9,6 +9,7 @@
 #include <linux/memory.h>
 #include <linux/node.h>
 #include <linux/hugetlb.h>
+#include <linux/compaction.h>
 #include <linux/cpumask.h>
 #include <linux/topology.h>
 #include <linux/nodemask.h>
@@ -246,6 +247,8 @@ int register_node(struct node *node, int num, struct node *parent)
 ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

This patch adds a proc file /proc/sys/vm/compact_memory.  When an
arbitrary value is written to the file, all zones are compacted.  The
expected user of such a trigger is a job scheduler that prepares the
system before the target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/sysctl/vm.txt |   10 +++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   61 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..3b3fa1b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,15 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
+all zones are compacted such that free memory is available in contiguous
+blocks where possible. This can be important for example in the allocation of
+huge pages although processes will also directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 465ca88..5723888 100644
--- ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone.  The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist.  The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner.  The pages isolated for migration are
then migrated to the newly isolated free pages.

[aarcange@redhat.com: Fix unsafe optimisation]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    9 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    1 +
 include/linux/vmstat.h     |    3 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  394 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   45 +++++
 mm/vmstat.c                |    7 +
 8 files changed, 461 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..465ca88
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_CONTINUE	0
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3058bbc..eb21256 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -334,6 +334,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such as
in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.

For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zo
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/vmstat.c |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index d3e0fa1..23a5899 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -16,6 +16,7 @@
 #include <linux/cpu.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
+#include <linux/math64.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -420,6 +421,33 @@ static void fill_contig_page_info(struct zone *zone,
 						(order - suitable_order);
 	}
 }
+
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is
a value between 0 and 1. The higher the value, the more of free memory
is unusable and by implication, the worse the external fragmentation
is. For the most part, the huge page size will be the size of interest
but not necessarily so it is exported on a per-order and per-zone basis
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

[akpm@linux-foundation.org: Fix allnoconfig]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 mm/vmstat.c |  150 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 149 insertions(+), 1 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..d3e0fa1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -379,7 +379,50 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
 }
 #endif
 
-#ifdef CONFIG_PROC_FS
+#ifdef CONFIG_COMPACTION
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory.  The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

[akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..a275a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLB_PAGE && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ...
From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated.  However, a swap cache page can be migrated and
fits this description.  This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/migrate.c |   53 ++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b114635..4afd6fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,7 +485,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int remap_swapcache)
 {
 	struct address_space *mapping;
 	int rc;
@@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
+	if (rc) {
 		newpage->mapping = NULL;
+	} else {
+		if (remap_swapcache)
+			remove_migration_ptes(page, newpage);
+	}
 
 	unlock_page(newpage);
 
@@ -540,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -601,18 +605,33 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is ...
From: Christoph Lameter
Date: Wednesday, April 21, 2010 - 7:30 am

You are going to keep the migration ptes after the page has been unlocked?
Or is remap_swapcache true if its not a swapcache page?

Maybe you meant

if (!remap_swapcache)


Looks like you meant !remap_swapcache

--

From: Mel Gorman
Date: Wednesday, April 21, 2010 - 8:00 am

Yes, because it's not known if the anon_vma for the unmapped swapcache page
still exists or not.  Now, a bug has been reported where a migration PTE is
found where the page is not locked. I'm trying to determine if it's the same

No, remap_swapcache could just be called "remap". If it's 0, it's

If remap_swapcache is 1, the anon_vma is valid (or irrelevant because
it's a file) and it's safe to remap the page by removing the migration
PTEs.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Wednesday, April 21, 2010 - 8:05 am

Call this "can_remap"?
--

From: Mel Gorman
Date: Wednesday, April 21, 2010 - 8:14 am

can_do - ba dum tisch.

While you are looking though, maybe you can confirm something for me.

1. Is leaving a migration PTE like this behind reasonable? (I think yes
   particularly as the page was already unmapped so it's not a new fault
   incurred)
2. Is the BUG_ON check in
   include/linux/swapops.h#migration_entry_to_page() now wrong? (I
   think yes, but I'm not sure and I'm having trouble verifying it)

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Wednesday, April 21, 2010 - 8:31 am

The design of page migration only allows for the existence of these as
long as the page is locked. Not sure what would happen if you leave this
hanging around. Paths that are not prepared for a migration_pte may

The bug check ensures that migration entries only occur when the page
is locked. This patch changes that behavior. This is going too oops
therefore in unmap_and_move() when you try to remove the migration_ptes
from an unlocked page.


--

From: Mel Gorman
Date: Wednesday, April 21, 2010 - 8:34 am

If there are other paths, then migration of unmapped PageSwapCache is
plain unsafe and this patch would have to go. It'd limit compaction
somewhat but without the series, it would appear that memory hot-remove

It's not unmap_and_move() that the problem is occurring on but during a
page fault - presumably in do_swap_page but I'm not 100% certain.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Wednesday, April 21, 2010 - 8:46 am

remove_migration_pte() calls migration_entry_to_page(). So it must do that
only if the page is still locked.

You need to ensure that the page is not unlocked in move_to_new_page() if
the migration ptes are kept.

move_to_new_page() only unlocks the new page not the original page. So that is safe.

And it seems that the old page is also unlocked in unmap_and_move() only
after the migration_ptes have been removed? So we are fine after all...?



--

From: Mel Gorman
Date: Thursday, April 22, 2010 - 2:28 am

Correct, but the other call path is

do_swap_page
  -> migration_entry_wait
    -> migration_entry_to_page

with migration_entry_wait expecting the page to be locked. There is a dangling
migration PTEs coming from somewhere. I thought it was from unmapped swapcache

You'd think but migration PTEs are being left behind in some circumstance. I
thought it was due to this series, but it's unlikely. It's more a case that
compaction heavily exercises migration.

We can clean up the old migration PTEs though when they are encountered
like in the following patch for example? I'll continue investigating why
this dangling migration pte exists as closing that race would be a
better fix.

==== CUT HERE ====
mm,migration: Remove dangling migration ptes pointing to unlocked pages

Due to some yet-to-be-identified race, it is possible for migration PTEs
to be left behind, When later paged-in, a BUG is triggered that assumes
that all migration PTEs are point to a page currently being migrated and
so must be locked.

Rather than calling BUG, this patch notes the existance of dangling migration
PTEs in migration_entry_wait() and cleans them up.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/swapops.h |   11 ++++++-----
 mm/memory.c             |    2 +-
 mm/migrate.c            |   24 ++++++++++++++++++++----
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cd42e30..01fba71 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -95,14 +95,15 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
+static inline struct page *migration_entry_to_page(swp_entry_t entry,
+				bool lock_required)
 {
 	struct page *p = pfn_to_page(swp_offset(entry));
 	/*
 	 * Any use of migration entries may only occur while the
 	 * corresponding page is ...
From: KAMEZAWA Hiroyuki
Date: Thursday, April 22, 2010 - 2:46 am

On Thu, 22 Apr 2010 10:28:20 +0100

I use similar patch for debugging. In my patch, this when this function founds
dangling migration entry, return error code and do_swap_page() returns
VM_FAULT_SIGBUS.


Hmm..in my test, the case was.

Before try_to_unmap:
	mapcount=1, SwapCache, remap_swapcache=1
After remap
	mapcount=0, SwapCache, rc=0.

So, I think there may be some race in rmap_walk() and vma handling or
anon_vma handling. migration_entry isn't found by rmap_walk.

Hmm..it seems this kind patch will be required for debug.

-Kame



--

From: Minchan Kim
Date: Thursday, April 22, 2010 - 3:13 am

On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki

I looked do_swap_page, again.
lock_page is called long after migration_entry_wait.
It means lock_page can't close the race.

So I think this BUG is possible.



-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 22, 2010 - 3:31 am

On Thu, 22 Apr 2010 19:13:12 +0900

I think it's not a problem.
When migration-entry-wait is called, enry_wait() does

	pte_lock();
	check migration_pte
	check it's locked.

And after wait_on_page_locked(), it just returns to user and cause
page fault again. 

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 22, 2010 - 3:51 am

On Thu, 22 Apr 2010 19:31:06 +0900

Ok, here is my patch for _fix_. But still testing...
Running well at least for 30 minutes, where I can see bug in 10minutes.
But this patch is too naive. please think about something better fix.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At adjust_vma(), vma's start address and pgoff is updated under
write lock of mmap_sem. This means the vma's rmap information
update is atoimic only under read lock of mmap_sem.


Even if it's not atomic, in usual case, try_to_ummap() etc...
just fails to decrease mapcount to be 0. no problem.

But at page migration's rmap_walk(), it requires to know all
migration_entry in page tables and recover mapcount.

So, this race in vma's address is critical. When rmap_walk meet
the race, rmap_walk will mistakenly get -EFAULT and don't call
rmap_one(). This patch adds a lock for vma's rmap information. 
But, this is _very slow_.
We need something sophisitcated, light-weight update for this..


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/mm_types.h |    1 +
 kernel/fork.c            |    1 +
 mm/mmap.c                |   11 ++++++++++-
 mm/rmap.c                |    3 +++
 4 files changed, 15 insertions(+), 1 deletion(-)

Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
@@ -183,6 +183,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	spinlock_t adjust_lock;
 };
 
 struct core_thread {
Index: linux-2.6.34-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc4-mm1/mm/mmap.c
@@ -584,13 +584,20 @@ again:			remove_next = 1 + (end > next->
 		if (adjust_next)
 			vma_prio_tree_remove(next, root);
 ...
From: Mel Gorman
Date: Thursday, April 22, 2010 - 7:14 am

Ok wow. That is exceptionally well-spotted. This looks like a proper bug

In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
the duration of the update and is  taken elsewhere where the VMA information
is read such as rmap_walk_file()

In the event the VMA is anon, vma_adjust currently talks no locks and your
patch introduces a new one but why not use the anon_vma lock here? Am I
missing something that requires the new lock?

For example;

==== CUT HERE ====
mm: Take the anon_vma lock in vma_adjust()

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch takes the anon_vma->lock during vma_adjust to avoid such
races.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/mmap.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma)
+		spin_unlock(&vma->anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
--

From: Minchan Kim
Date: Thursday, April 22, 2010 - 7:18 am

rmap_walk_anon doesn't hold vma's anon_vma->lock.
It holds page->anon_vma->lock.

-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Thursday, April 22, 2010 - 8:40 am

Of course, thank you for pointing out my error. With multiple
anon_vma's, the locking is a bit of a mess. We cannot hold spinlocks on
two vma's in the same list at the same time without potentially causing
a livelock. The problem becomes how we can safely drop one anon_vma and
acquire the other without them disappearing from under us.

See the XXX mark in the following incomplete patch for example. It's
incomplete because the list traversal is also not safe once the lock has
been dropped and -EFAULT is returned by vma_address.

==== CUT HERE ====
mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch takes the anon_vma->lock during vma_adjust to avoid such
races. During rmap_walk, the page anon_vma is locked but as it walks the
VMA list, it'll lock the VMA->anon_vma if they differ as well.

---
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   48 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->anon_vma)
+		spin_unlock(&vma->anon_vma->lock);
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..1ea0cae ...
From: Mel Gorman
Date: Thursday, April 22, 2010 - 9:13 am

Incidentally, I now belatedly see why Kamezawa introduced a new lock. I
assume it was to get around this mess.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Thursday, April 22, 2010 - 12:29 pm

There is a simplier alternative I guess. When the vma->anon_vma is difference,
try and lock it. If the lock is uncontended, continue. If not, release the
pages anon_vma lock and start from the beginning - repeat until the list is
walked uncontended. This should avoid livelocking against other walkers.
I have the test running now for 30 minutes with no problems but will
leave it overnight and see what happens.

I tried the approach of having vma_adjust and rmap_walk always seeing
the same anon_vma but I couldn't devise a method. It doesn't seem
possible but I'm still getting to grips with the anon_vma_chain stuff.
Maybe Rik can spot a better way of doing this.

==== CUT HERE ====
mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk

vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.

This patch has vma_adjust() take the vma->anon_vma lock. When
rmap_walk_anon() is walking the anon_vma_chain list, it can encounter a
VMA a different anon_vma. It cannot just take this lock because
depending on the order of traversal of anon_vma_chains in other
processes, it could cause a livelock. Instead, it releases the pages
anon_vma lock it has and starts again from scratch until it can traverse
the full list uncontended.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   21 ++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) ...
From: Christoph Lameter
Date: Thursday, April 22, 2010 - 12:40 pm

Isnt this also a race with reclaim /  swap?
--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 22, 2010 - 4:52 pm

On Thu, 22 Apr 2010 14:40:46 -0500 (CDT)
Yes, it's also race in reclaim/swap ...
  page_referenced()
  try_to_unmap().
  rmap_walk()  <==== we hit this case.

But above 2 are not considered to be critical.

I'm not sure how this race affect KSM.

Thanks,
-Kame

--

From: Mel Gorman
Date: Friday, April 23, 2010 - 2:03 am

I'm not that familiar with KSM but took a look through. Mostly,
accessing the VMA is protected by the mmap_sem with the exception of
rmap_walk_ksm. It needs similar protection for accessing the VMA than
rmap_walk_anon does.

Specifically, this part

                list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
                        vma = vmac->vma;
                        if (rmap_item->address < vma->vm_start ||
                            rmap_item->address >= vma->vm_end)
                                continue;

needs to acquire the vma->anon_vma lock if it differs or in your case
call something similar to vma_address_safe.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Thursday, April 22, 2010 - 7:23 am

Nice Catch, Kame. :)

For further optimization, we can hold vma->adjust_lock if vma_address
returns -EFAULT. But I hope we redesigns it without new locking. 
But I don't have good idea, now. :(

-- 
Kind regards,
Minchan Kim


--

From: Minchan Kim
Date: Thursday, April 22, 2010 - 7:40 am

How about this?
I just merged ideas of Mel and Kame.:)

It just shows the concept, not formal patch. 


diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61ea742 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ -619,7 +621,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
-
+	if (vma->anon_vma) 
+		spin_unlock(&vma->anon_vma->lock);
 	if (remove_next) {
 		if (file) {
 			fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 3a53d9f..8075057 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1359,9 +1359,22 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	spin_lock(&anon_vma->lock);
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address = vma_address(page, vma);
-		if (address == -EFAULT)
+		struct anon_vma *tmp_anon_vma = vma->anon_vma;
+		unsigned long address;
+		int tmp_vma_lock = 0;
+		
+		if (tmp_anon_vma != anon_vma) {
+			spin_lock(&tmp_anon_vma->lock);	
+			tmp_vma_lock = 1;
+		}
+		address = vma_address(page, vma);
+		if (address == -EFAULT) {
+			if (tmp_vma_lock)
+				spin_unlock(&tmp_anon_vma->lock);
 			continue;
+		}
+		if (tmp_vma_lock)
+			spin_unlock(&tmp_anon_vma->lock);
 		ret = rmap_one(page, vma, address, arg);
 		if (ret != SWAP_AGAIN)
 			break;
-- 
1.7.0.5



-- 
Kind regards,
Minchan Kim


--

From: Mel Gorman
Date: Thursday, April 22, 2010 - 8:44 am

heh, I thought of a similar approach at the same time as you but missed
this mail until later. However, with this approach I suspect there is a
possibility that two walkers of the same anon_vma list could livelock if
two locks on the list are held at the same time. Am still thinking of

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Friday, April 23, 2010 - 11:31 am

Hi Mel,


Trying to understand this issue and I've some questions. This
vma_adjust and lock inversion troubles with the anon-vma lock in
rmap_walk are a new issue introduced by the recent anon-vma changes,
right?

About swapcache, try_to_unmap just nuke the mappings, establish the
swap entry in the pte (not migration entry), and then there's no need
to call remove_migration_ptes. So it just need to skip it for
swapcache. page_mapped must return zero after try_to_unmap returns
before we're allowed to migrate (plus the page count must be just 1
and not 2 or more for gup users!).

I don't get what's the problem about swapcache and the races connected
to it, the moment I hear migration PTE in context of swapcache
migration I'm confused because there's no migration PTE for swapcache.

The new page will have no mappings either, it just needs to be part of
the swapcache with the same page->index = swapentry, indexed in the
radix tree with that page->index, and paga->mapping pointing to
swapcache. Then new page faults will bring it in the pageatables. The
lookup_swap_cache has to be serialized against some lock, it should be
the radix tree lock? So the migration has to happen with that lock
hold no? We can't just migrate swapcache without stopping swapcache
radix tree lookups no? I didn't digest the full migrate.c yet and I
don't see where it happens. Freeing the swapcache while simpler and
safer, would be quite bad as it'd create I/O for potentially hot-ram.

About the refcounting of anon-vma in migrate.c I think it'd be much
simpler if zap_page_range and folks would just wait (like they do if
they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be
no need of refcounting the anon-vma that way.

I assume whatever is added to rmap_walk I also have to add to
split_huge_page later when switching to mainline anon-vma code (for
now I stick to 2.6.32 anon-vma code to avoid debugging anon-vma-chain,
memory compaction, swapcache migration and transparent hugepage at ...
From: Mel Gorman
Date: Friday, April 23, 2010 - 12:23 pm

In a manner of speaking. There was no locking going on but prior to the
anon_vma changes, there would have been only one anon_vma lock and the
fix would be easier - just take the lock on anon_vma->lock while the

That would be an alternative for swapcache but it's not necessarily

That was a mistake on my part. The race appears to be between vma_adjust
changing the details of the VMA while rmap_walk is going on. It mistakenly
believes the vma no longer spans the address, gets -EFAULT from vma_address
and doesn't clean up the migration PTE. This is later encountered but the
page lock is no longer held and it bugs. An alternative would be to clean
up the migration PTE of unlocked pages on the assumption it was due to this

Think migrate_page_move_mapping() is what you're looking for? It takes

I'm not getting what you're suggesting here. The refcount is to make
sure the anon_vma doesn't go away after the page mapcount reaches zero.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Friday, April 23, 2010 - 12:39 pm

So it was very much a bug before too and we could miss to find some
pte mapping the page if vm_start was adjusted?

Also note, expand_downwards also moves vm_start with only the
anon_vma->lock as it has to serialize against other expand_downwards
and the rmap_walk. But expand_downwards takes the lock and it was safe
before.

Also for swapping even if things screwup it's no big deal, because it
will just skip, but migration has to find all ptes in
remove_migration_ptes and try_to_unmap also has to unmap everything.

In the split_huge_page case even the ordering at which newly allocated
vmas for the child are queued is critical, they've to be put at the
end of the list to be safe (otherwise do_wp_huge_page may not wait and

Hmm try_to_unmap already nukes all swap entries without creating any

Agreed, it's sure better to close the race... the other may have even
more implications. It's good to retain the invariant that when a
migration PTE exists the page also still exists and it's locked


Causing zap_page-range to Wait the end of migration when it encounters
migration ptes instead of skipping them all together by only releasing
the rss and doing nothing about them. If the pte can't go away, so the
mm so the vma and so the anon-vma. I'm not suggesting to change that,
but I guess that's the way I would have done if I would have
implemented it, it'd avoid refcounting. Just like I did in
split_huge_page I count the number of pmd marked splitting, and I
compare it to the number of pmds that are converted from huge to
not-huge and I compared that again with the page_mapcount. If the
three numbers aren't equal I bug. It simply can't go wrong unnoticed
that way. I only can do that because I stop the zapping.
--

From: Andrea Arcangeli
Date: Friday, April 23, 2010 - 2:35 pm

Well I looked deeper into it myself as I wanted to have this bit (and
other bits) sorted out in aa.git, and definitely this is a bug
introduced by the newest anon-vma changes in 2.6.34-rc so aa.git
cannot be affected as it's using the 2.6.33 anon-vma (and prev) code.

vma_adjust already takes the anon_vma->lock and of course I also
further verified that trying to apply your snippet to vma_adjust
results in immediately deadlock as the very same lock is already taken
in my tree as it's the same anon-vma (simpler). So aa.git will be
immune from these bugs for now.
--

From: Mel Gorman
Date: Saturday, April 24, 2010 - 3:52 am

I think you're right. This is a new bug introduced by the anon_vma changes. On

Yes, I expected that. Previously, there was only one anon_vma so if you

It should be. I expect that's why you have never seen the bugon in
swapops.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Saturday, April 24, 2010 - 4:13 am

Correct, I never seen it, and I keep it under very great stress with
swap storms of hugepages, lots of I/O and khugepaged at 100% cpu.

Also keep in mind expand_downwards which also adjusts
vm_start/vm_pgoff the same way (and without mmap_sem write mode).
--

From: Mel Gorman
Date: Saturday, April 24, 2010 - 4:59 am

Well, to me this is also good because it shows it's not an existing bug in
migration or a new bug introduced by compaction either. Previously I hadn't
seen this bug either but until relatively recently, the bulk of the testing

Will keep it in mind. It's taking the anon_vma lock but once again,
there might be more than one anon_vma to worry about and the proper
locking still isn't massively clear to me.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Saturday, April 24, 2010 - 7:30 am

I suggest to test again with aa.git as soon as I make a new release
with your v8 code (probably today). I didn't merge the swapcache
locked debug patch that tries to recover the thing after the fact. No
problem here with swapcache apparently and your v8 and latest numa-aware
khugepaged code is under stress for the last 12 hours.

Other than full numa awareness in all hugepage allocations and your v8
code, I added a generic document that needs review and I plan to add a
config tweak to select the default to be madvise or always for
transparent hugepage (never makes no sense, other than for debugging
purposes, madvise already provides the guarantee of zero risk of

Yes, that's my point, same issue there.
--

From: Rik van Riel
Date: Monday, April 26, 2010 - 2:54 pm

The locking for the anon_vma_chain->same_vma list is
essentially the same as what was used before in mmap
and anon_vma_prepare.

Either the mmap_sem is held for write, or the mmap_sem
is held for reading and the page_table_lock is held.

What exactly is the problem that migration is seeing?

-- 
All rights reversed
--

From: Mel Gorman
Date: Monday, April 26, 2010 - 3:11 pm

There are two problems.

Migration isn't holding the mmap_sem for write, for read or the pagetable
lock. It locks the page, unmaps it, puts a migration PTE in place that looks
like a swap entry, copies it and remaps it under the pagetable lock. At no
point does it hold the mmap_sem, but it needs to be sure it finds all the
migration pte it created. Because there are multiple anon_vma's, the locking
is tricky and unclear. I have one patch that locks the anon_vmas as it finds
them but is prepared to start over in the event of contention.

The second appears to be migration ptes that get copied during fork().
This is easier to handle.

I'm testing two patches at the moment and after 8 hours have seen no problem
even though the races are being detected (and handled). If it survives the
night, I'll post them.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Monday, April 26, 2010 - 3:26 pm

split_huge_page has the exact same requirements, except it is more
strict and it will stop zap_page_range and count that the same number
of pmds it marked as splitting are found again later.


Also note migration has the same "ordering" requirements for
anon_vma_link during fork, new vmas have to be appended at the end or
migration will choke (not going into the details of why, but I can if
you want). This should be safe in new anon-vma code as I already
pointed out this requirement to Rik for split_huge_page to be safe too.

I never tested split_huge_page on the fixed new anon-vma code (before
the latest fixes so with rc4 or so, I only know before the latest
fixes it was triggering BUG_ON in split_huge_page as I've enough
bug-on in there to be sure if split_huge_page doesn't BUG_ON, it's
safe). I need to retry with the new anon-vma code... split_huge_page
never showed anything wrong with the 2.6.33 code that I'm running on

And this is also where the requirement that new vmas are added to the

I run again the same kernel as before and I reproduced the crash in
migration_entry_wait swapops.h (page not locked) just once when I
posted the stack trace and never again. I wanted to compare stack
traces and see if it happens again. But that bug in
migration_entry_wait can't be related to the new anon-vma code because
I've backed it out from aa.git. Still you've to figure out if your
patch is fixing a real bug.

I'm just pointing out if there's a bug in anon-vma
vma_adjust/expand_downards is unrelated to the crash in swapops.h
migration_entry_wait. And obviously it's not either a bug in
transparent hugepage code, as you also reproduced the same crash
without using aa.git only with v8.

We need to fix the swapops.h bug with maximum priority... (and of
course the anon-vma bug too if it exists).

Other than that swapops.h in migrate that you can also reproduce with
only mainline + memory compaction v8, I had zero other problems with
current aa.git.
--

From: Andrea Arcangeli
Date: Sunday, April 25, 2010 - 7:41 am

Oh I just got the very crash you're talking about with aa.git with
your v8 code. Weird that I never reproduced it before! I think it's
because I fixed gcc to be fully backed by hugepages always (without
khugepaged) and I was rebuilding a couple of packages, and that now
triggers memory compaction much more, but mixed with heavy
fork/execve. This is the only instability I managed to reproduce over
24 hours of stress testing and it's clearly not related to transparent
hugepage support but it's either a bug in migrate.c (more likely) or
memory compaction.

Note that I'm running with the 2.6.33 anon-vma code, so it will
relieve you to know it's not the anon-vma recent changes causing this
(well I can't rule out anon-vma bugs, but if it's anon-vma, it's a
longstanding one).

kernel BUG at include/linux/swapops.h:105!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size
CPU 0 
Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr

Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name
RIP: 0010:[<ffffffff810e66b0>]  [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180
RSP: 0000:ffff88009ab6fa58  EFLAGS: 00010246
RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298
RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8
RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298
R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0
FS:  00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50
CS:  0010 DS: 0000 ...
From: Mel Gorman
Date: Tuesday, April 27, 2010 - 2:40 am

The oopses I am getting look very similar. The page is encountered in
the stack while copying the arguements in. I don't think it's a
coincidence.

[17831.496941] ------------[ cut here ]------------
[17831.532517] kernel BUG at include/linux/swapops.h:105!
[17831.532517] invalid opcode: 0000 [#1] PREEMPT SMP 
[17831.532517] last sysfs file: /sys/block/sde/size
[17831.532517] CPU 0 
[17831.532517] Modules linked in: kvm_amd kvm dm_crypt loop evdev tpm_tis i2c_piix4 shpchp tpm wmi tpm_bios serio_raw i2c_core pci_hotplug processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci ide_pci_generic libata r8169 mii ide_core ehci_hcd scsi_mod ohci_hcd floppy thermal fan thermal_sys
[17831.532517] 
[17831.532517] Pid: 31028, comm: make Not tainted 2.6.34-rc4-mm1-fix-swapops #2 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[17831.532517] RIP: 0010:[<ffffffff811094fb>]  [<ffffffff811094fb>] migration_entry_wait+0xc1/0x129
[17831.532517] RSP: 0018:ffff88007ebfd9d8  EFLAGS: 00010246
[17831.532517] RAX: ffffea0000000000 RBX: ffffea0000199368 RCX: 00000000000389f0
[17831.532517] RDX: 0000000000199368 RSI: ffffffff81826558 RDI: 0000000000e9d63e
[17831.532517] RBP: ffff88007ebfda08 R08: ffff88007e334ec0 R09: ffff88007ebfd9e8
[17831.532517] R10: 00000000b411a2ca R11: 0000000000000246 R12: 0000000036f44000
[17831.532517] R13: ffff88007d164088 R14: f8000000000074eb R15: 0000000000e9d63e
[17831.532517] FS:  00002abc5c083f90(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[17831.532517] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17831.532517] CR2: 00007fff3fb8cd1c CR3: 000000007d12d000 CR4: 00000000000006f0
[17831.532517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17831.532517] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17831.532517] Process make (pid: 31028, ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, April 27, 2010 - 3:41 am

On Tue, 27 Apr 2010 10:40:02 +0100

Hmm. booby trap aronude here ?
==
static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
{
....
       /*
         * cover the whole range: [new_start, old_end)
         */
        if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
                return -ENOMEM;

        /*
         * move the page tables downwards, on failure we rely on
         * process cleanup to remove whatever mess we made.
         */
        if (length != move_page_tables(vma, old_start,
                                       vma, new_start, length))
                return -ENOMEM;
...
        /*
         * Shrink the vma to just the new range.  Always succeeds.
         */
        vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);

	
==

I think we have wrong vma_address() -> "pte"
==
	=== (A) ===
	vma_adjust().  ---- (*)
	=== (B) ===
	move_pte().
==

	vma_address(page, vma)
	=> address = vma->vm_start + ((page->index << shift) - vma->vm_pgoff) << PAGE_SHIFT);

So, vma_address() in zone (A) and vma_address in (B) will return different address.

When pte inludes migration_pte, this seems critical. Because an address pointed
by vma_address() in zone (B) will not contain migration_pte until
move_ptes() ends.

Thanks,
-Kame






--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 4:12 am

I think so. I have a debugging patch running at the moment that is

Specfically, I have it in move_ptes. If migration entries are being found
there, it would be reasonable for exec() to wait on migration to complete

Yes. I was expecting that the anon_vma lock in vma_adjust would delay exec

This is plausible considering that, like vma_adjust(), move_ptes does
not appear to take the anon_vma->lock in the same fashion as i_mmap_lock
is taken for files.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 8:42 am

Awesome this is happening in the same page and in the same place!
Thanks a lot for sharing your oops (I'm running the same kernel again
and I never reproduced it again but I didn't apply the "reproducer"
stress to it but only plenty of gcc hugepage loads, I think rebuilding
gcc with hugepage-gcc is what triggered it the first time, as gcc
takes more memory to build and has a lot more of pathologic cases like
translate.o taking lots of memory than the kernel or glibc which I
rebuilt a lot recently).

Reducing the race to this will help tremendously. If I understand
correctly patch 1 didn't fix it (and patch 2 can't be the issue for me
as I'm running 2.6.33 anon-vma code). I'll check ASAP if patch 1 is
needed even if it's not fixing this bug, or if it's unnecessary, and
what else could be wrong...

Thanks a lot for the help!
Andrea
--

From: Mel Gorman
Date: Saturday, April 24, 2010 - 3:50 am

I thought it was but I was looking at an rc kernel instead of 2.6.33.
This looked as if it was safe but it's not any more with the new

It either has to find them all or it has to be capable of a
lazy-cleanup. I had lazy cleanup patch but it was dropped because we
felt it should have been possible to properly lock this. I'm beginning
to think it can't because there appears to be a few cases where the VM

There might also be a locking snarl there as well then. I confess that
the details of transparent hugepage support have fallen back out of my

When a mapped swapcache is unmapped, a migration PTE is put in place. If
that was not the case, we wouldn't be hitting the bug in the first



That would be another way of doing it all right. Count how many
migration ptes we created, pass that to rmap_walk. If they don't match,
assume a race and do it again before the page is unlocked.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Thursday, April 22, 2010 - 8:14 am

You could make it atomic through the use of RCU.

Create a new vma entry with the changed parameters and then atomically
switch to the new vma.

Problem is that you have some list_heads in there.

--

From: Paul E. McKenney
Date: Thursday, April 22, 2010 - 8:39 pm

Indeed, it would be necessary to update -all- pointers to the old
vma entry to point to the new vma entry.  The question at that point
will be "is it OK to switch the pointers over one at a time?"

In many situations, the answer is "yes", but it is necessary to check
carefully.

						Thanx, Paul
--

From: Minchan Kim
Date: Thursday, April 22, 2010 - 9:55 pm

Hi, Christoph.


That's a good idea if we can do _simply_.
That's because there are many confusion anon_vma and vma handling nowadays.
(http://thread.gmane.org/gmane.linux.kernel/969907)
So I hope we solve the problem without rather complicated rcu locking
if it isn't critical path.

-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 21, 2010 - 4:59 pm

On Wed, 21 Apr 2010 09:30:20 -0500 (CDT)

Ah....Can I confirm my understanding ?

remap_swapcache == true only when
  The old page was ANON && it is not mapped. && it is SwapCache.

We do above check under lock_page(). So, this SwapCache is never mapped until
we release lock_page() on the old page. So, we don't use migration_pte in
this case because try_to_unmap() do nothing and don't need to call
remove_migration_pte().

If migration_pte is used somewhere...I think it's bug.


--

From: Minchan Kim
Date: Wednesday, April 21, 2010 - 5:11 pm

On Thu, Apr 22, 2010 at 8:59 AM, KAMEZAWA Hiroyuki

Yes. so I thought what kinds of race happened.
Firstly I doubt fork and migration. but It isn't.
I can't understand how this bug happens.
Apparently, We have been missed something.
I will look into this further.


-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Tuesday, April 20, 2010 - 2:01 pm

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors.  The problem is that between the page being
isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
page dropped to 0 and the anon_vma gets freed.  This can happen during
memory compaction if pages being migrated belong to a process that exits
before migration completes.  Hence, the use-after-free race looks like

 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

This patch checks the mapcount after the rcu lock is taken.  If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 42a3d24..b114635 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -600,6 +600,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapped(page))
+			goto rcu_unlock;
+
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->external_refcount);
 	}
-- 
1.6.5

--

Previous thread: [PATCH 02/14] mm,migration: Share the anon_vma ref counts between KSM and page migration by Mel Gorman on Tuesday, April 20, 2010 - 2:01 pm. (1 message)

Next thread: Re: Should calculation of vm.overcommit_ratio be changed? by Dirk Geschke on Tuesday, April 20, 2010 - 1:51 pm. (1 message)