Re: [RFC PATCH -v3] take all anon_vma locks in anon_vma_lock

Previous thread: [PATCH] Staging: tm6000: fix coding style issues of tm6000-cards.c by Roel Van Nyen on Tuesday, April 27, 2010 - 2:25 pm. (1 message)

Next thread: Re: [PATCH] STAGING: hv: Update blkvsc_drive.c to remove spaces before newlines by Greg KH on Tuesday, April 27, 2010 - 2:39 pm. (1 message)
From: Mel Gorman
Date: Tuesday, April 27, 2010 - 2:30 pm

After V1, it was clear that execve was still racing but eventually died
in an exec-related race. An additional part of the test was created that
hammers exec() to reproduce typically within 10 minutes rather than several
hours.  The problem was that the VMA is moved under lock but not the page
tables. Migration fails to remove the migration PTE from its new location and
a BUG is later triggered. The third patch in this series is a candidate fix.

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

There are a number of races between migration and other operations that mean a
migration PTE can be left behind. Broadly speaking, migration works by locking
a page, unmapping it, putting a migration PTE in place that looks like a swap
entry, copying the page and remapping the page removing the old migration PTE.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs to
be copied or a migration PTE to be left behind. Migration still completes and
the page is unlocked but later a fault will call migration_entry_to_page()
and BUG() because the page is not locked. This series aims to close some
of these races.

Patch 1 alters fork() to restart page table copying when a migration PTE is
	encountered.

Patch 2 has vma_adjust() acquire the anon_vma lock and makes rmap_walk()
	aware that VMAs on the chain may have different anon_vma locks that
	also need to be acquired.

Patch 3 notes that while a VMA is moved under the anon_vma lock, the page
	tables are not similarly protected. Where migration PTEs are
	encountered, they are cleaned up.

The reproduction case was as follows;

1. Run kernel compilation in a loop
2. Start three processes, each of which creates one mapping. The three stress
   different aspects of the problem. The operations they ...

During exec(), a temporary stack is setup and moved later to its final
location. There is a race between migration and exec whereby a migration
PTE can be placed in the temporary stack. When this VMA is moved under the
lock, migration no longer knows where the PTE is, fails to remove the PTE
and the migration PTE gets copied to the new location.  This later causes
a bug when the migration PTE is discovered but the page is not locked.

This patch handles the situation by removing the migration PTE when page
tables are being moved in case migration fails to find them. The alternative
would require significant modification to vma_adjust() and the locks taken
to ensure a VMA move and page table copy is atomic with respect to migration.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/migrate.h |    7 +++++++
 mm/migrate.c            |    2 +-
 mm/mremap.c             |   29 +++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7a07b17..05d2292 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -22,6 +22,8 @@ extern int migrate_prep(void);
 extern int migrate_vmas(struct mm_struct *mm,
 		const nodemask_t *from, const nodemask_t *to,
 		unsigned long flags);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+			unsigned long addr, void *old);
 #else
 #define PAGE_MIGRATION 0
 
@@ -42,5 +44,10 @@ static inline int migrate_vmas(struct mm_struct *mm,
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline int remove_migration_pte(struct page *new,
+		struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+}
+
 #endif /* CONFIG_MIGRATION */
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4afd6fe..053fd39 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -75,7 +75,7 @@ void putback_lru_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a ...

On Tue, 27 Apr 2010 22:30:52 +0100

Mel, I don't like this fix. Consider following,

 1. try_to_unmap(oldpage)
 2. copy and replace
 3. remove_migration_ptes(oldpage, newpage)

What this patch handles is "3: remove_migration_ptes fails to remap it and
migration_pte will remain there case....The fact "new page is not mapped" means
"get_page() is not called against the new page".
So, the new page have been able to be freed until we restart move_ptes.

I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. 
When page_count(page) != page_mapcount(page) +1, migration skip it.

Thanks,
-Kame

--


My proposed fix avoids to walk the pagetables once more time and to
mangle over the page counts. Can you check it? It works but it needs
more review.
--


On Wed, 28 Apr 2010 02:08:33 +0200

Sure...but can we avoid temporal objrmap breakage(inconsistency) by it ?

THanks,
-Kame

--


This is the real bug, the patch 1 should be rejected and the
expanation-trace has the ordering wrong. The ordering is subtle but
fundamental to prevent that race, split_huge_page also requires the
same anon-vma list_add_tail to avoid the same race between fork and
rmap_walk. It should work fine already with old and new anon-vma code
as they both add new vmas always to the tail of the list.

So the bug in very short, is that "move_page_tables runs out of sync

I'll now evaluate the fix and see if I can find any other
way to handle this.

Great, I'm quite sure with patch 3 we'll move the needle and fix the
bug, it perfectly explains why we only get the oops inside execve in
the stack page.

Patch 2 I didn't check it yet but it's only relevant for the new
anon-vma code, I suggest to handle it separately from the rest.
--


I think a better fix for bug mentioned in patch 3, is like below. This
seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma
code. Not sure if this also works with the new anon-vma code in
mainline but at first glance I think it should. At that point we
should be single threaded so it shouldn't matter if anon_vma is
temporary null.

Then you've to re-evaluate the vma_adjust fixes for mainline-only in
patch 2 at the light of the below (I didn't check patch 2 in detail).

Please try to reproduce with the below applied.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;
+	struct anon_vma *anon_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are
 	if (vma != find_vma(mm, new_start))
 		return -EFAULT;
 
+	anon_vma = vma->anon_vma;
+	/* stop rmap_walk or it won't find the stack pages */
+	spin_lock(&anon_vma->lock);
+	/* avoid vma_adjust to take any further anon_vma lock */
+	vma->anon_vma = NULL;
+
 	/*
 	 * cover the whole range: ...

On Wed, 28 Apr 2010 00:58:52 +0200

Seems nice.

I'll test this but I think we need to take care of do_mremap(), too.
And it's more complicated....

Thanks,
-Kame

--


What did you mean with objrmap inconsistency? I think this is single
threaded there, userland didn't run yet and I don't think page faults
could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just

do_mremap has to be safe by:

1) adjusting page->index atomically with the pte updates inside pt
   lock (while it moves from one pte to another)

2) having both vmas src and dst (not overlapping) indexed in the
   proper anon_vmas before move_page_table runs

As long as it's not overlapping it shouldn't be difficult to enforce
the above two invariants, exec.c is magic as it works on overlapping
areas and src and dst are the same vma and it's indexed into just one
anon-vma. So we've to stop the rmap_walks before we mangle over the
vma with vma_adjust and move_page_tables and truncate the end of the
vma with vma_adjust again, and finally we resume the rmap_walks.

I'm not entirely sure of the above so review greatly appreciated ;)
--


actually no need of this at all! of course the dst vma will have
vma->vm_pgoff adjusted instead... never mind. So I don't see a problem
there.

I think this is very special of how exec.c abuses move_page_tables by
passing vma as src and dst, when it obviously cannot be indexed in two
anon-vmas because there's a single vma and a single vma->anon_vma, and
src and dst obviously cannot have two different vm_pgoff again because
there's a single vma and there can't be two different vma->vm_pgoff.

So I'm very hopeful do_mremap is already fully safe...
--


On Wed, 28 Apr 2010 03:05:43 +0200
I mean following relationship.

	vma_address(vma, page1) <-> address <-> pte <-> page2
	
	If page1 == page2, objrmap is consistent.
I reviewed do_mremap again and am thinking it's safe.


	new_vma = copy_vma();
	....                   ----------(*)
	move_ptes().
	munmap unnecessary range.

At (*), if page1 != page2, rmap_walk will not run correctly.
But it seems copy_vma() keeps page1==page2 

As I reported, when there is a problem, vma_address() returns an address but
that address doesn't contain migration_pte.

BTW, page->index is not updated, we just keep [start_address, pgoff] to be
sane value.

Thanks,
-Kame


--


yes I corrected myself too, the end result is the same and adjusting
the new vma vm_pgoff available in the anon-vma list is obviously
faster and simpler. When the src and dst ranges don't overlap and
src_vma != dst_vma (vma, new_vma in code) things are a lot simpler
there...
--


On Wed, 28 Apr 2010 00:58:52 +0200

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(),
so this patch clears vma->anon_vma...


	/*
	 * We adjust vma and move page tables in sequence. While update, 
	 * (vma, page) <-> address <-> pte relationship is unstable.
	 * We lock anon_vma->lock for keeping rmap_walk() safe. (see mm/rmap.c)

I think we can unlock this just after move_page_tables().


Thanks,
-kame

--


yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's
patch locks the anon_vma also when vm_start != start. See the other

Checking this, I can't see where exactly is vma->vm_pgoff adjusted
during the atomic section I protected with the anon_vma->lock?
For a moment it looks like these pages become unmovable.

I guess this is why I thought initially that it was move_page_tables
to adjust the page->index. If it doesn't then the vma->vm_pgoff has to
be moved down of shift >>PAGE_SHIFT and it doesn't seem to be
happening which is an unrelated bug.
--


On Wed, 28 Apr 2010 03:44:34 +0200
The page can be replaced with migration_pte before the 1st vma_adjust.

The key is 
	(vma, page) <-> address <-> pte <-> page
relationship.

	vma_adjust() 
	(*)
	move_pagetables();
	(**)
	vma_adjust();

At (*), vma_address(vma, page) retruns a _new_ address. But pte is not

Anyway, I have no strong opinion about the placement of unlock(anon_vma->lock).

I wonder why we don't see this at testing memory-hotplug is because memory-hotplug
disables a new page allocation in the migration range. So, this exec() is hard to get
a page which can be migration target.

Thanks,
-Kame

--


Yes I agree we can move the unlock at (**) because the last vma_adjust
is only there to truncate the vm_end. In fact it looks super
heavyweight to call vma_adjust for that instead of just using
vma->vm_end = new_end considering we're under mmap_sem, full anonymous
etc... In fact I think even the first vma_adjust looks too
heavyweight and it doesn't bring any simplicity or added safety
considering this works in place and there's nothing to wonder about
vm_next or vma_merge or vm_file or anything that vma_adjust is good at.

So the confusion I had about vm_pgoff is because all things that moves
vm_start down, also move vm_pgoff down like stack growsdown but of
course those don't move the pages down too, so we must not alter
vm_pgoff here just vm_start along with the pagetables inside the
anon_vma lock to be fully safe. Also I forgot to unlock in case of
-ENOMEM ;)

this is a new try, next is for a later time... hope this helps!

Thanks!

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - ...

On Wed, 28 Apr 2010 04:42:27 +0200

Seems good.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I'll test this and report if I see trouble again.

Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
my office is nearly closed. So, please consider no-mail-from-me is
good information.


Thanks,

--


On Wed, 28 Apr 2010 11:49:44 +0900
Here is bad news. When move_page_tables() fails, "some ptes" are moved
but others are not and....there is no rollback routine.

I bet the best way to fix this mess up is 
 - disable overlap moving of arg pages
 - use do_mremap().

But maybe you guys want to fix this directly.
Here is a temporal fix from me. But don't trust me..
==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

And, when moving overlapping ptes by move_page_tables(), it's cannot
be roll-backed as mremap does. This patch changes move_page_tables()'s
behavior and if it failes, no ptes are moved.

Changelog:
 - modified move_page_tables() to do atomic pte moving because
   "some ptes are moved but others are not" is critical for rmap_walk().
 - free pgtables at failure rather than give it all to do_exit().
   If not, objrmap will be inconsitent until exit() frees all.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

---
 fs/exec.c   |   67 +++++++++++++++++++++++++++++++++++++-----------------------
 mm/mremap.c |   28 ++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 28 deletions(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include ...

Thanks to you both for looking into this. I far prefer this general approach
than cleaning up the migration PTEs as the page tables get copied. While it
might "work", it's sloppy in the same way as having migration_entry_wait()
do the cleanup was sloppy. It's far preferable to make the VMA move and
page table copy atomic with anon_vma->lock.


The biggest problem is that the reverse mapping is temporarily out of
sync until do_exit gets rid of the mess, but how serious is that really?

If there is a migration entry in there, the mapcount should already be zero and
migration holds a reference to the page to prevent it going away. rmap_walk()
may then miss the migration_pte so it gets left behind. Ordinarily this
would be bad but in exec(), we cannot be faulting this page so we won't
trigger the bug in swapops. Instead, do_exit ultimately will skip over the
migration PTE doing nothing with the page but as the mapcount is still zero,

I see the point of your patch but I'm not yet seeing why it is
necessary to back out if move_page_tables fails.

That said, both patches have a greater problem. Both of them hold a spinlock
(anon_vma->lock) while calling into the page allocator with GFP_KERNEL (to
allocate the page tables). We don't want to change that to GFP_ATOMIC so
either we need to allocate the pages in advance or special case rmap_walk()
to not walk processes that are in exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


On Tue, 27 Apr 2010 22:30:52 +0100

Here is my final proposal (before going vacation.)

I think this is very simple. The biggest problem is when move_page_range
fails, setup_arg_pages pass it all to exit() ;)

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is an band-aid patch for avoiding unmap->remap of stack pages
while it's udner exec(). At exec, pages for stack is moved by
setup_arg_pages(). Under this, (vma,page)<->address relationship
can be in broken state.
Moreover, if moving ptes fails, pages with not-valid-rmap remains
in the page table and objrmap for the page is completely broken
until exit() frees all up.

This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address()
returns -EFAULT always and try_to_unmap() fails.
(IOW, the pages for stack are pinned until setup_arg_pages() ends.)

And this prevents page migration because the page's mapcount never
goes to 0 until exec() fixes it up.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c                |    4 +++-
 include/linux/mm_types.h |    5 +++++
 mm/rmap.c                |    5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -250,7 +250,8 @@ static int __bprm_mm_init(struct linux_b
 	err = insert_vm_struct(mm, vma);
 	if (err)
 		goto err;
-
+	/* prevent rmap_walk, try_to_unmap() etc..until we get fixed rmap */
+	vma->unstable_rmap = 1;
 	mm->stack_vm = mm->total_vm = 1;
 	up_write(&mm->mmap_sem);
 	bprm->p = vma->vm_end - sizeof(void *);
@@ -653,6 +654,7 @@ int setup_arg_pages(struct linux_binprm 
 		ret = -EFAULT;
 
 out_unlock:
+	vma->unstable_rmap = 0;
 	up_write(&mm->mmap_sem);
 	return ret;
 }
Index: mel-test/include/linux/mm_types.h
===================================================================
--- mel-test.orig/include/linux/mm_types.h
+++ ...

I don't get it, I don't see the pinning and returning -EFAULT is not
solution for things that cannot fail (i.e. remove_migration_ptes and
split_huge_page). Plus there's no point to return failure to rmap_walk
when we can just stop the rmap_walk with the proper lock.
--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 2:30 pm

vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			locked_vma = NULL;
+			if (anon_vma != vma->anon_vma) {
+				locked_vma = vma->anon_vma;
+				if (!spin_trylock(&locked_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1697,14 @@ again:
 ...
From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 4:10 pm

The old code did:

    /*
     * When changing only vma->vm_end, we don't really need
     * anon_vma lock.
     */
    if (vma->anon_vma && (insert || importer || start !=  vma->vm_start))
	anon_vma = vma->anon_vma;
    if (anon_vma) {
        spin_lock(&anon_vma->lock);

why did it become unconditional? (and no idea why it was removed)

But I'm not sure about this part.... this is really only a question, I
may well be wrong, I just don't get it.
--

From: Mel Gorman
Date: Wednesday, April 28, 2010 - 2:15 am

It became unconditional because I wasn't sure of the optimisation versus the
new anon_vma changes (doesn't matter, should have been safe). At the time
the patch was introduced, the bug looked like a race in VMA's in the list
having their details modified. I thought vma_address was returning -EFAULT
when it shouldn't and while this may still be possible, it wasn't the prime
cause of the bug.

The more important race was in execve between when a VMA got moved and the
page tables copied. The anon_vma locks are fine for the VMA move but the
page table copy happens later. What the patch did was alter the timing of
the race. rmap_walk() was finding the VMA of the new stack being set up by
exec, failing to lock it and backing off. By the time it would restart and
get back to that VMA, it was already moved making the bug simply harder to
reproduce because the race window was so small.

So, the VMA list does not appear to be messed up but there still needs
to be protection against modification of VMA details that are already on
the list. For that, the seq counter would have been enough and
lighter-weight than acquiring the anon_vma->lock every time in
vma_adjust().


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 8:35 am

Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
the removal of the lock but I don't get it from the comments. Or at
least I don't get from that comment why we can't resurrect the plain
old deleted code that looked fine to me. Like there is no reason to

You mean you're dropping patch 2 too? I agree dropping patch 1 but
to me the having to take all the anon_vma locks for every
vma->anon_vma->lock that we walk seems a must, otherwise
expand_downwards and vma_adjust won't be ok, plus we need to re-add
the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
and vm_start outside of the anon_vma->lock. Or I am mistaken?

Patch 2 wouldn't help the swapops crash we reproduced because at that
point the anon_vma of the stack is the local one, it's just after
execve.

vma_adjust and expand_downards would alter vm_pgoff and vm_start while
taking only the vma->anon_vma->lock where the vma->anon_vma is the
_local_ one of the vma.  But a vma in mainline can be indexed in
infinite anon_vmas, so to prevent breaking migration
vma_adjust/expand_downards the rmap_walk would need to take _all_
anon_vma->locks for every anon_vma that the vma is indexed into. Or
alternatively like you implemented rmap_walk would need to check if
the vma we found in the rmap_walk is different from the original
anon_vma and to take the vma->anon_vma->lock (so taking the
anon_vma->lock of the local anon_vma of every vma) before it can
actually read the vma->vm_pgoff/vm_start inside vma_address.

If the above is right it also means the new anon-vma changes also break
the whole locking of transparent hugepage, see wait_split_huge_page,
it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
"local" anon-vma is enough, when in fact the hugepage may be shared
and belonging to the parent parent_vma->anon_vma and not to the local
one of the last child that is waiting on the wrong lock. So I may have
to rewrite this part of the thp locking to solve this. And for me it's
not ...
From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 8:39 am

Another way (not sure if it's good or bad, but it'd clearly avoid the
restarting locks in rmap_walk) would be to allocate a shared lock, and
still share the lock like the anon_vma was shared before. So we have
shorter chains to walk, but still a larger lock, so we've to take just
one and be safe.

spin_lock(&vma->anon_vma->shared_anon_vma_lock->lock)

something like that...
--

From: Mel Gorman
Date: Wednesday, April 28, 2010 - 8:55 am

Frankly, I don't understand why it was safe to drop the lock either.
Maybe it was a mistake but I still haven't convinced myself I fully

Temporarily at least until I figured out if execve was the only problem. The
locking in vma_adjust didn't look like the prime reason for the crash

No, you're not. If nothing else, vma_address can return the wrong value because
the VMAs vm_start and vm_pgoff were in the process of being updated but not
fully updated. It's hard to see how vma_address would return the wrong value
and miss a migration PTE as a result but it's a possibility.  It's probably

True, although in the case of expand_downwards, it's highly unlikely that
there is also a migration PTE to be cleaned up. It's hard to see how a
migration PTE would be left behind in that case but it still looks wrong to

I felt this would be too heavy in the common case which is why I made
rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically

To be absolutly sure, yes this is required. I don't think we've been hitting
this exact problem in these tests but it still is a better plan than adjusting

No fun. That potentially could be a lot of locks to take to split the
page.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 9:23 am

I understand the design but I'm also unsure about the details. it's
just this lock that gets splitted and when you update the
vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
queued in multiple other anon_vmas, and you're only serializing
and safe for the pages that have page_mapcount 1, and point to the
local anon_vma == vma->anon_vma, not any other shared page.

The removal of the vma->anon_vma->lock from vma_adjust just seems an
unrelated mistake to me too, but I don't know for sure why yet.

Basically vma_adjust needs the anon_vma lock like expand_downards has.

After you fix vma_adjust to be as safe as expand_downards you've also
to take care of the rmap_walk that may run on a page->mapping =
anon_vma that isn't the vma->anon_vma and you're not taking that
anon_vma->lock of the shared page, when you change the vma
vm_pgoff/vm_start. If rmap_walk finds to find a pte, becauase of that,

For aa.git it is sure enough. And as long as you only see the migrate


For the rmap_walk itself, migrate and split_huge_page are identical,
the main problem of transparent hugepage support is that I used the
anon_vma->lock in a wider way and taken well before the rmap_walk, so
I'm screwed in a worse way than migrate.

So I may have to change from anon_vma->lock to the compound_lock in
wait_split_huge_page(). But I'll still need the restarting loop of
anon_vma locks then inside the two rmap_walk run by split_huge_page.
Problem is, I would have preferred to do this locking change later as
a pure optimization than as a requirement for merging and running
stable, as it'll make things slightly more complex.

BTW, if we were to share the lock across all anon_vmas as I mentioned
in prev email, and just reduce the chain length, then it'd solve all

Every time we fail to find the PTE, it can also mean try_to_unmap just
failed to instantiate the migration pte leading to random memory
corruption in migrate. If a task fork and the some of the stack pages
at the bottom of the ...
From: Mel Gorman
Date: Wednesday, April 28, 2010 - 10:34 am

Well, in the easiest case, the details of the VMA (particularly vm_start
and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the
case of migration, it will return other false positives or negatives.


Is this not what the try-lock-different-vmas-or-backoff-and-retry logic

I can only remember seeing the execve-related crash but I'd rather the
locking was correct too.

Problem is, I've seen at least one crash due to execve() even with the
check made in try_to_unmap_anon to not migrate within the temporary


It might be where we end up eventually. I'm simply loathe to introduce

How so? The old PTE should have been left in place, the page count of

Because the list could be very large, it would make more sense to


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 10:58 am

yes exactly. This is why patch 2 can't be dropped, both for the

Right only problem is for remove_migration_ptes (and for both
split_huge_page rmap_walks). For migrate the only issue is the second

Kind of agree, we'll see...
--

From: Rik van Riel
Date: Wednesday, April 28, 2010 - 10:47 am

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..1eef42c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->page_table_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,11 +94,14 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+	if (anon_vma) {
+		struct anon_vma_chain *avc;
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+	}
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..2c13bbb 100644
--- ...
From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 11:03 am

This will cause a lock inversion (page_table_lock can only be taken
after the anon_vma lock). I don't immediately see why the
page_table_lock here though?
--

From: Rik van Riel
Date: Wednesday, April 28, 2010 - 11:09 am

We need to safely walk the vma->anon_vma_chain /
anon_vma_chain->same_vma list.

So much for using the mmap_sem for read + the
page_table_lock to lock the anon_vma_chain list.

We'll need a new lock somewhere, probably in the
mm_struct since one per process seems plenty.

I'll add that in the next version of the patch.

-- 
All rights reversed
--

From: Rik van Riel
Date: Wednesday, April 28, 2010 - 11:25 am

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..492e7ca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & ...
From: Mel Gorman
Date: Wednesday, April 28, 2010 - 12:07 pm

Well, it looks nice but I am wary of making much in the way of comment
on its correctness. Every time I think I understand the new anon_vma

This doesn't build with LOCKDEP enabled. As we discussed on IRC, I
changed nest_lock to spinlock_t * and always made it the
anon_vma_chain_lock. That vaguely makes more sense than sometimes
depending on the mmap_sem semaphore but I'm not 100% solid on it.

Walking along like this and taking the necessary locks seems reasonable
though and should be usable by transparent hugepage support where as my

so I changed this and others like it to the anon_vma_chain_lock. I

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Rik van Riel
Date: Wednesday, April 28, 2010 - 1:17 pm

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
  * all the anon_vmas associated with this VMA.
  * The "same_anon_vma" list contains the anon_vma_chains
  * which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
  */
 struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
-	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
+	struct list_head same_vma;	/* see above */
 	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
 };
 
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
-	struct anon_vma *anon_vma = vma->anon_vma;
-	if (anon_vma)
-		spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock)					\
+({									\
+	struct anon_vma *anon_vma = vma->anon_vma;			\
+	if (anon_vma) {							\
+		struct anon_vma_chain *avc;				\
+		list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+			spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+	}								\
+})
 
 static inline void ...
From: Rik van Riel
Date: Wednesday, April 28, 2010 - 1:57 pm

Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...

Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
otherwise we have an unavoidable lock ordering conflict.  This changes the
locking rules for the "same_vma" list to be either mm->mmap_sem for write, 
or mm->mmap_sem for read plus the new mm->anon_vma_chain lock.  This limits
the place where the new lock is taken to 2 locations - anon_vma_prepare and
expand_downwards.

Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- 
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?

v3:
 - change anon_vma_unlock into a macro so lockdep works right
 - fix lock ordering in anon_vma_prepare
v2:
 - also change anon_vma_unlock to walk the loop
 - add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
 - introduce a new lock for the vma->anon_vma_chain list, to prevent
   the lock inversion that Andrea pointed out

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
 	int map_count;				/* number of VMAs */
 	struct rw_semaphore mmap_sem;
 	spinlock_t page_table_lock;		/* Protects page tables and some counters */
+	spinlock_t anon_vma_chain_lock;		/* Protects vma->anon_vma_chain, with mmap_sem */
 
 	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
 						 * together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma ...
From: Minchan Kim
Date: Wednesday, April 28, 2010 - 5:28 pm

This patch makes things simple. So I like this.
Actually, I wanted this all-at-once locks approach.
But I was worried about that how the patch affects AIM 7 workload
which is cause of anon_vma_chain about scalability by Rik.
But now Rik himself is sending the patch. So I assume the patch
couldn't decrease scalability of the workload heavily.

Let's wait result of test if Rik doesn't have a problem of AIM7.

-- 
Kind regards,
Minchan Kim
--

From: Rik van Riel
Date: Wednesday, April 28, 2010 - 7:10 pm

The thing is, the number of anon_vmas attached to a VMA is
small (depth of the tree, so for apache or aim the typical
depth is 2). This N is between 1 and 3.

The problem we had originally is the _width_ of the tree,
where every sibling process was attached to the same anon_vma
and the rmap code had to walk the page tables of all the
processes, for every privately owned page in each child process.
For large server workloads, this N is between a few hundred and
a few thousand.

What matters most at this point is correctness - we need to be
able to exclude rmap walks when messing with a VMA in any way
that breaks lookups, because rmap walks for page migration and
hugepage conversion have to be 100% reliable.

That is not a constraint I had in mind with the original
anon_vma changes, so the code needs to be fixed up now...

I suspect that taking one or two extra spinlocks in the code
paths changed by this patch (mmap/munmap/...) is going to make
a difference at all, since all of those paths are pretty
infrequently taken.

-- 
All rights reversed
--

From: Minchan Kim
Date: Wednesday, April 28, 2010 - 7:55 pm

Yes. I understand it.

When you tried anon_vma_chain patches as I pointed out, what I have a
concern is parent's vma not child's one.
The vma of parent still has N anon_vma.
AFAIR, you said it's trade-off and would be good than old at least.
I agreed.  But I just want to remind you because this makes worse.  :)
The corner case is that we have to hold locks of N.

Do I miss something?
Really, Can't we ignore that case latency although this happen infrequently?
I am not against this patch. I just want to listen your opinion.

-- 
Kind regards,
Minchan Kim
--

From: Rik van Riel
Date: Thursday, April 29, 2010 - 8:39 am

No, it is the other way around.

The anon_vma of the parent is also present in all of the
children, so the parent anon_vma is attached to N vmas.

However, the parent vma only has 1 anon_vma attached to
it, and each of the children will have 2 anon_vmas.

That is what should keep any locking overhead with this
patch minimal.

Yes, a deep fork bomb can slow itself down.  Too bad,
don't do that :)

-- 
All rights reversed
--

From: Minchan Kim
Date: Wednesday, April 28, 2010 - 11:42 pm

me/ slaps self.

It's about height of tree and I can't imagine high height of
scenario.(fork->fork->fork->...->fork)
So as Rik pointed out, It's a not big overhead about latency latency,
at least. I think.

I supports this approach.

-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Thursday, April 29, 2010 - 12:37 am

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Thursday, April 29, 2010 - 1:15 am

In vma_adjust(), what prevents something like rmap_map seeing partial
updates while the following lines execute?

        vma->vm_start = start;
        vma->vm_end = end;
        vma->vm_pgoff = pgoff;
        if (adjust_next) {
                next->vm_start += adjust_next << PAGE_SHIFT;
                next->vm_pgoff += adjust_next;
        }

They would appear to happen outside the lock, even with this patch. The


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Thursday, April 29, 2010 - 1:44 am

Assuming you are referring to migration, it's easiest to just not migrate
pages within the stack until after shift_arg_pages runs. The locks
cannot be held during move_page_tables() because the page allocator is
called. It could be done in two stages where pages are allocated outside
the lock and then passed to move_page_tables() but I don't think
increasing the cost of exec() is justified just so a page can be
migrated during exec.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Thursday, April 29, 2010 - 1:32 am

This part does it. :)

----
@@ -578,6 +578,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }   
        }   
 
+       anon_vma_lock(vma, &mm->mmap_sem);
        if (root) {
                flush_dcache_mmap_lock(mapping);
                vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                vma_prio_tree_insert(vma, root);
                flush_dcache_mmap_unlock(mapping);
        }   
+       anon_vma_unlock(vma);
---


But we still need patch about shift_arg_pages.



-- 
Kind regards,
Minchan Kim


--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 2:30 pm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().

Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.

	CPUA			CPU B
				do_fork()
				copy_mm() (from process 1 to process2)
				insert new vma to mmap_list (if inode/anon_vma)
	pte_lock(process1)
	unmap a page
	insert migration_entry
	pte_unlock(process1)

	migrate page copy
				copy_page_range
	remap new page by rmap_walk()
	pte_lock(process2)
	found no pte.
	pte_unlock(process2)
				pte lock(process2)
				pte lock(process1)
				copy migration entry to process2
				pte unlock(process1)
				pte unlokc(process2)
	pte_lock(process1)
	replace migration entry
	to new page's pte.
	pte_unlock(process1)

Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/mman.h>

    #define SIZE (24*1048576UL)
    #define CHILDREN 100
    int main()
    {
	    int i = 0;
	    pid_t pids[CHILDREN];
	    char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
			    MAP_PRIVATE|MAP_ANONYMOUS,
			    0, 0);
	    if (buf == MAP_FAILED) {
		    perror("mmap");
		    exit(-1);
	    }

	    while (++i) {
		    int j = i % CHILDREN;

		    if (j == 0) {
			    printf("Waiting on children\n");
			    for (j = 0; j < CHILDREN; j++) {
				    memset(buf, i, SIZE);
				    if (pids[j] != -1)
					    waitpid(pids[j], NULL, 0);
			    }
			    j = 0;
		    }

		    if ((pids[j] = fork()) == 0) {
			    memset(buf, i, SIZE);
			    exit(EXIT_SUCCESS);
		    }
	    }

	    munmap(buf, SIZE);
    }

copy_page_range() ...
From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 3:22 pm

Ok I had a first look:



rmap_walk will walk process1 first! It's at the head, the vmas with

rmap_walk has to lock down process1 before process2, this is the
ordering issue I already mentioned in earlier email. So it cannot
happen and this patch is unnecessary.

The ordering is fundamental and as said anon_vma_link already adds new
vmas to the _tail_ of the anon-vma. And this is why it has to add to
the tail. If anon_vma_link would add new vmas to the head of the list,
the above bug could materialize, but it doesn't so it cannot happen.

In mainline anon_vma_link is called anon_vma_chain_link, see the
list_add_tail there to provide this guarantee.

Because process1 is walked first by CPU A, the migration entry is
replaced by the final pte before copy-migration-entry
runs. Alternatively if copy-migration-entry runs before before
process1 is walked, the migration entry will be copied and found in
process 2.

Comments welcome.
Andrea
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 27, 2010 - 4:52 pm

On Wed, 28 Apr 2010 00:22:45 +0200

I already explained this doesn't happend and said "I'm sorry".

But considering maintainance, it's not necessary to copy migration ptes
and we don't have to keep a fundamental risks of migration circus.

So, I don't say "we don't need this patch."

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 27, 2010 - 5:28 pm

On Wed, 28 Apr 2010 02:19:11 +0200

why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

Anyway, I agree there are no oops. But there are risks because migration is
a feature which people don't tend to take care of (as memcg ;)
I like conservative approach for this kind of features.

Thanks,
-Kame










 

--

From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 5:59 pm

I already thought this for split_huge_page and because split_huge_page
that already waits inside copy_huge_pmd if there's any
pmd_trans_splitting bit set, isn't removing the requirement for
list_add_tail in anon_vma_[chain_]link this patch also isn't removing
it.

But it's more complex than my prev explanation now that I think about
it more so no wonder it's not clear why yet (my fault).

So the thing is, if you patch it this way, the second rmap_walk run by
remove_migration_ptes will be safe even if the ordering is inverted,
but then the first rmap_walk that establishes the migration entry, will
still break if the rmap walk scans the child before the parent.

The only objective of the patch is to remove the list_add_tail
requirement but I'll show that it's still required by the first
rmap_walk in try_to_unmap below:

CPU A	    		  	      	CPU B
				      	fork
try_to_unmap
try to set migration entry in child (null pte still, nothing done)
					copy pte and map page in child pte (no
				      	migration entry set on child)
set migration entry in parent

parent ends up with migration entry but child not, breaking migration
in another way. so even with the patch, the only way to be safe is
that rmap_walk always scans the parent _first_.

So this patch is a noop from every angle, and it just moves the
ordering requirement from the remove_migration_ptes to try_to_unmap,
but it doesn't remove it as far as I can tell. So it only slowdown
fork a bit for no good.

split_huge_page is about the same. If the ordering is inverted,
split_huge_page could set a splitting pmd in the parent only, like the

Don't worry, migrate will run 24/7 the moment THP is in ;).

What migration really changed (and it's beneficial so split_huge_page
isn't introducing new requirement) is that rmap has to be exact at all
times, if it was only swap using it, nothing would happen because of
the vma_adjust/move_page_tables/vma_adjust not being in an atomic
section against the rmap_walk (the anon_vma ...
From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 5:18 pm

Oops I must have overlooked it sorry! I just seen the trace quoted in
the comment of the patch and that at least would need correction
before it can be pushed in mainline, or it creates huge confusion to

split_huge_page also has the same requirement and there is no bug to
fix, so I don't see why to make special changes for just migrate.c
when we still have to list_add_tail for split_huge_page.

Furthermore this patch isn't fixing anything in any case and it looks
a noop to me. If the order ever gets inverted, and process2 ptes are
scanned before process1 ptes in the rmap_walk, sure the
copy-page-tables will break and stop until the process1 rmap_walk will
complete, but that is not enough! You have to repeat the rmap_walk of
process1 if the order ever gets inverted and this isn't happening in
the patch so I don't see how it could make any difference even just
for migrate.c (obviously not for split_huge_page).
--

From: Mel Gorman
Date: Wednesday, April 28, 2010 - 1:24 am

And after going through it again, I'm happy that this was a red herring.

Even if it's not strictly necessary, migration should (and does in this case)
cope with being able to find all its migration ptes. An extra one being copied
doesn't matter as long as it can be found on the chain. It's not like the

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, April 27, 2010 - 3:27 pm

In general migration ptes were designed to cause the code encountering
them to go to sleep until migration is finished and a regular pte is
available. Looks like we are tolerating the handling of migration entries.

Never imagined copying page table with this. There could be recursion
issues of various kinds because page migration requires operations on the
page tables while performing migration. A simple fix would be to *not*


This means they are copied / moved etc and "cleaned" up in a state when
the page was unlocked. Migration entries are not supposed to exist when
a page is not locked.
--

From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 3:32 pm

There is no bug there, no need to wait either. I already audited it
before, and I didn't see any bug. Unless you can show a bug with CPU A
running the rmap_walk on process1 before process2, there is no bug to

patch 3 is real, and the first thought I had was to lock down the page
before running vma_adjust and unlock after move_page_tables. But these
are virtual addresses. Maybe there's a simpler way to keep migration
away while we run those two operations.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 27, 2010 - 5:13 pm

On Wed, 28 Apr 2010 00:32:42 +0200

Doing some check in move_ptes() after vma_adjust() is not safe.
IOW, when vma's information and information in page-table is incosistent...objrmap
is broken and migartion will cause panic.

Then...I think there are 2 ways.
  1. use seqcounter in "mm_struct" as previous patch and lock it at mremap.
or
  2. get_user_pages_fast() when do remap.

Thanks,
-Kame







--

From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 5:20 pm

3 take the anon_vma->lock
--

From: Mel Gorman
Date: Wednesday, April 28, 2010 - 7:23 am

I've been looking at ways during the day that the anon_vma lock can be held
while the page tables are being allocated. The schemes were way too hairy
just to cover a migration corner case.

As this is particular to exec, I'm wondering if Kamezawa's additional proposal
of just skipping migration of pages within the temporary stack might be the
best solution overall in terms of effectiveness and simplicity. His patch
introduced a new variable to the VMA but it shouldn't be necessary and it
altered vma_address which is unnecessary.

Here is a different version of the same basic idea to skip temporary VMAs
during migration. Maybe go with this?

(As a heads-up, I'll also be going offline in about 24 hours until Tuesday
morning. The area I'm in has zero internet access)

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch identifies when it is about to migrate
pages from a temporary stack and leaves them alone.  Memory hot-remove will
try again, sys_move_pages() wouldn't be operating during exec() time and
memory compaction ...
From: Mel Gorman
Date: Wednesday, April 28, 2010 - 7:57 am

The assumptions on the vm flags is of course totally wrong. VM_EXEC might
be applied as well as default flags from the mm.  The following is the same
basic idea, skip VMAs belonging to processes in exec rather than trying
to hold anon_vma->lock across move_page_tables(). Not tested yet.

==== CUT HERE ====
mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up.

There is a race between shift_arg_pages and migration that allows this
bug to trigger. A temporary stack is setup during exec and later moved. If
migration moves a page in the temporary stack and the VMA is then removed,
the migration PTE may not be found leading to a BUG when the stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk by
holding the anon_vma lock but this is problematic as pages must be allocated
for page tables. Instead, this patch skips processes in exec by making an
assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in
exec() that hasn't finalised the temporary stack yet.  Memory hot-remove
will try again, sys_move_pages() wouldn't be operating during exec() time
and memory compaction will just continue to another page without concern.

[kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/rmap.c |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..9e20188 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,6 +1141,18 @@ static int ...
From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 8:16 am

This is better than the other, that made it look like people could set
broken rmap at arbitrary times, at least this shows it's ok only
during execve before anything run, but if we can't take the anon-vma
lock really better than having to make these special checks inside
every rmap_walk that has to be accurate, we should just delay the
linkage of the stack vma into its anon-vma, until after the move_pages.
--

From: Mel Gorman
Date: Wednesday, April 28, 2010 - 8:23 am

Is it possible to delay the linkage like that? As arguments get copied into
the temporary stack before it gets moved, I'd have expected the normal fault
path to prepare and attach the anon_vma. We could special case it but
that isn't very palatable either.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 8:45 am

I'm not sure what is more palatable, but I feel it should be fixed in
execve.
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 1:40 pm

Ok the best idea so far I had is to add a fake temporary fake vma to
the anon_vma list with the old vm_start and same vm_pgoff before
shifting down vma->vm_start and calling move_page_tables. Then after
the move is complete we remove the fake vma. So all the fast paths
will remain unmodified and no magic is required. I'll try to fix this
for the old stable anon-vma code and test in aa.git first as the code
will differ. If it works ok anybody can port it to new anon-vma code.
--

From: Andrea Arcangeli
Date: Wednesday, April 28, 2010 - 2:05 pm

here the new try for aa.git or 2.6.33 anon-vma code.

----
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -514,16 +517,36 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	spin_lock(&vma->anon_vma->lock);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
+	vma->vm_start = new_start;
+	__anon_vma_link(tmp_vma);
+	spin_unlock(&vma->anon_vma->lock);
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, ...
From: Mel Gorman
Date: Wednesday, April 28, 2010 - 2:17 am

I see there is a large discussion on that patch so I'll read that rather
than commenting here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

Previous thread: [PATCH] Staging: tm6000: fix coding style issues of tm6000-cards.c by Roel Van Nyen on Tuesday, April 27, 2010 - 2:25 pm. (1 message)

Next thread: Re: [PATCH] STAGING: hv: Update blkvsc_drive.c to remove spaces before newlines by Greg KH on Tuesday, April 27, 2010 - 2:39 pm. (1 message)