Re: [PATCH 2/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

Previous thread: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered by Mel Gorman on Monday, April 26, 2010 - 3:37 pm. (3 messages)

Next thread: Re: [PATCH] x86/PCI: never allocate PCI MMIO resources below BIOS_END by H. Peter Anvin on Monday, April 26, 2010 - 4:02 pm. (14 messages)
From: Mel Gorman
Date: Monday, April 26, 2010 - 3:37 pm

After digging around a lot, I believe the following two patches are the
best way to close the race that allows a migration PTE to be left behind
triggering a BUG check in migration_entry_to_page().

Patch one alters has fork() wait for migration to complete. Patch two has
vma_adjust() acquire the anon_vma lock it is aware of and makes rmap_walk()
aware that different VMAs can be encountered during the walk.

I dropped the use of the seq counter because there were still races in
place. For example, while the seq counter would catch when vma_adjust()
and rmap_walk() were looking at the same VMA, there was still insufficient
protection on the VMA list being modified.

The reproduction case was as follows;

1. Run kernel compilation in a loop
2. Start two processes that repeatedly fork()ed and manipulated mappings
3. Constantly compact memory using /proc/sys/vm/compact_memory
4. Optionally add/remove swap

With these two patches applied, I was unable to trigger the bug check
in migration_entry_to_page() but it would be really helpful if Rik could
comment on the anon_vma locking requirements and whether patch 2 is 100%
safe or not.  The tests have only been running 8 hours but I'm posting now
anyway and will see how it survives running for a few days.

The other issues raised about expand_downwards will need to be re-examined to
see if they still exist and transparent hugepage support will need further
thinking to see if split_huge_page() can deal with these situations.

 mm/ksm.c    |   13 +++++++++++++
 mm/memory.c |   25 ++++++++++++++++---------
 mm/mmap.c   |    6 ++++++
 mm/rmap.c   |   23 ++++++++++++++++++++---
 4 files changed, 55 insertions(+), 12 deletions(-)

--

From: Mel Gorman
Date: Monday, April 26, 2010 - 3:37 pm

vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   13 +++++++++++++
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   22 +++++++++++++++++++---
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..baa5b4d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,22 @@ again:
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			if (anon_vma != vma->anon_vma) {
+				if (!spin_trylock(&vma->anon_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
 				continue;
+
+			if (anon_vma != vma->anon_vma)
+				spin_unlock(&vma->anon_vma->lock);
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	if (vma->anon_vma)
+		spin_lock(&vma->anon_vma->lock);
+
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
@@ ...
From: Minchan Kim
Date: Monday, April 26, 2010 - 4:05 pm

if (address == -EFAULT)



-- 
Kind regards,
Minchan Kim
From: KAMEZAWA Hiroyuki
Date: Monday, April 26, 2010 - 5:30 pm

On Tue, 27 Apr 2010 08:05:26 +0900
yes. thank you for pointing out.

-Kame

--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 2:17 am

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Monday, April 26, 2010 - 4:15 pm

Actually, I am worry about rollback approach like this.
If we don't often need anon_vmas serializing, that's enough.
But I am not sure how often we need locking of anon_vmas like this.
Whenever we need it, we have to use rollback approach like this in future.
In my opinion, it's not good.

Rik, can't we make anon_vma locks more simple?
Anyway, Mel's patch is best now.

I hope improving locks of anon_vmas without rollback approach in near future.

Thanks, Mel.


-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Monday, April 26, 2010 - 5:07 pm

On Mon, 26 Apr 2010 23:37:58 +0100

Ok, acquiring vma->anon_vma->spin_lock always sounds very safe.
(but slow.)

I'll test this, too.


--

From: KAMEZAWA Hiroyuki
Date: Monday, April 26, 2010 - 8:50 pm

On Tue, 27 Apr 2010 09:07:06 +0900

Sorry. reproduced. It seems the same bug before patch. 
mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Monday, April 26, 2010 - 9:03 pm

On Tue, 27 Apr 2010 12:50:40 +0900
Sorry again. It was _not_ SwapCache. Just SwapBacked.

Thanks,
-Kame

--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 1:59 am

Same here, reproduced after 18 hours.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 27, 2010 - 2:09 am

On Tue, 27 Apr 2010 09:59:51 +0100
Hmm. It seems rmap_one() is called and the race is not in vma_address()
but in remap_migration_pte().
So, I added more hooks for debug..but not reproduced yet.
(But I doubt my debug code, too ;)

But it seems strange to have a race in remap_migration_pte(), so, I doubt
my debug code, too.

Thanks,
-Kame

--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 3:29 am

It could have been in both but the vma lock should have been held across
the rmap_one. It still reproduces but it's still the right thing to do.
This is the current version of patch 2/2.

==== CUT HERE ====

[PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   19 +++++++++++++++++--
 mm/mmap.c |    6 ++++++
 mm/rmap.c |   27 +++++++++++++++++++++++----
 3 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..87c7531 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,19 @@ again:
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = vmac->vma;
+
+			/* See comment in mm/rmap.c#rmap_walk_anon on locking */
+			if (anon_vma != vma->anon_vma) {
+				if (!spin_trylock(&vma->anon_vma->lock)) {
+					spin_unlock(&anon_vma->lock);
+					goto again;
+				}
+			}
+
 			if (rmap_item->address < vma->vm_start ||
 			    rmap_item->address >= vma->vm_end)
-				continue;
+				goto next_vma;
+
 			/*
 			 * Initially we examine only the vma which covers this
 			 * rmap_item; but later, if there is still work to do,
@@ -1684,9 +1694,14 @@ again:
 			 * were forked from the original since ksmd ...
From: Andrea Arcangeli
Date: Tuesday, April 27, 2010 - 8:37 am

Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma
code, it's unlikely that focusing on patch 2 you'll fix bug in
swapops.h. If this is a bug in the new anon-vma code, it needs fixing
of course! But I doubt this bug is related to swapops in execve on the
bprm->p args.

I've yet to check in detail patch 1 sorry, I'll let you know my
opinion about it as soon as I checked it in detail.
--

From: Mel Gorman
Date: Tuesday, April 27, 2010 - 9:35 am

Why do you doubt it's unrelated to execve? From what I've seen during the day,
there is a race in execve where the VMA gets moved (under the anon_vma lock)
before the page-tables are copied with move_ptes (without a lock). In that
case, execve can encounter and copy migration ptes before migration removes
them. This also applies to mainline because it is only taking the RCU lock
and not the anon_vma->lock.

I have a prototype that "handles" the situation with the new anon_vma
code by removing the migration ptes it finds while moving page tables
but it needs more work before releasing.

An alternative would be to split vma_adjust() into locked and unlocked
versions. shift_arg_pages() could then take the anon_vma lock to lock
both the VMA move and the pagetable copy here.

        /*
         * cover the whole range: [new_start, old_end)
         */
        if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
                return -ENOMEM;

        /*
         * move the page tables downwards, on failure we rely on
         * process cleanup to remove whatever mess we made.
         */
        if (length != move_page_tables(vma, old_start,
                                       vma, new_start, length))
                return -ENOMEM;

It'd be messy to split up the locking of vma_adjust like this though and
exec() will hold the anon_vma locks for longer just to guard against
migration. It's not clear this is better than having move_ptes handle

No problem. I still need to revisit all of these patches once I am
confident the swapops bug cannot be triggered any more.

--

From: Rik van Riel
Date: Monday, April 26, 2010 - 5:30 pm

If you're part way down the list, surely you'll need to
unlock multiple anon_vmas here before going to retry?

Otherwise you could forget to unlock one that you already
locked, and live-lock here.

-- 
All rights reversed
--

From: KAMEZAWA Hiroyuki
Date: Monday, April 26, 2010 - 5:31 pm

On Mon, 26 Apr 2010 20:30:41 -0400
vma->anon_vma->lock is released after vma_address().

-Kame

--

From: Rik van Riel
Date: Monday, April 26, 2010 - 7:13 pm

Doh, never mind.  Too much code in my mind at once...

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Andrea Arcangeli
Date: Monday, April 26, 2010 - 4:04 pm

So patch 1 is for aa.git too, and patch 2 is only for mainline with
the new anon-vma changes (patch 2 not needed in current aa.git, and if
I apply it, it'll deadlock so...) right?

split_huge_page is somewhat simpler and more strict in its checking
than migrate.c in this respect, and yes patch 2 will also need to be
extended to cover split_huge_page the moment I stop backing out the
new anon-vma code (but it won't be any different, whatever works for
migrate will also work for split_huge_page later).

For now I'm much more interested in patch 1 and I'll leave patch 2 to
mainline digestion and check it later hope to find all issues fixed by
the time transparent hugepage gets merged.

About patch 1 it's very interesting because I looked at the race
against fork and migrate yesterday and I didn't see issues but I'm
going to read your patch 1 in detail now to understand what is the
problem you're fixing.

Good you posted this fast, so I can try to help ;)
Andrea
--

Previous thread: [PATCH 1/2] mm,migration: During fork(), wait for migration to end if migration PTE is encountered by Mel Gorman on Monday, April 26, 2010 - 3:37 pm. (3 messages)

Next thread: Re: [PATCH] x86/PCI: never allocate PCI MMIO resources below BIOS_END by H. Peter Anvin on Monday, April 26, 2010 - 4:02 pm. (14 messages)