After digging around a lot, I believe the following two patches are the best way to close the race that allows a migration PTE to be left behind triggering a BUG check in migration_entry_to_page(). Patch one alters has fork() wait for migration to complete. Patch two has vma_adjust() acquire the anon_vma lock it is aware of and makes rmap_walk() aware that different VMAs can be encountered during the walk. I dropped the use of the seq counter because there were still races in place. For example, while the seq counter would catch when vma_adjust() and rmap_walk() were looking at the same VMA, there was still insufficient protection on the VMA list being modified. The reproduction case was as follows; 1. Run kernel compilation in a loop 2. Start two processes that repeatedly fork()ed and manipulated mappings 3. Constantly compact memory using /proc/sys/vm/compact_memory 4. Optionally add/remove swap With these two patches applied, I was unable to trigger the bug check in migration_entry_to_page() but it would be really helpful if Rik could comment on the anon_vma locking requirements and whether patch 2 is 100% safe or not. The tests have only been running 8 hours but I'm posting now anyway and will see how it survives running for a few days. The other issues raised about expand_downwards will need to be re-examined to see if they still exist and transparent hugepage support will need further thinking to see if split_huge_page() can deal with these situations. mm/ksm.c | 13 +++++++++++++ mm/memory.c | 25 ++++++++++++++++--------- mm/mmap.c | 6 ++++++ mm/rmap.c | 23 ++++++++++++++++++++--- 4 files changed, 55 insertions(+), 12 deletions(-) --
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 13 +++++++++++++
mm/mmap.c | 6 ++++++
mm/rmap.c | 22 +++++++++++++++++++---
3 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..baa5b4d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,22 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
continue;
+
+ if (anon_vma != vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ ...if (address == -EFAULT) -- Kind regards, Minchan Kim
On Tue, 27 Apr 2010 08:05:26 +0900 yes. thank you for pointing out. -Kame --
-- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Actually, I am worry about rollback approach like this. If we don't often need anon_vmas serializing, that's enough. But I am not sure how often we need locking of anon_vmas like this. Whenever we need it, we have to use rollback approach like this in future. In my opinion, it's not good. Rik, can't we make anon_vma locks more simple? Anyway, Mel's patch is best now. I hope improving locks of anon_vmas without rollback approach in near future. Thanks, Mel. -- Kind regards, Minchan Kim --
On Mon, 26 Apr 2010 23:37:58 +0100 Ok, acquiring vma->anon_vma->spin_lock always sounds very safe. (but slow.) I'll test this, too. --
On Tue, 27 Apr 2010 09:07:06 +0900 Sorry. reproduced. It seems the same bug before patch. mapcount 1 -> unmap -> remap -> mapcount 0. And it was SwapCache. Thanks, -Kame --
On Tue, 27 Apr 2010 12:50:40 +0900 Sorry again. It was _not_ SwapCache. Just SwapBacked. Thanks, -Kame --
Same here, reproduced after 18 hours. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Tue, 27 Apr 2010 09:59:51 +0100 Hmm. It seems rmap_one() is called and the race is not in vma_address() but in remap_migration_pte(). So, I added more hooks for debug..but not reproduced yet. (But I doubt my debug code, too ;) But it seems strange to have a race in remap_migration_pte(), so, I doubt my debug code, too. Thanks, -Kame --
It could have been in both but the vma lock should have been held across
the rmap_one. It still reproduces but it's still the right thing to do.
This is the current version of patch 2/2.
==== CUT HERE ====
[PATCH] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 19 +++++++++++++++++--
mm/mmap.c | 6 ++++++
mm/rmap.c | 27 +++++++++++++++++++++++----
3 files changed, 46 insertions(+), 6 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..87c7531 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1674,9 +1674,19 @@ again:
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ if (anon_vma != vma->anon_vma) {
+ if (!spin_trylock(&vma->anon_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
- continue;
+ goto next_vma;
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
@@ -1684,9 +1694,14 @@ again:
* were forked from the original since ksmd ...Well, keep in mind I reproduced the swapops bug with 2.6.33 anon-vma code, it's unlikely that focusing on patch 2 you'll fix bug in swapops.h. If this is a bug in the new anon-vma code, it needs fixing of course! But I doubt this bug is related to swapops in execve on the bprm->p args. I've yet to check in detail patch 1 sorry, I'll let you know my opinion about it as soon as I checked it in detail. --
Why do you doubt it's unrelated to execve? From what I've seen during the day,
there is a race in execve where the VMA gets moved (under the anon_vma lock)
before the page-tables are copied with move_ptes (without a lock). In that
case, execve can encounter and copy migration ptes before migration removes
them. This also applies to mainline because it is only taking the RCU lock
and not the anon_vma->lock.
I have a prototype that "handles" the situation with the new anon_vma
code by removing the migration ptes it finds while moving page tables
but it needs more work before releasing.
An alternative would be to split vma_adjust() into locked and unlocked
versions. shift_arg_pages() could then take the anon_vma lock to lock
both the VMA move and the pagetable copy here.
/*
* cover the whole range: [new_start, old_end)
*/
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
return -ENOMEM;
/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
if (length != move_page_tables(vma, old_start,
vma, new_start, length))
return -ENOMEM;
It'd be messy to split up the locking of vma_adjust like this though and
exec() will hold the anon_vma locks for longer just to guard against
migration. It's not clear this is better than having move_ptes handle
No problem. I still need to revisit all of these patches once I am
confident the swapops bug cannot be triggered any more.
--
If you're part way down the list, surely you'll need to unlock multiple anon_vmas here before going to retry? Otherwise you could forget to unlock one that you already locked, and live-lock here. -- All rights reversed --
On Mon, 26 Apr 2010 20:30:41 -0400 vma->anon_vma->lock is released after vma_address(). -Kame --
Doh, never mind. Too much code in my mind at once... Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
So patch 1 is for aa.git too, and patch 2 is only for mainline with the new anon-vma changes (patch 2 not needed in current aa.git, and if I apply it, it'll deadlock so...) right? split_huge_page is somewhat simpler and more strict in its checking than migrate.c in this respect, and yes patch 2 will also need to be extended to cover split_huge_page the moment I stop backing out the new anon-vma code (but it won't be any different, whatever works for migrate will also work for split_huge_page later). For now I'm much more interested in patch 1 and I'll leave patch 2 to mainline digestion and check it later hope to find all issues fixed by the time transparent hugepage gets merged. About patch 1 it's very interesting because I looked at the race against fork and migrate yesterday and I didn't see issues but I'm going to read your patch 1 in detail now to understand what is the problem you're fixing. Good you posted this fast, so I can try to help ;) Andrea --
