Re: [PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

Previous thread: [git pull] spi & OF fixes and documentation cleanup. by Grant Likely on Friday, April 30, 2010 - 8:07 am. (1 message)

Next thread: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock by Mel Gorman on Thursday, April 29, 2010 - 1:32 am. (2 messages)

Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page.  If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  ...

Hi Mel,

did you see my proposed fix? I'm running with it applied, I'd be
interested if you can test it. Surely it will also work for new
anon-vma code in upstream, because at that point there's just 1
anon-vma and nothing else attached to the vma.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f2...

I think it's wrong to try to handle the race in rmap walk by making
magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
vma->vm_mm->map_count == 1, when we can fix it fully and simply in
exec.c by indexing two vmas in the same anon-vma with a different
vm_start so the pages will be found at all times by the rmap_walk.
--


I'm building a mergeable THP+memory compaction tree ready for mainline
merging based on new anon-vma code, so I'm integrating your patch1 and
this below should be the port of my alternate fix to your patch2 to
fix the longstanding crash in migrate (not a bug in new anon-vma code
but longstanding). patch1 is instead about the bugs introduced by the
new anon-vma code that might crash migrate (even without memory
compaction and/or THP) the same way as the bug fixed by the below.

==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) ...

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--


I noticed I didn't qrefresh to sync the patch to the working dir and I
had some change in working dir not picked up in the patch. A
INIT_LIST_HEAD was missing and this was the patch I actually tested.

===
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
+		return -ENOMEM;
+	}
+
+	/*
 	 * cover the whole range: [new_start, old_end)
+	 *
+	 * The vma is attached only to vma->anon_vma so one lock is
+	 * enough. Even this lock might be ...

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--


I like this approach than exclude temporal stack while migration.

If we look it through viewpoint of performance, Mel and Kame's one
look good and simple. But If I look it through viewpoint of
correctness, Andrea's one looks good.
I mean Mel's approach is that problem is here but let us solve it with
there. it makes dependency between here and there. And In future, if
temporal stack and rmap code might be problem, we also should solve it
in there. :)

So I support this one.

-- 
Kind regards,
Minchan Kim
--


That explains exactly why I wanted to solve it locally to exec.c and
by using the same method of mremap. And it fixes all users not just
migrate (split_huge_page may also need it in the future if we ever
allow the user stack to be born huge instead of growing huge with
khugepaged if needed).
--


I did when I got back - sorry for the delay. The patchset I sent out was what
I had fully tested and was confident worked. I picked up the version of the

Unfortunately, the same bug triggers after about 18 minutes. The objective of
your fix is very simple - have a VMA covering the new range so that rmap can
find it. However, no lock is held during move_page_tables() because of the
need to call the page allocator. Due to the lack of locking, is it possible
that something like the following is happening?

Exec Process				Migration Process
begin move_page_tables
					begin rmap walk
					take anon_vma locks
					find new location of pte (do nothing)
copy migration pte to new location
#### Bad PTE now in place
					find old location of pte
					remove old migration pte
					release anon_vma locks
remove temporary VMA
some time later, bug on migration pte

Even with the care taken, a migration PTE got copied and then left behind. What
I haven't confirmed at this point is if the ordering of the walk in "migration
process" is correct in the above scenario. The order is important for
the race as described to happen.


How bad is that magic check really? Is there a scenario when it's
the wrong thing to do?

I agree that migration skipping specific pages of the temporary stack is
unfortunate and having exec-aware informtion in migration is an odd dependency
at best. On the other hand, it's not as bad as skipping other regions as exec
will finish and allow the pages to be moved again. The impact to compaction

If it can be simply fixed in exec, then I'll agree. Your patch looked simple
but unfortunately it doesn't fix the problem and it does introduce another
call to kmalloc() in the exec path. It's probably something that would only
be noticed by microbenchmarks though so I'm less concerned about that aspect.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


Ok so this seems the ordering dependency on the anon_vma list that
strikes again, I didn't realize the ordering would matter here, but it
does as shown above, great catch! The destination vma of the
move_page_tables has to be at the tail of the anon_vma list like the
child vma have to be at the end to avoid the equivalent race in
fork. This has to be a requirement for mremap too. We just want to
enforce the same invariants that mremap already enforces, to avoid
adding new special cases to the VM.

== for new anon-vma code ==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@redhat.com>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

And split_huge_page() will have the same requirements as migrate.c
already has.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fsnotify.h>
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = ...

Agreed. To be honest, I found the problems ordering of the anon_vma a little
confusing but as long as it's consistent everywhere, it's manageable. If
this ever burns us in the future though, we might want DEBUG_VM option that

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

I'm currently testing this and have seen no problems after an hour which
is typically good. To be absolutly sure, it needs 24 hours but so far so
good. The changelog is a tad on the light side so maybe you'd like to take
this one instead and edit it to your liking?

==== CUT HERE ===

mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ...

I'll take your changelog for aa.git thanks! And the non trivial stuff
was documented in the code too.

So now in aa.git I've two branches, master -> old-anon_vma,
anon_vma_chain -> new-anon_vma.

anon_vma_chain starts with Rik's patch 1/2 and then this
patch. old-anon_vma starts with backout-anon-vma and then this patch 2
backported to old anon-vma code. After the removal of all
vma->anon_vma->lock usages from THP code, and switching to a slower
get_page() spin_unlock(page_table_lock) page_lock_anon_vma(page)
model, the anon_vma_chain branch has a chance to be as solid as the
master branch. anon_vma_chain branch can be pulled from mainline
branches too. The master branch is also not using anymore any
vma->anon_vma->lock even if it still could and it'd be a bit faster,
to give more testing to the anon_vma_chain code.

You can see the difference with "git diff master anon_vma_chain".

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/anon_v...

This should be THP-23 and THP-23-anon_vma_chain tags, I'll do proper
release soon.
--


A simpler solution would be to not allow migration of the temporary stack?

--


The patch's intention is to not migrate pages within the temporary
stack. What are you suggesting that is different?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


A simple way to disallow migration of pages is to increment the refcount
of a page.

--


I guess it could be done by walking the page-tables in advance of the move
and elevating the page count of any pages faulted and then finding those
pages afterwards.  The fail path would be a bit of a pain though if the page
tables are partially moved though. It's unnecessarily complicated when the
temporary stack can be easily avoided.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


Faulting during exec? Dont we hold mmap_sem for write? A get_user_pages()
or so on the range will increment the refcount.
--


Or just identify the temporary stack from the migration side instead of
adding to the cost of exec?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


Adding one off checks to a generic mechanism isnt really clean
programming. Using the provided means of disabling a generic mechanism is.

--


Andrea's solution is likely lighter than yours as it is one kmalloc and
an insertion into the VM as opposed to a page table walk with reference
counting. Better yet, it exists as a patch that has been tested and it
fits in with the generic mechanism by guaranteeing that rmap_walk finds
all the migration PTEs during the second walk.

The problem remains the same - that class of solution increases the cost of
a common operation (exec) to keep a much less operation (migration) happy.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


page table walk adding reference counting is still a one off check,
the generic rmap_walk mechanism won't care about the reference
counting, still only migrate checks the page count... so it doesn't
move the needle in clean programming terms.
--


Ok for migrate but it won't prevent to crash in split_huge_page rmap
walk, nor the PG_lock. Why for a rmap bug have a migrate specific fix?
The fix that makes execve the only special place to handle in every
rmap walk, is at least more maintainable than a fix that makes one of
the rmap walk users special and won't fix the others, as there will be
more than just 1 user that requires this. My fix didn't make execve
special and it didn't require execve knowledge into the every rmap
walk like migrate (split_huge_page etc...) but as long as the kernel
doesn't crash I'm fine ;).
--


On Mon, 10 May 2010 21:05:59 +0200

At first, I like step-by-step approach even if it makes our cost double
because it's easy to understand and makes chasing change-log easy.

Ok, your split_huge_page() has some problems with current rmap+migration.
But I don't like a patch for never-happen-now bug in change-log.

I believe it can be fixed by the same approach for execs.
Renaming 
#define VM_STACK_INCOMPLETE_SETUP
to be
#define VM_TEMPORARY_INCONSITENT_RMAP
in _your_ patch series and add some check in rmap_walk() seems enough.

Of course, I may misunderstand your problem. Could you show your patch
which meets the problem with rmap+migration ?

Thanks,
-Kame




--

Previous thread: [git pull] spi & OF fixes and documentation cleanup. by Grant Likely on Friday, April 30, 2010 - 8:07 am. (1 message)

Next thread: [PATCH 1/2] mm: Take all anon_vma locks in anon_vma_lock by Mel Gorman on Thursday, April 29, 2010 - 1:32 am. (2 messages)