Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. For example [ 511.201534] kernel BUG at include/linux/swapops.h:105! [ 511.201534] invalid opcode: 0000 [#1] PREEMPT SMP [ 511.201534] last sysfs file: /sys/block/sde/size [ 511.201534] CPU 0 [ 511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys [ 511.888526] [ 511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H [ 511.888526] RIP: 0010:[<ffffffff811094ff>] [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 [ 512.173545] RSP: 0018:ffff880037b979d8 EFLAGS: 00010246 [ 512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830 [ 512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e [ 512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8 [ 512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000 [ 512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e [ 512.380001] FS: 00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 [ 512.380001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0 [ ...
Hi Mel, did you see my proposed fix? I'm running with it applied, I'd be interested if you can test it. Surely it will also work for new anon-vma code in upstream, because at that point there's just 1 anon-vma and nothing else attached to the vma. http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f2... I think it's wrong to try to handle the race in rmap walk by making magic checks on vm_flags VM_GROWSDOWN|GROWSUP and vma->vm_mm->map_count == 1, when we can fix it fully and simply in exec.c by indexing two vmas in the same anon-vma with a different vm_start so the pages will be found at all times by the rmap_walk. --
I'm building a mergeable THP+memory compaction tree ready for mainline merging based on new anon-vma code, so I'm integrating your patch1 and this below should be the port of my alternate fix to your patch2 to fix the longstanding crash in migrate (not a bug in new anon-vma code but longstanding). patch1 is instead about the bugs introduced by the new anon-vma code that might crash migrate (even without memory compaction and/or THP) the same way as the bug fixed by the below. == Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. And split_huge_page() will have the same requirements as migrate.c already has. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; + unsigned long moved_length; struct mmu_gather *tlb; + struct vm_area_struct *tmp_vma; BUG_ON(new_start > new_end); @@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are return -EFAULT; /* + * We need to create a fake temporary vma and index it in the + * anon_vma list in order to allow the pages to be reachable + * at all times by the rmap walk for migrate, while + * move_page_tables() is running. + */ + tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + if (!tmp_vma) + return -ENOMEM; + *tmp_vma = *vma; + + if (unlikely(anon_vma_clone(tmp_vma, vma))) ...
Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
I noticed I didn't qrefresh to sync the patch to the working dir and I
had some change in working dir not picked up in the patch. A
INIT_LIST_HEAD was missing and this was the patch I actually tested.
===
Subject: fix race between shift_arg_pages and rmap_walk
From: Andrea Arcangeli <aarcange@redhat.com>
migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.
And split_huge_page() will have the same requirements as migrate.c
already has.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are
unsigned long length = old_end - old_start;
unsigned long new_start = old_start - shift;
unsigned long new_end = old_end - shift;
+ unsigned long moved_length;
struct mmu_gather *tlb;
+ struct vm_area_struct *tmp_vma;
BUG_ON(new_start > new_end);
@@ -515,17 +518,46 @@ static int shift_arg_pages(struct vm_are
return -EFAULT;
/*
+ * We need to create a fake temporary vma and index it in the
+ * anon_vma list in order to allow the pages to be reachable
+ * at all times by the rmap walk for migrate, while
+ * move_page_tables() is running.
+ */
+ tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ if (!tmp_vma)
+ return -ENOMEM;
+ *tmp_vma = *vma;
+ INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+ if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+ kmem_cache_free(vm_area_cachep, tmp_vma);
+ return -ENOMEM;
+ }
+
+ /*
* cover the whole range: [new_start, old_end)
+ *
+ * The vma is attached only to vma->anon_vma so one lock is
+ * enough. Even this lock might be ...Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
I like this approach than exclude temporal stack while migration. If we look it through viewpoint of performance, Mel and Kame's one look good and simple. But If I look it through viewpoint of correctness, Andrea's one looks good. I mean Mel's approach is that problem is here but let us solve it with there. it makes dependency between here and there. And In future, if temporal stack and rmap code might be problem, we also should solve it in there. :) So I support this one. -- Kind regards, Minchan Kim --
That explains exactly why I wanted to solve it locally to exec.c and by using the same method of mremap. And it fixes all users not just migrate (split_huge_page may also need it in the future if we ever allow the user stack to be born huge instead of growing huge with khugepaged if needed). --
I did when I got back - sorry for the delay. The patchset I sent out was what I had fully tested and was confident worked. I picked up the version of the Unfortunately, the same bug triggers after about 18 minutes. The objective of your fix is very simple - have a VMA covering the new range so that rmap can find it. However, no lock is held during move_page_tables() because of the need to call the page allocator. Due to the lack of locking, is it possible that something like the following is happening? Exec Process Migration Process begin move_page_tables begin rmap walk take anon_vma locks find new location of pte (do nothing) copy migration pte to new location #### Bad PTE now in place find old location of pte remove old migration pte release anon_vma locks remove temporary VMA some time later, bug on migration pte Even with the care taken, a migration PTE got copied and then left behind. What I haven't confirmed at this point is if the ordering of the walk in "migration process" is correct in the above scenario. The order is important for the race as described to happen. How bad is that magic check really? Is there a scenario when it's the wrong thing to do? I agree that migration skipping specific pages of the temporary stack is unfortunate and having exec-aware informtion in migration is an odd dependency at best. On the other hand, it's not as bad as skipping other regions as exec will finish and allow the pages to be moved again. The impact to compaction If it can be simply fixed in exec, then I'll agree. Your patch looked simple but unfortunately it doesn't fix the problem and it does introduce another call to kmalloc() in the exec path. It's probably something that would only be noticed by microbenchmarks though so I'm less concerned about that aspect. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Ok so this seems the ordering dependency on the anon_vma list that strikes again, I didn't realize the ordering would matter here, but it does as shown above, great catch! The destination vma of the move_page_tables has to be at the tail of the anon_vma list like the child vma have to be at the end to avoid the equivalent race in fork. This has to be a requirement for mremap too. We just want to enforce the same invariants that mremap already enforces, to avoid adding new special cases to the VM. == for new anon-vma code == Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. And split_huge_page() will have the same requirements as migrate.c already has. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -503,7 +504,9 @@ static int shift_arg_pages(struct vm_are unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; + unsigned long moved_length; struct mmu_gather *tlb; + struct vm_area_struct *tmp_vma; BUG_ON(new_start > new_end); @@ -515,17 +518,43 @@ static int shift_arg_pages(struct vm_are return -EFAULT; /* + * We need to create a fake temporary vma and index it in the + * anon_vma list in order to allow the pages to be reachable + * at all times by the rmap walk for migrate, while + * move_page_tables() is running. + */ + tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + if (!tmp_vma) + return -ENOMEM; + *tmp_vma = ...
Agreed. To be honest, I found the problems ordering of the anon_vma a little confusing but as long as it's consistent everywhere, it's manageable. If this ever burns us in the future though, we might want DEBUG_VM option that Reviewed-by: Mel Gorman <mel@csn.ul.ie> I'm currently testing this and have seen no problems after an hour which is typically good. To be absolutly sure, it needs 24 hours but so far so good. The changelog is a tad on the light side so maybe you'd like to take this one instead and edit it to your liking? ==== CUT HERE === mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Page migration requires rmap to be able to find all migration ptes created by migration. If the second rmap_walk clearing migration PTEs misses an entry, it is left dangling causing a BUG_ON to trigger during fault. For example; [ 511.201534] kernel BUG at include/linux/swapops.h:105! [ 511.201534] invalid opcode: 0000 [#1] PREEMPT SMP [ 511.201534] last sysfs file: /sys/block/sde/size [ 511.201534] CPU 0 [ 511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys [ 511.888526] [ 511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H [ 511.888526] RIP: 0010:[<ffffffff811094ff>] [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 [ 512.173545] RSP: 0018:ffff880037b979d8 EFLAGS: 00010246 [ 512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830 [ 512.329617] RDX: 0000000001a2ba10 RSI: ...
I'll take your changelog for aa.git thanks! And the non trivial stuff was documented in the code too. So now in aa.git I've two branches, master -> old-anon_vma, anon_vma_chain -> new-anon_vma. anon_vma_chain starts with Rik's patch 1/2 and then this patch. old-anon_vma starts with backout-anon-vma and then this patch 2 backported to old anon-vma code. After the removal of all vma->anon_vma->lock usages from THP code, and switching to a slower get_page() spin_unlock(page_table_lock) page_lock_anon_vma(page) model, the anon_vma_chain branch has a chance to be as solid as the master branch. anon_vma_chain branch can be pulled from mainline branches too. The master branch is also not using anymore any vma->anon_vma->lock even if it still could and it'd be a bit faster, to give more testing to the anon_vma_chain code. You can see the difference with "git diff master anon_vma_chain". http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog;h=refs/heads/anon_v... This should be THP-23 and THP-23-anon_vma_chain tags, I'll do proper release soon. --
A simpler solution would be to not allow migration of the temporary stack? --
The patch's intention is to not migrate pages within the temporary stack. What are you suggesting that is different? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
A simple way to disallow migration of pages is to increment the refcount of a page. --
I guess it could be done by walking the page-tables in advance of the move and elevating the page count of any pages faulted and then finding those pages afterwards. The fail path would be a bit of a pain though if the page tables are partially moved though. It's unnecessarily complicated when the temporary stack can be easily avoided. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Faulting during exec? Dont we hold mmap_sem for write? A get_user_pages() or so on the range will increment the refcount. --
Or just identify the temporary stack from the migration side instead of adding to the cost of exec? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Adding one off checks to a generic mechanism isnt really clean programming. Using the provided means of disabling a generic mechanism is. --
Andrea's solution is likely lighter than yours as it is one kmalloc and an insertion into the VM as opposed to a page table walk with reference counting. Better yet, it exists as a patch that has been tested and it fits in with the generic mechanism by guaranteeing that rmap_walk finds all the migration PTEs during the second walk. The problem remains the same - that class of solution increases the cost of a common operation (exec) to keep a much less operation (migration) happy. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
page table walk adding reference counting is still a one off check, the generic rmap_walk mechanism won't care about the reference counting, still only migrate checks the page count... so it doesn't move the needle in clean programming terms. --
Ok for migrate but it won't prevent to crash in split_huge_page rmap walk, nor the PG_lock. Why for a rmap bug have a migrate specific fix? The fix that makes execve the only special place to handle in every rmap walk, is at least more maintainable than a fix that makes one of the rmap walk users special and won't fix the others, as there will be more than just 1 user that requires this. My fix didn't make execve special and it didn't require execve knowledge into the every rmap walk like migrate (split_huge_page etc...) but as long as the kernel doesn't crash I'm fine ;). --
On Mon, 10 May 2010 21:05:59 +0200 At first, I like step-by-step approach even if it makes our cost double because it's easy to understand and makes chasing change-log easy. Ok, your split_huge_page() has some problems with current rmap+migration. But I don't like a patch for never-happen-now bug in change-log. I believe it can be fixed by the same approach for execs. Renaming #define VM_STACK_INCOMPLETE_SETUP to be #define VM_TEMPORARY_INCONSITENT_RMAP in _your_ patch series and add some check in rmap_walk() seems enough. Of course, I may misunderstand your problem. Could you show your patch which meets the problem with rmap+migration ? Thanks, -Kame --
