After V1, it was clear that execve was still racing but eventually died in an exec-related race. An additional part of the test was created that hammers exec() to reproduce typically within 10 minutes rather than several hours. The problem was that the VMA is moved under lock but not the page tables. Migration fails to remove the migration PTE from its new location and a BUG is later triggered. The third patch in this series is a candidate fix. Changelog since V1 o Handle the execve race o Be sure that rmap_walk() releases the correct VMA lock o Hold the anon_vma lock for the address lookup and the page remap o Add reviewed-bys There are a number of races between migration and other operations that mean a migration PTE can be left behind. Broadly speaking, migration works by locking a page, unmapping it, putting a migration PTE in place that looks like a swap entry, copying the page and remapping the page removing the old migration PTE. If a fault occurs, the faulting process waits until migration completes. The problem is that there are some races that either allow migration PTEs to be copied or a migration PTE to be left behind. Migration still completes and the page is unlocked but later a fault will call migration_entry_to_page() and BUG() because the page is not locked. This series aims to close some of these races. Patch 1 alters fork() to restart page table copying when a migration PTE is encountered. Patch 2 has vma_adjust() acquire the anon_vma lock and makes rmap_walk() aware that VMAs on the chain may have different anon_vma locks that also need to be acquired. Patch 3 notes that while a VMA is moved under the anon_vma lock, the page tables are not similarly protected. Where migration PTEs are encountered, they are cleaned up. The reproduction case was as follows; 1. Run kernel compilation in a loop 2. Start three processes, each of which creates one mapping. The three stress different aspects of the problem. The operations they ...
During exec(), a temporary stack is setup and moved later to its final
location. There is a race between migration and exec whereby a migration
PTE can be placed in the temporary stack. When this VMA is moved under the
lock, migration no longer knows where the PTE is, fails to remove the PTE
and the migration PTE gets copied to the new location. This later causes
a bug when the migration PTE is discovered but the page is not locked.
This patch handles the situation by removing the migration PTE when page
tables are being moved in case migration fails to find them. The alternative
would require significant modification to vma_adjust() and the locks taken
to ensure a VMA move and page table copy is atomic with respect to migration.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/migrate.h | 7 +++++++
mm/migrate.c | 2 +-
mm/mremap.c | 29 +++++++++++++++++++++++++++++
3 files changed, 37 insertions(+), 1 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7a07b17..05d2292 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -22,6 +22,8 @@ extern int migrate_prep(void);
extern int migrate_vmas(struct mm_struct *mm,
const nodemask_t *from, const nodemask_t *to,
unsigned long flags);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+ unsigned long addr, void *old);
#else
#define PAGE_MIGRATION 0
@@ -42,5 +44,10 @@ static inline int migrate_vmas(struct mm_struct *mm,
#define migrate_page NULL
#define fail_migrate_page NULL
+static inline int remove_migration_pte(struct page *new,
+ struct vm_area_struct *vma, unsigned long addr, void *old)
+{
+}
+
#endif /* CONFIG_MIGRATION */
#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4afd6fe..053fd39 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -75,7 +75,7 @@ void putback_lru_pages(struct list_head *l)
/*
* Restore a potential migration pte to a ...On Tue, 27 Apr 2010 22:30:52 +0100 Mel, I don't like this fix. Consider following, 1. try_to_unmap(oldpage) 2. copy and replace 3. remove_migration_ptes(oldpage, newpage) What this patch handles is "3: remove_migration_ptes fails to remap it and migration_pte will remain there case....The fact "new page is not mapped" means "get_page() is not called against the new page". So, the new page have been able to be freed until we restart move_ptes. I bet calling __get_user_pages_fast() before vma_adjust() is the way to go. When page_count(page) != page_mapcount(page) +1, migration skip it. Thanks, -Kame --
My proposed fix avoids to walk the pagetables once more time and to mangle over the page counts. Can you check it? It works but it needs more review. --
On Wed, 28 Apr 2010 02:08:33 +0200 Sure...but can we avoid temporal objrmap breakage(inconsistency) by it ? THanks, -Kame --
This is the real bug, the patch 1 should be rejected and the expanation-trace has the ordering wrong. The ordering is subtle but fundamental to prevent that race, split_huge_page also requires the same anon-vma list_add_tail to avoid the same race between fork and rmap_walk. It should work fine already with old and new anon-vma code as they both add new vmas always to the tail of the list. So the bug in very short, is that "move_page_tables runs out of sync I'll now evaluate the fix and see if I can find any other way to handle this. Great, I'm quite sure with patch 3 we'll move the needle and fix the bug, it perfectly explains why we only get the oops inside execve in the stack page. Patch 2 I didn't check it yet but it's only relevant for the new anon-vma code, I suggest to handle it separately from the rest. --
I think a better fix for bug mentioned in patch 3, is like below. This seems to work fine on aa.git with the old (stable) 2.6.33 anon-vma code. Not sure if this also works with the new anon-vma code in mainline but at first glance I think it should. At that point we should be single threaded so it shouldn't matter if anon_vma is temporary null. Then you've to re-evaluate the vma_adjust fixes for mainline-only in patch 2 at the light of the below (I didn't check patch 2 in detail). Please try to reproduce with the below applied. ---- Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. So shift_arg_pages must run atomically with respect of rmap_walk, and it's enough to run it under the anon_vma lock to make it atomic. And split_huge_page() will have the same requirements as migrate.c already has. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -503,6 +504,7 @@ static int shift_arg_pages(struct vm_are unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; struct mmu_gather *tlb; + struct anon_vma *anon_vma; BUG_ON(new_start > new_end); @@ -513,6 +515,12 @@ static int shift_arg_pages(struct vm_are if (vma != find_vma(mm, new_start)) return -EFAULT; + anon_vma = vma->anon_vma; + /* stop rmap_walk or it won't find the stack pages */ + spin_lock(&anon_vma->lock); + /* avoid vma_adjust to take any further anon_vma lock */ + vma->anon_vma = NULL; + /* * cover the whole range: ...
On Wed, 28 Apr 2010 00:58:52 +0200 Seems nice. I'll test this but I think we need to take care of do_mremap(), too. And it's more complicated.... Thanks, -Kame --
What did you mean with objrmap inconsistency? I think this is single threaded there, userland didn't run yet and I don't think page faults could run. Maybe it's safer to add a VM_BUG_ON(vma->anon_vma) just do_mremap has to be safe by: 1) adjusting page->index atomically with the pte updates inside pt lock (while it moves from one pte to another) 2) having both vmas src and dst (not overlapping) indexed in the proper anon_vmas before move_page_table runs As long as it's not overlapping it shouldn't be difficult to enforce the above two invariants, exec.c is magic as it works on overlapping areas and src and dst are the same vma and it's indexed into just one anon-vma. So we've to stop the rmap_walks before we mangle over the vma with vma_adjust and move_page_tables and truncate the end of the vma with vma_adjust again, and finally we resume the rmap_walks. I'm not entirely sure of the above so review greatly appreciated ;) --
actually no need of this at all! of course the dst vma will have vma->vm_pgoff adjusted instead... never mind. So I don't see a problem there. I think this is very special of how exec.c abuses move_page_tables by passing vma as src and dst, when it obviously cannot be indexed in two anon-vmas because there's a single vma and a single vma->anon_vma, and src and dst obviously cannot have two different vm_pgoff again because there's a single vma and there can't be two different vma->vm_pgoff. So I'm very hopeful do_mremap is already fully safe... --
On Wed, 28 Apr 2010 03:05:43 +0200 I mean following relationship. vma_address(vma, page1) <-> address <-> pte <-> page2 If page1 == page2, objrmap is consistent. I reviewed do_mremap again and am thinking it's safe. new_vma = copy_vma(); .... ----------(*) move_ptes(). munmap unnecessary range. At (*), if page1 != page2, rmap_walk will not run correctly. But it seems copy_vma() keeps page1==page2 As I reported, when there is a problem, vma_address() returns an address but that address doesn't contain migration_pte. BTW, page->index is not updated, we just keep [start_address, pgoff] to be sane value. Thanks, -Kame --
yes I corrected myself too, the end result is the same and adjusting the new vma vm_pgoff available in the anon-vma list is obviously faster and simpler. When the src and dst ranges don't overlap and src_vma != dst_vma (vma, new_vma in code) things are a lot simpler there... --
On Wed, 28 Apr 2010 00:58:52 +0200 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Hmm..Mel's patch 2/3 takes vma->anon_vma->lock in vma_adjust(), so this patch clears vma->anon_vma... /* * We adjust vma and move page tables in sequence. While update, * (vma, page) <-> address <-> pte relationship is unstable. * We lock anon_vma->lock for keeping rmap_walk() safe. (see mm/rmap.c) I think we can unlock this just after move_page_tables(). Thanks, -kame --
yep, it should be safe with patch 2 applied too. And I'm unsure why Mel's patch locks the anon_vma also when vm_start != start. See the other Checking this, I can't see where exactly is vma->vm_pgoff adjusted during the atomic section I protected with the anon_vma->lock? For a moment it looks like these pages become unmovable. I guess this is why I thought initially that it was move_page_tables to adjust the page->index. If it doesn't then the vma->vm_pgoff has to be moved down of shift >>PAGE_SHIFT and it doesn't seem to be happening which is an unrelated bug. --
On Wed, 28 Apr 2010 03:44:34 +0200 The page can be replaced with migration_pte before the 1st vma_adjust. The key is (vma, page) <-> address <-> pte <-> page relationship. vma_adjust() (*) move_pagetables(); (**) vma_adjust(); At (*), vma_address(vma, page) retruns a _new_ address. But pte is not Anyway, I have no strong opinion about the placement of unlock(anon_vma->lock). I wonder why we don't see this at testing memory-hotplug is because memory-hotplug disables a new page allocation in the migration range. So, this exec() is hard to get a page which can be migration target. Thanks, -Kame --
Yes I agree we can move the unlock at (**) because the last vma_adjust is only there to truncate the vm_end. In fact it looks super heavyweight to call vma_adjust for that instead of just using vma->vm_end = new_end considering we're under mmap_sem, full anonymous etc... In fact I think even the first vma_adjust looks too heavyweight and it doesn't bring any simplicity or added safety considering this works in place and there's nothing to wonder about vm_next or vma_merge or vm_file or anything that vma_adjust is good at. So the confusion I had about vm_pgoff is because all things that moves vm_start down, also move vm_pgoff down like stack growsdown but of course those don't move the pages down too, so we must not alter vm_pgoff here just vm_start along with the pagetables inside the anon_vma lock to be fully safe. Also I forgot to unlock in case of -ENOMEM ;) this is a new try, next is for a later time... hope this helps! Thanks! ---- Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. So shift_arg_pages must run atomically with respect of rmap_walk, and it's enough to run it under the anon_vma lock to make it atomic. And split_huge_page() will have the same requirements as migrate.c already has. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -502,6 +503,7 @@ static int shift_arg_pages(struct vm_are unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - ...
On Wed, 28 Apr 2010 04:42:27 +0200 Seems good. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> I'll test this and report if I see trouble again. Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05, my office is nearly closed. So, please consider no-mail-from-me is good information. Thanks, --
On Wed, 28 Apr 2010 11:49:44 +0900 Here is bad news. When move_page_tables() fails, "some ptes" are moved but others are not and....there is no rollback routine. I bet the best way to fix this mess up is - disable overlap moving of arg pages - use do_mremap(). But maybe you guys want to fix this directly. Here is a temporal fix from me. But don't trust me.. == Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. So shift_arg_pages must run atomically with respect of rmap_walk, and it's enough to run it under the anon_vma lock to make it atomic. And split_huge_page() will have the same requirements as migrate.c already has. And, when moving overlapping ptes by move_page_tables(), it's cannot be roll-backed as mremap does. This patch changes move_page_tables()'s behavior and if it failes, no ptes are moved. Changelog: - modified move_page_tables() to do atomic pte moving because "some ptes are moved but others are not" is critical for rmap_walk(). - free pgtables at failure rather than give it all to do_exit(). If not, objrmap will be inconsitent until exit() frees all. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- --- fs/exec.c | 67 +++++++++++++++++++++++++++++++++++++----------------------- mm/mremap.c | 28 ++++++++++++++++++++++--- 2 files changed, 67 insertions(+), 28 deletions(-) Index: mel-test/fs/exec.c =================================================================== --- mel-test.orig/fs/exec.c +++ mel-test/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include ...
Thanks to you both for looking into this. I far prefer this general approach than cleaning up the migration PTEs as the page tables get copied. While it might "work", it's sloppy in the same way as having migration_entry_wait() do the cleanup was sloppy. It's far preferable to make the VMA move and page table copy atomic with anon_vma->lock. The biggest problem is that the reverse mapping is temporarily out of sync until do_exit gets rid of the mess, but how serious is that really? If there is a migration entry in there, the mapcount should already be zero and migration holds a reference to the page to prevent it going away. rmap_walk() may then miss the migration_pte so it gets left behind. Ordinarily this would be bad but in exec(), we cannot be faulting this page so we won't trigger the bug in swapops. Instead, do_exit ultimately will skip over the migration PTE doing nothing with the page but as the mapcount is still zero, I see the point of your patch but I'm not yet seeing why it is necessary to back out if move_page_tables fails. That said, both patches have a greater problem. Both of them hold a spinlock (anon_vma->lock) while calling into the page allocator with GFP_KERNEL (to allocate the page tables). We don't want to change that to GFP_ATOMIC so either we need to allocate the pages in advance or special case rmap_walk() to not walk processes that are in exec. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Tue, 27 Apr 2010 22:30:52 +0100 Here is my final proposal (before going vacation.) I think this is very simple. The biggest problem is when move_page_range fails, setup_arg_pages pass it all to exit() ;) == From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> This is an band-aid patch for avoiding unmap->remap of stack pages while it's udner exec(). At exec, pages for stack is moved by setup_arg_pages(). Under this, (vma,page)<->address relationship can be in broken state. Moreover, if moving ptes fails, pages with not-valid-rmap remains in the page table and objrmap for the page is completely broken until exit() frees all up. This patch adds vma->broken_rmap. If broken_rmap != 0, vma_address() returns -EFAULT always and try_to_unmap() fails. (IOW, the pages for stack are pinned until setup_arg_pages() ends.) And this prevents page migration because the page's mapcount never goes to 0 until exec() fixes it up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- fs/exec.c | 4 +++- include/linux/mm_types.h | 5 +++++ mm/rmap.c | 5 +++++ 3 files changed, 13 insertions(+), 1 deletion(-) Index: mel-test/fs/exec.c =================================================================== --- mel-test.orig/fs/exec.c +++ mel-test/fs/exec.c @@ -250,7 +250,8 @@ static int __bprm_mm_init(struct linux_b err = insert_vm_struct(mm, vma); if (err) goto err; - + /* prevent rmap_walk, try_to_unmap() etc..until we get fixed rmap */ + vma->unstable_rmap = 1; mm->stack_vm = mm->total_vm = 1; up_write(&mm->mmap_sem); bprm->p = vma->vm_end - sizeof(void *); @@ -653,6 +654,7 @@ int setup_arg_pages(struct linux_binprm ret = -EFAULT; out_unlock: + vma->unstable_rmap = 0; up_write(&mm->mmap_sem); return ret; } Index: mel-test/include/linux/mm_types.h =================================================================== --- mel-test.orig/include/linux/mm_types.h +++ ...
I don't get it, I don't see the pinning and returning -EFAULT is not solution for things that cannot fail (i.e. remove_migration_ptes and split_huge_page). Plus there's no point to return failure to rmap_walk when we can just stop the rmap_walk with the proper lock. --
vma_adjust() is updating anon VMA information without any locks taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a BUG_ON
to trigger when the page is faulted in.
With the recent anon_vma changes, there can be more than one anon_vma->lock
that can be taken in a anon_vma_chain but a second lock cannot be spinned
upon in case of deadlock. Instead, the rmap walker tries to take locks of
different anon_vma's. If the attempt fails, the operation is restarted.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/ksm.c | 22 ++++++++++++++++++++--
mm/mmap.c | 6 ++++++
mm/rmap.c | 28 +++++++++++++++++++++++-----
3 files changed, 49 insertions(+), 7 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
again:
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
struct anon_vma *anon_vma = rmap_item->anon_vma;
+ struct anon_vma *locked_vma;
struct anon_vma_chain *vmac;
struct vm_area_struct *vma;
spin_lock(&anon_vma->lock);
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
+
+ /* See comment in mm/rmap.c#rmap_walk_anon on locking */
+ locked_vma = NULL;
+ if (anon_vma != vma->anon_vma) {
+ locked_vma = vma->anon_vma;
+ if (!spin_trylock(&locked_vma->lock)) {
+ spin_unlock(&anon_vma->lock);
+ goto again;
+ }
+ }
+
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
- continue;
+ goto next_vma;
+
/*
* Initially we examine only the vma which covers this
* rmap_item; but later, if there is still work to do,
@@ -1684,9 +1697,14 @@ again:
...The old code did:
/*
* When changing only vma->vm_end, we don't really need
* anon_vma lock.
*/
if (vma->anon_vma && (insert || importer || start != vma->vm_start))
anon_vma = vma->anon_vma;
if (anon_vma) {
spin_lock(&anon_vma->lock);
why did it become unconditional? (and no idea why it was removed)
But I'm not sure about this part.... this is really only a question, I
may well be wrong, I just don't get it.
--
It became unconditional because I wasn't sure of the optimisation versus the new anon_vma changes (doesn't matter, should have been safe). At the time the patch was introduced, the bug looked like a race in VMA's in the list having their details modified. I thought vma_address was returning -EFAULT when it shouldn't and while this may still be possible, it wasn't the prime cause of the bug. The more important race was in execve between when a VMA got moved and the page tables copied. The anon_vma locks are fine for the VMA move but the page table copy happens later. What the patch did was alter the timing of the race. rmap_walk() was finding the VMA of the new stack being set up by exec, failing to lock it and backing off. By the time it would restart and get back to that VMA, it was already moved making the bug simply harder to reproduce because the race window was so small. So, the VMA list does not appear to be messed up but there still needs to be protection against modification of VMA details that are already on the list. For that, the seq counter would have been enough and lighter-weight than acquiring the anon_vma->lock every time in vma_adjust(). -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain the removal of the lock but I don't get it from the comments. Or at least I don't get from that comment why we can't resurrect the plain old deleted code that looked fine to me. Like there is no reason to You mean you're dropping patch 2 too? I agree dropping patch 1 but to me the having to take all the anon_vma locks for every vma->anon_vma->lock that we walk seems a must, otherwise expand_downwards and vma_adjust won't be ok, plus we need to re-add the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff and vm_start outside of the anon_vma->lock. Or I am mistaken? Patch 2 wouldn't help the swapops crash we reproduced because at that point the anon_vma of the stack is the local one, it's just after execve. vma_adjust and expand_downards would alter vm_pgoff and vm_start while taking only the vma->anon_vma->lock where the vma->anon_vma is the _local_ one of the vma. But a vma in mainline can be indexed in infinite anon_vmas, so to prevent breaking migration vma_adjust/expand_downards the rmap_walk would need to take _all_ anon_vma->locks for every anon_vma that the vma is indexed into. Or alternatively like you implemented rmap_walk would need to check if the vma we found in the rmap_walk is different from the original anon_vma and to take the vma->anon_vma->lock (so taking the anon_vma->lock of the local anon_vma of every vma) before it can actually read the vma->vm_pgoff/vm_start inside vma_address. If the above is right it also means the new anon-vma changes also break the whole locking of transparent hugepage, see wait_split_huge_page, it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the "local" anon-vma is enough, when in fact the hugepage may be shared and belonging to the parent parent_vma->anon_vma and not to the local one of the last child that is waiting on the wrong lock. So I may have to rewrite this part of the thp locking to solve this. And for me it's not ...
Another way (not sure if it's good or bad, but it'd clearly avoid the restarting locks in rmap_walk) would be to allocate a shared lock, and still share the lock like the anon_vma was shared before. So we have shorter chains to walk, but still a larger lock, so we've to take just one and be safe. spin_lock(&vma->anon_vma->shared_anon_vma_lock->lock) something like that... --
Frankly, I don't understand why it was safe to drop the lock either. Maybe it was a mistake but I still haven't convinced myself I fully Temporarily at least until I figured out if execve was the only problem. The locking in vma_adjust didn't look like the prime reason for the crash No, you're not. If nothing else, vma_address can return the wrong value because the VMAs vm_start and vm_pgoff were in the process of being updated but not fully updated. It's hard to see how vma_address would return the wrong value and miss a migration PTE as a result but it's a possibility. It's probably True, although in the case of expand_downwards, it's highly unlikely that there is also a migration PTE to be cleaned up. It's hard to see how a migration PTE would be left behind in that case but it still looks wrong to I felt this would be too heavy in the common case which is why I made rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically To be absolutly sure, yes this is required. I don't think we've been hitting this exact problem in these tests but it still is a better plan than adjusting No fun. That potentially could be a lot of locks to take to split the page. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I understand the design but I'm also unsure about the details. it's just this lock that gets splitted and when you update the vm_start/pgoff with only the vma->anon_vma->lock, the vma may be queued in multiple other anon_vmas, and you're only serializing and safe for the pages that have page_mapcount 1, and point to the local anon_vma == vma->anon_vma, not any other shared page. The removal of the vma->anon_vma->lock from vma_adjust just seems an unrelated mistake to me too, but I don't know for sure why yet. Basically vma_adjust needs the anon_vma lock like expand_downards has. After you fix vma_adjust to be as safe as expand_downards you've also to take care of the rmap_walk that may run on a page->mapping = anon_vma that isn't the vma->anon_vma and you're not taking that anon_vma->lock of the shared page, when you change the vma vm_pgoff/vm_start. If rmap_walk finds to find a pte, becauase of that, For aa.git it is sure enough. And as long as you only see the migrate For the rmap_walk itself, migrate and split_huge_page are identical, the main problem of transparent hugepage support is that I used the anon_vma->lock in a wider way and taken well before the rmap_walk, so I'm screwed in a worse way than migrate. So I may have to change from anon_vma->lock to the compound_lock in wait_split_huge_page(). But I'll still need the restarting loop of anon_vma locks then inside the two rmap_walk run by split_huge_page. Problem is, I would have preferred to do this locking change later as a pure optimization than as a requirement for merging and running stable, as it'll make things slightly more complex. BTW, if we were to share the lock across all anon_vmas as I mentioned in prev email, and just reduce the chain length, then it'd solve all Every time we fail to find the PTE, it can also mean try_to_unmap just failed to instantiate the migration pte leading to random memory corruption in migrate. If a task fork and the some of the stack pages at the bottom of the ...
Well, in the easiest case, the details of the VMA (particularly vm_start and vm_pgoff) can confuse callers of vma_address during rmap_walk. In the case of migration, it will return other false positives or negatives. Is this not what the try-lock-different-vmas-or-backoff-and-retry logic I can only remember seeing the execve-related crash but I'd rather the locking was correct too. Problem is, I've seen at least one crash due to execve() even with the check made in try_to_unmap_anon to not migrate within the temporary It might be where we end up eventually. I'm simply loathe to introduce How so? The old PTE should have been left in place, the page count of Because the list could be very large, it would make more sense to -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
yes exactly. This is why patch 2 can't be dropped, both for the Right only problem is for remove_migration_ptes (and for both split_huge_page rmap_walks). For migrate the only issue is the second Kind of agree, we'll see... --
Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...
Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..1eef42c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
* all the anon_vmas associated with this VMA.
* The "same_anon_vma" list contains the anon_vma_chains
* which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->page_table_lock.
*/
struct anon_vma_chain {
struct vm_area_struct *vma;
struct anon_vma *anon_vma;
- struct list_head same_vma; /* locked by mmap_sem & page_table_lock */
+ struct list_head same_vma; /* see above */
struct list_head same_anon_vma; /* locked by anon_vma->lock */
};
@@ -90,11 +94,14 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
return page_rmapping(page);
}
-static inline void anon_vma_lock(struct vm_area_struct *vma)
+static inline void anon_vma_lock(struct vm_area_struct *vma, void *nest_lock)
{
struct anon_vma *anon_vma = vma->anon_vma;
- if (anon_vma)
- spin_lock(&anon_vma->lock);
+ if (anon_vma) {
+ struct anon_vma_chain *avc;
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+ spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock);
+ }
}
static inline void anon_vma_unlock(struct vm_area_struct *vma)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..2c13bbb 100644
--- ...This will cause a lock inversion (page_table_lock can only be taken after the anon_vma lock). I don't immediately see why the page_table_lock here though? --
We need to safely walk the vma->anon_vma_chain / anon_vma_chain->same_vma list. So much for using the mmap_sem for read + the page_table_lock to lock the anon_vma_chain list. We'll need a new lock somewhere, probably in the mm_struct since one per process seems plenty. I'll add that in the next version of the patch. -- All rights reversed --
Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...
Also document the locking rules for the same_vma list in the anon_vma_chain
and remove the anon_vma_lock call from expand_upwards, which does not need it.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?
v2:
- also change anon_vma_unlock to walk the loop
- add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
- introduce a new lock for the vma->anon_vma_chain list, to prevent
the lock inversion that Andrea pointed out
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
+ spinlock_t anon_vma_chain_lock; /* Protects vma->anon_vma_chain, with mmap_sem */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..492e7ca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
* all the anon_vmas associated with this VMA.
* The "same_anon_vma" list contains the anon_vma_chains
* which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
*/
struct anon_vma_chain {
struct vm_area_struct *vma;
struct anon_vma *anon_vma;
- struct list_head same_vma; /* locked by mmap_sem & ...Well, it looks nice but I am wary of making much in the way of comment on its correctness. Every time I think I understand the new anon_vma This doesn't build with LOCKDEP enabled. As we discussed on IRC, I changed nest_lock to spinlock_t * and always made it the anon_vma_chain_lock. That vaguely makes more sense than sometimes depending on the mmap_sem semaphore but I'm not 100% solid on it. Walking along like this and taking the necessary locks seems reasonable though and should be usable by transparent hugepage support where as my so I changed this and others like it to the anon_vma_chain_lock. I -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
+ spinlock_t anon_vma_chain_lock; /* Protects vma->anon_vma_chain, with mmap_sem */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma {
* all the anon_vmas associated with this VMA.
* The "same_anon_vma" list contains the anon_vma_chains
* which link all the VMAs associated with this anon_vma.
+ *
+ * The "same_vma" list is locked by either having mm->mmap_sem
+ * locked for writing, or having mm->mmap_sem locked for reading
+ * AND holding the mm->anon_vma_chain_lock.
*/
struct anon_vma_chain {
struct vm_area_struct *vma;
struct anon_vma *anon_vma;
- struct list_head same_vma; /* locked by mmap_sem & page_table_lock */
+ struct list_head same_vma; /* see above */
struct list_head same_anon_vma; /* locked by anon_vma->lock */
};
@@ -90,18 +94,24 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
return page_rmapping(page);
}
-static inline void anon_vma_lock(struct vm_area_struct *vma)
-{
- struct anon_vma *anon_vma = vma->anon_vma;
- if (anon_vma)
- spin_lock(&anon_vma->lock);
-}
+#define anon_vma_lock(vma, nest_lock) \
+({ \
+ struct anon_vma *anon_vma = vma->anon_vma; \
+ if (anon_vma) { \
+ struct anon_vma_chain *avc; \
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) \
+ spin_lock_nest_lock(&avc->anon_vma->lock, nest_lock); \
+ } \
+})
static inline void ...Take all the locks for all the anon_vmas in anon_vma_lock, this properly
excludes migration and the transparent hugepage code from VMA changes done
by mmap/munmap/mprotect/expand_stack/etc...
Unfortunately, this requires adding a new lock (mm->anon_vma_chain_lock),
otherwise we have an unavoidable lock ordering conflict. This changes the
locking rules for the "same_vma" list to be either mm->mmap_sem for write,
or mm->mmap_sem for read plus the new mm->anon_vma_chain lock. This limits
the place where the new lock is taken to 2 locations - anon_vma_prepare and
expand_downwards.
Document the locking rules for the same_vma list in the anon_vma_chain and
remove the anon_vma_lock call from expand_upwards, which does not need it.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
Posted quickly as an RFC patch, only compile tested so far.
Andrea, Mel, does this look like a reasonable approach?
v3:
- change anon_vma_unlock into a macro so lockdep works right
- fix lock ordering in anon_vma_prepare
v2:
- also change anon_vma_unlock to walk the loop
- add calls to anon_vma_lock & anon_vma_unlock to vma_adjust
- introduce a new lock for the vma->anon_vma_chain list, to prevent
the lock inversion that Andrea pointed out
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..a0679c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -239,6 +239,7 @@ struct mm_struct {
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
+ spinlock_t anon_vma_chain_lock; /* Protects vma->anon_vma_chain, with mmap_sem */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..703c472 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -52,11 +52,15 @@ struct anon_vma ...This patch makes things simple. So I like this. Actually, I wanted this all-at-once locks approach. But I was worried about that how the patch affects AIM 7 workload which is cause of anon_vma_chain about scalability by Rik. But now Rik himself is sending the patch. So I assume the patch couldn't decrease scalability of the workload heavily. Let's wait result of test if Rik doesn't have a problem of AIM7. -- Kind regards, Minchan Kim --
The thing is, the number of anon_vmas attached to a VMA is small (depth of the tree, so for apache or aim the typical depth is 2). This N is between 1 and 3. The problem we had originally is the _width_ of the tree, where every sibling process was attached to the same anon_vma and the rmap code had to walk the page tables of all the processes, for every privately owned page in each child process. For large server workloads, this N is between a few hundred and a few thousand. What matters most at this point is correctness - we need to be able to exclude rmap walks when messing with a VMA in any way that breaks lookups, because rmap walks for page migration and hugepage conversion have to be 100% reliable. That is not a constraint I had in mind with the original anon_vma changes, so the code needs to be fixed up now... I suspect that taking one or two extra spinlocks in the code paths changed by this patch (mmap/munmap/...) is going to make a difference at all, since all of those paths are pretty infrequently taken. -- All rights reversed --
Yes. I understand it. When you tried anon_vma_chain patches as I pointed out, what I have a concern is parent's vma not child's one. The vma of parent still has N anon_vma. AFAIR, you said it's trade-off and would be good than old at least. I agreed. But I just want to remind you because this makes worse. :) The corner case is that we have to hold locks of N. Do I miss something? Really, Can't we ignore that case latency although this happen infrequently? I am not against this patch. I just want to listen your opinion. -- Kind regards, Minchan Kim --
No, it is the other way around. The anon_vma of the parent is also present in all of the children, so the parent anon_vma is attached to N vmas. However, the parent vma only has 1 anon_vma attached to it, and each of the children will have 2 anon_vmas. That is what should keep any locking overhead with this patch minimal. Yes, a deep fork bomb can slow itself down. Too bad, don't do that :) -- All rights reversed --
me/ slaps self. It's about height of tree and I can't imagine high height of scenario.(fork->fork->fork->...->fork) So as Rik pointed out, It's a not big overhead about latency latency, at least. I think. I supports this approach. -- Kind regards, Minchan Kim --
-- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
In vma_adjust(), what prevents something like rmap_map seeing partial
updates while the following lines execute?
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
if (adjust_next) {
next->vm_start += adjust_next << PAGE_SHIFT;
next->vm_pgoff += adjust_next;
}
They would appear to happen outside the lock, even with this patch. The
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
Assuming you are referring to migration, it's easiest to just not migrate pages within the stack until after shift_arg_pages runs. The locks cannot be held during move_page_tables() because the page allocator is called. It could be done in two stages where pages are allocated outside the lock and then passed to move_page_tables() but I don't think increasing the cost of exec() is justified just so a page can be migrated during exec. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This part does it. :)
----
@@ -578,6 +578,7 @@ again: remove_next = 1 + (end >
next->vm_end);
}
}
+ anon_vma_lock(vma, &mm->mmap_sem);
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -599,6 +600,7 @@ again: remove_next = 1 + (end >
next->vm_end);
vma_prio_tree_insert(vma, root);
flush_dcache_mmap_unlock(mapping);
}
+ anon_vma_unlock(vma);
---
But we still need patch about shift_arg_pages.
--
Kind regards,
Minchan Kim
--
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
At page migration, we replace pte with migration_entry, which has
similar format as swap_entry and replace it with real pfn at the
end of migration. But there is a race with fork()'s copy_page_range().
Assume page migraion on CPU A and fork in CPU B. On CPU A, a page of
a process is under migration. On CPU B, a page's pte is under copy.
CPUA CPU B
do_fork()
copy_mm() (from process 1 to process2)
insert new vma to mmap_list (if inode/anon_vma)
pte_lock(process1)
unmap a page
insert migration_entry
pte_unlock(process1)
migrate page copy
copy_page_range
remap new page by rmap_walk()
pte_lock(process2)
found no pte.
pte_unlock(process2)
pte lock(process2)
pte lock(process1)
copy migration entry to process2
pte unlock(process1)
pte unlokc(process2)
pte_lock(process1)
replace migration entry
to new page's pte.
pte_unlock(process1)
Then, some serialization is necessary. IIUC, this is very rare event but
it is reproducible if a lot of migration is happening a lot with the
following program running in parallel.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#define SIZE (24*1048576UL)
#define CHILDREN 100
int main()
{
int i = 0;
pid_t pids[CHILDREN];
char *buf = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS,
0, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(-1);
}
while (++i) {
int j = i % CHILDREN;
if (j == 0) {
printf("Waiting on children\n");
for (j = 0; j < CHILDREN; j++) {
memset(buf, i, SIZE);
if (pids[j] != -1)
waitpid(pids[j], NULL, 0);
}
j = 0;
}
if ((pids[j] = fork()) == 0) {
memset(buf, i, SIZE);
exit(EXIT_SUCCESS);
}
}
munmap(buf, SIZE);
}
copy_page_range() ...Ok I had a first look: rmap_walk will walk process1 first! It's at the head, the vmas with rmap_walk has to lock down process1 before process2, this is the ordering issue I already mentioned in earlier email. So it cannot happen and this patch is unnecessary. The ordering is fundamental and as said anon_vma_link already adds new vmas to the _tail_ of the anon-vma. And this is why it has to add to the tail. If anon_vma_link would add new vmas to the head of the list, the above bug could materialize, but it doesn't so it cannot happen. In mainline anon_vma_link is called anon_vma_chain_link, see the list_add_tail there to provide this guarantee. Because process1 is walked first by CPU A, the migration entry is replaced by the final pte before copy-migration-entry runs. Alternatively if copy-migration-entry runs before before process1 is walked, the migration entry will be copied and found in process 2. Comments welcome. Andrea --
On Wed, 28 Apr 2010 00:22:45 +0200 I already explained this doesn't happend and said "I'm sorry". But considering maintainance, it's not necessary to copy migration ptes and we don't have to keep a fundamental risks of migration circus. So, I don't say "we don't need this patch." Thanks, -Kame --
On Wed, 28 Apr 2010 02:19:11 +0200 why we have to remove migration_pte by rmap_walk() which doesnt' exist ? Anyway, I agree there are no oops. But there are risks because migration is a feature which people don't tend to take care of (as memcg ;) I like conservative approach for this kind of features. Thanks, -Kame --
I already thought this for split_huge_page and because split_huge_page that already waits inside copy_huge_pmd if there's any pmd_trans_splitting bit set, isn't removing the requirement for list_add_tail in anon_vma_[chain_]link this patch also isn't removing it. But it's more complex than my prev explanation now that I think about it more so no wonder it's not clear why yet (my fault). So the thing is, if you patch it this way, the second rmap_walk run by remove_migration_ptes will be safe even if the ordering is inverted, but then the first rmap_walk that establishes the migration entry, will still break if the rmap walk scans the child before the parent. The only objective of the patch is to remove the list_add_tail requirement but I'll show that it's still required by the first rmap_walk in try_to_unmap below: CPU A CPU B fork try_to_unmap try to set migration entry in child (null pte still, nothing done) copy pte and map page in child pte (no migration entry set on child) set migration entry in parent parent ends up with migration entry but child not, breaking migration in another way. so even with the patch, the only way to be safe is that rmap_walk always scans the parent _first_. So this patch is a noop from every angle, and it just moves the ordering requirement from the remove_migration_ptes to try_to_unmap, but it doesn't remove it as far as I can tell. So it only slowdown fork a bit for no good. split_huge_page is about the same. If the ordering is inverted, split_huge_page could set a splitting pmd in the parent only, like the Don't worry, migrate will run 24/7 the moment THP is in ;). What migration really changed (and it's beneficial so split_huge_page isn't introducing new requirement) is that rmap has to be exact at all times, if it was only swap using it, nothing would happen because of the vma_adjust/move_page_tables/vma_adjust not being in an atomic section against the rmap_walk (the anon_vma ...
Oops I must have overlooked it sorry! I just seen the trace quoted in the comment of the patch and that at least would need correction before it can be pushed in mainline, or it creates huge confusion to split_huge_page also has the same requirement and there is no bug to fix, so I don't see why to make special changes for just migrate.c when we still have to list_add_tail for split_huge_page. Furthermore this patch isn't fixing anything in any case and it looks a noop to me. If the order ever gets inverted, and process2 ptes are scanned before process1 ptes in the rmap_walk, sure the copy-page-tables will break and stop until the process1 rmap_walk will complete, but that is not enough! You have to repeat the rmap_walk of process1 if the order ever gets inverted and this isn't happening in the patch so I don't see how it could make any difference even just for migrate.c (obviously not for split_huge_page). --
And after going through it again, I'm happy that this was a red herring. Even if it's not strictly necessary, migration should (and does in this case) cope with being able to find all its migration ptes. An extra one being copied doesn't matter as long as it can be found on the chain. It's not like the -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
In general migration ptes were designed to cause the code encountering them to go to sleep until migration is finished and a regular pte is available. Looks like we are tolerating the handling of migration entries. Never imagined copying page table with this. There could be recursion issues of various kinds because page migration requires operations on the page tables while performing migration. A simple fix would be to *not* This means they are copied / moved etc and "cleaned" up in a state when the page was unlocked. Migration entries are not supposed to exist when a page is not locked. --
There is no bug there, no need to wait either. I already audited it before, and I didn't see any bug. Unless you can show a bug with CPU A running the rmap_walk on process1 before process2, there is no bug to patch 3 is real, and the first thought I had was to lock down the page before running vma_adjust and unlock after move_page_tables. But these are virtual addresses. Maybe there's a simpler way to keep migration away while we run those two operations. --
On Wed, 28 Apr 2010 00:32:42 +0200 Doing some check in move_ptes() after vma_adjust() is not safe. IOW, when vma's information and information in page-table is incosistent...objrmap is broken and migartion will cause panic. Then...I think there are 2 ways. 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap. or 2. get_user_pages_fast() when do remap. Thanks, -Kame --
I've been looking at ways during the day that the anon_vma lock can be held while the page tables are being allocated. The schemes were way too hairy just to cover a migration corner case. As this is particular to exec, I'm wondering if Kamezawa's additional proposal of just skipping migration of pages within the temporary stack might be the best solution overall in terms of effectiveness and simplicity. His patch introduced a new variable to the VMA but it shouldn't be necessary and it altered vma_address which is unnecessary. Here is a different version of the same basic idea to skip temporary VMAs during migration. Maybe go with this? (As a heads-up, I'll also be going offline in about 24 hours until Tuesday morning. The area I'm in has zero internet access) ==== CUT HERE ==== mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. There is a race between shift_arg_pages and migration that allows this bug to trigger. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed, the migration PTE may not be found leading to a BUG when the stack is faulted. Ideally, shift_arg_pages must run atomically with respect of rmap_walk by holding the anon_vma lock but this is problematic as pages must be allocated for page tables. Instead, this patch identifies when it is about to migrate pages from a temporary stack and leaves them alone. Memory hot-remove will try again, sys_move_pages() wouldn't be operating during exec() time and memory compaction ...
The assumptions on the vm flags is of course totally wrong. VM_EXEC might be applied as well as default flags from the mm. The following is the same basic idea, skip VMAs belonging to processes in exec rather than trying to hold anon_vma->lock across move_page_tables(). Not tested yet. ==== CUT HERE ==== mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. There is a race between shift_arg_pages and migration that allows this bug to trigger. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed, the migration PTE may not be found leading to a BUG when the stack is faulted. Ideally, shift_arg_pages must run atomically with respect of rmap_walk by holding the anon_vma lock but this is problematic as pages must be allocated for page tables. Instead, this patch skips processes in exec by making an assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in exec() that hasn't finalised the temporary stack yet. Memory hot-remove will try again, sys_move_pages() wouldn't be operating during exec() time and memory compaction will just continue to another page without concern. [kamezawa.hiroyu@jp.fujitsu.com: Idea for having migration skip temporary stacks] Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/rmap.c | 28 +++++++++++++++++++++++++++- 1 files changed, 27 insertions(+), 1 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 85f203e..9e20188 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1141,6 +1141,18 @@ static int ...
This is better than the other, that made it look like people could set broken rmap at arbitrary times, at least this shows it's ok only during execve before anything run, but if we can't take the anon-vma lock really better than having to make these special checks inside every rmap_walk that has to be accurate, we should just delay the linkage of the stack vma into its anon-vma, until after the move_pages. --
Is it possible to delay the linkage like that? As arguments get copied into the temporary stack before it gets moved, I'd have expected the normal fault path to prepare and attach the anon_vma. We could special case it but that isn't very palatable either. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I'm not sure what is more palatable, but I feel it should be fixed in execve. --
Ok the best idea so far I had is to add a fake temporary fake vma to the anon_vma list with the old vm_start and same vm_pgoff before shifting down vma->vm_start and calling move_page_tables. Then after the move is complete we remove the fake vma. So all the fast paths will remain unmodified and no magic is required. I'll try to fix this for the old stable anon-vma code and test in aa.git first as the code will differ. If it works ok anybody can port it to new anon-vma code. --
here the new try for aa.git or 2.6.33 anon-vma code. ---- Subject: fix race between shift_arg_pages and rmap_walk From: Andrea Arcangeli <aarcange@redhat.com> migrate.c requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it can't be removed if the second rmap_walk fails to find the page. And split_huge_page() will have the same requirements as migrate.c already has. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -55,6 +55,7 @@ #include <linux/fsnotify.h> #include <linux/fs_struct.h> #include <linux/pipe_fs_i.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -502,7 +503,9 @@ static int shift_arg_pages(struct vm_are unsigned long length = old_end - old_start; unsigned long new_start = old_start - shift; unsigned long new_end = old_end - shift; + unsigned long moved_length; struct mmu_gather *tlb; + struct vm_area_struct *tmp_vma; BUG_ON(new_start > new_end); @@ -514,16 +517,36 @@ static int shift_arg_pages(struct vm_are return -EFAULT; /* + * We need to create a fake temporary vma and index it in the + * anon_vma list in order to allow the pages to be reachable + * at all times by the rmap walk for migrate, while + * move_page_tables() is running. + */ + tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + if (!tmp_vma) + return -ENOMEM; + *tmp_vma = *vma; + + spin_lock(&vma->anon_vma->lock); + /* * cover the whole range: [new_start, old_end) */ - vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL); + vma->vm_start = new_start; + __anon_vma_link(tmp_vma); + spin_unlock(&vma->anon_vma->lock); /* * move the page tables downwards, on failure we rely on * process cleanup to remove whatever mess we made. */ - if (length != move_page_tables(vma, old_start, - vma, new_start, ...
I see there is a large discussion on that patch so I'll read that rather than commenting here. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
