Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

Previous thread: Re: [PATCH 4/4] KVM MMU: do not intercept invlpg if 'oos_shadow' is disabled by Xiao Guangrong on Wednesday, May 5, 2010 - 5:54 am. (2 messages)

Next thread: [PATCH 0/6] BKL ioctl pushdown by John Kacur on Wednesday, May 5, 2010 - 6:15 am. (21 messages)
From: Mel Gorman
Date: Wednesday, May 5, 2010 - 6:14 am

So, V4 wasn't to everyone liking so here is another variation of the fix
needed for migration-related races on VMA adjustments. From V4, patch 1 is
a different approach where as patch 2 is the same except a minor bug on
anon_vma ordering is fixed. If these can be agreed upon, it'd be nice to
get a fix in for 2.6.34.

Changelog since V4
  o Switch back anon_vma locking to put bulk of locking in rmap_walk
  o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
  o Rediff against the latest upstream tree
  o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
  o Drop fork changes
  o Avoid pages in temporary stacks during exec instead of migration pte
    lazy cleanup
  o Drop locking-related patch and replace with Rik's

Changelog since V1
  o Handle the execve race
  o Be sure that rmap_walk() releases the correct VMA lock
  o Hold the anon_vma lock for the address lookup and the page remap
  o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration
versus vma_adjust. While I am not aware of any reproduction cases, it is
potentially racy. This patch is an alternative to Rik's more comprehensive
locking approach posted at http://lkml.org/lkml/2010/5/3/155 and uses
trylock-and-retry logic in rmap_walk ...
From: Mel Gorman
Date: Wednesday, May 5, 2010 - 6:14 am

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock
to take in a anon_vma_chain but a second lock cannot be spinned upon in case
of deadlock. The rmap walker tries to take locks of different anon_vma's
but if the attempt fails, locks are released and the operation is restarted.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship between vm_start and vm_pgoff is preserved
	   for vma_address()

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/ksm.c  |   22 ++++++++++++++++++++--
 mm/mmap.c |    9 +++++++++
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 3 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 3666d43..0c09927 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1668,15 +1668,28 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
 again:
 	hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
+		struct anon_vma *locked_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
 		spin_lock(&anon_vma->lock);
 		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
 			vma = ...
From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 7:34 am

Btw, is this really needed?

Nobody else takes two anon_vma locks at the same time, so in order to 
avoid ABBA deadlocks all we need to guarantee is that rmap_walk_ksm() and 
rmap_walk_anon() always lock the anon_vma's in the same order.

And they do, as far as I can tell. How could we ever get a deadlock when 
we have both cases doing the locking by walking the same_anon_vma list?

	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

So I think the "retry" logic looks unnecessary, and actually opens us up 
to a possible livelock bug (imagine a long chain, and heavy page fault 
activity elsewhere that ends up locking some anon_vma in the chain, and 
just the right behavior that gets us into a lockstep situation), rather 
than fixing an ABBA deadlock.

Now, if it's true that somebody else _does_ do nested anon_vma locking, 
I'm obviously wrong. But I don't see such usage.

Comments?

				Linus
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 7:56 am

I could not convince myself that it wasn't. lockdep throws a fit if you try

rmap_walk() appears to be the only one that takes multiple locks but it itself
is not serialised. If there are more than one process calling rmap_walk()
on different processes sharing the same VMAs, is there a guarantee they walk
it in the same order? I didn't think so at the time the patch because the
anon_vma the walk starts from is based on the page being migrated rather

If we always started the list walk in the same place then it'd be fine but

I imagined it and I'm not super-happy about it. It's one of the reasons Rik

Just what I have above. I couldn't convince myself that two callers to
rmap_walk from pages based on different VMAs on the same_anon_vma list would
always started in the same place.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 8:31 am

So I had this notion of the list always getting deeper and us guaranteeing 
the order in it, but you're right - that's not the 'same_anon_vma' list, 
it's the 'same_vma' one.

Damn. So yeah, I don't see us guaranteeing any ordering guarantees. My 
bad.

That said, I do wonder if we could _make_ the ordering reliable. I did 
that for the 'same_vma' one, because I wanted to be able to verify that 
chains were consistent (and we also needed to be able to find the "oldest 
anon_vma" for the case of re-instantiating pages that migth exist in 
multiple different anon_vma's).

Any ideas?

		Linus
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 8:54 am

I'm still thinking of the ordering but one possibility would be to use a mutex
similar to mm_all_locks_mutex to force the serialisation of rmap_walk instead
of the trylock-and-retry. That way, the ordering wouldn't matter. It would
slow migration if multiple processes are migrating pages by some unknowable

Not yet.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 9:13 am

Rik's patch that takes the locks in the faster path is preferable to
me, it's just simpler, you know the really "strong" long is the
page->mapping/anon_vma->lock and nothing else. You've a page, you take
that lock, you're done for that very page.

Sure that means updating vm_start/vm_pgoff then requires locking all
anon_vmas that the vma registered into, but that's conceptually
simpler and it doesn't alter the page_lock_anon_vma semantics. Now I
wonder if you said the same_anon_vma is in order, but the same_vma is
not, if it's safe to lock the same_vma in list order in anon_vma_lock,
I didn't experience problems on the anon_vma_chain branch but
anon_vma_lock disables all lockdep lock inversion checking.
--

From: Peter Zijlstra
Date: Wednesday, May 5, 2010 - 12:11 pm

So how's that going to work out for my make anon_vma->lock a mutex
patches?

--

From: Andrea Arcangeli
Date: Wednesday, May 5, 2010 - 12:57 pm

I'm not seeing much problem after all, even if you only switch the
anon_vma->lock (you switch both so it's quite different), unmap_vmas
may end up calling split_huge_page_pmd in zap_pmd_range only if the
vma is full anonymous (I don't allow hugepages in MAP_PRIVATE yet) so
there would be no i_mmap_lock held. But clearly if you switch _both_
it's even safer. In any case when we make that change, it'll require
to call split_huge_page_pmd and split_huge_page only in preemptive
points, and there is no such requirement today, and clearly when all
vm locking goes preemptive it'll be much natural and lower risk to
remove that requirement from split_huge_page too.

Also I think if we start taking mutex in anon_vma the i_mmap_lock
should switch too at the same time. I suspect it's an arbitrary choice
that we've to take always the i_mmap_lock before the anon_vma locks in
mmap.c so it makes sense they move in tandem.
--

From: Andrea Arcangeli
Date: Thursday, May 20, 2010 - 5:27 pm

If you're interested I can include your patchset after memory
compaction in aa.git, far from the ideal path for merging but ideal if
you want to test together with the full thing (memory compaction,
split_huge_page as you wondered just above etc..) and hopefully give
it more testing.

Note: I'm not sure if it's the right way to go, in fact I'm quite
skeptical, not because it won't work, but ironically the main reason
I'm interested is to close the XPMEM requirements the right way (not
with page pins and deferred async invalidates), as long as we've users
asking for rescheduling in mmu notifier methods this is the only way
to go. Initially I thought it had to be a build time option, but
seeing you doing it by default and for totally different reasons, I'm
slightly more optimistic it can be the default and surely XPMEM will
love it... the fact these locks are smarter helps a lot too.
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 3:37 am

The hatchet-job mutex is off the table so it's down to

start-with-root-anon_vma-and-lock-in-order-when-walking-list (what I last posted)
take-all-anon_vma-locks-when-changing-vmas (Rik's)
use-seq-counter-to-spot-changes-to-VMAs-when-walking-list (Kamezawa-san's approach)

Any strong preference?

I still haven't read the other comments Linus made so I don't have a strong
preference yet. Either Rik's or the patch I posted should be enough for

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 10:34 am

Hmm.. An idea is starting to take form..

How about something like this?

 - the lock is per-anon_vma

BUT

 - you always lock the _deepest_ anon_vma you can find.

That means just a single lock. And the "deepest" anon_vma is well-defined 
for all anon_vma's, because each same_anon_vma chain is always rooted in 
the original anon_vma that caused it.

From the vma, it's simply
	avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
	anon_vma = avc->anon_vma;

and once you take that lock, you know you've gotten the lock for all 
chains related to that page. We _know_ that every single vma that is 
associated with that anon_vma must have a chain that eventually ends in 
that entry.

So I wonder if the locking can't be just something like this:

   struct anon_vma *lock_anon_vma_root(struct page *page)
   {
	struct anon_vma *anon_vma, *root;

	rcu_read_lock();
	anon_vma = page_anon_vma(page);
	if (!anon_vma)
		return ret;
	/* Make sure the anon_vma 'same_anon_vma' list is stable! */
	spin_lock(&anon_vma->lock);
	root = NULL;
	if (!list_empty(&anon_vma->head)) {
		struct anon_vma_chain *avc;
		struct vm_area_struct *vma;
		struct anon_vma *root;
		avc = list_first_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
		vma = avc->vma;
		avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
		root = avc->anon_vma;
	}
	/* We already locked it - anon_vma _was_ the root */
	if (root == anon_vma)
		return root;
	spin_unlock(&anon_vma->lock);
	if (root) {
		spin_lock(&root->lock);
		return root;
	}
	rcu_read_unlock();
	return NULL;
   }

and

   void unlock_anon_vma_root(struct anon_vma *root)
   {
	spin_unlock(&root->lock);
	rcu_read_unlock();
   }

or something. I agree that the above is not _beautiful_, and it's not 
exactly simple, but it does seem to have the absolutely huge advantage 
that it is a nice O(1) thing that only ever takes a single lock and has no 
nesting. And while the code ...
From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 10:57 am

To clarify: here "is associated with that anon_vma" is basically about the 
whole forest of anon_vma/vma links. Different vma's can be associated with 
different anon_vma's, and pages can be associated with anon_vma's that can 
in turn reach other anon_vma's and many other vma's.

But regardless of _how_ you walk the chains between anon_vma's and vma's 
(including walking "back-pointers" that don't even exist except 
conceptually for the pointer going the other way), any relationship will 
have started at _some_ root vma.

IOW, the root anon_vma is directly 1:1 related to "do these vma/anon_vma's 
relate in _any_ way". If it's the same root anon_vma, then there is a 
historical relationship. And if the root anon_vma's are different, then 
there cannot be any relationship at all.

So locking the root anon_vma is both minimal _and_ sufficient. Any locking 
scheme (traversing the lists) would eventually end up hitting that root 
entry (minimal locking), and at the same time that root entry is also 
guaranteed to be the same for all related entities (ie it's sufficient to 
lock the root entry if everybody else also looks up their root and locks 
it).

I think. Tell me if I'm wrong.

			Linus
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 11:14 am

In the direction I was taking, only rmap_walk took the deepest lock (I called
it oldest but hey) and would take other anon_vma locks as well. The objective
was to make sure the order the locks were taken in was correct.

I think you are suggesting that any anon_vma lock that is taken should always
take the deepest lock. Am I right and is that necessary? The downsides is that
there is a single lock that is hotter. The upside is that rmap_walk no longer


That was my understanding but I'm still not sure of my footing on the

Not sure if this is necessary. In the case of rmap_walk(), it's already
held. In the other cases, a semaphore is held which should prevent the first

Can it be empty? I didn't think it was possible as the anon_vma must

Other than terminology and minor implementation details, this is more or

This is the only significant point we differ on. Do we;

1. Always take the deepest anon_vma lock using the lock anon_vma lock
   just to protect the list long enough to find it?

or

2. Only have rmap_walk use the deepest anon_vma lock just so it takes
   multiple locks in the correct order?

I am not seeing a killer advantage of one over the order. While (2)
would mean the same lock is always taken - it's a double lock to get it
"lock local, find deepest, lock deepest, release local" which adds a
small overhead to the common path. With 1, the searching around is
confined to migration.

Andrea, is there an advantage of one over the other for you or is Rik's
approach just better overall?


Well, for what it works, the basic principal appears to work. At least,

Considering that we came up with more or less the same idea, the basic
idea of "lock the deepest anon_vma" must be sound :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 11:34 am

I could personally go either way, I don't really care that deeply.

I think you could easily just take the root lock in the rmap_walk_anon/ksm 
paths, and _also_ take the individual locks as you walk it (safe, since 
now the root lock avoids the ABBA issue - you only need to compare the 
individual lock against the root lock to not take it twice, of course).

Or you could take the "heavy lock" approach that Andrea was arguing for, 
but rather than iterating you'd just take the root lock.

I absolutely _hated_ the "iterate over all locks in the normal case" idea, 
but with the root lock it's much more targeted and no longer is about 
nested locks of the same type.

So the things I care about are just:

 - I hate that "retry" logic that made things more complex and had the 
   livelock problem.

   The "root lock" helper function certainly wouldn't be any fewer lines 
   than your retry version, but it's a clearly separate locking function, 
   rather than mixed in with the walking code. And it doesn't do livelock.

 - I detest "take all locks" in normal paths. I'm ok with it for special 
   case code (and I think the migrate code counts as special case), but I 
   think it was really horribly and fundamentally wrong in that "mm: Take 
   all anon_vma locks in anon_vma_lock" patch I saw.

but whether we want to take the root lock in "anon_vma_lock()" or not is 
just a "detail" as far as I'm concerned. It's no longer "horribly wrong". 
It might have scalability issues etc, of course, but likely only under 
insane loads.


Ok, so that was answered in the other email - I think it's necessary in 
the general case, although depending on exactly _how_ the page was looked 
up, that may not be true.

If you have guarantees that the page is still mapped (thanks for page 
table lock or something) and the anon_vma can't go away (just a read lock 
on a mm_sem that was used to look up the page would also be sufficient), 
that list_empty() check is unnecessary.

So it's a bit ...
From: Mel Gorman
Date: Thursday, May 6, 2010 - 4:03 am

Initially, I thought the problem with this was making the root anon_vma
lock hotter. It didn't seem that big of a deal but it was there. The greater
problem was that the RCU lock is needed to exchange the local anon_vma lock
with the root anon_vma lock. So the "heavy lock" approach is actually quite
heavy because it involves two spinlocks, the RCU lock and the root lock
being hotter.

Right now, I'm thinking that only rmap_walk taking the root anon_vma
lock and taking multiple locks as it walks is nicer. I believe it's
sufficient for migration but it also needs to be sufficient for



I think this approach of always taking the root lock would be neater in a
number of respects because from a page, there would be the "one true lock".
If RCU was not involved, it would be particularly nice.

Part of PeterZ's "replace anon_vma lock with mutex" involves proper
reference counting of anon_vma. If even the reference count part was
polished, it would allow us to always take the root anon_vma lock
without RCU because it would be

lock local_anon_vma
find root_anon_vma
get root_anon_vma
unlock local_anon_vma
lock root anon_vma
put root_anon_vma

So maybe when anon_vma is reference counted, it'd be best to switch to
always locking the root anon_vma.

For the moment though, I reckon it's best to only have rmap_walk

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Rik van Riel
Date: Thursday, May 6, 2010 - 6:40 am

It should work, but only if we always take the deepest
anon_vma lock.

Not just in the migration code, but also in mmap, munmap,
mprotect (for split_vma), expand_stack, etc...

Otherwise we will still not provide exclusion of migrate
vs. those events.

I'm guessing that means changing both anon_vma_lock and
page_lock_anon_vma to always take the deepest anon_vma
lock - not introducing a new function that is only called
by the migration code.

-- 
All rights reversed
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 6:45 am

Are you sure?

I thought this as well but considered a situation something like

root anon_vma          <--- rmap_walk starts here
     anon_vma a
     anon_vma b
     anon_vma c        <--- an munmap/mmap/mprotect/etc here
     anon_vma d
     anon_vma e

The rmap_walk takes the root lock and then locks a, b, c, d and e as it
walks along.

The mSomething event happens on c and takes the lock

if rmap_walk gets there first, it takes the lock and the mSomething
event waits until the full rmap_walk is complete (delayed slightly but
no biggie).

if mSomething gets there first, rmap_walk will wait on taking the lock.
Again, there could be some delays but no biggie.


That would be the case all right but I'd prefer to have PeterZ's patches
that do full reference counting of anon_vma first instead of introducing
RCU to those paths.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 10:53 am

If the same_vma list is properly ordered then maybe something like the
following is allowed?

(This patch is on top of mm,migration: Prevent rmap_walk_[anon|ksm]
seeing the wrong VMA information)

At least it booted and did not immediately kill itself during migration.
It's less clear what to do for KSM but I'm ignoring it for the moment.

==== CUT HERE ====
mm,migration: In rmap_walk, always take the locks in the same order

---
 include/linux/rmap.h |   32 ++++++++++++++++++++++++++++++++
 mm/ksm.c             |    5 +----
 mm/rmap.c            |   13 ++-----------
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7721674..749aaca 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,6 +99,38 @@ static inline struct anon_vma *page_anon_vma(struct page *page)
 	return page_rmapping(page);
 }
 
+static inline struct anon_vma *page_anon_vma_lock_oldest(struct page *page)
+{
+	struct anon_vma *anon_vma, *oldest_anon_vma;
+	struct anon_vma_chain *avc, *oldest_avc;
+
+	/* Get the pages anon_vma */
+	if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
+					    PAGE_MAPPING_ANON)
+		return NULL;
+	anon_vma = page_rmapping(page);
+	if (!anon_vma)
+		return NULL;
+
+	spin_lock(&anon_vma->lock);
+
+	/*
+	 * Get the oldest anon_vma on the list by depending on the ordering
+	 * of the same_vma list setup by __page_set_anon_rmap
+	 */
+	avc = list_entry(&anon_vma->head, struct anon_vma_chain, same_anon_vma);
+	oldest_avc = list_entry(avc->vma->anon_vma_chain.prev,
+				struct anon_vma_chain, same_vma);
+	oldest_anon_vma = oldest_avc->anon_vma;
+
+	if (anon_vma != oldest_anon_vma) {
+		spin_lock(&oldest_anon_vma->lock);
+		spin_unlock(&anon_vma->lock);
+	}
+
+	return oldest_anon_vma;
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
diff --git a/mm/ksm.c b/mm/ksm.c
index 0c09927..113f972 100644
--- ...
From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 11:02 am

We're not guaranteed that the 'anon_vma->head' list is non-empty.

Somebody could have freed the list and the anon_vma and we have a stale 
'page->anon_vma' (that has just not been _released_ yet). 

And shouldn't that be 'list_first_entry'? Or &anon_vma->head.next?

How did that line actually work for you? Or was it just a "it boots", but 
no actual testing of the rmap walk?

			Linus
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 11:17 am

Ahh, fair point. I asked in the other mail was the empty list check


It's currently running a migration-related stress test without any
deadlocks or lockdep wobbly so far.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Wednesday, May 5, 2010 - 5:22 pm

This is what I just started testing on a 4-core machine. Lockdep didn't
complain but there are two potential sources of badness in anon_vma_lock_root
marked with XXX. The second is the most important because I can't see how the
local and root anon_vma locks can be safely swapped - i.e. release local and
get the root without the root disappearing. I haven't considered the other
possibilities yet such as always locking the root anon_vma. Going to
sleep on it.

Any comments?

==== CUT HERE ====
mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

vma_adjust() is updating anon VMA information without locks being taken.
In contrast, file-backed mappings use the i_mmap_lock and this lack of
locking can result in races with users of rmap_walk such as page migration.
vma_address() can return -EFAULT for an address that will soon be valid.
For migration, this potentially leaves a dangling migration PTE behind
which can later cause a BUG_ON to trigger when the page is faulted in.

With the recent anon_vma changes, there can be more than one anon_vma->lock to
take in a anon_vma_chain but as the order of anon_vmas cannot be guaranteed,
rmap_walk cannot take multiple locks. This patch has rmap_walk start
by locking the anon_vma lock associated with a page. It then finds the
"root" anon_vma using the anon_vma_chains same_vma list as it is strictly
ordered. The root anon_vma lock is taken and rmap_walk traverses the
list. This allows multiple locks to be taken as the list is always traversed
in the same direction.

For vma_adjust(), the locking behaviour prior to the anon_vma is restored
so that rmap_walk() can be sure of the integrity of the VMA information and
lists when the anon_vma lock is held. With this patch, the vma->anon_vma->lock
is taken if

	a) If there is any overlap with the next VMA due to the adjustment
	b) If there is a new VMA is being inserted into the address space
	c) If the start of the VMA is being changed so that the
	   relationship ...
From: Linus Torvalds
Date: Wednesday, May 5, 2010 - 5:42 pm

No, that can't happen. If you find an avc struct, it _will_ have a 
anon_vma pointer. So there's no point in testing for NULL. If some bug 

What makes this ok is the fact that it must be running under the RCU read 
lock, and anon_vma's thus cannot be released. My version of the code made 
that explicit. Yours does not, and doesn't even have comments about the 
fact that it needs to be called RCU read-locked. Tssk, tssk.

Please don't just assume locking. Either lock it, or say "this must be 
called with so-and-so held". Not just a silent "this would be buggy if 
anybody ever called it without the RCU lock".

		Linus
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 3:02 am

Good. If this returns NULL, it should oops when spin_lock(NULL->lock)

This is very subtle in itself. RCU guarantees that the anon_vma exists
but does it guarantee that it's the same one we expect and that it


Sure. It was an oversight when merging what I had with what you posted
up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Linus Torvalds
Date: Thursday, May 6, 2010 - 7:15 am

Nothing. And we shouldn't care.

If it's been freed and re-used, then all the anon_vma's (and vma's) 
associated with the original anon_vma (and page) have been free'd.

And that, in turn, means that we don't really need to lock anything at 
all. The fact that we end up locking an anon_vma that _used_ to be the 
root anon_vma is immaterial - the lock won't _help_, but it shouldn't hurt 
either, since it's still a valid spinlock.

Now, the above is only true as far as the anon_vma itself is concerned. 
It's entirely possible that any _other_ data structures would need to be 
double-checked after getting the lock. For example, is the _page_ still 
associated with that anon_vma? But that's an external issue as far as the 
anon_vma locking is concerned - presumably the 'rmap_walk()' caller will 
have made sure that the page itself is stable somehow.

				Linus
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 7:25 am

It does, by having the page locked as it performs the walk.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Thursday, May 6, 2010 - 2:47 am

Dumb question.

I can't understand why we should use list_first_entry.

I looked over the code.
anon_vma_chain_link uses list_add_tail so I think that's right.
But anon_vma_prepare uses list_add. So it's not consistent.
How do we make sure list_first_entry returns deepest vma?

Sorry if I am missing.


-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 2:54 am

list_first_entry is not getting the root (what you called deepest but lets
pick a name and stick with it or this will be worse than it already is). That
list_first entry is what gets us from

local anon_vma -> avc for the local anon_vma -> local vma

It's the ordering of the same_vma list hanging off the local_vma that is
important according to this comment in __page_set_anon_rmap

                /*
                 * The page may be shared between multiple processes.
                 * We must use the _oldest_ possible anon_vma for the
                 * page mapping!  That anon_vma is guaranteed to be
                 * present in all processes that could share this page.
                 *
                 * So take the last AVC chain entry in the vma, which is the
                 * deepest ancestor, and use the anon_vma from that.
                 */
                avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma);
                anon_vma = avc->anon_vma;

Not at all. The more people that look at this the better.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Thursday, May 6, 2010 - 3:01 am

Yes. Sorry for confusing word. :)
Let's have a question again. What I have a question is that why we




-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Thursday, May 6, 2010 - 3:10 am

Nothing other than it's easier to read and a bit more self-documenting
than;


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Linus Torvalds
Date: Thursday, May 6, 2010 - 7:06 am

It's not that we "should" use list_entry_first. It's that we want to find 
_any_ entry on the list, and the most natural one is the first one.

So we could take absolutely any 'avc' entry that is reachable from the 
anon_vma, and use that to look up _any_ 'vma' that is associated with that 
anon_vma. And then, from _any_ of those vma's, we know how to get to the 
"root anon_vma" - the one that they are all associated with.

So no, there's absolutely nothing special about the first entry. It's 
just a random easily found one.

		Linus
--

From: Minchan Kim
Date: Thursday, May 6, 2010 - 8:59 am

On Thu, May 6, 2010 at 11:06 PM, Linus Torvalds

Thanks, Linus and Mel.
You understood my question correctly. :)

My concern was following case.

Child process does mmap new VMA but anon_vma is reused nearer child's
VMA which is linked parent's VMA by fork.
In that case, anon_vma_prepare calls list_add not list_add_tail.
ex) list_add(&avc->same_anon_vma, &anon_vma->head);

It means list_first_entry is the new VMA not old VMA and new VMA's
root_avc isn't linked at parent's one. It means we are locking each
other locks. That's why I have a question.

But I carefully looked at the reusable_anon_vma and found
list_is_singular. I remember Linus changed it to make problem simple.
So in my scenario, new VMA can't share old VMA's anon_vma.

So my story is broken.
If I miss something, please, correct me. :)

-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Thursday, May 6, 2010 - 12:38 am

On Wed,  5 May 2010 14:14:40 +0100

I'm sorry I couldn't catch all details but can I make a question ?
Why seq_counter is bad finally ? I can't understand why we have
to lock anon_vma with risks of costs, which is mysterious struct now.

Adding a new to mm_struct is too bad ?

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At treating rmap, there is no guarantee that "rmap is always correct"
because vma->vm_start, vma->vm_pgoff are modified without any lock.

In usual, it's not a problem that we see incosistent rmap at 
try_to_unmap() etc...But, at migration, this temporal inconsistency
makes rmap_walk() to do wrong decision and leaks migration_pte.
This causes BUG later.

This patch adds seq_counter to mm-struct(not vma because inconsistency
information should cover multiple vmas.). By this, rmap_walk()
can always see consistent [start, end. pgoff] information at checking
page's pte in a vma.

In exec()'s failure case, rmap is left as broken but we don't have to
take care of it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/exec.c                |   20 +++++++++++++++-----
 include/linux/mm_types.h |    2 ++
 mm/mmap.c                |    3 +++
 mm/rmap.c                |   13 ++++++++++++-
 4 files changed, 32 insertions(+), 6 deletions(-)

Index: linux-2.6.34-rc5-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc5-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc5-mm1/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/page-debug-flags.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
+#include <linux/seqlock.h>
 
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
@@ -310,6 +311,7 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+	seqcount_t	rmap_consistent;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
Index: ...
From: Mel Gorman
Date: Thursday, May 6, 2010 - 2:46 am

It's not the biggest problem. I'm not totally against this approach but
some of the problems I had were;

1. It introduced new locking. anon_vmas would be covered by RCU,
   spinlocks and seqlock - each of which is used in different
   circumstances. The last patch I posted doesn't drastically
   alter the locking. It just says that if you are taking multiple
   locks, you must start from the "root" anon_vma.

2. I wasn't sure if it was usable by transparent hugepage support.
   Andrea?

3. I had similar concerns about it livelocking like the
   trylock-and-retry although it's not terrible.

4. I couldn't convince myself at the time that it wasn't possible for
   someone to manipulate the list while it was being walked and a VMA would be
   missed. For example, if fork() was called while rmap_walk was happening,
   were we guaranteed to find the VMAs added to the list?  I admit I didn't
   fully investigate this question at the time as I was still getting to
   grips with anon_vma. I can reinvestigate if you think the "lock the root
   anon_vma first when taking multiple locks" has a bad cost that is
   potentially resolved with seqcounter

5. It added a field to mm_struct. It's the smallest of concerns though.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Thursday, May 6, 2010 - 4:52 pm

On Thu, 6 May 2010 10:46:21 +0100
ok. I just thought a lock-system which we have to find "which lock should I


If everyone think seqlock is simple, I think it should be. But it seems you all are
going ahead with anon_vma->lock approach. 
(Basically, it's ok to me if it works. We may be able to make it better in later.)

I'll check your V7.

Thank you for answering.

Regards,
-Kame



--

From: KAMEZAWA Hiroyuki
Date: Thursday, May 6, 2010 - 10:49 pm

On Fri, 7 May 2010 08:52:19 +0900

plz forget about seq_counter. we may have to add "retry" path for avoiding
dead lock. If so, using anon_vma->lock in proper manner seems sane.

Thanks,
-Kame

--


From: Andrea Arcangeli <aarcange@redhat.com>

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: ...
Previous thread: Re: [PATCH 4/4] KVM MMU: do not intercept invlpg if 'oos_shadow' is disabled by Xiao Guangrong on Wednesday, May 5, 2010 - 5:54 am. (2 messages)

Next thread: [PATCH 0/6] BKL ioctl pushdown by John Kacur on Wednesday, May 5, 2010 - 6:15 am. (21 messages)