Re: [PATCH] mm: fix hang on anon_vma->root->lock

Previous thread: [PATCH] pm_qos: Fix inline documentation. by Saravana Kannan on Wednesday, August 25, 2010 - 10:52 pm. (3 messages)

Next thread: [PATCH 3/3] leds: ab8500-led: led driver based on ab8500 pwm by Arun Murthy on Wednesday, August 25, 2010 - 11:50 pm. (1 message)
From: Hugh Dickins
Date: Wednesday, August 25, 2010 - 11:12 pm

After several hours, kbuild tests hang with anon_vma_prepare() spinning on
a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
(which makes this very much more likely, but it could happen without).

The ever-subtle page_lock_anon_vma() now needs a further twist: since
anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
of a reused anon_vma structure at any moment, page_lock_anon_vma()
needs to check page_mapped() again before succeeding, otherwise
page_unlock_anon_vma() might address a different root->lock.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/rmap.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

--- 2.6.36-rc2/mm/rmap.c	2010-08-16 00:18:01.000000000 -0700
+++ linux/mm/rmap.c	2010-08-24 03:22:17.000000000 -0700
@@ -316,7 +316,7 @@ void __init anon_vma_init(void)
  */
 struct anon_vma *page_lock_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma, *root_anon_vma;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -327,8 +327,21 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	anon_vma_lock(anon_vma);
-	return anon_vma;
+	root_anon_vma = ACCESS_ONCE(anon_vma->root);
+	spin_lock(&root_anon_vma->lock);
+
+	/*
+	 * If this page is still mapped, then its anon_vma cannot have been
+	 * freed.  But if it has been unmapped, we have no security against
+	 * the anon_vma structure being freed and reused (for another anon_vma:
+	 * SLAB_DESTROY_BY_RCU guarantees that - so the spin_lock above cannot
+	 * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting
+	 * anon_vma->root before page_unlock_anon_vma() is called to unlock.
+	 */
+	if (page_mapped(page))
+		return anon_vma;
+
+	spin_unlock(&root_anon_vma->lock);
 out:
 	rcu_read_unlock();
 	return NULL;
--

From: David Miller
Date: Wednesday, August 25, 2010 - 11:41 pm

From: Hugh Dickins <hughd@google.com>

Interesting, is the condition which allows this to trigger specific
to this merge window or was it always possible?
--

From: Hugh Dickins
Date: Thursday, August 26, 2010 - 3:54 am

Just specific to this merge window, which started using
anon_vma->root->lock in place of anon_vma->lock (anon_vma->root is
often anon_vma itself, but not always).  I _think_ that change was
itself a simplification of the locking in 2.6.35, rather than plugging
a particular hole (it's not been backported to -stable), but I may be
wrong on that - Rik?

Hugh
--

From: David Miller
Date: Thursday, August 26, 2010 - 12:00 pm

From: Hugh Dickins <hughd@google.com>

Ok, because I've been seeing mysterious hangs recently on one of my
sparc64 boxes and I was wondering if this might explain them :-)

--

From: Andrea Arcangeli
Date: Thursday, August 26, 2010 - 5:19 pm

rmap_walk_anon isn't stable without the anon_vma->root->lock. This is
because it only locks the local anon_vma and but the "vma" can be
still modified under it in vma_adjust, so vma_address can fail and
migration will crash. swapping is not issue as it's ok to miss a pte
once in a while if the vma is under vma_adjust by the time the VM
tries to unmap the page.

See the anon_vma_lock added as well to vma_adjust (without it, it'd be
just an useless exercise, but instead it's really plugging a real hole
thanks to the anon_vma_lock addition to vma_adjust).

Again not an issue unless you use migration (incidentally what memory
compaction uses, and you know who the primary user of memory
compaction is :).

Secondly ksm_does_need_to_copy can't be fixed to cow only pages that
are nonlinear as the page->mapping != anon_vma can now generate false
positives. It's now trivial to fix with page->mapping->root !=
anon_vma->root.

About your patch, it's a noop in my view... A single page_mapping
check after rcu_read_lock is enough. And "anon_vma->root" can't change
if page->mapping points to "anon_vma" at any time after rcu_read_lock
returns. rcu_read_lock works for all anon_vma including
anon_vma->root.

If you were running the page_mapped() check on a different "page" then
it could make a difference, but repeating it on the same page in the
same rcu_read_lock protected critical section won't make a difference
as far as the anon_vma freeing is concerned.
--

From: Rik van Riel
Date: Thursday, August 26, 2010 - 6:32 am

Reviewed-by: Rik van Riel <riel@redhat.com>

And yes, AFAIK this code lived just in -mm up to now.

-- 
All rights reversed
--

From: Andrea Arcangeli
Date: Thursday, August 26, 2010 - 4:50 pm

Hi Hugh,


I don't get it, the anon_vma can be freed and reused only after we run
rcu_read_unlock(). And the anon_vma->root can't change unless the
anon_vma is freed and reused. Last but not the least by the time
page->mapping points to "anon_vma" the "anon_vma->root" is already
initialized and stable.

The page_mapped test is only relevant against the rcu_read_lock, not
the spin_lock, so how it can make a difference to run it twice inside
the same rcu_read_lock protected critical section? The first one still
is valid also after the anon_vma_lock() returns, it's not like that
anon_vma_lock drops the rcu_read_lock internally.

Furthermore no need of ACCESS_ONCE on the anon_vma->root because it
can't change from under us as the anon_vma can't be freed from under
us until rcu_read_unlock returns (after we verified the first time
that page_mapped is true under the rcu_read_lock, which we already do
before trying to take the anon_vma_lock).
--

From: Hugh Dickins
Date: Thursday, August 26, 2010 - 6:43 pm

No.  Between rcu_read_lock and rcu_read_unlock, once we've done the
first (original) page_mapped test to make sure that this isn't just a
long-stale page->mapping left over in there, SLAB_DESTROY_BY_RCU
ensures that the slab page on which this "struct anon_vma" resides
cannot be freed and reused for some other purpose e.g. a page of user
data.  But that piece of slab holding this "struct anon_vma" is liable
to be freed and reused for another struct anon_vma at any point, until

Yes, but RCU is not protecting against that: all it's doing is
guaranteeing that when we "speculatively" spin_lock the anon_vma, we

Yes, but two things to be careful of there: one, we leave
page->mapping pointing to the anon_vma maybe long after that address
has got reused for something else, it's only when the page is finally
freed that it's cleared (and there certainly was a good racing reason
for that, but I'd have to think long and hard to reconstruct the
sequence - OTOH it was a race between page_remove_rmap and
page_add_anon_rmap); and two, notice how anon_vma_prepare() sets
anon_vma->root = anon_vma on a newly allocated anon_vma, before it
gets anon_vma_lock - so anon_vma->root can change underneath us at any
point there, until we've got anon_vma_lock _and_ checked stability

I must rush off for a few hours: hopefully my assertions above shed
some light., I think you're mistaking the role that RCU plays here.
You remark in other mail "About your patch, it's a noop in my view..."
- but seems quite an effective noop ;)  Without the patch my kbuilds
would hang usually within 6 hours - though one time they did manage 20
hours, I admit. They ran with the patch for 52 hours before I switched
the machine over to something else.

Hugh
--

From: Andrea Arcangeli
Date: Friday, August 27, 2010 - 2:55 am

That's exactly correct, I thought it prevented reuse of the slab
entry, not only of the whole slab... SLAB_DESTROY_BY_RCU is a lot more
tricky to use than I though...

However at the light of this, I think page_lock_anon_vma could have
returned a freed and reused anon_vma well before the anon-vma changes.

The anon_vma could have been freed after the first page_mapped check
succeed but before taking the spinlock. I think, it worked fine
because the rmap walks are robust enough just not to fall apart on a
reused anon_vma while the lock is hold. It become a visible problem
now because we were unlocking the wrong lock leading to a
deadlock. But I guess it wasn't too intentional to return a reused
anon_vma out of page_lock_anon_vma.
--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 9:43 am

What you say there is all exactly right, except for "I guess it wasn't
too intentional": it was intentional, and known that it all worked out
okay in the rare case when a reused anon_vma got fed into the loops -
the anon_vma, after all, is nothing more than a list of places where
you may find the page mapped, it has never asserted that a page will
be found everywhere that the anon_vma lists.

I would have liked to say "well known" above, but perhaps well known
only to me: you're certainly not the first to be surprised by this.
IIRC both Christoph and Peter have at different times proposed patches
to tighten up page_lock_anon_vma() to avoid returning a stale/reused
anon_vma, probably both were dropped because neither was actually
necessary, until now: I guess it's a good thing for understandability
that anon_vma->root->lock now requires that we weed out that case.

Hugh
--

From: Christoph Lameter
Date: Friday, August 27, 2010 - 10:13 am

Most people dealing with this for the first time get through a discovery
period. The network guys had similar problems when they first tried to use

Right. We need to verify that the object we have reached is the correct
one.

The basic problem with SLAB_DESTROY_BY_RCU is that you get a reference to
an object that is guaranteed only to have the same type (the instance may
fluctuate and be replaced from under you unless other measures are taken).

Typically one must take a lock within the memory structure to pin down
the object (or take a refcount). Only then can you follow pointers and
such. It is only possible to verify that the right object has been
reached *after* locking. Following a pointer without having determined
that we hit the right object should not occur.

A solution here would be to take the anon_vma->lock (prevents the object
switching under us) and then verify that the mapping is the one we are
looking for and that the pointer points to the right root. Then take the
root lock.

Hughs solution takes a global spinlock which will limit scalability.
--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 10:55 am

(I wouldn't describe that as a "problem with SLAB_DESTROY_BY_RCU":
it's precisely the nature of SLAB_DESTROY_BY_RCU, what makes it useful

Eh?  My solution was a second page_mapped(page) test i.e. testing an atomic.

Hugh
--

From: Christoph Lameter
Date: Friday, August 27, 2010 - 12:29 pm

Argh. Right. Looked like a global to me. Did not see the earlier local
def.

If you still use a pointer then what does insure that the root
pointer was not changed after the ACCESS_ONCE? The free semantics
of an anon_vma?

Since there is no lock taken before the mapped check none of the
earlier reads from the anon vma structure nor the page mapped check
necessarily reflect a single state of the anon_vma.




--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 1:14 pm

Nothing ensures that the root pointer was not changed after the
ACCESS_ONCE, that's exactly why we use ACCESS_ONCE there: once we've
got the lock and realize that what we've locked may not be what we
wanted (or may change from what we were wanting at any moment, the
page no longer being mapped there - but in that case we no longer want
it), we have to be sure to unlock the one we locked, rather than the
one which anon_vma->root might subsequently point to.

(Umm, maybe I'm not the clearest of explainers, sorry!  If you get my

There's no lock (other than RCU's read "lock")  taken before the
original mapped check, and that's important, otherwise our attempt to
lock might actually spinon or corrupt something that was long ago an
anon_vma.  But we do take the anon_vma->root->lock before the second
mapped check which I added.  If the page is still mapped at the point
of that second check, then we know that we got the right anon_vma,
that the page might be mapped in it, and anon_vma->root is not going
to change underneath us before the page_unlock_anon_vma().  (The page
may get unmapped at any time, the lock does not protect against that;
but if it's still mapped once we hold the lock, free_pgtables() cannot
free the anon_vma until we're done.)

Hugh
--

From: Christoph Lameter
Date: Friday, August 27, 2010 - 1:56 pm

I do not see any check after we have taken the lock to verify that we

You then are using an object from the anon_vma (the pointer) without a
lock! This is unstable therefore unless there are other constraints. The
anon_vma->lock must be taken before derefencing that pointer. The page may
have been unmapped and mapped again between the two checks. Unlikely but

I do not see a second check (*after* taking the lock) in the patch and the
way the lock is taken can be a problem in itself.

--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 2:28 pm

No second version of the patch, no.  As I said already, it's that
second page_mapped check which gives the guarantee that the anon_vma



Yes, unlikely but possible.  (Well, actually, is it possible?  It can
be unmapped on exit without any lock, but unmapping for pageout would
require the page lock, would insert a swp_entry_t, and mapping again
would go to do_swap_page which would again require the page lock.  But
never mind that, let's assume there is a way it can be unmapped and
mapped again.)  The thing is, page->mapping will point to the same
anon_vma throughout, that only gets reset when the page is freed, and
there should be nowhere else that modifies page->mapping once it's
been set to anon_vma - if you know of somewhere, please point to it,
--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 2:33 pm

Sorry, I seem to have hit some key which sent the mail off too soon, a
tab perhaps: finishing up...


        if (page_mapped(page))

No, that's what we rely upon SLAB_DESTROY_BY_RCU for.

Hugh
--

From: Christoph Lameter
Date: Friday, August 27, 2010 - 4:06 pm

As far as I can tell you would have to recheck the mapping pointer and the
pointer to the root too after taking the lock because only taking the lock
stabilitzes the object. Any other data you may have obtained before

SLAB_DESTROY_BY_RCU does not guarantee that the object stays the same nor
does it prevent any fields from changing. Going through a pointer with
only SLAB_DESTROY_BY_RCU means that you can only rely on the atomicity
guarantee for pointer updates. You get a valid pointer but pointer changes
are not prevented by SLAB_DESTROY_BY_RCU.

The only guarantee of that would be through other synchronization
techniques. If you believe that the page lock provides sufficient
synchronization that then this approach may be ok.
--

From: Hugh Dickins
Date: Friday, August 27, 2010 - 6:07 pm

That's a more interesting question than I'd realized.  When
page_lock_anon_vma() first came in (2.6.9) there was nothing which
updated page->mapping of an anon page after it was set, until the page
was freed.

Since then we've gathered a few places which update it while holding
the page lock (migrate.c, ksm.c) - no problem since the callers of
page_lock_anon_vma() hold and must hold page lock.  Well, there is the
fairly recent call to page_lock_anon_vma() from memory-failure.c, and
its even more recent use on hugepages: there's switching back and
forth between p and hpage and page, but I think it does end up
applying page_lock_anon_vma() to the very page that it locked earlier.

Then there's the recently added page_move_anon_rmap(): fine in
memory.c, the page lock is held; but apparently broken in hugetlb.c,
where it's called only when the pagelock has not been taken!
Horiguchi-san Cc'ed.

__page_set_anon_rmap() looks like it might have changed anon
page->mapping in 2.6.35, but Andrea has fixed that with PageAnon tests
in 2.6.36-rc.  Ah, but what if "exclusive" and non-exclusive calls to
__page_set_anon_rmap() are racing?  Not clear, it may be that Andrea
has only narrowed a window not closed it (and I've not yet looked up
the commit to see his intent); or it may be okay, that there cannot be
a conflict of anon_vma in that case.  Need to dig deeper.

__hugepage_set_anon_rmap() appears to copy the 2.6.35
__page_set_anon_rmap(), and probably needs to add in Andrea's fix, or
whatever else is needed there.

This is a different problem (or it may turn out to be a non-existent
problem, aside from the hugetlb.c case to be fixed there) than I was
fixing with my patch, and can be patched separately; but It certainly
looks as if it's worth adding a BUG_ON or VM_BUG_ON to check for a
switch of anon_vma beneath us there.   Plus a

A change in the pointer to the root is covered by the ACCESS_ONCE:
yes, it can change beneath us there, but only through the anon_vma
being freed and ...
From: Christoph Lameter
Date: Friday, August 27, 2010 - 7:47 pm

I had the impression that you rely on SLAB_DESTROY_BY_RCU for more than
what it gives you. If the lock taken is not directly in the structure that
is managed by slab but only reachable by a pointer then potential pointer
changes are also danger to consider.

I'd be much more comfortable if the following would be done

A. Pin the anon_vma by either
	I. Take a refcount on the anon vma
	II. Take a lock in the anon vma (something that is not pointed to)

B. Either
	I. All values that have been used before the pinning are
	   verified after the pinning (and the lock is reacquired
           if verification fails).

	II. Or all functions using page_lock_anon_vma() must securely
	    work in the case that the anon_vma was reused for
	    something else before the vma lock was acquired.

--

From: Peter Zijlstra
Date: Saturday, August 28, 2010 - 3:17 am

Last time I looked they all work like that, they all use something akin
to vma_address() which validates that the page we're interested in is
indeed part of the vma we obtained from the rmap chain.

Anyway, I'll try and refresh my preemptible mmu patch-set now that the
merge window dust settled and post if again, hopefully we can stick it
in -next.
--

From: Andrea Arcangeli
Date: Saturday, August 28, 2010 - 8:54 am

__page_set_anon_rmap doesn't require the PG_lock only if it's a newly
allocated page and it is called from page_add_new_anon_rmap. So there
cannot be concurrent __page_set_anon_rmap running on the same page. If
there could be concurrent __page_set_anon_rmap running on the same
page, the page_lock_anon_vma running on a changing page->mapping would
be the last worry, as page_add_new_anon_rmap would overwrite _mapcount
with 0 while do_page_add_anon_rmap runs, so corrupting the mapcount
information leading to immediate crash during page freeing....

So definitely it must not happen and not only because of
page_lock_anon_vma... and it's unlikely to go unnoticed.


That change is to avoid altering the page->mapping for anon
pages. It's only an optimization. No need to set the page->mapping
back to the anon_vma->root for AnonPages (that in turn have already
their page->mapping set) if we've already more finegrained information
into the page->mapping. If we've already information in page->mapping
(page is Anon) then setting to anon_vma->root can only be a coarser
setting losing anon_vma child granularity. We must set to the root
anon vma however when the page is swapcache but not anon yet... and if
it's not exclusive we've to use the anon_vma->root, otherwise if it's
being taken over by the local process we can use the local

I think it's actually safe in anon_vma terms, setting the
page->mapping to the anon_vma->root _always_ safe, but it should use
anon_vma->root instead of list_entry (should still lead to the same
result) and it can probably also optimize it if it's already an
AnonPage like I did for the not-hugetlbfs case (which also includes
transparent hugepages as they share the core VM paths).

The lack of BUG_ON(!PageLocked(page)) in the hugetlb_add_anon_rmap is
worth fixing...

hugepage_add_new_anon_rmap runs lock_page before so most certainly is
ok (maybe lock_page not needed if it's a new page?).

hugepage_add_anon_rmap seems to run on a local new page too ...
Previous thread: [PATCH] pm_qos: Fix inline documentation. by Saravana Kannan on Wednesday, August 25, 2010 - 10:52 pm. (3 messages)

Next thread: [PATCH 3/3] leds: ab8500-led: led driver based on ab8500 pwm by Arun Murthy on Wednesday, August 25, 2010 - 11:50 pm. (1 message)