After several hours, kbuild tests hang with anon_vma_prepare() spinning on
a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
(which makes this very much more likely, but it could happen without).
The ever-subtle page_lock_anon_vma() now needs a further twist: since
anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
of a reused anon_vma structure at any moment, page_lock_anon_vma()
needs to check page_mapped() again before succeeding, otherwise
page_unlock_anon_vma() might address a different root->lock.
Signed-off-by: Hugh Dickins <hughd@google.com>
---
mm/rmap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
--- 2.6.36-rc2/mm/rmap.c 2010-08-16 00:18:01.000000000 -0700
+++ linux/mm/rmap.c 2010-08-24 03:22:17.000000000 -0700
@@ -316,7 +316,7 @@ void __init anon_vma_init(void)
*/
struct anon_vma *page_lock_anon_vma(struct page *page)
{
- struct anon_vma *anon_vma;
+ struct anon_vma *anon_vma, *root_anon_vma;
unsigned long anon_mapping;
rcu_read_lock();
@@ -327,8 +327,21 @@ struct anon_vma *page_lock_anon_vma(stru
goto out;
anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
- anon_vma_lock(anon_vma);
- return anon_vma;
+ root_anon_vma = ACCESS_ONCE(anon_vma->root);
+ spin_lock(&root_anon_vma->lock);
+
+ /*
+ * If this page is still mapped, then its anon_vma cannot have been
+ * freed. But if it has been unmapped, we have no security against
+ * the anon_vma structure being freed and reused (for another anon_vma:
+ * SLAB_DESTROY_BY_RCU guarantees that - so the spin_lock above cannot
+ * corrupt): with anon_vma_prepare() or anon_vma_fork() redirecting
+ * anon_vma->root before page_unlock_anon_vma() is called to unlock.
+ */
+ if (page_mapped(page))
+ return anon_vma;
+
+ spin_unlock(&root_anon_vma->lock);
out:
rcu_read_unlock();
return NULL;
--
From: Hugh Dickins <hughd@google.com> Interesting, is the condition which allows this to trigger specific to this merge window or was it always possible? --
Just specific to this merge window, which started using anon_vma->root->lock in place of anon_vma->lock (anon_vma->root is often anon_vma itself, but not always). I _think_ that change was itself a simplification of the locking in 2.6.35, rather than plugging a particular hole (it's not been backported to -stable), but I may be wrong on that - Rik? Hugh --
From: Hugh Dickins <hughd@google.com> Ok, because I've been seeing mysterious hangs recently on one of my sparc64 boxes and I was wondering if this might explain them :-) --
rmap_walk_anon isn't stable without the anon_vma->root->lock. This is because it only locks the local anon_vma and but the "vma" can be still modified under it in vma_adjust, so vma_address can fail and migration will crash. swapping is not issue as it's ok to miss a pte once in a while if the vma is under vma_adjust by the time the VM tries to unmap the page. See the anon_vma_lock added as well to vma_adjust (without it, it'd be just an useless exercise, but instead it's really plugging a real hole thanks to the anon_vma_lock addition to vma_adjust). Again not an issue unless you use migration (incidentally what memory compaction uses, and you know who the primary user of memory compaction is :). Secondly ksm_does_need_to_copy can't be fixed to cow only pages that are nonlinear as the page->mapping != anon_vma can now generate false positives. It's now trivial to fix with page->mapping->root != anon_vma->root. About your patch, it's a noop in my view... A single page_mapping check after rcu_read_lock is enough. And "anon_vma->root" can't change if page->mapping points to "anon_vma" at any time after rcu_read_lock returns. rcu_read_lock works for all anon_vma including anon_vma->root. If you were running the page_mapped() check on a different "page" then it could make a difference, but repeating it on the same page in the same rcu_read_lock protected critical section won't make a difference as far as the anon_vma freeing is concerned. --
Reviewed-by: Rik van Riel <riel@redhat.com> And yes, AFAIK this code lived just in -mm up to now. -- All rights reversed --
Hi Hugh, I don't get it, the anon_vma can be freed and reused only after we run rcu_read_unlock(). And the anon_vma->root can't change unless the anon_vma is freed and reused. Last but not the least by the time page->mapping points to "anon_vma" the "anon_vma->root" is already initialized and stable. The page_mapped test is only relevant against the rcu_read_lock, not the spin_lock, so how it can make a difference to run it twice inside the same rcu_read_lock protected critical section? The first one still is valid also after the anon_vma_lock() returns, it's not like that anon_vma_lock drops the rcu_read_lock internally. Furthermore no need of ACCESS_ONCE on the anon_vma->root because it can't change from under us as the anon_vma can't be freed from under us until rcu_read_unlock returns (after we verified the first time that page_mapped is true under the rcu_read_lock, which we already do before trying to take the anon_vma_lock). --
No. Between rcu_read_lock and rcu_read_unlock, once we've done the first (original) page_mapped test to make sure that this isn't just a long-stale page->mapping left over in there, SLAB_DESTROY_BY_RCU ensures that the slab page on which this "struct anon_vma" resides cannot be freed and reused for some other purpose e.g. a page of user data. But that piece of slab holding this "struct anon_vma" is liable to be freed and reused for another struct anon_vma at any point, until Yes, but RCU is not protecting against that: all it's doing is guaranteeing that when we "speculatively" spin_lock the anon_vma, we Yes, but two things to be careful of there: one, we leave page->mapping pointing to the anon_vma maybe long after that address has got reused for something else, it's only when the page is finally freed that it's cleared (and there certainly was a good racing reason for that, but I'd have to think long and hard to reconstruct the sequence - OTOH it was a race between page_remove_rmap and page_add_anon_rmap); and two, notice how anon_vma_prepare() sets anon_vma->root = anon_vma on a newly allocated anon_vma, before it gets anon_vma_lock - so anon_vma->root can change underneath us at any point there, until we've got anon_vma_lock _and_ checked stability I must rush off for a few hours: hopefully my assertions above shed some light., I think you're mistaking the role that RCU plays here. You remark in other mail "About your patch, it's a noop in my view..." - but seems quite an effective noop ;) Without the patch my kbuilds would hang usually within 6 hours - though one time they did manage 20 hours, I admit. They ran with the patch for 52 hours before I switched the machine over to something else. Hugh --
That's exactly correct, I thought it prevented reuse of the slab entry, not only of the whole slab... SLAB_DESTROY_BY_RCU is a lot more tricky to use than I though... However at the light of this, I think page_lock_anon_vma could have returned a freed and reused anon_vma well before the anon-vma changes. The anon_vma could have been freed after the first page_mapped check succeed but before taking the spinlock. I think, it worked fine because the rmap walks are robust enough just not to fall apart on a reused anon_vma while the lock is hold. It become a visible problem now because we were unlocking the wrong lock leading to a deadlock. But I guess it wasn't too intentional to return a reused anon_vma out of page_lock_anon_vma. --
What you say there is all exactly right, except for "I guess it wasn't too intentional": it was intentional, and known that it all worked out okay in the rare case when a reused anon_vma got fed into the loops - the anon_vma, after all, is nothing more than a list of places where you may find the page mapped, it has never asserted that a page will be found everywhere that the anon_vma lists. I would have liked to say "well known" above, but perhaps well known only to me: you're certainly not the first to be surprised by this. IIRC both Christoph and Peter have at different times proposed patches to tighten up page_lock_anon_vma() to avoid returning a stale/reused anon_vma, probably both were dropped because neither was actually necessary, until now: I guess it's a good thing for understandability that anon_vma->root->lock now requires that we weed out that case. Hugh --
Most people dealing with this for the first time get through a discovery period. The network guys had similar problems when they first tried to use Right. We need to verify that the object we have reached is the correct one. The basic problem with SLAB_DESTROY_BY_RCU is that you get a reference to an object that is guaranteed only to have the same type (the instance may fluctuate and be replaced from under you unless other measures are taken). Typically one must take a lock within the memory structure to pin down the object (or take a refcount). Only then can you follow pointers and such. It is only possible to verify that the right object has been reached *after* locking. Following a pointer without having determined that we hit the right object should not occur. A solution here would be to take the anon_vma->lock (prevents the object switching under us) and then verify that the mapping is the one we are looking for and that the pointer points to the right root. Then take the root lock. Hughs solution takes a global spinlock which will limit scalability. --
(I wouldn't describe that as a "problem with SLAB_DESTROY_BY_RCU": it's precisely the nature of SLAB_DESTROY_BY_RCU, what makes it useful Eh? My solution was a second page_mapped(page) test i.e. testing an atomic. Hugh --
Argh. Right. Looked like a global to me. Did not see the earlier local def. If you still use a pointer then what does insure that the root pointer was not changed after the ACCESS_ONCE? The free semantics of an anon_vma? Since there is no lock taken before the mapped check none of the earlier reads from the anon vma structure nor the page mapped check necessarily reflect a single state of the anon_vma. --
Nothing ensures that the root pointer was not changed after the ACCESS_ONCE, that's exactly why we use ACCESS_ONCE there: once we've got the lock and realize that what we've locked may not be what we wanted (or may change from what we were wanting at any moment, the page no longer being mapped there - but in that case we no longer want it), we have to be sure to unlock the one we locked, rather than the one which anon_vma->root might subsequently point to. (Umm, maybe I'm not the clearest of explainers, sorry! If you get my There's no lock (other than RCU's read "lock") taken before the original mapped check, and that's important, otherwise our attempt to lock might actually spinon or corrupt something that was long ago an anon_vma. But we do take the anon_vma->root->lock before the second mapped check which I added. If the page is still mapped at the point of that second check, then we know that we got the right anon_vma, that the page might be mapped in it, and anon_vma->root is not going to change underneath us before the page_unlock_anon_vma(). (The page may get unmapped at any time, the lock does not protect against that; but if it's still mapped once we hold the lock, free_pgtables() cannot free the anon_vma until we're done.) Hugh --
I do not see any check after we have taken the lock to verify that we You then are using an object from the anon_vma (the pointer) without a lock! This is unstable therefore unless there are other constraints. The anon_vma->lock must be taken before derefencing that pointer. The page may have been unmapped and mapped again between the two checks. Unlikely but I do not see a second check (*after* taking the lock) in the patch and the way the lock is taken can be a problem in itself. --
No second version of the patch, no. As I said already, it's that second page_mapped check which gives the guarantee that the anon_vma Yes, unlikely but possible. (Well, actually, is it possible? It can be unmapped on exit without any lock, but unmapping for pageout would require the page lock, would insert a swp_entry_t, and mapping again would go to do_swap_page which would again require the page lock. But never mind that, let's assume there is a way it can be unmapped and mapped again.) The thing is, page->mapping will point to the same anon_vma throughout, that only gets reset when the page is freed, and there should be nowhere else that modifies page->mapping once it's been set to anon_vma - if you know of somewhere, please point to it, --
Sorry, I seem to have hit some key which sent the mail off too soon, a
tab perhaps: finishing up...
if (page_mapped(page))
No, that's what we rely upon SLAB_DESTROY_BY_RCU for.
Hugh
--
As far as I can tell you would have to recheck the mapping pointer and the pointer to the root too after taking the lock because only taking the lock stabilitzes the object. Any other data you may have obtained before SLAB_DESTROY_BY_RCU does not guarantee that the object stays the same nor does it prevent any fields from changing. Going through a pointer with only SLAB_DESTROY_BY_RCU means that you can only rely on the atomicity guarantee for pointer updates. You get a valid pointer but pointer changes are not prevented by SLAB_DESTROY_BY_RCU. The only guarantee of that would be through other synchronization techniques. If you believe that the page lock provides sufficient synchronization that then this approach may be ok. --
That's a more interesting question than I'd realized. When page_lock_anon_vma() first came in (2.6.9) there was nothing which updated page->mapping of an anon page after it was set, until the page was freed. Since then we've gathered a few places which update it while holding the page lock (migrate.c, ksm.c) - no problem since the callers of page_lock_anon_vma() hold and must hold page lock. Well, there is the fairly recent call to page_lock_anon_vma() from memory-failure.c, and its even more recent use on hugepages: there's switching back and forth between p and hpage and page, but I think it does end up applying page_lock_anon_vma() to the very page that it locked earlier. Then there's the recently added page_move_anon_rmap(): fine in memory.c, the page lock is held; but apparently broken in hugetlb.c, where it's called only when the pagelock has not been taken! Horiguchi-san Cc'ed. __page_set_anon_rmap() looks like it might have changed anon page->mapping in 2.6.35, but Andrea has fixed that with PageAnon tests in 2.6.36-rc. Ah, but what if "exclusive" and non-exclusive calls to __page_set_anon_rmap() are racing? Not clear, it may be that Andrea has only narrowed a window not closed it (and I've not yet looked up the commit to see his intent); or it may be okay, that there cannot be a conflict of anon_vma in that case. Need to dig deeper. __hugepage_set_anon_rmap() appears to copy the 2.6.35 __page_set_anon_rmap(), and probably needs to add in Andrea's fix, or whatever else is needed there. This is a different problem (or it may turn out to be a non-existent problem, aside from the hugetlb.c case to be fixed there) than I was fixing with my patch, and can be patched separately; but It certainly looks as if it's worth adding a BUG_ON or VM_BUG_ON to check for a switch of anon_vma beneath us there. Plus a A change in the pointer to the root is covered by the ACCESS_ONCE: yes, it can change beneath us there, but only through the anon_vma being freed and ...
I had the impression that you rely on SLAB_DESTROY_BY_RCU for more than
what it gives you. If the lock taken is not directly in the structure that
is managed by slab but only reachable by a pointer then potential pointer
changes are also danger to consider.
I'd be much more comfortable if the following would be done
A. Pin the anon_vma by either
I. Take a refcount on the anon vma
II. Take a lock in the anon vma (something that is not pointed to)
B. Either
I. All values that have been used before the pinning are
verified after the pinning (and the lock is reacquired
if verification fails).
II. Or all functions using page_lock_anon_vma() must securely
work in the case that the anon_vma was reused for
something else before the vma lock was acquired.
--
Last time I looked they all work like that, they all use something akin to vma_address() which validates that the page we're interested in is indeed part of the vma we obtained from the rmap chain. Anyway, I'll try and refresh my preemptible mmu patch-set now that the merge window dust settled and post if again, hopefully we can stick it in -next. --
__page_set_anon_rmap doesn't require the PG_lock only if it's a newly allocated page and it is called from page_add_new_anon_rmap. So there cannot be concurrent __page_set_anon_rmap running on the same page. If there could be concurrent __page_set_anon_rmap running on the same page, the page_lock_anon_vma running on a changing page->mapping would be the last worry, as page_add_new_anon_rmap would overwrite _mapcount with 0 while do_page_add_anon_rmap runs, so corrupting the mapcount information leading to immediate crash during page freeing.... So definitely it must not happen and not only because of page_lock_anon_vma... and it's unlikely to go unnoticed. That change is to avoid altering the page->mapping for anon pages. It's only an optimization. No need to set the page->mapping back to the anon_vma->root for AnonPages (that in turn have already their page->mapping set) if we've already more finegrained information into the page->mapping. If we've already information in page->mapping (page is Anon) then setting to anon_vma->root can only be a coarser setting losing anon_vma child granularity. We must set to the root anon vma however when the page is swapcache but not anon yet... and if it's not exclusive we've to use the anon_vma->root, otherwise if it's being taken over by the local process we can use the local I think it's actually safe in anon_vma terms, setting the page->mapping to the anon_vma->root _always_ safe, but it should use anon_vma->root instead of list_entry (should still lead to the same result) and it can probably also optimize it if it's already an AnonPage like I did for the not-hugetlbfs case (which also includes transparent hugepages as they share the core VM paths). The lack of BUG_ON(!PageLocked(page)) in the hugetlb_add_anon_rmap is worth fixing... hugepage_add_new_anon_rmap runs lock_page before so most certainly is ok (maybe lock_page not needed if it's a new page?). hugepage_add_anon_rmap seems to run on a local new page too ...
