Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Previous thread: [IRQ map] VIA C7 CN700 2.6.23-rc9-git USB IRQs disabled by Guennadi Liakhovetski on Thursday, October 4, 2007 - 7:19 pm. (8 messages)

Next thread: [PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash by Gregory Haskins on Friday, October 5, 2007 - 12:22 am. (2 messages)
To: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, October 4, 2007 - 9:43 pm

David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
affects the Xen pv-ops backend.

In Xen, pagetables are normally kept RO so that the hypervisor can
mediate all updates to them. If Xen sees a write to an active
(currently pointed to by cr3) or pinned (a currently inactive but
registered pagetable), it will trap the write fault and emulate the
instruction making the update; this means that most pagetable-modifying
code doesn't need to know or care that pagetables are RO.

When a pagetable is first created (either in execve or fork), the the
Xen paravirt backend pins the pagetable, and conversely, on exit it is
unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks.
Pinning is done in two phases: first the pagetable pages are marked RO,
and then the pagetable is registered with Xen; unpinning is the
opposite. This works assuming that the pagetable is not in use, and not
yet visible to other cpus.

The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it
copies the old pagetable into the new one, and registers each vma with
the rmap prio tree. Once everything is copied, it calls
arch_dup_mmap(), which ends up doing the Xen pagetable pin. However,
because the pagetable is visible to other cpus via the prio tree,
pagetable modifications (specifically, clearing the access bit) can race
with pinning. If it hits between making the pagetable pages RO but
before they're registered with Xen, modifications to the flags will
fault, and Xen won't know to do the fixup.

The converse is also true in exit_mmap(): arch_exit_mmap is called
before removing the vmas from the prio tree, so it can race with unpinning.

The specific oops I'm seeing is this:

BUG: unable to handle kernel paging request at virtual address c5b023e8
printing eip:
c016d3f2
*pdpt = 000000004bc1a001
Oops: 0003 [#1]
PREEMPT SMP
Modules linked in:
CPU: 1
EIP: 0061:[<c016d3f2>] Not t...

To: Jeremy Fitzhardinge <jeremy@...>
Cc: David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 7:36 am

I see the discussion has somehow moved on to pagetable locking -
mysterious because the word "lock" never once appears in your
otherwise very helpful elucidation below, for which many thanks.

Maybe what I have to add is now of historical interest only, or none,
but I was prevented from answering your original mail earlier...

I think that race with Xen has been there all along, not introduced
by David's commit. Take a look at ptep_clear_flush_young() in the

To my naive mind, your problem actually lies in those two stages:

No. The asm-generic/pgtable.h version of ptep_test_and_clear_young()
does set_pte_at((__vma)->vm_mm, (__address), (__ptep), pte_mkold(__pte)),
which may be nice for Xen, but is dangerously racy on i386 SMP (I hope
it's not racy on the other architectures using it...): it's in danger
of losing the dirty bit if that gets set in userspace in between us
reading __pte and setting the pte_mkold version of __pte. There's
no issue if we were to lose the young bit while setting the dirty

Maybe (if this turns out to be more than just a muddle over pagetable
locking). But note that you have a problem, not just with the
ptep_clear_flush_young from page_referenced_one, but also with the
ptep_clear_flush (which ends up using native_ptep_get_and_clear on
i386 I think) from try_to_unmap_one: you just won't hit that one
as often as the page_referenced_one case.

Anything in include/asm-i386/pgtable.h which uses pte_update or
pte_update_defer ought to be considered; but I expect that you'll
find all the other instances are only used from safe contexts,
which cannot race with yours. The pte_update comes after the op
that concerns you: maybe there should be a pte_about_to_update

No. It is intentional that we make those ptes visible as early as
possible, so that concurrent pageout (and less importantly swapoff)
has the best chance of finding all references to a page (or swap ent).
If they only became visible at the final arch_dup_mmap stage, then
it might become...

To: Hugh Dickins <hugh@...>
Cc: David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 3:39 pm

It has to be taken with a grain of salt. The reason "lock" isn't
mentioned is because I mis-analyzed the situation, and overlooked that

Yes, but I don't think its a problem with correct locking. set_pte_at

No, its the Xen-specific kernel code which does the RO marking: it marks
the pagetables RO (using a hypercall to make the actual modifications),
and then tells the hypervisor to pin the whole pagetable. It can't be

Hm, OK.

J
-

To: Hugh Dickins <hugh@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 2:58 pm

On Fri, 5 Oct 2007 12:36:33 +0100 (BST)

It does. Telling Xen to pin the page as a page table page is
basically the first thing a Xen kernel does after marking the
page read-only.

This makes for a narrow race window, during which ptep_test_and_clear_young
cannot clear the referenced bit and may end up causing a crash. We do not
care about it not clearing the referenced bit during that window, since it
will be cleared during the next go-around and the race is very rare.

Hence, the only thing we need to fix is the crash.

We can do that by adding an entry for ptep_test_and_clear_young to the
exception table. This way we do not need to turn this into a new paravirt
ops hook (since the fast path is exactly the same as x86 native) and there
is no need for added complexity.

Also, Xen would not conflict with SPLIT_PTLOCK_CPUS.

--
All Rights Reversed
-

To: Rik van Riel <riel@...>
Cc: Hugh Dickins <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 3:40 pm

Is this the only possible case? It's just the one I found while testing

Well, isn't the correct fix to make Xen take all the pagetable locks
while pinning/unpinning? Adding exception handling to
test_and_clear_bit would solve this particular race, but are there
others (either now or potentially)? Seems fragile.

J
-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Hugh Dickins <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 3:56 pm

On Fri, 05 Oct 2007 12:40:05 -0700

You're right, holding the pagetable lock(s) across pinning/unpinning
operations would avoid the problem too.

While we do not care about the accessed bit, there are similar operations
going on with the dirty bit - which is a lot more important :)

--
All Rights Reversed
-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, October 4, 2007 - 10:44 pm

y'know, I think I think it's been several years since I saw a report of an
honest to goodness, genuine SMP race in core kernel. We used to be
infested by them, but the term has fallen into disuse. Interesting, but

I expect that 2) has the maximum niceness*suitable-for-2.6.23 product.

That's if you actually care much about kernel.org major releases - do many
people run kernel.org kernels on Xen? If "not many" then we could perhaps
do something more elaborate for 2.6.23.1. But adding ever more pvops as
core kernel evolves was always expected.
-

To: Andrew Morton <akpm@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, October 7, 2007 - 5:52 am

Does that include data races on weakly ordered systems, or UP races
(ie. with sleeping locks rather than spinning ones)? ;) Because we had
and have a few of those...
-

To: Andrew Morton <akpm@...>
Cc: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 12:08 am

I was a bit surprised to find myself typing it too. I guess it could
also be a preempt race, which has been a bit more common. Anyway, its a
deliberately unlocked access to the pagetable structure, so not terribly

Well, given that there hasn't been a Xen-capable kernel.org release yet,

I think keep it simple for now; anything significant can wait for the
brave new world of unified x86.

J

-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, October 4, 2007 - 9:52 pm

On Thu, 04 Oct 2007 18:43:32 -0700

Either of these two would work. Another alternative could be to
let test_and_clear_pte_flags have an exception table entry, where
we jump right to the next instruction if the instruction clearing
the flag fails.

That is the essentially variant you need for Xen, except the fast
path is still exactly the same it is as when running on native
hardware.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

To: Rik van Riel <riel@...>
Cc: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 12:15 am

Hm, that wouldn't end up clearing the bit. You'd need a Xen-specific
exception handler to do that, which would turn the whole thing into
Xen-specific code, and you're back at adding a pv-op.

J

-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Hugh Dickens <hugh@...>, David Rientjes <rientjes@...>, Zachary Amsden <zach@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Rusty Russell <rusty@...>, Andi Kleen <ak@...>, Keir Fraser <keir@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, October 5, 2007 - 9:17 am

On Thu, 04 Oct 2007 21:15:18 -0700

Big deal. We don't care *that* much.

As long as the bit gets cleared 99.9% of the time, we're fine.
All that bit is for is determining when to swap out a page.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

Previous thread: [IRQ map] VIA C7 CN700 2.6.23-rc9-git USB IRQs disabled by Guennadi Liakhovetski on Thursday, October 4, 2007 - 7:19 pm. (8 messages)

Next thread: [PATCH] LOCKDEP: fix mismatched lockdep_depth/curr_chain_hash by Gregory Haskins on Friday, October 5, 2007 - 12:22 am. (2 messages)