Hi ! Just yesterday, I was browsing through the users of set_pte_at() to check something, and stumbled on a (new ?) bug that will introduce subtle problems on at least powerpc and s390. No big deal, I'll send a fix, but I'm becoming concerned with how fragile our page table & PTE access has become. (The bug btw is that we ptep_get_and_clear followed by a set_pte_at, at least on those architectures, you -must- flush before you put something new after you have cleared a PTE, I'll have to fixup our implementation of the new pte_modify_start/commit). With the need of the various virtual machines on x86, we've seen new page table accessors being created like there is no tomorrow, changes in the PTEs are accessed that may or may not be things we can rely on being stable in arch code, etc... Unfortunately, the arch code often has a very intimate relationship to how page tables are handled. The rules for locking, what can and cannot be done within a single PTE lock section, what can or cannot be done on a PTE, for example after it's been cleared, etc... vary in subtle ways and the way the things are today, the situation is very messy and fragile. Maybe it's time to have one head in "charge" of the page table access to try to keep some sanity, maybe it's time to write down some rules (for example, can we rely now and forever that set_pte_at() will -never- be called to write on top of an already valid PTE ?, etc...). But maybe it's time to try to move the abstraction up a bit, maybe along the lines of what Nick proposed a while ago, some kind of transactional model. That would give a lot more freedom to architectures to have their own PTE access rules and optimisations. Comments ? Ideas ? Cheers, Ben. --
When I added the ptep_modify_* interface, it occurred to me that
assuming that ptep_get_and_clear would always prevent async pte updates
was a bit optimistic, or at least presumptuous. And certainly not
flushing the tlb seems like something that just happens to work on x86
(in fact I'm not quite sure how it does work on x86).
I didn't change the function of that code when I made the change, so the
bug was pre-existing; I think it has been like that for quite a while
(though I haven't done any git archaeology to back that up). So I don't
think there's a new bug here.
There have been a few optional-extras calls, which are all fine to leave
It seems to me that the rules are roughly "anything goes while you're
holding the pte lock, but it all must be sane by the time you release
I don't know if there's any such rule regarding set_pte_at(). Certainly
if you were overwriting an existing valid entry, you'd have to arrange
Do you have a reference to Nick's proposals?
A higher level interface might give us virtualization people more scope
to play with things. But given that the current interface mostly works
for everyone, a high level interface would tend to result in a lot of
duplicated code unless you pretty strictly stipluate something like
"implement the highest level interface you need *and no higher*; use
common library code for everything else".
I think documenting the real rules for the current interface would be a
more immediately fruitful path to start with.
J
--
The bug may have been there, as I said, lots of unwritten rules... sometimes broken. I'm not necessarily blaming you, but there have been lots of changes to the PTE accessors over the last 2 years and not always under any control :-) In our case, the consequence is that the entry can be re-hashed because the fact that it was already hashed and where it was hashed, which is encoded in the PTE, gets lost by the clear. That means a potential duplicate entry in the hash. A hard to hit race, but possible. Such a condition is architecturally illegal and can cause things ranging from incorrect translation to machine checks or checkstops (generally, on LPAR machines, what will happen is your partition will get killed). I know s390 has different issues & constraints. Martin told me during Again, I'm not necessarily talking about the very latest round of changes that you did here... More like a general approach as trying to find a better overall interface to PTE access which currently is a total Well, unfortunately it's a lot more complex than that. For example, on powerpc, we have to keep the hash table in sync and hash misses don't take the PTE lock. They are aking to a non-atomic HW walk of the page tables if you prefer. So we must be careful about things like never replacing a PTE without first flushing the hash table entry for the previous one (if it was hashed). We also need to ensure we flush before we unlock, but we do that by re-using the new lazy MMU hooks, as to not insert a new PTE (which can potentially be hashed). Well, for example, the generic copy-on-write case used to do just that, ie set_pte_at() over the previous one, then flush. That is not good for us and thus our set_pte_at() has special code to check whether there's already a present PTE there and does a synchronous flush if there is. However, that also changed, and nowadays, afaik, set_pte_at() is never called anymore to override a already present PTE. We are doing some work on 32 bits code that ...
I got something far enough to compile and boot (iirc on x86 and maybe ia64). Basically it was designed with fine grained pte locking in mind which is why I stopped working on it after the pmd locking from Hugh, but I imagine it could be used in some cases to do more intelligent things with pte manipulation. IIRC, basically you would have a pte_modify_start, which gives back a cookie and the pte value, and pte_modify_end which takes a new pte value and the cookie. This would allow the arch to see exactly how the pte was changed, but I imagine in some cases you would still prefer to do that with optimised direct functions rather than test and branches. So I don't know if it would be a magic cure ;) Part of the problem with the pte API (as well as the cache flush and tlb flush APIs) is that it often involves the core mm code telling the arch how it thinks ptes,tlbs,caches should be managed, rather than I think the better approach would be telling the arch what it wants to do. We are getting better slowly I think (eg. you note that set_pte_at is no longer used as a generic "do anything"), but I won't dispute that this whole area could use an overhaul; a document for all the rules, a single person or point of responsibility for those rules... --
From: Nick Piggin <npiggin@suse.de> I agree. To a certain extent this is what BSD does in it's pmap layer, except that they don't have the page table datastructure abstraction like Linus does in the generic code, and which I think was a smart design decision on our side. All of the pmap modules in BSD are pretty big and duplicate a lot of code that arch's don't have to be mindful about under Linux. --
I definitely agree, I don't think we want to go away from the page table as being the abstraction :-) But I'm wondering if we can do a little bit better with the accessors to those page tables. BTW. am I the only one to have got one copy of David's reply (that I'm quoting) coming with a From: Nick Piggin in the headers ? (apparently coming from kvack). Cheers, Ben. --
No. I see that, and so does marc. http://marc.info/?t=122184627700007&r=1&w=2 And I've only ever seen it from Dave on kvack. --
Can we nowadays -rely- on set_pte_at() never being called to overwrite an already valid PTE ? I mean, it looks like the generic code doesn't do it anymore but I wonder if it's reasonable to forbid that from coming back ? That would allow me to remove some hacks in ppc64 and simplify some upcoming ppc32 code. Cheers, Ben. --
A good first step might be to define some conventions. For example,
define that set_pte*() *always* means setting a non-valid pte to either
a new non-valid state (like a swap reference) or to a valid state.
modify_pte() would modify the flags of a valid
pte, giving a new valid pte. etc...
It may be that a given architecture collapses some or all of these down
to the same underlying functionality, but it would allow the core intent
to be clearly expressed.
What is the complete set of primitives we need? I also noticed that a
number of the existing pagetable operations are used only once or twice
in the core code; I wonder if we really need such special cases, or
whether we can make each arch pte operation carry a bit more weight?
Also, rather than leaving all the rule enforcing to documentation and a
maintainer, we should also consider having a debug mode which adds
enough paranoid checks to each operation so that any rule breakage will
fail obviously on all architectures.
J
--
Yup. Or make it clear that ptep_set_access_flags() should only be used to -relax- access (ie, set dirty, writeable, accessed, ... but not Yes, that was some of my concern. It's getting close to having one API We could do both. Now, regarding operations, let's first find the major call sites, see what I miss. I'm omitting free_* in memory.c as those are for freeing pte pages, not accessing PTEs themselves. I'm also ignoring read-only call sites and hugetlb for now. * None-iterative accessors - handle_pte_fault in memory.c, on "fixup" faults (pte is present and it's not a COW), for fixing up DIRTY and ACCESSED (btw, could we make that also fixup EXEC ? I would like this for some stuff I'm working on at the moment, ie set it if the vma has VM_EXEC and it was lost from the PTE as I might want to mask it out of PTEs under some circumstances). Textbook usage of ptep_set_access_flags(), so that's fine. - do_wp_page() in memory.c for COW or fixup of shared writeable mapping writeable-ness. Doesn't overwrite existing PTE for COW anymore, it uses clear_flush nowadays and fixup of shared writeable mapping uses ptep_set_access_flags() as it should, so that's all good. - insert_pfn() and insert_page() still in memory.c for fancy page faults. Just a trivial set_pte_at() of a !present one, no big deal here - RMAP ones ? Some ad-hoc stuff due to _notify thingies. * Iterative accessors (some don't batch, maybe they could/should). - zapping a mapping (zap_p*) in memory.c - fork (copy_p*) in memory.c could batch better maybe ? - setting linear user mappings (remap_p*) in memory.c, trivial set_pte_at() on a range, pte's should be !present I think. - mprotect (change_p*) in memory.c, which has the problem I mentioned - moving page tables (move_p*), pretty trivial clear_flush + set_pte_at - clear_regs_pte_range via walk_page_range in fs/proc/task_mmu.c, does a test_and_clear_young, flushes mm afterward, could use some lazy stuff so we can batch properly on ppc64. - ...
I don't think that is a huge problem as such... if there was lots of repeated uses of the API I'd also be concerned about mm/ code not being well factored :) My concern is that things aren't well documented, maybe not consistent enough, and are usually named according to what their implementation looks like on the favourite arch of the person who introduced them, rather than what the VM needs to get done :P Which leads some architectures (I'm looking at ia64 ;)) to reinvent things or just add new primitives rather than finding existing common code. Which makes the problem worse. And it makes generic VM developers not be able to follow what's going on This is one example of being too low level. It wouldn't be hard to have an arch where the SMC race does not apply, or even it is probably possible to avoid the flush in the case of single-threaded mm. Can't do that however, fremap is only ever doing clear_flush on present ptes, or set_pte_at on pte_none_ptes (and in this case it is going from !present to !present, which is probably also unusual in some cases for arch code to deal with, although it is not restricted to fremap I guess) --
>>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes: Nick> On Tue, Sep 23, 2008 at 04:49:32PM +1000, Benjamin Herrenschmidt Nick> I don't think that is a huge problem as such... if there was Nick> lots of repeated uses of the API I'd also be concerned about mm/ Nick> code not being well factored :) Is it worth taking another look at the page-table abstraction layer that Darren Williams posted here last year? PeterC --
The powerpc bug whereof you write appears to have been there since ... linux-2.4.0 or earlier: entry = ptep_get_and_clear(pte); set_pte(pte, pte_modify(entry, newprot)); But perhaps powerpc was slightly different back in those days. It sounds to me like a bug in your current ptep_get_and_clear(), Then I hope he will probably send Linus the fix. Though what we already have falls somewhat short of perfection, I've much more enthusiasm for fixing its bugs, than for any fancy redesign introducing its own bugs. Others have more stamina! Hugh --
Yes, I figured out the bug was already there. And no, it's not the right approach to have ptep_get_and_clear() flush because it would mean that call cannot batch flushes, and thus we would lose ability to Well, the current set accessor, as far as I'm concerned is a big pile of steaming shit that evolved from x86-specific gunk raped in different horrible ways to make it looks like it fits on other architectures and additionally mashed with goo to make it somewhat palatable by virtualization stuff. Yes, bugs can be fixed but it's still an horrible mess. Now, regarding the above bug, I'm afraid the only approaches I see that would work would be to have either a ptep_get_and_clear_flush(), which I suppose x86 virt. people will hate, or maybe to actually have a powerpc specific variant of the new start/commit hooks that does the flush. Ben. --
What do you propose then? Ideally one would like to get something that
works for powerpc, s390, all the wacky ia64 modes as well as x86. The
ia64 folks proposed something, but I've not looked at it closely. From
an x86 virtualization perspective, something that's basically x86 with
as much scope for batching and deferring as possible would be fine.
As a start, what's the state machine for a pte? What states can it be
in, and how does it move from state to state? It sounds like powerpc
has at least one extra state above x86 (hashed, with the hash key stored
ptep_get_and_clear() is not batchable anyway, because the x86
implementation requires an atomic xchg on the pte, which will likely
result in some sort of trap (and if it doesn't then it doesn't need
batching). The start/commit API was specifically so that we can do the
mprotect (and fork COW updates) in a batchable way (in Xen its
implemented with a pte update hypercall which updates the pte without
affecting the A/D bits).
J
--
That's where things get interesting. I liked Nick ideas of doing something transactional that could encompass the lock, bach and flushing We store in the PTE whether it was hashed, and the location within a hash bucket. (For each hash value, there's 8 buckets, or rather 16 if you count our secondary hashing). We must never write a new valid PTE after we cleared a hashed one without having a flush in between. On 32 bits we have less state (only the 'hashed' bit) but the problem is similar, though we handle it differently: we never clear the hash bit until we flush the hash, ie, pte_clear doesn't clear the hash bit. On 64-bit we do things differently, we do clear PTEs and pile up in a per-cpu batch what needs to be flushed, the flush then happens when Well, ptep_get_and_clear() used to be used by zap_pte_range() which I _HOPE_ was batchable on x86 :-) Nowadays, there's this new ptep_get_and_clear_full() (yet another totally meaningless name for an ad-hoc API added for some random special purpose) that zap_pte_range() uses. Maybe that one is now subtly different such as it can be used to batch on x86 ? In any case, powerpc batches -everything- (unless it's called *_flush in which case the flush is immediate) in a private per-cpu batch and I think we have different ideas of what batch means but yeah, we do batch everything including these on powerpc without the new start/commit interface. --
Yes, that sounds fine in principle, but the practise gets tricky. The
trouble with that kind of interface is that it can be fairly heavyweight
unless you amortise the cost of the transaction setup/commit with
Yeah. zap_pte_range isn't great when doing a munmap, but the _full gunk
lets it special-case process teardown and it ends up not being a
performance problem (in the Xen case, we've already switch to another
pagetable at that point, so it isn't really a pagetable any more and
Likely; it's one of those overused generic words. The specific meaning
I'm using is "we can roll a bunch of updates into a single hypercall".
Operations which do atomic RMW or fetch-set are typically not batchable
because the caller wants the result *now* and can't wait for a deferred
result.
J
--
I agree. Anyway, I'm sure if we get everybody around the same table, we can find something, maybe not that high level, but a better set of accessors that fits all needs with a more precise semantic. Anyway, I'll see if I can come up with ideas later that we can discuss. It's also all inline so it does give us flexibility in having things Yes. We used to do it differently but it was actually racy in a subtle Right. For us it's mostly about flushing the hash entries, though there is no fundamental difference I believe here, it's about updating a cache, whether it's the TLB, our hash table, hypervisor shadows, etc... and we -should- be able to find a more sensible common ground. Cheers, Ben. --
Whyever not the latter? Jeremy seems to have gifted that to you, for precisely such a purpose. Hugh p.s. I surely agree with you over the name ptep_get_and_clear_full(): horrid, even more confusing than the tlb->fullmm from which it derives its name. I expect I'd agree with you over a lot more too, but please, bugfixes first. --
Yeah. Not that I don't quite understand what the point of the start/modify/commit thing the way it's currently used in mprotect since we are doing the whole transaction for a single PTE change, ie how does that help with hypervisors vs. a single ptep_modify_protection() for example is beyond me :-) When I think about transactions, I think about starting a transaction, changing a -bunch- of PTEs, then commiting... Essentially I see the PTE lock thing as being a transaction. Cheers, Sure. --
I think we need to be a bit clearer about the terminology:
A batch is a bunch of things that can be optionally deferred so they can
be done in chunks.
A transaction is something that must either complete successfully, or
have no effect at all. Doing multiple things in a transaction means
that they must all complete or none. (In general we assume there's
nothing about these low-level pagetable operations which can fail, so we
can ignore the failure part of transactions.)
In this case, a batch is not a transaction. Doing things between
arch_enter_lazy_mmu_mode/arch_leave_lazy_mmu_mode makes no guarantees
about when operations are performed, other than guaranteeing that
they'll all be done by the time arch_leave_lazy_mmu_mode returns.
Everything about how things are chunked into batches are up to the
underlying architecture, and the calling code can't make any assumptions
about it (the specific problem we've had to fix in a couple of places is
things expecting to be able to read back their recent pagetable
modifications immediately after issuing the call).
The ptep_modify_prot_start/commit pair specifies a single pte update in
such a way to allow more implementation flexibility - ie, there's no
naked requirement for an atomic fetch-and-clear operation. I chose the
transaction-like terminology to emphasize that the start/commit
functions must be strictly paired; there's no way to fail or abort the
"transaction". A whole group of those start/commit pairs can be batched
together without affecting their semantics.
J
--
I still can't see the point of having now 3 functions instead of just one such as ptep_modify_protection(). I don't see what it buys you other than adding gratuituous new interfaces. Ben;. --
Yeah, that would work too; that's pretty much how Xen implements it
anyway. The main advantage of the start/commit pair is that the
resulting code was completely unchanged from the old code. The mprotect
sequence using ptep_modify_protection would end up reading the pte twice
before writing it.
J
--
Not necessarily .. depends how you factor out the interface to it. Anyway, not a big deal now. I'll do a patch to fix the hole on powerpc, and if my brain clicks, over the next few weeks, I'll see if I can come up with an overall nicer API covering all usages. In many case might just be a matter of giving a saner name to existing calls and documenting them properly tho :-) Cheers, Ben. --
As far as I can tell the current code should work. It is not pretty though, in particular the nasty pairing of flush_tlb_mm() with ptep_set_wrprotect() and flush_tlb_range() with change_protection() is fragile. For me the question is if we can find a sensible set of basic primitives that work for all architectures in a performant way. This is really hard.. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
