These Patches make page tables relocatable for numa, memory defragmentation, and memory hotblug. The potential need to rewalk the page tables before making any changes causes a 1.6% peformance degredation in the lmbench page miss micro benchmark. Page table relocation is critical for several projects. 1) Numa system process relcoation. Currently, when a process is migrated from one node to another the page tables are left behind. This means increased latency and inter-node traffic on all page faults. Migrating the page tables with the process will be a performance win. 2) Memory hotplug. Currently memory hotplug cannot move page tables out of the memory that is about to be removed. This code is a first step to being able to move page tables out of memory that is going to be unplugged. 3) Memory defragmentation. Currently page tables are the largest chunk of non-moveable memory (needs verification). By making page tables relocatable, we can decrease the number of memory fragments and allow for higher order allocations. This is important for supporting huge pages that can greatly improve performance in some circumstances. Currently the high level memory relocation code only supports 1 above. The low level routines can be used to migrate any page table to any new page. However, 1 seems to be the best case for correctness testing and is also the easiest place to hook into existing code, so that is what is currently supported. The low level page table relocation routines are generic and will be easy to use for 2 and 3. Signed-off-by:rossb@google.com ----- Major changes from the previous version include more comments and replacing the semaphore that serialized access to the relocation code with an integer count of the number of times it has been reentered. The later lead to an improvement from a 3% performace loss to a 1.6% perfromance loss vs no relocation code (There was also a version change from 2.6.23 to 2.6.25-rc9 since the last benchmark, so some ...
So you mean the check to see if there's a migration currently in I've read through this patch a couple of times so far, but I still don't quite get it. The "why" rationale is good, but it would be nice to have a high-level "how" paragraph which explains the overall principle of operation. (OK, I think I see how all this fits together now.) From looking at it, a few points to note: - It only tries to move usermode pagetables. For the most part (at least on x86) the kernel pagetables are fairly static (and effectively statically allocated), but vmalloc does allocate new kernel pagetable memory. As a consequence, it doesn't need to worry about tlb-flushing global pages or unlocked updates to init_mm. - It would be nice to explain the "delimbo" terminology. I got it in the end, but it took me a while to work out what you meant. Open questions in my mind: - How does it deal with migrating the accessed/dirty bits in ptes if cpus can be using old versions of the pte for a while after the copy? Losing dirty updates can lose data, so explicitly addressing this point in code and/or comments is important. - Is this deeply incompatible with shared ptes? - It assumes that each pagetable level is a page in size. This isn't even true on x86 (32-bit PAE pgds are not), and definitely not true on other architectures. It would make sense to skip migrating non-page-sized pagetable levels, but the code could/should check for it. - Does it work on 2 and 3-level pagetable systems? Ideally the clever folding stuff would make it all fall out naturally, but somehow that never seems to end up working. - Could you use the existing tlb batching machinery rather than MMF_NEED_FLUSH? They seem to overlap somewhat. - What architectures does this support? You change a lot of arch files, but it looks to me like you've only implemented this for x86-64. Is that right? A lot of this patch won't apply to x86 at the moment because of ...
Yup. But the page fault code is so efficient, that a test and There are comments in migrate.c on the how. If they are insufficient, please indicate what you would like to see. I've been staring at the I never liked the delimbo terminology, but it's the best I've been able to come up with so far. I'm open to changing it. Otherwise I can It doesn't currently. Although it's easy to fix. Just before the free, we just have to copy the dirty bits again. Slow, but not in a Not deeply. It just doesn't support them at the moment (although it doesn't check either.) It would just need to do all the pmd's I've never tried to compile it on anything other than a 4 level system. I suspect it will fail, but a couple of well placed #ifdef's As of 2.6.22, I couldn't use any of the existing batching. They do overlap, but not 100% and I didn't want to impact the other batching It currently only supports X86_64. There are only a couple of missing things to support other architectures. The tlb_reload code needs to be created on all architectures and the node specific page table allocation code needs to be created. I'm waiting for the x86 unification to setlle out before doing another merge. My guess is that it should support all 4 level page table x86 variants at that point. The 3 level variants will take a little It's had tons of testing on moving a few simple programs around on a fake numa system. It's had no testing on a real numa system and I I think it was a simple little macro at one point. At this point, it Let me know what you decide to do here. It shouldn't be too hard to Not anymore. It's been replaced by the nesting count and just didn't When I started, none of the other page table walkers were pure walkers. I'll take a look and see if I can find one I can use. Otherwise this is only really guaranteed to work on 4 level page tables at this point. It should Correct, but I included it for completeness. We could eliminate it Mostly I ...
Yeah, its easy for that to happen. Those comments are helpful, I was
thinking very high-level things like:
* what initiates migration?
* how can an mm be under multiple levels of migration?
* ...?
Just a comment saying something like "a pagetable page is considered to
be in limbo if it has been copied, but may still be in use. It may be
either in a cpu's stale tlb entry, or in use by the kernel on another
cpu with a transient reference." would clarify what the delimbo is
But the issue I'm concerned about is what happens if a process writes
the page, causing its cpu to mark the (old, in-limbo) pte dirty.
Meanwhile someone else is scanning the pagetables looking for things to
evict. It check the (shiny new) pte, finds it not dirty, and decides to
evict the apparently clean page.
What, for that matter, stops a page from being evicted from under a
limboed mapping? Does it get accounted for (I guess the existing tlb
flushing should be sufficient to keep it under control).
Also, what happens if a page happens to get migrated twice in quick
succession (ie, while there's still an in-limbo page from the first
Well, x86-64 is 4 level, x86-32 PAE is 3 level, and x86-32 non-PAE is 2
level. I don't think there'd be too much crying if you didn't support
32-bit non-PAE, but 32-bit PAE is useful.
I would say that x86 unification is still a fair way from "done", but
OK. I was planning on making the change anyway, and this is just
I wouldn't eliminate it for speed, but it could be misleading if someone
thought that it would actually do something. Don't know; no clear
answer. Given that delimbo_X are inlines, you could easily put a "if
(mm == &init_mm)" to skip everything, which would make these cases
compile to nothing (and "if (__builtin_constant_p(mm) && mm ==
&init_mm)" if you really want to make sure there's no additional
It's OK for them to be using the old mm while you're migrating because
they'll be using the ...The delimbo functions can be extended to deal with the dirty bit.
They already have to be called to make sure the cpu is looking at the
proper page flags. The easiest solution to the races is probably to
make the delimbo pte functions flush the tlb cache to make sure the
cpu will also be looking at the correct entry to update flags.
Otherwise the atomic ptep* functions would probably need to be
modified.
Ross
--
