These Patches make page tables relocatable for numa, memory defragmentation, and memory hotblug. The potential need to rewalk the page tables before making any changes causes a 1.6% peformance degredation in the lmbench page miss micro benchmark. Page table relocation is critical for several projects. 1) Numa system process relcoation. Currently, when a process is migrated from one node to another the page tables are left behind. This means increased latency and inter-node traffic on all page faults. Migrating the page tables with the process will be a performance win. 2) Memory hotplug. Currently memory hotplug cannot move page tables out of the memory that is about to be removed. This code is a first step to being able to move page tables out of memory that is going to be unplugged. 3) Memory defragmentation. Currently page tables are the largest chunk of non-moveable memory (needs verification). By making page tables relocatable, we can decrease the number of memory fragments and allow for higher order allocations. This is important for supporting huge pages that can greatly improve performance in some circumstances. Currently the high level memory relocation code only supports 1 above. The low level routines can be used to migrate any page table to any new page. However, 1 seems to be the best case for correctness testing and is also the easiest place to hook into existing code, so that is what is currently supported. The low level page table relocation routines are generic and will be easy to use for 2 and 3. Signed-off-by:email@example.com ----- Major changes from the previous version include more comments and replacing the semaphore that serialized access to the relocation code with an integer count of the number of times it has been reentered. The later lead to an improvement from a 3% performace loss to a 1.6% perfromance loss vs no relocation code (There was also a version change from 2.6.23 to 2.6.25-rc9 since the last benchmark, so some ...
So you mean the check to see if there's a migration currently in I've read through this patch a couple of times so far, but I still don't quite get it. The "why" rationale is good, but it would be nice to have a high-level "how" paragraph which explains the overall principle of operation. (OK, I think I see how all this fits together now.) From looking at it, a few points to note: - It only tries to move usermode pagetables. For the most part (at least on x86) the kernel pagetables are fairly static (and effectively statically allocated), but vmalloc does allocate new kernel pagetable memory. As a consequence, it doesn't need to worry about tlb-flushing global pages or unlocked updates to init_mm. - It would be nice to explain the "delimbo" terminology. I got it in the end, but it took me a while to work out what you meant. Open questions in my mind: - How does it deal with migrating the accessed/dirty bits in ptes if cpus can be using old versions of the pte for a while after the copy? Losing dirty updates can lose data, so explicitly addressing this point in code and/or comments is important. - Is this deeply incompatible with shared ptes? - It assumes that each pagetable level is a page in size. This isn't even true on x86 (32-bit PAE pgds are not), and definitely not true on other architectures. It would make sense to skip migrating non-page-sized pagetable levels, but the code could/should check for it. - Does it work on 2 and 3-level pagetable systems? Ideally the clever folding stuff would make it all fall out naturally, but somehow that never seems to end up working. - Could you use the existing tlb batching machinery rather than MMF_NEED_FLUSH? They seem to overlap somewhat. - What architectures does this support? You change a lot of arch files, but it looks to me like you've only implemented this for x86-64. Is that right? A lot of this patch won't apply to x86 at the moment because of ...
Yup. But the page fault code is so efficient, that a test and There are comments in migrate.c on the how. If they are insufficient, please indicate what you would like to see. I've been staring at the I never liked the delimbo terminology, but it's the best I've been able to come up with so far. I'm open to changing it. Otherwise I can It doesn't currently. Although it's easy to fix. Just before the free, we just have to copy the dirty bits again. Slow, but not in a Not deeply. It just doesn't support them at the moment (although it doesn't check either.) It would just need to do all the pmd's I've never tried to compile it on anything other than a 4 level system. I suspect it will fail, but a couple of well placed #ifdef's As of 2.6.22, I couldn't use any of the existing batching. They do overlap, but not 100% and I didn't want to impact the other batching It currently only supports X86_64. There are only a couple of missing things to support other architectures. The tlb_reload code needs to be created on all architectures and the node specific page table allocation code needs to be created. I'm waiting for the x86 unification to setlle out before doing another merge. My guess is that it should support all 4 level page table x86 variants at that point. The 3 level variants will take a little It's had tons of testing on moving a few simple programs around on a fake numa system. It's had no testing on a real numa system and I I think it was a simple little macro at one point. At this point, it Let me know what you decide to do here. It shouldn't be too hard to Not anymore. It's been replaced by the nesting count and just didn't When I started, none of the other page table walkers were pure walkers. I'll take a look and see if I can find one I can use. Otherwise this is only really guaranteed to work on 4 level page tables at this point. It should Correct, but I included it for completeness. We could eliminate it Mostly I ...
Yeah, its easy for that to happen. Those comments are helpful, I was thinking very high-level things like: * what initiates migration? * how can an mm be under multiple levels of migration? * ...? Just a comment saying something like "a pagetable page is considered to be in limbo if it has been copied, but may still be in use. It may be either in a cpu's stale tlb entry, or in use by the kernel on another cpu with a transient reference." would clarify what the delimbo is But the issue I'm concerned about is what happens if a process writes the page, causing its cpu to mark the (old, in-limbo) pte dirty. Meanwhile someone else is scanning the pagetables looking for things to evict. It check the (shiny new) pte, finds it not dirty, and decides to evict the apparently clean page. What, for that matter, stops a page from being evicted from under a limboed mapping? Does it get accounted for (I guess the existing tlb flushing should be sufficient to keep it under control). Also, what happens if a page happens to get migrated twice in quick succession (ie, while there's still an in-limbo page from the first Well, x86-64 is 4 level, x86-32 PAE is 3 level, and x86-32 non-PAE is 2 level. I don't think there'd be too much crying if you didn't support 32-bit non-PAE, but 32-bit PAE is useful. I would say that x86 unification is still a fair way from "done", but OK. I was planning on making the change anyway, and this is just I wouldn't eliminate it for speed, but it could be misleading if someone thought that it would actually do something. Don't know; no clear answer. Given that delimbo_X are inlines, you could easily put a "if (mm == &init_mm)" to skip everything, which would make these cases compile to nothing (and "if (__builtin_constant_p(mm) && mm == &init_mm)" if you really want to make sure there's no additional It's OK for them to be using the old mm while you're migrating because they'll be using the ...
The delimbo functions can be extended to deal with the dirty bit. They already have to be called to make sure the cpu is looking at the proper page flags. The easiest solution to the races is probably to make the delimbo pte functions flush the tlb cache to make sure the cpu will also be looking at the correct entry to update flags. Otherwise the atomic ptep* functions would probably need to be modified. Ross --
|Ken Chen||[patch] sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares.|
|Ingo Molnar||Re: [PATCH v3] x86: merge the simple bitops and move them to bitops.h|
|Paul Turner||[tg_shares_up rewrite v4 11/11] sched: update tg->shares after cpu.shares write|
|Andi Kleen||Re: - romsignature-checksum-cleanup-2.patch removed from -mm tree|
|Petr Tesarik||Re: Serious problem with ticket spinlocks on ia64|
|Junio C Hamano||Re: Teach "git checkout" to use git-show-ref|
|Christian Jaeger||Re: Problem with Git.pm bidi_pipe methods|
|Linus Torvalds||[PATCH 1/7] Make unpack_trees_options bit flags actual bitfields|
|Jon Smirl||stgit: managing signed-off-by lines|
|Han-Wen Nienhuys||Re: Cleaning up git user-interface warts|
|Linux Kernel Mailing List||MIPS: Bonito64: Make Loongson independent from Bonito64 code.|
|Linux Kernel Mailing List||iwlwifi: initialize spinlock before use|
|Linux Kernel Mailing List||i2c-i801: Add Intel Cougar Point device IDs|
|Linux Kernel Mailing List||drm/i915: Add information on pinning and fencing to the i915 list debug.|
|Linux Kernel Mailing List||ALSA: hda - Clean up quirks for HP laptops with AD1984A|
|Gerrit Renker||v2 [PATCH 1/4] dccp: Limit feature negotiation to connection setup phase|
|Richard Cochran||Re: [PATCH v3 3/3] ptp: Added a clock that uses the eTSEC found on the MPC85xx.|
|Inaky Perez-Gonzalez||[PATCH 40/40] wimax/i2400m: add CREDITS and MAINTAINERS entries|
|Sathya Perla||[PATCH net-next-2.6] be2net: add multiple RX queue support|
|Changli Gao||Re: [PATCH 3/3] ifb: move tq from ifb_private|
|Boris Samorodov||Re: twa + dump = sbwait|
|韓家標 Bill Hacker||Re: ZFS honesty|
|Bjoern A. Zeeb||Re: Can not boot 7.0-BETA3 with IPSEC|
|Sam Leffler||Re: Lots of "ath0: bad series0 hwrate 0x1b" in 8.0-BETA2|