V1->V2 - Add sparch64 patch - Single i386 and x86_64 patch - Update attribution - Update justification - Update approvals - Earlier discussion of V1 was at http://marc.info/?l=linux-kernel&m=117357922219342&w=2 This patchset introduces an arch independent framework to handle lists of recently used page table pages. It is necessary for x86_64 and i386 to avoid the special casing of SLUB because these two platforms use fields in the page_struct (page->index and page->private) that SLUB needs (and in fact SLAB also needs page-private if performing debugging!). There is also the tendency of arches to use page flags to mark page table pages. The slab also uses page flags. Separating page table page allocation into quicklists avoids the danger of conflicts and frees up page flags for SLUB and for the arch code. Page table pages have the characteristics that they are typically zero or in a known state when they are freed. This is usually the exactly same state as needed after allocation. So it makes sense to build a list of freed page table pages and then consume the pages already in use first. Those pages have already been initialized correctly (thus no need to zero them) and are likely already cached in such a way that the MMU can use them most effectively. Page table pages are used in a sparse way so zeroing them on allocation is not too useful. Such an implementation already exits for ia64. Howver, that implementation did not support constructors and destructors as needed by i386 / x86_64. It also only supported a single quicklist. The implementation here has constructor and destructor support as well as the ability for an arch to specify how many quicklists are needed. Quicklists are defined by an arch defining the necessary number of quicklists in arch/<arch>/Kconfig. F.e. i386 needs two and thus has config NR_QUICK int default 2 If an arch has requested quicklist support then pages can be allocated from the quicklist (or from the page allocator if the ...
Abstract quicklist from the OA64 implementation
Extract the quicklist implementation for IA64, clean it up
and generalize it to allow multiple quicklists and support
for constructors and destructors..
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
arch/ia64/Kconfig | 4 ++
arch/ia64/mm/contig.c | 2 -
arch/ia64/mm/discontig.c | 2 -
arch/ia64/mm/init.c | 51 ---------------------------
include/asm-ia64/pgalloc.h | 82 ++++++++-------------------------------------
include/linux/quicklist.h | 81 ++++++++++++++++++++++++++++++++++++++++++++
mm/Kconfig | 5 ++
mm/Makefile | 2 +
mm/quicklist.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 191 insertions(+), 119 deletions(-)
Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/init.c 2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c 2007-03-12 22:49:23.000000000 -0700
@@ -39,9 +39,6 @@
DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
extern void ia64_tlb_init (void);
unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
struct page *zero_page_memmap_ptr; /* map entry for zero page */
EXPORT_SYMBOL(zero_page_memmap_ptr);
-#define MIN_PGT_PAGES 25UL
-#define MAX_PGT_FREES_PER_PASS 16L
-#define PGT_FRACTION_OF_NODE_MEM 16
-
-static inline long
-max_pgt_pages(void)
-{
- u64 node_free_pages, max_pgt_pages;
-
-#ifndef CONFIG_NUMA
- node_free_pages = nr_free_pages();
-#else
- node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
- max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
- max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
- return ...This doesn't work, and so CONFIG_QUICKLIST is always set. The NR_QUICK thing seems a bit backwards anyways, perhaps it would make more sense to have architectures set CONFIG_GENERIC_QUICKLIST in the same way that the other GENERIC_xxx bits are defined, and then set NR_QUICK based off of that. It's obviously going to be 2 or 1 for most people, and x86 seems to be the only one that needs 2. How about this? -- diff --git a/mm/Kconfig b/mm/Kconfig index 7942b33..2f20860 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -163,3 +163,8 @@ config ZONE_DMA_FLAG default "0" if !ZONE_DMA default "1" +config NR_QUICK + int + depends on GENERIC_QUICKLIST + default "2" if X86 + default "1" -
Both i386 and x86_64 currently need 2 and if other arches start using quicklists then they would have the same issues. There may be other cases in the future where these may be useful. So I think this is too Is there a way of checking if a CONFIG_xxx is set to any value? Then we could do config QUICKLISTS depends on defined(NR_QUICK) Alternately we could replace #ifdef CONFIG_QUICKLISTS with #ifdef CONFIG_NR_QUICK ? -
From: David Miller <davem@davemloft.net>
[QUICKLIST]: Add sparc64 quicklist support.
I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
arch/sparc64/Kconfig | 4 ++++
arch/sparc64/mm/init.c | 24 ------------------------
arch/sparc64/mm/tsb.c | 2 +-
include/asm-sparc64/pgalloc.h | 26 ++++++++++++++------------
4 files changed, 19 insertions(+), 37 deletions(-)
Index: linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/Kconfig 2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig 2007-03-12 22:53:30.000000000 -0700
@@ -26,6 +26,10 @@ config MMU
bool
default y
+config NR_QUICK
+ int
+ default 1
+
config STACKTRACE_SUPPORT
bool
default y
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/init.c 2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c 2007-03-12 22:53:30.000000000 -0700
@@ -176,30 +176,6 @@ unsigned long sparc64_kern_sec_context _
int bigkernel = 0;
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long flags)
-{
- clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
- pgtable_cache = kmem_cache_create("pgtable_cache",
- PAGE_SIZE, PAGE_SIZE,
- SLAB_HWCACHE_ALIGN |
- SLAB_MUST_HWCACHE_ALIGN,
- zero_ctor,
- NULL);
- if (!pgtable_cache) {
- prom_printf("Could not create pgtable_cache\n");
- prom_halt();
- }
- tsb_cache_init();
-}
-
#ifdef CONFIG_DEBUG_DCFLUSH
atomic_t dcpage_flushes = ATOMIC_INIT(0);
#ifdef CONFIG_SMP
Index: ...i386: Convert to quicklists
Implement the i386 management of pgd and pmds using quicklists.
The i386 management of page table pages currently uses page sized slabs.
The page state is therefore mainly determined by the slab code. However,
i386 also uses its own fields in the page struct to mark special pages
and to build a list of pgds using the ->private and ->index field (yuck!).
This has been finely tuned to work right with SLAB but SLUB needs more
control over the page struct. Currently the only way for SLUB to support
these slabs is through special casing PAGE_SIZE slabs.
If we use quicklists instead then we can avoid the mess, and also the
overhead of manipulating page sized objects through slab.
It also allows us to use standard list manipulation macros for the
pgd list using page->lru thereby simplifying the code.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
arch/i386/Kconfig | 4 ++
arch/i386/kernel/process.c | 1
arch/i386/kernel/smp.c | 2 -
arch/i386/mm/fault.c | 5 +--
arch/i386/mm/init.c | 25 -----------------
arch/i386/mm/pageattr.c | 2 -
arch/i386/mm/pgtable.c | 63 +++++++++++++++++----------------------------
include/asm-i386/pgalloc.h | 2 -
include/asm-i386/pgtable.h | 13 +++------
9 files changed, 39 insertions(+), 78 deletions(-)
Index: linux-2.6.21-rc3-mm2/arch/i386/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/init.c 2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/init.c 2007-03-12 22:53:27.000000000 -0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
EXPORT_SYMBOL_GPL(remove_memory);
#endif
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
- if (PTRS_PER_PMD > 1) {
- pmd_cache = ...Conver x86_64 to using quicklists This adds caching of pgds and puds, pmds, pte. That way we can avoid costly zeroing and initialization of special mappings in the pgd. A second quicklist is useful to separate out PGD handling. We can carry the initialized pgds over to the next process needing them. Also clean up the pgd_list handling to use regular list macros. There is no need anymore to avoid the lru field. Move the add/removal of the pgds to the pgdlist into the constructor / destructor. That way the implementation is congruent with i386. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- arch/x86_64/Kconfig | 4 ++ arch/x86_64/kernel/process.c | 1 arch/x86_64/kernel/smp.c | 2 - arch/x86_64/mm/fault.c | 5 +- include/asm-x86_64/pgalloc.h | 76 +++++++++++++++++++++---------------------- include/asm-x86_64/pgtable.h | 3 - mm/Kconfig | 5 ++ 7 files changed, 52 insertions(+), 44 deletions(-) Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig =================================================================== --- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig 2007-03-12 22:49:20.000000000 -0700 +++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig 2007-03-12 22:53:28.000000000 -0700 @@ -56,6 +56,10 @@ config ZONE_DMA bool default y +config NR_QUICK + int + default 2 + config ISA bool Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h =================================================================== --- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h 2007-03-12 22:49:20.000000000 -0700 +++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h 2007-03-12 22:53:28.000000000 -0700 @@ -4,6 +4,10 @@ #include <asm/pda.h> #include <linux/threads.h> #include <linux/mm.h> +#include <linux/quicklist.h> + +#define QUICK_PGD 0 /* We preserve special mappings over free */ +#define QUICK_PT 1 /* Other page table pages that are zero on free */ #define pmd_populate_kernel(mm, pmd, ...
Well if they're zero then perhaps they should be released to the page allocator to satisfy the next __GFP_ZERO request. If that request is for a pagetable page, we break even (except we get to remove special-case code). If that __GFP_ZERO allocation was or some application other than for a pagetable, we win. iow, can we just nuke 'em? (Will require some work in the page allocator) (That work will open the path to using the idle thread to prezero pages) -
Page allocator still requires interrupts to be disabled, which this doesn't. Considering there isn't much else that frees known zeroed pages, I wonder if it is worthwhile. Last time the zeroidle discussion came up was IIRC not actually real performance gain, just cooking the 1024 CPU threaded pagefault numbers ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
If you want a zeroed page for pagecache and someone has just stuffed a known-zero, cache-hot page into the pagetable quicklists, you have good reason to be upset. In fact, if you want a _non_-zeroed page and someone has just stuffed a known-zero, cache-hot page into the pagetable quicklists, you still have reason to be upset. You *want* that cache-hot page. Generally, all these little private lists of pages (such as the ones which slab had/has) are a bad deal. Cache effects preponderate and I do think we're generally better off tossing the things into a central pool. Plus, we can get in a situation where take a cache-cold, known-zero page from the pte quicklist when there is a cache-hot, non-zero page sitting in the page allocator. I suspect that zeroing the cache-hot page would take a similar amount of time to a single miss agains the cache-cold page. I'm not saying that I _know_ that the quicklists are pointless, but I don't think it's established that they are pointful. ISTR that experiments with removing the i386 quicklists made zero difference, but that was an awfully long time ago. Significantly, it Maybe, dunno. It was apparently a win on powerpc many years ago. I had a fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the page. But it needed too much support in core VM to bother. Since then we've grown per-cpu page magazines and __GFP_ZERO. Plus I'm not aware of anyone having tried doing it on x86 with non-temporal stores. -
On a Pentium 4? ;) Sure, that is a minor detail, considering that you'll usually be allocating The thing is, pagetable pages are the one really good exception to the rule that we should keep cache hot and initialise-on-demand. They typically are fairly sparsely populated and sparsely accessed. Even for last level page tables, I think it is reasonable to assume they will usually be pretty cold. And you want to allocate cache cold pages as well, for the same reasons (you want to keep your cache hot pages for when they actually will be For slab I understand. And a lot of users of slab constructers were also silly, precisely because we should initialise on demand to keep the cache hits up. But cold(ish?) pagetable quicklists make sense, IMO (that is, if you *must* You can win on specifically constructed benchmarks, easily. But considering all the other problems you're going to introduce, we'd need a significant win on a significant something, IMO. You waste memory bandwidth. You also use more CPU and memory cycles speculatively, ergo you waste more power. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
eh? I'd have thought that a pte page which has just gone through zap_pte_range() will very often have a _lot_ of hot cachelines, and that's a common case. Yeah, prezeroing in idle is probably pointless. But I'm not aware of anyone having tried it properly... -
Ok, then what did I do wrong 3 years ago with the prezeroing patchsets? -
You merged part of it and were involved in the discussions. General overviews: http://lwn.net/Articles/117881/ http://lwn.net/Articles/128225/ The details on the problems with prezeroing and touching multiple cachelines of the page. http://www.gelato.unsw.edu.au/archives/linux-ia64/0412/12252.html -
Well I guess that would be the case if you had just unmapped a 4MB chunk that was pretty dense with pages. My malloc seems to allocate and free in blocks of 128K, so that's only going to give us 3% of the last level pte being cache hot when it gets freed. Not sure what common mmap(file) access patterns look like. The majority of programs I run have a smattering of llpt pages pretty sparsely populated, covering text, libraries, heap, stack, vdso. We don't actually have to zap_pte_range the entire page table in order to free it (IIRC we used to have to, before the 4lpt patches). But yeah let's see some tests. I would definitely want to avoid this extra layer of complexity if it is just as good to return the pages -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
I'm trying to remember why we ever would have needed to zero out the pagetable pages if we're taking down the whole mm? Maybe it's because "oh, the arch wants to put this page into a quicklist to recycle it", which is all rather circular. It would be interesting to look at a) leave the page full of random garbage if we're releasing the whole mm and b) return it straight to the page allocator. -
Well we have the 'fullmm' case, which avoids all the locked pte operations (for those architectures where hardware pt walking requires atomicity). However we still have to visit those to-be-unmapped parts of the page table, to find the pages and free them. So we still at least need to bring it into cache for the read... at which point, the store probably isn't a big burden. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
I suspect there are some tlb operations which could be skipped in that case It means all that data has to be written back. Yes, I expect it'll prove to be less costly than the initial load. -
Depends on the tlb flush implementation. The generic one doesn't look like it is all that smart about optimising the fullmm case. It does skip some Still, it is something we could try. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Why not try to find a place to stash a linklist pointer and link them
all together? Saves the pulldown pagetable walk altogether.
J
-
Because we'd need one link per mm that a page is mapped in? -- Mathematics is the supreme nostalgia of our time. -
Can pagetable pages be shared between mms? (Kernel pmds in PAE excepted.)
J
-
Ahh, I think the issue is that we have to walk the page tables to drop the reference count of the _actual pages_ they point to. The page tables themselves could all be put on a list or two lists (one for PMDs, one for everything else), but that wouldn't really be a win over just walking the tree, especially given the extra list maintenance. Because the fan-out is large, the bulk of the work is bringing the last layer of the tree into cache to find all the pages in the address space. And there's really no way around that. -- Mathematics is the supreme nostalgia of our time. -
From: Matt Mackall <mpm@selenic.com> That's right. And I will note that historically we used to be much worse in this area, as we used to walk the page table tree twice on address space teardown (once to hit the PTE entries, once to free the page tables). Happily it is a one-pass algorithm now. But, within active VMA ranges, we do have to walk all the bits at least one time. -
Well you -could- do this: - reuse a long in struct page as a used map that divides the page up into 32 or 64 segments - every time you set a PTE, set the corresponding bit in the mask - when we zap, only visit the regions set in the mask Thus, you avoid visiting most of a PMD page in the sparse case, assuming PTEs aren't evenly spread across the PMD. This might not even be too horrible as the appropriate struct page should be in cache with the appropriate bits of the mm already locked, etc. -- Mathematics is the supreme nostalgia of our time. -
And do the same in pte pages for actual mapped pages? Or do you think
they would be too densely populated for it to be worthwhile?
J
-
>>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy@goop.org> writes: Jeremy> And do the same in pte pages for actual mapped pages? Or do Jeremy> you think they would be too densely populated for it to be Jeremy> worthwhile? We've been doing some measurements on how densely clumped ptes are. On 32-bit platforms, they're pretty dense. On IA64, quite a bit sparser, depending on the workload of course. I think that's mostly because of the larger pagesize on IA64 -- with 64k pages, you don't need very many to map a small object. I'm hoping IanW can give more details. -
From: Matt Mackall <mpm@selenic.com> Yes, I've even had that idea before. You can even hide it behind pmd_none() et al., the generic VM doesn't even have to know that the page table macros are doing this optimization. -
We never did need to modify ptes on exit() or other pagetable prunings (not that they were ever done outside exit() before 2.6.x). The only subtlety is that pruning on munmap() needs a TLB flush for the TLB itself to drop the references to the pages referred to by the PTE's on pruning in the presence of hardware pagetable walkers (in the exit() case there are no user execution contexts left to potentially utilize the dead translations so it's less important). That's handled by tlb_remove_page() and shouldn't need any updates across such a change. I believe the zeroing on teardown was largely a result of idiom vs. any particular need. Essentially using ptep_get_and_clear() to handle the non-pruning munmap() case in a manner unified with other pagetable teardowns. Also likely is 2.4.x legacy from when that and possibly earlier kernels maintained arch-private quicklists for pagetables. There are furthermore distinctions to make between fork() and execve(). fork() stomps over the entire process address space copying pagetables en masse. After execve() a process incrementally faults in PTE's one at a time. It should be clear that if case analyses are of interest at all, fork() will want cache-hot pages (cache-preloaded pages?) where such are largely wasted on incremental faults after execve(). The copy operations in fork() should probably also be examined in the context of shared pagetables at some point. -- wli -
To make this perfectly clear, we can deal with the varying usage cases with hot/cold flags to the pagetable allocator functions. Where bulk copies such as fork() are happening, it makes perfect sense to precharge the cache by eager zeroing. Where sparse single pte affairs such as incrementally faulting things in after execve() are involved, cache cold preconstructed pagetable pages are ideal. Address hints could furthermore be used to precharge single cachelines (e.g. via prefetch) in the sparse usage case. -- wli -
I don't see much point to them. For powerpc, I would rather grab an My recollection was that it wasn't a win, but it was a long time ago... Paul. -
Nope that wont work. 1. We need to support other states of pages other than zeroed. 2. Prezeroing does not make much sense if a large portion of the page is being used. Performance is better if the whole page is zeroed directly before use.Prezeroing only makes sense for sparse I already tried that 3 years ago and there was *no* benefit for usual users of the a page allocator. The advantage exists only if a small portion of the page is used. F.e. For one cacheline there was a 4x improvement. See lkml archives for prezeroing. -
Unsurprised. Were non-temporal stores tried? -
pgd are not completely zeroed. They contain mappings that are always Yes with no material change. The work lead to making ia64 use non temporal stores for spin unlock but it was not useful for prezeroing. -
