Re: [QUICKLIST 0/4] Arch independent quicklists V2

Previous thread: [PATCH] Introduce load_TLS to the "for" loop. by Rusty Russell on Monday, March 12, 2007 - 11:39 pm. (6 messages)

Next thread: RE: Question: removal of syscall macros? by albcamus on Monday, March 12, 2007 - 8:06 pm. (1 message)
From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 12:13 am

V1->V2
- Add sparch64 patch
- Single i386 and x86_64 patch
- Update attribution
- Update justification
- Update approvals
- Earlier discussion of V1 was at
  http://marc.info/?l=linux-kernel&m=117357922219342&w=2

This patchset introduces an arch independent framework to handle lists
of recently used page table pages. It is necessary for x86_64 and
i386 to avoid the special casing of SLUB because these two platforms
use fields in the page_struct (page->index and page->private)
that SLUB needs (and in fact SLAB also needs page-private if
performing debugging!). There is also the tendency of arches to use
page flags to mark page table pages. The slab also uses page flags.
Separating page table page allocation into quicklists avoids the danger
of conflicts and frees up page flags for SLUB and for the arch code.

Page table pages have the characteristics that they are typically zero
or in a known state when they are freed. This is usually the exactly
same state as needed after allocation. So it makes sense to build a list
of freed page table pages and then consume the pages already in use
first. Those pages have already been initialized correctly (thus no
need to zero them) and are likely already cached in such a way that
the MMU can use them most effectively. Page table pages are used in
a sparse way so zeroing them on allocation is not too useful.

Such an implementation already exits for ia64. Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.
It also only supported a single quicklist. The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining the necessary number
of quicklists in arch/<arch>/Kconfig. F.e. i386 needs two and thus
has

config NR_QUICK
	int
	default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the ...
From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 12:13 am

Abstract quicklist from the OA64 implementation

Extract the quicklist implementation for IA64, clean it up
and generalize it to allow multiple quicklists and support
for constructors and destructors..

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/ia64/Kconfig          |    4 ++
 arch/ia64/mm/contig.c      |    2 -
 arch/ia64/mm/discontig.c   |    2 -
 arch/ia64/mm/init.c        |   51 ---------------------------
 include/asm-ia64/pgalloc.h |   82 ++++++++-------------------------------------
 include/linux/quicklist.h  |   81 ++++++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig                 |    5 ++
 mm/Makefile                |    2 +
 mm/quicklist.c             |   81 ++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 191 insertions(+), 119 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/ia64/mm/init.c	2007-03-12 22:49:21.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/ia64/mm/init.c	2007-03-12 22:49:23.000000000 -0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr;	/* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES			25UL
-#define MAX_PGT_FREES_PER_PASS		16L
-#define PGT_FRACTION_OF_NODE_MEM	16
-
-static inline long
-max_pgt_pages(void)
-{
-	u64 node_free_pages, max_pgt_pages;
-
-#ifndef	CONFIG_NUMA
-	node_free_pages = nr_free_pages();
-#else
-	node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-	max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-	max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-	return ...
From: Paul Mundt
Date: Tuesday, March 13, 2007 - 2:05 am

This doesn't work, and so CONFIG_QUICKLIST is always set. The NR_QUICK
thing seems a bit backwards anyways, perhaps it would make more sense to
have architectures set CONFIG_GENERIC_QUICKLIST in the same way that the
other GENERIC_xxx bits are defined, and then set NR_QUICK based off of
that. It's obviously going to be 2 or 1 for most people, and x86 seems to
be the only one that needs 2.

How about this?

--

diff --git a/mm/Kconfig b/mm/Kconfig
index 7942b33..2f20860 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -163,3 +163,8 @@ config ZONE_DMA_FLAG
 	default "0" if !ZONE_DMA
 	default "1"
 
+config NR_QUICK
+	int
+	depends on GENERIC_QUICKLIST
+	default "2" if X86
+	default "1"
-

From: Christoph Lameter
Date: Thursday, March 15, 2007 - 1:51 pm

Both i386 and x86_64 currently need 2 and if other arches start using 
quicklists then they would have the same issues. There may be other 
cases in the future where these may be useful. So I think this is too 

Is there a way of checking if a CONFIG_xxx is set to any value?

Then we could do

config QUICKLISTS
	depends on defined(NR_QUICK)

Alternately we could replace #ifdef CONFIG_QUICKLISTS with
#ifdef CONFIG_NR_QUICK ?

-

From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 12:13 am

From: David Miller <davem@davemloft.net>

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller <davem@davemloft.net>

---
 arch/sparc64/Kconfig          |    4 ++++
 arch/sparc64/mm/init.c        |   24 ------------------------
 arch/sparc64/mm/tsb.c         |    2 +-
 include/asm-sparc64/pgalloc.h |   26 ++++++++++++++------------
 4 files changed, 19 insertions(+), 37 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/Kconfig	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/Kconfig	2007-03-12 22:53:30.000000000 -0700
@@ -26,6 +26,10 @@ config MMU
 	bool
 	default y
 
+config NR_QUICK
+	int
+	default 1
+
 config STACKTRACE_SUPPORT
 	bool
 	default y
Index: linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/sparc64/mm/init.c	2007-03-12 22:49:19.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/sparc64/mm/init.c	2007-03-12 22:53:30.000000000 -0700
@@ -176,30 +176,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long flags)
-{
-	clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-	pgtable_cache = kmem_cache_create("pgtable_cache",
-					  PAGE_SIZE, PAGE_SIZE,
-					  SLAB_HWCACHE_ALIGN |
-					  SLAB_MUST_HWCACHE_ALIGN,
-					  zero_ctor,
-					  NULL);
-	if (!pgtable_cache) {
-		prom_printf("Could not create pgtable_cache\n");
-		prom_halt();
-	}
-	tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: ...
From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 12:13 am

i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
The page state is therefore mainly determined by the slab code. However,
i386 also uses its own fields in the page struct to mark special pages
and to build a list of pgds using the ->private and ->index field (yuck!).
This has been finely tuned to work right with SLAB but SLUB needs more
control over the page struct. Currently the only way for SLUB to support
these slabs is through special casing PAGE_SIZE slabs.

If we use quicklists instead then we can avoid the mess, and also the
overhead of manipulating page sized objects through slab.

It also allows us to use standard list manipulation macros for the
pgd list using page->lru thereby simplifying the code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/i386/Kconfig          |    4 ++
 arch/i386/kernel/process.c |    1 
 arch/i386/kernel/smp.c     |    2 -
 arch/i386/mm/fault.c       |    5 +--
 arch/i386/mm/init.c        |   25 -----------------
 arch/i386/mm/pageattr.c    |    2 -
 arch/i386/mm/pgtable.c     |   63 +++++++++++++++++----------------------------
 include/asm-i386/pgalloc.h |    2 -
 include/asm-i386/pgtable.h |   13 +++------
 9 files changed, 39 insertions(+), 78 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/i386/mm/init.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/i386/mm/init.c	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/i386/mm/init.c	2007-03-12 22:53:27.000000000 -0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-	if (PTRS_PER_PMD > 1) {
-		pmd_cache = ...
From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 12:13 am

Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is useful to separate out PGD handling. We can carry
the initialized pgds over to the next process needing them.

Also clean up the pgd_list handling to use regular list macros.
There is no need anymore to avoid the lru field.

Move the add/removal of the pgds to the pgdlist into the
constructor / destructor. That way the implementation is
congruent with i386.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86_64/Kconfig          |    4 ++
 arch/x86_64/kernel/process.c |    1 
 arch/x86_64/kernel/smp.c     |    2 -
 arch/x86_64/mm/fault.c       |    5 +-
 include/asm-x86_64/pgalloc.h |   76 +++++++++++++++++++++----------------------
 include/asm-x86_64/pgtable.h |    3 -
 mm/Kconfig                   |    5 ++
 7 files changed, 52 insertions(+), 44 deletions(-)

Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig	2007-03-12 22:53:28.000000000 -0700
@@ -56,6 +56,10 @@ config ZONE_DMA
 	bool
 	default y
 
+config NR_QUICK
+	int
+	default 2
+
 config ISA
 	bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h	2007-03-12 22:49:20.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h	2007-03-12 22:53:28.000000000 -0700
@@ -4,6 +4,10 @@
 #include <asm/pda.h>
 #include <linux/threads.h>
 #include <linux/mm.h>
+#include <linux/quicklist.h>
+
+#define QUICK_PGD 0	/* We preserve special mappings over free */
+#define QUICK_PT 1	/* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, ...
From: Andrew Morton
Date: Tuesday, March 13, 2007 - 1:53 am

Well if they're zero then perhaps they should be released to the page allocator
to satisfy the next __GFP_ZERO request.  If that request is for a pagetable
page, we break even (except we get to remove special-case code).  If that
__GFP_ZERO allocation was or some application other than for a pagetable, we
win.

iow, can we just nuke 'em?

(Will require some work in the page allocator)
(That work will open the path to using the idle thread to prezero pages)
-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 1:03 am

Page allocator still requires interrupts to be disabled, which this doesn't.

Considering there isn't much else that frees known zeroed pages, I wonder if
it is worthwhile.

Last time the zeroidle discussion came up was IIRC not actually real performance
gain, just cooking the 1024 CPU threaded pagefault numbers ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 4:52 am

If you want a zeroed page for pagecache and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you have good
reason to be upset.

In fact, if you want a _non_-zeroed page and someone has just stuffed a
known-zero, cache-hot page into the pagetable quicklists, you still have
reason to be upset.  You *want* that cache-hot page.

Generally, all these little private lists of pages (such as the ones which
slab had/has) are a bad deal.  Cache effects preponderate and I do think
we're generally better off tossing the things into a central pool.

Plus, we can get in a situation where take a cache-cold, known-zero page
from the pte quicklist when there is a cache-hot, non-zero page sitting in
the page allocator.  I suspect that zeroing the cache-hot page would take a
similar amount of time to a single miss agains the cache-cold page.

I'm not saying that I _know_ that the quicklists are pointless, but I don't
think it's established that they are pointful.

ISTR that experiments with removing the i386 quicklists made zero
difference, but that was an awfully long time ago.  Significantly, it

Maybe, dunno.  It was apparently a win on powerpc many years ago.  I had a
fiddle with it 5-6 years ago on x86 using a cache-disabled mapping of the
page.  But it needed too much support in core VM to bother.  Since then
we've grown per-cpu page magazines and __GFP_ZERO.  Plus I'm not aware of
anyone having tried doing it on x86 with non-temporal stores.

-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 4:06 am

On a Pentium 4? ;)

Sure, that is a minor detail, considering that you'll usually be allocating

The thing is, pagetable pages are the one really good exception to the
rule that we should keep cache hot and initialise-on-demand. They
typically are fairly sparsely populated and sparsely accessed. Even
for last level page tables, I think it is reasonable to assume they will
usually be pretty cold.

And you want to allocate cache cold pages as well, for the same reasons
(you want to keep your cache hot pages for when they actually will be

For slab I understand. And a lot of users of slab constructers were also
silly, precisely because we should initialise on demand to keep the cache
hits up.

But cold(ish?) pagetable quicklists make sense, IMO (that is, if you *must*

You can win on specifically constructed benchmarks, easily.

But considering all the other problems you're going to introduce, we'd need
a significant win on a significant something, IMO.

You waste memory bandwidth. You also use more CPU and memory cycles
speculatively, ergo you waste more power.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 5:15 am

eh?  I'd have thought that a pte page which has just gone through
zap_pte_range() will very often have a _lot_ of hot cachelines, and
that's a common case.


Yeah, prezeroing in idle is probably pointless.  But I'm not aware of
anyone having tried it properly...
-

From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 4:20 am

Ok, then what did I do wrong 3 years ago with the prezeroing patchsets?


-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 5:30 am

Failed to provide us a link to it?
-

From: Christoph Lameter
Date: Thursday, March 15, 2007 - 1:23 pm

You merged part of it and were involved in the discussions.

General overviews:

http://lwn.net/Articles/117881/
http://lwn.net/Articles/128225/

The details on the problems with prezeroing and touching multiple 
cachelines of the page.

http://www.gelato.unsw.edu.au/archives/linux-ia64/0412/12252.html

-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 4:30 am

Well I guess that would be the case if you had just unmapped a 4MB
chunk that was pretty dense with pages.

My malloc seems to allocate and free in blocks of 128K, so that's
only going to give us 3% of the last level pte being cache hot when
it gets freed. Not sure what common mmap(file) access patterns
look like.

The majority of programs I run have a smattering of llpt pages
pretty sparsely populated, covering text, libraries, heap, stack,
vdso.

We don't actually have to zap_pte_range the entire page table in
order to free it (IIRC we used to have to, before the 4lpt patches).

But yeah let's see some tests. I would definitely want to avoid this
extra layer of complexity if it is just as good to return the pages

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 5:47 am

I'm trying to remember why we ever would have needed to zero out the pagetable
pages if we're taking down the whole mm?  Maybe it's because "oh, the
arch wants to put this page into a quicklist to recycle it", which is
all rather circular.

It would be interesting to look at a) leave the page full of random garbage
if we're releasing the whole mm and b) return it straight to the page allocator.
-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 5:01 am

Well we have the 'fullmm' case, which avoids all the locked pte operations
(for those architectures where hardware pt walking requires atomicity).

However we still have to visit those to-be-unmapped parts of the page table,
to find the pages and free them. So we still at least need to bring it into
cache for the read... at which point, the store probably isn't a big burden.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 6:11 am

I suspect there are some tlb operations which could be skipped in that case

It means all that data has to be written back.  Yes, I expect it'll prove
to be less costly than the initial load.
-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 5:18 am

Depends on the tlb flush implementation. The generic one doesn't look like
it is all that smart about optimising the fullmm case. It does skip some

Still, it is something we could try.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Jeremy Fitzhardinge
Date: Tuesday, March 13, 2007 - 10:30 am

Why not try to find a place to stash a linklist pointer and link them
all together?  Saves the pulldown pagetable walk altogether.

    J
-

From: Matt Mackall
Date: Tuesday, March 13, 2007 - 1:03 pm

Because we'd need one link per mm that a page is mapped in?

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jeremy Fitzhardinge
Date: Tuesday, March 13, 2007 - 1:17 pm

Can pagetable pages be shared between mms?  (Kernel pmds in PAE excepted.)

    J
-

From: Matt Mackall
Date: Tuesday, March 13, 2007 - 1:21 pm

Ahh, I think the issue is that we have to walk the page tables to drop
the reference count of the _actual pages_ they point to. The page
tables themselves could all be put on a list or two lists (one for
PMDs, one for everything else), but that wouldn't really be a win over
just walking the tree, especially given the extra list maintenance.

Because the fan-out is large, the bulk of the work is bringing the last
layer of the tree into cache to find all the pages in the address
space. And there's really no way around that.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: David Miller
Date: Tuesday, March 13, 2007 - 2:07 pm

From: Matt Mackall <mpm@selenic.com>

That's right.

And I will note that historically we used to be much worse
in this area, as we used to walk the page table tree twice
on address space teardown (once to hit the PTE entries, once
to free the page tables).

Happily it is a one-pass algorithm now.

But, within active VMA ranges, we do have to walk all
the bits at least one time.
-

From: Matt Mackall
Date: Tuesday, March 13, 2007 - 2:14 pm

Well you -could- do this:

- reuse a long in struct page as a used map that divides the page up
  into 32 or 64 segments
- every time you set a PTE, set the corresponding bit in the mask
- when we zap, only visit the regions set in the mask

Thus, you avoid visiting most of a PMD page in the sparse case,
assuming PTEs aren't evenly spread across the PMD.

This might not even be too horrible as the appropriate struct page
should be in cache with the appropriate bits of the mm already locked,
etc.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jeremy Fitzhardinge
Date: Tuesday, March 13, 2007 - 2:36 pm

And do the same in pte pages for actual mapped pages?  Or do you think
they would be too densely populated for it to be worthwhile?

    J

-

From: Peter Chubb
Date: Tuesday, March 13, 2007 - 2:46 pm

>>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy@goop.org> writes:


Jeremy> And do the same in pte pages for actual mapped pages?  Or do
Jeremy> you think they would be too densely populated for it to be
Jeremy> worthwhile?

We've been doing some measurements on how densely clumped ptes are.
On 32-bit platforms, they're pretty dense.  On IA64, quite a bit
sparser, depending on the workload of course.  I think that's mostly because
of the larger pagesize on IA64 -- with 64k pages, you don't need very
many to map a small object.

I'm hoping IanW can give more details.
-

From: David Miller
Date: Tuesday, March 13, 2007 - 2:48 pm

From: Matt Mackall <mpm@selenic.com>

Yes, I've even had that idea before.

You can even hide it behind pmd_none() et al., the generic VM
doesn't even have to know that the page table macros are doing
this optimization.
-

From: William Lee Irwin III
Date: Tuesday, March 13, 2007 - 6:12 pm

We never did need to modify ptes on exit() or other pagetable prunings
(not that they were ever done outside exit() before 2.6.x). The only
subtlety is that pruning on munmap() needs a TLB flush for the TLB
itself to drop the references to the pages referred to by the PTE's on
pruning in the presence of hardware pagetable walkers (in the exit()
case there are no user execution contexts left to potentially utilize
the dead translations so it's less important). That's handled by
tlb_remove_page() and shouldn't need any updates across such a change.

I believe the zeroing on teardown was largely a result of idiom vs.
any particular need. Essentially using ptep_get_and_clear() to handle
the non-pruning munmap() case in a manner unified with other pagetable
teardowns. Also likely is 2.4.x legacy from when that and possibly
earlier kernels maintained arch-private quicklists for pagetables.

There are furthermore distinctions to make between fork() and execve().
fork() stomps over the entire process address space copying pagetables
en masse. After execve() a process incrementally faults in PTE's one at
a time. It should be clear that if case analyses are of interest at
all, fork() will want cache-hot pages (cache-preloaded pages?) where
such are largely wasted on incremental faults after execve(). The copy
operations in fork() should probably also be examined in the context of
shared pagetables at some point.


-- wli
-

From: William Lee Irwin III
Date: Thursday, March 15, 2007 - 4:12 pm

To make this perfectly clear, we can deal with the varying usage cases
with hot/cold flags to the pagetable allocator functions. Where bulk
copies such as fork() are happening, it makes perfect sense to
precharge the cache by eager zeroing. Where sparse single pte affairs
such as incrementally faulting things in after execve() are involved,
cache cold preconstructed pagetable pages are ideal. Address hints
could furthermore be used to precharge single cachelines (e.g. via
prefetch) in the sparse usage case.


-- wli
-

From: Paul Mackerras
Date: Tuesday, March 13, 2007 - 4:58 pm

I don't see much point to them.  For powerpc, I would rather grab an

My recollection was that it wasn't a win, but it was a long time ago...

Paul.
-

From: Christoph Lameter
Date: Tuesday, March 13, 2007 - 4:17 am

Nope that wont work.

1. We need to support other states of pages other than zeroed.

2. Prezeroing does not make much sense if a large portion of the
   page is being used. Performance is better if the whole page 
   is zeroed directly before use.Prezeroing only makes sense for sparse

I already tried that 3 years ago and there was *no* benefit for usual
users of the a page allocator. The advantage exists only if a small
portion of the page is used. F.e. For one cacheline there was a 4x 
improvement. See lkml archives for prezeroing.


-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 5:27 am

Unsurprised.  Were non-temporal stores tried?
-

From: Christoph Lameter
Date: Thursday, March 15, 2007 - 1:28 pm

pgd are not completely zeroed. They contain mappings that are always 


Yes with no material change. The work lead to making ia64 use non 
temporal stores for spin unlock but it was not useful for prezeroing.

-

Previous thread: [PATCH] Introduce load_TLS to the "for" loop. by Rusty Russell on Monday, March 12, 2007 - 11:39 pm. (6 messages)

Next thread: RE: Question: removal of syscall macros? by albcamus on Monday, March 12, 2007 - 8:06 pm. (1 message)