Re: Why preallocate pmd in x86 32-bit PAE?

Previous thread: [PATCH RESEND] xen: mask _PAGE_PCD from ptes by Jeremy Fitzhardinge on Thursday, November 15, 2007 - 5:49 pm. (1 message)

Next thread: [PATCH] x86: clean up nmi_32/64.c by Hiroshi Shimamoto on Thursday, November 15, 2007 - 6:14 pm. (1 message)
To: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, November 15, 2007 - 5:57 pm

I'm looking at unifying asm-x86/pgalloc*.h, and so I'm trying to make
things as similar as possible between 32 and 64-bit.

Once difference is that 64-bit incrementally allocates all levels of the
pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
the pgd. What's the rationale for this? What pitfalls would there be
in making them incrementally allocated?

Preallocation makes sense from the perspective that they will all be
allocated almost immediately in a typical process. But it is a somewhat
arbitrary difference from 64-bit, and since 64-bit can't reasonably
preallocate any pagetable levels, it seems sensible to change 32-bit to
match.

Thanks,
J
-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, November 15, 2007 - 6:12 pm

IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
have to be present.

What earlier CPU's did was to basically load all four values into the CPU
when you loaded %cr3. There was no "three-level page table walker" at all:
it was still a two-level page table walker, there were just for magic
internal page tables that were indexed off the two high bits.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 1:12 pm

Hm, do you recall what processors that might affect? As far as I know,
current processors will ignore non-present top-level entries. Anyway,
we can point them not present to empty_zero_page, so testing the present
bit will still be sufficient to tell if we need to allocate a new pmd,
but if the hardware decides to follow the page reference there's no harm
done. (Hm, unless the hardware decides it wants to set A or D bits in

That just means we need to reload cr3 after populating the pgd with a
new pmd, right?

J
-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Linus Torvalds <torvalds@...>, William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 1:45 pm

PDPTR is documented to have P bits but none of the other control bits,
unlike other levels of the hierarchy.

The hardware never sets A or D bits on non-present pages, since all the
bits except P are reserved for the operating systems (and, besides, they

Yes. And as Linus said, it would be a new special case.

-hpa

-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 1:35 pm

Are you sure?

Anyway, this is not worth making a distinction for. Just pre-allocate all
of them. There really is just 4 PGD entries, and it really *is* different
from having a full three-level page table, and of the four PGD entries:

- one is used for the kernel mapping (assuming the regular 1:3 layout)
- AT LEAST two are required by user space anyway

so pre-allocating is never going to waste more than one page.

And you may feel that pre-allocating is a special case, but it's an
*easier* special case than the one that you are apparently thinking about
(which is to special-case according to CPU version).

So don't do it. Just preallocate for the magic 4-entry PGD. You can make
the special case just be something like

/* Preallocate for small PGD's */
#if PTRS_PER_PGD == 4
for (i = 0; i < USER_PTRS_PER_PGD; i++) {
pmd_t *pmd = pmd_alloc();
set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
}
#endif

or similar.

There is absolutely *zero* reason not to do this, and there is also zero
reason to make this be a "32-bit vs 64-bit" issue. The code can be there
in both, and the #if could even be all in C code (ie there may be reasons
to prefer writing it as

/* The old-style PAE PGD needs to be preallocated */
if (USER_PTRS_PER_PGD <= 4) {
...
}

and the compiler should even compile it away entirely for all practical

x86 page table walking never sets A/D bits on non-present entries.

That said, there's still a huge difference.

For "real" page table walking, you can always just insert entries without
flushing the cache if those entries weren't there before (because the TLB
is supposed to not cache negative entries).

Again, because of the way the mahic 4-entry PGD works, that isn't true for
it. It caches the entries regardless, so if you change it from non-present
to present, you have to flush the TLB (well, "reload %cr3", which is the

BUT ONLY FOR THIS CASE!

And if you preallocate it, you make *that* special case ...

To: Linus Torvalds <torvalds@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 3:14 pm

Yes, OK, it makes sense. Conceptually they would be dynamically
allocated and freed, but they'd just happen to start allocated, to avoid
the tlb flush of populating the pgd of an active pagetable. If you
happened to do a 1G munmap, it may end up freeing and reallocating them,
but that's going to be very rare. Either way, the other special cases
are avoided (though pgd_populate would still need to be correct, on the
offchance it gets invoked).

J
-

To: Jeremy Fitzhardinge <jeremy@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 3:22 pm

I don't think we ever free the pmd's now, do we?

(Except for the *final* free, of course, when we release the whole VM).

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 3:43 pm

Not for 32-bit at the moment, but it does in principle. munmap ends up
calling free_pgtables, and so ends up calling pmd_free_range. That will
do a pud_clear to detach the pmd from the pagetable and call
__pmd_free_tlb, which ends up doing tlb_remove_page ->
free_page_and_swap_cache. 32-bit knobbles all this at the moment, but
it looks to me like it wouldn't be hard to make this work if the code is
all common with 64-bit.

J
-

To: Linus Torvalds <torvalds@...>
Cc: William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, H. Peter Anvin <hpa@...>, Linux Kernel Mailing List <linux-kernel@...>, Zachary Amsden <zach@...>
Date: Friday, November 16, 2007 - 2:30 pm

3.8.5 in vol 3a "Page-Directory and Page-Table Entries With Extended
Addressing Enabled":

The present flag (bit 0) in the page-directory-pointer-table entries
can be set to 0 or 1. If the present flag is clear, the remaining
bits in the page-directory-pointer-table entry are available to the
operating system. If the present flag is set, the fields of the
page-directory-pointer-table entry are defined in Figures 3-20 for
4-KByte pages and Figures 3-21 for 2-MByte pages.

So I would assume this works on all current CPUs, but I can imagine that

Yeah, I'm not so concerned about memory saving; I don't think there

I'm hoping to avoid special-casing anything, if I can help it, aside
from the normal 32/64-bit 2/3/4-level parameterising of the various

Perhaps. And there's the corresponding difference between 32 and 64 bit
on freeing a pagetable; 32-bit assumes the pgd destructor will free the
pmd, whereas 64-bit does it separately. Even in the current 32-bit
code, there's separate handling for PAE and non-PAE. I think it can all

Yes, that is a bit awkward; it means that 32-bit PAE would need a
speparate pgd_populate. But that seems like a smaller change than 1)
making 32-bit PAE pgd-alloc preallocate the pmd, and 2) making pmd_free
noop on 32-bit PAE, and 3) making pgd_free free the preallocated pmd.
Perhaps 2 & 3 aren't necessary and can be the same as 64-bit.

Yep, absolutely.

J
-

To: Linus Torvalds <torvalds@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, William Lee Irwin III <wli@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, November 15, 2007 - 6:42 pm

This is true, although you could point a PGD to an all-zero page if you
really wanted to. You have to re-load CR3 after modifying the top-level

They still are. Loading CR3 in PAE really loads four registers from
memory. x86-64 is different, of course.

-hpa
-

To: H. Peter Anvin <hpa@...>
Cc: Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, November 15, 2007 - 8:40 pm

There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.

There are other overheads, of course, though they're more typically
per-something besides processes.

-- wli
-

To: William Lee Irwin III <wli@...>
Cc: Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Andi Kleen <ak@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, November 15, 2007 - 8:41 pm

I think Jeremy's question was due to trying to reduce the 32/64-bit
differences. Performance-wise, it might add a small amount to user
setup time (a typical 32-bit process will need all four, for the main
binary, libraries, stack and kernel, respectively) but it is probably
not significant (although I'd like to see numbers just in case).

-hpa

-

To: H. Peter Anvin <hpa@...>
Cc: William Lee Irwin III <wli@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 7:16 am

With the new top down mmap layout and standard 3:1 split it should typically
only need two.

-Andi
-

To: Andi Kleen <ak@...>
Cc: William Lee Irwin III <wli@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 11:45 am

Well, three with the kernel.

-hpa
-

To: H. Peter Anvin <hpa@...>
Cc: William Lee Irwin III <wli@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 11:53 am

I didn't count kernel because it is always fixed anyways and about zero
overhead for the normal setup case.

-Andi

-

To: Andi Kleen <ak@...>
Cc: William Lee Irwin III <wli@...>, Linus Torvalds <torvalds@...>, Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Nick Piggin <nickpiggin@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, November 16, 2007 - 12:10 pm

Of course, but it was in the original list so...

-hpa
-

Previous thread: [PATCH RESEND] xen: mask _PAGE_PCD from ptes by Jeremy Fitzhardinge on Thursday, November 15, 2007 - 5:49 pm. (1 message)

Next thread: [PATCH] x86: clean up nmi_32/64.c by Hiroshi Shimamoto on Thursday, November 15, 2007 - 6:14 pm. (1 message)