I'm looking at unifying asm-x86/pgalloc*.h, and so I'm trying to make
things as similar as possible between 32 and 64-bit.Once difference is that 64-bit incrementally allocates all levels of the
pagetable, whereas 32-bit PAE preallocates the 4 pmds when it allocates
the pgd. What's the rationale for this? What pitfalls would there be
in making them incrementally allocated?Preallocation makes sense from the perspective that they will all be
allocated almost immediately in a typical process. But it is a somewhat
arbitrary difference from 64-bit, and since 64-bit can't reasonably
preallocate any pagetable levels, it seems sensible to change 32-bit to
match.Thanks,
J
-
IIRC, the present bit is ignored in the magic 4-entry PGD. All entries
have to be present.What earlier CPU's did was to basically load all four values into the CPU
when you loaded %cr3. There was no "three-level page table walker" at all:
it was still a two-level page table walker, there were just for magic
internal page tables that were indexed off the two high bits.Linus
-
Hm, do you recall what processors that might affect? As far as I know,
current processors will ignore non-present top-level entries. Anyway,
we can point them not present to empty_zero_page, so testing the present
bit will still be sufficient to tell if we need to allocate a new pmd,
but if the hardware decides to follow the page reference there's no harm
done. (Hm, unless the hardware decides it wants to set A or D bits inThat just means we need to reload cr3 after populating the pgd with a
new pmd, right?J
-
PDPTR is documented to have P bits but none of the other control bits,
unlike other levels of the hierarchy.The hardware never sets A or D bits on non-present pages, since all the
bits except P are reserved for the operating systems (and, besides, theyYes. And as Linus said, it would be a new special case.
-hpa
-
Are you sure?
Anyway, this is not worth making a distinction for. Just pre-allocate all
of them. There really is just 4 PGD entries, and it really *is* different
from having a full three-level page table, and of the four PGD entries:- one is used for the kernel mapping (assuming the regular 1:3 layout)
- AT LEAST two are required by user space anywayso pre-allocating is never going to waste more than one page.
And you may feel that pre-allocating is a special case, but it's an
*easier* special case than the one that you are apparently thinking about
(which is to special-case according to CPU version).So don't do it. Just preallocate for the magic 4-entry PGD. You can make
the special case just be something like/* Preallocate for small PGD's */
#if PTRS_PER_PGD == 4
for (i = 0; i < USER_PTRS_PER_PGD; i++) {
pmd_t *pmd = pmd_alloc();
set_pgd(pgd+i, __pgd(PAGE_PRESENT | __pa(pmd));
}
#endifor similar.
There is absolutely *zero* reason not to do this, and there is also zero
reason to make this be a "32-bit vs 64-bit" issue. The code can be there
in both, and the #if could even be all in C code (ie there may be reasons
to prefer writing it as/* The old-style PAE PGD needs to be preallocated */
if (USER_PTRS_PER_PGD <= 4) {
...
}and the compiler should even compile it away entirely for all practical
x86 page table walking never sets A/D bits on non-present entries.
That said, there's still a huge difference.
For "real" page table walking, you can always just insert entries without
flushing the cache if those entries weren't there before (because the TLB
is supposed to not cache negative entries).Again, because of the way the mahic 4-entry PGD works, that isn't true for
it. It caches the entries regardless, so if you change it from non-present
to present, you have to flush the TLB (well, "reload %cr3", which is theBUT ONLY FOR THIS CASE!
And if you preallocate it, you make *that* special case ...
Yes, OK, it makes sense. Conceptually they would be dynamically
allocated and freed, but they'd just happen to start allocated, to avoid
the tlb flush of populating the pgd of an active pagetable. If you
happened to do a 1G munmap, it may end up freeing and reallocating them,
but that's going to be very rare. Either way, the other special cases
are avoided (though pgd_populate would still need to be correct, on the
offchance it gets invoked).J
-
I don't think we ever free the pmd's now, do we?
(Except for the *final* free, of course, when we release the whole VM).
Linus
-
Not for 32-bit at the moment, but it does in principle. munmap ends up
calling free_pgtables, and so ends up calling pmd_free_range. That will
do a pud_clear to detach the pmd from the pagetable and call
__pmd_free_tlb, which ends up doing tlb_remove_page ->
free_page_and_swap_cache. 32-bit knobbles all this at the moment, but
it looks to me like it wouldn't be hard to make this work if the code is
all common with 64-bit.J
-
3.8.5 in vol 3a "Page-Directory and Page-Table Entries With Extended
Addressing Enabled":The present flag (bit 0) in the page-directory-pointer-table entries
can be set to 0 or 1. If the present flag is clear, the remaining
bits in the page-directory-pointer-table entry are available to the
operating system. If the present flag is set, the fields of the
page-directory-pointer-table entry are defined in Figures 3-20 for
4-KByte pages and Figures 3-21 for 2-MByte pages.So I would assume this works on all current CPUs, but I can imagine that
Yeah, I'm not so concerned about memory saving; I don't think there
I'm hoping to avoid special-casing anything, if I can help it, aside
from the normal 32/64-bit 2/3/4-level parameterising of the variousPerhaps. And there's the corresponding difference between 32 and 64 bit
on freeing a pagetable; 32-bit assumes the pgd destructor will free the
pmd, whereas 64-bit does it separately. Even in the current 32-bit
code, there's separate handling for PAE and non-PAE. I think it can allYes, that is a bit awkward; it means that 32-bit PAE would need a
speparate pgd_populate. But that seems like a smaller change than 1)
making 32-bit PAE pgd-alloc preallocate the pmd, and 2) making pmd_free
noop on 32-bit PAE, and 3) making pgd_free free the preallocated pmd.
Perhaps 2 & 3 aren't necessary and can be the same as 64-bit.Yep, absolutely.
J
-
This is true, although you could point a PGD to an all-zero page if you
really wanted to. You have to re-load CR3 after modifying the top-levelThey still are. Loading CR3 in PAE really loads four registers from
memory. x86-64 is different, of course.-hpa
-
There may be bigger fish to fry in terms of per-process overhead, if
you're trying to cut that down. The trouble with trying to address
some of those is that there is mutual antagonism between compactness
and expansibility in the process address space layout, so you'll end
up instantiating a lot more than you want barring some sort of provision
for a compact address space layout. Pagetable sharing is a far more
powerful resource scalability method, though it also needs cooperation
in user address space layout to reap its gains.There are other overheads, of course, though they're more typically
per-something besides processes.-- wli
-
I think Jeremy's question was due to trying to reduce the 32/64-bit
differences. Performance-wise, it might add a small amount to user
setup time (a typical 32-bit process will need all four, for the main
binary, libraries, stack and kernel, respectively) but it is probably
not significant (although I'd like to see numbers just in case).-hpa
-
With the new top down mmap layout and standard 3:1 split it should typically
only need two.-Andi
-
Well, three with the kernel.
-hpa
-
I didn't count kernel because it is always fixed anyways and about zero
overhead for the normal setup case.-Andi
-
Of course, but it was in the original list so...
-hpa
-
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Matt Mackall | Re: [PATCH] x86: fix unconditional arch/x86/kernel/pcspeaker.c compiling |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
git: | |
