Hi Ingo, This series addresses various cleanups in pagetable allocation in the direction of unifying 32/64 bits (that's still a while off yet). The significant change in here is that I'm separating the lifetime of a pmd from its pgd in the 32-bit PAE case. This makes it logically the same as 64-bit pagetable allocation, and it overall simplifies the code. The patches are: - A pure Xen fix I tacked on for convenience - Use the same pgd_list mechanism for 32 and 64 bits - Add an mm parameter for paravirt_alloc_pd, for consistency - Some fixes to early_ioremap to make sure the right paravirt hooks are called appropriately - de-macro asm-x86/pgalloc_32.h - make mm/pgtable_32.c:pgd_ctor a single function - dynamically allocate pmds rather than always allocating them with the pgd - Add Xen bits for dealing with pmd allocation - Preallocate pmds to avoid excessive tlb flushes - Allocate and initialize kernel pmds when they're not shared - Avoid excessive tlb flushes when pulling down pmds. I've done a number of randconfig test builds to shake out various configurations on 32 nd 64 bits. One caveat: in order to demacro pgalloc_32.h, I had to rearrange some headers in asm-generic/tlb.h, as it was including asm/pgalloc.h for no good reason. As a result, any other file which was expecting to implicitly pick up asm/pgalloc.h when including a asm/tlb.h header may get header file problems. I have not done any cross builds to try and track down any non-x86 fallout from this. Thanks, J --
The constructors for PAE and non-PAE pgd_ctors are more or less
identical, and can be made into the same function.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: William Irwin <wli@holomorphy.com>
---
arch/x86/mm/pgtable_32.c | 43 +++++++++++--------------------------------
1 file changed, 11 insertions(+), 32 deletions(-)
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -224,50 +224,32 @@ static inline void pgd_list_del(pgd_t *p
list_del(&page->lru);
}
+#define UNSHARED_PTRS_PER_PGD \
+ (SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)
-
-#if (PTRS_PER_PMD == 1)
-/* Non-PAE pgd constructor */
static void pgd_ctor(void *pgd)
{
unsigned long flags;
- /* !PAE, no pagetable sharing */
+ /* Clear usermode parts of PGD */
memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
spin_lock_irqsave(&pgd_lock, flags);
- /* must happen under lock */
- clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
- swapper_pg_dir + USER_PTRS_PER_PGD,
- KERNEL_PGD_PTRS);
- paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
- __pa(swapper_pg_dir) >> PAGE_SHIFT,
- USER_PTRS_PER_PGD,
- KERNEL_PGD_PTRS);
- pgd_list_add(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
-}
-#else /* PTRS_PER_PMD > 1 */
-/* PAE pgd constructor */
-static void pgd_ctor(void *pgd)
-{
- /* PAE, kernel PMD may be shared */
-
if (SHARED_KERNEL_PMD) {
+ /* must happen under lock */
clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
swapper_pg_dir + USER_PTRS_PER_PGD,
KERNEL_PGD_PTRS);
- } else {
- unsigned long flags;
+ paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
+ __pa(swapper_pg_dir) >> PAGE_SHIFT,
+ USER_PTRS_PER_PGD,
+ KERNEL_PGD_PTRS);
+ } else
+ pgd_list_add(pgd);
- memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
- spin_lock_irqsave(&pgd_lock, flags);
- pgd_list_add(pgd);
- spin_unlock_irqrestore(&pgd_lock, ...Put appropriate pagetable update hooks in so that paravirt knows what's going on in there. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> --- arch/x86/mm/ioremap.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -18,6 +18,7 @@ #include <asm/fixmap.h> #include <asm/pgtable.h> #include <asm/tlbflush.h> +#include <asm/pgalloc.h> #ifdef CONFIG_X86_64 @@ -265,7 +266,7 @@ void __init early_ioremap_init(void) pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); memset(bm_pte, 0, sizeof(bm_pte)); - set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE)); + pmd_populate_kernel(&init_mm, pmd, bm_pte); /* * The boot-ioremap range spans multiple pmds, for which @@ -295,6 +296,7 @@ void __init early_ioremap_clear(void) pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); pmd_clear(pmd); + paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT); __flush_tlb_all(); } --
This seems to have ended up in f6df72e71eba621b2f5c49b3a763116fac748f6e as: + paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT); and the pmd_populate_kernel hunk is missing altogether. From: Ian Campbell <ijc@hellion.org.uk> Date: Thu, 31 Jan 2008 18:56:06 +0000 Subject: [PATCH] x86: fix early_ioremap pagetable ops for paravirt. Some important parts of f6df72e71eba621b2f5c49b3a763116fac748f6e got dropped along the way, reintroduce them. Signed-off-by: Ian Campbell <ijc@hellion.org.uk> --- arch/x86/mm/ioremap.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index ed4208e..93d931e 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -302,7 +302,7 @@ void __init early_ioremap_init(void) pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); memset(bm_pte, 0, sizeof(bm_pte)); - set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE)); + pmd_populate_kernel(&init_mm, pmd, bm_pte); /* * The boot-ioremap range spans multiple pmds, for which @@ -332,7 +332,7 @@ void __init early_ioremap_clear(void) pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)); pmd_clear(pmd); - paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT); + paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT); __flush_tlb_all(); } -- 1.5.3.8 -- Ian Campbell This fortune would be seven words long if it were six words shorter. --
thanks, applied. AFAICS it should only affect paravirt, not the native kernel, right? Ingo --
Looks like a mismerge/misapply dropped one of the cases of pte flag
masking for Xen. Also, only mask the flags for present ptes.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
arch/x86/xen/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -244,8 +244,10 @@ unsigned long long xen_pgd_val(pgd_t pgd
pte_t xen_make_pte(unsigned long long pte)
{
- if (pte & 1)
+ if (pte & _PAGE_PRESENT) {
pte = phys_to_machine(XPADDR(pte)).maddr;
+ pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ }
return (pte_t){ .pte = pte };
}
@@ -291,10 +293,10 @@ unsigned long xen_pgd_val(pgd_t pgd)
pte_t xen_make_pte(unsigned long pte)
{
- if (pte & _PAGE_PRESENT)
+ if (pte & _PAGE_PRESENT) {
pte = phys_to_machine(XPADDR(pte)).maddr;
-
- pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ }
return (pte_t){ pte };
}
--
Add mm to paravirt_alloc_pd, partly to make it consistent with
paravirt_alloc_pt, and because later changes will make use of it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
arch/x86/kernel/vmi_32.c | 2 +-
arch/x86/mm/init_32.c | 4 ++--
arch/x86/mm/pgtable_32.c | 4 +++-
include/asm-x86/paravirt.h | 6 +++---
include/asm-x86/pgalloc_32.h | 3 +--
5 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/vmi_32.c b/arch/x86/kernel/vmi_32.c
--- a/arch/x86/kernel/vmi_32.c
+++ b/arch/x86/kernel/vmi_32.c
@@ -398,7 +398,7 @@ static void vmi_allocate_pt(struct mm_st
vmi_ops.allocate_page(pfn, VMI_PAGE_L1, 0, 0, 0);
}
-static void vmi_allocate_pd(u32 pfn)
+static void vmi_allocate_pd(struct mm_struct *mm, u32 pfn)
{
/*
* This call comes in very early, before mem_map is setup.
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -67,7 +67,7 @@ static pmd_t * __init one_md_table_init(
if (!(pgd_val(*pgd) & _PAGE_PRESENT)) {
pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
+ paravirt_alloc_pd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
BUG_ON(pmd_table != pmd_offset(pud, 0));
@@ -378,7 +378,7 @@ void __init native_pagetable_setup_start
pte_clear(NULL, va, pte);
}
- paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
+ paravirt_alloc_pd(&init_mm, __pa(base) >> PAGE_SHIFT);
}
void __init native_pagetable_setup_done(pgd_t *base)
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -321,13 +321,15 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
if (PTRS_PER_PMD == 1 || !pgd)
return pgd;
+ mm->pgd = pgd; /* so that alloc_pd can use it */
+
for (i = 0; i < ...Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
arch/x86/mm/fault.c | 10 +++-------
arch/x86/mm/pageattr.c | 2 +-
arch/x86/mm/pgtable_32.c | 19 +++++--------------
include/asm-x86/pgtable.h | 2 ++
include/asm-x86/pgtable_32.h | 2 --
include/asm-x86/pgtable_64.h | 3 ---
6 files changed, 11 insertions(+), 27 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -922,10 +922,8 @@ do_sigbus:
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}
-#ifdef CONFIG_X86_64
DEFINE_SPINLOCK(pgd_lock);
LIST_HEAD(pgd_list);
-#endif
void vmalloc_sync_all(void)
{
@@ -950,13 +948,11 @@ void vmalloc_sync_all(void)
struct page *page;
spin_lock_irqsave(&pgd_lock, flags);
- for (page = pgd_list; page; page =
- (struct page *)page->index)
+ list_for_each_entry(page, &pgd_list, lru) {
if (!vmalloc_sync_one(page_address(page),
- address)) {
- BUG_ON(page != pgd_list);
+ address))
break;
- }
+ }
spin_unlock_irqrestore(&pgd_lock, flags);
if (!page)
set_bit(pgd_index(address), insync);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -100,7 +100,7 @@ static void __set_pmd_pte(pte_t *kpte, u
if (!SHARED_KERNEL_PMD) {
struct page *page;
- for (page = pgd_list; page; page = (struct page *)page->index) {
+ list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -210,27 +210,18 @@ void pmd_ctor(struct kmem_cache *cache,
* vmalloc faults work because ...In x86 PAE mode, stop treating pmds as a special case. Previously
they were always allocated and freed with the pgd. The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.
This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.
There is a complicating wart, however. When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3. Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.
This patch simply avoids reloading cr3 unless the update is to the
current pagetable. Later patches will optimise this further.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
---
arch/x86/mm/init_32.c | 13 -------
arch/x86/mm/pgtable_32.c | 68 --------------------------------------
include/asm-x86/pgalloc_32.h | 22 ++++++++++--
include/asm-x86/pgtable-3level.h | 39 +++++++++++++++------
include/asm-x86/pgtable_32.h | 3 -
5 files changed, 47 insertions(+), 98 deletions(-)
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -709,19 +709,6 @@ int arch_add_memory(int nid, u64 start,
}
#endif
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
- if (PTRS_PER_PMD > 1) {
- pmd_cache = kmem_cache_create("pmd",
- PTRS_PER_PMD*sizeof(pmd_t),
- PTRS_PER_PMD*sizeof(pmd_t),
- SLAB_PANIC,
- pmd_ctor);
- }
-}
-
/*
* This function cannot be __init, since exceptions don't work in that
* section. Put this after the callers, so that it cannot be inlined.
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ ...If SHARED_KERNEL_PMD is false, then we need to allocate and initialize
the kernel pmd. We can easily piggy-back this onto the existing pmd
prepopulation code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
arch/x86/mm/pgtable_32.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -269,7 +269,7 @@ static void pgd_mop_up_pmds(pgd_t *pgdp)
{
int i;
- for(i = 0; i < USER_PTRS_PER_PGD; i++) {
+ for(i = 0; i < UNSHARED_PTRS_PER_PGD; i++) {
pgd_t pgd = pgdp[i];
if (pgd_val(pgd) != 0) {
@@ -289,6 +289,10 @@ static void pgd_mop_up_pmds(pgd_t *pgdp)
* processor notices the update. Since this is expensive, and
* all 4 top-level entries are used almost immediately in a
* new process's life, we just pre-populate them here.
+ *
+ * Also, if we're in a paravirt environment where the kernel pmd is
+ * not shared between pagetables (!SHARED_KERNEL_PMDS), we allocate
+ * and initialize the kernel pmds here.
*/
static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
{
@@ -297,13 +301,18 @@ static int pgd_prepopulate_pmd(struct mm
int i;
pud = pud_offset(pgd, 0);
- for (addr = i = 0; i < USER_PTRS_PER_PGD; i++, pud++, addr += PUD_SIZE) {
+ for (addr = i = 0; i < UNSHARED_PTRS_PER_PGD;
+ i++, pud++, addr += PUD_SIZE) {
pmd_t *pmd = pmd_alloc_one(mm, addr);
if (!pmd) {
pgd_mop_up_pmds(pgd);
return 0;
}
+
+ if (i >= USER_PTRS_PER_PGD)
+ memcpy(pmd, (pmd_t *)pgd_page_vaddr(swapper_pg_dir[i]),
+ sizeof(pmd_t) * PTRS_PER_PMD);
pud_populate(mm, pud, pmd);
}
@@ -346,4 +355,3 @@ void check_pgt_cache(void)
{
quicklist_trim(0, pgd_dtor, 25, 16);
}
-
--
Convert macros into inline functions, for better type-checking. This patch required a little bit of fiddling with headers in order to make __(pte|pmd)_free_tlb inline rather than macros. asm-generic/tlb.h includes asm/pgalloc.h, though it doesn't directly use any pgalloc definitions. I removed this include to avoid an include cycle, but it may cause secondary compile failures by things depending on the indirect inclusion; arch/x86/mm/hugetlbpage.c was one such place; there may be others. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> --- arch/x86/mm/hugetlbpage.c | 3 + arch/x86/mm/init_32.c | 1 include/asm-generic/tlb.h | 1 include/asm-x86/pgalloc_32.h | 61 ++++++++++++++++++++++++-------------- include/asm-x86/pgtable-3level.h | 2 - include/linux/swap.h | 1 6 files changed, 43 insertions(+), 26 deletions(-) diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -15,6 +15,7 @@ #include <asm/mman.h> #include <asm/tlb.h> #include <asm/tlbflush.h> +#include <asm/pgalloc.h> static unsigned long page_table_shareable(struct vm_area_struct *svma, struct vm_area_struct *vma, @@ -88,7 +89,7 @@ static void huge_pmd_share(struct mm_str spin_lock(&mm->page_table_lock); if (pud_none(*pud)) - pud_populate(mm, pud, (unsigned long) spte & PAGE_MASK); + pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK)); else put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -42,6 +42,7 @@ #include <asm/bugs.h> #include <asm/tlb.h> #include <asm/tlbflush.h> +#include <asm/pgalloc.h> #include <asm/sections.h> #include <asm/paravirt.h> #include <asm/setup.h> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h --- a/include/asm-generic/tlb.h +++ ...
Deal properly with pmd-level pages being allocated and freed
dynamically. We can handle them more or less the same as pte pages.
Also, deal with early_ioremap pagetable manipulations.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
---
arch/x86/xen/enlighten.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -658,6 +658,13 @@ static __init void xen_alloc_pt_init(str
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
}
+/* Early release_pt assumes that all pts are pinned, since there's
+ only init_mm and anything attached to that is pinned. */
+static void xen_release_pt_init(u32 pfn)
+{
+ make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
+}
+
static void pin_pagetable_pfn(unsigned level, unsigned long pfn)
{
struct mmuext_op op;
@@ -669,7 +676,7 @@ static void pin_pagetable_pfn(unsigned l
/* This needs to make sure the new pte page is pinned iff its being
attached to a pinned pagetable. */
-static void xen_alloc_pt(struct mm_struct *mm, u32 pfn)
+static void xen_alloc_ptpage(struct mm_struct *mm, u32 pfn, unsigned level)
{
struct page *page = pfn_to_page(pfn);
@@ -678,12 +685,22 @@ static void xen_alloc_pt(struct mm_struc
if (!PageHighMem(page)) {
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
- pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);
+ pin_pagetable_pfn(level, pfn);
} else
/* make sure there are no stray mappings of
this page */
kmap_flush_unused();
}
+}
+
+static void xen_alloc_pt(struct mm_struct *mm, u32 pfn)
+{
+ xen_alloc_ptpage(mm, pfn, MMUEXT_PIN_L1_TABLE);
+}
+
+static void xen_alloc_pd(struct mm_struct *mm, u32 pfn)
+{
+ xen_alloc_ptpage(mm, pfn, MMUEXT_PIN_L2_TABLE);
}
/* This should never happen until we're OK to use struct page */
@@ -788,6 +805,9 @@ static __init void xen_pagetable_setup_d
/* ...In PAE mode, an update to the pgd requires a cr3 reload to make sure
the processor notices the changes. Since this also has the
side-effect of flushing the tlb, its an expensive operation which we
want to avoid where possible.
This patch mitigates the cost of installing the initial set of pmds on
process creation by preallocating them when the pgd is allocated.
This avoids up to three tlb flushes during exec, as it creates the new
process address space while the pagetable is in active use.
The pmds will be freed as part of the normal pagetable teardown in
free_pgtables, which is called in munmap and process exit. However,
free_pgtables will only free parts of the pagetable which actually
contain mappings, so stray pmds may still be attached to the pgd at
pgd_free time. We must mop them up to prevent a memory leak.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
---
arch/x86/mm/pgtable_32.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -258,17 +258,87 @@ static void pgd_dtor(void *pgd)
spin_unlock_irqrestore(&pgd_lock, flags);
}
+#ifdef CONFIG_X86_PAE
+/*
+ * Mop up any pmd pages which may still be attached to the pgd.
+ * Normally they will be freed by munmap/exit_mmap, but any pmd we
+ * preallocate which never got a corresponding vma will need to be
+ * freed manually.
+ */
+static void pgd_mop_up_pmds(pgd_t *pgdp)
+{
+ int i;
+
+ for(i = 0; i < USER_PTRS_PER_PGD; i++) {
+ pgd_t pgd = pgdp[i];
+
+ if (pgd_val(pgd) != 0) {
+ pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+
+ pgdp[i] = native_make_pgd(0);
+
+ paravirt_release_pd(pgd_val(pgd) >> PAGE_SHIFT);
+ pmd_free(pmd);
+ }
+ }
+}
+
+/*
+ * In PAE ...PAE mode requires that we reload cr3 in order to guarantee that
changes to the pgd will be noticed by the processor. This means that
in principle pud_clear needs to reload cr3 every time. However,
because reloading cr3 implies a tlb flush, we want to avoid it where
possible.
pud_clear() is only used in a couple of places:
- in free_pmd_range(), when pulling down a range of process address space, and
- huge_pmd_unshare()
In both cases, the calling code will do a a tlb flush anyway, so
there's no need to do it within pud_clear().
In free_pmd_range(), the pud_clear is immediately followed by
pmd_free_tlb(); we can hook that to make the mmu_gather do an
unconditional full flush to make sure cr3 gets reloaded.
In huge_pmd_unshare, it is followed by flush_tlb_range, which always
results in a full cr3-reload tlb flush.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
---
include/asm-x86/pgalloc_32.h | 7 +++++++
include/asm-x86/pgtable-3level.h | 21 +++++++++++++++------
2 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/include/asm-x86/pgalloc_32.h b/include/asm-x86/pgalloc_32.h
--- a/include/asm-x86/pgalloc_32.h
+++ b/include/asm-x86/pgalloc_32.h
@@ -74,6 +74,13 @@ static inline void pmd_free(pmd_t *pmd)
static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
+ /* This is called just after the pmd has been detached from
+ the pgd, which requires a full tlb flush to be recognized
+ by the CPU. Rather than incurring multiple tlb flushes
+ while the address space is being pulled down, make the tlb
+ gathering machinery do a full flush when we're done. */
+ tlb->fullmm = 1;
+
paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
tlb_remove_page(tlb, virt_to_page(pmd));
}
diff --git a/include/asm-x86/pgtable-3level.h ...hm, i tried this, and got an early crash: [ 29.389844] VFS: Mounted root (ext3 filesystem) readonly. [ 29.389872] debug: unmapping init memory c0b03000..c0b6f000 [ 29.440139] PM: Adding info for No Bus:vcs1 [ 29.463676] khelper used greatest stack depth: 2404 bytes left [ 29.467238] PM: Adding info for No Bus:vcsa1 [ 29.541785] PANIC: double fault, gdt at c1d16000 [255 bytes] [ 29.541785] double fault, tss at c1d19100 [ 29.541785] eip = c011fa95, esp = c3bf6000 [ 29.541785] eax = c3bf6010, ebx = c0b6fc08, ecx = 0000007b, edx = 00000000 [ 29.541785] esi = f76a7df4, edi = c011fa90 i think it's one of your patches :) Bisecting it down to the right one now. Config attached. Ingo
Wouldn't surprise me. Given that its a non-PAE config, most of the
patches won't be in play, but perhaps I screwed up coping the kernel
pagetable entries into the pgd somehow.
J
--
and after a session of bisection, the winner patch is: Subject: x86: unify PAE/non-PAE pgd_ctor which is a tad unexpected, given the relatively harmless nature of the patch. (but then again, nothing is really harmless in PAE land.) btw., this is not fair i think: your patch was apparently caught by the note the close proximity of c0b6f000 and ebx = c0b6fc08. [ I regularly come up with such nasty tricks and debugging helpers like that to catch bad patches off-guard. You have been warned! ;-) ] Ingo --
ok, i merged up your series with this patch removed. (it was possible with a few manual fixups) That way the problem .config boots fine. Ingo --
Oh, well, good. At least off-the-cuff diagnosis was right. I must have
Hm, perhaps, but it could be as easily coincidence. The place there
initmem is freed is close to where it first needs to rely on a
non-initmm pagetable. I presume that message means that c0b6f000 was
*not* freed.
J
--
