Hi Ingo,
This series lays the groundwork for 64-bit Xen support. It follows
the usual pattern: a series of general cleanups and improvements,
followed by additions and modifications needed to slide Xen in.
Most of the 64-bit paravirt-ops work has already been done and
integrated for some time, so the changes are relatively minor.
Interesting and potentially hazardous changes in this series are:
"paravirt/x86_64: move __PAGE_OFFSET to leave a space for hypervisor"
This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
to 0xffff880000000000. I have no general justification for this: the
specific reason is that Xen claims the first 16 kernel GDT slots for
itself, and we must move up the mapping to make room. In the process
I parameterised the compile-time construction of the initial
pagetables in head_64.S to cope with it.
"x86_64: adjust mapping of physical pagetables to work with Xen"
"x86_64: create small vmemmap mappings if PSE not available"
This rearranges the construction of the physical mapping so that it
works with Xen. This affects three aspects of the code:
1. It can't use pse, so it will only use pse if the processor
supports it.
2. It never replaces an existing mapping, so it can just extend the
early boot-provided mappings (either from head_64.S or the Xen domain
builder).
3. It makes sure that any page is iounmapped before attaching it to the
pagetable to avoid having writable aliases of pagetable pages.
The logical structure of the code is more or less unchanged, and still
works fine in the native case.
vmemmap mapping is likewise changed.
"x86_64: PSE no longer a hard requirement."
Because booting under Xen doesn't set PSE, it's no longer a hard
requirement for the kernel. PSE will be used whereever possible.
Overall diffstat:
arch/x86/Kconfig | 7 +
arch/x86/ia32/ia32entry.S | 37 +++--
arch/x86/kernel/aperture_64.c | 4
arch/x86/kernel/asm-offsets_32.c |...This will significantly decrease the maximum amount of physical Both sound like cases of "let's hack Linux to work around Xen problems" -Andi --
A bit, but not "significantly". We'd already discussed that if the
amount of physical starts approaching 2^48 then we'd hope that the chips
will grow some more virtual bits.
J
--What does Linux expect to scale up to? Reserving 16 PML4 entries leaves the kernel with 120TB of available 'negative' address space. Should be plenty, I would think. -- Keir --
There are already (ok non x86-64) systems shipping today with 10+TB of addressable memory. 100+TB is not that far away with typical growth rates. Besides there has to be much more in the negative address space than just direct mapping. So far we always that 64bit Linux can support upto 1/4*max VA memory. With your change that formula would be not true anymore. -Andi --
There are obviously no x64 boxes around at the moment with >1TB of regular shared memory, since no CPUs have more than 40 address lines. 100+TB RAM is surely years away. If this is a blocker issue, we could just keep PAGE_OFFSET as it is when Xen support is not configured into the kernel. Then those who are concerned about 5% extra headroom at 100TB RAM sizes can configure their kernel Does the formula have any practical significance? -- Keir --
Yes, but why build something non scalable now that you have to fix in a few years? Especially when it comes with "i have no justification" in Yes, because getting more than 48bits of VA will be extremly costly in terms of infrastructure and assuming continuing growth rates and very large machines 46bits is not all that much. -Andi --
This reduces native kernel max memory support from around 127 TB to around 120 TB. We also limit the Xen hypervisor to ~7 TB of physical memory - is that wise in the long run? Sure, current CPUs support 40 physical bits [1 TB] for now so it's all theoretical at this moment. my guess is that CPU makers will first extend the physical lines all the way up to 46-47 bits before they are willing to touch the logical model and extend the virtual space beyond 48 bits (47 bits of that available to kernel-space in practice - i.e. 128 TB). So eventually, in a few years, we'll feel some sort of crunch when the # of physical lines approaches the # of logical bits - just like when That should be fine too - and probably useful for 64-bit kmemcheck support as well. To further increase the symmetry between 64-bit and 32-bit, could you please also activate the mem=nopentium switch on 64-bit to allow the forcing of a non-PSE native 64-bit bootup? (Obviously not a good idea normally, as it wastes 0.1% of RAM and increases PTE related CPU cache footprint and TLB overhead, but it is useful for debugging.) a few other risk areas: - the vmalloc-sync changes. Are you absolutely sure that it does not matter for performance? - "The 32-bit early_ioremap will work equally well for 64-bit, so just use it." Famous last words ;-) Anyway, that's all theory - i'll try out your patchset in -tip to see what breaks in practice ;-) Ingo --
i've put the commits (and a good number of dependent commits) into the new tip/x86/xen-64bit topic branch. It quickly broke the build in testing: include/asm/pgalloc.h: In function ‘paravirt_pgd_free': include/asm/pgalloc.h:14: error: parameter name omitted arch/x86/kernel/entry_64.S: In file included from arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free': include/asm/pgalloc.h:14: error: parameter name omitted [...] with this config: http://redhat.com/~mingo/misc/config-Wed_Jun_25_16_37_51_CEST_2008.bad this could easily be some integration mistake on my part, so please double-check the end result. Merging it into tip/master is a bit tricky, due to various interactions. This should work fine if you check out the latest tip/master: git-merge tip/x86/xen-64bit [ ... fix up the trivial merge conflict ... ] i've already merged tip/x86/xen-64bit-base topic into master, to make it easier. (there were a few preconditions for the 64-bit Xen patches which arent carried in linux-next - such as the nmi-safe changes.) Ingo --
No, looks like my fault. The non-PARAVIRT version of
paravirt_pgd_free() is:
static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
but C doesn't like missing parameter names, even if unused.
This should fix it:
diff -r 19b73cc5fdf4 include/asm-x86/pgalloc.h
--- a/include/asm-x86/pgalloc.h Wed Jun 25 11:24:41 2008 -0400
+++ b/include/asm-x86/pgalloc.h Wed Jun 25 13:11:56 2008 -0700
@@ -11,7 +11,7 @@
#include <asm/paravirt.h>
#else
#define paravirt_pgd_alloc(mm) __paravirt_pgd_alloc(mm)
-static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
+static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *pgd) {}
static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
--that fixed the build but now we've got a boot crash with this config: time.c: Detected 2010.304 MHz processor. spurious 8259A interrupt: IRQ7. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: [<0000000000000000>] PGD 0 Thread overran stack, or stack corrupted Oops: 0010 [1] SMP CPU 0 with: http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad i've pushed out the current tip/xen-64bit branch, so that you can see how things look like at the moment, but i cannot put it into tip/master yet. Ingo --
I don't know if this will fix this bug, but it's definitely a bugfix.
It was trashing random pages by overwriting them with pagetables...
Subject: x86_64: memory mapping: don't trash large pmd mapping
Don't trash a large pmd's data when mapping physical memory.
This is a bugfix for "x86_64: adjust mapping of physical pagetables
to work with Xen".
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/mm/init_64.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
===================================================================
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -311,7 +311,8 @@
}
if (pmd_val(*pmd)) {
- phys_pte_update(pmd, address, end);
+ if (!pmd_large(*pmd))
+ phys_pte_update(pmd, address, end);
continue;
}
--What stage during boot? I'm seeing an initrd problem, but that's
relatively late.
J
--Blerg, a contextless NULL rip. Have you done any bisection on it?
Could you try again with the same config, but with
"CONFIG_PARAVIRT_DEBUG" enabled as well? That will BUG if it turns out
to be trying to call a NULL paravirt-op
Yeah, I was expecting things to break somewhere with this lot :/
Could you add this patch? I don't think it will help this case, but
it's a bugfix.
J
Subject: x86_64: use SWAPGS_UNSAFE_STACK in ia32entry.S
Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active
stack is the usermode stack.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/ia32/ia32entry.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
===================================================================
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -98,7 +98,7 @@
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,0
CFI_REGISTER rsp,rbp
- SWAPGS
+ SWAPGS_UNSAFE_STACK
movq %gs:pda_kernelstack, %rsp
addq $(PDA_STACKOFFSET),%rsp
/*
@@ -210,7 +210,7 @@
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
- SWAPGS
+ SWAPGS_UNSAFE_STACK
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq %gs:pda_kernelstack,%rsp
--plus -tip auto-testing found another build failure with: http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad arch/x86/kernel/entry_64.S: Assembler messages: arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic Ingo --
I'm confused. How did this config both crash and not build?
J
--this problem still reproduces. i've pushed out all fixes into tip/x86/xen-64bit. That branch combined with the config above still reproduces the build failure above. Ingo --
Subject: x86_64: fix non-paravirt compilation Make sure SWAPGS and PARAVIRT_ADJUST_EXCEPTION_FRAME are properly defined when CONFIG_PARAVIRT is off. Fixes Ingo's build failure: arch/x86/kernel/entry_64.S: Assembler messages: arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- include/asm-x86/irqflags.h | 22 +++++++++++++--------- include/asm-x86/processor.h | 3 --- 2 files changed, 13 insertions(+), 12 deletions(-) =================================================================== --- a/include/asm-x86/irqflags.h +++ b/include/asm-x86/irqflags.h @@ -167,7 +167,20 @@ #define INTERRUPT_RETURN_NMI_SAFE NATIVE_INTERRUPT_RETURN_NMI_SAFE #ifdef CONFIG_X86_64 +#define SWAPGS swapgs +/* + * Currently paravirt can't handle swapgs nicely when we + * don't have a stack we can rely on (such as a user space + * stack). So we either find a way around these or just fault + * and emulate if a guest tries to call swapgs directly. + * + * Either way, this is a good way to document that we don't + * have a reliable stack. x86_64 only. + */ #define SWAPGS_UNSAFE_STACK swapgs + +#define PARAVIRT_ADJUST_EXCEPTION_FRAME /* */ + #define INTERRUPT_RETURN iretq #define USERGS_SYSRET64 \ swapgs; \ @@ -233,15 +246,6 @@ #else #ifdef CONFIG_X86_64 -/* - * Currently paravirt can't handle swapgs nicely when we - * don't have a stack we can rely on (such as a user space - * stack). So we either find a way around these or just fault - * and emulate if a guest tries to call swapgs directly. - * - * Either way, this is a good way to document that we don't - * have a reliable stack. x86_64 only. - */ #define ARCH_LOCKDEP_SYS_EXIT call lock...
i've put tip/x86/xen-64bit into tip/master briefly and it quickly triggered this crash on 64-bit x86: Linux version 2.6.26-rc8-tip-00241-gc6c8cb2-dirty (mingo@dione) (gcc version 4.2.3) #12303 SMP Sun Jun 29 10:30:01 CEST 2008 Command line: root=/dev/sda6 console=ttyS0,115200 earlyprintk=serial,ttyS0,115200 debug initcall_debug apic=verbose sysrq_always_enabled ignore_loglevel selinux=0 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f800 (usable) BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003fff0000 (usable) BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS) BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls console [earlyser0] enabled debug: ignoring loglevel setting. Entering add_active_range(0, 0x0, 0x9f) 0 entries of 25600 used Entering add_active_range(0, 0x100, 0x3fff0) 1 entries of 25600 used last_pfn = 0x3fff0 max_arch_pfn = 0x3ffffffff init_memory_mapping kernel direct mapping tables up to 3fff0000 @ 8000-a000 PANIC: early exception 0e rip 10:ffffffff804b24e2 error 0 cr2 ffffffffff300000 Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip-00241-gc6c8cb2-dirty #12303 Call Trace: [<ffffffff80efe196>] early_idt_handler+0x56/0x6a [<ffffffff804b24e2>] ? __memcpy_fromio+0x12/0x30 [<ffffffff804b24d9>] ? __memcpy_fromio+0x9/0x30 [<ffffffff80f32f27>] dmi_scan_machine+0x57/0x1b0 [<ffffffff80f02c15>] setup_arch+0x3f5/0x5e0 [<ffffffff80efedd5>] start_kernel+0x75/0x350 [<ffffffff80efe289>] x86_64_start_reservations+0x89/0xa0 [<ffffffff80efe397>] x86_64_start_kernel+0xf7/0x100 RIP 0x10 with this config: [ message continues ]
Looks like the setup.c unification missed the early_ioremap init from the early_ioremap unification. Unconditionally call early_ioremap_init(). Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> diff -r 5c26177fdf8c arch/x86/kernel/setup.c --- a/arch/x86/kernel/setup.c Sun Jun 29 16:57:52 2008 -0700 +++ b/arch/x86/kernel/setup.c Sun Jun 29 19:57:00 2008 -0700 @@ -523,11 +523,12 @@ memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data)); pre_setup_arch_hook(); early_cpu_init(); - early_ioremap_init(); reserve_setup_data(); #else printk(KERN_INFO "Command line: %s\n", boot_command_line); #endif + + early_ioremap_init(); ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev); screen_info = boot_params.screen_info; --
applied to tip/x86/unify-setup - thanks Jeremy. I've reactived the x86/xen-64bit branch and i'm testing it currently. Ingo --
-tip auto-testing found pagetable corruption (CPA self-test failure): [ 32.956015] CPA self-test: [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0 [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3 [ 32.968000] CPA ffff88001d54e000: unexpected level 2 [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3 [ 32.976000] CPA ffff880022c5d000: unexpected level 2 [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3 [ 32.984000] CPA ffff8800200ce000: unexpected level 2 [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3 config and full log can be found at: http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad i've pushed that tree out into tip/tmp.xen-64bit.Mon_Jun_30_11_11. The only new item in that tree over a well-tested base is x86/xen-64bit, so i've taken it out again. Ingo --
Phew. OK, I've worked this out. Short version is that's it's a false
alarm, and there was no real failure here. Long version:
* I changed the code to create the physical mapping pagetables to
reuse any existing mapping rather than replace it. Specifically,
reusing an pud pointed to by the pgd caused this symptom to appear.
* The specific PUD being reused is the one created statically in
head_64.S, which creates an initial 1GB mapping.
* That mapping doesn't have _PAGE_GLOBAL set on it, due to the
inconsistency between __PAGE_* and PAGE_*.
* The CPA test attempts to clear _PAGE_GLOBAL, and then checks to
see that the resulting range is 1) shattered into 4k pages, and 2)
has no _PAGE_GLOBAL.
* However, since it didn't have _PAGE_GLOBAL on that range to start
with, change_page_attr_clear() had nothing to do, and didn't
bother shattering the range,
* resulting in the reported messages
The simple fix is to set _PAGE_GLOBAL in level2_ident_pgt.
An additional fix to make CPA testing more robust by using some other
pagetable bit (one of the unused available-to-software ones). This
would solve spurious CPA test warnings under Xen which uses _PAGE_GLOBAL
for its own purposes (ie, not under guest control).
Also, we should revisit the use of _PAGE_GLOBAL in asm-x86/pgtable.h,
and use it consistently, and drop MAKE_GLOBAL. The first time I
proposed it it caused breakages in the very early CPA code; with luck
that's all fixed now.
Anyway, the simple fix below. I'll put together RFC patches for the
other suggestions. I also split the originating patch into tiny, tiny
bisectable pieces.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/head_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
===================================================================
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -374,7 +37...great - i've applied your fix and re-integrated x86/xen-64bit, it's cool! :) Ingo --
hm, -tip testing still triggers a 64-bit bootup crash: [ 0.000000] init_memory_mapping [ 0.000000] kernel direct mapping tables up to 3fff0000 @ 8000-a000 PANIC: early exception 0e rip 10:ffffffff80418f81 error 0 cr2 ffffffffff300000 [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #13363 [ 0.000000] [ 0.000000] Call Trace: [ 0.000000] [<ffffffff807f088b>] ? init_memory_mapping+0x341/0x56b [ 0.000000] [<ffffffff80dba19f>] early_idt_handler+0x5f/0x73 [ 0.000000] [<ffffffff80418f81>] ? __memcpy_fromio+0xd/0x1e [ 0.000000] [<ffffffff80de238a>] dmi_scan_machine+0x41/0x19b [ 0.000000] [<ffffffff80dbeba8>] setup_arch+0x46d/0x5d8 [ 0.000000] [<ffffffff802896a0>] ? kernel_text_unlock+0x10/0x12 [ 0.000000] [<ffffffff80263b86>] ? raw_notifier_chain_register+0x9/0xb [ 0.000000] [<ffffffff80dba140>] ? early_idt_handler+0x0/0x73 [ 0.000000] [<ffffffff80dbac5a>] start_kernel+0xf4/0x3b3 [ 0.000000] [<ffffffff80dba140>] ? early_idt_handler+0x0/0x73 [ 0.000000] [<ffffffff80dba2a4>] x86_64_start_reservations+0xa9/0xad [ 0.000000] [<ffffffff80dba3b8>] x86_64_start_kernel+0x110/0x11f [ 0.000000] http://redhat.com/~mingo/misc/crash.log-Tue_Jul__1_10_55_47_CEST_2008.bad http://redhat.com/~mingo/misc/config-Tue_Jul__1_10_55_47_CEST_2008.bad Excluding the x86/xen-64bit topic solves the problem. It triggered on two 64-bit machines so it seems readily reproducible with that config. i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55 Ingo --
The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
early_ioremap_init()". Logically that patch should probably be in the
xen64 branch, since it's only meaningful with the early_ioremap unification.
J
--ah, indeed - it was missing from tip/master due to: | commit ac998c259605741efcfbd215533b379970ba1d9f | Author: Ingo Molnar <mingo@elte.hu> | Date: Mon Jun 30 12:01:31 2008 +0200 | | Revert "x86: setup_arch() && early_ioremap_init()" | | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27. because that change needed the other changes from xen-64bit. will retry tomorrow. Ingo --
ok, i've re-added x86/xen-64bit and it's looking good in testing so far. Ingo --
got [ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0 [ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0 [ffffe20000000000-ffffe200003fffff] PMD -> [ffff880001400000-ffff8800017fffff] on node 0 [ffffe20000200000-ffffe200005fffff] PMD -> [ffff880001600000-ffff8800019fffff] on node 0 [ffffe20000400000-ffffe200007fffff] PMD -> [ffff880001800000-ffff880001bfffff] on node 0 [ffffe20000600000-ffffe200009fffff] PMD -> [ffff880001a00000-ffff880001dfffff] on node 0 [ffffe20000800000-ffffe20000bfffff] PMD -> [ffff880001c00000-ffff880001ffffff] on node 0 [ffffe20000a00000-ffffe20000dfffff] PMD -> [ffff880001e00000-ffff8800021fffff] on node 0 [ffffe20000c00000-ffffe20000ffffff] PMD -> [ffff880002000000-ffff8800023fffff] on node 0 [ffffe20000e00000-ffffe200011fffff] PMD -> [ffff880002200000-ffff8800025fffff] on node 0 [ffffe20001000000-ffffe200013fffff] PMD -> [ffff880002400000-ffff8800027fffff] on node 0 [ffffe20001200000-ffffe200015fffff] PMD -> [ffff880002600000-ffff8800029fffff] on node 0 [ffffe20001400000-ffffe200017fffff] PMD -> [ffff880002800000-ffff880002bfffff] on node 0 [ffffe20001600000-ffffe200019fffff] PMD -> [ffff880002a00000-ffff880002dfffff] on node 0 [ffffe20001800000-ffffe20001bfffff] PMD -> [ffff880002c00000-ffff880002ffffff] on node 0 [ffffe20001a00000-ffffe20001dfffff] PMD -> [ffff880002e00000-ffff8800031fffff] on node 0 [ffffe20001c00000-ffffe20001ffffff] PMD -> [ffff880003000000-ffff8800033fffff] on node 0 [ffffe20001e00000-ffffe200021fffff] PMD -> [ffff880003200000-ffff8800035fffff] on node 0 [ffffe20002000000-ffffe200023fffff] PMD -> [ffff880003400000-ffff8800037fffff] on node 0 [ffffe20002200000-ffffe200025fffff] PMD -> [ffff880003600000-ffff8800039fffff] on node 0 [ffffe20002400000-ffffe200027fffff] PMD -> [ffff880003800000-ffff880003bfffff] on node 0 [ffffe20002600000-ffffe200029fffff] PMD -> [ffff880003a00000-ffff880003dfffff] on n...
I haven't seen those messages before. Can you explain what they mean?
J
--that is for SPARSEMEM virtual memmap... CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y YH --
I modified the vmemmap code so it would create 4k mappings if PSE isn't
supported. Did I get it wrong? It should have no effect when PSE is
available (which is any time you're not running under Xen).
J
--it could be address continuous checkup for printout in vmemmap_populated has some problem... YH --
you moved p_end = p + PMD_SIZE before...
if (p_end != p || node_start != node) {
YH
--Ingo, please put attached patch after jeremy's xen pv64 patches. YH
applied, thanks. Ingo --
Great. I'm hoping this stuff will be OK for the next merge, so I'm
primed for fast turnaround bugfixes ;)
Also, I have the series of followup patches to actually implement 64-bit
Xen which have much less impact on the non-Xen parts of the tree. I'll
probably mail them out later today.
Thanks,
J
--Looks like you lost the other patch to put the early_ioremap_init in the
right place...
J
--This config doesn't have CONFIG_DEBUG_KERNEL enabled, let alone
CONFIG_CPA_DEBUG. I've noticed this seems to happen quite a lot:
there's a disconnect between the log file and the config which is
supposed to have built the kernel. Is there a bug in your test
infrastructure?
J
--sometimes the kernel preceding the currently built one is the buggy one. As i have them saved away, so the right one should be: http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_03_04_CEST_2008.bad Ingo --
That config doesn't build for me. When I put it in place and do "make
oldconfig" it still asks for lots of config options (which I just set to
default). But when I build it fails with:
CC arch/x86/kernel/asm-offsets.s
In file included from include2/asm/page.h:40,
from include2/asm/pda.h:8,
from include2/asm/current.h:19,
from include2/asm/processor.h:15,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/prefetch.h:14,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/list.h:6,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/module.h:9,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/crypto.h:21,
from /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/asm-offsets_64.c:7,
from /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/asm-offsets.c:4:
include2/asm/page_64.h:46:2: error: #error "CONFIG_PHYSICAL_START must be a multiple of 2MB"
make[3]: *** [arch/x86/kernel/asm-offsets.s] Error 1
I can fix that, of course, but it doesn't give me confidence I'm testing
what you are...
J
--the problem there is that the 32-bit config has: CONFIG_PHYSICAL_START=0x100000 which the 64-bit make oldconfig picked up, but that start address is not valid on 64-bit. Ingo --
Er, we're talking about 64-bit here, aren't we? The log messages are
from a 64-bit kernel.
Well, it was the wrong config anyway, which I guess is the source of
this confusion.
(I thought ARCH= to select 32/64 was going away now that the config has
the bitsize config?)
J
--yep, correct - but it has to be done carefully - until now people (and tools) could assume that 'make oldconfig' just creates stuff for their native host architecture. But i agree in principle. Ingo --
it could be wrong? do we need that for 64 bit? YH --
Yes. I unified the early_ioremap implementations by making 64-bit use
the 32-bit one.
J
--i'm testing on multiple systems in parallel, each is running randconfig kernels. One 64-bit system found a build bug, the other one found a boot crash. This can happen if certain configs build fine (but crash), certain configs dont even build. Each system does a random walk of the config space. I've applied your two fixes and i'm re-testing. Ingo --
Yes, but the URL for both the crash and the build failure pointed to the
Thanks,
J
--yeah, i guess so. Right now i only ran into the build failure so there's hope :) Here's a config that fails to build for sure: http://redhat.com/~mingo/misc/config-Fri_Jun_27_17_54_32_CEST_2008.bad note, on 32-bit there's a yet unfixed initrd corruption bug i've bisected back to: | 510be56adc4bb9fb229637a8e89268d987263614 is first bad commit | commit 510be56adc4bb9fb229637a8e89268d987263614 | Author: Yinghai Lu <yhlu.kernel@gmail.com> | Date: Tue Jun 24 04:10:47 2008 -0700 | | x86: introduce init_memory_mapping for 32bit so if you see something like that it's probably not a bug introduced by your changes. (and maybe you'll see why the above commit is buggy, i havent figured it out yet) Ingo --
Well, on a non-PSE system find_early_table_space() will not allocate
enough memory for ptes. But I posted the fix for that, and it's likely
you're using PSE anyway. Nothing pops out from a quick re-read, but it
could easily be mis-reserving the ramdisk memory or something.
J
--There's no inherent reason why Xen itself needs to be able to have all
memory mapped at once. 32-bit Xen doesn't and can survive quite
happily. It's certainly nice to be able to access anything directly,
but it's just a performance optimisation. In practice, the guest
generally has almost everything interesting mapped anyway, and Xen
maintains a recursive mapping of the pagetable to make its access to the
pagetable very efficient, so it's only when a hypercall is doing
something to an unmapped page that there's an issue.
The main limitation the hole-size imposes is the max size of the machine
to physical map. That uses 8bytes/page, and reserves 256GB of space for
it, meaning that the current limit is 2^47 bytes - but there's another
256GB of reserved and unused space next to it, so that could be easily
OK. Though it might be an idea to add "nopse" and start deprecating
Oh, I didn't mean to include that one. I think it's probably safe (from
both the performance and correctness stands), but it's not necessary for
Yep, thanks,
J
--Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/entry_64.S | 4 ++--
arch/x86/kernel/paravirt.c | 3 +++
include/asm-x86/elf.h | 2 +-
include/asm-x86/paravirt.h | 10 ++++++++++
include/asm-x86/system.h | 3 ++-
5 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1080,7 +1080,7 @@
/* Reload gs selector with exception handling */
/* edi: new selector */
-ENTRY(load_gs_index)
+ENTRY(native_load_gs_index)
CFI_STARTPROC
pushf
CFI_ADJUST_CFA_OFFSET 8
@@ -1094,7 +1094,7 @@
CFI_ADJUST_CFA_OFFSET -8
ret
CFI_ENDPROC
-ENDPROC(load_gs_index)
+ENDPROC(native_load_gs_index)
.section __ex_table,"a"
.align 8
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -331,6 +331,9 @@
.store_idt = native_store_idt,
.store_tr = native_store_tr,
.load_tls = native_load_tls,
+#ifdef CONFIG_X86_64
+ .load_gs_index = native_load_gs_index,
+#endif
.write_ldt_entry = native_write_ldt_entry,
.write_gdt_entry = native_write_gdt_entry,
.write_idt_entry = native_write_idt_entry,
diff --git a/include/asm-x86/elf.h b/include/asm-x86/elf.h
--- a/include/asm-x86/elf.h
+++ b/include/asm-x86/elf.h
@@ -83,9 +83,9 @@
(((x)->e_machine == EM_386) || ((x)->e_machine == EM_486))
#include <asm/processor.h>
+#include <asm/system.h>
#ifdef CONFIG_X86_32
-#include <asm/system.h> /* for savesegment */
#include <asm/desc.h>
#define elf_check_arch(x) elf_check_arch_ia32(x)
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -115,6 +115,9 @@
void (*set_ld...patch logistics detail: the signoff order suggests it's been authored by Eduardo - but there's no From line to that effect - should i change it accordingly? Ingo --
Yes, it's Eduardo's. Huh, I have the From line here; must have got
stripped off by my script...
J
--From: Eduardo Habkost <ehabkost@redhat.com>
We will need to set a pte on l3_user_pgt. Extract set_pte_vaddr_pud()
from set_pte_vaddr(), that will accept the l3 page table as parameter.
This change should be a no-op for existing code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/mm/init_64.c | 31 ++++++++++++++++++++-----------
include/asm-x86/pgtable_64.h | 3 +++
2 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -149,22 +149,13 @@
}
void
-set_pte_vaddr(unsigned long vaddr, pte_t new_pte)
+set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
{
- pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(new_pte));
-
- pgd = pgd_offset_k(vaddr);
- if (pgd_none(*pgd)) {
- printk(KERN_ERR
- "PGD FIXMAP MISSING, it should be setup in head.S!\n");
- return;
- }
- pud = pud_offset(pgd, vaddr);
+ pud = pud_page + pud_index(vaddr);
if (pud_none(*pud)) {
pmd = (pmd_t *) spp_getpage();
pud_populate(&init_mm, pud, pmd);
@@ -195,6 +186,24 @@
* (PGE mappings get flushed as well)
*/
__flush_tlb_one(vaddr);
+}
+
+void
+set_pte_vaddr(unsigned long vaddr, pte_t pteval)
+{
+ pgd_t *pgd;
+ pud_t *pud_page;
+
+ pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(pteval));
+
+ pgd = pgd_offset_k(vaddr);
+ if (pgd_none(*pgd)) {
+ printk(KERN_ERR
+ "PGD FIXMAP MISSING, it should be setup in head.S!\n");
+ return;
+ }
+ pud_page = (pud_t*)pgd_page_vaddr(*pgd);
+ set_pte_vaddr_pud(pud_page, vaddr, pteval);
}
/*
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
...64-bit Xen pushes a couple of extra words onto an exception frame.
Add a hook to deal with them.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/asm-offsets_64.c | 1 +
arch/x86/kernel/entry_64.S | 2 ++
arch/x86/kernel/paravirt.c | 3 +++
arch/x86/xen/enlighten.c | 3 +++
include/asm-x86/paravirt.h | 9 +++++++++
include/asm-x86/processor.h | 2 ++
6 files changed, 20 insertions(+)
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -61,6 +61,7 @@
OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
+ OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -864,6 +864,7 @@
*/
.macro zeroentry sym
INTR_FRAME
+ PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq $0 /* push error code/oldrax */
CFI_ADJUST_CFA_OFFSET 8
pushq %rax /* push real oldrax to the rdi slot */
@@ -876,6 +877,7 @@
.macro errorentry sym
XCPT_FRAME
+ PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq %rax
CFI_ADJUST_CFA_OFFSET 8
CFI_REL_OFFSET rax,0
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -298,6 +298,9 @@
.irq_enable = native_irq_enable,
.safe_halt = native_safe_halt,
.halt = native_halt,
+#ifdef CONFIG_X86_64
+ .adjust_exception_frame = paravirt_nop,
+#endif
};
struct pv_cpu_ops pv_cpu_ops = {
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/...This removes a pile of buggy open-coded implementations of savesegment
and loadsegment.
(They are buggy because they don't have memory barriers to prevent
them from being reordered with respect to memory accesses.)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/cpu/common_64.c | 3 ++-
arch/x86/kernel/process_64.c | 28 +++++++++++++++-------------
2 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/cpu/common_64.c b/arch/x86/kernel/cpu/common_64.c
--- a/arch/x86/kernel/cpu/common_64.c
+++ b/arch/x86/kernel/cpu/common_64.c
@@ -480,7 +480,8 @@
struct x8664_pda *pda = cpu_pda(cpu);
/* Setup up data that may be needed in __get_free_pages early */
- asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
+ loadsegment(fs, 0);
+ loadsegment(gs, 0);
/* Memory clobbers used to order PDA accessed */
mb();
wrmsrl(MSR_GS_BASE, pda);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -362,10 +362,10 @@
p->thread.fs = me->thread.fs;
p->thread.gs = me->thread.gs;
- asm("mov %%gs,%0" : "=m" (p->thread.gsindex));
- asm("mov %%fs,%0" : "=m" (p->thread.fsindex));
- asm("mov %%es,%0" : "=m" (p->thread.es));
- asm("mov %%ds,%0" : "=m" (p->thread.ds));
+ savesegment(gs, p->thread.gsindex);
+ savesegment(fs, p->thread.fsindex);
+ savesegment(es, p->thread.es);
+ savesegment(ds, p->thread.ds);
if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
p->thread.io_bitmap_ptr = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
@@ -404,7 +404,9 @@
void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
- asm volatile("movl %0, %%fs; movl %0, %%es; movl %0, %%ds" :: "r"(0));
+ loadsegment(fs, 0);
+ loadsegment(es, 0);
+ loadsegment(ds, 0);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
@@ -591,11 +593,...Because Xen doesn't support PSE mappings in guests, all code which assumed the presence of PSE has been changed to fall back to smaller mappings if necessary. As a result, PSE is optional rather than required (though still used whereever possible). Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- include/asm-x86/required-features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/asm-x86/required-features.h b/include/asm-x86/required-features.h --- a/include/asm-x86/required-features.h +++ b/include/asm-x86/required-features.h @@ -42,7 +42,7 @@ #endif #ifdef CONFIG_X86_64 -#define NEED_PSE (1<<(X86_FEATURE_PSE & 31)) +#define NEED_PSE 0 #define NEED_MSR (1<<(X86_FEATURE_MSR & 31)) #define NEED_PGE (1<<(X86_FEATURE_PGE & 31)) #define NEED_FXSR (1<<(X86_FEATURE_FXSR & 31)) --
