Hi Linus, The patch below fixes VMI for 2.6.27-rc.. There were these 2 commits which made their way in 2.6.27-rc1 which broke VMI. x86: move fix mapping page table range early commit e7b3789524eecc96213dd69d6686efd429235051 Author: Yinghai Lu <yhlu.kernel@gmail.com> x86: use acpi_numa_init to parse on 32-bit numa commit 1c6e55032e24ff79668581a0f296c278ef7edd4e Author: Yinghai Lu <yhlu.kernel@gmail.com> VMI relies on relocating the fixmap area to make room for the hypervisor. These 2 commits started accessing the fixmap area's and using them before VMI got a chance to check if it wants to relocate the fixmap area. Once VMI got to the point of relocating the fixmap area's it resulted in BUG's. The patch below moves the vmi_init call right after max_low_pfn is initialized and before we touch the fixmap areas. Also added some comment so that people know that VMI may relocate the fixmaps. Please apply. Thanks, Alok -- From: Alok N Kataria <akataria@vmware.com> Move the vmi_init call right after max_low_pfn is initialized and before we touch the fixmap areas. Also, document the fact that VMI may relocate the fixmaps, so that the next programmer doesn't accidently break VMI. Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Zachary Amsden <zach@vmware.com> --- arch/x86/kernel/setup.c | 19 +++++++++++-------- 1 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 2d88858..132b8cd 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -742,6 +742,17 @@ void __init setup_arch(char **cmdline_p) high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1; #endif +#if defined(CONFIG_VMI) && defined(CONFIG_X86_32) + /* + * Must be after max_low_pfn is determined, and before kernel + * pagetables are setup. + * Also if a VMI ROM is found we relocate the fixmap area to reserve + * space for the hypervisor, make sure this is done before we ...
Could you describe this in more detail? I am not super-happy about this solution if there is a better one, like simply locating the fixmap area out of the way to start with. -hpa --
That can't be done until we know the size of the hole to relocate, which isn't known until we probe in the first meg of memory to find the associated ROM. It used to be the case that other things that poked around with platform specific memory and checking ROM areas lived around here in setup.c, so it was a nice place to put it. With all the abstraction and combination and overifdeffing going on here, that might no longer be the case. We could move it earlier, but then we'd need another hook to call in after max_low_pfn is known. Or we could remove the dependency on max_low_pfn and just create a liberal linear to physical mapping for lomem which spans all possible low memory; then it doesn't matter so much where it is called. Zach --
Okay, you lost me about halfway through that... could you perhaps describe the problem from the beginning, exactly what you're trying to do? -hpa --
A kernel compiled with VMI enabled may run on a non-VMI platform. If that is the case, the fixmap should not be relocated. If however, a VMI ROM is found, we need to hijack up to 64-MB of linear address space from the top of memory down. This means moving the fixmap down by the same amount. Right now the code is structured in such a way that it wants to know how much physical memory there is, so it can register a mapping table for mapping linear addresses in the lowmem area to physical addresses. This causes the code to depend on max_low_pfn being initialized, which accounts for the current placement. But it also must be called before anything that creates the fixmap, because the same code which registers the linear address mapping also reserves high memory above the fixmap. My point is 1) these could be two separate calls, or 2) the lowmem mapping table need not depend on max_low_pfn at all, it is safe to create an extra large mapping which covers all possible lowmem instead of the physical ram that is actually available. Zach --
I take it there are no alternatives other than putting this at the end Realistically speaking, any (virtual) machine which does *not* have a full complement of lowmem (i.e. less than 896 MB in the common case) will not suffer significatly from losing a few megabytes of address space. -hpa --
Nope, it must be in an area allowing for segmentation protection, while keeping the kernel on zero-based segments; that means only the end of Yes, the reason to make the fixmap moveable is to allow as much address space as possible for big memory (physical) machines. --
That being said, the fixmap area being movable more than kind of defeats a major point of the fixmap area; the addresses in it are no longer fixed. The only way I can see around that, though, is to move the 1:1 mapping base up by 2/4 MB (for PAE/no PAE, respectively) and put the fixmap area there. Kind of sucks, but would be doable. -hpa --
So if the address isn't fixed, you'll end up with an indirect pointer, and it would likely be much better to just use a fixed direct pointer that is not at the top. And anything that is within the top 31 bits of the address space should generate the same good code, since the fixed offset is always going to be a 32-bit thing anyway. So moving the FIXMAP area down by 4MB sounds like a fine thing to do with no real downside, if it then means that we don't need to move the FIXMAP area at all. Hmm? Am I missing something? Linus --
Just moving it down by 4 MB doesn't help, since the VMI guys want as much as 64 MB, which is half the standard vmalloc area and hence too much address space lost. We can't put it at the bottom of the vmalloc area, since that boundary is not fixed, either. The one remaining fixed boundary in the machine is the kernel-userspace boundary. Hence moving the 1:1 area up by one PDE unit and sticking the fixmap area in that region. -hpa --
Yeah, ok. Since this is a 32-bit only issue, 64MB is actually a fair chunk Yeah, ok, but I'd be more nervous about the validation issues there. There might be a lot of code that assumes that TASK_SIZE is the start of the 1:1 area. It does sound like a good approach, it just makes me worry about the test coverage. Linus --
Indeed. Unfortunately I don't see any other options. -hpa --
Well, here's an idea from outer space. The fixmap can't possibly be used until it's got a backing page table and initial mappings installed. One can imagine a world where references to the fixmap are left as unresolved, and then those unresolved symbols are linked to the fixmap area when it gets set up in the kernel page table. Voilla! The requisite foodling required to massage various gcci and lds into compliance with this scheme, not to mention the required module loading changes might be a bit of headache, and even then, I'm not sure that gcc will be smart enough to allow all the required relocations to generate optimal code. But the upshot would be the potential for dynamic registration of fixmap areas, yet still keeping direct pointers into the thing, and also removing all the ifdefs from the fixmap definitions for the various platform specific fixmap pages. Just leave dangling references to some fixed bad address (fixmap_hole) for things unused. And even allow kernel modules to register new fixmap types! All it requires is a well thought out strategy for naming fixmap pages and then two sprinkles of linker magic. You could even randomize the non-randomized VDSO location at boot-time. Whee! Zach --
fixmap.ko, except backwards?
That said, isn't this exactly what the immediate values stuff is
supposed to be able to do?
J
--
It's always been movable with CONFIG_PARAVIRT enabled. Xen needs to
reserve a hole at the top of the address space too.
J
--
Ok, since we are already past rc-2, I think we should fix the VMI problem sooner than later. Any approach that we eventually take to make the fixmap's actually *fixed*, would be independent of this fix. Below is the patch which does away from the dependency of activating VMI only after max_low_pfn is known. We move vmi_initialization very early in the setup_arch code. Patch on top of current git. Please have a look and apply. -- From: Alok N Kataria <akataria@vmware.com> The lowmem mapping table created by VMI need not depend on max_low_pfn at all. Instead we now create an extra large mapping which covers all possible lowmem instead of the physical ram that is actually available. This allows the vmi initialization to be done before max_low_pfn could be computed. We also move the vmi_init code very early in the boot process so that nobody accidentally breaks the fixmap dependancy. Signed-off-by: Alok N Kataria <akataria@vmware.com> Acked-by: Zachary Amsden <zach@vmware.com> --- arch/x86/kernel/setup.c | 16 ++++++++-------- arch/x86/kernel/vmi_32.c | 3 ++- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 2d88858..6e5823b 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -604,6 +604,14 @@ void __init setup_arch(char **cmdline_p) early_cpu_init(); early_ioremap_init(); +#if defined(CONFIG_VMI) && defined(CONFIG_X86_32) + /* + * Must be before kernel pagetables are setup + * or fixmap area is touched. + */ + vmi_init(); +#endif + ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev); screen_info = boot_params.screen_info; edid_info = boot_params.edid_info; @@ -817,14 +825,6 @@ void __init setup_arch(char **cmdline_p) kvmclock_init(); #endif -#if defined(CONFIG_VMI) && defined(CONFIG_X86_32) - /* - * Must be after max_low_pfn is determined, and before kernel - * pagetables are setup. - */ - vmi_init(); -#endif - ...
Agreed. Applied to tip:x86/urgent. Thanks! -hpa --
Yes. You could just call reserve_top_address() at a suitably early
point to reserve the space. Its a pvops API call which has been there
since patch one or two of pvops.
It does exactly what the rest of this thread discusses, rendering it moot.
J
--
It's not moot. The fixmap area should never have been made movable. It's utter braindamage. Given the x86 architecture, it's inevitable that PV will want to reserve address space at the top of memory, and therefore the fixmap area needs to be moved out of that space. -hpa --
Shrug. It's been like that for a couple of years now. It was one of
the very first paravirt-ops patches. It wasn't controversial then, and
OK. But there's a few places where the code uses FIXADDR_TOP to mean
"top of kernel address space", so we'd need to come up with a proper
symbol for that.
J
--
The Linux kernel was never a paragon of perfection - it was never meant to be. Just because a bit of cruft went unnoticed into the kernel I suggest KERNEL_TOP. -hpa --
why not reserving that in e820 table? YH --
Virtual space, not physical space. -hpa --
I don't really see what the issue is.
Fixmaps are primarily used for things that need to be mapped early
before we can allocate address space dynamically. They're predominantly
used for boot-time init, and rarely on any performance-critical path.
The only vaguely regular use a fixmap gets during runtime is poking at
apics, and that's dominated by IO time, and kmap_atomic. Statically,
there's only 100 references in the kernel. And it only affects 32-bit.
Having fixmaps at link-time fixed addresses would be nice, I suppose,
Fine by me. It would be easy to plug KERNEL_TOP/__KERNEL_TOP in now,
and then fix up fixmap independently.
J
--
Yes, and we should add a symbol for the bottom of the 1:1 area as well (to disambiguate it from TASK_SIZE). -hpa --
Well, that's already called "PAGE_OFFSET". 64-bit needs to be careful
about the distinction anyway, because there's the sign extension hole
between user and kernel space. Xen squeezes itself in just above the
hole, so I moved PAGE_OFFSET up accordingly.
J
--
Ah, that explains that strange offset. I was wondering about that. -hpa --
Thinking about it, the fixmap really has to be as high as possible. If
it were any lower, then it would either truncate the 1:1 mapping, or
shadow some physical memory.
J
--
The proposal was to put it *before* the 1:1 mapping: FIX_HOLE would start at TASK_SIZE PAGE_OFFSET would shift to TASK_SIZE + PMD_PAGE_SIZE All the remaining offsets remain as-is. It's slightly less efficient than the current version, since we can't share the PDE page for the fixmap with the final vmalloc map, but not terribly so. -hpa --
Using my interplanetary ideas, linking the fixmap at runtime would allow optimal placement of the fixmap, any hypervisor areas, vmalloc, and pkmap space. That might allow one to increase the amount of lowmem available for direct mapping depending on some platform variables such as hypervisor reserved space, physical memory size, APIC present.. those aren't known until boot time. Actually, NCPUs is a good one, since we require atomic kmap space dependent on NCPUs, which could be given back to linear memory map. That might actually be more worthwhile than link time fixed addresses. Zach --
Yeah, but that's a huge project. -hpa --
Hi Peter, The first commit, x86: use acpi_numa_init to parse on 32-bit numa commit 1c6e55032e24ff79668581a0f296c278ef7edd4e Moves the call to dmi_scan_machine before the vmi_initialization is done, dmi_scan_machine internally calls early_ioremap, which does early_set_fixmap effectively making use of FIXMAP areas before VMI gets a chance to relocate it. Similarly, in the other commit, x86: move fix mapping page table range early commit e7b3789524eecc96213dd69d6686efd429235051 There is this new call to early_ioremap_page_table_range_init which is done from init_memory_mapping, this uses FIXADDR_TOP to initialize the page table range. Now if you look at vmi_init, we relocate the fixmap area by changing the __FIXADDR_TOP value. So this needs to happen before anybody starts I won't say that i completely understand this statement , but IMO the patch that i sent effectively does the same thing, we make sure that the fixmap area is set to a final value before anybody else starts using it. Thanks, --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List |
