Re: [PATCH]Fix broken VMI in 2.6.27-rc..

Previous thread: cat /sys/devices/system/cpu/cpuidle/current_driver by Justin Mattock on Thursday, August 7, 2008 - 12:01 pm. (1 message)

Next thread: Re: Please fix x86 defconfig regression by Rafael J. Wysocki on Thursday, August 7, 2008 - 1:46 pm. (2 messages)
From: Alok Kataria
Date: Thursday, August 7, 2008 - 12:12 pm

Hi Linus,

The patch below fixes VMI for 2.6.27-rc..

There were these 2 commits which made their way in 2.6.27-rc1 which
broke VMI.

x86: move fix mapping page table range early
commit e7b3789524eecc96213dd69d6686efd429235051
Author: Yinghai Lu <yhlu.kernel@gmail.com>

x86: use acpi_numa_init to parse on 32-bit numa
commit 1c6e55032e24ff79668581a0f296c278ef7edd4e
Author: Yinghai Lu <yhlu.kernel@gmail.com>

VMI relies on relocating the fixmap area to make room for the
hypervisor. These 2 commits started accessing the fixmap area's and
using them before VMI got a chance to check if it wants to relocate the
fixmap area. Once VMI got to the point of relocating the fixmap area's
it resulted in BUG's.

The patch below moves the vmi_init call right after max_low_pfn
is initialized and before we touch the fixmap areas. Also added some
comment so that people know that VMI may relocate the fixmaps.

Please apply.

Thanks,
Alok

--

From: Alok N Kataria <akataria@vmware.com>

Move the vmi_init call right after max_low_pfn is initialized and before
we touch the fixmap areas. Also, document the fact that VMI may relocate
the fixmaps, so that the next programmer doesn't accidently break VMI.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Zachary Amsden <zach@vmware.com>

---

 arch/x86/kernel/setup.c |   19 +++++++++++--------
 1 files changed, 11 insertions(+), 8 deletions(-)


diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2d88858..132b8cd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -742,6 +742,17 @@ void __init setup_arch(char **cmdline_p)
 	high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
 #endif
 
+#if defined(CONFIG_VMI) && defined(CONFIG_X86_32)
+	/*
+	 * Must be after max_low_pfn is determined, and before kernel
+	 * pagetables are setup.
+	 * Also if a VMI ROM is found we relocate the fixmap area to reserve
+	 * space for the hypervisor, make sure this is done before we ...
From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 2:20 pm

Could you describe this in more detail?  I am not super-happy about this 
solution if there is a better one, like simply locating the fixmap area 
out of the way to start with.

	-hpa
--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 2:27 pm

That can't be done until we know the size of the hole to relocate, which
isn't known until we probe in the first meg of memory to find the
associated ROM.  It used to be the case that other things that poked
around with platform specific memory and checking ROM areas lived around
here in setup.c, so it was a nice place to put it.  With all the
abstraction and combination and overifdeffing going on here, that might
no longer be the case.

We could move it earlier, but then we'd need another hook to call in
after max_low_pfn is known.

Or we could remove the dependency on max_low_pfn and just create a
liberal linear to physical mapping for lomem which spans all possible
low memory; then it doesn't matter so much where it is called.

Zach

--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 2:34 pm

Okay, you lost me about halfway through that... could you perhaps 
describe the problem from the beginning, exactly what you're trying to do?

	-hpa
--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 2:42 pm

A kernel compiled with VMI enabled may run on a non-VMI platform.  If
that is the case, the fixmap should not be relocated.  If however, a VMI
ROM is found, we need to hijack up to 64-MB of linear address space from
the top of memory down.  This means moving the fixmap down by the same
amount.

Right now the code is structured in such a way that it wants to know how
much physical memory there is, so it can register a mapping table for
mapping linear addresses in the lowmem area to physical addresses.  This
causes the code to depend on max_low_pfn being initialized, which
accounts for the current placement.

But it also must be called before anything that creates the fixmap,
because the same code which registers the linear address mapping also
reserves high memory above the fixmap.

My point is 1) these could be two separate calls, or 2) the lowmem
mapping table need not depend on max_low_pfn at all, it is safe to
create an extra large mapping which covers all possible lowmem instead
of the physical ram that is actually available.

Zach

--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 2:52 pm

I take it there are no alternatives other than putting this at the end 

Realistically speaking, any (virtual) machine which does *not* have a 
full complement of lowmem (i.e. less than 896 MB in the common case) 
will not suffer significatly from losing a few megabytes of address space.

	-hpa

--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 2:55 pm

Nope, it must be in an area allowing for segmentation protection, while
keeping the kernel on zero-based segments; that means only the end of

Yes, the reason to make the fixmap moveable is to allow as much address
space as possible for big memory (physical) machines.

--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 3:17 pm

That being said, the fixmap area being movable more than kind of defeats 
a major point of the fixmap area; the addresses in it are no longer fixed.

The only way I can see around that, though, is to move the 1:1 mapping 
base up by 2/4 MB (for PAE/no PAE, respectively) and put the fixmap area 
there.  Kind of sucks, but would be doable.

	-hpa
--

From: Linus Torvalds
Date: Thursday, August 7, 2008 - 3:38 pm

So if the address isn't fixed, you'll end up with an indirect pointer, and 
it would likely be much better to just use a fixed direct pointer that is 
not at the top.

And anything that is within the top 31 bits of the address space should 
generate the same good code, since the fixed offset is always going to be 
a 32-bit thing anyway. So moving the FIXMAP area down by 4MB sounds like a 
fine thing to do with no real downside, if it then means that we don't 
need to move the FIXMAP area at all.

Hmm? Am I missing something?

		Linus
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 3:58 pm

Just moving it down by 4 MB doesn't help, since the VMI guys want as 
much as 64 MB, which is half the standard vmalloc area and hence too 
much address space lost.  We can't put it at the bottom of the vmalloc 
area, since that boundary is not fixed, either.

The one remaining fixed boundary in the machine is the kernel-userspace 
boundary.  Hence moving the 1:1 area up by one PDE unit and sticking the 
fixmap area in that region.

	-hpa
--

From: Linus Torvalds
Date: Thursday, August 7, 2008 - 4:08 pm

Yeah, ok. Since this is a 32-bit only issue, 64MB is actually a fair chunk 

Yeah, ok, but I'd be more nervous about the validation issues there. There 
might be a lot of code that assumes that TASK_SIZE is the start of the 1:1 
area. It does sound like a good approach, it just makes me worry about the 
test coverage.

		Linus
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 4:12 pm

Indeed.  Unfortunately I don't see any other options.

	-hpa
--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 4:26 pm

Well, here's an idea from outer space.  The fixmap can't possibly be
used until it's got a backing page table and initial mappings installed.
One can imagine a world where references to the fixmap are left as
unresolved, and then those unresolved symbols are linked to the fixmap
area when it gets set up in the kernel page table.  Voilla!

The requisite foodling required to massage various gcci and lds into
compliance with this scheme, not to mention the required module loading
changes might be a bit of headache, and even then, I'm not sure that gcc
will be smart enough to allow all the required relocations to generate
optimal code.

But the upshot would be the potential for dynamic registration of fixmap
areas, yet still keeping direct pointers into the thing, and also
removing all the ifdefs from the fixmap definitions for the various
platform specific fixmap pages.  Just leave dangling references to some
fixed bad address (fixmap_hole) for things unused.  And even allow
kernel modules to register new fixmap types!

All it requires is a well thought out strategy for naming fixmap pages
and then two sprinkles of linker magic.  You could even randomize the
non-randomized VDSO location at boot-time.  Whee!

Zach

--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 4:49 pm

fixmap.ko, except backwards?

That said, isn't this exactly what the immediate values stuff is 
supposed to be able to do?

    J
--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 4:23 pm

It's always been movable with CONFIG_PARAVIRT enabled.  Xen needs to 
reserve a hole at the top of the address space too.

    J
--

From: Alok Kataria
Date: Friday, August 8, 2008 - 12:15 pm

Ok, since we are already past rc-2, I think we should fix the VMI
problem sooner than later.  Any approach that we eventually take to make
the fixmap's actually *fixed*, would be independent of this fix. 

Below is the patch which does away from the dependency of activating VMI
only after max_low_pfn is known. We move vmi_initialization very early
in the setup_arch code. 

Patch on top of current git. Please have a look and apply.
--
From: Alok N Kataria <akataria@vmware.com>

The lowmem mapping table created by VMI need not depend on max_low_pfn
at all.  Instead we now create an extra large mapping which covers all
possible lowmem instead of the physical ram that is actually available.

This allows the vmi initialization to be done before max_low_pfn could
be computed. We also move the vmi_init code very early in the boot process
so that nobody accidentally breaks the fixmap dependancy.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Acked-by: Zachary Amsden <zach@vmware.com>
---

 arch/x86/kernel/setup.c  |   16 ++++++++--------
 arch/x86/kernel/vmi_32.c |    3 ++-
 2 files changed, 10 insertions(+), 9 deletions(-)


diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2d88858..6e5823b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -604,6 +604,14 @@ void __init setup_arch(char **cmdline_p)
 	early_cpu_init();
 	early_ioremap_init();
 
+#if defined(CONFIG_VMI) && defined(CONFIG_X86_32)
+	/*
+	 * Must be before kernel pagetables are setup
+	 * or fixmap area is touched.
+	 */
+	vmi_init();
+#endif
+
 	ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
 	screen_info = boot_params.screen_info;
 	edid_info = boot_params.edid_info;
@@ -817,14 +825,6 @@ void __init setup_arch(char **cmdline_p)
 	kvmclock_init();
 #endif
 
-#if defined(CONFIG_VMI) && defined(CONFIG_X86_32)
-	/*
-	 * Must be after max_low_pfn is determined, and before kernel
-	 * pagetables are setup.
-	 */
-	vmi_init();
-#endif
-
 ...
From: H. Peter Anvin
Date: Friday, August 8, 2008 - 3:23 pm

Agreed.  Applied to tip:x86/urgent.  Thanks!

	-hpa
--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 4:21 pm

Yes.  You could just call reserve_top_address() at a suitably early 
point to reserve the space.  Its a pvops API call which has been there 
since patch one or two of pvops.

It does exactly what the rest of this thread discusses, rendering it moot.

    J
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 4:27 pm

It's not moot.

The fixmap area should never have been made movable.  It's utter 
braindamage.

Given the x86 architecture, it's inevitable that PV will want to reserve 
address space at the top of memory, and therefore the fixmap area needs 
to be moved out of that space.

	-hpa
--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 4:46 pm

Shrug.  It's been like that for a couple of years now.  It was one of 
the very first paravirt-ops patches.  It wasn't controversial then, and 

OK.  But there's a few places where the code uses FIXADDR_TOP to mean 
"top of kernel address space", so we'd need to come up with a proper 
symbol for that.

    J
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 4:51 pm

The Linux kernel was never a paragon of perfection - it was never meant 
to be.  Just because a bit of cruft went unnoticed into the kernel 

I suggest KERNEL_TOP.

	-hpa

--

From: Yinghai Lu
Date: Thursday, August 7, 2008 - 5:01 pm

why not reserving that in e820 table?

YH
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 5:11 pm

Virtual space, not physical space.

	-hpa
--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 5:10 pm

I don't really see what the issue is.

Fixmaps are primarily used for things that need to be mapped early 
before we can allocate address space dynamically.  They're predominantly 
used for boot-time init, and rarely on any performance-critical path.  
The only vaguely regular use a fixmap gets during runtime is poking at 
apics, and that's dominated by IO time, and kmap_atomic.  Statically, 
there's only 100 references in the kernel.  And it only affects 32-bit.

Having fixmaps at link-time fixed addresses would be nice, I suppose, 

Fine by me.  It would be easy to plug KERNEL_TOP/__KERNEL_TOP in now, 
and then fix up fixmap independently.

    J
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 5:13 pm

Yes, and we should add a symbol for the bottom of the 1:1 area as well 
(to disambiguate it from TASK_SIZE).

	-hpa

--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 5:23 pm

Well, that's already called "PAGE_OFFSET".  64-bit needs to be careful 
about the distinction anyway, because there's the sign extension hole 
between user and kernel space.  Xen squeezes itself in just above the 
hole, so I moved PAGE_OFFSET up accordingly.

    J
--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 5:29 pm

Ah, that explains that strange offset.  I was wondering about that.

	-hpa

--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 11:10 pm

Thinking about it, the fixmap really has to be as high as possible.  If 
it were any lower, then it would either truncate the 1:1 mapping, or 
shadow some physical memory.

    J
--

From: H. Peter Anvin
Date: Friday, August 8, 2008 - 9:13 am

The proposal was to put it *before* the 1:1 mapping:

FIX_HOLE would start at TASK_SIZE
PAGE_OFFSET would shift to TASK_SIZE + PMD_PAGE_SIZE

All the remaining offsets remain as-is.

It's slightly less efficient than the current version, since we can't 
share the PDE page for the fixmap with the final vmalloc map, but not 
terribly so.

	-hpa
--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 6:14 pm

Using my interplanetary ideas, linking the fixmap at runtime would allow
optimal placement of the fixmap, any hypervisor areas, vmalloc, and
pkmap space.  That might allow one to increase the amount of lowmem
available for direct mapping depending on some platform variables such
as hypervisor reserved space, physical memory size, APIC present.. those
aren't known until boot time.  Actually, NCPUs is a good one, since we
require atomic kmap space dependent on NCPUs, which could be given back
to linear memory map.

That might actually be more worthwhile than link time fixed addresses.

Zach

--

From: H. Peter Anvin
Date: Thursday, August 7, 2008 - 6:19 pm

Yeah, but that's a huge project.

	-hpa
--

From: Zachary Amsden
Date: Thursday, August 7, 2008 - 6:28 pm

But it's awesome.

--

From: Alok Kataria
Date: Thursday, August 7, 2008 - 2:41 pm

Hi Peter,

The first commit,
x86: use acpi_numa_init to parse on 32-bit numa
commit 1c6e55032e24ff79668581a0f296c278ef7edd4e

Moves the call to dmi_scan_machine before the vmi_initialization is
done, dmi_scan_machine internally calls early_ioremap, which does
early_set_fixmap effectively making use of FIXMAP areas before VMI gets
a chance to relocate it.

Similarly, in the other commit,
x86: move fix mapping page table range early
commit e7b3789524eecc96213dd69d6686efd429235051

There is this new call to early_ioremap_page_table_range_init which is
done from init_memory_mapping, this uses FIXADDR_TOP to initialize the
page table range. 

Now if you look at vmi_init, we relocate the fixmap area by changing the
__FIXADDR_TOP  value. So this needs to happen before anybody starts

I won't say that i completely understand this statement , but IMO the
patch that i sent effectively does the same thing, we make sure that the
fixmap area is set to a final value before anybody else starts using
it. 

Thanks,

--

Previous thread: cat /sys/devices/system/cpu/cpuidle/current_driver by Justin Mattock on Thursday, August 7, 2008 - 12:01 pm. (1 message)

Next thread: Re: Please fix x86 defconfig regression by Rafael J. Wysocki on Thursday, August 7, 2008 - 1:46 pm. (2 messages)