Hi Andi,
My eyes fell on the following table in the boot messages:
early res: 0 [0-fff] BIOS data page
early res: 1 [6000-7fff] SMP_TRAMPOLINE
early res: 2 [200000-374557] TEXT DATA BSS
early res: 3 [9fc00-a0bff] EBDA
early res: 4 [8000-afff] PGTABLE
The memory reserved for the EBDA overflows into the area normally
reserved for the VGA adaptor. It seems that you wanted to force
the allocation to cover whole pages, like:
early res: 3 [9f000-9ffff] EBDA
This is what this patch implements.
Is it really necessary to force the allocation to a page boundary?
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
arch/x86/kernel/head64.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index ad24408..2c52543 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -53,7 +53,7 @@ static void __init copy_bootdata(char *real_mode_data)
static __init void reserve_ebda(void)
{
- unsigned ebda_addr, ebda_size;
+ unsigned int ebda_addr, ebda_size, ebda_end;
/*
* there is a real-mode segmented pointer pointing to the
@@ -70,12 +70,14 @@ static __init void reserve_ebda(void)
/* Round EBDA up to pages */
if (ebda_size == 0)
ebda_size = 1;
+ if (ebda_size > 64)
+ ebda_size = 64;
ebda_size <<= 10;
- ebda_size = round_up(ebda_size + (ebda_addr & ~PAGE_MASK), PAGE_SIZE);
- if (ebda_size > 64*1024)
- ebda_size = 64*1024;
+
+ ebda_end = ALIGN(ebda_addr + ebda_size, PAGE_SIZE);
+ ebda_addr = ALIGN(ebda_addr - PAGE_SIZE, PAGE_SIZE);
- reserve_early(ebda_addr, ebda_addr + ebda_size, "EBDA");
+ reserve_early(ebda_addr, ebda_end, "EBDA");
}
void __init x86_64_start_kernel(char * real_mode_data)
--
well, that's what your EBDA descriptor says - it's set to 9fc00 which is 512 bytes below the VGA range. This behavior didnt really change over v2.6.24 (which reserved 'into' the VGA range too), it's just that in v2.6.25 we also print out these early reservations. Can you see any regression? There should be no harm from overlapping into the VGA range - these "reservations" only make RAM unavailable for normal allocations. your patch on the other hand rounds the EBDA area down which could in theory be unsafe on other boxes (where there could be real RAM above the EBDA area): the safest approach is to round the beginning of it down, the end of it up (to page boundary). Your patch _should_ be OK, but in practice it doesnt hurt to reserve a bit more around the edges than to accidentally give a page to the OS that the BIOS might rely upon. Ingo --
It's 1024 bytes below, but yes, the EBDA starts there. Then the first
two
bytes of the EBDA contain the value 0x0001, which means that its size is
Correct. I thought it was new code, but looking more closely, the
behaviour
has indeed not changed recently (note to self: git log -p somefile.c
does
Not really: ebda_addr is just a local variable. If the system needs to
find the start of the EBDA, it will just have to look at the 16 bit
value
The patch is exactly trying to do that. The code that was there seemed
to
imply that the author wanted to allocate whole pages, in such a way that
the allocation contained the whole EBDA. I think that is what it does
after this patch.
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - And now for something completely different
--
In theory not, in practice it works around some problems in early allocations where the other users assume page alignment. At some point it needs to be cleaned up properly. -Andi --
It is, but that rounding gets done in reserve_bootmem() anyway, so there is no need for the arch-specific code to do it. The 32-bit EBDA code hard-codes a size of 4K, which is probably equally wrong; my gut feel is that the right thing to do is to reserve from the EBDA up to the 640K mark (some BIOSes use an area like that for SMM stuff), possibly with some sanity checking. -hpa --
On Sun, 24 Feb 2008 18:18:16 -0800, "H. Peter Anvin" <hpa@zytor.com>
Then, how about reserving everything from the end of conventional memory
up to the 1Mb mark? Like this:
/*
* The BIOS places the EBDA/XBDA at the top of conventional
* memory, and usually decreases the reported amount of
* conventional memory (int 0x12) too.
*/
static __init void reserve_ebda(void)
{
unsigned int lowmem, ebda_addr;
/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
/* start of EBDA area */
ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
ebda_addr <<= 4;
/* Fixup: bios puts an EBDA in the top 64K segment */
/* of conventional memory, but does not adjust lowmem. */
if ((lowmem - ebda_addr) <= 0x10000)
lowmem = ebda_addr;
/* Fixup: bios does not report an EBDA at all. */
/* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
if ((ebda_addr == 0) && (lowmem >= 0x9f000))
lowmem = 0x9f000;
/* Paranoia: should never happen, but... */
if (lowmem >= 0x100000)
lowmem = 0xa0000;
/* reserve all memory between lowmem and the 1MB mark */
reserve_early(lowmem, 0x100000, "BIOS reserved");
}
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - Choose from over 50 domains or use your own
--
looks very good to me - could you send a patch against x86.git#testing? Ingo --
Explicitly reserve_early the whole address range from the end of
conventional memory as reported by the bios data area up to the
1Mb mark. Regard the info retrieved from the BIOS data area with
a bit of paranoia, though, because some biosses forget to register
the EBDA correctly.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
--
I think this patch is against -x86.git#testing :-/.
Greetings,
Alexander
arch/x86/kernel/head64.c | 45 +++++++++++++++++++++++++++------------------
1 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 38f32e7..b684552 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -49,33 +49,42 @@ static void __init copy_bootdata(char *real_mode_data)
}
}
-#define EBDA_ADDR_POINTER 0x40E
+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+/*
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too.
+ */
static __init void reserve_ebda(void)
{
- unsigned ebda_addr, ebda_size;
+ unsigned int lowmem, ebda_addr;
- /*
- * there is a real-mode segmented pointer pointing to the
- * 4K EBDA area at 0x40E
- */
- ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
+ /* end of low (conventional) memory */
+ lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
ebda_addr <<= 4;
- if (!ebda_addr)
- return;
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
- ebda_size = *(unsigned short *)__va(ebda_addr);
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= ...The 32-bit code still uses reserve_bootmem, so this is not really
a unification with the 64-bit version of the ebda reservation code,
but at least it provides the same detection logic and reserves the
same areas.
This does not crash immediately on qemu. No further testing was
done! Otherwise:
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
Hello Ian, Ingo,
This patch should do the same reservations as the reserve_early
patch for 64 bit. Maybe you could try if this solves the problems
you are seeing with Xen?
On the other hand, Xen should be able to use those legacy ranges
as normal memory. Why doesn't it work? Does Xen use a special
bootloader? In that case it could just add the reservation to
the e820 memory map that is provided.
Another possible approach to the whole problem is just to ammend
the e820 memory map in the boot-code. Thanks to hpa, it is now easy
to change this code. Then the kernel can just trust the e820 memory
map, and only 'sophisticated' bootloaders need to do something
similar by themselves.
Greetings,
Alexander
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index a1d7071..e12cc31 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -385,15 +385,47 @@ unsigned long __init find_max_low_pfn(void)
return max_low_pfn;
}
+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
/*
- * workaround for Dell systems that neglect to reserve EBDA
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it (errata #56). Usually the page is reserved anyways,
+ * unless you have no PS/2 mouse plugged in.
*/
static void __init ...I haven't tested extensively either but it does seem to solve the problem for Xen. Thanks! -- Ian Campbell In spite of everything, I still believe that people are good at heart. -- Anne Frank --
This patch adds explicit detection of the EBDA and reservation
of the rom and adapter address space 0xa0000-0x100000 to the
i386 kernels. It uses reserve_bootmem instead of reserve_early,
because reserve_early is not yet available on i386.
Before this patch, the EBDA size was hardcoded as 4Kb. Also, the
reservation of the adapter range was done by modifying the e820
map which is now not necessary any longer, and the code is removed
from copy_e820_map.
The changes in e820_64.c are only a change in the comment above
copy_e820_map, and some changes of the types of local variables
in that function such that the 32 and 64 bit versions become equal.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
Thank you!
Ingo,
I think this is ready for -x86#testing.
It boots to a small userspace in qemu (i386).
If I should separate the cleanups, let me know.
Greetings,
Alexander
arch/x86/kernel/e820_32.c | 30 +++++++-------------------
arch/x86/kernel/e820_64.c | 15 +++++++++---
arch/x86/kernel/setup_32.c | 51 ++++++++++++++++++++++++++++++++-----------
3 files changed, 57 insertions(+), 39 deletions(-)
diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
index 4e16ef4..8ea7db2 100644
--- a/arch/x86/kernel/e820_32.c
+++ b/arch/x86/kernel/e820_32.c
@@ -444,44 +444,30 @@ int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map)
* set up memory. If we aren't, we'll fake a memory map.
*
* We check to see that the memory map contains at least 2 elements
- * before we'll use it, because the detection code in setup.S may
- * not be perfect and most every PC known to man has two memory
+ * before we'll use it, because the detection code in boot/memory.c
+ * may not be perfect and every PC known to man has two memory
* regions: one from 0 to 640k, and one from 1mb up. (The IBM
* thinkpad 560x, for example, does not cooperate with the memory
* detection code.)
*/
-int __init copy_e820_map(struct ...I haven't investigated in any detail, but with 2.6.25-rc3 and your
patch I'm seeing a Xen guest hit this BUG:
void __init smp_alloc_memory(void)
{
trampoline_base = alloc_bootmem_low_pages(PAGE_SIZE);
/*
* Has to be in very low memory so we can execute
* real-mode AP code.
*/
if (__pa(trampoline_base) >= 0x9F000)
BUG();
}
Stack looks like:
[<c137ef97>] smp_alloc_memory+0x25 <--
[<c137ef97>] smp_alloc_memory+0x25
[<c137a500>] setup_arch+0x28e
[<c13735f7>] start_kernel+0x7a
[<c1379240>] xen_start_kernel+0x300
Cheers,
Mark.
--
On Fri, 29 Feb 2008 17:14:07 +0000, "Mark McLoughlin"
Ouch.
My first guess is that the BIOS data area is completely non-existent for
Xen.
Is it guaranteed that the memory is zeroed out on boot? In that case we
can
special-case it easily:
change:
/* Paranoia: should never happen, but... */
if (lowmem >= 0x100000)
lowmem = 0xa0000;
into:
/* Strange case, like Xen ;) */
if (lowmem == 0 || lowmem >= 0x100000)
lowmem = 0x9f000;
Can you test that?
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - Send your email first class
--
Yes, that fixes boot for me. Thanks, Mark. --
On Fri, 29 Feb 2008 22:06:59 +0000, "Mark McLoughlin"
Thanks for testing this.
I'ld rather not count on Xen providing zeroed memory, though. I'll
try to find a different solution. It's weekend anyhow ;).
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - Or how I learned to stop worrying and
love email again
--
This patch adds explicit detection of the EBDA and reservation
of the rom and adapter address space 0xa0000-0x100000 to the
i386 kernels. Before this patch, the EBDA size was hardcoded
as 4Kb. Also, the reservation of the adapter range was done by
modifying the e820 map which is now not necessary any longer,
and that code is removed from copy_e820_map.
The amount of conventional memory and the start of the EBDA are
detected by reading the BIOS data area directly. Paravirtual
environments do not provide this area, so we bail out early
in that case. They will just have to set up a correct memory
map to start with.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
Hi Ingo,
This is the second attempt at a i386-version of the ebda
patch. I hope that one of the Xen people will be able to
check that this does not break their setups, but I think
it will be fine after their patch to exclude the 0x9f000-
0x100000 area explicitly in their setup.
Greetings,
Alexander
diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
index 4e16ef4..746d683 100644
--- a/arch/x86/kernel/e820_32.c
+++ b/arch/x86/kernel/e820_32.c
@@ -450,38 +450,25 @@ int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map)
* thinkpad 560x, for example, does not cooperate with the memory
* detection code.)
*/
-int __init copy_e820_map(struct e820entry * biosmap, int nr_map)
+int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
{
/* Only one memory region (or negative)? Ignore it */
if (nr_map < 2)
return -1;
do {
- unsigned long long start = biosmap->addr;
- unsigned long long size = biosmap->size;
- unsigned long long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;
/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
return -1;
- /*
- * Some BIOSes claim RAM ...Confirmed that with Ian's e820 map patch and your patch, Xen DomU boots This is a bit magic, is it worth splitting it out as something like is_paravirt_environment() ? Cheers, Mark. --
On Tue, 04 Mar 2008 11:44:42 +0000, "Mark McLoughlin" Yes, I guess it would make sense, but do we really want to start creating accessor functions to the boot_params struct? If we do, then maybe put this as an inline function in <asm-x86/bootparam.h>? I don't think <asm-x86/paravirt.h> is the right place for this. Should is_paravirt_environment then always return 0 if CONFIG_PARAVIRT is not set? Maybe a function like get_subarch would be preferable? or has_legacy_bios_areas? I don't know. If someone makes a reasonable suggestion, I'll code it up. Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - And now for something completely different --
hm, for now i've only got the patch below queued up for v2.6.25. Could you check whether just the patch below ontop of -rc3-ish upstream solves the problem too? The EBDA patch would be a bit risky now - it's queued up for v2.6.26 at the moment. Ingo ---------------> Subject: x86/xen: fix DomU boot problem From: Ian Campbell <ijc@hellion.org.uk> Date: Thu, 28 Feb 2008 23:16:49 +0000 Construct Xen guest e820 map with a hole between 640K-1M. It's pure luck that Xen kernels have gotten away with it in the past. The patch below seems like the right thing to do. It certainly boots in a domU without the DMI problem (without any of the other related patches such as Alexander's). Signed-off-by: Ian Campbell <ijc@hellion.org.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Tested-by: Mark McLoughlin <markmc@redhat.com> Acked-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/xen/setup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-x86.q/arch/x86/xen/setup.c =================================================================== --- linux-x86.q.orig/arch/x86/xen/setup.c +++ linux-x86.q/arch/x86/xen/setup.c @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void) unsigned long max_pfn = xen_start_info->nr_pages; e820.nr_map = 0; - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM); + add_memory_region(0, LOWMEMSIZE(), E820_RAM); + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM); return "Xen"; } --
Sorry, perhaps I wasn't clear. Ian's patch to add a hole in the e820 map fixes the problem without any other patch. I was just confirming to Alexander that his EBDA patch didn't break anything further. Thanks, Mark. --
ah, great. You got me worried for a minute :) Ingo --
Yes, is_paravirt() already exists for this purpose.
The code looking at the boot params will only work if we actually booted
via the paravirt Linux boot protocol, which Xen doesn't at present.
J
--
On Tue, 04 Mar 2008 07:18:48 -0800, "Jeremy Fitzhardinge"
Hi,
If it exists, it is well-hidden. A grep for is_paravirt on the testing
I chose to code it exactly this way, because it is what is used in
head_32.S to choose how to start the kernel. Or is this code not
executed at all?
[excerpt form head_32.S]
cmpw $0x207, pa(boot_params + BP_version)
jb default_entry
/* Paravirt-compatible boot parameters. Look to see what
architecture
we're booting under. */
movl pa(boot_params + BP_hardware_subarch), %eax
cmpl $num_subarch_entries, %eax
jae bad_subarch
movl pa(subarch_entries)(,%eax,4), %eax
subl $__PAGE_OFFSET, %eax
jmp *%eax
[and]
subarch_entries:
.long default_entry /* normal x86/PC */
.long lguest_entry /* lguest hypervisor */
.long xen_entry /* Xen hypervisor */
num_subarch_entries = (. - subarch_entries) / 4
[end]
If this is indeed not executed, is there a way to detect whether
we can expect the environment to behave like a normal pc in terms
of magic addresses, bios areas, isa reserved address space and so
on?
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - mmm... Fastmail...
--
The subarch stuff (boot_params.hdr.hardware_subarch) is indeed executed, at least with newer PV guests. However, as far as this kind of stuff, one really have to wonder to what extent the PV users really care about how much they perturb the guests. The canonical view of virtualization is that you should perturb your guests as little as possible, but that doesn't really seem to be considered in the observable bits of the PV universe as far as I can tell. A *massive* issue with hooks -- including paravirt_ops -- is that they are largely undocumented both in code and in specification, and usually hard-code the underlying implemenentation at a specific point in time: I have yet to see any sort of specification document for paravirt_ops, and most of the hooks are simply empty on hardware. This means that the burden has shifted onto the kernel maintainers to test every possible paravirt_ops client, because it is quite literally the only way to know when it's broken. I'm starting to feel that the PV clients need to document their environments and their constraints better for the benefit of the core maintainers. -hpa --
No, this isn't executed when booting under Xen. Xen is a bit magic, and
has its own kernel entrypoint which the domain builder finds via
Xen-specific ELF notes on the vmlinux. We plan to move to booting via
Just because something is paravirtualized and uses a non-PC
hardware_subarch doesn't mean this stuff isn't present. Even the
paravirt_enabled() test isn't accurate, because the environment may
still choose to emulate these things (or in the Xen dom0 case, it may
directly expose the real hardware).
J
--
Jeremy Fitzhardinge pointed out that looking at the boot_params
struct to determine if the system is running in a paravirtual
environment is not reliable for the Xen case, currently. He also
points out that there already exists a function to determine if
the system is running in a paravirtual environment. So let's use
that instead. This gets rid of the preprocessor test too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
paravirt_enabled() returns 0 if CONFIG_PARAVIRT is not set, or if a
paravirt-enabled kernel is run on bare hardware: dom0. If that is right,
then it is exactly what is needed.
Greetings,
Alexander
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index 20e537b..f595d7b 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -64,6 +64,7 @@
#include <setup_arch.h>
#include <bios_ebda.h>
#include <asm/cacheflush.h>
+#include <asm/processor.h>
/* This value is set up by the early boot code to point to the value
immediately after the boot time page tables. It contains a *physical*
@@ -408,12 +409,8 @@ static void __init reserve_ebda_region(void)
/* that area is absent. We'll just have to assume */
/* that the paravirt case can handle memory setup */
/* correctly, without our help. */
-#ifdef CONFIG_PARAVIRT
- if ((boot_params.hdr.version >= 0x207) &&
- (boot_params.hdr.hardware_subarch != 0)) {
+ if (paravirt_enabled())
return;
- }
-#endif
/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
--
Jeremy Fitzhardinge pointed out that looking at the boot_params
struct to determine if the system is running in a paravirtual
environment is not reliable for the Xen case, currently. He also
points out that there already exists a function to determine if
the system is running in a paravirtual environment. So let's use
that instead. This gets rid of the preprocessor test too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
Same, for 64 bit.
Greetings,
Alexander
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 269a6b4..48be76c 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -72,12 +72,8 @@ static void __init reserve_ebda_region(void)
/* that area is absent. We'll just have to assume */
/* that the paravirt case can handle memory setup */
/* correctly, without our help. */
-#ifdef CONFIG_PARAVIRT
- if ((boot_params.hdr.version >= 0x207) &&
- (boot_params.hdr.hardware_subarch != 0)) {
+ if (paravirt_enabled())
return;
- }
-#endif
/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
--
This patch is an add-on to the 64-bit ebda patch. It makes
the functions reserve_ebda_region (renamed from reserve_ebda)
and copy_e820_map equal to the 32-bit versions of the previous
patch.
Changes:
Use u64 and u32 for local variables in copy_e820_map.
The amount of conventional memory and the start of the EBDA are
detected by reading the BIOS data area directly. Paravirtual
environments do not provide this area, so we bail out early
in that case. They will just have to set up a correct memory
map to start with.
Add a safety net for zeroed out BIOS data area.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
---
diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index dd68cfd..baa4223 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -621,10 +621,10 @@ static int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
return -1;
do {
- unsigned long start = biosmap->addr;
- unsigned long size = biosmap->size;
- unsigned long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;
/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b684552..269a6b4 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -55,12 +55,30 @@ static void __init copy_bootdata(char *real_mode_data)
/*
* The BIOS places the EBDA/XBDA at the top of conventional
* memory, and usually decreases the reported amount of
- * conventional memory (int 0x12) too.
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it ...The EBDA is optional anyway; I presume it should have a zero pointer if it isn't present. -hpa --
On Fri, 29 Feb 2008 10:44:36 -0800, "H. Peter Anvin" <hpa@zytor.com> Correct, but the value that indicates the size of the conventional memory area must be set. At least on any real hardware. My guess is that both values read as 0 on Xen, which causes 0-0x100000 to be reserved. If that is always the case the workaround is fine, I think. -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - mmm... Fastmail... --
Instead of using early reservations inside the kernel code,
we could use the realmode code to modify the e820 memmap.
This patch shows what that would look like. I have not looked
at the case where the BIOS does not provide an e820 memmap
yet. Probably a full solution would need to create a fake
e820 memmap in that case.
Comments?
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
diff --git a/arch/x86/boot/memory.c b/arch/x86/boot/memory.c
index e77d89f..522920a 100644
--- a/arch/x86/boot/memory.c
+++ b/arch/x86/boot/memory.c
@@ -23,6 +23,7 @@ static int detect_memory_e820(void)
int count = 0;
u32 next = 0;
u32 size, id;
+ u32 lowmem, ebda_addr;
u8 err;
struct e820entry *desc = boot_params.e820_map;
@@ -50,13 +51,51 @@ static int detect_memory_e820(void)
just return failure. */
if (id != SMAP) {
count = 0;
- break;
+ goto out;
}
count++;
desc++;
- } while (next && count < E820MAX);
-
+ } while (next && count < (E820MAX - 1));
+
+ /* Some BIOSes do not reserve the EBDA/XBDA area correctly in.
+ The e820-map. Find out where the EBDA resides by looking at
+ the BIOS data area and reserve the EBDA and the following
+ legacy adapter area explicitly. */
+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
+ /* end of low (conventional) memory */
+ set_fs(0);
+ lowmem = rdfs16(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = rdfs16(BIOS_EBDA_SEGMENT);
+ ebda_addr <<= 4;
+
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
+
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;
+
+ /* Paranoia: should never happen, but... */
+ if (lowmem >= 0x100000)
+ lowmem = 0xa0000;
+
+ /* reserve ...An e820 is already faked up in machine_specific_memory_setup() if one This won't work for Xen since the real-mode code never runs there. I think it could be fixed in xen_memory_setup() though if native goes down this route. -- Ian Campbell Nasrudin walked into a teahouse and declaimed, "The moon is more useful than the sun." "Why?", he was asked. "Because at night we need the light more." --
s/could/should/. You need to set up your memory map more sensibly; it's not just the kernel, user space tries to access these areas too. -hpa --
Agreed, I think it's pure luck that Xen kernels have gotten away with it in the past. The patch below seems like the right thing to do. It certainly boots in a domU without the DMI problem (without any of the other related patches such as Alexander's). However ddcprobe hangs when run -- need to investigate some more, vm86 in general works ok (i.e. my vm86 test code passes). BTW Jeremy, the kernel doesn't use XENMEM_memory_map -- any reason other than it not being useful at the time? These days the tools can push an arbitrary e820 down for a guest which might be useful to support, although nothing interesting is done with it today. Ian. x86/xen: Construct e820 map with a hole between 640K-1M. Signed-off-by: Ian Campbell <ijc@hellion.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> --- arch/x86/xen/setup.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 3bad477..2341492 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void) unsigned long max_pfn = xen_start_info->nr_pages; e820.nr_map = 0; - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM); + add_memory_region(0, LOWMEMSIZE(), E820_RAM); + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM); return "Xen"; } -- 1.5.4.2 -- Ian Campbell His mind is like a steel trap: full of mice. -- Foghorn Leghorn --
Seems to me to be the right thing to do ... Acked-by: Mark McLoughlin <markmc@redhat.com> Cheers, Mark. --
thanks - i picked it up for v2.6.25 merging. i'm wondering, what triggered this bug, and why didnt we have these problems in the past? Ingo --
Thanks, I'd been planning to resend once I got the ddcprobe thing sorted but I've been a bit snowed under (deadlines :-|). This patch is independently correct though and turns the ddcprobe failure from a domU kernel crash into a userspace hang. ddcprobe hangs because the page at pfn 0 is comprised entirely 0xc2 bytes (the Xen memory scrub poison value -- I'm running a debug hypervisor) which causes lrmi to try and jump to 0xc2c2*16+0xc2c2 to implement int $0xXX. gdb won't let me actually examine that address for some reason so I don't actually know what it's hitting to cause the loop/hang but no good can come of jumping to that address in any case... Sanest solution seems to me to be to explicitly zero PFN 0 -- I have a Did someone make the boot allocator start at the bottom instead of the top of memory or something? Ian. -- Ian Campbell Current Noise: Nile - Even The Gods Must Die There is a natural hootchy-kootchy to a goldfish. -- Walt Disney --
That's been bothering me too, but hadn't come up with anything until I
just now looked again.
I'm pretty certain that dmi_scan_machine() wasn't using a fixmap before
(in 2.6.24) since I had to fix a broken xen dom0 patch to handle this
new situation.
Looking at 2.6.24 bt_ioremap(), it would have been hitting this:
/*
* Don't remap the low PCI/ISA area, it's always mapped..
*/
if (phys_addr >= ISA_START_ADDRESS && last_addr < ISA_END_ADDRESS)
return phys_to_virt(phys_addr);
If I restore this behaviour (i.e. remove Ian's e820 hole and add the
patch below), then I see us still allocating a page table at 0xf0000 but
we don't then try and create a new writable mapping to it later, and
everything goes fine.
That's a little worrying though, surely? The mapping created by
kernel_physical_mapping_init() is writable, so shouldn't Xen barf if you
try and use a page within that mapping as a page table?
Cheers,
From: Mark McLoughlin <markmc@redhat.com>
Date: Tue, 4 Mar 2008 15:32:03 +0000
Subject: [PATCH] x86: Don't remap the ISA region in early_ioremap()
Restore the previous behaviour of bt_ioremap() where
it would use the existing mapping of the ISA region
rather than using a fixmap.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
arch/x86/mm/ioremap.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 882328e..89fdd8f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -425,6 +425,13 @@ void __init *early_ioremap(unsigned long phys_addr, unsigned long size)
return NULL;
}
+ /*
+ * Don't remap the low PCI/ISA area, it's always mapped..
+ */
+ if (phys_addr >= ISA_START_ADDRESS && last_addr < ISA_END_ADDRESS)
+ return ...ah, ok, indeed Jeremy's gut feeling about early_ioremap() was right - we didnt use fixmaps for early_ioremap() [ ex. bt_ioremap() ] in .24. We try to slowly get rid of the linearity assumptions from arch/x86, this was one of the steps. Ingo --
I think it was a side effect of all the early_ioremap work, which
changed the order in which things happened at boot time.
J
--
which commits do you mean in particular? Ingo --
Good question. It might have been cpa as well. I'd have to look
through to confirm the theory, but I think Xen got broken for other
reasons around there, so simple bisect wouldn't be enough to identify
the problem.
J
--
i'm trying to do bug forensics to figure out where we broke Xen - i really have the feeling that there could be other side-effects of whatever change caused this. Ingo --
looks good to me - please resend with a note to add it to x86.git once you are happy with it. Seems like 2.6.25 material as well. Ingo --
Won't this waste 300+ KB of Perfectly Good RAM? Or I understood it incorrectly? I am aware that it would take more work to tell all kernel code that it shouldn't look for BIOS data on this region when running as a domU guest, but it seems that it would be a better solution. -- Eduardo --
No, the right thing is for Xen to not try to map RAM in this area. -hpa --
Why? If the Xen host is telling us there is valid RAM in this area, why can't we use it? Existing Xen hosts use those physical address ranges for RAM. We can't fix this on the e820 map on the guest side without making otherwise-valid RAM unused. -- Eduardo --
x86 has a number of historical assumptions, both in kernel and in userspace (consider dmidecode!) Trying to go against them is a losing proposition, *AND* a headache for the maintainers, which will have to consider "oh yes, and Xen does this dumb thing which goes against what all the hardware does." -hpa --
Ahem. "Hardware does this dumb thing that Xen avoids". kthxbye ;)
J
--
If hardware was trying to emulate Xen, you'd be right. -hpa --
For the moment that's true, but we should be able to release those
pages. On the other hand, there's been talk of making Xen hand out
memory in physically contigious 2M chunks, which means trying to unmap
300k won't be possible or worthwhile. I suspect real hardware will just
waste this memory too (it won't remap that 320k of ram to somewhere
else; it will just be shadowed) - which is nothing compared to chipsets
which will just throw away 1G to make space for PCI...
J
--
Not that 384K is all that much to worry about, but you could just map memory from 2 MB upward, and set the kernel load address to 2 MB (which is good for performance anyway.) There has been talk of making the default kernel load address 16 MB, to keep the DMA zone free. -hpa --
No, never looked at that stuff in detail. I'll bear it in mind.
J
--
On Mon, 25 Feb 2008 10:13:17 -0800, "H. Peter Anvin" <hpa@zytor.com>
Hello hpa,
I failed to find a comment referring to 0x413 or int 0x12 in
arch/x86/kernel. I do know the comment in Documentation/i386/boot.txt,
however, suggesting that "INT 12h" _should_ be used, but that it would
be pointless for zImage and old bzImage kernels, because they _have_
to be started from 0x90000 anyway. New bzImage kernels do not have
this limitation, and smart bootloaders simply put them at much
lower addresses, like 0x40000. (I know: you know.) My point is just
that the argument not to use 0x413 in the bootloader is not valid
for this case.
That leaves the possibility that code/data is inserted at the top of
conventional memory by either the BIOS or some kind of bootloader.
My view on this is that this code/data should be preserved: it
was specifically asked from us by lowering 0x413. One just loses
a tiny bit of usable memory, and if the program booting the kernel
knows better it can unload the driver if it wants to. It's just
Of course. That's also why I already added the old-Dell case ;).
But one problem at a time, please!
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - I mean, what is it about a decent email service?
--
i've applied your patch to x86.git#testing. Feel free to send a code-unification patch too :-) Ingo --
So you've basically wasted a lot of memory with no concrete breakage known ...Please reconsider. -Andi --
Just to give some perspective of this: On my laptop here BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d2000 - 0000000000100000 (reserved) This means it reserves only ~193KB in the 640k-1MB area With this patch it will reserve 384KB instead. This means 191KB are lost. While that doesn't sound too much it worth as much as 382 patches that reduce kernel code size by 512bytes or worth 3820 patches that reduce kernel code by 100 bytes in terms of memory consumption. Now such kernel code size patches are always popular, but why undo that work by throwing away perfectly good memory elsewhere? Or also the laptop kernel does Freeing unused kernel memory: 340k freed This means the 193KB now lost are worth 56% of the complete memory that is freed by __initdata/__init. Just maintaining these annotations is a lot of work, but why do all that if we then throw away than half as much memory as they save so easily? -Andi --
Hi Andi,
Can you provide the complete 'raw' e820 info? This is at least
not complete, and also might be after the sanitation. Do you
mean that (1) there is usable RAM somewhere between 0xa0000
and 0xd2000? Or that (2) this second line should be RAM, not
reserved?
Case (1) would surpise me, because I expect the VGA adapter
(which I assume is there...) to occupy 0xa0000 to 0xc0000 for
the framebuffer. Also, your case would be a lot stronger if
there were a line that explicitly indicated that there was
RAM there.
Case (2) would surprise me too, because a lot of software
would expect the system BIOS to reside in at least the area
0xf0000 to 0x100000. jmp 0xf000:fff0 for a reboot, no?
Laptops do not always have the VGA bios in the standard place,
so the range 0xc0000-0xd2000 could well be unused. Still, I
doubt that there is real RAM accessible in that region.
So I think you have not correctly interpreted the e820 info.
But, if you (or anyone else, for that matter) provide(s) a raw
e820 map that shows usable RAM in the region 0xa0000-0x100000,
The two lines you gave say that two regions are reserved. Nothing
tells what is in between those regions. If a region is not
I don't intend to. I have never seen a machine with usable
Agreed, if...
Greetings,
Alexander
--
Alexander van Heukelum
heukelum@fastmail.fm
--
http://www.fastmail.fm - mmm... Fastmail...
--
It doesn't waste any memory at all. It arguably wastes some *address space*, but that's an entirely different thing. Unless you have chipset-specific drivers to enable memory in the 640-1024K memory area, there is nothing there. -hpa --
OK, let me see if I can dig up the comment I thought I found at one point, and maybe found out the history behind it. Either that, or find out that my memory is faulty ;) -hpa --
It's not needed, the e820 maps are always correct for modern systems in The i386 kernel did this always, but I intentionally removed it from the 64bit kernel because all the modern BIOS seem to correctly report holes in this area. Only didn't do it on i386 because there were some concerns of very old systems not doing this correctly. My suspicion is that modern Windows systems rely on this, that is why BIOSes typically get it correct now. I think it should be only undone if you have a concrete case where it breaks not just based on someone's gut feel. Sure it's not a lot of memory, but why waste memory unnecessarily? -Andi --
