may need user to have new kexec tools that could create e820 table
from /sys/firmware/memmap instead of /proc/iomem for second kernel
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Bernhard Walle <bwalle@suse.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1279,6 +1279,10 @@ void __init e820_reserve_resources(void)
res = alloc_bootmem_low(sizeof(struct resource) * e820.nr_map);
for (i = 0; i < e820.nr_map; i++) {
+ if (e820.map[i].type != E820_RAM) {
+ res++;
+ continue;
+ }
end = e820.map[i].addr + e820.map[i].size - 1;
#ifndef CONFIG_RESOURCES_64BIT
if (end > 0x100000000ULL) {
--
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com> /proc/iomem is mostly about io resources which you have just removed. It is totally the wrong thing to only register RAM resource! The use by kexec was and is just taking advantage of something that already existed. --
On Sun, Aug 24, 2008 at 7:52 PM, Eric W. Biederman
story:
before 2.6.26, kernel will insert_resource with lapic addr into resource tree.
and then use request_resource to add entries with all entries in e820 tables.
so one entry is overlapped with lapic address is never added to resource tree.
from 2.6.26, we use have e820 insert_resource for it's entries to
resource tree at first. and later use
insert_resource for lapic address. so all entries from e820 is showing
up on resource tree.
problem: some devices that on bus0, has resource with BAR,, and those
address is falling into reserved area in e820.
when pcibios_allocate_bus_resources check those resource, it found
request_resource(pr, res) will fail. at this point pr is
resource of parent bus of those device. ant it is iomem_resource. then
those device will updated resource by OS allocations.
that should be ok, but some chipset put HPET in one BAR1, that changes
will make hpet addr is not consistent anymore.
the system will hang...
solutions will be:
1. use quirks to protect the hpet in BAR
[PATCH] x86: protect hpet in BAR for one ATI chipset v3
so avoid kernel don't allocate nre resource for it because it can not
allocate the old
address from BIOS.
the same way like some IO APIC address in BAR handling
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
---
drivers/pci/quirks.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
Index: linux-2.6/drivers/pci/quirks.c
===================================================================
--- linux-2.6.orig/drivers/pci/quirks.c
+++ linux-2.6/drivers/pci/quirks.c
@@ -1918,6 +1918,22 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_B
PCI_DEVICE_ID_NX2_5709S,
quirk_brcm_570x_limit_vpd);
+static void __init quirk_hpet_in_bar(struct pci_dev *pdev)
+{
+ int i;
+ u64 base, size;
+
+ /* the BAR1 is the location of the HPET...we must
+ * not touch this, so forcibly insert it into the ...see the RFC commit below for more details - about the problem and various solutions we are thinking about. The core problem is that the problem was hard to find and hard to debug - it took the exception debugging effort of David Witbrodt to track it down. So we are trying structural fixes to improve the situation. Just reverting the e820 changes breaks other things and is not the real fix anyway: the real fix is to increase communication between PC platform devices/drivers and the PCI code. DMI driven quirks are too limited as well - more such systems are suspected. For now we've got the patch below from Yinghai - which hooks directly into the x86 PCI discovery and reallocation code. While that's already better than the initial DMI quirk, i think the real fix should go one level higher, to the resource manager. i'd rather see the e820 reserved entries show up there (losing system setup information is almost always a bad idea - and the e820 map is central enough to be one of the more reliable BIOS-provided data structures), but with a different resource property: a 'sticky' resource bit which would cause overlapping PCI devices that already have their BAR programmed stay there. We already have a certain amount of support for 'container' resources (bridge resources for example). That would automatically protect any hpet (or, in theory, ioapic) platform devices from the PCI code's currently blind resource reprogramming logic. These platform devices are not PCI enumerated so we cannot just make the platform drivers themselves be PCI drivers, and they are special in many regards. (often they are not PCI devices at all) Note that this is only about the (BIOS provided) e820 map. The core problem is, inserting e820 map reserved entries as 'real' resources can break real devices. Ingo ----------------> From 1521c6b7a96e8d79c424216d9118859a017a4e9e Mon Sep 17 00:00:00 2001 From: Yinghai Lu <yhlu.kernel@gmail.com> Date: Sun, 24 Aug 2008 21:41:28 ...
Agreed. And that is why I NAK'd YH's first patch which just yanked all of the reserved entries out of the resource map. This really does need to get up to how we deal with resources The core problem is seeing the e820 reservation as a conflict, not inserting the resources themselves. The question: How do we deal more gracefully with BIOS bugs. The problem: We don't have full system information so we have to guess and perform other magic to make the system work. I bet if the HPET driver knew we had changed it's bar it would have worked but of course that won't work in general. One of the other problems we have seen in this area if memory serves is that BIOS reserved regions can don't always split on the same boundaries as real hardware. The last time this class of problem came up we added insert_resource to the resource allocator. It seems either we are not using it properly or it is an insufficient fix. Hmm. Why does pci_find_parent_resource fail? --
On Mon, Aug 25, 2008 at 6:30 AM, Eric W. Biederman it doesn't fail, it got [0, -1ULL]. because that device 00:14.0 is on bus0. and that is one HT chain system, YH --
2.0.0 has that implemented. Bernhard -- Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development --
Yes can you guys make kexec-tools 2.0.0 can be complied to static as one option? YH --
See http://article.gmane.org/gmane.linux.kernel.kexec/2223. Bernhard -- Bernhard Walle, SUSE LINUX Products GmbH, Architecture Development --
I think this will wipe out ACPI related entries also from /proc/iomem and kdump will be broken as second kernel needs to know about the ACPI areas. Though, if all these entries are available in /sys/firmware/memap then probably one can modify kexec-tools to grep RAM entries from /proc/iomem and rest of the entries from /sys/firmware/memmap. I would not prefer doing that it makes the logic twisted. Thanks Vivek --
/sys/firmware/memmap have all of them. though RAM entries from /proc/iomem could be smaller than that in /sys/firmware/memmap because of trimming from commandline. YH --
