Re: [tip:x86/urgent] x86-32: Make sure we can map all of lowmem if we need to

Previous thread: [PATCH 0/12] KVM: SVM: Add support for VMCB state caching by Joerg Roedel on Friday, December 3, 2010 - 3:45 am. (6 messages)

Next thread: Re: [PATCH v3] scripts/coccinelle: update for compatability with Coccinelle 0.2.4 by Michal Marek on Friday, December 3, 2010 - 4:32 am. (1 message)
From: Stanislaw Gruszka
Date: Friday, December 3, 2010 - 4:16 am

On my T-60 laptop, i686 system with 2.6.37-rc4 kernel,
"echo c > /proc/sysrq-trigger" just hung the system. Kdump
works on 2.6.36. Is this known issue? If not, what info
I should provide to solve it (I think the easiest way
to solve the problem would be bisect) ?

Stanislaw
--

From: Maxim Uvarov
Date: Friday, December 3, 2010 - 8:46 am

I tested x86 QEMU yesterday with the latest git. It worked.



-- 
Best regards,
Maxim Uvarov
--

From: Stanislaw Gruszka
Date: Friday, December 3, 2010 - 10:11 am

Here is the photo 
http://people.redhat.com/sgruszka/20101203_005.jpg

There are two BUGs, first "sleeping function called from invalid
context" and then "unable to handle null pointer dereference".

Stanislaw
--

From: Neil Horman
Date: Friday, December 3, 2010 - 10:54 am

The warning about sleeping is an artifact of the fact that we panic the box with
irqs disabled I think (although I would think the fault handler would have
re-enabled them properly).  Not sure what the NULL pointer is from
--

From: Stanislaw Gruszka
Date: Tuesday, December 7, 2010 - 3:50 am

NULL pointer dereferece is ok, that's the way sysrq_handle_crash
trigger a crash. Problem here is that secondary kdump kernel hung at
start.

Bisection shows that bad commit is

commit 72d7c3b33c980843e756681fb4867dc1efd62a76
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Wed Aug 25 13:39:17 2010 -0700

    x86: Use memblock to replace early_res

Before commit kdump work. After it kernel doesn't compile (!?!). I fixed
compilation, but sill crash kernel can not be even loaded, I fixed that
using hunks from 9f4c13964b58608fbce05540743281ea3146c0e8 "x86, memblock:
Fix crashkernel allocation". After that crash kernel can be loaded, but
it hung at start, what is the problem that still happen in -rc4.
I'm attaching config, hope this is enough to fix.

Stanislaw
From: Yinghai Lu
Date: Tuesday, December 7, 2010 - 12:24 pm

please check debug patches, and boot first kernel and kexec second kernel with "ignore_loglevel debug earlyprintk...."

Thanks

	Yinghai
From: Stanislaw Gruszka
Date: Wednesday, December 8, 2010 - 7:19 am

Second kernel does not print anything, so maybe it not even start.
Dmesg from primary kernel attached.

Stanislaw
From: Yinghai Lu
Date: Thursday, December 9, 2010 - 12:16 am

please try attached debug patch.

Thanks

	Yinghai
From: Stanislaw Gruszka
Date: Thursday, December 9, 2010 - 5:41 am

With debug patch kdump kernel boot. Dmesg's from kdump and 
primary kernel in attachment.

Stanislaw
From: Yinghai Lu
Date: Thursday, December 9, 2010 - 1:09 pm

thanks.

please check if this one works. it only put crashkernel low.

Yinghai
From: Stanislaw Gruszka
Date: Monday, December 13, 2010 - 3:08 am

Yes, with patch kdump works.

Thanks
Stanislaw
--

From: Yinghai Lu
Date: Monday, December 13, 2010 - 11:20 am

peter, vivek,

it seems 32bit kdump need crashkernel much low than we expect...

Maybe we have to find_in_range_low() to make 32bit kdump happy.

Thanks

Yinghai

Subject: [PATCH] x86, memblock: Add memblock_x86_find_in_range_low()

Generic version is going from high to low, and it seems it can not find
right area compact enough.

the x86 version will go from goal to limit and just like the way We used
for early_res

to make crashkernel happy with 32bit kdump

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/memblock.h |    2 +
 arch/x86/kernel/setup.c         |    2 -
 arch/x86/mm/memblock.c          |   52 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/memblock.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/memblock.c
+++ linux-2.6/arch/x86/mm/memblock.c
@@ -346,3 +346,55 @@ u64 __init memblock_x86_hole_size(u64 st
 
 	return end - start - ((u64)ram << PAGE_SHIFT);
 }
+
+/* Check for already reserved areas */
+static inline bool __init check_with_memblock_reserved(u64 *addrp, u64 size, u64 align)
+{
+	u64 addr = *addrp;
+	bool changed = false;
+	struct memblock_region *r;
+again:
+	for_each_memblock(reserved, r) {
+		if ((addr + size) > r->base && addr < (r->base + r->size)) {
+			addr = round_up(r->base + r->size, align);
+			changed = true;
+			goto again;
+		}
+	}
+
+	if (changed)
+		*addrp = addr;
+
+	return changed;
+}
+
+/*
+ * Find a free area with specified alignment in a specific range from bottom up
+ */
+u64 __init memblock_x86_find_in_range_low(u64 start, u64 end, u64 size, u64 align)
+{
+	struct memblock_region *r;
+
+	for_each_memblock(memory, r) {
+		u64 ei_start = r->base;
+		u64 ei_last = ei_start + r->size;
+		u64 addr, last;
+
+		addr = round_up(ei_start, align);
+		if (addr < start)
+			addr = round_up(start, align);
+		if (addr >= ei_last)
+			continue;
+		while ...
From: H. Peter Anvin
Date: Monday, December 13, 2010 - 12:47 pm

Not this garbage again... sigh.  Once again, I will want to know what
the actual constraint is... not just "oh, this seems to work on this one
system."

I realize that the kdump interfaces are probably beyond saving -- we
have had this discussion enough times -- but I'm not happy about it and
I will really want to know what the heck the real issue is.

Furthermore, such a function should NOT be private to x86 core; if it's
needed at all it should live in the memblock core.

	-hpa
--

From: Vivek Goyal
Date: Tuesday, December 14, 2010 - 3:41 pm

Same here Yinghai. We need to debug that what is that upper limit for
loading x86 32bit kernel and if we know/understand that, we can fail
the loading of kdump kernel citing the appropriate reason. Last time
our understanding was that as long as we allocate memory below 896MB
things should be fine.

Stanislaw, how much memory you are reserving at what address with -rc4
kernel? Can you please look at  /proc/iomem? And try to reserve same
amount of memory at roughly same address at 2.6.36 kernel, and see if
kdump works.

So how I used to debug problems in kdump path. 

- Try earlyprintk for second kernel.
- Try --debug, --console-serial options with kexec while loading second
  kernel. Important thing to know here is control reached to purgatory
  or not.
- If that gives me nothing then it boils down to putting some outb()
  statements in first kernel and second kernel boot path to know where
  things went wrong.

  Because the issue was resolved by reserving memory in low memory
  area, it sounds like second kernel failed to boot early. So early
  printk might help otherwise outb() and serial console is the friend.

Thanks
Vivek
--

From: Stanislaw Gruszka
Date: Wednesday, December 15, 2010 - 3:39 am

I could debug this problem, but I do not suffer from free time right
now :-) Would be better someone bootmem/kdump experienced debug this.
I just check other laptop (T500, 2.6.37-rc5, x86_64, RHEL6 user space,
crashkernel=256M, 1.6G mem), kdump does not work there too. So I do
think problem is hard to reproduce.

Stanislaw
--

From: Yinghai Lu
Date: Wednesday, December 15, 2010 - 3:41 pm

ok, will try to find some old machine with less memory and devices to duplicate the problem.

Yinghai
--

From: Yinghai Lu
Date: Wednesday, December 15, 2010 - 9:29 pm

please check

[PATCH] x86, crashkernel, 32bit: only try to get range under 512M

Steanishlaw report kdump is 32bit is broken.

in misc.c for decompresser, it will do sanity checking to make sure heap
heap under 512M.

So limit it in first kernel under 512M for 32bit system.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/setup.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -499,7 +499,19 @@ static inline unsigned long long get_tot
 	return total << PAGE_SHIFT;
 }
 
+/*
+ * arch/x86/boot/compressed/misc.c will check heap size for decompresser
+ *  32bit will have more strict limitation
+ */
 #define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
+#define HEAP_LIMIT_32BIT 0x20000000
+
+#ifdef CONFIG_X86_64
+#define CRASH_KERNEL_LIMIT DEFAULT_BZIMAGE_ADDR_MAX
+#else
+#define CRASH_KERNEL_LIMIT HEAP_LIMIT_32BIT
+#endif
+
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long total_mem;
@@ -521,7 +533,7 @@ static void __init reserve_crashkernel(v
 		 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
 		 */
 		crash_base = memblock_find_in_range(alignment,
-			       DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
+			       CRASH_KERNEL_LIMIT, crash_size, alignment);
 
 		if (crash_base == MEMBLOCK_ERROR) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
--

From: Stanislaw Gruszka
Date: Thursday, December 16, 2010 - 3:00 am

Patch fix problem on my T-60 laptop.

As expected patch does not help on my other T-500 x86_64 system,
kdump not work there, but perhaps this is a different problem,
I'm going to check it.

Stanislaw
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 9:16 am

I think limiting kdump below 512 MiB on 32 bits may make sense; perhaps
even on 64 bits.  It's pretty conservative, after all...

Opinions?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Vivek Goyal
Date: Thursday, December 16, 2010 - 9:22 am

Actually it will be good to know why 512MB. I know in the past we have
been talking of reserving memory in higher memory regions and Neil Horman
had been trying to boot bzImage in 64 bit mode so that it can be run
from higher addresses. 

So right now limiting it is easy but it is desirable to be able to run
bzImage from as high a address as possible and knowing why to limit it
to 512MB can help see if there is a way to get rid of that limitation.

I probably would not worry about 32bit systems but for 64 bit, I
cerntainly want to make it boot from higher addresses (if it is possible
technically).

Thanks
Vivek
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 9:53 am

It's worth noting that there is almost always going to be a need for
*some* low memory.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Yinghai Lu
Date: Saturday, December 18, 2010 - 2:50 pm

Can you try crashkernel=256M@128M on your T-500 x86_64 system?

Thanks

Yinghai
--

From: Vivek Goyal
Date: Thursday, December 16, 2010 - 7:39 am

Thanks Yinghai. I am wondering why on 32bit heap has to be with-in 512MB.
I think you are referring to following check in
arch/x86/boot/compressed/misc.c.

	if (end > ((-__PAGE_OFFSET-(512 <<20)-1) & 0x7fffffff))
		error("Destination address too large");

It was introduced here.

commit 968de4f02621db35b8ae5239c8cfc6664fb872d8
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Thu Dec 7 02:14:04 2006 +0100

    [PATCH] i386: Relocatable kernel support

Eric,

It has been long. By any chance would you remember where does above
constraint come from?

Thanks
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 9:28 am

It might, in fact, be bogus; specifically a proxy for the fact that we
need the kernel memory including bss and brk below the lowmem boundary,
which isn't well-defined.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 10:28 am

the brk is complaining if i change that to 

 	if (end > ((-__PAGE_OFFSET-(128 <<20)-1) & 0x7fffffff))
 		error("Destination address too large");

brk is complaining when try to get more for dmi ...
...
I'm in purgatory
bootconsole [uart0] enabled
Kernel Layout:
  .text: [0x2e000000-0x2e3f08ca]
.rodata: [0x2e3f2000-0x2e5a2fff]
  .data: [0x2e5a3000-0x2e5f6467]
  .init: [0x2e5f7000-0x2e670fff]
   .bss: [0x2e675000-0x2e76ffff]
   .brk: [0x2e770000-0x2e894fff]
    memblock_x86_reserve_range: [0x00001000-0x00001fff]    EX TRAMPOLINE
    memblock_x86_reserve_range: [0x2e000000-0x2e76ffff]    TEXT DATA BSS
    memblock_x86_reserve_range: [0x35bdd000-0x35f49fff]          RAMDISK
    memblock_x86_reserve_range: [0x0009c800-0x000fffff]  * BIOS reserved
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.37-rc5-tip+ (root@mpk12-3214-189-181) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #4 SMP Wed Dec 15 11:04:32 PST 2010
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  NSC Geode by NSC
  Cyrix CyrixInstead
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
  UMC UMC UMC UMC
BIOS-provided physical RAM map:
 BIOS-e820: [0x00000000000100-0x0000000009c7ff] (usable)
 BIOS-e820: [0x0000000009c800-0x0000000009ffff] (reserved)
 BIOS-e820: [0x000000000e0000-0x000000000fffff] (reserved)
 BIOS-e820: [0x00000000100000-0x0000007ff9ffff] (usable)
 BIOS-e820: [0x0000007ffae000-0x0000007ffaffff] (usable)
 BIOS-e820: [0x0000007ffb0000-0x0000007ffbdfff] (ACPI data)
 BIOS-e820: [0x0000007ffbe000-0x0000007ffeffff] (ACPI NVS)
 BIOS-e820: [0x0000007fff0000-0x0000007fffffff] (reserved)
 BIOS-e820: [0x000000e0000000-0x000000efffffff] (reserved)
 BIOS-e820: [0x000000fec00000-0x000000fec00fff] (reserved)
 BIOS-e820: [0x000000fee00000-0x000000feefffff] (reserved)
 BIOS-e820: [0x000000ff700000-0x000000ffffffff] (reserved)
last_pfn = 0x7ffb0 max_arch_pfn = 0x1000000
NX (Execute Disable) protection: active
user-defined ...
From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 12:58 pm

I'm assuming it bails due to:

	BUG_ON((char *)(_brk_end + size) > __brk_limit);

... could you find out what _brk_end and __brk_limit are?

	-hpa
--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 3:57 pm

void __init print_kernel_layout(void)
{
        printk("Kernel Layout:\n");
        printk("  .text: [%#010lx-%#010lx]\n", __pa_symbol(&_text), __pa_symbol(&_etext) - 1);
        printk(".rodata: [%#010lx-%#010lx]\n", __pa_symbol(&__start_rodata), __pa_symbol(&__end_rodata) - 1);
        printk("  .data: [%#010lx-%#010lx]\n", __pa_symbol(&_sdata), __pa_symbol(&_edata) - 1);
        printk("  .init: [%#010lx-%#010lx]\n", __pa_symbol(&__init_begin), __pa_symbol(&__init_end) - 1);
        printk("   .bss: [%#010lx-%#010lx]\n", __pa_symbol(&__bss_start), __pa_symbol(&__bss_stop) - 1);
        printk("   .brk: [%#010lx-%#010lx]\n", __pa_symbol(&__brk_base), __pa_symbol(&__brk_limit) - 1);

so __brk_limit should be right?
--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 4:30 pm

void __init print_kernel_layout(void)
{
        printk("Kernel Layout:\n");
        printk("  .text: [%#010lx-%#010lx]\n", __pa_symbol(&_text), __pa_symbol(&_etext) - 1);
        printk(".rodata: [%#010lx-%#010lx]\n", __pa_symbol(&__start_rodata), __pa_symbol(&__end_rodata) - 1);
        printk("  .data: [%#010lx-%#010lx]\n", __pa_symbol(&_sdata), __pa_symbol(&_edata) - 1);
        printk("  .init: [%#010lx-%#010lx]\n", __pa_symbol(&__init_begin), __pa_symbol(&__init_end) - 1);
        printk("   .bss: [%#010lx-%#010lx]\n", __pa_symbol(&__bss_start), __pa_symbol(&__bss_stop) - 1);
        printk("   .brk: [%#010lx-%#010lx]\n", __pa_symbol(&__brk_base), __pa_symbol(&__brk_limit) - 1);

DMI present.
_brk_end: ee8e6000, __brk_limit: ee895000 

--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 4:49 pm

looks like in arch/x86/kernel/head_32.S
will put page_table in _brk....

if the whole range is some high, it will use more buffer in _brk for ...

brk pre-calucation could be wrong and too small.

Yinghai
--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 5:39 pm

32bit have assume KERNEL_IMAGE_SIZE is 512M
arch/x86/include/asm/page_32_types.h:#define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
arch/x86/include/asm/page_64_types.h:#define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
arch/x86/kernel/head64.c:       BUILD_BUG_ON(MODULES_VADDR-KERNEL_IMAGE_START < KERNEL_IMAGE_SIZE);
arch/x86/kernel/head64.c:       BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
arch/x86/kernel/head64.c:       max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
arch/x86/kernel/head_32.S: *     (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
arch/x86/kernel/head_32.S: *     (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
arch/x86/kernel/head_32.S: * KERNEL_IMAGE_SIZE should be greater than pa(_end)
arch/x86/kernel/head_32.S:KERNEL_PAGES = (KERNEL_IMAGE_SIZE + MAPPING_BEYOND_END)>>PAGE_SHIFT 

and use that to estimate BRK size.

so we could change the BRK calculating code to handle 896M or just limit crashkernel for 32bit to 512M...

handle 896M one:

---
 arch/x86/boot/compressed/misc.c |    2 +-
 arch/x86/kernel/head_32.S       |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/boot/compressed/misc.c
===================================================================
--- linux-2.6.orig/arch/x86/boot/compressed/misc.c
+++ linux-2.6/arch/x86/boot/compressed/misc.c
@@ -365,7 +365,7 @@ asmlinkage void decompress_kernel(void *
 	if (heap > 0x3fffffffffffUL)
 		error("Destination address too large");
 #else
-	if (heap > ((-__PAGE_OFFSET-(512<<20)-1) & 0x7fffffff))
+	if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
 		error("Destination address too large");
 #endif
 #ifndef CONFIG_RELOCATABLE
Index: linux-2.6/arch/x86/kernel/head_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head_32.S
+++ linux-2.6/arch/x86/kernel/head_32.S
@@ -68,8 +68,10 @@ MAPPING_BEYOND_END = \
  * Worst-case size of the kernel mapping we need to make:
 ...
From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 6:06 pm

Grmf... this was originally 4 GiB, but someone tried to tighten the
bound.  I think we should set it back to 4 GiB; 896 MiB is still
approximate.


	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 6:21 pm

Thinking about it, we probably should *both* fix the brk and limit the
crashkernel to 512 MiB (for compatibility with older crashkernels.)

	-hpa
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 6:51 pm

Can whomever has a test case for this please test the attached test patch?

	-hpa

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 8:05 pm

it works ...

with PAGE_OFFSET=0xc0000000

'm in purgatory
bootconsole [uart0] enabled
Kernel Layout:
  .text: [0x2e000000-0x2e3f08ca]
.rodata: [0x2e3f2000-0x2e5a2fff]
  .data: [0x2e5a3000-0x2e5f6467]
  .init: [0x2e5f7000-0x2e670fff]
   .bss: [0x2e675000-0x2e76ffff]
   .brk: [0x2e770000-0x2e954fff]
    memblock_x86_reserve_range: [0x00001000-0x00001fff]    EX TRAMPOLINE
    memblock_x86_reserve_range: [0x2e000000-0x2e76ffff]    TEXT DATA BSS
    memblock_x86_reserve_range: [0x35c20000-0x35f49fff]          RAMDISK
    memblock_x86_reserve_range: [0x0009c800-0x000fffff]  * BIOS reserved
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.37-rc5-tip+ (root@mpk12-3214-189-181) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #9 SMP Thu Dec 16 08:46:56 PST 2010
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  NSC Geode by NSC
  Cyrix CyrixInstead
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
  UMC UMC UMC UMC
BIOS-provided physical RAM map:
 BIOS-e820: [0x00000000000100-0x0000000009c7ff] (usable)
 BIOS-e820: [0x0000000009c800-0x0000000009ffff] (reserved)
 BIOS-e820: [0x000000000e0000-0x000000000fffff] (reserved)
 BIOS-e820: [0x00000000100000-0x0000007ff9ffff] (usable)
 BIOS-e820: [0x0000007ffae000-0x0000007ffaffff] (usable)
 BIOS-e820: [0x0000007ffb0000-0x0000007ffbdfff] (ACPI data)
 BIOS-e820: [0x0000007ffbe000-0x0000007ffeffff] (ACPI NVS)
 BIOS-e820: [0x0000007fff0000-0x0000007fffffff] (reserved)
 BIOS-e820: [0x000000e0000000-0x000000efffffff] (reserved)
 BIOS-e820: [0x000000fec00000-0x000000fec00fff] (reserved)
 BIOS-e820: [0x000000fee00000-0x000000feefffff] (reserved)
 BIOS-e820: [0x000000ff700000-0x000000ffffffff] (reserved)
last_pfn = 0x7ffb0 max_arch_pfn = 0x1000000
NX (Execute Disable) protection: active
user-defined physical RAM map:
 user: [0x00000000000000-0x0000000009ffff] (usable)
 user: [0x0000002e000000-0x00000035f59fff] (usable)
 user: [0x0000007ffb0000-0x0000007ffeffff] ...
From: Yinghai Lu
Date: Thursday, December 16, 2010 - 8:07 pm

with PAGE_OFFSET=0x40000000

I'm in purgatory
bootconsole [uart0] enabled
Kernel Layout:
  .text: [0x2f000000-0x2f3fbf4a]
.rodata: [0x2f3fe000-0x2f5b1fff]
  .data: [0x2f5b2000-0x2f60e067]
  .init: [0x2f60f000-0x2f690fff]
   .bss: [0x2f695000-0x2f796fff]
   .brk: [0x2f797000-0x2fdbafff]
    memblock_x86_reserve_range: [0x00001000-0x00001fff]    EX TRAMPOLINE
    memblock_x86_reserve_range: [0x2f000000-0x2f796fff]    TEXT DATA BSS
    memblock_x86_reserve_range: [0x36d8a000-0x36f49fff]          RAMDISK
    memblock_x86_reserve_range: [0x0009c800-0x000fffff]  * BIOS reserved
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.37-rc5-tip+ (root@mpk12-3214-189-181) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #11 SMP Thu Dec 16 10:49:27 PST 2010
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  NSC Geode by NSC
  Cyrix CyrixInstead
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
  UMC UMC UMC UMC
BIOS-provided physical RAM map:
 BIOS-e820: [0x00000000000100-0x0000000009c7ff] (usable)
 BIOS-e820: [0x0000000009c800-0x0000000009ffff] (reserved)
 BIOS-e820: [0x000000000e0000-0x000000000fffff] (reserved)
 BIOS-e820: [0x00000000100000-0x0000007ff9ffff] (usable)
 BIOS-e820: [0x0000007ffae000-0x0000007ffaffff] (usable)
 BIOS-e820: [0x0000007ffb0000-0x0000007ffbdfff] (ACPI data)
 BIOS-e820: [0x0000007ffbe000-0x0000007ffeffff] (ACPI NVS)
 BIOS-e820: [0x0000007fff0000-0x0000007fffffff] (reserved)
 BIOS-e820: [0x000000e0000000-0x000000efffffff] (reserved)
 BIOS-e820: [0x000000fec00000-0x000000fec00fff] (reserved)
 BIOS-e820: [0x000000fee00000-0x000000feefffff] (reserved)
 BIOS-e820: [0x000000ff700000-0x000000ffffffff] (reserved)
last_pfn = 0x7ffb0 max_arch_pfn = 0x1000000
NX (Execute Disable) protection: active
user-defined physical RAM map:
 user: [0x00000000000000-0x0000000009ffff] (usable)
 user: [0x0000002f000000-0x00000036f59fff] (usable)
 user: [0x0000007ffb0000-0x0000007ffeffff] (ACPI data)
DMI ...
From: tip-bot for H. Peter Anvin
Date: Thursday, December 16, 2010 - 8:19 pm

Commit-ID:  147dd5610c8d1bacb88a6c1dfdaceaf257946ed0
Gitweb:     http://git.kernel.org/tip/147dd5610c8d1bacb88a6c1dfdaceaf257946ed0
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Thu, 16 Dec 2010 19:11:09 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Thu, 16 Dec 2010 19:11:09 -0800

x86-32: Make sure we can map all of lowmem if we need to

A relocatable kernel can be anywhere in lowmem -- and in the case of a
kdump kernel, is likely to be fairly high.  Since the early page
tables map everything from address zero up we need to make sure we
allocate enough brk that we can map all of lowmem if we need to.

Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4D0AD3ED.8070607@kernel.org>
---
 arch/x86/boot/compressed/misc.c |    2 +-
 arch/x86/kernel/head_32.S       |   12 +++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 23f315c..325c052 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -355,7 +355,7 @@ asmlinkage void decompress_kernel(void *rmode, memptr heap,
 	if (heap > 0x3fffffffffffUL)
 		error("Destination address too large");
 #else
-	if (heap > ((-__PAGE_OFFSET-(512<<20)-1) & 0x7fffffff))
+	if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
 		error("Destination address too large");
 #endif
 #ifndef CONFIG_RELOCATABLE
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index bcece91..d7cdf5b 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -60,16 +60,18 @@
 #define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
 #endif
 
+/* Number of possible pages in the lowmem region */
+LOWMEM_PAGES = (((1<<32) - __PAGE_OFFSET) >> PAGE_SHIFT)
+	
 /* Enough space to fit pagetables for the low memory linear map */
-MAPPING_BEYOND_END = ...
From: Stanislaw Gruszka
Date: Friday, December 17, 2010 - 7:33 am

Fix confirmed, thanks

Stanislaw
--

From: Vivek Goyal
Date: Thursday, December 16, 2010 - 3:01 pm

Yinghai, 

On my system above change works fine and I can boot into second kernel. So
it will boil down to knowing what are the exact constraints on heap for
decompression and for 32bit can we allow heap upto 896MB or not.

Thanks
Vivek
--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 3:58 pm

really? what is you CONFIG_PAGE_OFFSET? 0x40000000 or 0xc0000000?

	Yinghai
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 9:15 am

I am using CONFIG_PAGE_OFFSET=0xC0000000

Thanks
Vivek
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 6:15 pm

By the way, 896 MiB is almost certainly too aggressive; the vmalloc area
is adjustable and there are other bits that can chew off a few MiB of
address space.  I would suggest we either make it 512 or 768 MiB *and*
fix the brk limit.

Opinions?

	-hpa
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 8:31 pm

I'd like to apply a modified version of this patch (attached.)

Ack/nak, people?

	-hpa
From: Yinghai
Date: Thursday, December 16, 2010 - 8:58 pm

Please don't do that to 64 bit

My big system with 1024g memory and a lot of cards with rhel 6 to make kdump work must have crashkernel=512m and second kernel need to take pci=nomsi

Thanks

--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 9:08 pm

Hm, this seems like an epic FAIL.

First of all, the current code still limits it to 896 MiB, so 512 MiB is
not a significant restriction.

Second, this patch only applies if "crashkernel=" is not specified, so
it doesn't even apply to your situation.

Third, if you have to specify "crashkernel=" that means that there is
yet another problem here that should be fixed.

	-hpa
--

From: Yinghai Lu
Date: Thursday, December 16, 2010 - 9:46 pm

current code:
        /* 0 means: find the address automatically */
        if (crash_base <= 0) {
                const unsigned long long alignment = 16<<20;    /* 16M */

                /*
                 *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
                 */
                crash_base = memblock_find_in_range(alignment,
                               DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);

                if (crash_base == MEMBLOCK_ERROR) {
                        pr_info("crashkernel reservation failed - No suitable area found.\n");
                        return;
                }
        } else {
                unsigned long long start;

                start = memblock_find_in_range(crash_base,
                                 crash_base + crash_size, crash_size, 1<<20);
                if (start != crash_base) {
                        pr_info("crashkernel reservation failed - memory is in use.\n");
                        return;
                }
        }

first branch : will take only crash_size.

no, every kdump need to specify crashkernel=128M or more.

	Yinghai
--

From: H. Peter Anvin
Date: Thursday, December 16, 2010 - 10:16 pm

Oh, you're referring to crashkernel size.  Okay, this is somewhat
different.  However, the margin is just too small on 64 bits, then.  How
far up can we actually get away with on 64 bits currently?  4 GiB?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 10:01 am

I agree here that we should not do it for 64 bit.

- Just because we need it for 32 bit does not mean we should limit it for
  64bit. And we do want to have the capability to boot the kernel from as
  high memory as possible so creating another aritificial limit is counter
  to that.

- I would not worry too much about backward compatibility and allow
  booting 32bit kernel till 768MB. The reason being that most of the
  distros use same kernel for crash dumping as regular kernel. Maintainig
  two separate kernels is big hassle.

  So a small set of people who run into issue, would need to change kernel
  command line "crashkernel=128M@64M" or something similar.

Thanks
--

From: H. Peter Anvin
Date: Friday, December 17, 2010 - 10:56 am

Do we have actual testing for how high the 64-bit kernel will load?  I'm
assuming that the usage of a 32-bit kdump kernel for a 64-bit main
kernel is nonexistent.

	-hpa
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 11:02 am

In the past I have run into 1-2 folks who were using 32bit kdump kernel
on 64bit main. But again for those, the workaround is to specify the
different crashkernel= syntax and explicityly specify where to reserve
memory.

Thanks
Vivek
--

From: Yinghai Lu
Date: Friday, December 17, 2010 - 11:21 am

if bzImage is used, it is 896M.

or crashkernel=... will take two ranges like one high and one low.

also kexec bzImage in 64bit should use startup_64 aka 0x200 offset instead of startup_32 in arch/x86/boot/compressed/head_64.S

then bzImage can be put above 4G...

Thanks

Yinghai
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 11:35 am

Strangely on my x86_84 systems with 37-rc6, I am trying to reserve memory and
nothing shows up on /proc/iomem. dmesg says that I am reaserving 128M at 64M
but nothing in /proc/iomeme. Going back to .36 kernel and see what
happens.

Ok, last time we had looked that kexec-tools had constraint to load

Neil had been trying that but AFAIK, he had no success. I don't know but
he was struggling with setting up pages tables in kexec for 64bit startup.

But yes, making use of 64bit entry point is in the wish list.

Thanks
Vivek
--

From: H. Peter Anvin
Date: Friday, December 17, 2010 - 12:39 pm

Why?  896 MiB is a 32-bit kernel limitation which doesn't have anything
to do with the bzImage format.

So unless there is something going on here, I suspect you're just plain
flat wrong.

	-hpa
--

From: Yinghai Lu
Date: Friday, December 17, 2010 - 12:46 pm

kexec-tools have some checking when it loads bzImage.

	Yinghai
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 12:50 pm

Yinghai,

I think x86_64 might have just inherited the settings of 32bit without
giving it too much of thought. At that point of time nobody bothered
to load the kernel from high addresses. So these might be artificial
limits.

Thanks
Vivek
--

From: Yinghai Lu
Date: Friday, December 17, 2010 - 12:52 pm

good point.  will check that.
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 1:01 pm

Yinghai,

On x86_64, I am not seeing "Crash kernel" entry in /proc/iomem.

I see following in dmesg.

"[    0.000000] Reserving 128MB of memory at 64MB for crashkernel (System
RAM: 5120MB)"

Following is my /proc/iomem.

# cat /proc/iomem 
00000100-0000ffff : reserved
00010000-00096fff : System RAM
00097000-0009ffff : reserved
000c0000-000e7fff : pnp 00:0f
000e8000-000fffff : reserved
00100000-bffc283f : System RAM
  01000000-015d1378 : Kernel code
  015d1379-01aee00f : Kernel data
  01bc8000-024b4c4f : Kernel bss
bffc2840-bfffffff : reserved

So there is RAM available at the requested address still no entry for
"Crash Kernel". This is both with 2.6.36 as well as 37-rc6 kernel. I am 
wondering if insert_resource() is failing here?

Thanks
Vivek
--

From: Yinghai Lu
Date: Friday, December 17, 2010 - 1:06 pm

also could be memblock_x86_reserve() fail ...

Please check attached debug patch...

Thanks

	Yinghai
From: Vivek Goyal
Date: Friday, December 17, 2010 - 1:34 pm

looks like memblock_x86_reserve() is fine. Following is dmesg output with
your debug patches applied.


[    0.000000]     memblock_x86_reserve_range: [0x01000000-0x024bcb77]    TEXT DATA BSS
[    0.000000]     memblock_x86_reserve_range: [0x7fafb000-0x7fff3fff]          RAMDISK
[    0.000000]     memblock_x86_reserve_range: [0x00097000-0x000fffff]  * BIOS reserved
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.37-rc6+ (root@chilli.lab.bos.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #73 SMP Fri Dec 17 15:24:34 EST 2010
[    0.000000] Command line: ro root=/dev/mapper/vg_chilli-lv_root rd_LVM_LV=vg_chilli/lv_root rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=tty0, console=ttyS0,115200n8 selinux=0 crashkernel=128M@64M kexec_jump_back_entry=0x6148206465520a0f
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000100 - 0000000000097000 (usable)
[    0.000000]  BIOS-e820: 0000000000097000 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bffc2840 (usable)
[    0.000000]  BIOS-e820: 00000000bffc2840 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI 2.5 present.
[    0.000000] DMI: 0A9Ch/HP xw6600 Workstation, BIOS 786F4 v00.32 09/18/2007
[    0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[    0.000000] No AGP bridge found
[    0.000000] last_pfn = 0x140000 max_arch_pfn = 0x400000000
[    0.000000] ...
From: Vivek Goyal
Date: Friday, December 17, 2010 - 4:51 pm

Hi Yinghai,

Please ignore this. The problem was with my setup with some user space
script setting kexec_crash_size = 0 hence freeing up the memory. I think
it is time to put a kernel message when memory is freed/shrinked. I wasted
a lot of time debugging it.

Sorry for the noise here.

thanks
Vivek
--

From: H. Peter Anvin
Date: Friday, December 17, 2010 - 12:56 pm

Can we do this in the meantime, just so we fix the immediate problem?

	-hpa
From: Vivek Goyal
Date: Friday, December 17, 2010 - 1:11 pm

Peter, kexec-tools on 64bit currently seems to be allowing loding bzImage
till 896MB. So I am not too keen it to reduce it to 768MB in kernel just
because x86_64 could be booted from even higher addresses and somebody
first has to do some auditing and experiments.

IMHO, we should have 768MB limit for 32bit and continue with 896MB limit for
64bit and once somebody makes x86_64 boot from even higher address reliably
then we can change both kernel and kexec-tools.

Thanks
Vivek
--

From: H. Peter Anvin
Date: Friday, December 17, 2010 - 1:59 pm

If we're splitting by architectures anyway, why not leave 32 bits at 512
MiB and thus making older crashkernels usable just in case someone has a
frozen toolset?

	-hpa
--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 2:13 pm

If you are more comfortable with 512MB for i386, that's fine with me. I
care more for 64bit at this point of time.

Thanks
Vivek
--

From: Stanislaw Gruszka
Date: Monday, December 20, 2010 - 9:31 am

I'm not sure what going on, but I can no logner reproduce kdump problem 
with -rc6 on my T-500 x86_64 system.

I tested below patch together with previous patch "x86-32: Make sure
we can map all of lowmem if we need to", and on my both laptops i686 and
x86_64 system boots and kdump works.


--

From: tip-bot for H. Peter Anvin
Date: Friday, December 17, 2010 - 9:34 pm

Commit-ID:  7f8595bfacef279f06c82ec98d420ef54f2537e0
Gitweb:     http://git.kernel.org/tip/7f8595bfacef279f06c82ec98d420ef54f2537e0
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Thu, 16 Dec 2010 19:20:41 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 17 Dec 2010 15:04:00 -0800

x86, kexec: Limit the crashkernel address appropriately

Keep the crash kernel address below 512 MiB for 32 bits and 896 MiB
for 64 bits.  For 32 bits, this retains compatibility with earlier
kernel releases, and makes it work even if the vmalloc= setting is
adjusted.

For 64 bits, we should be able to increase this substantially once a
hard-coded limit in kexec-tools is fixed.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20101217195035.GE14502@redhat.com>
---
 arch/x86/kernel/setup.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 21c6746..c9089a1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -501,7 +501,18 @@ static inline unsigned long long get_total_mem(void)
 	return total << PAGE_SHIFT;
 }
 
-#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
+/*
+ * Keep the crash kernel below this limit.  On 32 bits earlier kernels
+ * would limit the kernel to the low 512 MiB due to mapping restrictions.
+ * On 64 bits, kexec-tools currently limits us to 896 MiB; increase this
+ * limit once kexec-tools are fixed.
+ */
+#ifdef CONFIG_X86_32
+# define CRASH_KERNEL_ADDR_MAX	(512 << 20)
+#else
+# define CRASH_KERNEL_ADDR_MAX	(896 << 20)
+#endif
+
 static void __init reserve_crashkernel(void)
 {
 	unsigned long long total_mem;
@@ -520,10 +531,10 @@ static void __init reserve_crashkernel(void)
 		const unsigned long long alignment = 16<<20;	/* 16M */
 
 		/*
-		 *  kexec want bzImage is below ...
From: H. Peter Anvin
Date: Friday, December 17, 2010 - 12:50 pm

So kexec-tools are broken?

	-hpa
--

From: Américo Wang
Date: Monday, December 13, 2010 - 3:25 am

Yinghai, is it possible to add the debug patch to upstream too? For debugging
future kdump issues like this.

Thanks.
--

From: Maciej Rutecki
Date: Sunday, December 5, 2010 - 7:35 am

I created a Bugzilla entry at 
https://bugzilla.kernel.org/show_bug.cgi?id=24372
for your bug report, please add your address to the CC list in there, thanks!

-- 
Maciej Rutecki
http://www.maciek.unixy.pl
--

Previous thread: [PATCH 0/12] KVM: SVM: Add support for VMCB state caching by Joerg Roedel on Friday, December 3, 2010 - 3:45 am. (6 messages)

Next thread: Re: [PATCH v3] scripts/coccinelle: update for compatability with Coccinelle 0.2.4 by Michal Marek on Friday, December 3, 2010 - 4:32 am. (1 message)