[PATCH 3/5] Support second memory region in crash_shrink_memory()

Previous thread: defconfig strangeness by Nicholas Mc Guire on Thursday, April 22, 2010 - 9:07 am. (3 messages)

Next thread: [GIT PULL] fallout from percpu.h -> slab.h dependency breaking by Tejun Heo on Thursday, April 22, 2010 - 9:38 am. (1 message)
From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

Patch applies to 2.6.34-rc5

On x86 platform, even if hardware is 64-bit capable, kernel starts
execution in 32-bit mode. When system is kdump-enabled, crashed kernel
switches to 32 bit mode and jumps into new kernel. This automatically
limits location of dump-capture kernel image and it's initrd by first
4Gb of memory. Switching to 32 bit mode is performed by purgatory
code, which has relocations of type R_X86_64_32S (32-bit signed), and
this cuts "good" address space for crash kernel down to 2 Gb. I/O
regions may cut down this space further.

When system has a lot of memory (hundreds of gigabytes), dump-capture
kernel also needs relatively a lot of memory to account old kernel's
pages. It may be impossible to reserve enough memory below 2 or even 4
Gb. Simplest solution is it break dump-capture kernel's reserved
memory region into two pieces: first (small) region for kernel and
initrd images may be easily placed in "good" address space in the
beginning of physical memory, and second region may be located
anywhere.

This serie of patches realizes this approach. It requires also changes
in kexec utility to make this feature work, but is
backward-compatible: old versions of kexec will work with new
kernel. I will post patch to kexec-tools upstream separately.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>

 Documentation/kdump/kdump.txt       |   40 ++++++++
 Documentation/kernel-parameters.txt |   19 +++-
 arch/x86/kernel/setup.c             |   56 +++++++----
 include/linux/kexec.h               |    6 +
 kernel/kexec.c                      |  182 ++++++++++++++++++++++++++---------
 5 files changed, 232 insertions(+), 71 deletions(-)

--

From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

Currently crash kernel uses only one memory region (described by
struct resource). When this region gets enough large, there may appear
a problem to reside this region in a valid addresses range.

This patch introduces second memory region, which may be also used by
crash kernel. First region may be enough small to place only kernel
and initrd images at low addresses, and second region may be placed
almost anywhere.

Second memory resource has another name with aim not to confuse
existing userspace utilities, like kexec.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 include/linux/kexec.h |    1 +
 kernel/kexec.c        |   11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 03e8e8d..1a3b0a3 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -198,6 +198,7 @@ extern struct kimage *kexec_crash_image;
 /* Location of a reserved region to hold the crash kernel.
  */
 extern struct resource crashk_res;
+extern struct resource crashk_res_hi;
 typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];
 extern note_buf_t __percpu *crash_notes;
 extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 87ebe8a..1bd0199 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -49,7 +49,7 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
 size_t vmcoreinfo_size;
 size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
 
-/* Location of the reserved area for the crash kernel */
+/* Location of the reserved area for the crash kernel in low memory */
 struct resource crashk_res = {
 	.name  = "Crash kernel",
 	.start = 0,
@@ -57,6 +57,14 @@ struct resource crashk_res = {
 	.flags = IORESOURCE_BUSY | IORESOURCE_MEM
 };
 
+/* Location of the reserved area for the crash kernel in high memory */
+struct resource crashk_res_hi = {
+	.name  = "Crash high memory",
+	.start = 0,
+	.end   = 0,
+	.flags = IORESOURCE_BUSY | IORESOURCE_MEM
+};
+
 int ...
From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

crashkernel= syntax of kernel command line was extended to allow
reservation of two memory regions for dump-capture kernel.

Syntax for simple case was changed from

    crashkernel=size[@offset]

to

    crashkernel=<low>/<high>

Where <low> and <high> are memory regions for dump-capture kernel in
usual crashkernel format (size@offset).

Crashkernel syntax, involving conditional reservation based on memory
size, was changed from

    crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]

to

    crashkernel=<range1>:<low_size1>[/<high_size1>]
                [,<range2>:<low_size2>[/high_size2],...]
                [@low_offset][/high_offset]

New syntax is backward compatible.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 include/linux/kexec.h |    5 ++
 kernel/kexec.c        |  116 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 1a3b0a3..d2063f8 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -207,6 +207,11 @@ extern size_t vmcoreinfo_max_size;
 
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base);
+int __init parse_crashkernel_ext(char *cmdline, unsigned long long system_ram,
+				 unsigned long long *crash_size,
+				 unsigned long long *crash_base,
+				 unsigned long long *crash_size_hi,
+				 unsigned long long *crash_base_hi);
 int crash_shrink_memory(unsigned long new_size);
 size_t crash_get_memory_size(void);
 
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1bd0199..b8fd6eb 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1229,23 +1229,42 @@ module_init(crash_notes_memory_init)
  */
 
 
+static char * __init parse_crashkernel_region(char		*cmdline,
+					     unsigned long long	*crash_size,
+					     unsigned long long	*crash_base)
+{
+	char *cur = cmdline;
+
+	*crash_size = ...
From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

This patch changes crash_shrink_memory() to work with previosly added
memory region also. When shrink occurs, second region is shrunk first.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 kernel/kexec.c |   55 ++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index b8fd6eb..dfaa01e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1117,10 +1117,36 @@ static void free_reserved_phys_range(unsigned long begin, unsigned long end)
 	}
 }
 
+int crash_shrink_region(struct resource *crashk, unsigned long new_size)
+{
+	unsigned long start, end, size;
+
+	start = crashk->start;
+	end = crashk->end;
+	size = end - start + 1;
+
+	if (!size || new_size == size) /* Nothing to free */
+		return 0;
+
+	if (new_size > size)
+		return -EINVAL;
+
+	start = roundup(start, PAGE_SIZE);
+	end = roundup(start + new_size, PAGE_SIZE);
+
+	free_reserved_phys_range(end, crashk->end);
+
+	if (start == end)
+		release_resource(crashk);
+	crashk->end = end - 1;
+
+	return 0;
+}
+
 int crash_shrink_memory(unsigned long new_size)
 {
 	int ret = 0;
-	unsigned long start, end;
+	unsigned long crash_size, low_size;
 
 	mutex_lock(&kexec_mutex);
 
@@ -1128,26 +1154,25 @@ int crash_shrink_memory(unsigned long new_size)
 		ret = -ENOENT;
 		goto unlock;
 	}
-	start = crashk_res.start;
-	end = crashk_res.end;
 
-	if (new_size >= end - start + 1) {
+	crash_size = low_size = crashk_res.end - crashk_res.start + 1;
+	crash_size += crashk_res_hi.end - crashk_res_hi.start + 1;
+
+	if (crash_size == new_size)
+		goto unlock;
+	if (crash_size < new_size) {
 		ret = -EINVAL;
-		if (new_size == end - start + 1)
-			ret = 0;
 		goto unlock;
 	}
 
-	start = roundup(start, PAGE_SIZE);
-	end = roundup(start + new_size, PAGE_SIZE);
-
-	free_reserved_phys_range(end, crashk_res.end);
-
-	if (start == end) {
-		crashk_res.end = ...
From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

Mention new crashkernel= syntax in documentation.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 Documentation/kdump/kdump.txt       |   40 +++++++++++++++++++++++++++++++++++
 Documentation/kernel-parameters.txt |   19 +++++++++++-----
 2 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index cab61d8..9f93d17 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -266,7 +266,47 @@ This would mean:
     2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
     3) if the RAM size is larger than 2G, then reserve 128M
 
+Avoiding memory reservation problem on large systems
+====================================================
 
+For large systems with huge amount of memory dump-capture kernel
+requires more memory to handle properly old kernel's pages. However,
+it raises issues with h/w-dependent limitations on some platforms. For
+example, on x86-64 system kernel and initrd still have to be placed in
+first 2 gigabytes, because kernel starts executing in 32-bit mode, and
+kdump purgatory code can jump only to 32-bit signed addresses. This
+limitation is a real problem in cases, when dump-capturing region is
+large and cannot fit in good area. For such cases it's possible to use
+special crashkernel syntax:
+
+    crashkernel=<low>/<high>
+
+<low> and <high> are memory regions for dump-capture kernel in usual
+crashkernel format (size@offset). For example:
+
+    crashkernel=64M/1G@4G
+
+This would mean to allocate 64M of memory at the lowest valid address
+and to allocate 1G at physical address 4G.
+
+New syntax for extended format (in case of memory dependent
+reservation):
+
+    crashkernel=<range1>:<low_size1>[/<high_size1>]
+                [,<range2>:<low_size2>[/high_size2],...]
+                [@low_offset][/high_offset]
+    range=start-[end]
+
+For example:
+
+    crashkernel=2G-32G:256M,32G-:256M/1G@0/8G
+
+This ...
From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 9:23 am

This patch adds second memory region support for kexec on x86
platform.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 arch/x86/kernel/setup.c |   56 +++++++++++++++++++++++++++++-----------------
 1 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index c4851ef..9b395bb 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -501,19 +501,11 @@ static inline unsigned long long get_total_mem(void)
 	return total << PAGE_SHIFT;
 }
 
-static void __init reserve_crashkernel(void)
+static int __init reserve_crashkernel_region(char *region_name,
+					     struct resource *crashk,
+					     unsigned long long crash_size,
+					     unsigned long long crash_base)
 {
-	unsigned long long total_mem;
-	unsigned long long crash_size, crash_base;
-	int ret;
-
-	total_mem = get_total_mem();
-
-	ret = parse_crashkernel(boot_command_line, total_mem,
-			&crash_size, &crash_base);
-	if (ret != 0 || crash_size <= 0)
-		return;
-
 	/* 0 means: find the address automatically */
 	if (crash_base <= 0) {
 		const unsigned long long alignment = 16<<20;	/* 16M */
@@ -522,7 +514,7 @@ static void __init reserve_crashkernel(void)
 				 alignment);
 		if (crash_base == -1ULL) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
-			return;
+			return -EINVAL;
 		}
 	} else {
 		unsigned long long start;
@@ -531,20 +523,42 @@ static void __init reserve_crashkernel(void)
 				 1<<20);
 		if (start != crash_base) {
 			pr_info("crashkernel reservation failed - memory is in use.\n");
-			return;
+			return -EINVAL;
 		}
 	}
-	reserve_early(crash_base, crash_base + crash_size, "CRASH KERNEL");
+	reserve_early(crash_base, crash_base + crash_size, region_name);
 
 	printk(KERN_INFO "Reserving %ldMB of memory at %ldMB "
-			"for crashkernel (System RAM: %ldMB)\n",
+			"for crashkernel\n",
 			(unsigned long)(crash_size >> 20),
-			(unsigned long)(crash_base >> ...
From: Eric W. Biederman
Date: Thursday, April 22, 2010 - 3:07 pm

Have you tried loading a 64bit vmlinux directly into a higher address
range?  There may be a bit or two missing but you should be able to
load a linux kernel above 4GB.  I tested the basics of that mechanism
when I made the 64bit relocatable kernel.

I don't buy the argument that there is a direct connection between
the amount of memory you have and how much memory it takes to dump it.
Even an indirect connections seems suspicious.

Eric
--

From: H. Peter Anvin
Date: Thursday, April 22, 2010 - 3:37 pm

We actually have a 64-bit entry point even in bzImage; it is at offset
+0x200 from the 32-bit entry point.  Right now that offset is not
exported anywhere, but it has been stable for a very long time... at
least for as far back as the decompressor has been 64 bits.

The interface to the 64-bit code is by necessity wider, since there is
no such thing as paging off in 64-bit mode, but it probably isn't *too*
hard to figure out how page tables need to be set up in order to work
properly.  At that point, it would be good to document it.

	-hpa
--

From: Vivek Goyal
Date: Thursday, April 22, 2010 - 3:45 pm

I guess even if it works, for distributions it will become additional
liability to carry vmlinux (instead of relocatable bzImage). So we shall

Memory requirement by user space might be of interest though like dump
filtering tools. I vaguely remember that it used to first traverse all
the memory pages, create some internal data structures and then start
dumping.

So memory required by filtering tool might be directly proportional to
amount of memory present in the system.

Vitaly, have you really run into cases where 2G upper limit is a concern.
What is the configuration you have, how much memory it has and how much
memory are you planning to reserve for kdump kernel?

Thanks
Vivek
--

From: Eric W. Biederman
Date: Thursday, April 22, 2010 - 5:48 pm

As Peter pointed out we actually have everything thing we need except
a bit of documentation and the flag that says this is a 64bit kernel.

From a testing perspective a 64bit vmlinux should work today without
changes.  Once it is confirmed there is a solution with the 64bit
kernel we just need a small patch to boot.txt and a few tweaks to 

Assuming your dump filtering tool creates a bitmap of pages to be dumped
you get a ration of 32K to 1.  Or 3MB for 100GB and 32MB for 1TB.
Which is noticeable in the worst case but definitely not enough to push

A good question.

Eric
--

From: Cong Wang
Date: Thursday, April 22, 2010 - 10:21 pm

We have observed that on a machine which has 66G memory, when we do
crashkernel=1G@4G, kexec failed to load the crash kernel, but the memory
reservation _did_ succeed.

Thanks.
--

From: Eric W. Biederman
Date: Thursday, April 22, 2010 - 10:42 pm

Did you try loading vmlinux?   If not this sounds like the fact that
/sbin/kexec doesn't realize it can boot a 64bit bzImage in 64bit
mode.

Eric

--

From: Vitaly Mayatskikh
Date: Thursday, April 22, 2010 - 11:43 pm

/sbin/kexec currently has hardcoded limitations for bzImage and
initrd:

include/x86/x86-linux.h:

#define DEFAULT_INITRD_ADDR_MAX 0x37FFFFFF
#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF

This is easy to override. However, purgatory code still wants to see
kernel below 2 Gb (32-bit signed relocations).
-- 
wbr, Vitaly
--

From: Vivek Goyal
Date: Friday, April 23, 2010 - 7:44 am

Agreed. Doing little more testing and fixing some issues, if need be, and
making 64 bzImage work is the better way instead of splitting the reserved
memory.
 
Thanks
Vivek
--

From: Vitaly Mayatskikh
Date: Friday, April 23, 2010 - 12:08 am

I tried it on system with 96G of RAM. When I reserved 512M for kdump
kernel, system stopped loading somewhere in user space. With larger
reserved area /sbin/kexec can't load kernel (because of hardcoded
limitation in /sbin/kexec). After removing this limitation kernel was
loaded below 2G, but system even hasn't booted.

Unfortunately, I don't remember exact details now and have no access
to that machine temporarily. Will try to get access and come back with
details.
-- 
wbr, Vitaly
--

Previous thread: defconfig strangeness by Nicholas Mc Guire on Thursday, April 22, 2010 - 9:07 am. (3 messages)

Next thread: [GIT PULL] fallout from percpu.h -> slab.h dependency breaking by Tejun Heo on Thursday, April 22, 2010 - 9:38 am. (1 message)