* PATCHSET INTRODUCTION
patch 1: Documentation.
patch 2: Adds a numa=possible=<N> command line option to set an additional N nodes
as being possible for memory hotplug.
patch 3: Add node hotplug emulation, introduce debugfs node/add_node interface
patch 4: Abstract cpu register functions, make these interface friend for cpu
hotplug emulation
patch 5: Support cpu probe/release in x86, it provide a software method to hot
add/remove cpu with sysfs interface.
patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
domain to build the incorrect hierarchy.
patch 7: Implement per-node add_memory debugfs interface
* FEEDBACKDS & RESPONSES
v9:
Solve the bug reported by Eric B Munson, check the return value of cpu_down when do
CPU release.
Solve the conflicts with Tejun Heo' Unificaton NUMA code, re-work patch 5 based on his
patch.
Some small changes on debugfs per-node add_memory interface.
v8:
Reconsider David's proposal, accept the per-node add_memory interface on debugfs.
(p7).
v7:
David: We don't need two different interfaces, one in sysfs and one in debugfs,
to hotplug memory.
Response: We use the debugfs for memory hotplug emulation only, for sysfs memory probe
interface, we did not do any modifications, so we remove original patch 7
from patchset.
David: Suggest new probe files in debugfs for each online node:
/sys/kernel/debug/mem_hotplug/add_node (already exists)
/sys/kernel/debug/mem_hotplug/node0/add_memory
/sys/kernel/debug/mem_hotplug/node1/add_memory
Response: We need not make a simple thing such complicated, We'd prefer to
rename the mem_hotplug/probe interface as mem_hotplug/add_memory.
/sys/kernel/debug/mem_hotplug/add_node (already exists)
/sys/kernel/debug/mem_hotplug/add_memory (rename probe as add_memory)
v6:
Greg KH: Suggest to use interface mem_hotplug/add_node
David: Agree with Greg's suggestion
Response: We move the interface ...From: Shaohui Zheng <shaohui.zheng@intel.com> add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt to explain the usage for the hotplug emulator. Reviewed-By: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Haicheng Li <haicheng.li@intel.com> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> --- Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt 2010-12-10 13:43:19.573331001 +0800 @@ -0,0 +1,97 @@ +NUMA Hotplug Emulator for x86_64 +--------------------------------------------------- + +NUMA hotplug emulator is able to emulate NUMA Node Hotplug +thru a pure software way. It intends to help people easily debug +and test node/CPU/memory hotplug related stuff on a +none-NUMA-hotplug-support machine, even a UMA machine and virtual +environment. + +1) Node hotplug emulation: + +Adds a numa=possible=<N> command line option to set an additional N nodes +as being possible for memory hotplug. This set of possible nodes +control nr_node_ids and the sizes of several dynamically allocated node +arrays. + +This allows memory hotplug to create new nodes for newly added memory +rather than binding it to existing nodes. + +For emulation on x86, it would be possible to set aside memory for hotplugged +nodes (say, anything above 2G) and to add an additional four nodes as being +possible on boot with + + mem=2G numa=possible=4 + +and then creating a new 128M node at runtime: + + # echo 128M@0x80000000 > /sys/kernel/debug/mem_hotplug/add_node + On node 1 totalpages: 0 + init_memory_mapping: 0000000080000000-0000000088000000 + 0080000000 - 0088000000 page 2M + +Once the new node has been added, its memory can be onlined. If this +memory represents memory section 16, for example: + + # echo online > ...
From: Shaohui Zheng <shaohui.zheng@intel.com>
Abstract cpu register functions, provide a more flexible interface
register_cpu_node, the new interface provides convenience to add cpu
to a specified node, we can use it to add a cpu to a fake node.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/cpu.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/cpu.h 2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/include/asm/cpu.h 2010-11-17 09:01:10.192838977 +0800
@@ -27,6 +27,7 @@
#ifdef CONFIG_HOTPLUG_CPU
extern int arch_register_cpu(int num);
+extern int arch_register_cpu_node(int num, int nid);
extern void arch_unregister_cpu(int);
#endif
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c 2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c 2010-11-17 10:05:32.934085248 +0800
@@ -52,6 +52,15 @@
}
EXPORT_SYMBOL(arch_register_cpu);
+int __ref arch_register_cpu_node(int num, int nid)
+{
+ if (num)
+ per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
+
+ return register_cpu_node(&per_cpu(cpu_devices, num).cpu, num, nid);
+}
+EXPORT_SYMBOL(arch_register_cpu_node);
+
void arch_unregister_cpu(int num)
{
unregister_cpu(&per_cpu(cpu_devices, num).cpu);
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c 2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/drivers/base/cpu.c 2010-11-17 10:05:32.943465010 +0800
@@ -208,17 +208,18 @@
static SYSDEV_CLASS_ATTR(offline, 0444, print_cpus_offline, NULL);
/*
- * register_cpu - Setup a sysfs device for a CPU.
+ * register_cpu_node - Setup a sysfs device for a CPU.
* @cpu - cpu->hotpluggable field set to 1 will generate a ...From: David Rientjes <rientjes@google.com>
Adds a numa=possible=<N> command line option to set an additional N nodes
as being possible for memory hotplug. This set of possible nodes
controls nr_node_ids and the sizes of several dynamically allocated node
arrays.
This allows memory hotplug to create new nodes for newly added memory
rather than binding it to existing nodes.
The first use-case for this will be node hotplug emulation which will use
these possible nodes to create new nodes to test the memory hotplug
callbacks and surrounding memory hotplug code.
CC: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Documentation/x86/x86_64/boot-options.txt | 4 ++++
arch/x86/mm/numa_64.c | 18 +++++++++++++++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -174,6 +174,10 @@ NUMA
If given as an integer, fills all system RAM with N fake nodes
interleaved over physical nodes.
+ numa=possible=<N>
+ Sets an additional N nodes as being possible for memory
+ hotplug.
+
ACPI
acpi=off Don't enable ACPI
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
int numa_off __initdata;
static unsigned long __initdata nodemap_addr;
static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
/*
* Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
#ifdef CONFIG_NUMA_EMU
if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
- return;
+ goto out;
...On Fri, 10 Dec 2010 15:31:21 +0800 hm, I didn't know you could do that with labels. --
Yeah, it's equivalent to __attribute__((unused)) and according to the gcc manual section 6.30: In GNU C, an attribute specifier list may appear after the colon following a label, other than a case or default label. The only attribute it makes sense to use after a label is unused. This feature is intended for code generated by programs which contains labels that may be unused but which is compiled with ‘-Wall’. It would not normally be appropriate to use in it human-written code, though it could be useful in cases where the code that jumps to the label is contained within an #ifdef conditional. I used it because I knew I wouldn't get away with putting a label inside > > unsigned long __init numa_free_all_bootmem(void)
From: Shaohui Zheng <shaohui.zheng@intel.com>
Add add_memory interface to support to memory hotplug emulation for each online
node under debugfs. The reserved memory can be added into desired node with
this interface.
The layout on debugfs:
mem_hotplug/node0/add_memory
mem_hotplug/node1/add_memory
mem_hotplug/node2/add_memory
...
Add a memory section(128M) to node 3(boots with mem=1024m)
echo 0x40000000 > mem_hotplug/node3/add_memory
CC: David Rientjes <rientjes@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/mm/memory_hotplug.c
===================================================================
--- linux-hpe4.orig/mm/memory_hotplug.c 2010-12-10 13:22:44.753331000 +0800
+++ linux-hpe4/mm/memory_hotplug.c 2010-12-10 13:41:48.803331000 +0800
@@ -933,6 +933,81 @@
static struct dentry *memhp_debug_root;
+#ifdef CONFIG_ARCH_MEMORY_PROBE
+
+static ssize_t add_memory_store(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 phys_addr = 0;
+ int nid = file->private_data - NULL;
+ int ret;
+
+ printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+ phys_addr = simple_strtoull(buf, NULL, 0);
+
+ ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+ if (ret)
+ count = ret;
+
+ return count;
+}
+
+static int add_memory_open(struct inode *inode, struct file *file)
+{
+ file->private_data = inode->i_private;
+ return 0;
+}
+
+static const struct file_operations add_memory_file_ops = {
+ .open = add_memory_open,
+ .write = add_memory_store,
+ .llseek = generic_file_llseek,
+};
+
+/*
+ * Create add_memory debugfs entry under specified node
+ */
+static int debugfs_create_add_memory_entry(int nid)
+{
+ char buf[32];
+ static struct dentry *node_debug_root;
+
+ snprintf(buf, sizeof(buf), "node%d", nid);
+ node_debug_root = debugfs_create_dir(buf, ...On Fri, 10 Dec 2010 15:31:26 +0800 Even more unneeded initalisation. Please check the whole patchset for this. It's bad because it can sometimes generate more code and because it can sometimes hide bugs by Well that was sneaky. It would be more conventional to just use the typecast: Was this usage of i_private and private_data documented in comments hm, debugfs_create_dir() was poorly designed - it should return an --
Yes, It is a my habit to initialize variable when define it. I will check them We ignored the warning for function simple_strtoull in the whole patchset. Yes, I added the usage information when create the add_memory entry, it seems that I should also add comment here. /* the nid information was represented by the offset of pointer(NULL+nid) */ if (!debugfs_create_file("add_memory", S_IWUSR, node_debug_root, Totally agree. I see that the simliar call on debugfs_create_dir. For the failure, -- Thanks & Regards, Shaohui --
From: Shaohui Zheng <shaohui.zheng@intel.com> When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU hotplug process. For the CPU supported SMT, some logical CPUs are in the same socket, but it may located in different NUMA node after we have emulator. it misleads the scheduling domain to build the incorrect hierarchy, and it causes the following call trace when rebalance the scheduling domain: divide error: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu8/online CPU 0 Modules linked in: fbcon tileblit font bitblit softcursor radeon ttm drm_kms_helper e1000e usbhid via_rhine mii drm i2c_algo_bit igb dca Pid: 0, comm: swapper Not tainted 2.6.32hpe #78 X8DTN RIP: 0010:[<ffffffff81051da5>] [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10 RSP: 0018:ffff880028203c30 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000015ac0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff880277e8cfa0 RDI: 0000000000000000 RBP: ffff880028203dc0 R08: ffff880277e8cfa0 R09: 0000000000000040 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f16cfc85770 CR3: 0000000001001000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff81822000, task ffffffff8184a600) Stack: ffff880028203d60 ffff880028203cd0 ffff8801c204ff08 ffff880028203e38 <0> 0101ffff81018c59 ffff880028203e44 00000001810806bd ffff8801c204fe00 <0> 0000000528200000 ffffffff00000000 0000000000000018 0000000000015ac0 Call Trace: <IRQ> [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0 [<ffffffff81053b2c>] rebalance_domains+0x17c/0x570 [<ffffffff81018c89>] ? read_tsc+0x9/0x20 [<ffffffff81088ee0>] ? ...
On Fri, 10 Dec 2010 15:31:25 +0800 Unneeded initialisation. Does this cause an unused var warning when --
I am trying to avoid too much ifdef here, it seems it take an unused var warining when CONFIG_ARCH_CPU_PROBE_RELEASE=n. good catching. Agree, the comment is too simple, should add better documents for function fake_cpu_socket_info. -- Thanks & Regards, Shaohui --
From: Shaohui Zheng <shaohui.zheng@intel.com> CPU physical hot-add/hot-remove are supported on some hardwares, and it was already supported in current linux kernel. NUMA Hotplug Emulator provides a mechanism to emulate the process with software method. It can be used for testing or debuging purpose. CPU physical hotplug is different with logical CPU online/offline. Logical online/offline is controled by interface /sys/device/cpu/cpuX/online. CPU hotplug emulator uses probe/release interface. It becomes possible to do cpu hotplug automation and stress Add cpu interface probe/release under sysfs for x86_64. User can use this interface to emulate the cpu hot-add and hot-remove process. Directive: *) Reserve CPU thru grub parameter like: maxcpus=4 the rest CPUs will not be initiliazed. *) Probe CPU we can use the probe interface to hot-add new CPUs: echo nid > /sys/devices/system/cpu/probe *) Release a CPU echo cpu > /sys/devices/system/cpu/release A reserved CPU will be hot-added to the specified node. 1) nid == 0, the CPU will be added to the real node which the CPU should be in 2) nid != 0, add the CPU to node nid even through it is a fake node. CC: Ingo Molnar <mingo@elte.hu> CC: Len Brown <len.brown@intel.com> CC: Yinghai Lu <Yinghai.Lu@Sun.COM> CC: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> Signed-off-by: Haicheng Li <haicheng.li@intel.com> --- This patch is based on Tejun's unification of the 32 and 64 bit NUMA boot paths, specifically the patch at http://marc.info/?l=linux-kernel&m=129087151912379. Index: linux-hpe4/arch/x86/kernel/acpi/boot.c =================================================================== --- linux-hpe4.orig/arch/x86/kernel/acpi/boot.c 2010-12-10 13:42:34.553331000 +0800 +++ linux-hpe4/arch/x86/kernel/acpi/boot.c 2010-12-10 14:48:32.113331001 +0800 @@ -668,8 +668,39 @@ } EXPORT_SYMBOL(acpi_map_lsapic); +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE +static void acpi_map_cpu2node_emu(int ...
Shaohui, What kernel is this series based on? I cannot get it to build when applied to mainline. I seem to be missing a definition for set_apicid_to_node. Eric
Eric, These is a code conflict with Tejun's NUNA unification code, and Tejun's code is still under review. This patchset solves the code conflict, the v9 emulator is based on his patches, and we need to wait until his patches was accepted. Tejun's patch: http://marc.info/?l=linux-kernel&m=129087151912379. If you are doing some testing, you can try to use v8 emulator. -- Thanks & Regards, Shaohui --
On Fri, 10 Dec 2010 15:31:24 +0800 One definition per line make for more maintainable code. s/num/cpu/ would be conventional. "num" is a pretty poor identifier in arch_cpu_probe() is global and exported to modules, but is undocumented. If it had been documented, I might have been able to work out why arg It's generally better to make kernel messages self-identifying. Especially error messages. If someone comes along and sees "can not release cpu 0" in their logs, they don't have a clue what caused it Something like "When cpu hotplug emulation is enabled, register only s/emulation, the/emulation. The/ --
Agree, I will put them into 2 lines, and remove the initialisations. it is a warning, so I ignore it. Sorry, Andrew, I did not catch it. Do you mean to add the document before Sorry, It is the same with function arch_cpu_probe, I did not catch the problem, should I add documentation before the definition or declaration? Or -- Thanks & Regards, Shaohui --
Don't ignore warnings! At least, not until you've understood the reason for them and have a *reason* to ignore them. simple_strtoul() will silently accept input of the form "42foo", treating it as "42". That's a userspace bug and the kernel should report it. This means that the code should be changed to handle error Sure, add a comment documenting the function. Better, although "arch_cpu_release" isn't very meaningful to an administrator. "NUMA hotplug remove" or something like that would be more useful. All these messages should be looked at from the point of view of the people who they are to serve. Although in this special case, that's most likely to be a kernel developer so I guess such clarity isn't needed. --
it is a tricky thing. When I debug it under a Virtual Machine, If I do a cpu probe via sysfs cpu/probe interface, The function arch_cpu_probe will be called __three__ times, but only one call is valid, so I add a check on `count` to It is a good lesson for me, when I meet the similar problem next time, I should consider more from the point of the user. -- Thanks & Regards, Shaohui --
hm, why does it get called three times? Is that something which can/should be fixed in callers rather than in the callee? --
It might be a bug in the caller, but just guess currently. I will investigate it. -- Thanks & Regards, Shaohui --
From: David Rientjes <rientjes@google.com> Add an interface to allow new nodes to be added when performing memory hot-add. This provides a convenient interface to test memory hotplug notifier callbacks and surrounding hotplug code when new nodes are onlined without actually having a machine with such hotpluggable SRAT entries. This adds a new debugfs interface at /sys/kernel/debug/mem_hotplug/add_node that behaves in a similar way to the memory hot-add "probe" interface. Its format is size@start, where "size" is the size of the new node to be added and "start" is the physical address of the new memory. The new node id is a currently offline, but possible, node. The bit must be set in node_possible_map so that nr_node_ids is sized appropriately. For emulation on x86, for example, it would be possible to set aside memory for hotplugged nodes (say, anything above 2G) and to add an additional four nodes as being possible on boot with mem=2G numa=possible=4 and then creating a new 128M node at runtime: # echo 128M@0x80000000 > /sys/kernel/debug/mem_hotplug/add_node On node 1 totalpages: 0 init_memory_mapping: 0000000080000000-0000000088000000 0080000000 - 0088000000 page 2M Once the new node has been added, its memory can be onlined. If this memory represents memory section 16, for example: # echo online > /sys/devices/system/memory/memory16/state Built 2 zonelists in Node order, mobility grouping on. Total pages: 514846 Policy zone: Normal [ The memory section(s) mapped to a particular node are visible via /sys/kernel/debug/mem_hotplug/node1, in this example. ] The new node is now hotplugged and ready for testing. CC: Haicheng Li <haicheng.li@intel.com> CC: Greg KH <gregkh@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> --- Documentation/memory-hotplug.txt | 24 +++++++++++++++ mm/memory_hotplug.c | 59 ++++++++++++++++++++++++++++++++++++++ 2 files ...
On Fri, 10 Dec 2010 15:31:22 +0800 This will cause the write to return a smaller number than `count': a short write. Some userspace code may then decide to write the remainder of the data (whcih is the correct way to use the write() syscall). Could be a bit dangerous, and perhaps simply declaring an error if too PAGES_PER_SECTION has type unsigned long, so the rhs of this comparison might overflow on 32-bit, should anyone ever try to use this code on 32-bit. otoh the compiler might do it as 64-bit because the lhs is 64-bit. Not --
We traditionally haven't been using NODEMASK_ALLOC() in sysfs (or, in this case, debugfs) functions because they're never deep in a call chain. Even for 4K node support, which isn't a supported config on any arch that allows CONFIG_MEMORY_HOTPLUG, this would only be 512 bytes on the short stack. I agree with the remainder of the points in your review and will be sending fixes against -mm, thanks! --
I bet linux-2.6.227 supports a meganode. --
Shaohui, I'll reply to this message with an updated version of this patch to address Andrew's comments. You can merge it into your series or Andrew can take it seperately (although it doesn't do much good without "x86: add numa=possible command line option" unless you have hotpluggable SRAT entries and CONFIG_ACPI_NUMA). --
Add an interface to allow new nodes to be added when performing memory hot-add. This provides a convenient interface to test memory hotplug notifier callbacks and surrounding hotplug code when new nodes are onlined without actually having a machine with such hotpluggable SRAT entries. This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node that behaves in a similar way to the memory hot-add "probe" interface. Its format is size@start, where "size" is the size of the new node to be added and "start" is the physical address of the new memory. The new node id is a currently offline, but possible, node. The bit must be set in node_possible_map so that nr_node_ids is sized appropriately. For emulation on x86, for example, it would be possible to set aside memory for hotplugged nodes (say, anything above 2G) and to add an additional four nodes as being possible on boot with mem=2G numa=possible=4 and then creating a new 128M node at runtime: # echo 128M@0x80000000 > /sys/kernel/debug/hotplug/add_node On node 1 totalpages: 0 init_memory_mapping: 0000000080000000-0000000088000000 0080000000 - 0088000000 page 2M Once the new node has been added, its memory can be onlined. If this memory represents memory section 16, for example: # echo online > /sys/devices/system/memory/memory16/state Built 2 zonelists in Node order, mobility grouping on. Total pages: 514846 Policy zone: Normal [ The memory section(s) mapped to a particular node are visible via /sys/devices/system/node/node1, in this example. ] The new node is now hotplugged and ready for testing. Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/memory-hotplug.txt | 24 +++++++++++++ mm/memory_hotplug.c | 69 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+), 0 deletions(-) diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt --- a/Documentation/memory-hotplug.txt +++ ...
Okay, thanks David. I will merge it into my series when I send next version. Thanks & Regards, Shaohui --
