Re: [5/7, v9] NUMA Hotplug Emulator: Support cpu probe/release in x86_64

Previous thread: vga switcheroo not working / crashing the machine on i915/nvidia hybrid. (ASUS U30JC) by Giacomo on Friday, December 10, 2010 - 1:59 am. (1 message)

Next thread: [RFC PATCH V2 1/5] Add a new sock flag for zero-copy by Shirley Ma on Friday, December 10, 2010 - 2:55 am. (1 message)
From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

* PATCHSET INTRODUCTION

patch 1: Documentation.
patch 2: Adds a numa=possible=<N> command line option to set an additional N nodes
		 as being possible for memory hotplug. 
	    
patch 3: Add node hotplug emulation, introduce debugfs node/add_node interface

patch 4: Abstract cpu register functions, make these interface friend for cpu
		 hotplug emulation
patch 5: Support cpu probe/release in x86, it provide a software method to hot
		 add/remove cpu with sysfs interface.
patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
		 domain to build the incorrect hierarchy.
patch 7: Implement per-node add_memory debugfs interface

* FEEDBACKDS & RESPONSES

v9:

Solve the bug reported by Eric B Munson, check the return value of cpu_down when do
 CPU release.

Solve the conflicts with Tejun Heo' Unificaton NUMA code, re-work patch 5 based on his
patch.

Some small changes on debugfs per-node add_memory interface.

v8:

Reconsider David's proposal, accept the per-node add_memory interface on debugfs.
(p7).

v7:

David:    We don't need two different interfaces, one in sysfs and one in debugfs,
          to hotplug memory.
Response: We use the debugfs for memory hotplug emulation only, for sysfs memory probe
          interface, we did not do any modifications, so we remove original patch 7
		  from patchset.
David:    Suggest new probe files in debugfs for each online node:
			/sys/kernel/debug/mem_hotplug/add_node (already exists)
			/sys/kernel/debug/mem_hotplug/node0/add_memory
			/sys/kernel/debug/mem_hotplug/node1/add_memory

Response: We need not make a simple thing such complicated, We'd prefer to
          rename the mem_hotplug/probe interface as mem_hotplug/add_memory.
			/sys/kernel/debug/mem_hotplug/add_node (already exists)
			/sys/kernel/debug/mem_hotplug/add_memory (rename probe as add_memory)

v6:

Greg KH:  Suggest to use interface mem_hotplug/add_node
David:    Agree with Greg's suggestion
Response: We move the interface ...
From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From: Shaohui Zheng <shaohui.zheng@intel.com>

add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
to explain the usage for the hotplug emulator.

Reviewed-By: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-12-10 13:43:19.573331001 +0800
@@ -0,0 +1,97 @@
+NUMA Hotplug Emulator for x86_64
+---------------------------------------------------
+
+NUMA hotplug emulator is able to emulate NUMA Node Hotplug
+thru a pure software way. It intends to help people easily debug
+and test node/CPU/memory hotplug related stuff on a
+none-NUMA-hotplug-support machine, even a UMA machine and virtual
+environment.
+
+1) Node hotplug emulation:
+
+Adds a numa=possible=<N> command line option to set an additional N nodes
+as being possible for memory hotplug.  This set of possible nodes
+control nr_node_ids and the sizes of several dynamically allocated node
+arrays.
+
+This allows memory hotplug to create new nodes for newly added memory
+rather than binding it to existing nodes.
+
+For emulation on x86, it would be possible to set aside memory for hotplugged
+nodes (say, anything above 2G) and to add an additional four nodes as being
+possible on boot with
+
+	mem=2G numa=possible=4
+
+and then creating a new 128M node at runtime:
+
+	# echo 128M@0x80000000 > /sys/kernel/debug/mem_hotplug/add_node
+	On node 1 totalpages: 0
+	init_memory_mapping: 0000000080000000-0000000088000000
+	 0080000000 - 0088000000 page 2M
+
+Once the new node has been added, its memory can be onlined.  If this
+memory represents memory section 16, for example:
+
+	# echo online > ...
From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From: Shaohui Zheng <shaohui.zheng@intel.com>

Abstract cpu register functions, provide a more flexible interface
register_cpu_node, the new interface provides convenience to add cpu
to a specified node, we can use it to add a cpu to a fake node.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/cpu.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/cpu.h	2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/include/asm/cpu.h	2010-11-17 09:01:10.192838977 +0800
@@ -27,6 +27,7 @@
 
 #ifdef CONFIG_HOTPLUG_CPU
 extern int arch_register_cpu(int num);
+extern int arch_register_cpu_node(int num, int nid);
 extern void arch_unregister_cpu(int);
 #endif
 
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 10:05:32.934085248 +0800
@@ -52,6 +52,15 @@
 }
 EXPORT_SYMBOL(arch_register_cpu);
 
+int __ref arch_register_cpu_node(int num, int nid)
+{
+	if (num)
+		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
+
+	return register_cpu_node(&per_cpu(cpu_devices, num).cpu, num, nid);
+}
+EXPORT_SYMBOL(arch_register_cpu_node);
+
 void arch_unregister_cpu(int num)
 {
 	unregister_cpu(&per_cpu(cpu_devices, num).cpu);
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/drivers/base/cpu.c	2010-11-17 10:05:32.943465010 +0800
@@ -208,17 +208,18 @@
 static SYSDEV_CLASS_ATTR(offline, 0444, print_cpus_offline, NULL);
 
 /*
- * register_cpu - Setup a sysfs device for a CPU.
+ * register_cpu_node - Setup a sysfs device for a CPU.
  * @cpu - cpu->hotpluggable field set to 1 will generate a ...
From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From:  David Rientjes <rientjes@google.com>

Adds a numa=possible=<N> command line option to set an additional N nodes
as being possible for memory hotplug.  This set of possible nodes
controls nr_node_ids and the sizes of several dynamically allocated node
arrays.

This allows memory hotplug to create new nodes for newly added memory
rather than binding it to existing nodes.

The first use-case for this will be node hotplug emulation which will use
these possible nodes to create new nodes to test the memory hotplug
callbacks and surrounding memory hotplug code.

CC: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
 Documentation/x86/x86_64/boot-options.txt |    4 ++++
 arch/x86/mm/numa_64.c                     |   18 +++++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -174,6 +174,10 @@ NUMA
 		If given as an integer, fills all system RAM with N fake nodes
 		interleaved over physical nodes.
 
+  numa=possible=<N>
+		Sets an additional N nodes as being possible for memory
+		hotplug.
+
 ACPI
 
   acpi=off	Don't enable ACPI
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 int numa_off __initdata;
 static unsigned long __initdata nodemap_addr;
 static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
 
 /*
  * Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto out;
 ...
From: Andrew Morton
Date: Wednesday, December 22, 2010 - 5:27 pm

On Fri, 10 Dec 2010 15:31:21 +0800

hm, I didn't know you could do that with labels.

--

From: David Rientjes
Date: Wednesday, December 22, 2010 - 6:14 pm

Yeah, it's equivalent to __attribute__((unused)) and according to the gcc 
manual section 6.30:

	In GNU C, an attribute specifier list may appear after the colon 
	following a label, other than a case or default label. The only 
	attribute it makes sense to use after a label is unused. This 
	feature is intended for code generated by programs which contains 
	labels that may be unused but which is compiled with ‘-Wall’. It 
	would not normally be appropriate to use in it human-written code, 
	though it could be useful in cases where the code that jumps to 
	the label is contained within an #ifdef conditional.

I used it because I knew I wouldn't get away with putting a label inside 
> >  unsigned long __init numa_free_all_bootmem(void)
From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From:  Shaohui Zheng <shaohui.zheng@intel.com>

Add add_memory interface to support to memory hotplug emulation for each online
node under debugfs. The reserved memory can be added into desired node with
this interface.

The layout on debugfs:
	mem_hotplug/node0/add_memory
	mem_hotplug/node1/add_memory
	mem_hotplug/node2/add_memory
	...

Add a memory section(128M) to node 3(boots with mem=1024m)

	echo 0x40000000 > mem_hotplug/node3/add_memory

CC: David Rientjes <rientjes@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/mm/memory_hotplug.c
===================================================================
--- linux-hpe4.orig/mm/memory_hotplug.c	2010-12-10 13:22:44.753331000 +0800
+++ linux-hpe4/mm/memory_hotplug.c	2010-12-10 13:41:48.803331000 +0800
@@ -933,6 +933,81 @@
 
 static struct dentry *memhp_debug_root;
 
+#ifdef CONFIG_ARCH_MEMORY_PROBE
+
+static ssize_t add_memory_store(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	u64 phys_addr = 0;
+	int nid = file->private_data - NULL;
+	int ret;
+
+	printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+	phys_addr = simple_strtoull(buf, NULL, 0);
+
+	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	if (ret)
+		count = ret;
+
+	return count;
+}
+
+static int add_memory_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+static const struct file_operations add_memory_file_ops = {
+	.open		= add_memory_open,
+	.write		= add_memory_store,
+	.llseek		= generic_file_llseek,
+};
+
+/*
+ * Create add_memory debugfs entry under specified node
+ */
+static int debugfs_create_add_memory_entry(int nid)
+{
+	char buf[32];
+	static struct dentry *node_debug_root;
+
+	snprintf(buf, sizeof(buf), "node%d", nid);
+	node_debug_root = debugfs_create_dir(buf, ...
From: Andrew Morton
Date: Wednesday, December 22, 2010 - 5:27 pm

On Fri, 10 Dec 2010 15:31:26 +0800

Even more unneeded initalisation.

Please check the whole patchset for this.  It's bad because it can
sometimes generate more code and because it can sometimes hide bugs by

Well that was sneaky.

It would be more conventional to just use the typecast:



Was this usage of i_private and private_data documented in comments

hm, debugfs_create_dir() was poorly designed - it should return an


--

From: Shaohui Zheng
Date: Wednesday, December 22, 2010 - 7:00 pm

Yes, It is a my habit to initialize variable when define it. I will check them 


We ignored the warning for function simple_strtoull in the whole patchset.

Yes, I added the usage information when create the add_memory entry, it seems
that I should also add comment here.

/* the nid information was represented by the offset of pointer(NULL+nid) */
	if (!debugfs_create_file("add_memory", S_IWUSR, node_debug_root,

Totally agree. I see that the simliar call on debugfs_create_dir. For the failure,


-- 
Thanks & Regards,
Shaohui

--

From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From: Shaohui Zheng <shaohui.zheng@intel.com>

When hotplug a CPU with emulator, we are using a logical CPU to emulate the
CPU hotplug process. For the CPU supported SMT, some logical CPUs are in the
same socket, but it may located in different NUMA node after we have emulator.
it misleads the scheduling domain to build the incorrect hierarchy, and it
causes the following call trace when rebalance the scheduling domain:

divide error: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu8/online
CPU 0 
Modules linked in: fbcon tileblit font bitblit softcursor radeon ttm drm_kms_helper e1000e usbhid via_rhine mii drm i2c_algo_bit igb dca
Pid: 0, comm: swapper Not tainted 2.6.32hpe #78 X8DTN
RIP: 0010:[<ffffffff81051da5>]  [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10
RSP: 0018:ffff880028203c30  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000015ac0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff880277e8cfa0 RDI: 0000000000000000
RBP: ffff880028203dc0 R08: ffff880277e8cfa0 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f16cfc85770 CR3: 0000000001001000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81822000, task ffffffff8184a600)
Stack:
 ffff880028203d60 ffff880028203cd0 ffff8801c204ff08 ffff880028203e38
<0> 0101ffff81018c59 ffff880028203e44 00000001810806bd ffff8801c204fe00
<0> 0000000528200000 ffffffff00000000 0000000000000018 0000000000015ac0
Call Trace:
 <IRQ> 
 [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0
 [<ffffffff81053b2c>] rebalance_domains+0x17c/0x570
 [<ffffffff81018c89>] ? read_tsc+0x9/0x20
 [<ffffffff81088ee0>] ? ...
From: Andrew Morton
Date: Wednesday, December 22, 2010 - 5:27 pm

On Fri, 10 Dec 2010 15:31:25 +0800


Unneeded initialisation.

Does this cause an unused var warning when


--

From: Shaohui Zheng
Date: Wednesday, December 22, 2010 - 10:10 pm

I am trying to avoid too much ifdef here, it seems it take an unused var
warining when CONFIG_ARCH_CPU_PROBE_RELEASE=n. good catching.


Agree, the comment is too simple, should add better documents for function
fake_cpu_socket_info.

-- 
Thanks & Regards,
Shaohui

--

From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From: Shaohui Zheng <shaohui.zheng@intel.com>

CPU physical hot-add/hot-remove are supported on some hardwares, and it 
was already supported in current linux kernel. NUMA Hotplug Emulator provides
a mechanism to emulate the process with software method. It can be used for
testing or debuging purpose.

CPU physical hotplug is different with logical CPU online/offline. Logical
online/offline is controled by interface /sys/device/cpu/cpuX/online. CPU
hotplug emulator uses probe/release interface. It becomes possible to do cpu
hotplug automation and stress

Add cpu interface probe/release under sysfs for x86_64. User can use this
interface to emulate the cpu hot-add and hot-remove process.

Directive:
*) Reserve CPU thru grub parameter like:
	maxcpus=4

the rest CPUs will not be initiliazed. 

*) Probe CPU
we can use the probe interface to hot-add new CPUs:
	echo nid > /sys/devices/system/cpu/probe

*) Release a CPU
	echo cpu > /sys/devices/system/cpu/release

A reserved CPU will be hot-added to the specified node.
1) nid == 0, the CPU will be added to the real node which the CPU
should be in
2) nid != 0, add the CPU to node nid even through it is a fake node.

CC: Ingo Molnar <mingo@elte.hu>
CC: Len Brown <len.brown@intel.com>
CC: Yinghai Lu <Yinghai.Lu@Sun.COM>
CC: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
---
This patch is based on Tejun's unification of the 32 and 64 bit NUMA boot paths,
 specifically the patch at http://marc.info/?l=linux-kernel&m=129087151912379.
Index: linux-hpe4/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/acpi/boot.c	2010-12-10 13:42:34.553331000 +0800
+++ linux-hpe4/arch/x86/kernel/acpi/boot.c	2010-12-10 14:48:32.113331001 +0800
@@ -668,8 +668,39 @@
 }
 EXPORT_SYMBOL(acpi_map_lsapic);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+static void acpi_map_cpu2node_emu(int ...
From: Eric B Munson
Date: Thursday, December 16, 2010 - 9:25 am

Shaohui,

What kernel is this series based on?  I cannot get it to build when applied
to mainline.  I seem to be missing a definition for set_apicid_to_node.

Eric

From: Shaohui Zheng
Date: Thursday, December 16, 2010 - 4:34 pm

Eric,
	These is a code conflict with Tejun's NUNA unification code, and Tejun's code is still under
review. This patchset solves the code conflict, the v9 emulator is based on his patches, and we
need to wait until his patches was accepted.

Tejun's patch: http://marc.info/?l=linux-kernel&m=129087151912379.

	If you are doing some testing, you can try to use v8 emulator.

-- 
Thanks & Regards,
Shaohui

--

From: Andrew Morton
Date: Wednesday, December 22, 2010 - 5:27 pm

On Fri, 10 Dec 2010 15:31:24 +0800

One definition per line make for more maintainable code.





s/num/cpu/ would be conventional.  "num" is a pretty poor identifier in


arch_cpu_probe() is global and exported to modules, but is undocumented.

If it had been documented, I might have been able to work out why arg


It's generally better to make kernel messages self-identifying. 
Especially error messages.  If someone comes along and sees "can not
release cpu 0" in their logs, they don't have a clue what caused it



Something like "When cpu hotplug emulation is enabled, register only

s/emulation, the/emulation.  The/





--

From: Shaohui Zheng
Date: Wednesday, December 22, 2010 - 6:34 pm

Agree, I will put them into 2 lines, and remove the initialisations.

it is a warning, so I ignore it.





Sorry, Andrew, I did not catch it. Do you mean to add the document before




Sorry, It is the same with function arch_cpu_probe, I did not catch the
problem, should I add documentation before the definition or declaration? Or



-- 
Thanks & Regards,
Shaohui

--

From: Andrew Morton
Date: Wednesday, December 22, 2010 - 8:21 pm

Don't ignore warnings!  At least, not until you've understood the
reason for them and have a *reason* to ignore them.

simple_strtoul() will silently accept input of the form "42foo",
treating it as "42".  That's a userspace bug and the kernel should
report it.  This means that the code should be changed to handle error

Sure, add a comment documenting the function.


Better, although "arch_cpu_release" isn't very meaningful to an
administrator.  "NUMA hotplug remove" or something like that would be
more useful.

All these messages should be looked at from the point of view of the
people who they are to serve.  Although in this special case, that's
most likely to be a kernel developer so I guess such clarity isn't
needed.


--

From: Shaohui Zheng
Date: Wednesday, December 22, 2010 - 7:24 pm

it is a tricky thing. When I debug it under a Virtual Machine, If I do a cpu
probe via sysfs cpu/probe interface, The function arch_cpu_probe will be called
__three__ times, but only one call is valid, so I add a check on `count` to

It is a good lesson for me, when I meet the similar problem next time, I should
consider more from the point of the user.

-- 
Thanks & Regards,
Shaohui

--

From: Andrew Morton
Date: Wednesday, December 22, 2010 - 10:28 pm

hm, why does it get called three times?  Is that something which
can/should be fixed in callers rather than in the callee?

--

From: Shaohui Zheng
Date: Wednesday, December 22, 2010 - 9:30 pm

It might be a bug in the caller, but just guess currently. I will investigate it.

-- 
Thanks & Regards,
Shaohui

--

From: shaohui.zheng
Date: Friday, December 10, 2010 - 12:31 am

From: David Rientjes <rientjes@google.com>

Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new debugfs interface at /sys/kernel/debug/mem_hotplug/add_node
that behaves in a similar way to the memory hot-add "probe" interface.
Its format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional four nodes as being possible on boot with

	mem=2G numa=possible=4

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/kernel/debug/mem_hotplug/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M
Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal
 [ The memory section(s) mapped to a particular node are visible via
   /sys/kernel/debug/mem_hotplug/node1, in this example. ]

The new node is now hotplugged and ready for testing.

CC: Haicheng Li <haicheng.li@intel.com>
CC: Greg KH <gregkh@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
 Documentation/memory-hotplug.txt |   24 +++++++++++++++
 mm/memory_hotplug.c              |   59 ++++++++++++++++++++++++++++++++++++++
 2 files ...
From: Andrew Morton
Date: Wednesday, December 22, 2010 - 5:27 pm

On Fri, 10 Dec 2010 15:31:22 +0800


This will cause the write to return a smaller number than `count': a
short write.  Some userspace code may then decide to write the
remainder of the data (whcih is the correct way to use the write()
syscall).

Could be a bit dangerous, and perhaps simply declaring an error if too

PAGES_PER_SECTION has type unsigned long, so the rhs of this comparison
might overflow on 32-bit, should anyone ever try to use this code on
32-bit.

otoh the compiler might do it as 64-bit because the lhs is 64-bit.  Not


--

From: David Rientjes
Date: Wednesday, December 22, 2010 - 6:38 pm

We traditionally haven't been using NODEMASK_ALLOC() in sysfs (or, in this 
case, debugfs) functions because they're never deep in a call chain.  Even 
for 4K node support, which isn't a supported config on any arch that 
allows CONFIG_MEMORY_HOTPLUG, this would only be 512 bytes on the short 
stack.

I agree with the remainder of the points in your review and will be 
sending fixes against -mm, thanks!
--

From: Andrew Morton
Date: Wednesday, December 22, 2010 - 7:20 pm

I bet linux-2.6.227 supports a meganode.
--

From: David Rientjes
Date: Tuesday, December 28, 2010 - 12:34 am

Shaohui, I'll reply to this message with an updated version of this patch 
to address Andrew's comments.  You can merge it into your series or Andrew 
can take it seperately (although it doesn't do much good without "x86: add 
numa=possible command line option" unless you have hotpluggable SRAT 
entries and CONFIG_ACPI_NUMA).
--

From: David Rientjes
Date: Tuesday, December 28, 2010 - 12:34 am

Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
that behaves in a similar way to the memory hot-add "probe" interface.
Its format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional four nodes as being possible on boot with

	mem=2G numa=possible=4

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/kernel/debug/hotplug/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M

Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal

 [ The memory section(s) mapped to a particular node are visible via
   /sys/devices/system/node/node1, in this example. ]

The new node is now hotplugged and ready for testing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/memory-hotplug.txt |   24 +++++++++++++
 mm/memory_hotplug.c              |   69 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
--- a/Documentation/memory-hotplug.txt
+++ ...
From: Zheng, Shaohui
Date: Tuesday, December 28, 2010 - 7:31 pm

Okay, thanks David. I will merge it into my series when I send next version.

Thanks & Regards,
Shaohui
--

Previous thread: vga switcheroo not working / crashing the machine on i915/nvidia hybrid. (ASUS U30JC) by Giacomo on Friday, December 10, 2010 - 1:59 am. (1 message)

Next thread: [RFC PATCH V2 1/5] Add a new sock flag for zero-copy by Shirley Ma on Friday, December 10, 2010 - 2:55 am. (1 message)