[PATCH] Cleanup to make remove_memory() arch neutral

Previous thread: iozone lies/bug ? by J.A. on Friday, September 5, 2008 - 10:03 am. (1 message)

Next thread: [PATCH] [RESEND] mm: show memory section to node relationship in sysfs by Gary Hade on Friday, September 5, 2008 - 10:21 am. (1 message)
From: Gary Hade
Date: Friday, September 5, 2008 - 10:21 am

Resending with linux-kernel@vger.kernel.org and x86@kernel.org copied
this time.  No changes other than this and modified Subject line.  The
only response so far on linux-mm has been an Acked-by: from 
Yasunori Goto <y-goto@jp.fujitsu.com>


Add memory hotremove config option to x86_64

Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels.  This patch makes it possible
to configure the memory hotremove functionality into the x86_64
kernel as well. 

Signed-off-by: Gary Hade <garyhade@us.ibm.com>

---
 arch/x86/Kconfig      |    3 +++
 arch/x86/mm/init_64.c |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig	2008-09-03 13:33:59.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-03 13:34:55.000000000 -0700
@@ -1384,6 +1384,9 @@
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool X86_64
 	depends on NUMA
Index: linux-2.6.27-rc5/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/mm/init_64.c	2008-09-03 13:34:08.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/mm/init_64.c	2008-09-03 13:34:55.000000000 -0700
@@ -740,6 +740,24 @@
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+	unsigned long timeout = 120 * HZ;
+	int ret;
+	start_pfn = start >> PAGE_SHIFT;
+	end_pfn = start_pfn + (size >> PAGE_SHIFT);
+	ret = offline_pages(start_pfn, end_pfn, timeout);
+	if (ret)
+		goto out;
+	/* Arch-specific calls go here */
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(remove_memory);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
 #endif /* CONFIG_MEMORY_HOTPLUG */

 ...
From: Ingo Molnar
Date: Friday, September 5, 2008 - 10:44 am

so this will break the build on 32-bit, if CONFIG_MEMORY_HOTREMOVE=y? 
mm/memory_hotplug.c assumes that remove_memory() is provided by the 

hm, nothing appears to be arch-specific about this trivial wrapper 
around offline_pages().

Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of 
mm/memory_hotplug.c instead, as a weak function? That way architectures 
only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures 
with different/special needs can override it.

	Ingo
--

From: Badari Pulavarty
Date: Friday, September 5, 2008 - 11:14 am

Yes. All the archs (ppc64, ia64, s390, x86_64) have exact same
function. No architecture needed special handling so far (initial
versions of ppc64 needed extra handling, but I moved the code
to different place). 

We can make this generic and kill all arch-specific ones.
Initially, we didn't know if any arch needs special handling -
so ended up having private functions for each arch.  

Yes. We should do that. I will send out a patch.

Thanks,
Badari

--

From: Ingo Molnar
Date: Friday, September 5, 2008 - 11:17 am

ok - if all architectures have the same function then please make it a 
regular function not a weak one, and remove all the duplications.

	Ingo
--

From: Badari Pulavarty
Date: Monday, September 8, 2008 - 2:52 pm

There is nothing architecture specific about remove_memory().
remove_memory() function is common for all architectures which
support hotplug memory remove. Instead of duplicating it in every
architecture, collapse them into arch neutral function.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 arch/ia64/mm/init.c   |   17 -----------------
 arch/powerpc/mm/mem.c |   17 -----------------
 arch/s390/mm/init.c   |   11 -----------
 mm/memory_hotplug.c   |   10 ++++++++++
 4 files changed, 10 insertions(+), 45 deletions(-)

Index: linux-2.6.27-rc5/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/ia64/mm/init.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/ia64/mm/init.c	2008-09-08 12:38:59.000000000 -0700
@@ -701,23 +701,6 @@ int arch_add_memory(int nid, u64 start, 
 
 	return ret;
 }
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn, end_pfn;
-	unsigned long timeout = 120 * HZ;
-	int ret;
-	start_pfn = start >> PAGE_SHIFT;
-	end_pfn = start_pfn + (size >> PAGE_SHIFT);
-	ret = offline_pages(start_pfn, end_pfn, timeout);
-	if (ret)
-		goto out;
-	/* we can free mem_map at this point */
-out:
-	return ret;
-}
-EXPORT_SYMBOL_GPL(remove_memory);
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif
 
 /*
Index: linux-2.6.27-rc5/arch/powerpc/mm/mem.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/powerpc/mm/mem.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/powerpc/mm/mem.c	2008-09-08 12:39:19.000000000 -0700
@@ -135,23 +135,6 @@ int arch_add_memory(int nid, u64 start, 
 
 	return __add_pages(zone, start_pfn, nr_pages);
 }
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn, end_pfn;
-	int ret;
-
-	start_pfn = start >> PAGE_SHIFT;
-	end_pfn = start_pfn + (size >> PAGE_SHIFT);
-	ret = offline_pages(start_pfn, ...
From: Andrew Morton
Date: Monday, September 8, 2008 - 5:56 pm

On Mon, 08 Sep 2008 14:52:34 -0700

I spent some time trying to build-test this on ia64 and gave up.  How
the heck do you turn on memory hotplug on ia64?

--

From: Randy Dunlap
Date: Monday, September 8, 2008 - 6:14 pm

After using ia64 defconfig, all I had to do was enable Sparse Memory model
instead of Discontiguous.


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/
--

From: Yasunori Goto
Date: Monday, September 8, 2008 - 6:21 pm

EXPORT_SYMBOL_GPL(remove_memory) is removed.
It is required by drivers/acpi/acpi_memhotplug.ko.


-- 
Yasunori Goto 


--

From: Badari Pulavarty
Date: Tuesday, September 9, 2008 - 8:12 am

Thanks for catching it. I forgot that it was being used
by acpi. Since we didn't export it for ppc and s390,
I assumed its safe to remove the export. Sorry !!

Thanks,
Badari

--

From: Badari Pulavarty
Date: Monday, September 8, 2008 - 2:56 pm

Cleaned up patch with out remove_memory(). 
Depends on make remove_memory() arch neutral patch.

Thanks,
Badari

Add memory hotremove config option to x86

Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels.  This patch makes it possible
to configure the memory hotremove functionality into the x86
kernel as well. 

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
---
 arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig	2008-09-08 12:36:06.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-08 12:45:30.000000000 -0700
@@ -1384,6 +1384,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool X86_64
 	depends on NUMA


--

From: Andi Kleen
Date: Friday, September 5, 2008 - 11:04 am

You forgot to describe how you tested it? Does it actually work.
And why do you want to do it it? What's the use case?

The general understanding was that it doesn't work very well on a real
machine at least because it cannot be controlled how that memory maps
to real pluggable hardware (and you cannot completely empty a node at runtime)
and a Hypervisor would likely use different interfaces anyways.

-Andi
--

From: Badari Pulavarty
Date: Friday, September 5, 2008 - 11:31 am

At this time we are interested on node remove (on x86_64). 
It doesn't really work well at this time - due to some of the structures
(pgdat etc) are striped across all nodes. These is no easy way to
relocate them. Yasunori Goto is working on patches to address some of
these issues.

But we are considering adding support to restrict/skip bootmem
allocations on selected nodes. That way, we should be able to do
node remove.

(BTW, on ppc64 this works fine - since we are interested mostly in
removing *some* sections of memory to give it back to hypervisor - 
not entire node removal).

Thanks,
Badari

--

From: Andi Kleen
Date: Friday, September 5, 2008 - 11:54 am

That means you can never put any slab data on specific nodes.
And all the kernel subsystems on that node will not ever get local
memory.  How are you going to solve that?  And if you disallow
kernel allocations in so large memory areas you get many of the highmem
issues that plagued 32bit back in the 64bit kernel.

There are lots of other issues. It's quite questionable if this

Ok for hypervisors you can do it reasonably easy on x86 too, but it's likely
that some hypercall interface is better than going through
sysfs. 

-Andi
--

From: Badari Pulavarty
Date: Friday, September 5, 2008 - 3:34 pm

You are absolutely correct. There is no easy solution - one has 
to loose performance in order to support node removal, along with
some old x86 issues :(

We were contemplating idea of limiting node removal to few

Same issues exist with ia64 and x86_64 won't be any worse off.
Gary was trying to enable the functionality so that we can atleast
test out offlining memory section easier (test page migration,
isolation code and hash out issues)

Another possible idea being considered (still lot of unknowns)
to make use offline memory section feature for power management
(*cough*).

Anyway, as you can see this patch doesn't add any code - just
enables config option for x86_64. (if you are worried about

sysfs interface already exists to offline sections of memory. (same
interface as online).

The proposed patch provides easy way to find out what sections of
memory belongs to which node. (could be useful on its own).

Thanks,
Badari

--

From: Gary Hade
Date: Friday, September 5, 2008 - 12:53 pm

So far, I have tested it on a 2-node IBM x460, 2-node IBM x3950, and
a 4-node IBM x3950 M2 and have been able to successfully offline and
re-online all memory sections marked as removable multiple times with
no apparent problems.

By directing the change to -mm our hope is that others will try it


The inability to offline all non-primary node memory sections
certainly needs to be addressed.  The pgdat removal work that
Yasunori Goto has started will hopefully continue and help resolve
this issue.  We have only just started thinking about issues related
to resources other that CPUs and memory that will need to be released
in preparation for node removal (e.g. memory and i/o resources
assigned to PCI devices on a node targeted for removal).  Much of
this is new territory for us so any suggestions that you and others
can offer will be much appreciated.

Thanks for asking.

Gary

-- 
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@us.ibm.com
http://www.ibm.com/linux/ltc

--

From: Andi Kleen
Date: Friday, September 5, 2008 - 1:04 pm

You make it sound like it's just some minor technical hurdle
that needs to be addressed. But from all analysis of these issues
I've seen so far it's extremly hard and all possible solutions
have serious issues. So before doing some baby steps there
should be at least some general idea how this thing is supposed

That's the easy stuff. The hard parts are all the kernel objects
that you cannot move.

-Andi

--

From: Gary Hade
Date: Friday, September 5, 2008 - 2:54 pm

I am not sure if I understand why you appear to be opposed to
enabling the hotremove function before all the issues related
to an eventual goal of being able to free all memory on a node
are addressed.  Even in the absence of solutions for these issues
it seems like there could still be other possible benefits such
as the ability to selectively expand and shrink available memory
for testing or debugging purposes.  I believe it would also be
helpful to those working on or testing possible solutions for
the removal issues.

Gary

-- 
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@us.ibm.com
http://www.ibm.com/linux/ltc

--

From: Andi Kleen
Date: Friday, September 5, 2008 - 5:01 pm

I'm quite sceptical that it can be ever made to work in a useful
way for real hardware (as opposed to an hypervisor para virtual setup
for which this interface is not the right way -- it should be done
in some specific driver instead) 

And if it cannot be made to work then it will be a false promise
to the user. They will see it and think it will work, but it will
not.

This means I don't see a real use case for this feature.

-Andi

--

From: Yasunori Goto
Date: Saturday, September 6, 2008 - 12:06 am

I don't think its driver is almighty.
IIRC, balloon driver can be cause of fragmentation for 24-7 system.

In addition, I have heard that memory hotplug would be useful for reducing
of power consumption of DIMM.

I have to admit that memory hotplug has many issues, but I would like to
solve them step by step.


Thanks.
-- 
Yasunori Goto 


--

From: Andi Kleen
Date: Saturday, September 6, 2008 - 1:53 am

Sure the balloon driver can be likely improved too, it's just
that I don't think a balloon driver should call into the function

It's unclear that memory hotplug is the right model for DIMM power management.
The problem is that DIMMs are interleaved, so you again have to completely

Let's call it "node" or "hardware" memory hot unplug, not that
anyone confuses it with the easier VM based hot unplug or the really

The question is if they are even solvable in a useful way.
I'm not sure it's that useful to start and then find out
that it doesn't work anyways.

-Andi

--

From: Nick Piggin
Date: Sunday, September 7, 2008 - 10:52 pm

You use non-linear mappings for the kernel, so that kernel data is
not tied to a specific physical address. AFAIK, that is the only way
to really do it completely (like the fragmentation problem).

Of course, I don't think that would be a good idea to do that in the
forseeable future.
--

From: Andi Kleen
Date: Monday, September 8, 2008 - 2:36 am

Even with that there are lots of issues, like keeping track of 

Agreed.

-Andi

-- 
ak@linux.intel.com
--

From: Nick Piggin
Date: Monday, September 8, 2008 - 2:46 am

Right, but the "high level" software solution is to have nonlinear
kernel mappings. Executing kernel code should not be so hard because
it could be handled just like executing user code (ie. the CPU that
is executing will subsequently fault and be blocked until the
relocation is complete).

DMAs aren't trivial at all, but I guess there could be say, a method
to submit and revoke areas of memory for DMA, and the submit would
block if the memory is currently being relocated underneath it (then
it would be able to find the new address).

Anwyay, whatever the case, yeah I'm not trying to say it is trivial

Same as the "anti-frag" patches. We must not proceed with this kind of
thing on the justification that "in future we'll be able to unplug any
bit of memory". Because it is not just a matter of logical steps to
reach that point, but basically a fundamental rethink of how the kernel
memory mapping should work.

Other realistic justifications are OK, but if someone wants to unplug
everything, then please put effort into *first* making the kernel
mapping nonlinear, and then we can look at the complexity and
performance costs of that fundamental step.
--

From: Andi Kleen
Date: Monday, September 8, 2008 - 3:30 am

First blocking arbitary code is hard. There is some code parts
which are not allowed to block arbitarily. Machine check or NMI
handlers come to mind, but there are likely more.

Then that would be essentially a hypervisor or micro kernel approach.
e.g. Xen does that already kind of, but even there it would
be quite hard to do fully in a general way. And for hardware hotplug
only the fully generally way is actually useful unfortunately.

-Andi
--

From: Nick Piggin
Date: Monday, September 8, 2008 - 4:19 am

Sorry, by "block", I really mean spin I guess. I mean that the CPU will
be forced to stop executing due to the page fault during this sequence:

for prot RO:
alloc new page
memcpy(new, old)
ptep_clear_flush(ptep)         <--- from here
set_pte(ptep, newpte)          <--- until here

for prot RW, the window also would include the memcpy, however if that
adds too much latency for execute/reads, then it can be mapped RO first,

What would be? Blocking in interrupts? Or non-linear kernel mapping in
general? Nonlinear kernel mapping I don't think anyone disputes is the
only way to defragment (for unplug or large allocations) arbitrary
physical memory with any sort of guarantee. In the future if TLB costs
grow very much larger, I think this might be worth considering.

But until that becomes inevitable, I really don't want to hack the VM
with crap like transparent variable order mappings etc. but rather

Yeah I don't really get the hardware hotplug thing. For reliability or
anything it should all be done in hardware (eg. warm/hot spare memory
module). For power I guess there is some argument, but I would prefer
to wait the trends out longer before committing to something big: non
volatile ram replacement for dram for example might be achieved in
future.

But if anybody disagrees, they are sure free to implement non-linear
kernel mappings and physical defragmentation and shut me up with
real numbers!

--

From: Andi Kleen
Date: Monday, September 8, 2008 - 4:30 am

It's hard for NMIs at least. They cannot execute faults.

In the end you would need to define a core kernel which 
cannot be remapped and the rest which can and you end up

Well in general someone remapping all the memory beyond you.
That's essentially a hypervisor in my book.

-Andi
--

From: Nick Piggin
Date: Monday, September 8, 2008 - 6:48 am

Well, just for executing code (and reading RO data), then it shouldn't
matter at all actually if the CPU starts executing from the new page
or the old page, so long as there is a way to quiesce NMIs before freeing
the old page.

So the NMI can run, and read data, but it may have a problem with stores.
At least, some kind of redesign of NMI handlers might be required so that
they can make a note of the pending operation and try to do something
sane in that case. Or, there could be a small region of memory; a page or
two, which does not get migrated and NMIs can write to it. I don't think
you need to go so far as saying the entire kernel image must be non


I don't see it. It is among one of the things a hypervisor may do.
But anyway, call it what you will.
--

From: Ingo Molnar
Date: Saturday, September 6, 2008 - 7:33 am

What would be nice is to insert the information both during bootup and 
in /proc/meminfo and 'free' output that hot-removable memory segments 
are not generic free memory, it's currently a limited resource that 
might or might not be sufficient to serve a given workload.

Perhaps even exclude it from 'total' memory reported by meminfo - to be 
on the safe side of user expectations. In terms of user-space memory it 
is already generic swappable memory but in terms of kernel-space 
allocations it is not.

As i said it earlier in the thread, i certainly have no objections from 
the x86 maintenance side - nothing is worse than a generic kernel 
feature only available on certain less frequently used platforms. Memory 
hotplug has been available for some time in the MM and it's not really 
causing any maintenance trouble at the moment and it is not enabled by 
default either.

Having said that, i have my doubts about its generic utility (the power 
saving aspects are likely not realizable - nobody really wants DIMMs to 
just sit there unused and the cost of dynamic migration is just 
horrendous) - but as long as it's opt-in there's no reason to limit the 
availability of an in-kernel feature artificially.

Removing those limitations of kernel-space allocations should indeed be 
done in baby steps - and whether it's worth turning such memory into 
completely generic kernel memory is an open question.

But the fact that a piece of memory is not fully generic is no reason 
not to allow users to create special, capability-limited RAM resources 
like they can already do via hugetlbfs or ramfs, as long as the the 
capability limitations are advertised clearly.

Yes, memory hotplug has limitations we all understand, but still it's an 
arguably useful feature in some circumstances. If we never give a 
feature a chance to evolve on the main Linux platform that 90%+ of our 
users use it wont ever be truly useful.

Please send the new patches against -git or -tip and we can put them ...
From: kamezawa.hiroyu
Date: Saturday, September 6, 2008 - 9:00 am

I wonder why anyone doesn't talk about ZONE_MOVABLE...When I wrote memory
hotplug, I assumed help of ZONE_MOVABLE and SPARSEMEM. It is shown in
meminfo.(I think memory hotplug is useful only when ZONE_MOVABLE is used.)

Most of problems which Goto wrote are mainly about placement of memmap and 
pgdat, zones. One example is that "when SPARSEMEM_VMEMMAP is enabled,

Nobody ? maybe just a trade-off problem in user side. 
Even without DIMM hotplug or DIMM's power save mode, making a DIMM idle
is of no use ? I think memory consumes much power when it used.
Memory Hotplug and ZONE_MOVABLE can make some memory idle.
Hmm, adding a feature like 
 - offline some memory at boot.
 - online-memory-as-hugeltb mode
  
is useful for generic pc users ?

Regards,
-Kame
--

From: kamezawa.hiroyu
Date: Saturday, September 6, 2008 - 9:05 am

But I have to point out HDD access consumes far power than memory.
That's trade-off problem depends on usage, anyway.

Thanks,
-Kame
--

From: Ingo Molnar
Date: Saturday, September 6, 2008 - 9:17 am

yeah, most likely. (It's possible technically even on a native kernel - 

yeah - it's actually the way how hugetlb should be done. Plus expand 
gbpages to hugetlbfs and hotplug memory on Barcelona CPUs and you can do 
user-space apps that can run for a long time without any TLB misses. 
_That_ might make sense to explore in practice. (i'm not holding my 
breath though, TLB misses are _fast_ on the best x86 CPUs.)

But we wont be able to make such experiments without having the 
capability on x86. So i'd like to break the catch-22 by accepting all 
this into arch/x86, it certainly is simple and makes some sense, it's 
just that i'm not that convinced about it personally at the moment.

So feel free to turn it all into a killer feature (make hugetlb backed 
memory transparent to user-space, etc. etc.) that high-performance 
computing users strive for and all that will change. Please send the 
reshaped patches so we can move past the 'what if' discussion phase ;-)

	Ingo
--

Previous thread: iozone lies/bug ? by J.A. on Friday, September 5, 2008 - 10:03 am. (1 message)

Next thread: [PATCH] [RESEND] mm: show memory section to node relationship in sysfs by Gary Hade on Friday, September 5, 2008 - 10:21 am. (1 message)