Re: [PATCH RFC] hotplug-memory: refactor online_pages to separate zone growth from page onlining

Previous thread: Re: [patch 20/21] forcedeth: fix locking bug with netconsole by Ingo Molnar on Friday, March 28, 2008 - 4:46 pm. (1 message)

Next thread: unexpected rename() behaviour by Ketil Froyn on Friday, March 28, 2008 - 5:07 pm. (5 messages)
From: Jeremy Fitzhardinge
Date: Friday, March 28, 2008 - 5:00 pm

The Xen balloon driver needs to separate the process of hot-installing
memory into two phases: one to allocate the page structures and
configure the zones, and another to actually online the pages of newly
installed memory.

This patch splits up the innards of online_pages() into two pieces which
correspond to these two phases.  The behaviour of online_pages() itself
is unchanged.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |    3 +
 mm/memory_hotplug.c            |   66 ++++++++++++++++++++++++++++++++--------
 2 files changed, 57 insertions(+), 12 deletions(-)

===================================================================
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -57,7 +57,10 @@
 /* need some defines for these for archs that don't support it */
 extern void online_page(struct page *page);
 /* VM interface that may be used by firmware interface */
+extern int prepare_online_pages(unsigned long pfn, unsigned long nr_pages);
+extern unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages);
 extern int online_pages(unsigned long, unsigned long);
+
 extern void __offline_isolated_pages(unsigned long, unsigned long);
 extern int offline_pages(unsigned long, unsigned long, unsigned long);
 
===================================================================
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -180,31 +180,35 @@
 	return 0;
 }
 
-
-int online_pages(unsigned long pfn, unsigned long nr_pages)
+/* Tell anyone who's interested that we're onlining some memory */
+static int notify_going_online(unsigned long pfn, unsigned long nr_pages)
 {
-	unsigned long flags;
-	unsigned long onlined_pages = 0;
-	struct zone *zone;
-	int need_zonelists_rebuild = 0;
+	struct ...
From: Dave Hansen
Date: Friday, March 28, 2008 - 5:47 pm

That's kind a weird line in the patch.  How'd that get there?  Why are
you moving 'arg'?

This look OK, but it does add ~45 lines of code, and I'm not immediately
sure how you're going to use it.  Could you address that a bit?

I do kinda wish you'd take a real hard look at what the new functions

The comment is good, but the function name is not. :)  How about

Isn't the comment on this one a bit redundant? :)

This looks to me to have become the real online_pages() now.  This
function is what goes and individually onlines pages.  If someone was
trying to figure out whether to call online_pages() or

We should really wrap up memory notify:

static void memory_notify(int state, unsigned long start_pfn,
			  unsigned long nr_pages, int status_change_nid)
{
	struct memory_notify arg;
	arg.start_pfn = start_pfn;
	arg.nr_pages = nr_pages;
	arg.status_change_nid = status_change_nid;
	return the_current_memory_notify(state, &arg);
}

We can use that in a couple of spots, right?

-- Dave

--

From: Jeremy Fitzhardinge
Date: Friday, March 28, 2008 - 7:08 pm

arg is the notifier arg.  This function is just wrapping up the 
GOING_ONLINE notifier.  The code is just a copy, so it isn't doing 
anything it wasn't doing before.

The original code also recycles arg for the ONLINE notifier, but that 

Sure.  When the balloon driver wants to increase the domain's size, but 
it finds its run out of page structures to grow into, it hotplug-adds 
some memory.  This code uses the add_memory_resource() function I posted 
the patch for yesterday.  (Error-checking removed for brevity.)

static void balloon_expand(unsigned pages)
{
	struct resource *res;
	int ret;
	u64 size = (u64)pages * PAGE_SIZE;
	unsigned pfn;
	unsigned start_pfn, end_pfn;

	res = kzalloc(sizeof(*res), GFP_KERNEL);

	res->name = "Xen Balloon";
	res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;

	ret = allocate_resource(&iomem_resource, res, size, 0, -1,
				1ul << SECTION_SIZE_BITS, NULL, NULL);

	start_pfn = res->start >> PAGE_SHIFT;
	end_pfn = (res->end + 1) >> PAGE_SHIFT;

	ret = add_memory_resource(0, res);
	ret = prepare_online_pages(start_pfn, pages);

	for(pfn = start_pfn; pfn < end_pfn; pfn++) {
		struct page *page = pfn_to_page(pfn);

		SetPageReserved(page);
		set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
		balloon_append(page);		/* add to a list of balloon pages */
	}
}


So this just gives us some page structures, but there's no underlying 
memory yet.  Later, the balloon driver starts populating the pages with 
real memory behind them:

	for (i = 0; i < nr_pages; i++) {
		page = balloon_retrieve();

		pfn = page_to_pfn(page);

		/* frame_list is set of real memory pages */
		set_phys_to_machine(pfn, frame_list[i]);

		/* Relinquish the page back to the allocator. */
		mark_pages_onlined(pfn, 1);

		/* Link back into the page tables if not highmem. */
		if (!PageHighMem(page)) {
			int ret;
			ret = HYPERVISOR_update_va_mapping(
				(unsigned long)__va(pfn << PAGE_SHIFT),
				mfn_pte(frame_list[i], PAGE_KERNEL),
				0);

		}


Sure. ...
From: Dave Hansen
Date: Friday, March 28, 2008 - 11:01 pm

There is at least one other user that I know of, which is the ehea
driver.  They're running through patches now to use it.

Anyway, we need the notifier.  We're only going to get more and more
drivers that need notification.  There may be one user now, but that's
no reason to rip it out.  Feel free to revisit it in a year if there's
still only one user. :)

-- Dave

--

From: Dave Hansen
Date: Saturday, March 29, 2008 - 9:06 am

I've been thinking about this some more, and I wish that you wouldn't
just throw this interface away or completely disable it.  It actually
does *exactly* what you want in a way. :)

When the /memoryXX/ directory appears, that means that the hardware has
found the memory, and that the 'struct page' is allocated and ready to
be initialized.

When the OS actually wants to use the memory (initialize the 'struct
page', and free_page() it), it does the 'echo online > /sys...'.  Both
the 'struct page' and the memory represented by it are untouched until
the "online".  This was originally in place to avoid fragmenting it
immediately in the case that the system did not need it.

To me, it sounds like the only different thing that you want is to make
sure that only partial sections are onlined.  So, shall we work with the
existing interfaces to online partial sections, or will we just disable
it entirely when we see Xen?

For Xen and KVM, how does it get decided that the guest needs more
memory?  Is this guest or host driven?  Both?  How is the guest
notified?  Is guest userspace involved at all?

-- Dave

--

From: Jeremy Fitzhardinge
Date: Saturday, March 29, 2008 - 4:53 pm

I had no intention of globally disabling it.  I just need to disable it 

Well, yes and no.

For the current balloon driver, it doesn't make much sense.  It would 
add a fair amount of complexity without any real gain.  It's currently 
based around alloc_page/free_page.  When it wants to shrink the domain 
and give memory back to the host, it allocates pages, adds the page 
structures to a ballooned pages list, and strips off the backing memory 
and gives it to the host.  Growing the domain is the converse: it gets 
pages from the host, pulls page structures off the list, binds them 
together and frees them back to the kernel.  If it runs out of ballooned 
page structures, it hotplugs in some memory to add more.

That said, if (partial-)sections were much smaller - say 2-4 meg - and 
page migration/defrag worked reliably, then we could probably do without 
the balloon driver and do it all in terms of memory hot plug/unplug.  
That would give us a general mechanism which could either be driven from 
userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules.  Aside 
from small sections, the only additional requirement would be an online 
hook which can actually attach backing memory to the pages being 

In Xen, either the host or the guest can set the target size for the 
domain, which is capped by the host-set limit.  Aside from possibly 
setting the target size, there's no usermode involvement in managing 
ballooning.  The virtio balloon driver is similar, though from a quick 
look it seems to be entirely driven by the host side.

    J
--

From: Anthony Liguori
Date: Saturday, March 29, 2008 - 5:26 pm

Ballooning on KVM (and s390) is very much a different beast from Xen.  
With Xen, ballooning is very similar to hotplug in that you're adding 
and removing physical memory from the guest.  The use of alloc_page() to 
implement it instead of hotplug is for the reasons Jeremy's outlined 
above.  Logically though, it's hotplug.

For KVM and s390, ballooning is really a primitive form of guest page 
hinting.  The host asks the guest to allocate some memory and the guest 
allocates what it can, and then tells the host which pages they were.  
It's basically saying the pages are Unused and then the host may move 
those pages from Up=>Uz which reduces the resident size of the guest.  
The virtual size stays the same though.  We can enforce limits on the 
resident size of the guest via the new cgroup memory controller.

The guest is free to reclaim those pages at any time it wants without 
informing the host.  In fact, we plan to utilize this by implementing a 
shrinker and OOM handler in the virtio balloon driver.

Hotplug is still useful for us as it's more efficient to hot-add 1gb of 
memory instead of starting out with an extra 1gb and ballooning down.  
We wouldn't want to hotplug away every page we balloon though as we want 
to be able to reclaim them if necessary without the hosts intervention 

The host support for KVM ballooning is entirely in userspace, but that's 
orthogonal to the discussion at hand really.

Regards,


--

From: Dave Hansen
Date: Monday, March 31, 2008 - 9:42 am

Right, but by disabling it for your case, you have given up all of the
testing that others have done on it.  Let's try and see if we can get

How does this deal with things like present_pages in the zones?  Does
the total ram just grow with each hot-add, or does it grow on a per-page

Even with 1MB sections and a flat sparsemem map, you're only looking at
~500k of overhead for the sparsemem storage.  Less if you use vmemmap.  

-- Dave

--

From: Jeremy Fitzhardinge
Date: Monday, March 31, 2008 - 11:06 am

I suppose, but I'm not sure I see the point.  What are the benefits of 
using this interface?  You mentioned that the interface exists so that 
its possible to defer using a newly added piece of memory to avoid 
fragmentation.  I suppose I can see the point of that

But in the xen-balloon case, the memory is added on-demand precisely 
when its about to be used, and then onlined in pieces as needed.  
Extending the usermode interface to allow partial onlining/offlining 
doesn't seem very useful for the case of physical hotplug memory, and 
its not at all clear how to do it in a useful way for the xen-balloon 
case.  Particularly for offlining, since you'd need to guarantee that 

Well, there are two ways of looking at it:

    either hot-plugging memory immediately adds pages, but they're also
    all immediately allocated and therefore unavailable for general use, or

    the pages are notionally physically added as they're populated by
    the host


In principle they're equivalent, but I could imagine the former has the 
potential to make the VM waste time scanning unfreeable pages.

I'm not sure the patches I've posted are doing this stuff correctly 


At the moment my concern is 32-bit x86, which doesn't support vmemmap or 
sections smaller than 512MB because of the shortage of page flags bits.

    J
--

From: Yasunori Goto
Date: Tuesday, April 1, 2008 - 12:17 am

Not only to avoid fragmentation, but also for notification
to user level for preparing memory add event.
When memory is added, there is a notification via udev for each memory
device.
In our box, one node which includes some DIMMs and CPUs can be added by
hot-add, and there is another notification for 1 node by ACPI's
container device.
After user level check for preparing, user(or shell script) can
online memory.

IIRC, some of user level application would require this notification.

Basically, I hope there is no change for user level interface between
physical hotplug and Xen as much as possible. 
So, I would like to make sense why memory is added "on-demand" on Xen.
I thought the hypervisor gathers a section's memory and moves all of them
from one guest to another at a time. Its gathering time may be long time.
But, each per page moving may cause of fragmentation, if my understanding

I don't make sense both your idea yet. Could you tell me more?
One of them may be same to my understanding. But I'm not sure.


Thanks.

-- 
Yasunori Goto 


--

From: Dave Hansen
Date: Wednesday, April 2, 2008 - 11:46 am

Yeah, I forgot that we didn't have vmemmap on x86-32.  Ugh.

OK, here's another idea: Xen (and the balloon driver) already handle a
case where a guest boots up with 2GB of memory but only needs 1GB,
right?  It will balloon the guest down to 1GB from 2GB.

Why don't we just have hotplug work that way?  When we want to take a
guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire
section (512MB or 1GB or whatever), actually online the whole thing,
then make the balloon driver take it back to where it *should* be.  That
way we're completely reusing existing components that have do be able to
handle this case anyway.

Yeah, this is suboptimal, an it has a possibility of fragmenting the
memory, but it will only be used for the x86-32 case.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Wednesday, April 2, 2008 - 11:52 am

It also requires you actually have the memory on hand to populate the 
whole area.  512MB is still a significant chunk on a 2GB server; you may 
end up generating significant overall system memory pressure to scrape 
together the memory, only to immediately discard it again.

    J
--

From: Dave Hansen
Date: Wednesday, April 2, 2008 - 11:59 am

That's a very good point.  Can we make it so that the hypervisors don't
actually allocate the memory to the guest until its first touch?  If the
pages are on the freelist, their *contents* shouldn't be touched at all
during the onlining process.

Maybe we could put a special mark on the pages (please no page flag :)
and the allocator can jump in and ask for the page from the hypervisor
before returning it to the system.  I think Anthony had some ideas
around this area.  It's kinda a poor man's page hinting.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Wednesday, April 2, 2008 - 2:03 pm

No, not in a Xen direct-pagetable guest.  The guest actually sees real 
hardware page numbers (mfns) when the hypervisor gives it a page.  By 
the time the hypervisor gives it a page reference, it already 
guaranteeing that the page is available for guest use.  The only thing 
that we could do is prevent the guest from mapping the page, but that 
doesn't really achieve much.

I think we're getting off track here; this is a lot of extra complexity 
to justify allowing usermode to use /sys to online a chunk of hotplugged 
memory.

    J
--

From: Dave Hansen
Date: Wednesday, April 2, 2008 - 2:17 pm

Oh, once we've let Linux establish ptes to it, we've required that the
hypervisor have it around?  How does that work with the balloon driver?
Do we destroy the ptes when giving balloon memory back to the
hypervisor?

If we're talking about i386, then we're set.  We don't map the hot-added
memory at all because we only add highmem on i386.  The only time we map
these pages is *after* we actually allocate them when they get mapped

Either that, or we're going to develop the entire Xen/kvm memory hotplug
architecture around the soon-to-be-legacy i386 limitations. :)

-- Dave

--

From: Jeremy Fitzhardinge
Date: Wednesday, April 2, 2008 - 2:35 pm

Well, the balloon driver can balloon out lowmem pages, so we have to 
deal with mappings either way.  But balloon+hotplug would work 

Everything also applies to x86-64.


    J
--

From: Dave Hansen
Date: Wednesday, April 2, 2008 - 2:43 pm

Yeah, but I'm just talking about hotplugged memory.  When we add it, we
don't have to map the added pages (since they're highmem) and don't have
to touch their contents and zero them out, either.  Then, the balloon
driver can notice that the memory is too large, and start to balloon it

Not really, though.  We don't have the page->flags shortage or lack of
vmemmap on x86_64.

-- Dave

--

From: Jeremy Fitzhardinge
Date: Wednesday, April 2, 2008 - 3:13 pm

Not at present.  But I'd like to change it to manage memory in largepage 

I didn't think x86-64 had a notion of highmem.


Right now, I'd rather have a single mechanism that works for both.

    J
--

From: Dave Hansen
Date: Wednesday, April 2, 2008 - 4:27 pm

I think there are a few options here.  One is to check on the way out of
the allocator that we're not over some Xen-specific limit.  Basically
that we aren't about to touch a hardware page for which the hypervisor
hasn't allocating backing memory.

Another is to give pages sitting in the allocator some kind of
associated state or keep them on separate lists.  (I think this has
something in common with those s390 CMM patches).  When you want to
allocate a page, you not only pull it off the buddy lists, but you also
have to check with the hypervisor to make sure it has backing store
before you actually return it.  You make it non-volatile in CMM-speak (I
think).

If you can't allocate backing store for a page, you toss it over to the
balloon driver (who's whole job is to keep track of pages without
hypervisor backing anyway) and go back to the allocator for another

Yeah, that would be most ideal.  But, at the same time, you don't want
to hobble your rockstar x86_64 implementation with quirks inherited from
the crufy 32-bit junk. :)

-- Dave

--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 3, 2008 - 12:03 am

On Wed, 02 Apr 2008 15:13:05 -0700

As I mentioned before, you can do that by plug online_page().

Now, online_page() is per-architecture. So, it's not so bad to make this
online_page() to be just a callback.
=
int online_page(struct page *page)
{
	if (online_page_callback)
		return (*online_page_callback)(struct page *page);
	retrun arch_default_online_page(page);
}
=
maybe not so dirty look. 

your ballon driver can overwrite this callback pointer.

Thanks,
-Kame

--

From: Anthony Liguori
Date: Wednesday, April 2, 2008 - 2:36 pm

s:Xen/kvm:Xen:g

We don't need anything special for KVM.  Bare metal memory hotplug 
should be sufficient provided userspace udev scripts are properly 
configured to offline memory automatically.

Regards,


--

From: KAMEZAWA Hiroyuki
Date: Friday, March 28, 2008 - 9:38 pm

Hi,

On Fri, 28 Mar 2008 17:00:05 -0700
I think you should add "onlined" member instead of reusing nr_pages.

But, in general, I have no objection to this way.

Thanks,
-Kame

--

From: Jeremy Fitzhardinge
Date: Friday, March 28, 2008 - 10:48 pm

I suppose.  What would I put into nr_pages?  And anyway, there are no 

The refactoring in general?

    J
--

From: KAMEZAWA Hiroyuki
Date: Friday, March 28, 2008 - 11:26 pm

On Fri, 28 Mar 2008 22:48:21 -0700
My point is "Notifier" is expexted to work correctly and include precise

Separating online_pages() into some meaningful blocks. Then, you can
reuse some parts and avoid dupilication.

Thanks,
-Kame

--

Previous thread: Re: [patch 20/21] forcedeth: fix locking bug with netconsole by Ingo Molnar on Friday, March 28, 2008 - 4:46 pm. (1 message)

Next thread: unexpected rename() behaviour by Ketil Froyn on Friday, March 28, 2008 - 5:07 pm. (5 messages)