The Xen balloon driver needs to separate the process of hot-installing
memory into two phases: one to allocate the page structures and
configure the zones, and another to actually online the pages of newly
installed memory.
This patch splits up the innards of online_pages() into two pieces which
correspond to these two phases. The behaviour of online_pages() itself
is unchanged.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
---
include/linux/memory_hotplug.h | 3 +
mm/memory_hotplug.c | 66 ++++++++++++++++++++++++++++++++--------
2 files changed, 57 insertions(+), 12 deletions(-)
===================================================================
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -57,7 +57,10 @@
/* need some defines for these for archs that don't support it */
extern void online_page(struct page *page);
/* VM interface that may be used by firmware interface */
+extern int prepare_online_pages(unsigned long pfn, unsigned long nr_pages);
+extern unsigned long mark_pages_onlined(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long, unsigned long);
+
extern void __offline_isolated_pages(unsigned long, unsigned long);
extern int offline_pages(unsigned long, unsigned long, unsigned long);
===================================================================
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -180,31 +180,35 @@
return 0;
}
-
-int online_pages(unsigned long pfn, unsigned long nr_pages)
+/* Tell anyone who's interested that we're onlining some memory */
+static int notify_going_online(unsigned long pfn, unsigned long nr_pages)
{
- unsigned long flags;
- unsigned long onlined_pages = 0;
- struct zone *zone;
- int need_zonelists_rebuild = 0;
+ struct ...That's kind a weird line in the patch. How'd that get there? Why are
you moving 'arg'?
This look OK, but it does add ~45 lines of code, and I'm not immediately
sure how you're going to use it. Could you address that a bit?
I do kinda wish you'd take a real hard look at what the new functions
The comment is good, but the function name is not. :) How about
Isn't the comment on this one a bit redundant? :)
This looks to me to have become the real online_pages() now. This
function is what goes and individually onlines pages. If someone was
trying to figure out whether to call online_pages() or
We should really wrap up memory notify:
static void memory_notify(int state, unsigned long start_pfn,
unsigned long nr_pages, int status_change_nid)
{
struct memory_notify arg;
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
arg.status_change_nid = status_change_nid;
return the_current_memory_notify(state, &arg);
}
We can use that in a couple of spots, right?
-- Dave
--
arg is the notifier arg. This function is just wrapping up the
GOING_ONLINE notifier. The code is just a copy, so it isn't doing
anything it wasn't doing before.
The original code also recycles arg for the ONLINE notifier, but that
Sure. When the balloon driver wants to increase the domain's size, but
it finds its run out of page structures to grow into, it hotplug-adds
some memory. This code uses the add_memory_resource() function I posted
the patch for yesterday. (Error-checking removed for brevity.)
static void balloon_expand(unsigned pages)
{
struct resource *res;
int ret;
u64 size = (u64)pages * PAGE_SIZE;
unsigned pfn;
unsigned start_pfn, end_pfn;
res = kzalloc(sizeof(*res), GFP_KERNEL);
res->name = "Xen Balloon";
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
ret = allocate_resource(&iomem_resource, res, size, 0, -1,
1ul << SECTION_SIZE_BITS, NULL, NULL);
start_pfn = res->start >> PAGE_SHIFT;
end_pfn = (res->end + 1) >> PAGE_SHIFT;
ret = add_memory_resource(0, res);
ret = prepare_online_pages(start_pfn, pages);
for(pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
SetPageReserved(page);
set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
balloon_append(page); /* add to a list of balloon pages */
}
}
So this just gives us some page structures, but there's no underlying
memory yet. Later, the balloon driver starts populating the pages with
real memory behind them:
for (i = 0; i < nr_pages; i++) {
page = balloon_retrieve();
pfn = page_to_pfn(page);
/* frame_list is set of real memory pages */
set_phys_to_machine(pfn, frame_list[i]);
/* Relinquish the page back to the allocator. */
mark_pages_onlined(pfn, 1);
/* Link back into the page tables if not highmem. */
if (!PageHighMem(page)) {
int ret;
ret = HYPERVISOR_update_va_mapping(
(unsigned long)__va(pfn << PAGE_SHIFT),
mfn_pte(frame_list[i], PAGE_KERNEL),
0);
}
Sure. ...There is at least one other user that I know of, which is the ehea driver. They're running through patches now to use it. Anyway, we need the notifier. We're only going to get more and more drivers that need notification. There may be one user now, but that's no reason to rip it out. Feel free to revisit it in a year if there's still only one user. :) -- Dave --
I've been thinking about this some more, and I wish that you wouldn't just throw this interface away or completely disable it. It actually does *exactly* what you want in a way. :) When the /memoryXX/ directory appears, that means that the hardware has found the memory, and that the 'struct page' is allocated and ready to be initialized. When the OS actually wants to use the memory (initialize the 'struct page', and free_page() it), it does the 'echo online > /sys...'. Both the 'struct page' and the memory represented by it are untouched until the "online". This was originally in place to avoid fragmenting it immediately in the case that the system did not need it. To me, it sounds like the only different thing that you want is to make sure that only partial sections are onlined. So, shall we work with the existing interfaces to online partial sections, or will we just disable it entirely when we see Xen? For Xen and KVM, how does it get decided that the guest needs more memory? Is this guest or host driven? Both? How is the guest notified? Is guest userspace involved at all? -- Dave --
I had no intention of globally disabling it. I just need to disable it
Well, yes and no.
For the current balloon driver, it doesn't make much sense. It would
add a fair amount of complexity without any real gain. It's currently
based around alloc_page/free_page. When it wants to shrink the domain
and give memory back to the host, it allocates pages, adds the page
structures to a ballooned pages list, and strips off the backing memory
and gives it to the host. Growing the domain is the converse: it gets
pages from the host, pulls page structures off the list, binds them
together and frees them back to the kernel. If it runs out of ballooned
page structures, it hotplugs in some memory to add more.
That said, if (partial-)sections were much smaller - say 2-4 meg - and
page migration/defrag worked reliably, then we could probably do without
the balloon driver and do it all in terms of memory hot plug/unplug.
That would give us a general mechanism which could either be driven from
userspace, and/or have in-kernel Xen/kvm/s390/etc policy modules. Aside
from small sections, the only additional requirement would be an online
hook which can actually attach backing memory to the pages being
In Xen, either the host or the guest can set the target size for the
domain, which is capped by the host-set limit. Aside from possibly
setting the target size, there's no usermode involvement in managing
ballooning. The virtio balloon driver is similar, though from a quick
look it seems to be entirely driven by the host side.
J
--
Ballooning on KVM (and s390) is very much a different beast from Xen. With Xen, ballooning is very similar to hotplug in that you're adding and removing physical memory from the guest. The use of alloc_page() to implement it instead of hotplug is for the reasons Jeremy's outlined above. Logically though, it's hotplug. For KVM and s390, ballooning is really a primitive form of guest page hinting. The host asks the guest to allocate some memory and the guest allocates what it can, and then tells the host which pages they were. It's basically saying the pages are Unused and then the host may move those pages from Up=>Uz which reduces the resident size of the guest. The virtual size stays the same though. We can enforce limits on the resident size of the guest via the new cgroup memory controller. The guest is free to reclaim those pages at any time it wants without informing the host. In fact, we plan to utilize this by implementing a shrinker and OOM handler in the virtio balloon driver. Hotplug is still useful for us as it's more efficient to hot-add 1gb of memory instead of starting out with an extra 1gb and ballooning down. We wouldn't want to hotplug away every page we balloon though as we want to be able to reclaim them if necessary without the hosts intervention The host support for KVM ballooning is entirely in userspace, but that's orthogonal to the discussion at hand really. Regards, --
Right, but by disabling it for your case, you have given up all of the testing that others have done on it. Let's try and see if we can get How does this deal with things like present_pages in the zones? Does the total ram just grow with each hot-add, or does it grow on a per-page Even with 1MB sections and a flat sparsemem map, you're only looking at ~500k of overhead for the sparsemem storage. Less if you use vmemmap. -- Dave --
I suppose, but I'm not sure I see the point. What are the benefits of
using this interface? You mentioned that the interface exists so that
its possible to defer using a newly added piece of memory to avoid
fragmentation. I suppose I can see the point of that
But in the xen-balloon case, the memory is added on-demand precisely
when its about to be used, and then onlined in pieces as needed.
Extending the usermode interface to allow partial onlining/offlining
doesn't seem very useful for the case of physical hotplug memory, and
its not at all clear how to do it in a useful way for the xen-balloon
case. Particularly for offlining, since you'd need to guarantee that
Well, there are two ways of looking at it:
either hot-plugging memory immediately adds pages, but they're also
all immediately allocated and therefore unavailable for general use, or
the pages are notionally physically added as they're populated by
the host
In principle they're equivalent, but I could imagine the former has the
potential to make the VM waste time scanning unfreeable pages.
I'm not sure the patches I've posted are doing this stuff correctly
At the moment my concern is 32-bit x86, which doesn't support vmemmap or
sections smaller than 512MB because of the shortage of page flags bits.
J
--
Not only to avoid fragmentation, but also for notification to user level for preparing memory add event. When memory is added, there is a notification via udev for each memory device. In our box, one node which includes some DIMMs and CPUs can be added by hot-add, and there is another notification for 1 node by ACPI's container device. After user level check for preparing, user(or shell script) can online memory. IIRC, some of user level application would require this notification. Basically, I hope there is no change for user level interface between physical hotplug and Xen as much as possible. So, I would like to make sense why memory is added "on-demand" on Xen. I thought the hypervisor gathers a section's memory and moves all of them from one guest to another at a time. Its gathering time may be long time. But, each per page moving may cause of fragmentation, if my understanding I don't make sense both your idea yet. Could you tell me more? One of them may be same to my understanding. But I'm not sure. Thanks. -- Yasunori Goto --
Yeah, I forgot that we didn't have vmemmap on x86-32. Ugh. OK, here's another idea: Xen (and the balloon driver) already handle a case where a guest boots up with 2GB of memory but only needs 1GB, right? It will balloon the guest down to 1GB from 2GB. Why don't we just have hotplug work that way? When we want to take a guest from 1GB to 1GB+1 page (or whatever), we just hotplug the entire section (512MB or 1GB or whatever), actually online the whole thing, then make the balloon driver take it back to where it *should* be. That way we're completely reusing existing components that have do be able to handle this case anyway. Yeah, this is suboptimal, an it has a possibility of fragmenting the memory, but it will only be used for the x86-32 case. -- Dave --
It also requires you actually have the memory on hand to populate the
whole area. 512MB is still a significant chunk on a 2GB server; you may
end up generating significant overall system memory pressure to scrape
together the memory, only to immediately discard it again.
J
--
That's a very good point. Can we make it so that the hypervisors don't actually allocate the memory to the guest until its first touch? If the pages are on the freelist, their *contents* shouldn't be touched at all during the onlining process. Maybe we could put a special mark on the pages (please no page flag :) and the allocator can jump in and ask for the page from the hypervisor before returning it to the system. I think Anthony had some ideas around this area. It's kinda a poor man's page hinting. -- Dave --
No, not in a Xen direct-pagetable guest. The guest actually sees real
hardware page numbers (mfns) when the hypervisor gives it a page. By
the time the hypervisor gives it a page reference, it already
guaranteeing that the page is available for guest use. The only thing
that we could do is prevent the guest from mapping the page, but that
doesn't really achieve much.
I think we're getting off track here; this is a lot of extra complexity
to justify allowing usermode to use /sys to online a chunk of hotplugged
memory.
J
--
Oh, once we've let Linux establish ptes to it, we've required that the hypervisor have it around? How does that work with the balloon driver? Do we destroy the ptes when giving balloon memory back to the hypervisor? If we're talking about i386, then we're set. We don't map the hot-added memory at all because we only add highmem on i386. The only time we map these pages is *after* we actually allocate them when they get mapped Either that, or we're going to develop the entire Xen/kvm memory hotplug architecture around the soon-to-be-legacy i386 limitations. :) -- Dave --
Well, the balloon driver can balloon out lowmem pages, so we have to
deal with mappings either way. But balloon+hotplug would work
Everything also applies to x86-64.
J
--
Yeah, but I'm just talking about hotplugged memory. When we add it, we don't have to map the added pages (since they're highmem) and don't have to touch their contents and zero them out, either. Then, the balloon driver can notice that the memory is too large, and start to balloon it Not really, though. We don't have the page->flags shortage or lack of vmemmap on x86_64. -- Dave --
Not at present. But I'd like to change it to manage memory in largepage
I didn't think x86-64 had a notion of highmem.
Right now, I'd rather have a single mechanism that works for both.
J
--
I think there are a few options here. One is to check on the way out of the allocator that we're not over some Xen-specific limit. Basically that we aren't about to touch a hardware page for which the hypervisor hasn't allocating backing memory. Another is to give pages sitting in the allocator some kind of associated state or keep them on separate lists. (I think this has something in common with those s390 CMM patches). When you want to allocate a page, you not only pull it off the buddy lists, but you also have to check with the hypervisor to make sure it has backing store before you actually return it. You make it non-volatile in CMM-speak (I think). If you can't allocate backing store for a page, you toss it over to the balloon driver (who's whole job is to keep track of pages without hypervisor backing anyway) and go back to the allocator for another Yeah, that would be most ideal. But, at the same time, you don't want to hobble your rockstar x86_64 implementation with quirks inherited from the crufy 32-bit junk. :) -- Dave --
On Wed, 02 Apr 2008 15:13:05 -0700
As I mentioned before, you can do that by plug online_page().
Now, online_page() is per-architecture. So, it's not so bad to make this
online_page() to be just a callback.
=
int online_page(struct page *page)
{
if (online_page_callback)
return (*online_page_callback)(struct page *page);
retrun arch_default_online_page(page);
}
=
maybe not so dirty look.
your ballon driver can overwrite this callback pointer.
Thanks,
-Kame
--
s:Xen/kvm:Xen:g We don't need anything special for KVM. Bare metal memory hotplug should be sufficient provided userspace udev scripts are properly configured to offline memory automatically. Regards, --
Hi, On Fri, 28 Mar 2008 17:00:05 -0700 I think you should add "onlined" member instead of reusing nr_pages. But, in general, I have no objection to this way. Thanks, -Kame --
I suppose. What would I put into nr_pages? And anyway, there are no
The refactoring in general?
J
--
On Fri, 28 Mar 2008 22:48:21 -0700 My point is "Notifier" is expexted to work correctly and include precise Separating online_pages() into some meaningful blocks. Then, you can reuse some parts and avoid dupilication. Thanks, -Kame --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and |
