This is standalone version of VMware Balloon driver. Unlike previous version, that tried to integrate VMware ballooning transport into virtio subsystem, and use stock virtio_ballon driver, this one implements both controlling thread/algorithm and hypervisor transport. We are submitting standalone driver because KVM maintainer (Avi Kivity) expressed opinion (rightly) that our transport does not fit well into virtqueue paradigm and thus it does not make much sense to integrate with virtio. Signed-off-by: Dmitry Torokhov <dtor@vmware.com> --- arch/x86/kernel/cpu/vmware.c | 2 drivers/misc/Kconfig | 16 + drivers/misc/Makefile | 1 drivers/misc/vmware_balloon.c | 745 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 764 insertions(+), 0 deletions(-) create mode 100644 drivers/misc/vmware_balloon.c diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index 1cbed97..dfdb4db 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -22,6 +22,7 @@ */ #include <linux/dmi.h> +#include <linux/module.h> #include <asm/div64.h> #include <asm/vmware.h> #include <asm/x86_init.h> @@ -101,6 +102,7 @@ int vmware_platform(void) return 0; } +EXPORT_SYMBOL(vmware_platform); /* * VMware hypervisor takes care of exporting a reliable TSC to the guest. diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 2191c8d..0d0d625 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -311,6 +311,22 @@ config TI_DAC7512 This driver can also be built as a module. If so, the module will be calles ti_dac7512. +config VMWARE_BALLOON + tristate "VMware Balloon Driver" + depends on X86 + help + This is VMware physical memory management driver which acts + like a "balloon" that can be inflated to reclaim physical pages + by reserving them in the guest and invalidating them in the + monitor, freeing up the underlying machine pages so they can + be ...
On Sun, 4 Apr 2010 14:52:02 -0700 I think I've forgotten what balloon drivers do. Are they as nasty a hack as I remember believing them to be? A summary of what this code sets out to do, and how it does it would be useful. Also please explain the applicability of this driver. Will xen use it? kvm? Out-of-tree code? The code implements a user-visible API (in /proc, at least). Please fully describe the proposed interface(s) in the changelog so we can Oh well, ho hum. Help is needed on working out what to do about this, please. Congrats on the new job, btw ;) --
(I haven't looked at Dmitry's patch yet, so this is from the Xen perspective.) In the simplest form, they just look like a driver which allocates a pile of pages, and the underlying memory gets returned to the hypervisor. When you want the memory back, it reattaches memory to the pageframes and releases the memory back to the kernel. This allows a virtual machine to shrink with respect to its original size. Going the other way - expanding beyond the memory allocation - is a bit trickier because you need to get some new page structures from somewhere. We don't do this in Xen yet, but I've done some experiments with hotplug memory to implement this. Or a simpler approach is to fake The basic idea of the driver is to allow a guest system to give up memory it isn't using so it can be reused by other virtual machines (or the host itself). Xen and KVM already have equivalents in the kernel. Now that I've had a quick look at Dmitry's patch, it's certainly along the same lines as the Xen code, but it isn't clear to me how much code they could end up sharing. There's a couple of similar-looking loops, but the bulk of the code appears to be VMware specific. One area that would be very useful as common code would be some kind of policy engine to drive the balloon driver. That is, something that can look at the VM's state and say "we really have a couple hundred MB of excess memory we could happily give back to the host". And - very important - "don't go below X MB, because then we'll die in a flaming swap storm". At the moment this is driven by vendor-specific tools with heuristics of varying degrees of sophistication (which could be as simple as absolutely manual control). The problem has two sides because there's the decision made by guests on how much memory they can afford to give up, and also on the host side who knows what the system-wide memory pressures are. And it can be affected by hypervisor-specific features, such as whether ...
On Mon, 05 Apr 2010 15:03:08 -0700 So... does this differ in any fundamental way from what hibernation does, via shrink_all_memory()? --
Just the _all_ bit, and the fact that we need to report the freed page numbers to the hypervisor. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
On Tue, 06 Apr 2010 01:26:11 +0300 So... why not tweak that, rather than implementing some parallel thing? --
I guess the main difference is that freeing memory is not the primary goal; we want to make sure that guest does not use some of its memory without notifying hypervisor first. -- Dmitry --
That's maybe 5 lines of code. Most of the code is focused on interpreting requests from the hypervisor and replying with the page numbers. -- error compiling committee.c: too many arguments to function --
I think Avi was being facetious ("_all_"). Hibernation assumes everything in the machine is going to stop for awhile. Ballooning assumes that the machine has lower memory need for awhile, but is otherwise fully operational. Think of it as hot-plug memory at a page granularity. Historically, all OS's had a (relatively) fixed amount of memory and, since it was fixed in size, there was no sense wasting any of it. In a virtualized world, OS's should be trained to be much more flexible as one virtual machine's "waste" could/should be another virtual machine's "want". Ballooning is currently the mechanism for this; it places memory pressure on the OS to encourage it to get by with less memory. Unfortunately, it is difficult even within an OS to determine what memory is wasted and what memory might be used imminently... because LRU is only an approximation of the future. Hypervisors have an even more difficult problem not only because they must infer this information from external events, but they can double the problem if they infer the opposite of what the OS actually does. As Jeremy mentioned, Transcendent Memory (and its Linux implementations "cleancache" and "frontswap") allows a guest kernel to give up memory for the broader good while still retaining a probability that it can get the same data back quickly. This results in more memory fluidity. Transcendent Memory ("tmem") still uses ballooning as the mechanism to create memory pressure... it just provides an insurance policy for that memory pressure. Avi will point out that it is not clear that tmem can make use of or benefit from tmem, but we need not repeat that discussion here. Dan --
On Mon, 5 Apr 2010 16:03:48 -0700 (PDT) shrink_all_memory() doesn't require that processes be stopped. If the existing code doesn't exactly match virtualisation's hotplug is different because it targets particular physical pages. For this requirement any old page will do. Preferably one which won't be needed soon, yes? --
The best page would not old page but unused page. We do rely on the standard mechanisms to find pages that can be freed to inflate balloon, but once pages are allocated they are not available till released. In case of shrinkig memory it can be allocated and used as soon as we wake up (it shrink was done in course of hibernation sequence). -- Dmitry --
Note that we're using shrink and grow in opposite senses.
shrink_all_memory() is trying to free as much kernel memory as possible,
which to the virtual machine's host looks like the guest is growing
(since it has claimed more memory for its own use). A balloon "shrink"
appears to Linux as allocated memory (ie, locking down memory within
Linux to make it available to the rest of system).
The fact that shrink_all_memory() has much deeper insight into the
current state of the vm subsystem is interesting; it has much more to
work with than a simple alloc/free page. Does it actively try to
reclaim cold, unlikely to be used stuff, first? It appears it does to
my mm/ naive eye.
I guess a way to use it in the short term is to have a loop of the form:
while (guest_size> target) {
shrink_all_memory(guest_size - target); /* force pages to be free */
while (p = alloc_page(GFP_NORETRY)) /* vacuum up pages */
release_page_to_hypervisor(p);
/* twiddle thumbs */
}
...assuming the allocation would tend to pick up the pages that
shrink_all_memory just freed.
Or ideally, have a form of shrink_all_memory() which causes pages to
become unused, but rather than freeing them returns them to the caller.
And is there some way to get the vm subsystem to provide backpressure:
"I'm getting desperately short of memory!"? Experience has shown that
administrators often accidentally over-shrink their domains and
effectively kill them. Sometimes due to bad UI - entering the wrong
units - but also because they just don't know what the actual memory
demands are. Or they change over time.
Thanks,
J
--
On Mon, 05 Apr 2010 16:28:38 -0700 Not really. One could presumably pull dopey tricks by hooking into slab shrinker registration or even ->writepage(). But cooking up something explicit doesn't sound too hard - the trickiest bit would be actually defining what it should do. --
The oft-suggested approach is to look at the I/O load from guests and give more memory to those that are thrashing. Of course not all I/O is directly due to memory pressure. -- error compiling committee.c: too many arguments to function --
Which is why it is very useful to be able to differentiate between: 1) refault I/O (due to pagecache too small, and PFRA choices) 2) swap I/O (due to memory pressure) 3) normal file dirty writes (due to an app's need for persistence) Again, the cleancache and frontswap hooks and APIs separate these out nicely. Dan "who worries he is sounding like a broken record" --
We also need to remember to consolidate the Xen and virtio-balloon drivers. They both have their own GFP flags, for instance, but I think they actually want the exact same thing. They could probably also share that snippet, right? -- Dave --
Sorry, I don't mean to be too self-serving. And I am far less an expert in Linux mm code than others involved in this discussion. But this backpressure metric is one thing that frontswap provides. It also provides an "insurance policy" for "desperately short of memory". It is the "yin" to the "yang" of cleancache. If I understand the swap subsystem correctly, there IS NO "getting desperately short of memory" except when a swap device is unavailable or, more likely, too darn slow. Frontswap writes synchronously to pseudo-RAM (tmem, in the case of Xen) instead of a slow asynchronous swap device. It hooks directly into swap_writepage()/swap_readpage() in a very clean, well-defined (not dopey) way. So -- I think -- it is a perfect feedback mechanism to tell a balloon driver (or equivalent), "I need more memory" while covering the short-term need until the balloon driver (and/or hypervisor) can respond. It works today with Xen, and Nitin Gupta is working on an in-kernel memory compression backend for it. And Chris Mason and I think it may also be a fine interface for SSD-used- as-RAM-extension. So please consider frontswap and cleancache before "cooking up something [else] explicit"... these were previously part of Transcendent Memory postings*, but I have revised them to be more useful, well-defined, and standalone (from Xen/tmem) and will be re-posting the revised versions soon. Dan * See: http://lwn.net/Articles/340080/ http://lkml.indiana.edu/hypermail/linux/kernel/0912.2/01322.html OLS 2009 proceedings LCA 2010 proceedings --
Jeremy provided a very good writeup; I will aldo expand changelog in the The driver is expected to be used on VMware platform - mainly ESX. Originally we tried to converge with KVM and use virtio and stock virtio_balloon driver but Avi mentioned that our code emulating virtqueue was more than balloon code itself and thus using virtio did Thanks ;). BTW, please send input stuff to my gmail addresss till. -- Dmitry --
Yeah. If we wanted commonality, we could make a balloon_core.c that contains the common code. IMO that's premature, but perhaps there's some meat there (like suspend/resume support and /proc//sys interface). -- error compiling committee.c: too many arguments to function --
I really not sure if it makes much sense. Ripping out virtdev/virtqueue We do not need any special suspend/resume support - the freezeable workqueue is stopped when suspending. Thanks. -- Dmitry --
Ah, virtio_balloon should do the same. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I think it would be useful to have common:
1. User and kernel mode ABIs for controlling ballooning. It assumes
that the different balloon implementations are sufficiently
similar in semantics. (Once there's a kernel ABI, adding a
common user ABI is trivial.)
2. Policy driving the ballooning driver, at least from the guest
side. That is, some good metrics from the vm subsystem about
memory pressure (both positive and negative), and something to
turn those metrics into requests to the balloon driver.
1) is not a huge amount of code, but something consistent would be
nice. 2) is something we've been missing and is a bit of an open
question/research project anyway.
J
--
3) Code that attempts to reclaim 2MB pages when possible -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Yes. Ballooning in 4k units is a bit silly.
J
--
Does it make sense to treat ballooning as a form of memory hotplug? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
It's a fine granularity form of memory hotplug, yes. -- error compiling committee.c: too many arguments to function --
It has some similarities. The main difference is granularity;
ballooning works in pages (typically 4k, but 2M probably makes more
sense), whereas memory hotplug works in DIMM-like sizes (256MB+).
That's way too coarse for us; a domain might only have 256MB or less to
start with.
I experimented with a sort of hybrid scheme, in which I used hotplug
memory to add new struct pages to the system, but only incrementally
populated the underlying pages with the balloon driver. That worked
pretty well, but it doesn't fit very well with how memory hotplug works
(at least when I last looked at it a couple of years ago).
J
--
This is standalone version of VMware Balloon driver. Ballooning is a technique that allows hypervisor dynamically limit the amount of memory available to the guest (with guest cooperation). In the overcommit scenario, when hypervisor set detects that it needs to shuffle some memory, it instructs the driver to allocate certain number of pages, and the underlying memory gets returned to the hypervisor. Later hypervisor may return memory to the guest by reattaching memory to the pageframes and instructing the driver to "deflate" balloon. Signed-off-by: Dmitry Torokhov <dtor@vmware.com> --- Unlike previous version, that tried to integrate VMware ballooning transport into virtio subsystem, and use stock virtio_ballon driver, this one implements both controlling thread/algorithm and hypervisor transport. We are submitting standalone driver because KVM maintainer (Avi Kivity) expressed opinion (rightly) that our transport does not fit well into virtqueue paradigm and thus it does not make much sense to integrate with virtio. There were also some concerns whether current ballooning technique is the right thing. If there appears a better framework to achieve this we are prepared to evaluate and switch to using it, but in the meantime we'd like to get this driver upstream. Changes since v1: - added comments throughout the code; - exported stats moved from /proc to debugfs; - better changelog. arch/x86/kernel/cpu/vmware.c | 2 drivers/misc/Kconfig | 16 + drivers/misc/Makefile | 1 drivers/misc/vmware_balloon.c | 808 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 827 insertions(+), 0 deletions(-) create mode 100644 drivers/misc/vmware_balloon.c diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index 1cbed97..dfdb4db 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -22,6 +22,7 @@ */ #include <linux/dmi.h> +#include <linux/module.h> #include <asm/div64.h> ...
Andrew, Do you see any issues with the driver? Will you be the one picking it up and queueing for mainline? Thanks, --
On Wed, 21 Apr 2010 12:59:35 -0700 Spose so. --
Good. I don't suppose we have a chance making into .34? Being a completely new driver and all... Thanks, Dmitry --
On Wed, 21 Apr 2010 13:52:08 -0700 It's foggy. Is there a good-sounding reason for pushing it in this late? --
We want to get the driver accepted in distributions so that users do not have to deal with an out-of-tree module and many distributions have "upstream first" requirement. The driver has been shipping for a number of years and users running on VMware platform will have it installed as part of VMware Tools even if it will not come from a distribution, thus there should not be additional risk in pulling the driver into mainline. The driver will only activate if host is VMware so everyone else should not be affected at all. Thanks, Dmitry --
On Thu, 15 Apr 2010 14:00:31 -0700 This is OK for both x86_32 and x86_64? afaict all the stats stuff is useless if CONFIG_DEBUG_FS=n. Perhaps in that case the vmballoon.stats field should be omitted and STATS_INC --
These control inflating/deflating rate of the ballon, mesured in OK, will do. Thanks Andrew. -- Dmitry --
OK, so here is the incremental patch addressing your comments. Or do you want the entire thing resent? Thanks. -- Dmitry vmware-balloon: miscellaneous fixes - document rate allocation constants - do not compile statistics code when debugfs is disabled - fix compilation error when debugfs is disabled Signed-off-by: Dmitry Torokhov <dtor@vmware.com> --- drivers/misc/vmware_balloon.c | 38 +++++++++++++++++++++++++++++++------- 1 files changed, 31 insertions(+), 7 deletions(-) diff --git a/drivers/misc/vmware_balloon.c b/drivers/misc/vmware_balloon.c index 90bba04..e7161c4 100644 --- a/drivers/misc/vmware_balloon.c +++ b/drivers/misc/vmware_balloon.c @@ -50,12 +50,28 @@ MODULE_ALIAS("dmi:*:svnVMware*:*"); MODULE_ALIAS("vmware_vmmemctl"); MODULE_LICENSE("GPL"); +/* + * Various constants controlling rate of inflaint/deflating balloon, + * measured in pages. + */ + +/* + * Rate of allocating memory when there is no memory pressure + * (driver performs non-sleeping allocations). + */ #define VMW_BALLOON_NOSLEEP_ALLOC_MAX 16384U +/* + * Rates of memory allocaton when guest experiences memory pressure + * (driver performs sleeping allocations). + */ #define VMW_BALLOON_RATE_ALLOC_MIN 512U #define VMW_BALLOON_RATE_ALLOC_MAX 2048U #define VMW_BALLOON_RATE_ALLOC_INC 16U +/* + * Rates for releasing pages while deflating balloon. + */ #define VMW_BALLOON_RATE_FREE_MIN 512U #define VMW_BALLOON_RATE_FREE_MAX 16384U #define VMW_BALLOON_RATE_FREE_INC 16U @@ -85,6 +101,10 @@ MODULE_LICENSE("GPL"); /* Maximum number of page allocations without yielding processor */ #define VMW_BALLOON_YIELD_THRESHOLD 1024 + +/* + * Hypervisor communication port definitions. + */ #define VMW_BALLOON_HV_PORT 0x5670 #define VMW_BALLOON_HV_MAGIC 0x456c6d6f #define VMW_BALLOON_PROTOCOL_VERSION 2 @@ -125,8 +145,7 @@ MODULE_LICENSE("GPL"); __stat & -1UL; \ }) -#define STATS_INC(stat) (stat)++ - +#ifdef CONFIG_DEBUG_FS struct ...
