Re: [PATCH] VMware Balloon driver

Previous thread: none

Next thread: 32GB SSD on USB1.1 P3/700 == ___HELL___ (2.6.34-rc3) by Andreas Mohr on Sunday, April 4, 2010 - 3:13 pm. (24 messages)
From: Dmitry Torokhov
Date: Sunday, April 4, 2010 - 2:52 pm

This is standalone version of VMware Balloon driver. Unlike previous
version, that tried to integrate VMware ballooning transport into virtio
subsystem, and use stock virtio_ballon driver, this one implements both
controlling thread/algorithm and hypervisor transport.

We are submitting standalone driver because KVM maintainer (Avi Kivity)
expressed opinion (rightly) that our transport does not fit well into
virtqueue paradigm and thus it does not make much sense to integrate
with virtio.

Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
---

 arch/x86/kernel/cpu/vmware.c  |    2 
 drivers/misc/Kconfig          |   16 +
 drivers/misc/Makefile         |    1 
 drivers/misc/vmware_balloon.c |  745 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 764 insertions(+), 0 deletions(-)
 create mode 100644 drivers/misc/vmware_balloon.c


diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 1cbed97..dfdb4db 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -22,6 +22,7 @@
  */
 
 #include <linux/dmi.h>
+#include <linux/module.h>
 #include <asm/div64.h>
 #include <asm/vmware.h>
 #include <asm/x86_init.h>
@@ -101,6 +102,7 @@ int vmware_platform(void)
 
 	return 0;
 }
+EXPORT_SYMBOL(vmware_platform);
 
 /*
  * VMware hypervisor takes care of exporting a reliable TSC to the guest.
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 2191c8d..0d0d625 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -311,6 +311,22 @@ config TI_DAC7512
 	  This driver can also be built as a module. If so, the module
 	  will be calles ti_dac7512.
 
+config VMWARE_BALLOON
+	tristate "VMware Balloon Driver"
+	depends on X86
+	help
+	  This is VMware physical memory management driver which acts
+	  like a "balloon" that can be inflated to reclaim physical pages
+	  by reserving them in the guest and invalidating them in the
+	  monitor, freeing up the underlying machine pages so they can
+	  be ...
From: Andrew Morton
Date: Monday, April 5, 2010 - 2:24 pm

On Sun, 4 Apr 2010 14:52:02 -0700

I think I've forgotten what balloon drivers do.  Are they as nasty a
hack as I remember believing them to be?

A summary of what this code sets out to do, and how it does it would be
useful.

Also please explain the applicability of this driver.  Will xen use it?
kvm?  Out-of-tree code?

The code implements a user-visible API (in /proc, at least).  Please
fully describe the proposed interface(s) in the changelog so we can




Oh well, ho hum.  Help is needed on working out what to do about this,
please.

Congrats on the new job, btw ;)

--

From: Jeremy Fitzhardinge
Date: Monday, April 5, 2010 - 3:03 pm

(I haven't looked at Dmitry's patch yet, so this is from the Xen 
perspective.)

In the simplest form, they just look like a driver which allocates a 
pile of pages, and the underlying memory gets returned to the 
hypervisor.  When you want the memory back, it reattaches memory to the 
pageframes and releases the memory back to the kernel.  This allows a 
virtual machine to shrink with respect to its original size.

Going the other way - expanding beyond the memory allocation - is a bit 
trickier because you need to get some new page structures from 
somewhere.   We don't do this in Xen yet, but I've done some experiments 
with hotplug memory to implement this.  Or a simpler approach is to fake 
The basic idea of the driver is to allow a guest system to give up 
memory it isn't using so it can be reused by other virtual machines (or 
the host itself).

Xen and KVM already have equivalents in the kernel.  Now that I've had a 
quick look at Dmitry's patch, it's certainly along the same lines as the 
Xen code, but it isn't clear to me how much code they could end up 
sharing.  There's a couple of similar-looking loops, but the bulk of the 
code appears to be VMware specific.

One area that would be very useful as common code would be some kind of 
policy engine to drive the balloon driver.  That is, something that can 
look at the VM's state and say "we really have a couple hundred MB of 
excess memory we could happily give back to the host".  And - very 
important - "don't go below X MB, because then we'll die in a flaming 
swap storm".

At the moment this is driven by vendor-specific tools with heuristics of 
varying degrees of sophistication (which could be as simple as 
absolutely manual control).  The problem has two sides because there's 
the decision made by guests on how much memory they can afford to give 
up, and also on the host side who knows what the system-wide memory 
pressures are.  And it can be affected by hypervisor-specific features, 
such as whether ...
From: Andrew Morton
Date: Monday, April 5, 2010 - 3:17 pm

On Mon, 05 Apr 2010 15:03:08 -0700


So...  does this differ in any fundamental way from what hibernation
does, via shrink_all_memory()?

--

From: Avi Kivity
Date: Monday, April 5, 2010 - 3:26 pm

Just the _all_ bit, and the fact that we need to report the freed page 
numbers to the hypervisor.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Andrew Morton
Date: Monday, April 5, 2010 - 3:40 pm

On Tue, 06 Apr 2010 01:26:11 +0300

So...  why not tweak that, rather than implementing some parallel thing?
--

From: Dmitry Torokhov
Date: Monday, April 5, 2010 - 4:01 pm

I guess the main difference is that freeing memory is not the primary
goal; we want to make sure that guest does not use some of its memory
without notifying hypervisor first.

-- 
Dmitry

--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:28 am

That's maybe 5 lines of code.  Most of the code is focused on 
interpreting requests from the hypervisor and replying with the page 
numbers.

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Monday, April 5, 2010 - 4:03 pm

I think Avi was being facetious ("_all_").  Hibernation assumes
everything in the machine is going to stop for awhile.  Ballooning
assumes that the machine has lower memory need for awhile, but
is otherwise fully operational.  Think of it as hot-plug memory
at a page granularity.

Historically, all OS's had a (relatively) fixed amount of memory
and, since it was fixed in size, there was no sense wasting any of it.
In a virtualized world, OS's should be trained to be much more
flexible as one virtual machine's "waste" could/should be another
virtual machine's "want".  Ballooning is currently the mechanism
for this; it places memory pressure on the OS to encourage it
to get by with less memory.  Unfortunately, it is difficult even
within an OS to determine what memory is wasted and what memory
might be used imminently... because LRU is only an approximation of
the future.  Hypervisors have an even more difficult problem not
only because they must infer this information from external events,
but they can double the problem if they infer the opposite of what
the OS actually does.

As Jeremy mentioned, Transcendent Memory (and its Linux implementations
"cleancache" and "frontswap") allows a guest kernel to give up memory
for the broader good while still retaining a probability that it
can get the same data back quickly.  This results in more memory
fluidity. Transcendent Memory ("tmem") still uses ballooning as
the mechanism to create memory pressure... it just provides an
insurance policy for that memory pressure.

Avi will point out that it is not clear that tmem can make use of
or benefit from tmem, but we need not repeat that discussion here.

Dan
--

From: Andrew Morton
Date: Monday, April 5, 2010 - 4:11 pm

On Mon, 5 Apr 2010 16:03:48 -0700 (PDT)

shrink_all_memory() doesn't require that processes be stopped.

If the existing code doesn't exactly match virtualisation's

hotplug is different because it targets particular physical pages.  For
this requirement any old page will do.  Preferably one which won't be
needed soon, yes?

--

From: Dmitry Torokhov
Date: Monday, April 5, 2010 - 4:28 pm

The best page would not old page but unused page.

We do rely on the standard mechanisms to find pages that can be freed to
inflate balloon, but once pages are allocated they are not available
till released. In case of shrinkig memory it can be allocated and used
as soon as we wake up (it shrink was done in course of hibernation
sequence).

-- 
Dmitry
--

From: Jeremy Fitzhardinge
Date: Monday, April 5, 2010 - 4:28 pm

Note that we're using shrink and grow in opposite senses.  
shrink_all_memory() is trying to free as much kernel memory as possible, 
which to the virtual machine's host looks like the guest is growing 
(since it has claimed more memory for its own use).  A balloon "shrink" 
appears to Linux as allocated memory (ie, locking down memory within 
Linux to make it available to the rest of system).

The fact that shrink_all_memory() has much deeper insight into the 
current state of the vm subsystem is interesting; it has much more to 
work with than a simple alloc/free page.  Does it actively try to 
reclaim cold, unlikely to be used stuff, first?  It appears it does to 
my mm/ naive eye.

I guess a way to use it in the short term is to have a loop of the form:

	while (guest_size>  target) {
		shrink_all_memory(guest_size - target);		/* force pages to be free */
		while (p = alloc_page(GFP_NORETRY))		/* vacuum up pages */
			release_page_to_hypervisor(p);
		/* twiddle thumbs */
	}

...assuming the allocation would tend to pick up the pages that 
shrink_all_memory just freed.

Or ideally, have a form of shrink_all_memory() which causes pages to 
become unused, but rather than freeing them returns them to the caller.

And is there some way to get the vm subsystem to provide backpressure: 
"I'm getting desperately short of memory!"?  Experience has shown that 
administrators often accidentally over-shrink their domains and 
effectively kill them.  Sometimes due to bad UI - entering the wrong 
units - but also because they just don't know what the actual memory 
demands are.  Or they change over time.

Thanks,
     J
--

From: Andrew Morton
Date: Monday, April 5, 2010 - 4:34 pm

On Mon, 05 Apr 2010 16:28:38 -0700

Not really.  One could presumably pull dopey tricks by hooking into
slab shrinker registration or even ->writepage().  But cooking up
something explicit doesn't sound too hard - the trickiest bit would be
actually defining what it should do.

--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:30 am

The oft-suggested approach is to look at the I/O load from guests and 
give more memory to those that are thrashing.  Of course not all I/O is 
directly due to memory pressure.

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Subject:
Date: Tuesday, April 6, 2010 - 10:27 am

Which is why it is very useful to be able to differentiate between:
1) refault I/O (due to pagecache too small, and PFRA choices)
2) swap I/O (due to memory pressure)
3) normal file dirty writes (due to an app's need for persistence)

Again, the cleancache and frontswap hooks and APIs separate these
out nicely.

Dan "who worries he is sounding like a broken record"
--

From: Dave Hansen
Date: Tuesday, April 6, 2010 - 4:20 pm

We also need to remember to consolidate the Xen and virtio-balloon
drivers.  They both have their own GFP flags, for instance, but I think
they actually want the exact same thing.  They could probably also share
that snippet, right?

-- Dave

--

From: Dan Magenheimer
Date: Monday, April 5, 2010 - 5:26 pm

Sorry, I don't mean to be too self-serving.  And I am far less
an expert in Linux mm code than others involved in this discussion.

But this backpressure metric is one thing that frontswap provides.
It also provides an "insurance policy" for "desperately short
of memory".  It is the "yin" to the "yang" of cleancache.

If I understand the swap subsystem correctly, there IS NO
"getting desperately short of memory" except when a swap
device is unavailable or, more likely, too darn slow.

Frontswap writes synchronously to pseudo-RAM (tmem, in the
case of Xen) instead of a slow asynchronous swap device.  It
hooks directly into swap_writepage()/swap_readpage() in
a very clean, well-defined (not dopey) way.
So -- I think -- it is a perfect feedback mechanism to
tell a balloon driver (or equivalent), "I need more memory"
while covering the short-term need until the balloon driver
(and/or hypervisor) can respond.

It works today with Xen, and Nitin Gupta is working on an
in-kernel memory compression backend for it.  And Chris Mason
and I think it may also be a fine interface for SSD-used-
as-RAM-extension.

So please consider frontswap and cleancache before "cooking
up something [else] explicit"...  these were previously part
of Transcendent Memory postings*, but I have revised them to
be more useful, well-defined, and standalone (from Xen/tmem)
and will be re-posting the revised versions soon.

Dan

* See:
http://lwn.net/Articles/340080/ 
http://lkml.indiana.edu/hypermail/linux/kernel/0912.2/01322.html 
OLS 2009 proceedings
LCA 2010 proceedings
--

From: Dmitry Torokhov
Date: Monday, April 5, 2010 - 3:58 pm

Jeremy provided a very good writeup; I will aldo expand changelog in the

The driver is expected to be used on VMware platform - mainly ESX.
Originally we tried to converge with KVM and use virtio and
stock virtio_balloon driver but Avi mentioned that our code emulating
virtqueue was more than balloon code itself and thus using virtio did




Thanks ;). BTW, please send input stuff to my gmail addresss till.

-- 
Dmitry

--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 9:32 am

Yeah.  If we wanted commonality, we could make a balloon_core.c that 
contains the common code.  IMO that's premature, but perhaps there's 
some meat there (like suspend/resume support and /proc//sys interface).

-- 
error compiling committee.c: too many arguments to function

--

From: Dmitry Torokhov
Subject:
Date: Tuesday, April 6, 2010 - 10:06 am

I really not sure if it makes much sense. Ripping out virtdev/virtqueue

We do not need any special suspend/resume support - the freezeable
workqueue is stopped when suspending.

Thanks.

-- 
Dmitry
--

From: Avi Kivity
Subject:
Date: Tuesday, April 6, 2010 - 10:42 am

Ah, virtio_balloon should do the same.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Jeremy Fitzhardinge
Date: Tuesday, April 6, 2010 - 11:25 am

I think it would be useful to have common:

   1. User and kernel mode ABIs for controlling ballooning.  It assumes
      that the different balloon implementations are sufficiently
      similar in semantics.   (Once there's a kernel ABI, adding a
      common user ABI is trivial.)
   2. Policy driving the ballooning driver, at least from the guest
      side.  That is, some good metrics from the vm subsystem about
      memory pressure (both positive and negative), and something to
      turn those metrics into requests to the balloon driver.

1) is not a huge amount of code, but something consistent would be 
nice.  2) is something we've been missing and is a bit of an open 
question/research project anyway.

     J
--

From: Avi Kivity
Date: Tuesday, April 6, 2010 - 11:36 am

3) Code that attempts to reclaim 2MB pages when possible

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Jeremy Fitzhardinge
Date: Tuesday, April 6, 2010 - 12:18 pm

Yes.  Ballooning in 4k units is a bit silly.

     J

--

From: Pavel Machek
Date: Wednesday, April 7, 2010 - 10:30 pm

Does it make sense to treat ballooning as a form of memory hotplug? 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Avi Kivity
Date: Thursday, April 8, 2010 - 12:18 am

It's a fine granularity form of memory hotplug, yes.

-- 
error compiling committee.c: too many arguments to function

--

From: Jeremy Fitzhardinge
Date: Thursday, April 8, 2010 - 10:01 am

It has some similarities.  The main difference is granularity;
ballooning works in pages (typically 4k, but 2M probably makes more
sense), whereas memory hotplug works in DIMM-like sizes (256MB+). 
That's way too coarse for us; a domain might only have 256MB or less to
start with.

I experimented with a sort of hybrid scheme, in which I used hotplug
memory to add new struct pages to the system, but only incrementally
populated the underlying pages with the balloon driver.  That worked
pretty well, but it doesn't fit very well with how memory hotplug works
(at least when I last looked at it a couple of years ago).

    J
--

From: Dmitry Torokhov
Date: Thursday, April 15, 2010 - 2:00 pm

This is standalone version of VMware Balloon driver. Ballooning is a
technique that allows hypervisor dynamically limit the amount of memory
available to the guest (with guest cooperation). In the overcommit
scenario, when hypervisor set detects that it needs to shuffle some memory,
it instructs the driver to allocate certain number of pages, and the
underlying memory gets returned to the hypervisor. Later hypervisor may
return memory to the guest by reattaching memory to the pageframes and
instructing the driver to "deflate" balloon.

Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
---

Unlike previous version, that tried to integrate VMware ballooning transport
into virtio subsystem, and use stock virtio_ballon driver, this one implements
both controlling thread/algorithm and hypervisor transport.

We are submitting standalone driver because KVM maintainer (Avi Kivity)
expressed opinion (rightly) that our transport does not fit well into
virtqueue paradigm and thus it does not make much sense to integrate
with virtio.

There were also some concerns whether current ballooning technique is
the right thing. If there appears a better framework to achieve this we
are prepared to evaluate and switch to using it, but in the meantime
we'd like to get this driver upstream.

Changes since v1:
	- added comments throughout the code;
	- exported stats moved from /proc to debugfs;
	- better changelog.

 arch/x86/kernel/cpu/vmware.c  |    2 
 drivers/misc/Kconfig          |   16 +
 drivers/misc/Makefile         |    1 
 drivers/misc/vmware_balloon.c |  808 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 827 insertions(+), 0 deletions(-)
 create mode 100644 drivers/misc/vmware_balloon.c


diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 1cbed97..dfdb4db 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -22,6 +22,7 @@
  */
 
 #include <linux/dmi.h>
+#include <linux/module.h>
 #include <asm/div64.h>
 ...
From: Dmitry Torokhov
Date: Wednesday, April 21, 2010 - 12:59 pm

Andrew,

Do you see any issues with the driver? Will you be the one picking it
up and queueing for mainline?

Thanks,

--

From: Andrew Morton
Date: Wednesday, April 21, 2010 - 1:18 pm

On Wed, 21 Apr 2010 12:59:35 -0700


Spose so.
--

From: Dmitry Torokhov
Date: Wednesday, April 21, 2010 - 1:52 pm

Good. I don't suppose we have a chance making into .34? Being a
completely new driver and all...

Thanks,

Dmitry

--

From: Andrew Morton
Date: Wednesday, April 21, 2010 - 2:13 pm

On Wed, 21 Apr 2010 13:52:08 -0700

It's foggy.  Is there a good-sounding reason for pushing it in this
late?

--

From: Dmitry Torokhov
Date: Wednesday, April 21, 2010 - 5:09 pm

We want to get the driver accepted in distributions so that users do not
have to deal with an out-of-tree module and many distributions have
"upstream first" requirement.

The driver has been shipping for a number of years and users running on
VMware platform will have it installed as part of VMware Tools even if
it will not come from a distribution, thus there should not be
additional risk in pulling the driver into mainline.  The driver will
only activate if host is VMware so everyone else should not be affected
at all.

Thanks,

Dmitry
--

From: Andrew Morton
Date: Wednesday, April 21, 2010 - 4:54 pm

On Thu, 15 Apr 2010 14:00:31 -0700


This is OK for both x86_32 and x86_64?


afaict all the stats stuff is useless if CONFIG_DEBUG_FS=n.  Perhaps in
that case the vmballoon.stats field should be omitted and STATS_INC

--

From: Dmitry Torokhov
Date: Wednesday, April 21, 2010 - 5:00 pm

These control inflating/deflating rate of the ballon, mesured in



OK, will do.

Thanks Andrew.

-- 
Dmitry 
--

From: Dmitry Torokhov
Date: Wednesday, April 21, 2010 - 6:02 pm

OK, so here is the incremental patch addressing your comments. Or do you
want the entire thing resent?

Thanks.

-- 
Dmitry


vmware-balloon: miscellaneous fixes

 - document rate allocation constants
 - do not compile statistics code when debugfs is disabled
 - fix compilation error when debugfs is disabled

Signed-off-by: Dmitry Torokhov <dtor@vmware.com>
---

 drivers/misc/vmware_balloon.c |   38 +++++++++++++++++++++++++++++++-------
 1 files changed, 31 insertions(+), 7 deletions(-)


diff --git a/drivers/misc/vmware_balloon.c b/drivers/misc/vmware_balloon.c
index 90bba04..e7161c4 100644
--- a/drivers/misc/vmware_balloon.c
+++ b/drivers/misc/vmware_balloon.c
@@ -50,12 +50,28 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
+/*
+ * Various constants controlling rate of inflaint/deflating balloon,
+ * measured in pages.
+ */
+
+/*
+ * Rate of allocating memory when there is no memory pressure
+ * (driver performs non-sleeping allocations).
+ */
 #define VMW_BALLOON_NOSLEEP_ALLOC_MAX	16384U
 
+/*
+ * Rates of memory allocaton when guest experiences memory pressure
+ * (driver performs sleeping allocations).
+ */
 #define VMW_BALLOON_RATE_ALLOC_MIN	512U
 #define VMW_BALLOON_RATE_ALLOC_MAX	2048U
 #define VMW_BALLOON_RATE_ALLOC_INC	16U
 
+/*
+ * Rates for releasing pages while deflating balloon.
+ */
 #define VMW_BALLOON_RATE_FREE_MIN	512U
 #define VMW_BALLOON_RATE_FREE_MAX	16384U
 #define VMW_BALLOON_RATE_FREE_INC	16U
@@ -85,6 +101,10 @@ MODULE_LICENSE("GPL");
 /* Maximum number of page allocations without yielding processor */
 #define VMW_BALLOON_YIELD_THRESHOLD	1024
 
+
+/*
+ * Hypervisor communication port definitions.
+ */
 #define VMW_BALLOON_HV_PORT		0x5670
 #define VMW_BALLOON_HV_MAGIC		0x456c6d6f
 #define VMW_BALLOON_PROTOCOL_VERSION	2
@@ -125,8 +145,7 @@ MODULE_LICENSE("GPL");
 	__stat & -1UL;					\
 })
 
-#define STATS_INC(stat) (stat)++
-
+#ifdef CONFIG_DEBUG_FS
 struct ...
Previous thread: none

Next thread: 32GB SSD on USB1.1 P3/700 == ___HELL___ (2.6.34-rc3) by Andreas Mohr on Sunday, April 4, 2010 - 3:13 pm. (24 messages)