Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Previous thread: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes by Dan Magenheimer on Thursday, April 22, 2010 - 6:43 am. (1 message)

Next thread: Frontswap [PATCH 2/4] (was Transcendent Memory): core code by Dan Magenheimer on Thursday, April 22, 2010 - 6:43 am. (1 message)
From: Dan Magenheimer
Date: Thursday, April 22, 2010 - 6:42 am

Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Patch applies to 2.6.34-rc5

In previous patch postings, frontswap was part of the Transcendent
Memory ("tmem") patchset.  This patchset refocuses not on the underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that frontswap can provide this very useful
functionality via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for frontswap; some believe
frontswap will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of frontswap can be found in the introductory
comment in mm/frontswap.c (in PATCH 2/4) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux.  Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010.  (Search news.google.com for Transcedent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>

 include/linux/frontswap.h |   98 ++++++++++++++
 include/linux/swap.h      |    2 
 include/linux/swapfile.h  |   13 +
 mm/Kconfig                |   16 ++
 mm/Makefile               |    1 
 mm/frontswap.c            |  301 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_io.c              |   12 +
 mm/swap.c                 |    4 
 mm/swapfile.c             |   58 +++++++-
 9 files changed, 496 insertions(+), 9 deletions(-)

Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device.  The storage is assumed to be
a synchronous concurrency-safe page-oriented pseudo-RAM ...
From: Avi Kivity
Date: Thursday, April 22, 2010 - 8:28 am

How baked in is the synchronous requirement?  Memory, for example, can 
be asynchronous if it is copied by a dma engine, and since there are 
hardware encryption engines, there may be hardware compression engines 
in the future.


-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Thursday, April 22, 2010 - 1:15 pm

Indeed.  But an asynchronous API is not appropriate for frontswap
(or cleancache).  The reason the hooks are so simple is because they
are assumed to be synchronous so that the page can be immediately

Yes, but for something like an SSD where copying can be used to
build up a full 64K write, the cost of copying memory may not be
counterproductive.
--

From: Avi Kivity
Date: Friday, April 23, 2010 - 2:48 am

Swapping is inherently asynchronous, so we'll have to wait for that to 
complete anyway (as frontswap does not guarantee swap-in will succeed).  
I don't doubt it makes things simpler, but also less flexible and useful.

Something else that bothers me is the double swapping.  Sure we're 
making swapin faster, but we we're still loading the io subsystem with 
writes.  Much better to make swap-to-ram authoritative (and have the 

I don't understand.  Please clarify.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Friday, April 23, 2010 - 7:43 am

Each page is either in frontswap OR on the normal swap device,
never both.  So, yes, both reads and writes are avoided if memory
is available and there is no write issued to the io subsystem if
memory is available.  The is_memory_available decision is determined
by the hypervisor dynamically for each page when the guest attempts
a "frontswap_put".  So, yes, you are indeed "swapping to the
hypervisor" but, at least in the case of Xen, the hypervisor

In many cases, this is true.  For the swap subsystem, it may not always
be true, though I see recent signs that it may be headed in that
direction.  In any case, unless you see this SSD discussion as
critical to the proposed acceptance of the frontswap patchset,
let's table it until there's some prototyping done.

Thanks,
Dan
--

From: Avi Kivity
Date: Friday, April 23, 2010 - 7:52 am

I see.  So why not implement this as an ordinary swap device, with a 
higher priority than the disk device?  this way we reuse an API and keep 
things asynchronous, instead of introducing a special purpose API.

Doesn't this commit the hypervisor to retain this memory?  If so, isn't 
it simpler to give the page to the guest (so now it doesn't need to swap 
at all)?


I think it will be true in an overwhelming number of cases.  Flash is 

It isn't particularly related.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Avi Kivity
Date: Friday, April 23, 2010 - 8:00 am

Looks like "init" == open, "put_page" == write, "get_page" == read, 
"flush_page|flush_area" == trim.  The only difference seems to be that 
an overwriting put_page may fail.  Doesn't seem to be much of a win, 
since a guest can simply avoid issuing the duplicate put_page, so the 
hypervisor is still committed to holding this memory for the guest.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Saturday, April 24, 2010 - 5:41 pm

Yes, indeed, this is true.  That is why it is important for any
policy implemented behind frontswap to "bill" the guest if it
is attempting to keep frontswap pages in the hypervisor forever
and to prod the guest to reclaim them when it no longer needs
super-fast emergency swap space.  The frontswap patch already includes
the kernel mechanism to enable this and the prodding can be implemented
by a guest daemon (of which there already exists an existence proof).

(While devil's advocacy is always welcome, frontswap is NOT a
cool academic science project where these issues have not been
considered or tested.)
--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 5:06 am

In this case you could use the same mechanism to stop new put_page()s?

Seems frontswap is like a reverse balloon, where the balloon is in 


Good to know.

-- 
error compiling committee.c: too many arguments to function

--

From: Nitin Gupta
Date: Friday, April 23, 2010 - 6:49 pm

ramzswap is exactly this: an ordinary swap device which stores every page
in (compressed) memory and its enabled as highest priority swap. Currently,
it stores these compressed chunks in guest memory itself but it is not very
difficult to send these chunks out to host/hypervisor using virtio.
 
However, it suffers from unnecessary block I/O layer overhead and requires
weird hooks in swap code, say to get notification when a swap slot is freed.
OTOH frontswap approach gets rid of any such artifacts and overheads.
(ramzswap: http://code.google.com/p/compcache/)

Thanks,
Nitin
--

From: Avi Kivity
Date: Saturday, April 24, 2010 - 11:27 am

Maybe we should optimize these overheads instead.  Swap used to always 
be to slow devices, but swap-to-flash has the potential to make swap act 
like an extension of RAM.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Nitin Gupta
Date: Saturday, April 24, 2010 - 8:11 pm

No: trim or discard is not useful. The problem is that we require a callback
_as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates
in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap).

Increasing the frequency of discards is also not an option:
 - Creating discard bio requests themselves need memory and these swap devices
come into picture only under low memory conditions.
 - We need to regularly scan swap_map to issue these discards. Increasing discard
frequency also means more frequent scanning (which will still not be fast enough

Spending lot of effort optimizing an overhead which can be completely avoided
is probably not worth it.

Also, I think the choice of a synchronous style API for frontswap and cleancache
is justified as they want to send pages to host *RAM*. If you want to use other
devices like SSDs, then these should be just added as another swap device as
we do currently -- these should not be used as frontswap storage directly.

Thanks,
Nitin
--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 5:16 am

Doesn't flash have similar requirements?  The earlier you discard, the 



I'm not sure.  Swap-to-flash will soon be everywhere.   If it's slow, 

Even for copying to RAM an async API is wanted, so you can dma it 
instead of copying.

-- 
error compiling committee.c: too many arguments to function

--

From: Nitin Gupta
Date: Sunday, April 25, 2010 - 9:05 am

No. We do not want to issue discard for every page as soon as it is freed.
I'm not flash expert but I guess issuing erase is just too expensive to be
issued so frequently. OTOH, ramzswap needs a callback for every page and as

Ok, but still all this bio allocation and block layer overhead seems
unnecessary and is easily avoidable. I think frontswap code needs

frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as

Optimizing swap-to-flash is surely desirable but this problem is separate
from ramzswap or frontswap optimization. For the latter, I think dealing

Maybe incremental development is better? Stabilize and refine existing
code and gradually move to async API, if required in future?

Thanks,
Nitin

--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 11:06 pm

Ok.  I agree it is silly to go through the block layer and end up 


Incremental development is fine, especially for ramzswap where the APIs 
are all internal.  I'm more worried about external interfaces, these 
stick around a lot longer and if not done right they're a pain forever.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Nitin Gupta
Date: Monday, April 26, 2010 - 6:47 am

No, we cannot issue discard bio at this place since swap_lock
spinlock is held.


Thanks,
Nitin
--

From: Dan Magenheimer
Date: Tuesday, April 27, 2010 - 1:29 am

OK, so on the one hand, you think that the proposed synchronous
interface for frontswap is insufficiently extensible for other
uses (presumably including KVM).  On the other hand, you agree
that using the existing I/O subsystem is unnecessarily heavyweight.
On the third hand, Nitin has answered your questions and spent
a good part of three years finding that extending the existing swap
interface to efficiently support swap-to-pseudo-RAM requires
some kind of in-kernel notification mechanism to which Linus
has already objected.

So you are instead proposing some new guest-to-host asynchronous
notification mechanism that doesn't use the existing bio
mechanism (and so presumably not irqs), imitates or can
utilize a dma engine, and uses less cpu cycles than copying
pages.  AND, for long-term maintainability, you'd like to avoid
creating a new guest-host API that does all this, even one that
is as simple and lightweight as the proposed frontswap hooks.

Does that summarize your objection well?
--

From: Avi Kivity
Date: Tuesday, April 27, 2010 - 2:21 am

No.  Adding a new async API that parallels the block layer would be 
madness.  My first preference would be to completely avoid new APIs.  I 
think that would work for swap-to-hypervisor but probably not for 
compcache.  Second preference is the synchronous API, third is a new 
async API.

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Monday, April 26, 2010 - 5:50 am

Well if you are saying that your primary objection to the
frontswap synchronous API is that it is exposed to modules via
some EXPORT_SYMBOLs, we can certainly fix that, at least
unless/until there are other pseudo-RAM devices that can use it.

Would that resolve your concerns?
--

From: Avi Kivity
Date: Monday, April 26, 2010 - 6:43 am

By external interfaces I mean the guest/hypervisor interface.  
EXPORT_SYMBOL is an internal interface as far as I'm concerned.

Now, the frontswap interface is also an internal interface, but it's 
close to the external one.  I'd feel much better if it was asynchronous.

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Saturday, April 24, 2010 - 5:30 pm

Umm... I think the difference between a "new" API and extending
an existing one here is a choice of semantics.  As designed, frontswap
is an extremely simple, only-very-slightly-intrusive set of hooks that
allows swap pages to, under some conditions, go to pseudo-RAM instead
of an asynchronous disk-like device.  It works today with at least
one "backend" (Xen tmem), is shipping today in real distros, and is
extremely easy to enable/disable via CONFIG or module... meaning
no impact on anyone other than those who choose to benefit from it.

"Extending" the existing swap API, which has largely been untouched for
many years, seems like a significantly more complex and error-prone
undertaking that will affect nearly all Linux users with a likely long
bug tail.  And, by the way, there is no existence proof that it
will be useful.


Well, we shall see.  It may also be the case that the existing
asynchronous swap API will work fine for some non traditional RAM;
and it may also be the case that frontswap works fine for some
non traditional RAM.  I agree there is fertile ground for exploration
here.  But let's not allow our speculation on what may or may
not work in the future halt forward progress of something that works

Yes, at a much larger more invasive cost to the kernel.  Frontswap


What I was referring to is that the existing swap code DOES NOT
always have the ability to collect N scattered pages before
initiating an I/O write suitable for a device (such as an SSD)
that is optimized for writing N pages at a time.  That is what
I meant by a timing constraint.  See references to page_cluster
in the swap code (and this is for contiguous pages, not scattered).

Dan
--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 5:11 am

My issue is with the API's synchronous nature.  Both RAM and more exotic 
memories can be used with DMA instead of copying.  A synchronous 


No need to change the kernel at all; the hypervisor controls the page 

I see.  Given that swap-to-flash will soon be way more common than 
frontswap, it needs to be solved (either in flash or in the swap code).

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Sunday, April 25, 2010 - 8:29 am

When pages are 2MB, this may be true.  When pages are 4KB and 
copied individually, it may take longer to program a DMA engine 
than to just copy 4KB.

But in any case, frontswap works fine on all existing machines
today.  If/when most commodity CPUs have an asynchronous RAM DMA
engine, an asynchronous API may be appropriate.  Or the existing
swap API might be appropriate. Or the synchronous frontswap API
may work fine too.  Speculating further about non-existent
hardware that might exist in the (possibly far) future is irrelevant
to the proposed patch, which works today on all existing x86 hardware
and on shipping software.

--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 11:01 pm

Of course, you have to use a batching API, like virtio or Xen's rings, 

dma engines are present on commodity hardware now:

http://en.wikipedia.org/wiki/I/O_Acceleration_Technology

I don't know if consumer machines have them, but servers certainly do.  
modprobe ioatdma.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Pavel Machek
Date: Tuesday, April 27, 2010 - 5:56 am

If we added all the apis that worked when proposed, we'd have
unmaintanable mess by about 1996.

Why can't frontswap just use existing swap api?
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Jeremy Fitzhardinge
Date: Monday, April 26, 2010 - 5:49 pm

Not in Xen PV guests (the hypervisor vets guest updates, but it can't
safely make its own changes to the pagetables).  (Its kind of annoying.)

    J
--

From: Dan Magenheimer
Date: Tuesday, April 27, 2010 - 7:32 am

Hi Pavel!

The existing swap API as it stands is inadequate for an efficient
synchronous interface (e.g. for swapping to RAM).  Both Nitin
and I independently have found this to be true.  But swap-to-RAM
is very useful in some cases (swap-to-kernel-compressed-RAM
and swap-to-hypervisor-RAM and maybe others) that were not even
conceived many years ago at the time the existing swap API was
designed for swap-to-disk.  Swap-to-RAM can relieve memory
pressure faster and more resource-efficient than swap-to-device
but must assume that RAM available for swap-to-RAM is dynamic
(not fixed in size).  (And swap-to-SSD, when the SSD is an
I/O device on an I/O bus is NOT the same as swap-to-RAM.)

In my opinion, frontswap is NOT a new API, but the simplest
possible extension of the existing swap API to allow for
efficient swap-to-RAM.  Avi's comments about a new API
(as he explained later in the thread) refer to a new API
between kernel and hypervisor, what is essentially the
Transcendent Memory interface.  Frontswap was separated from
the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM
and the possibility that there may be other interesting
swap-to-RAM uses.

Does this help?

Dan
--

From: Pavel Machek
Date: Thursday, April 29, 2010 - 6:02 am

So... how much slower is swapping to RAM over current interface when
compared to proposed interface, and how much is that slower than just
using the memory directly?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Tuesday, April 27, 2010 - 5:55 am

Stop right here. Instead of improving existing swap api, you just
create one because it is less work.

We do not want apis to cummulate; please just fix the existing one.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Nitin Gupta
Date: Tuesday, April 27, 2010 - 7:43 am

I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.

So, how frontswap approach can be seen as introducing a new API?

Thanks,
Nitin






--

From: Pavel Machek
Date: Thursday, April 29, 2010 - 6:04 am

Yes, and that set of hooks is new API, right?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Dan Magenheimer
Date: Friday, April 23, 2010 - 9:26 am

No, ANY put_page can fail, and this is a critical part of the API
that provides all of the flexibility for the hypervisor and all
the guests. (See previous reply.)

The "duplicate put" semantics are carefully specified as there
are some coherency corner cases that are very difficult to handle
in the "backend" but very easy to handle in the kernel.  So the
specification explicitly punts these to the kernel.
--

From: Avi Kivity
Date: Saturday, April 24, 2010 - 11:25 am

The guest isn't required to do any put_page()s.  It can issue lots of 
them when memory is available, and keep them in the hypervisor forever.  
Failing new put_page()s isn't enough for a dynamic system, you need to 
be able to force the guest to give up some of its tmem.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Sunday, April 25, 2010 - 6:12 am

You are suggesting the hypervisor communicate dynamically-rapidly-changing
physical memory availability information to a userland daemon in each guest,
and each daemon communicate this information to each respective kernel
to notify the kernel that hypervisor memory is not available?

Seems very convoluted to me, and anyway it doesn't eliminate the need

That's a reasonable analogy.  Frontswap serves nicely as an
emergency safety valve when a guest has given up (too) much of
its memory via ballooning but unexpectedly has an urgent need
that can't be serviced quickly enough by the balloon driver.
--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 6:18 am

Yeah, it's pretty ugly.  Balloons typically communicate without a daemon 

(or ordinary swap)

-- 
error compiling committee.c: too many arguments to function

--

From: Dave Hansen
Date: Thursday, April 29, 2010 - 6:45 pm

Frontswap and things like CMM2[1] have some fundamental advantages over
swapping and ballooning.  First of all, there are serious limits on
ballooning.  It's difficult for a guest to span a very wide range of
memory sizes without also including memory hotplug in the mix.  The ~1%
'struct page' penalty alone causes issues here.

A large portion of CMM2's gain came from the fact that you could take
memory away from guests without _them_ doing any work.  If the system is
experiencing a load spike, you increase load even more by making the
guests swap.  If you can just take some of their memory away, you can
smooth that spike out.  CMM2 and frontswap do that.  The guests
explicitly give up page contents that the hypervisor does not have to
first consult with the guest before discarding.

[1] http://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf 

-- Dave

--

From: Dave Hansen
Date: Friday, April 30, 2010 - 10:10 am

If you have a single swap device, sure.  But, I can also see a case
where you have a "fast" swap and "slow" swap.

The part of the argument about frontswap is that I like is the lack
sizing exposed to the guest.  When you're dealing with swap-only, you
are stuck adding or removing swap devices if you want to "grow/shrink"
the memory footprint.  If the host (or whatever is backing the
frontswap) wants to change the sizes, they're fairly free to.

The part that bothers me it is that it just pushes the problem
elsewhere.  For KVM, we still have to figure out _somewhere_ what to do
with all those pages.  It's nice that the host would have the freedom to
either swap or keep them around, but it doesn't really fix the problem.

I do see the lack of sizing exposed to the guest as being a bad thing,
too.  Let's say we saved 25% of system RAM to back a frontswap-type
device on a KVM host.  The first time a user boots up their set of VMs
and 25% of their RAM is gone, they're going to start complaining,
despite the fact that their 25% smaller systems may end up being faster.

I think I'd be more convinced if we saw this thing actually get used
somehow.  How is a ram-backed frontswap better than a /dev/ramX-backed
swap file in practice?

-- Dave

--

From: Avi Kivity
Date: Friday, April 30, 2010 - 11:08 am

True.  My remarks only apply to frontswap-to-hypervisor, for internally 

So it seems a bare-metal hypervisor has less access to the bare metal 
than a non-bare-metal hypervisor?

Seriously, leave the bare-metal FUD to Simon.  People on this list know 
that kvm and Xen have exactly the same access to the hardware (well 

There's still an exit.  It's much faster than a vmx/svm vmexit but still 
nontrivial.


It's determined by the hypervisor, same as with tmem.  The guest swaps 
to a virtual disk, the hypervisor places the data in RAM if it's 

You can have multiple swap devices.

wrt SR/IOV, you'll see synchronous frontswap reduce throughput.  SR/IOV 
will swap with <1 exit/page and DMA guest pages, while frontswap/tmem 
will carry a 1 exit/page hit (even if no swap actually happens) and the 
copy cost (if it does).


In-kernel compressed swap does seem to be a good match for a synchronous 
API.  For future memory devices, or even bare-metal buzzword-compliant 
hypervisors, I disagree.  An asynchronous API is required for 
efficiency, and they'll all have swap capability sooner or later (kvm, 
vmware, and I believe xen 4 already do).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Friday, April 30, 2010 - 9:43 am

(I'll back down on the CMM2 comparisons until I can go

I think you are making a number of possibly false assumptions here:
1) The host [the frontswap backend may not even be a hypervisor]
2) can back it with disk storage [not if it is a bare-metal hypervisor]
3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest]
4) when you're out of memory [how can this be determined outside of
   the hypervisor?]

And, importantly, "have your host expose a device which is write
cached by host memory"... you are implying that all guest swapping
should be done to a device managed/controlled by the host?  That
eliminates guest swapping to directIO/SRIOV devices doesn't it?

Anyway, I think we can see now why frontswap might not be a good
match for a hosted hypervisor (KVM), but that doesn't make it
any less useful for a bare-metal hypervisor (or TBD for in-kernel
compressed swap and TBD for possible future pseudo-RAM technologies).

Dan
--

From: Avi Kivity
Date: Friday, April 30, 2010 - 12:13 am

Frontswap does not do this.  Once a page has been frontswapped, the host 
is committed to retaining it until the guest releases it.  It's really 
not very different from a synchronous swap device.

I think cleancache allows the hypervisor to drop pages without the 
guest's immediate knowledge, but I'm not sure.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Friday, April 30, 2010 - 9:04 am

Gah.  You're right.  I'm  reading the two threads and confusing the
concepts.  I'm a bit less mystified why the discussion is revolving
around the swap device so much. :)

-- Dave

--

From: Pavel Machek
Date: Tuesday, April 27, 2010 - 10:55 pm

wtf? So lets fix the ballooning driver instead?

There's no reason it could not be as fast as frontswap, right?
Actually I'd expect it to be faster -- it can deal with big chunks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Jeremy Fitzhardinge
Date: Friday, April 30, 2010 - 10:52 am

I'd argue the opposite.  There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality.  You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc.  If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.

If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.

It seems to me the great strengths of the synchronous interface are:

    * it matches the needs of an existing implementation (tmem in Xen)
    * it is simple to understand within the context of the kernel code
      it's used in

Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization.  One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment.  An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.

The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device).  But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.

Do you think that you would be able to use frontswap in kvm if it were

Yes, that's comfortably within the "guests page themselves" model. 
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward.  But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.

That said, it does concern me that the host/hypervisor is ...
From: Avi Kivity
Date: Friday, April 30, 2010 - 11:24 am

The problem with relying on the guest to swap is that it's voluntary.  
The guest may not be able to do it.  When the hypervisor needs memory 
and guests don't cooperate, it has to swap.

But I'm not suggesting that the host swap on behalf on the guest.  
Rather, the guest swaps to (what it sees as) a device with a large 

With tmem you have to account for that memory, make sure it's 
distributed fairly, claim it back when you need it (requiring guest 
cooperation), live migrate and save/restore it.  It's a much larger 
change than introducing a write-back device for swapping (which has the 

If we use the existing paths, things are even simpler, and we match more 
needs (hypervisors with dma engines, the ability to reclaim memory 


For kvm (or Xen, with some modifications) all of the benefits of 
frontswap/tmem can be achieved with the ordinary swap.  It would need 
trim/discard support to avoid writing back freed data, but that's good 
for flash as well.

The advantages are:
- just works
- old guests
- <1 exit/page (since it's batched)
- no extra overhead if no free memory


Eventually you'll have to swap frontswap pages, or kill uncooperative 
guests.  At which point all of the simplicity is gone.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Jeremy Fitzhardinge
Date: Friday, April 30, 2010 - 11:59 am

Or fail whatever operation its trying to do.  You can only use
overcommit to fake unlimited resources for so long before you need a

Well, with caveats.  To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device.  And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap.  Sure, these are deployment issues rather than code

Well, you still can't reclaim memory; you can write it out to storage. 
It may be cheaper/byte, but it's still a resource dedicated to the
guest.  But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.

What kind of DMA engine do you have in mind?  Are there practical

It could be achieved with ballooning, but it isn't completely trivial. 
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand).  The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling

Killing guests is pretty simple.  Presumably the oom killer will get kvm
processes like anything else?

    J

--

From: Avi Kivity
Date: Saturday, May 1, 2010 - 1:28 am

Keep your commitment below RAM+swap and you'll be fine.  We want to 


You advertise it as a disk with write cache, so the guest is obliged to 
flush the cache if it wants a guarantee.  When it does, you flush your 
cache as well.  For swap, the guest will not issue any flushes.  This is 
already supported by qemu with cache=writeback.

I agree care is needed here.  You don't want to use the device for 

In general you want to run on RAM.  To maximise your RAM, you do things 
like page sharing and ballooning.  Both can fail, increasing the demand 
for RAM.  At that time you either kill a guest or swap to disk.

Consider a frontswap/tmem on bare-metal hypervisor cluster.  Presumably 
you give most of your free memory to guests.  A node dies.  Now you need 
to start its guests on the surviving nodes, but you're at the mercy of 
your guests to give up their tmem.

With an ordinary swap approach, you first flush cache to disk, and if 
that's not sufficient you start paging out guest memory.  You take a 

I/OAT (driver ioatdma).

When you don't have a  lot of memory free, you can also switch from 
write cache to O_DIRECT, so you use the storage controller's dma engine 



Yes.  Of course, you want your management code never to allow this to 
happen.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Sunday, May 2, 2010 - 9:06 am

The first problem is that you are simulating a fast resource
(RAM) with a resource that is orders of magnitude slower with
NO visibility to the user that suffers the consequences.  A good
analogy (and no analogy is perfect) is if Linux discovers a 16MHz
80286 on a serial card in addition to the 32 3GHz cores on a
Nehalem box and, whenever the 32 cores are all busy, randomly
schedules a process on the 80286, while recording all CPU usage
data as if the 80286 is a "real" processor.... "Hmmm... why
did my compile suddenly run 100 times slower?"

The second problem is "double swapping": A guest may choose
a page to swap to "guest swap", but invisibly to the guest,
the host first must fetch it from "host swap".  (This may
seem like it is easy to avoid... it is not and happens more
frequently than you might think.)

Third, host swapping makes live migration much more difficult.
Either the host swap disk must be accessible to all machines
or data sitting on a local disk MUST be migrated along with
RAM (which is not impossible but complicates live migration
substantially).  Last I checked, VMware does not allow
page-sharing and live migration to both be enabled for the
same host.

If you talk to VMware customers (especially web-hosting services)
that have attempted to use overcommit technologies that require
host-swapping, you will find that they quickly become allergic
to memory overcommit and turn it off.  The end users (users of
the VMs that inexplicably grind to a halt) complain loudly.
As a result, RAM has become a bottleneck in many many systems,
which ultimately reduces the utility of servers and the value

True.  But in the Xen+tmem implementation there are disincentives
for a guest to unnecessarily retain pages put into frontswap,
so the host doesn't need to care that it can't discard the pages
as the guest is "billed" for them anyway.

So far we've been avoiding hypervisor policy implementation
questions and focused on mechanism (because, after all, this
is a *Linux ...
From: Avi Kivity
Date: Sunday, May 2, 2010 - 9:48 am

It's bad, but it's better than ooming.

The same thing happens with vcpus: you run 10 guests on one core, if 
they all wake up, your cpu is suddenly 10x slower and has 30000x 
interrupt latency (30ms vs 1us, assuming 3ms timeslices).  Your disks 
become slower as well.

It's worse with memory, so you try to swap as a last resort.  However, 

True.  In fact when the guest and host use the same LRU algorithm, it 

kvm does live migration with swapping, and has no special code to 

Don't know about vmware, but kvm supports page sharing, swapping, and 

Choosing the correct overcommit ratio is certainly not an easy task.  
However, just hoping that memory will be available when you need it is 
not a good solution.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dan Magenheimer
Date: Monday, May 3, 2010 - 7:59 am

Virtualization is all about statistical multiplexing of fixed
resources.  If all guests demand a resource simultaneously,
that is peak alignment == "bad luck".

(But, honestly, I don't even remember the point either of us

Will it suck to the point of eventually causing the live migration
to fail?  Or will swap-storms effectively cause denial-of-service
for other guests?

Anyway, if live migration works fine with mostly-swapped-out guests

Frontswap only relies on the guest having an existing swap device,
defined in /etc/fstab like any normal Linux swap device.  If this
is "relying on guest swapping", yes frontswap relies on guest swapping.

Or if you are referring to your "host can't force guest to


Your argument might make sense from a KVM perspective but is
not true of frontswap with Xen+tmem.  With KVM, the host's
swap disk(s) can all be used as "slow RAM".  With Xen, there is
no host swap disk.  So, yes, the degree of potential memory
overcommitment is smaller with Xen+tmem than with KVM.  In
order to avoid all the host problems with host-swapping,
frontswap+Xen+tmem intentionally limits the degree of memory
overcommitment... but this is just memory overcommitment done
intelligently.
--

From: Dan Magenheimer
Date: Sunday, May 2, 2010 - 10:06 am

Well, to be fair, I meant the disagreement of synchronous vs

Simple policies must exist and must be enforced by the hypervisor to ensure
this doesn't happen.  Xen+tmem provides these policies and enforces them.
And it enforces them very _dynamically_ to constantly optimize
RAM utilization across multiple guests each with dynamically varying RAM

Huge performance hits that are completely inexplicable to a user
give virtualization a bad reputation.  If the user (i.e. guest,
not host, administrator) can at least see "Hmmm... I'm doing a lot
of swapping, guess I'd better pay for more (virtual) RAM", then

Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
code which performs the Xen-specific stuff (hypercalls) is only in

The missing part again is dynamicity.  How large is the virtual
disk?  Or are you proposing that disks can dramatically vary
in size across time?  I suspect that would be a very big patch.
And you're talking about a disk that doesn't have all the

A block device of what size?  Again, I don't think this will be

Ummm... no guest modifications, yet this special disk does everything
you've described above (and, to meet my dynamicity requirements,


Could you please explicitly identify what you are referring
to as a new external API?  The part this is different from


As noted VERY early in this thread, if/when it makes sense, frontswap
can do exactly the same thing by adding a buffering layer invisible

I think we agree that DMA makes sense when there is a lot of data to
copy and makes little sense when there is only a little (e.g. a
single page) to copy.  So I guess we need to understand what the
tradeoff is.  So, do you have any idea what the breakeven point is
for your favorite DMA engine for amount of data copied vs
1) locking the memory pages
2) programming the DMA engine
3) responding to the interrupt from the DMA engine

And the simple act of waiting to collect enough pages to "batch"
means none of those pages can be used until ...
From: Avi Kivity
Date: Monday, May 3, 2010 - 1:46 am

Can you explain what "enforcing" means in this context?  You loaned the 

What you're saying is "don't overcommit".  That's a good policy for some 
scenarios but not for others.  Note it applies equally well for cpu as 
well as memory.

frontswap+tmem is not overcommit, it's undercommit.   You have spare 
memory, and you give it away.  It isn't a replacement.  However, without 


Exactly as large as the swap space which the guest would have in the 


If block layer overhead is a problem, go ahead and optimize it instead 
of adding new interfaces to bypass it.  Though I expect it wouldn't be 
needed, and if any optimization needs to be done it is in the swap layer.

Optimizing swap has the additional benefit of improving performance on 

What happens when no tmem is available?  you swap to a volume.  That's 



Something completely internal to the guest can be replaced by something 
completely different.  Something that talks to a hypervisor will need 

So, you take a synchronous copyful interface, add another copy to make 
it into an asynchronous interface, instead of using the original 

When swapping out, Linux already batches pages in the block device's 
request queue.  Swapping out is inherently asynchronous and batched, 
you're swapping out those pages _because_ you don't need them, and 
you're never interested in swapping out a single page.  Linux already 
reserves memory for use during swapout.  There's no need to re-solve 
solved problems.

Swapping in is less simple, it is mostly synchronous (in some cases it 
isn't: with many threads, or with the preswap patches (IIRC unmerged)).  
You can always choose to copy if you don't have enough to justify dma.

The networking stack seems to think 4096 bytes is a good size for dma 
(see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Friday, April 30, 2010 - 9:16 am

But those are the guest's pages in the first place, that's not a new 
commitment.  CMM2 provides the hypervisor alternatives to swapping a 

They are not directly comparable.  In fact for dirty pages CMM2 is 
mostly a no-op - the host is forced to swap them out if it wants them.  
CMM2 brings value for demand zero or clean pages which can be restored 
by the guest without requiring swapin.

I think for dirty pages what CMM2 brings is the ability to discard them 

CMM2 is more directly comparably to ballooning rather than to 
frontswap.  Frontswap (and cleancache) work with storage that is 

The swap API (e.g. the block layer) itself is an asynchronous batched 
version of frontswap.  The complexity in CMM2 comes from the fact that 
it is communicating information about guest pages to the host, and from 

Given that whenever frontswap fails you need to swap anyway, it is 
better for the host to never fail a frontswap request and instead back 
it with disk storage if needed.  This way you avoid a pointless vmexit 
when you're out of memory.  Since it's disk backed it needs to be 
asynchronous and batched.

At this point we're back with the ordinary swap API.  Simply have your 
host expose a device which is write cached by host memory, you'll have 
all the benefits of frontswap with none of the disadvantages, and with 
no changes to guest code.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Saturday, May 1, 2010 - 10:10 am

OK, now I think I see the crux of the disagreement.

NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
in host swapping.  Host swapping is evil.  Host swapping is
the root of most of the bad reputation that memory overcommit
has gotten from VMware customers.  Host swapping can't be
avoided with some memory overcommit technologies (such as page
sharing), but frontswap on Xen+tmem CAN and DOES avoid it.

So, to summarize:

1) You agreed that a synchronous interface for frontswap makes
   sense for swap-to-in-kernel-compressed-RAM because it is
   truly swapping to RAM.
2) You have pointed out that an asynchronous interface for
   frontswap makes more sense for KVM than a synchronous
   interface, because KVM does host swapping.  Then you said
   if you have an asynchronous interface anyway, the existing
   swap code works just fine with no changes so frontswap
   is not needed at all... for KVM.
3) You have suggested that if Xen were more like KVM and required
   host-swapping, then Xen doesn't need frontswap either.

BUT frontswap on Xen+tmem always truly swaps to RAM.

So there are two users of frontswap for which the synchronous
interface makes sense.  I believe there may be more in the
future and you disagree but, as Jeremy said, "a general Linux
principle is not to overdesign interfaces for hypothetical users,
only for real needs."  We have demonstrated there is a need
with at least two users so the debate is only whether the
number of users is two or more than two.

Frontswap is a very non-invasive patch and is very cleanly
layered so that if it is not in the presence of either of 
the intended "users", it can be turned off in many different
ways with zero overhead (CONFIG'ed off) or extremely small overhead
(frontswap_ops is never set; or frontswap_ops is set but the
underlying hypervisor doesn't support it so frontswap_poolid
never gets set).

So... KVM doesn't need it and won't use it.  Do you, Avi, have
any other objections as to why the ...
From: Pavel Machek
Date: Sunday, May 2, 2010 - 12:11 am

Yet there are less invasive solutions available, like 'add trim
operation to swap_ops'.

So what needs to be said here is 'frontswap is XX times faster than
swap_ops based solution on workload YY'.
								       Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Nitin Gupta
Date: Sunday, May 2, 2010 - 12:57 am

Why host-level swapping is evil? In KVM case, VM is just another
process and host will just swap out pages using the same LRU like
scheme as with any other process, AFAIK.

Also, with frontswap, host cannot discard pages at any time as is
the case will cleancache. So, while cleancache is obviously very
useful, the usefulness of frontswap remains doubtful.

IMHO, along with cleancache, we should just have in in-memory
compressed swapping at *host* level i.e. no frontswap. I agree
that using frontswap hooks, it is easy to implement ramzswap
functionality but I think its not worth replacing this driver
with frontswap hooks. This driver already has all the goodness:
asynchronous interface, ability to dynamically add/remove ramzswap
devices etc. All that is lacking in this driver is a more efficient
'discard' functionality so we can free a page as soon as it becomes
unused.

It should also be easy to extend this driver to allow sending pages
to host using virtio (for KVM) or Xen hypercalls, if frontswap is
needed at all.

So, IMHO we can focus on cleancache development and add missing
parts to ramzswap driver.

Thanks,
Nitin
--

From: Dan Magenheimer
Date: Sunday, May 2, 2010 - 10:22 am

Your analogy only holds when the host administrator is either
extremely greedy or stupid.  My analogy only requires some
statistical bad luck: Multiple guests with peaks and valleys

Hmmm... I'll bet I can break it pretty easily.  I think the
case you raised that you thought would cause host OOM'ing
will cause kvm live migration to fail.

Or maybe not... when a guest is in the middle of a live migration,
I believe (in Xen), the entire guest memory allocation (possibly
excluding ballooned-out pages) must be simultaneously in RAM briefly
in BOTH the host and target machine.  That is, live migration is
not "pipelined".  Is this also true of KVM?  If so, your
statement above is just waiting a corner case to break it.

Choosing the _optimal_ overcommit ratio is impossible without a
prescient knowledge of the workload in each guest.  Hoping memory
will be available is certainly not a good solution, but if memory
is not available guest swapping is much better than host swapping.
And making RAM usage as dynamic as possible and live migration
as easy as possible are keys to maximizing the benefits (and
limiting the problems) of virtualization.

--

From: Avi Kivity
Date: Monday, May 3, 2010 - 2:39 am

10x vcpu is reasonable in some situations (VDI, powersave at night).  


No.  The entire guest address space can be swapped out on the source and 
target, less the pages being copied to or from the wire, and pages 
actively accessed by the guest.  Of course performance will suck if all 



That is why you need overcommit.  You make things dynamic with page 
sharing and ballooning and live migration, but at some point you need a 
failsafe fallback.  The only failsafe fallback I can see (where the host 
doesn't rely on guests) is swapping.

As far as I can tell, frontswap+tmem increases the problem.  You loan 
the guest some memory without the means to take it back, this increases 
memory pressure on the host.  The result is that if you want to avoid 
swapping (or are unable to) you need to undercommit host resources.  
Instead of sum(guest mem) + reserve < (host mem), you need sum(guest mem 
+ committed tmem) + reserve < (host mem).  You need more host memory, or 
less guests, or to be prepared to swap if the worst happens.

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Sunday, May 2, 2010 - 8:35 am

That's a bug.  You're giving the guest memory without the means to take 
it back.  The result is that you have to _undercommit_ your memory 
resources.

Consider a machine running a guest, with most of its memory free.  You 
give the memory via frontswap to the guest.  The guest happily swaps to 
frontswap, and uses the freed memory for something unswappable, like 
mlock()ed memory or hugetlbfs.

Now the second node dies and you need memory to migrate your guests 
into.  But you can't, and the hypervisor is at the mercy of the guest 
for getting its memory back; and the guest can't do it (at least not 

In this case the guest expects that swapped out memory will be slow 
(since was freed via the swap API; it will be slow if the host happened 
to run out of tmem).  So by storing this memory on disk you aren't 
reducing performance beyond what you promised to the guest.

Swapping guest RAM will indeed cause a performance hit, but sometimes 


kvm's host swapping is unrelated.  Host swapping swaps guest-owned 
memory; that's not what we want here.  We want to cache guest swap in 
RAM, and that's easily done by having a virtual disk cached in main 
memory.  We're simply presenting a disk with a large write-back cache to 
the guest.

You could just as easily cache a block device in free RAM with Xen.  
Have a tmem domain behave as the backend for your swap device.  Use 
ballooning to force tmem to disk, or to allow more cache when memory is 
free.

Voila: you no longer depend on guests (you depend on the tmem domain, 
but that's part of the host code), you don't need guest modifications, 

For any hypervisor which implements virtual disks with write-back cache 


AND that's a problem because it puts the hypervisor at the mercy of the 


The problem is not the complexity of the patch itself.  It's the fact 
that it introduces a new external API.  If we refactor swapping, that 
stands in the way.

How much, that's up to the mm maintainers to say.  If it isn't a ...
From: Dan Magenheimer
Date: Monday, May 3, 2010 - 9:01 am

We're getting into hypervisor policy issues, but given that probably
nobody else is listening by now, I guess that's OK. ;-)

The enforcement is on the "put" side.  The page is not loaned,
it is freely given, but only if the guest is within its
contractual limitations (e.g. within its predefined "maxmem").
If the guest chooses to never remove the pages from frontswap,
that's the guest's option, but that part of the guests
memory allocation can never be used for anything else so
it is in the guest's self-interest to "get" or "flush" the


Perhaps, but CPU overcommit has been a well-understood
part of computing for a very long time and users, admins,
and hosting providers all know how to recognize it and
deal with it.  Not so with overcommitment of memory;
the only exposure to memory limitations is "my disk light
is flashing a lot, I'd better buy more RAM".  Obviously,
this doesn't translate to virtualization very well.

And, as for your interrupt latency analogy, let's
revisit that if/when Xen or KVM support CPU overcommitment
for real-time-sensitive guests.  Until then, your analogy

But you are missing part of the magic:  Once the memory
page is no longer directly addressable (AND this implies not
directly writable) by the guest, the hypervisor can do interesting
things with it, such as compression and deduplication.

As a result, the sum of pages used by all the guests exceeds
the total pages of RAM in the system.  Thus overcommitment.
I agree that the degree of overcommitment is less than possible
with host-swapping, but none of the evil issues of host-swapping
happen. Again, this is "intelligent overcommitment".  Other
existing forms are "overcommit and cross your fingers that bad

Uh, no.  As I've said, everything about frontswap is entirely
optional, both at compile-time and run-time.  A frontswap-enabled
guest is fully compatible with a hypervisor with no frontswap;
a frontswap-enabled hypervisor is fully compatible with a guest
with no frontswap.  The only ...
From: Pavel Machek
Date: Monday, May 3, 2010 - 12:32 pm

I don't see why no copying is a requirement. I believe requirement
should be "it is fast enough".
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Dan Magenheimer
Date: Sunday, May 2, 2010 - 8:05 am

As Nitin pointed out much earlier in this thread:

"No: trim or discard is not useful"

I also think that trim does not do anything for the widely

Are you asking me to demonstrate that swap-to-hypervisor-RAM is
faster than swap-to-disk?

--

From: Pavel Machek
Date: Sunday, May 2, 2010 - 1:06 pm

I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Dan Magenheimer
Date: Sunday, May 2, 2010 - 2:05 pm

Well, it's not really apples-to-apples because swap-to-RAMdisk
is copying to a chunk of RAM with a known permanently-fixed size
so it SHOULD be faster than swap-to-hypervisor, and should
*definitely* be faster than swap-to-in-kernel-compressed-RAM
but I suppose it is still an interesting comparison.  I'll
see what I can do, but it will probably be a couple days to
figure out how to measure it (e.g. without accidentally measuring
any swap-to-disk).
--

From: Avi Kivity
Date: Thursday, April 29, 2010 - 11:53 am

You can't have a negative balloon size.  The two models are not equivalent.

Balloon allows you to give up a page for which you have a struct page.  
Frontswap (and swap) allows you to gain a page for which you don't have 
a struct page, but you can't access it directly.  The similarity is that 
in both cases the host may want the guest to give up a page, but cannot 

There's no reason for swapping and ballooning to behave differently when 
swap backing storage is RAM (they probably do now since swap was tuned 
for disks, not flash, but that's a bug if it's true).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Dave Hansen
Date: Friday, April 30, 2010 - 9:08 am

Once pages were dirtied (or I guess just slightly before), they became
volatile, and I don't think the hypervisor could do anything with them.
It could still swap them out like usual, but none of the CMM-specific
optimizations could be performed.

CC'ing Martin since he's the expert. :)

-- Dave

--

From: Martin Schwidefsky
Date: Monday, May 10, 2010 - 9:05 am

On Fri, 30 Apr 2010 09:08:00 -0700

Well, almost correct :-)
A dirty page (or one that is about to become dirty) can be in one of two
CMMA states:
1) stable
This is the case for pages where the kernel is doing some operation on
the page that will make it dirty, e.g. I/O. Before the kernel can
allow the operation the page has to be made stable. If the state
conversion to stable fails because the hypervisor removed the page the
page needs to get deleted from page cache and recreated from scratch.
2) potentially-volatile
This state is used for page cache pages for which a writable mapping
exists. The page can be removed by the hypervisor as long as the
physical per-page dirty bit is not set. As soon as the bit is set the
page is considered stable although the CMMA state still is potentially-
volatile.

In both cases the only thing the hypervisor can do with a dirty page is
to swap it as usual.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Dan Magenheimer
Date: Friday, April 30, 2010 - 8:59 am

Dave or others can correct me if I am wrong, but I think CMM2 also
handles dirty pages that must be retained by the hypervisor.  The
difference between CMM2 (for dirty pages) and frontswap is that
CMM2 sets hints that can be handled asynchronously while frontswap
provides explicit hooks that synchronously succeed/fail.

In fact, Avi, CMM2 is probably a fairly good approximation of what
the asynchronous interface you are suggesting might look like.

Not to beat a dead horse, but there is a very key difference:
The size and availability of frontswap is entirely dynamic;
any page-to-be-swapped can be rejected at any time even if
a page was previously successfully swapped to the same index.
Every other swap device is much more static so the swap code
assumes a static device.  Existing swap code can account for
"bad blocks" on a static device, but this is far from sufficient

Yes, cleancache can drop pages at any time because (as the
name implies) only clean pages can be put into cleancache.

--

From: Dan Magenheimer
Date: Thursday, April 29, 2010 - 7:42 am

Hi Pavel --

The whole concept of RAM that _might_ be available to the
kernel and is _not_ directly addressable by the kernel takes
some thinking to wrap your mind around, but I assure you
there are very good use cases for it.  RAM owned and managed
by a hypervisor (using controls unknowable to the kernel)
is one example; this is Transcendent Memory.  RAM which
has been compressed is another example; Nitin is working
on this using the frontswap approach because of some
issues that arise with ramzswap (see elsewhere on this
thread).  There are likely more use cases.

So in that context, let me answer your questions, combined

If this was possible by fixing the balloon driver, VMware would
have done it years ago.  The problem is that the balloon driver
is acting on very limited information, namely ONLY what THIS
kernel wants; every kernel is selfish and (eventually) uses every
bit of RAM it can get.  This is especially true when swapping
is required (under memory pressure).

So, in general, ballooning is NOT faster because a balloon
request to "get" RAM must wait for some other balloon driver
in some other kernel to "give" RAM.  OR some other entity
must periodically scan every kernels memory and guess at which
kernels are using memory inefficiently and steal it away before
a "needy" kernel asks for it.

While this does indeed "work" today in VMware, if you talk to
VMware customers that use it, many are very unhappy with the

Simply copying RAM from one page owned by the kernel to another
page owned by the kernel is pretty pointless as far as swapping
is concerned because it does nothing to reduce memory pressure,
so the comparison is a bit irrelevant.  But...

In my measurements, the overhead of managing "pseudo-RAM" pages
is in the same ballpark as copying the page.  Compression or
deduplication of course has additional costs.  See the
performance results at the end of the following two presentations
for some performance information when "pseudo-RAM" is ...
From: Avi Kivity
Date: Thursday, April 29, 2010 - 12:01 pm

Plus of course the asynchronity and batching of the block layer.  Even 
if you don't use a dma engine, you improve performance by exiting one 
per several dozen pages instead of for every page, perhaps enough to 
allow the hypervisor to justify copying the memory with non-temporal moves.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Avi Kivity
Date: Thursday, April 29, 2010 - 11:59 am

The concern is not with the hypervisor, but with Linux.  More external 

I'm convinced it's useful.  The API is so close to a block device 
(read/write with key/value vs read/write with sector/value) that we 
should make the effort not to introduce a new API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

Previous thread: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes by Dan Magenheimer on Thursday, April 22, 2010 - 6:43 am. (1 message)

Next thread: Frontswap [PATCH 2/4] (was Transcendent Memory): core code by Dan Magenheimer on Thursday, April 22, 2010 - 6:43 am. (1 message)