Frontswap [PATCH 0/4] (was Transcendent Memory): overview Patch applies to 2.6.34-rc5 In previous patch postings, frontswap was part of the Transcendent Memory ("tmem") patchset. This patchset refocuses not on the underlying technology (tmem) but instead on the useful functionality provided for Linux, and provides a clean API so that frontswap can provide this very useful functionality via a Xen tmem driver OR completely independent of tmem. For example: Nitin Gupta (of compcache and ramzswap fame) is implementing an in-kernel compression "backend" for frontswap; some believe frontswap will be a very nice interface for building RAM-like functionality for pseudo-RAM devices such as SSD or phase-change memory; and a Pune University team is looking at a backend for virtio (see OLS'2010). A more complete description of frontswap can be found in the introductory comment in mm/frontswap.c (in PATCH 2/4) which is included below for convenience. Note that an earlier version of this patch is now shipping in OpenSuSE 11.2 and will soon ship in a release of Oracle Enterprise Linux. Underlying tmem technology is now shipping in Oracle VM 2.2 and was just released in Xen 4.0 on April 15, 2010. (Search news.google.com for Transcedent Memory) Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org> include/linux/frontswap.h | 98 ++++++++++++++ include/linux/swap.h | 2 include/linux/swapfile.h | 13 + mm/Kconfig | 16 ++ mm/Makefile | 1 mm/frontswap.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++ mm/page_io.c | 12 + mm/swap.c | 4 mm/swapfile.c | 58 +++++++- 9 files changed, 496 insertions(+), 9 deletions(-) Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The storage is assumed to be a synchronous concurrency-safe page-oriented pseudo-RAM ...
How baked in is the synchronous requirement? Memory, for example, can be asynchronous if it is copied by a dma engine, and since there are hardware encryption engines, there may be hardware compression engines in the future. -- error compiling committee.c: too many arguments to function --
Indeed. But an asynchronous API is not appropriate for frontswap (or cleancache). The reason the hooks are so simple is because they are assumed to be synchronous so that the page can be immediately Yes, but for something like an SSD where copying can be used to build up a full 64K write, the cost of copying memory may not be counterproductive. --
Swapping is inherently asynchronous, so we'll have to wait for that to complete anyway (as frontswap does not guarantee swap-in will succeed). I don't doubt it makes things simpler, but also less flexible and useful. Something else that bothers me is the double swapping. Sure we're making swapin faster, but we we're still loading the io subsystem with writes. Much better to make swap-to-ram authoritative (and have the I don't understand. Please clarify. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Each page is either in frontswap OR on the normal swap device, never both. So, yes, both reads and writes are avoided if memory is available and there is no write issued to the io subsystem if memory is available. The is_memory_available decision is determined by the hypervisor dynamically for each page when the guest attempts a "frontswap_put". So, yes, you are indeed "swapping to the hypervisor" but, at least in the case of Xen, the hypervisor In many cases, this is true. For the swap subsystem, it may not always be true, though I see recent signs that it may be headed in that direction. In any case, unless you see this SSD discussion as critical to the proposed acceptance of the frontswap patchset, let's table it until there's some prototyping done. Thanks, Dan --
I see. So why not implement this as an ordinary swap device, with a higher priority than the disk device? this way we reuse an API and keep things asynchronous, instead of introducing a special purpose API. Doesn't this commit the hypervisor to retain this memory? If so, isn't it simpler to give the page to the guest (so now it doesn't need to swap at all)? I think it will be true in an overwhelming number of cases. Flash is It isn't particularly related. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Looks like "init" == open, "put_page" == write, "get_page" == read, "flush_page|flush_area" == trim. The only difference seems to be that an overwriting put_page may fail. Doesn't seem to be much of a win, since a guest can simply avoid issuing the duplicate put_page, so the hypervisor is still committed to holding this memory for the guest. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Yes, indeed, this is true. That is why it is important for any policy implemented behind frontswap to "bill" the guest if it is attempting to keep frontswap pages in the hypervisor forever and to prod the guest to reclaim them when it no longer needs super-fast emergency swap space. The frontswap patch already includes the kernel mechanism to enable this and the prodding can be implemented by a guest daemon (of which there already exists an existence proof). (While devil's advocacy is always welcome, frontswap is NOT a cool academic science project where these issues have not been considered or tested.) --
In this case you could use the same mechanism to stop new put_page()s? Seems frontswap is like a reverse balloon, where the balloon is in Good to know. -- error compiling committee.c: too many arguments to function --
ramzswap is exactly this: an ordinary swap device which stores every page in (compressed) memory and its enabled as highest priority swap. Currently, it stores these compressed chunks in guest memory itself but it is not very difficult to send these chunks out to host/hypervisor using virtio. However, it suffers from unnecessary block I/O layer overhead and requires weird hooks in swap code, say to get notification when a swap slot is freed. OTOH frontswap approach gets rid of any such artifacts and overheads. (ramzswap: http://code.google.com/p/compcache/) Thanks, Nitin --
Maybe we should optimize these overheads instead. Swap used to always be to slow devices, but swap-to-flash has the potential to make swap act like an extension of RAM. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
No: trim or discard is not useful. The problem is that we require a callback _as soon as_ a page (swap slot) is freed. Otherwise, stale data quickly accumulates in memory defeating the whole purpose of in-memory compressed swap devices (like ramzswap). Increasing the frequency of discards is also not an option: - Creating discard bio requests themselves need memory and these swap devices come into picture only under low memory conditions. - We need to regularly scan swap_map to issue these discards. Increasing discard frequency also means more frequent scanning (which will still not be fast enough Spending lot of effort optimizing an overhead which can be completely avoided is probably not worth it. Also, I think the choice of a synchronous style API for frontswap and cleancache is justified as they want to send pages to host *RAM*. If you want to use other devices like SSDs, then these should be just added as another swap device as we do currently -- these should not be used as frontswap storage directly. Thanks, Nitin --
Doesn't flash have similar requirements? The earlier you discard, the I'm not sure. Swap-to-flash will soon be everywhere. If it's slow, Even for copying to RAM an async API is wanted, so you can dma it instead of copying. -- error compiling committee.c: too many arguments to function --
No. We do not want to issue discard for every page as soon as it is freed. I'm not flash expert but I guess issuing erase is just too expensive to be issued so frequently. OTOH, ramzswap needs a callback for every page and as Ok, but still all this bio allocation and block layer overhead seems unnecessary and is easily avoidable. I think frontswap code needs frontswap simply calls frontswap_flush_page() in swap_entry_free() i.e. as Optimizing swap-to-flash is surely desirable but this problem is separate from ramzswap or frontswap optimization. For the latter, I think dealing Maybe incremental development is better? Stabilize and refine existing code and gradually move to async API, if required in future? Thanks, Nitin --
Ok. I agree it is silly to go through the block layer and end up Incremental development is fine, especially for ramzswap where the APIs are all internal. I'm more worried about external interfaces, these stick around a lot longer and if not done right they're a pain forever. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
No, we cannot issue discard bio at this place since swap_lock spinlock is held. Thanks, Nitin --
OK, so on the one hand, you think that the proposed synchronous interface for frontswap is insufficiently extensible for other uses (presumably including KVM). On the other hand, you agree that using the existing I/O subsystem is unnecessarily heavyweight. On the third hand, Nitin has answered your questions and spent a good part of three years finding that extending the existing swap interface to efficiently support swap-to-pseudo-RAM requires some kind of in-kernel notification mechanism to which Linus has already objected. So you are instead proposing some new guest-to-host asynchronous notification mechanism that doesn't use the existing bio mechanism (and so presumably not irqs), imitates or can utilize a dma engine, and uses less cpu cycles than copying pages. AND, for long-term maintainability, you'd like to avoid creating a new guest-host API that does all this, even one that is as simple and lightweight as the proposed frontswap hooks. Does that summarize your objection well? --
No. Adding a new async API that parallels the block layer would be madness. My first preference would be to completely avoid new APIs. I think that would work for swap-to-hypervisor but probably not for compcache. Second preference is the synchronous API, third is a new async API. -- error compiling committee.c: too many arguments to function --
Well if you are saying that your primary objection to the frontswap synchronous API is that it is exposed to modules via some EXPORT_SYMBOLs, we can certainly fix that, at least unless/until there are other pseudo-RAM devices that can use it. Would that resolve your concerns? --
By external interfaces I mean the guest/hypervisor interface. EXPORT_SYMBOL is an internal interface as far as I'm concerned. Now, the frontswap interface is also an internal interface, but it's close to the external one. I'd feel much better if it was asynchronous. -- error compiling committee.c: too many arguments to function --
Umm... I think the difference between a "new" API and extending an existing one here is a choice of semantics. As designed, frontswap is an extremely simple, only-very-slightly-intrusive set of hooks that allows swap pages to, under some conditions, go to pseudo-RAM instead of an asynchronous disk-like device. It works today with at least one "backend" (Xen tmem), is shipping today in real distros, and is extremely easy to enable/disable via CONFIG or module... meaning no impact on anyone other than those who choose to benefit from it. "Extending" the existing swap API, which has largely been untouched for many years, seems like a significantly more complex and error-prone undertaking that will affect nearly all Linux users with a likely long bug tail. And, by the way, there is no existence proof that it will be useful. Well, we shall see. It may also be the case that the existing asynchronous swap API will work fine for some non traditional RAM; and it may also be the case that frontswap works fine for some non traditional RAM. I agree there is fertile ground for exploration here. But let's not allow our speculation on what may or may not work in the future halt forward progress of something that works Yes, at a much larger more invasive cost to the kernel. Frontswap What I was referring to is that the existing swap code DOES NOT always have the ability to collect N scattered pages before initiating an I/O write suitable for a device (such as an SSD) that is optimized for writing N pages at a time. That is what I meant by a timing constraint. See references to page_cluster in the swap code (and this is for contiguous pages, not scattered). Dan --
My issue is with the API's synchronous nature. Both RAM and more exotic memories can be used with DMA instead of copying. A synchronous No need to change the kernel at all; the hypervisor controls the page I see. Given that swap-to-flash will soon be way more common than frontswap, it needs to be solved (either in flash or in the swap code). -- error compiling committee.c: too many arguments to function --
When pages are 2MB, this may be true. When pages are 4KB and copied individually, it may take longer to program a DMA engine than to just copy 4KB. But in any case, frontswap works fine on all existing machines today. If/when most commodity CPUs have an asynchronous RAM DMA engine, an asynchronous API may be appropriate. Or the existing swap API might be appropriate. Or the synchronous frontswap API may work fine too. Speculating further about non-existent hardware that might exist in the (possibly far) future is irrelevant to the proposed patch, which works today on all existing x86 hardware and on shipping software. --
Of course, you have to use a batching API, like virtio or Xen's rings, dma engines are present on commodity hardware now: http://en.wikipedia.org/wiki/I/O_Acceleration_Technology I don't know if consumer machines have them, but servers certainly do. modprobe ioatdma. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
If we added all the apis that worked when proposed, we'd have unmaintanable mess by about 1996. Why can't frontswap just use existing swap api? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Not in Xen PV guests (the hypervisor vets guest updates, but it can't
safely make its own changes to the pagetables). (Its kind of annoying.)
J
--
Hi Pavel! The existing swap API as it stands is inadequate for an efficient synchronous interface (e.g. for swapping to RAM). Both Nitin and I independently have found this to be true. But swap-to-RAM is very useful in some cases (swap-to-kernel-compressed-RAM and swap-to-hypervisor-RAM and maybe others) that were not even conceived many years ago at the time the existing swap API was designed for swap-to-disk. Swap-to-RAM can relieve memory pressure faster and more resource-efficient than swap-to-device but must assume that RAM available for swap-to-RAM is dynamic (not fixed in size). (And swap-to-SSD, when the SSD is an I/O device on an I/O bus is NOT the same as swap-to-RAM.) In my opinion, frontswap is NOT a new API, but the simplest possible extension of the existing swap API to allow for efficient swap-to-RAM. Avi's comments about a new API (as he explained later in the thread) refer to a new API between kernel and hypervisor, what is essentially the Transcendent Memory interface. Frontswap was separated from the tmem dependency to enable Nitin's swap-to-kernel-compressed-RAM and the possibility that there may be other interesting swap-to-RAM uses. Does this help? Dan --
So... how much slower is swapping to RAM over current interface when compared to proposed interface, and how much is that slower than just using the memory directly? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Stop right here. Instead of improving existing swap api, you just create one because it is less work. We do not want apis to cummulate; please just fix the existing one. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I'm a bit confused: What do you mean by 'existing swap API'?
Frontswap simply hooks in swap_readpage() and swap_writepage() to
call frontswap_{get,put}_page() respectively. Now to avoid a hardcoded
implementation of these function, it introduces struct frontswap_ops
so that custom implementations fronswap get/put/etc. functions can be
provided. This allows easy implementation of swap-to-hypervisor,
in-memory-compressed-swapping etc. with common set of hooks.
So, how frontswap approach can be seen as introducing a new API?
Thanks,
Nitin
--
Yes, and that set of hooks is new API, right? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
No, ANY put_page can fail, and this is a critical part of the API that provides all of the flexibility for the hypervisor and all the guests. (See previous reply.) The "duplicate put" semantics are carefully specified as there are some coherency corner cases that are very difficult to handle in the "backend" but very easy to handle in the kernel. So the specification explicitly punts these to the kernel. --
The guest isn't required to do any put_page()s. It can issue lots of them when memory is available, and keep them in the hypervisor forever. Failing new put_page()s isn't enough for a dynamic system, you need to be able to force the guest to give up some of its tmem. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
You are suggesting the hypervisor communicate dynamically-rapidly-changing physical memory availability information to a userland daemon in each guest, and each daemon communicate this information to each respective kernel to notify the kernel that hypervisor memory is not available? Seems very convoluted to me, and anyway it doesn't eliminate the need That's a reasonable analogy. Frontswap serves nicely as an emergency safety valve when a guest has given up (too) much of its memory via ballooning but unexpectedly has an urgent need that can't be serviced quickly enough by the balloon driver. --
Yeah, it's pretty ugly. Balloons typically communicate without a daemon (or ordinary swap) -- error compiling committee.c: too many arguments to function --
Frontswap and things like CMM2[1] have some fundamental advantages over swapping and ballooning. First of all, there are serious limits on ballooning. It's difficult for a guest to span a very wide range of memory sizes without also including memory hotplug in the mix. The ~1% 'struct page' penalty alone causes issues here. A large portion of CMM2's gain came from the fact that you could take memory away from guests without _them_ doing any work. If the system is experiencing a load spike, you increase load even more by making the guests swap. If you can just take some of their memory away, you can smooth that spike out. CMM2 and frontswap do that. The guests explicitly give up page contents that the hypervisor does not have to first consult with the guest before discarding. [1] http://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf -- Dave --
If you have a single swap device, sure. But, I can also see a case where you have a "fast" swap and "slow" swap. The part of the argument about frontswap is that I like is the lack sizing exposed to the guest. When you're dealing with swap-only, you are stuck adding or removing swap devices if you want to "grow/shrink" the memory footprint. If the host (or whatever is backing the frontswap) wants to change the sizes, they're fairly free to. The part that bothers me it is that it just pushes the problem elsewhere. For KVM, we still have to figure out _somewhere_ what to do with all those pages. It's nice that the host would have the freedom to either swap or keep them around, but it doesn't really fix the problem. I do see the lack of sizing exposed to the guest as being a bad thing, too. Let's say we saved 25% of system RAM to back a frontswap-type device on a KVM host. The first time a user boots up their set of VMs and 25% of their RAM is gone, they're going to start complaining, despite the fact that their 25% smaller systems may end up being faster. I think I'd be more convinced if we saw this thing actually get used somehow. How is a ram-backed frontswap better than a /dev/ramX-backed swap file in practice? -- Dave --
True. My remarks only apply to frontswap-to-hypervisor, for internally So it seems a bare-metal hypervisor has less access to the bare metal than a non-bare-metal hypervisor? Seriously, leave the bare-metal FUD to Simon. People on this list know that kvm and Xen have exactly the same access to the hardware (well There's still an exit. It's much faster than a vmx/svm vmexit but still nontrivial. It's determined by the hypervisor, same as with tmem. The guest swaps to a virtual disk, the hypervisor places the data in RAM if it's You can have multiple swap devices. wrt SR/IOV, you'll see synchronous frontswap reduce throughput. SR/IOV will swap with <1 exit/page and DMA guest pages, while frontswap/tmem will carry a 1 exit/page hit (even if no swap actually happens) and the copy cost (if it does). In-kernel compressed swap does seem to be a good match for a synchronous API. For future memory devices, or even bare-metal buzzword-compliant hypervisors, I disagree. An asynchronous API is required for efficiency, and they'll all have swap capability sooner or later (kvm, vmware, and I believe xen 4 already do). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
(I'll back down on the CMM2 comparisons until I can go I think you are making a number of possibly false assumptions here: 1) The host [the frontswap backend may not even be a hypervisor] 2) can back it with disk storage [not if it is a bare-metal hypervisor] 3) avoid a pointless vmexit [no vmexit for a non-VMX (e.g. PV) guest] 4) when you're out of memory [how can this be determined outside of the hypervisor?] And, importantly, "have your host expose a device which is write cached by host memory"... you are implying that all guest swapping should be done to a device managed/controlled by the host? That eliminates guest swapping to directIO/SRIOV devices doesn't it? Anyway, I think we can see now why frontswap might not be a good match for a hosted hypervisor (KVM), but that doesn't make it any less useful for a bare-metal hypervisor (or TBD for in-kernel compressed swap and TBD for possible future pseudo-RAM technologies). Dan --
Frontswap does not do this. Once a page has been frontswapped, the host is committed to retaining it until the guest releases it. It's really not very different from a synchronous swap device. I think cleancache allows the hypervisor to drop pages without the guest's immediate knowledge, but I'm not sure. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Gah. You're right. I'm reading the two threads and confusing the concepts. I'm a bit less mystified why the discussion is revolving around the swap device so much. :) -- Dave --
wtf? So lets fix the ballooning driver instead? There's no reason it could not be as fast as frontswap, right? Actually I'd expect it to be faster -- it can deal with big chunks. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I'd argue the opposite. There's no point in having the host do swapping
on behalf of guests if guests can do it themselves; it's just a
duplication of functionality. You end up having two IO paths for each
guest, and the resulting problems in trying to account for the IO,
rate-limit it, etc. If you can simply say "all guest disk IO happens
via this single interface", its much easier to manage.
If frontswap has value, it's because its providing a new facility to
guests that doesn't already exist and can't be easily emulated with
existing interfaces.
It seems to me the great strengths of the synchronous interface are:
* it matches the needs of an existing implementation (tmem in Xen)
* it is simple to understand within the context of the kernel code
it's used in
Simplicity is important, because it allows the mm code to be understood
and maintained without having to have a deep understanding of
virtualization. One of the problems with CMM2 was that it puts a lot of
intricate constraints on the mm code which can be easily broken, which
would only become apparent in subtle edge cases in a CMM2-using
environment. An addition async frontswap-like interface - while not as
complex as CMM2 - still makes things harder for mm maintainers.
The downside is that it may not match some implementation in which the
get/put operations could take a long time (ie, physical IO to a slow
mechanical device). But a general Linux principle is not to overdesign
interfaces for hypothetical users, only for real needs.
Do you think that you would be able to use frontswap in kvm if it were
Yes, that's comfortably within the "guests page themselves" model.
Setting up a block device for the domain which is backed by pagecache
(something we usually try hard to avoid) is pretty straightforward. But
it doesn't work well for Xen unless the blkback domain is sized so that
it has all of Xen's free memory in its pagecache.
That said, it does concern me that the host/hypervisor is ...The problem with relying on the guest to swap is that it's voluntary. The guest may not be able to do it. When the hypervisor needs memory and guests don't cooperate, it has to swap. But I'm not suggesting that the host swap on behalf on the guest. Rather, the guest swaps to (what it sees as) a device with a large With tmem you have to account for that memory, make sure it's distributed fairly, claim it back when you need it (requiring guest cooperation), live migrate and save/restore it. It's a much larger change than introducing a write-back device for swapping (which has the If we use the existing paths, things are even simpler, and we match more needs (hypervisors with dma engines, the ability to reclaim memory For kvm (or Xen, with some modifications) all of the benefits of frontswap/tmem can be achieved with the ordinary swap. It would need trim/discard support to avoid writing back freed data, but that's good for flash as well. The advantages are: - just works - old guests - <1 exit/page (since it's batched) - no extra overhead if no free memory Eventually you'll have to swap frontswap pages, or kill uncooperative guests. At which point all of the simplicity is gone. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Or fail whatever operation its trying to do. You can only use
overcommit to fake unlimited resources for so long before you need a
Well, with caveats. To be useful with migration the backing store needs
to be shared like other storage, so you can't use a specific host-local
fast (ssd) swap device. And because the device is backed by pagecache
with delayed writes, it has much weaker integrity guarantees than a
normal device, so you need to be sure that the guests are only going to
use it for swap. Sure, these are deployment issues rather than code
Well, you still can't reclaim memory; you can write it out to storage.
It may be cheaper/byte, but it's still a resource dedicated to the
guest. But that's just a consequence of allowing overcommit, and to
what extent you're happy to allow it.
What kind of DMA engine do you have in mind? Are there practical
It could be achieved with ballooning, but it isn't completely trivial.
It wouldn't work terribly well with a driver domain setup, unless all
the swap-devices turned out to be backed by the same domain (which in
turn would need to know how to balloon in response to overall system
demand). The partitioning of the pagecache among the guests would be at
the mercy of the mm subsystem rather than subject to any specific QoS or
other per-domain policies you might want to put in place (maybe fiddling
Killing guests is pretty simple. Presumably the oom killer will get kvm
processes like anything else?
J
--
Keep your commitment below RAM+swap and you'll be fine. We want to You advertise it as a disk with write cache, so the guest is obliged to flush the cache if it wants a guarantee. When it does, you flush your cache as well. For swap, the guest will not issue any flushes. This is already supported by qemu with cache=writeback. I agree care is needed here. You don't want to use the device for In general you want to run on RAM. To maximise your RAM, you do things like page sharing and ballooning. Both can fail, increasing the demand for RAM. At that time you either kill a guest or swap to disk. Consider a frontswap/tmem on bare-metal hypervisor cluster. Presumably you give most of your free memory to guests. A node dies. Now you need to start its guests on the surviving nodes, but you're at the mercy of your guests to give up their tmem. With an ordinary swap approach, you first flush cache to disk, and if that's not sufficient you start paging out guest memory. You take a I/OAT (driver ioatdma). When you don't have a lot of memory free, you can also switch from write cache to O_DIRECT, so you use the storage controller's dma engine Yes. Of course, you want your management code never to allow this to happen. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
The first problem is that you are simulating a fast resource (RAM) with a resource that is orders of magnitude slower with NO visibility to the user that suffers the consequences. A good analogy (and no analogy is perfect) is if Linux discovers a 16MHz 80286 on a serial card in addition to the 32 3GHz cores on a Nehalem box and, whenever the 32 cores are all busy, randomly schedules a process on the 80286, while recording all CPU usage data as if the 80286 is a "real" processor.... "Hmmm... why did my compile suddenly run 100 times slower?" The second problem is "double swapping": A guest may choose a page to swap to "guest swap", but invisibly to the guest, the host first must fetch it from "host swap". (This may seem like it is easy to avoid... it is not and happens more frequently than you might think.) Third, host swapping makes live migration much more difficult. Either the host swap disk must be accessible to all machines or data sitting on a local disk MUST be migrated along with RAM (which is not impossible but complicates live migration substantially). Last I checked, VMware does not allow page-sharing and live migration to both be enabled for the same host. If you talk to VMware customers (especially web-hosting services) that have attempted to use overcommit technologies that require host-swapping, you will find that they quickly become allergic to memory overcommit and turn it off. The end users (users of the VMs that inexplicably grind to a halt) complain loudly. As a result, RAM has become a bottleneck in many many systems, which ultimately reduces the utility of servers and the value True. But in the Xen+tmem implementation there are disincentives for a guest to unnecessarily retain pages put into frontswap, so the host doesn't need to care that it can't discard the pages as the guest is "billed" for them anyway. So far we've been avoiding hypervisor policy implementation questions and focused on mechanism (because, after all, this is a *Linux ...
It's bad, but it's better than ooming. The same thing happens with vcpus: you run 10 guests on one core, if they all wake up, your cpu is suddenly 10x slower and has 30000x interrupt latency (30ms vs 1us, assuming 3ms timeslices). Your disks become slower as well. It's worse with memory, so you try to swap as a last resort. However, True. In fact when the guest and host use the same LRU algorithm, it kvm does live migration with swapping, and has no special code to Don't know about vmware, but kvm supports page sharing, swapping, and Choosing the correct overcommit ratio is certainly not an easy task. However, just hoping that memory will be available when you need it is not a good solution. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Virtualization is all about statistical multiplexing of fixed resources. If all guests demand a resource simultaneously, that is peak alignment == "bad luck". (But, honestly, I don't even remember the point either of us Will it suck to the point of eventually causing the live migration to fail? Or will swap-storms effectively cause denial-of-service for other guests? Anyway, if live migration works fine with mostly-swapped-out guests Frontswap only relies on the guest having an existing swap device, defined in /etc/fstab like any normal Linux swap device. If this is "relying on guest swapping", yes frontswap relies on guest swapping. Or if you are referring to your "host can't force guest to Your argument might make sense from a KVM perspective but is not true of frontswap with Xen+tmem. With KVM, the host's swap disk(s) can all be used as "slow RAM". With Xen, there is no host swap disk. So, yes, the degree of potential memory overcommitment is smaller with Xen+tmem than with KVM. In order to avoid all the host problems with host-swapping, frontswap+Xen+tmem intentionally limits the degree of memory overcommitment... but this is just memory overcommitment done intelligently. --
Well, to be fair, I meant the disagreement of synchronous vs Simple policies must exist and must be enforced by the hypervisor to ensure this doesn't happen. Xen+tmem provides these policies and enforces them. And it enforces them very _dynamically_ to constantly optimize RAM utilization across multiple guests each with dynamically varying RAM Huge performance hits that are completely inexplicable to a user give virtualization a bad reputation. If the user (i.e. guest, not host, administrator) can at least see "Hmmm... I'm doing a lot of swapping, guess I'd better pay for more (virtual) RAM", then Xen+tmem uses the SAME internal kernel interface. The Xen-specific code which performs the Xen-specific stuff (hypercalls) is only in The missing part again is dynamicity. How large is the virtual disk? Or are you proposing that disks can dramatically vary in size across time? I suspect that would be a very big patch. And you're talking about a disk that doesn't have all the A block device of what size? Again, I don't think this will be Ummm... no guest modifications, yet this special disk does everything you've described above (and, to meet my dynamicity requirements, Could you please explicitly identify what you are referring to as a new external API? The part this is different from As noted VERY early in this thread, if/when it makes sense, frontswap can do exactly the same thing by adding a buffering layer invisible I think we agree that DMA makes sense when there is a lot of data to copy and makes little sense when there is only a little (e.g. a single page) to copy. So I guess we need to understand what the tradeoff is. So, do you have any idea what the breakeven point is for your favorite DMA engine for amount of data copied vs 1) locking the memory pages 2) programming the DMA engine 3) responding to the interrupt from the DMA engine And the simple act of waiting to collect enough pages to "batch" means none of those pages can be used until ...
Can you explain what "enforcing" means in this context? You loaned the What you're saying is "don't overcommit". That's a good policy for some scenarios but not for others. Note it applies equally well for cpu as well as memory. frontswap+tmem is not overcommit, it's undercommit. You have spare memory, and you give it away. It isn't a replacement. However, without Exactly as large as the swap space which the guest would have in the If block layer overhead is a problem, go ahead and optimize it instead of adding new interfaces to bypass it. Though I expect it wouldn't be needed, and if any optimization needs to be done it is in the swap layer. Optimizing swap has the additional benefit of improving performance on What happens when no tmem is available? you swap to a volume. That's Something completely internal to the guest can be replaced by something completely different. Something that talks to a hypervisor will need So, you take a synchronous copyful interface, add another copy to make it into an asynchronous interface, instead of using the original When swapping out, Linux already batches pages in the block device's request queue. Swapping out is inherently asynchronous and batched, you're swapping out those pages _because_ you don't need them, and you're never interested in swapping out a single page. Linux already reserves memory for use during swapout. There's no need to re-solve solved problems. Swapping in is less simple, it is mostly synchronous (in some cases it isn't: with many threads, or with the preswap patches (IIRC unmerged)). You can always choose to copy if you don't have enough to justify dma. The networking stack seems to think 4096 bytes is a good size for dma (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK). -- error compiling committee.c: too many arguments to function --
But those are the guest's pages in the first place, that's not a new commitment. CMM2 provides the hypervisor alternatives to swapping a They are not directly comparable. In fact for dirty pages CMM2 is mostly a no-op - the host is forced to swap them out if it wants them. CMM2 brings value for demand zero or clean pages which can be restored by the guest without requiring swapin. I think for dirty pages what CMM2 brings is the ability to discard them CMM2 is more directly comparably to ballooning rather than to frontswap. Frontswap (and cleancache) work with storage that is The swap API (e.g. the block layer) itself is an asynchronous batched version of frontswap. The complexity in CMM2 comes from the fact that it is communicating information about guest pages to the host, and from Given that whenever frontswap fails you need to swap anyway, it is better for the host to never fail a frontswap request and instead back it with disk storage if needed. This way you avoid a pointless vmexit when you're out of memory. Since it's disk backed it needs to be asynchronous and batched. At this point we're back with the ordinary swap API. Simply have your host expose a device which is write cached by host memory, you'll have all the benefits of frontswap with none of the disadvantages, and with no changes to guest code. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
OK, now I think I see the crux of the disagreement. NO! Frontswap on Xen+tmem never *never* _never_ NEVER results in host swapping. Host swapping is evil. Host swapping is the root of most of the bad reputation that memory overcommit has gotten from VMware customers. Host swapping can't be avoided with some memory overcommit technologies (such as page sharing), but frontswap on Xen+tmem CAN and DOES avoid it. So, to summarize: 1) You agreed that a synchronous interface for frontswap makes sense for swap-to-in-kernel-compressed-RAM because it is truly swapping to RAM. 2) You have pointed out that an asynchronous interface for frontswap makes more sense for KVM than a synchronous interface, because KVM does host swapping. Then you said if you have an asynchronous interface anyway, the existing swap code works just fine with no changes so frontswap is not needed at all... for KVM. 3) You have suggested that if Xen were more like KVM and required host-swapping, then Xen doesn't need frontswap either. BUT frontswap on Xen+tmem always truly swaps to RAM. So there are two users of frontswap for which the synchronous interface makes sense. I believe there may be more in the future and you disagree but, as Jeremy said, "a general Linux principle is not to overdesign interfaces for hypothetical users, only for real needs." We have demonstrated there is a need with at least two users so the debate is only whether the number of users is two or more than two. Frontswap is a very non-invasive patch and is very cleanly layered so that if it is not in the presence of either of the intended "users", it can be turned off in many different ways with zero overhead (CONFIG'ed off) or extremely small overhead (frontswap_ops is never set; or frontswap_ops is set but the underlying hypervisor doesn't support it so frontswap_poolid never gets set). So... KVM doesn't need it and won't use it. Do you, Avi, have any other objections as to why the ...
Yet there are less invasive solutions available, like 'add trim operation to swap_ops'. So what needs to be said here is 'frontswap is XX times faster than swap_ops based solution on workload YY'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Why host-level swapping is evil? In KVM case, VM is just another process and host will just swap out pages using the same LRU like scheme as with any other process, AFAIK. Also, with frontswap, host cannot discard pages at any time as is the case will cleancache. So, while cleancache is obviously very useful, the usefulness of frontswap remains doubtful. IMHO, along with cleancache, we should just have in in-memory compressed swapping at *host* level i.e. no frontswap. I agree that using frontswap hooks, it is easy to implement ramzswap functionality but I think its not worth replacing this driver with frontswap hooks. This driver already has all the goodness: asynchronous interface, ability to dynamically add/remove ramzswap devices etc. All that is lacking in this driver is a more efficient 'discard' functionality so we can free a page as soon as it becomes unused. It should also be easy to extend this driver to allow sending pages to host using virtio (for KVM) or Xen hypercalls, if frontswap is needed at all. So, IMHO we can focus on cleancache development and add missing parts to ramzswap driver. Thanks, Nitin --
Your analogy only holds when the host administrator is either extremely greedy or stupid. My analogy only requires some statistical bad luck: Multiple guests with peaks and valleys Hmmm... I'll bet I can break it pretty easily. I think the case you raised that you thought would cause host OOM'ing will cause kvm live migration to fail. Or maybe not... when a guest is in the middle of a live migration, I believe (in Xen), the entire guest memory allocation (possibly excluding ballooned-out pages) must be simultaneously in RAM briefly in BOTH the host and target machine. That is, live migration is not "pipelined". Is this also true of KVM? If so, your statement above is just waiting a corner case to break it. Choosing the _optimal_ overcommit ratio is impossible without a prescient knowledge of the workload in each guest. Hoping memory will be available is certainly not a good solution, but if memory is not available guest swapping is much better than host swapping. And making RAM usage as dynamic as possible and live migration as easy as possible are keys to maximizing the benefits (and limiting the problems) of virtualization. --
10x vcpu is reasonable in some situations (VDI, powersave at night). No. The entire guest address space can be swapped out on the source and target, less the pages being copied to or from the wire, and pages actively accessed by the guest. Of course performance will suck if all That is why you need overcommit. You make things dynamic with page sharing and ballooning and live migration, but at some point you need a failsafe fallback. The only failsafe fallback I can see (where the host doesn't rely on guests) is swapping. As far as I can tell, frontswap+tmem increases the problem. You loan the guest some memory without the means to take it back, this increases memory pressure on the host. The result is that if you want to avoid swapping (or are unable to) you need to undercommit host resources. Instead of sum(guest mem) + reserve < (host mem), you need sum(guest mem + committed tmem) + reserve < (host mem). You need more host memory, or less guests, or to be prepared to swap if the worst happens. -- error compiling committee.c: too many arguments to function --
That's a bug. You're giving the guest memory without the means to take it back. The result is that you have to _undercommit_ your memory resources. Consider a machine running a guest, with most of its memory free. You give the memory via frontswap to the guest. The guest happily swaps to frontswap, and uses the freed memory for something unswappable, like mlock()ed memory or hugetlbfs. Now the second node dies and you need memory to migrate your guests into. But you can't, and the hypervisor is at the mercy of the guest for getting its memory back; and the guest can't do it (at least not In this case the guest expects that swapped out memory will be slow (since was freed via the swap API; it will be slow if the host happened to run out of tmem). So by storing this memory on disk you aren't reducing performance beyond what you promised to the guest. Swapping guest RAM will indeed cause a performance hit, but sometimes kvm's host swapping is unrelated. Host swapping swaps guest-owned memory; that's not what we want here. We want to cache guest swap in RAM, and that's easily done by having a virtual disk cached in main memory. We're simply presenting a disk with a large write-back cache to the guest. You could just as easily cache a block device in free RAM with Xen. Have a tmem domain behave as the backend for your swap device. Use ballooning to force tmem to disk, or to allow more cache when memory is free. Voila: you no longer depend on guests (you depend on the tmem domain, but that's part of the host code), you don't need guest modifications, For any hypervisor which implements virtual disks with write-back cache AND that's a problem because it puts the hypervisor at the mercy of the The problem is not the complexity of the patch itself. It's the fact that it introduces a new external API. If we refactor swapping, that stands in the way. How much, that's up to the mm maintainers to say. If it isn't a ...
We're getting into hypervisor policy issues, but given that probably nobody else is listening by now, I guess that's OK. ;-) The enforcement is on the "put" side. The page is not loaned, it is freely given, but only if the guest is within its contractual limitations (e.g. within its predefined "maxmem"). If the guest chooses to never remove the pages from frontswap, that's the guest's option, but that part of the guests memory allocation can never be used for anything else so it is in the guest's self-interest to "get" or "flush" the Perhaps, but CPU overcommit has been a well-understood part of computing for a very long time and users, admins, and hosting providers all know how to recognize it and deal with it. Not so with overcommitment of memory; the only exposure to memory limitations is "my disk light is flashing a lot, I'd better buy more RAM". Obviously, this doesn't translate to virtualization very well. And, as for your interrupt latency analogy, let's revisit that if/when Xen or KVM support CPU overcommitment for real-time-sensitive guests. Until then, your analogy But you are missing part of the magic: Once the memory page is no longer directly addressable (AND this implies not directly writable) by the guest, the hypervisor can do interesting things with it, such as compression and deduplication. As a result, the sum of pages used by all the guests exceeds the total pages of RAM in the system. Thus overcommitment. I agree that the degree of overcommitment is less than possible with host-swapping, but none of the evil issues of host-swapping happen. Again, this is "intelligent overcommitment". Other existing forms are "overcommit and cross your fingers that bad Uh, no. As I've said, everything about frontswap is entirely optional, both at compile-time and run-time. A frontswap-enabled guest is fully compatible with a hypervisor with no frontswap; a frontswap-enabled hypervisor is fully compatible with a guest with no frontswap. The only ...
I don't see why no copying is a requirement. I believe requirement should be "it is fast enough". Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
As Nitin pointed out much earlier in this thread: "No: trim or discard is not useful" I also think that trim does not do anything for the widely Are you asking me to demonstrate that swap-to-hypervisor-RAM is faster than swap-to-disk? --
I would like comparison of swap-to-frontswap vs. swap-to-RAMdisk. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Well, it's not really apples-to-apples because swap-to-RAMdisk is copying to a chunk of RAM with a known permanently-fixed size so it SHOULD be faster than swap-to-hypervisor, and should *definitely* be faster than swap-to-in-kernel-compressed-RAM but I suppose it is still an interesting comparison. I'll see what I can do, but it will probably be a couple days to figure out how to measure it (e.g. without accidentally measuring any swap-to-disk). --
You can't have a negative balloon size. The two models are not equivalent. Balloon allows you to give up a page for which you have a struct page. Frontswap (and swap) allows you to gain a page for which you don't have a struct page, but you can't access it directly. The similarity is that in both cases the host may want the guest to give up a page, but cannot There's no reason for swapping and ballooning to behave differently when swap backing storage is RAM (they probably do now since swap was tuned for disks, not flash, but that's a bug if it's true). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Once pages were dirtied (or I guess just slightly before), they became volatile, and I don't think the hypervisor could do anything with them. It could still swap them out like usual, but none of the CMM-specific optimizations could be performed. CC'ing Martin since he's the expert. :) -- Dave --
On Fri, 30 Apr 2010 09:08:00 -0700 Well, almost correct :-) A dirty page (or one that is about to become dirty) can be in one of two CMMA states: 1) stable This is the case for pages where the kernel is doing some operation on the page that will make it dirty, e.g. I/O. Before the kernel can allow the operation the page has to be made stable. If the state conversion to stable fails because the hypervisor removed the page the page needs to get deleted from page cache and recreated from scratch. 2) potentially-volatile This state is used for page cache pages for which a writable mapping exists. The page can be removed by the hypervisor as long as the physical per-page dirty bit is not set. As soon as the bit is set the page is considered stable although the CMMA state still is potentially- volatile. In both cases the only thing the hypervisor can do with a dirty page is to swap it as usual. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Dave or others can correct me if I am wrong, but I think CMM2 also handles dirty pages that must be retained by the hypervisor. The difference between CMM2 (for dirty pages) and frontswap is that CMM2 sets hints that can be handled asynchronously while frontswap provides explicit hooks that synchronously succeed/fail. In fact, Avi, CMM2 is probably a fairly good approximation of what the asynchronous interface you are suggesting might look like. Not to beat a dead horse, but there is a very key difference: The size and availability of frontswap is entirely dynamic; any page-to-be-swapped can be rejected at any time even if a page was previously successfully swapped to the same index. Every other swap device is much more static so the swap code assumes a static device. Existing swap code can account for "bad blocks" on a static device, but this is far from sufficient Yes, cleancache can drop pages at any time because (as the name implies) only clean pages can be put into cleancache. --
Hi Pavel -- The whole concept of RAM that _might_ be available to the kernel and is _not_ directly addressable by the kernel takes some thinking to wrap your mind around, but I assure you there are very good use cases for it. RAM owned and managed by a hypervisor (using controls unknowable to the kernel) is one example; this is Transcendent Memory. RAM which has been compressed is another example; Nitin is working on this using the frontswap approach because of some issues that arise with ramzswap (see elsewhere on this thread). There are likely more use cases. So in that context, let me answer your questions, combined If this was possible by fixing the balloon driver, VMware would have done it years ago. The problem is that the balloon driver is acting on very limited information, namely ONLY what THIS kernel wants; every kernel is selfish and (eventually) uses every bit of RAM it can get. This is especially true when swapping is required (under memory pressure). So, in general, ballooning is NOT faster because a balloon request to "get" RAM must wait for some other balloon driver in some other kernel to "give" RAM. OR some other entity must periodically scan every kernels memory and guess at which kernels are using memory inefficiently and steal it away before a "needy" kernel asks for it. While this does indeed "work" today in VMware, if you talk to VMware customers that use it, many are very unhappy with the Simply copying RAM from one page owned by the kernel to another page owned by the kernel is pretty pointless as far as swapping is concerned because it does nothing to reduce memory pressure, so the comparison is a bit irrelevant. But... In my measurements, the overhead of managing "pseudo-RAM" pages is in the same ballpark as copying the page. Compression or deduplication of course has additional costs. See the performance results at the end of the following two presentations for some performance information when "pseudo-RAM" is ...
Plus of course the asynchronity and batching of the block layer. Even if you don't use a dma engine, you improve performance by exiting one per several dozen pages instead of for every page, perhaps enough to allow the hypervisor to justify copying the memory with non-temporal moves. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
The concern is not with the hypervisor, but with Linux. More external I'm convinced it's useful. The API is so close to a block device (read/write with key/value vs read/write with sector/value) that we should make the effort not to introduce a new API. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
