Thanks for the comment! Synchronous is required, but likely could be simulated by ensuring all coherency (and concurrency) requirements are met by some intermediate "buffering driver" -- at the cost of an extra page copy into a buffer and overhead of tracking the handles (poolid/inode/index) of pages in the buffer that are "in flight". This is an approach we are considering to implement an SSD backend, but hasn't been tested yet so, ahem, the proof will be in the put'ing. ;-) Dan --
Well, copying memory so you can use a zero-copy dma engine is counterproductive. Much easier to simulate an asynchronous API with a synchronous backend. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Hmmm.... I now realize you are thinking of applying frontswap to a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in will succeed, never double-swaps, and doesn't load the io subsystem with writes. This all works very nicely today with a fully synchronous "backend" (e.g. with tmem in Xen 4.0). So, I agree, hiding a truly asynchronous interface behind frontswap's synchronous interface may have some thorny issues. I wasn't recommending that it should be done, just speculating how it might be done. This doesn't make frontswap any less If I understand correctly, SSDs work much more efficiently when writing 64KB blocks. So much more efficiently in fact that waiting to collect 16 4KB pages (by first copying them to fill a 64KB buffer) will be faster than page-at-a-time DMA'ing them. If so, the frontswap interface, backed by an asynchronous "buffering layer" which collects 16 pages before writing to the SSD, may work very nicely. Again this is still just speculation... I was only pointing out that zero-copy DMA may not always be the best solution. Thanks, Dan --
Perhaps I misunderstood. Isn't frontswap in front of the normal swap device? So we do have double swapping, first to frontswap (which is in memory, yes, but still a nonzero cost), then the normal swap device. The io subsystem is loaded with writes; you only save the reads. Better to swap to the hypervisor, and make it responsible for committing to disk on overcommit or keeping in RAM when memory is available. This way we avoid the write to disk if memory is in fact available (or at least defer it until later). This way you avoid both reads and writes The guest can easily (and should) issue 64k dmas using scatter/gather. No need for copying. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Because the swapping API doesn't adapt well to dynamic changes in the size and availability of the underlying "swap" device, which Yes the hypervisor is committed to retain the memory. In some ways, giving a page of memory to a guest (via ballooning) is simpler and in some ways not. When a guest "owns" a page, it can do whatever it wants with it, independent of what is best for the "whole" virtualized system. When the hypervisor "owns" the page on behalf of the guest but the guest can't directly address it, the hypervisor has more flexibility. For example, tmem optionally compresses all frontswap pages, effectively doubling the size of its available memory. In the future, knowing that a guest application can never access the pages directly, it might store all frontswap pages in (slower but still synchronous) phase change memory or "far NUMA" Yes, fully supported in Xen 4.0. And as another example of flexibility, note that "lazy migration" of frontswap'ed pages I wasn't referring to hardware capability but to the availability Agreed. --
Can we extend it? Adding new APIs is easy, but harder to maintain in Ok. For non traditional RAM uses I really think an async API is needed. If the API is backed by a cpu synchronous operation is fine, but once it isn't RAM, it can be all kinds of interesting things. Note that even if you do give the page to the guest, you still control how it can access it, through the page tables. So for example you can easily compress a guest's pages without telling it about it; whenever it I have a feeling we're talking past each other here. Swap has no timing constraints, it is asynchronous and usually to slow devices. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
While I admit that I started this whole discussion by implying that frontswap (and cleancache) might be useful for SSDs, I think we are going far astray here. Frontswap is synchronous for a reason: It uses real RAM, but RAM that is not directly addressable by a (guest) kernel. SSD's (at least today) are still I/O devices; even though they may be very fast, they still live on a PCI (or slower) bus and use DMA. Frontswap is not intended for use with I/O devices. Today's memory technologies are either RAM that can be addressed by the kernel, or I/O devices that sit on an I/O bus. The exotic memories that I am referring to may be a hybrid: memory that is fast enough to live on a QPI/hypertransport, but slow enough that you wouldn't want to randomly mix and hand out to userland apps some pages from "exotic RAM" and some pages from "normal RAM". Such memory makes no sense today because OS's wouldn't know what to do with it. But it MAY make sense with frontswap (and cleancache). Nevertheless, frontswap works great today with a bare-metal hypervisor. I think it stands on its own merits, regardless of one's vision of future SSD/memory technologies. --
Even when frontswapping to RAM on a bare metal hypervisor it makes sense to use an async API, in case you have a DMA engine on board. -- error compiling committee.c: too many arguments to function --
They don't seem to have gained much ground in the FIVE YEARS since the patch was first posted to Linux, have they? Maybe it's because memory-to-memory copy using a CPU is so fast (especially for page-ish quantities of data) and is a small percentage of CPU utilization these days? --
Why do you say this? Servers have them and AFAIK networking uses them. There are other uses of the API in the code, but I don't know how much Copies take a small percentage of cpu because a lot of care goes into avoiding them, or placing them near the place where the copy is used. They certainly show up in high speed networking. A page-sized copy is small, but many of them will be expensive. -- error compiling committee.c: too many arguments to function --
Are there any production boxes that actually do this currently? I know IBM had 'expanded storage' on the 3090 series 20 years ago, haven't checked if the Z-series still do that. Was very cool at the time - supported 900+ users with 128M of main memory and 256M of expanded storage, because you got the first 3,000 or so page faults per second for almost free. Oh, and the 3090 had 2 special opcodes for "move page to/from expanded", so it was a very fast but still synchronous move (for whatever that's worth).
On Fri, Apr 23, 2010 at 6:47 AM, Dan Magenheimer I guess you are talking about the write amplification issue of SSD. In fact, most of the new generation drives already solved the problem with log like structure. Even with the old drives, the size of the writes depends on the the size of the erase block, which is not necessary 64KB. Jiahua --
