RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Previous thread: [PATCH] Staging: wlan-ng: fix many style warnings in hfa384x_usb.c by Alessandro Ghedini on Thursday, April 22, 2010 - 8:49 am. (3 messages)

Next thread: Re: [linux-pm] [patch] pm-qos refresh by Randy Dunlap on Thursday, April 22, 2010 - 8:52 am. (1 message)
From: Dan Magenheimer
Date: Thursday, April 22, 2010 - 8:48 am

Thanks for the comment!

Synchronous is required, but likely could be simulated by ensuring all
coherency (and concurrency) requirements are met by some intermediate
"buffering driver" -- at the cost of an extra page copy into a buffer
and overhead of tracking the handles (poolid/inode/index) of pages in
the buffer that are "in flight".  This is an approach we are considering
to implement an SSD backend, but hasn't been tested yet so, ahem, the
proof will be in the put'ing. ;-)

Dan
--

From: Avi Kivity
Date: Thursday, April 22, 2010 - 9:13 am

Well, copying memory so you can use a zero-copy dma engine is 
counterproductive.

Much easier to simulate an asynchronous API with a synchronous backend.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Friday, April 23, 2010 - 6:47 am

Hmmm.... I now realize you are thinking of applying frontswap to
a hosted hypervisor (e.g. KVM). Using frontswap with a bare-metal
hypervisor (e.g. Xen) works fully synchronously, guarantees swap-in
will succeed, never double-swaps, and doesn't load the io subsystem
with writes.  This all works very nicely today with a fully
synchronous "backend" (e.g. with tmem in Xen 4.0).

So, I agree, hiding a truly asynchronous interface behind
frontswap's synchronous interface may have some thorny issues.
I wasn't recommending that it should be done, just speculating
how it might be done.  This doesn't make frontswap any less

If I understand correctly, SSDs work much more efficiently when
writing 64KB blocks.  So much more efficiently in fact that waiting
to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
will be faster than page-at-a-time DMA'ing them.  If so, the
frontswap interface, backed by an asynchronous "buffering layer"
which collects 16 pages before writing to the SSD, may work
very nicely.  Again this is still just speculation... I was
only pointing out that zero-copy DMA may not always be the best
solution.

Thanks,
Dan
--

From: Avi Kivity
Date: Friday, April 23, 2010 - 6:57 am

Perhaps I misunderstood.  Isn't frontswap in front of the normal swap 
device?  So we do have double swapping, first to frontswap (which is in 
memory, yes, but still a nonzero cost), then the normal swap device.  
The io subsystem is loaded with writes; you only save the reads.

Better to swap to the hypervisor, and make it responsible for committing 
to disk on overcommit or keeping in RAM when memory is available.  This 
way we avoid the write to disk if memory is in fact available (or at 
least defer it until later).  This way you avoid both reads and writes 

The guest can easily (and should) issue 64k dmas using scatter/gather.  
No need for copying.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Friday, April 23, 2010 - 8:56 am

Because the swapping API doesn't adapt well to dynamic changes in
the size and availability of the underlying "swap" device, which

Yes the hypervisor is committed to retain the memory.  In
some ways, giving a page of memory to a guest (via ballooning)
is simpler and in some ways not.  When a guest "owns" a page,
it can do whatever it wants with it, independent of what is best
for the "whole" virtualized system.  When the hypervisor
"owns" the page on behalf of the guest but the guest can't
directly address it, the hypervisor has more flexibility.
For example, tmem optionally compresses all frontswap pages,
effectively doubling the size of its available memory.
In the future, knowing that a guest application can never
access the pages directly, it might store all frontswap pages in
(slower but still synchronous) phase change memory or "far NUMA"

Yes, fully supported in Xen 4.0.  And as another example of
flexibility, note that "lazy migration" of frontswap'ed pages

I wasn't referring to hardware capability but to the availability

Agreed.
--

From: Avi Kivity
Date: Saturday, April 24, 2010 - 11:22 am

Can we extend it?  Adding new APIs is easy, but harder to maintain in 

Ok.  For non traditional RAM uses I really think an async API is 
needed.  If the API is backed by a cpu synchronous operation is fine, 
but once it isn't RAM, it can be all kinds of interesting things.

Note that even if you do give the page to the guest, you still control 
how it can access it, through the page tables.  So for example you can 
easily compress a guest's pages without telling it about it; whenever it 

I have a feeling we're talking past each other here.  Swap has no timing 
constraints, it is asynchronous and usually to slow devices.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dan Magenheimer
Date: Sunday, April 25, 2010 - 6:37 am

While I admit that I started this whole discussion by implying
that frontswap (and cleancache) might be useful for SSDs, I think
we are going far astray here.  Frontswap is synchronous for a
reason: It uses real RAM, but RAM that is not directly addressable
by a (guest) kernel.  SSD's (at least today) are still I/O devices;
even though they may be very fast, they still live on a PCI (or
slower) bus and use DMA.  Frontswap is not intended for use with
I/O devices.

Today's memory technologies are either RAM that can be addressed
by the kernel, or I/O devices that sit on an I/O bus.  The
exotic memories that I am referring to may be a hybrid:
memory that is fast enough to live on a QPI/hypertransport,
but slow enough that you wouldn't want to randomly mix and
hand out to userland apps some pages from "exotic RAM" and some
pages from "normal RAM".  Such memory makes no sense today
because OS's wouldn't know what to do with it.  But it MAY
make sense with frontswap (and cleancache).

Nevertheless, frontswap works great today with a bare-metal
hypervisor.  I think it stands on its own merits, regardless
of one's vision of future SSD/memory technologies.
--

From: Avi Kivity
Date: Sunday, April 25, 2010 - 7:15 am

Even when frontswapping to RAM on a bare metal hypervisor it makes sense 
to use an async API, in case you have a DMA engine on board.

-- 
error compiling committee.c: too many arguments to function

--

From: Dan Magenheimer
Date: Monday, April 26, 2010 - 5:45 am

They don't seem to have gained much ground in the FIVE YEARS
since the patch was first posted to Linux, have they?

Maybe it's because memory-to-memory copy using a CPU
is so fast (especially for page-ish quantities of data)
and is a small percentage of CPU utilization these days?
--

From: Avi Kivity
Date: Monday, April 26, 2010 - 6:48 am

Why do you say this?  Servers have them and AFAIK networking uses them.  
There are other uses of the API in the code, but I don't know how much 

Copies take a small percentage of cpu because a lot of care goes into 
avoiding them, or placing them near the place where the copy is used.  
They certainly show up in high speed networking.

A page-sized copy is small, but many of them will be expensive.

-- 
error compiling committee.c: too many arguments to function

--

From: Valdis.Kletnieks
Date: Tuesday, April 27, 2010 - 4:52 am

Are there any production boxes that actually do this currently? I know IBM had
'expanded storage' on the 3090 series 20 years ago, haven't checked if the
Z-series still do that.  Was very cool at the time - supported 900+ users with
128M of main memory and 256M of expanded storage, because you got the first
3,000 or so page faults per second for almost free.  Oh, and the 3090 had 2
special opcodes for "move page to/from expanded", so it was a very fast but
still synchronous move (for whatever that's worth).

From: Jiahua
Date: Friday, April 23, 2010 - 9:35 am

On Fri, Apr 23, 2010 at 6:47 AM, Dan Magenheimer

I guess you are talking about the write amplification issue of SSD. In
fact, most of the new generation drives already solved the problem
with log like structure. Even with the old drives, the size of the
writes depends on the the size of the erase block, which is not
necessary 64KB.

Jiahua
--

Previous thread: [PATCH] Staging: wlan-ng: fix many style warnings in hfa384x_usb.c by Alessandro Ghedini on Thursday, April 22, 2010 - 8:49 am. (3 messages)

Next thread: Re: [linux-pm] [patch] pm-qos refresh by Randy Dunlap on Thursday, April 22, 2010 - 8:52 am. (1 message)