Re: [patch 0/6] Guest page hinting version 6.

Previous thread: Re: [ANNOUNCE] Ramback: faster than a speeding bullet by Benny Amorsen on Wednesday, March 12, 2008 - 6:25 am. (4 messages)

Next thread: [patch 3/6] Guest page hinting: mlocked pages. by Martin Schwidefsky on Wednesday, March 12, 2008 - 6:21 am. (3 messages)
From: Martin Schwidefsky
Date: Wednesday, March 12, 2008 - 6:21 am

Greetings,
I've dedusted the guest page hinting patches and ported them to todays
upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but
that is easy to fix. The code stills works as expected on my test system.

Our z/VM performance team recently published a report on guest page
hinting vs. the ballooner approach on SLES10 for a farm of web servers.
The code on SLES10 differs a bit from the upstream variant but the
performance results should be still valid.  You will find the report
here:

  http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html

(the VMRM-CMM the web page speaks about is the balloon approach,
 CMMA is the guest page hinting).

Both approaches to the memory overcommit problem show comparable benefits
for this workload, with an advantage for guest page hinting for large
number of guests. For other workloads your mileage may vary.

The main benefit for guest page hinting vs. the ballooner is that there
is no need for a monitor that keeps track of the memory usage of all the
guests, a complex algorithm that calculates the working set sizes and for
the calls into the guest kernel to control the size of the balloons.
The host just does normal LRU based paging. If the host picks one of the
pages the guest can recreate, the host can throw it away instead of writing
it to the paging device. Simple and elegant.
The main disadvantage is the added complexity that is introduced to the
guests memory management code to do the page state changes and to deal
with discard faults.

The last versions of the patches do not differ much, I consider the code
to be stable. My question now is how to proceed with the code. I sure
would love to see the code going upstream some day but that depends on
the mm developers as the code adds complexity that needs to be supported.
If the general feeling is that the advantages of this approach do not
warrent for the added complexity this will likely be the last time you
will hear about guest page hinting. 

--
blue ...
From: Rusty Russell
Date: Wednesday, March 12, 2008 - 3:41 pm

Well, I want this feature, but I agree about complexity.

AFAICT, the trivial subset of this is the hinting of Unused pages.  It seems 
that would buy us something, and perhaps be a stepping stone to full page 
hinting?

Cheers,
Rusty.
--

From: Martin Schwidefsky
Date: Thursday, March 13, 2008 - 2:47 am

I've been there but the unused page thing is so small that it doesn't
make sense to separate it from the patches. If I don't see any progress
then I will come up with a patch that adds the Unused state transitions
to the arch files of s390.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--

From: Hugh Dickins
Date: Thursday, March 13, 2008 - 9:57 am

Oh, that would be such a shame.  Your guest page hinting patches remind
me of that childhood thrill, when once a year the circus comes to town ;)

But seriously, I'm ashamed to see my name in the Cc list: it would
be very unfair if your patches never made it in, just because I've
failed to find the time to wrap my own puny brain around them.

It's very encouraging to see Jeremy and Rusty weighing in.  I hope
Zach will too, and I've added Andrea: their support would count a lot.
You have Nick on the list, good, I've added Christoph and Peter
(if you do resend, linux-mm might prove more useful than linux-kernel).

With support from rival virtualizers,
I do think you've a good chance of getting in.

Hugh
--

From: Martin Schwidefsky
Date: Thursday, March 13, 2008 - 10:14 am

It is an effort to get you head around it the first time. It gets

Grr, did I really forgot to copy linux-mm?!? (..insert you favourite
four letter word here..). I absolutely intended to copy linux-mm but

Yes, it would be great if we can find another user for it.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--

From: Zachary Amsden
Date: Thursday, March 13, 2008 - 10:45 am

I agree the page hinting technique is generally useful, even
cross-architecture.

What doesn't appear to be useful however, is support for this under
VMware.  It can be done, even without the writable pte support (yes,
really).  But due to us exploiting optimizations at lower layers, it
doesn't appear that it will gain us any performance - and we must
already have the complex working set algorithms to support

I would say we support it, but I don't expect us to make use of the
infrastructure anytime soon.  For us it would make more sense to use the
swap-fault optimization, but this requires some significant design
changes in our monitor.

Either way, these are both great ideas and I would not want to be held
responsible for blocking their upstream progress.  Someday, with the
evolving x86 architecture (if we ever get per-page dirty bits), they
might make sense for us to do as well.

Zach

--

From: Andrea Arcangeli
Date: Thursday, March 13, 2008 - 12:45 pm

With non-paravirt all you can do is to swap the guest physical memory
(mmu notifiers allows linux to do that) or share memory (mmu notifiers
+ ksm allows linux to do that too). We also have complex working set
algorithms that we use to finds which parts of the guest physical
address space are best to swap first: the core linux VM.

What paravirt allows us to do (and that's the whole point of the paper
I guess), is to go one step further than just guest swapping and to
ask the guest if the page really need to be swapped or if it can be
freed right away. So this would be an extension of the mmu notifiers
(this also shows how EMM API is too restrictive, while MMU notifiers
will allow that extension in the future) to avoid I/O sometime if
guest tells us it's not necessary to swap through paravirt ops.

When talking with friends about ballooning I already once suggested to
auto inflate the balloon with pages in the freelist.

Now this paper goes well beyond the pages in the freelist (called
U/unused in the paper), this also covers cache and mapped-clean cache
in the guest. That would have been the next step.

Anyway plain ballooning remains useful as rss limiting or numa
compartments in the linux hypervisor, to provide unfariness to certain
guests.

I didn't read the patch yet, but I think paravirt knowledge about
U/unused pages is needed to avoid guest swapping. The cache and mapped
cache in the guest is a gray area, because linux as hypervisor will be
extremely efficient at swapping out and swapping in the guest cache
(host swapping guest cache, may be faster than re-issuing a read-I/O
to refill the cache by itself, clearly with guest using
paravirt). Let's say I'm mostly interested about page-hinting for the
U pages initially.

I'm currently busy with other two features and trying to get mmu
notifier #v9 into mainline which is orders of magnitude more important
than avoiding a few swapouts sometime (without mmu notifiers
everything else is irrelevant, including guest page ...
From: Zachary Amsden
Date: Thursday, March 13, 2008 - 2:41 pm

We can tap into those algorithms just as effectively using ballooning, and we've optimized the sharing and working set models from outside of the guest.  So while CMM gives slightly better information for a random forced page eviction, the complexity doesn't appear to justify the savings for a VMware implementation.


Ballooning still works if you use a kernel based balloon driver.  Using
madvise wouldn't be a reliable way to balloon anyway.  Are you talking
about an API to manage working sets and such from userspace?

Cheers,

Zach

--

From: Jeremy Fitzhardinge
Date: Thursday, March 13, 2008 - 11:41 am

I like the idea and it seems basically sound, but unfortunately Xen 
won't be able to make use of it in the near term, because it doesn't 
support any kind of backing for guest domain memory.  There has been 
some thought about adding this kind of functionality to Xen.  Keir, Ian: 
do you think this kind of support in the kernel be useful to us?

One concern I have is that 4k is really a very fine grain.  We're 
thinking about moving Xen's memory management to operate in 2M chunk 
units, which would allow guests to directly use large pages with the 
corresponding reduction in TLB pressure.  One side-effect of this is 
that we'd need to change ballooning to be in 2M rather than 4k units in 
order to prevent physical memory fragmentation.

Page hinting at 4k resolution poses the same problem.  Would this 
technique still be useful operating on 2M chunks?  Certainly it seems 
less likely that you could easily get a whole 2M area with the same 
fine-grained properties that these patches track.  Would some kind of 
page/sub-page tracking be useful?

My other concern is just correctness over time on the Linux side.  We 
already have enough trouble keeping things like the pte and page 
structure state in sync, with resulting rare data-loss bugs.  Adding 
another layer which only applies in specific environments raises the 
possibility for new bugs to be un-noticed for a long time.  How can we 
structure the VM changes to make sure that its robust in the face of 
maintenance?

    J
--

From: Hugh Dickins
Date: Thursday, March 13, 2008 - 11:55 am

Yes, that's the main concern, as whenever lots of subtlety is added.
I wonder if there's any chance of a CONFIG_DEBUG mode, which could be
run on anybody's x86 machine, without involving any virtualization, but
in which the PAGE_STATEs become essential to the correct working of the mm.

Hugh
--

From: Zachary Amsden
Date: Thursday, March 13, 2008 - 12:53 pm

How about a fake hypervisor, which is really just a random page evictor,
following the rules of CMM?

--

From: Jeremy Fitzhardinge
Date: Friday, March 14, 2008 - 11:30 am

Probably simpler to just have variants of the page_set_* functions which 
simulate the worst-possible host action immediately (ie, stealing pages, 
logically swapping them, etc).  That wouldn't give you full coverage, 
but it would go some way.  An async variant which schedules a change in 
a few milliseconds would help too.

I guess that's equivalent to having a special-purpose hypervisor built 
into the kernel (hm, sounds familiar...).

    J
--

From: Zachary Amsden
Date: Friday, March 14, 2008 - 2:32 pm

It needn't be that hard on s390, I believe you don't need to worry about
PTEs becoming asynchronous when stealing a page, since if I understand
the hypervisor architecture, there is a per-page mapping level
available, allowing you to generate discard faults on access.  It might
be possible to use this mapping layer without implementing a full blown
hypervisor.  Martin?

For x86, at discard time, you would have to manually walk and invalidate
any PTEs potentially mapping the discarded page, but there is already
this great thing called Xen paravirt-ops which actually does that for
completely different reasons (PT page protection).

I think a random exponential distribution for discard would be needed to
catch all the racey failure modes.

Zach

--

From: Jeremy Fitzhardinge
Date: Friday, March 14, 2008 - 2:37 pm

Yes, I don't expect its a problem for s390, but the point is making 
something workalike enough to make sure there's an evenly distributed 

Not sure I follow.  Xen pvops pays attention to whether a particular 
page is being used as part of a pagetable, and changes its permissions 
accordingly.  But because pagetable pages are strictly kernel-only, we 
can get away with updating a single kernel-mapping pte which is shared 
across all processes.  In the guest page hinting case, we need to deal 
with general pages which can be mapped anywhere, so that really does 
require a full traversal of the pagetables.   Presumably rmap would be 
helpful here.

    J
--

From: Martin Schwidefsky
Date: Monday, March 17, 2008 - 2:21 am

Yes, on s390 the PTEs cannot be asynchronous because there is no need to
synchronize them in the first place. A mapping layer with all primitives
without using the SIE instruction will be difficult. For one we cannot
use the ESSA instruction which isolates the state changes and host page
table is tied to the SIE. The page state is stored in the page table
extension and the discard state is basically a specially marked invalid
pte in the host table. A mapping layer with some restrictions is

If you have to walk the guest page tables you call into the guest, no ?
I would characterize this more as a ballooner since you need guest
activity to do the page stealing. The trick with guest page hinting is
that you do NOT call into the guest to do the discard. You'll a nested

We had quite a few of these racy failures. Nasty. Hard to find.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--

From: Martin Schwidefsky
Date: Tuesday, May 6, 2008 - 8:33 am

Traffic on the guest page hinting patches died down again. Until another
user shows up I guess that's it for the full version. In the meantime I
push the patch below which is the poor mans version that can be done
without common code change. It uses the arch_alloc_page/arch_free_page
hooks to do the stable/unused state transitions. Better than nothing ..

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] guest page hinting light

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Use the existing arch_alloc_page/arch_free_page callbacks to do
the guest page state transitions between stable and unused.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/s390/Kconfig          |    7 +++
 arch/s390/mm/Makefile      |    1 
 arch/s390/mm/init.c        |    3 +
 arch/s390/mm/page-states.c |   79 +++++++++++++++++++++++++++++++++++++++++++++
 include/asm-s390/page.h    |   11 ++++++
 include/asm-s390/system.h  |    6 +++
 6 files changed, 107 insertions(+)

diff -urpN linux-2.6/arch/s390/Kconfig linux-2.6-patched/arch/s390/Kconfig
--- linux-2.6/arch/s390/Kconfig	2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/Kconfig	2008-05-06 17:38:28.000000000 +0200
@@ -430,6 +430,13 @@ config CMM_IUCV
 	  Select this option to enable the special message interface to
 	  the cooperative memory management.
 
+config PAGE_STATES
+	bool "Unused page notification"
+	help
+	  This enables the notification of unused pages to the
+	  hypervisor. The ESSA instruction is used to do the states
+	  changes between a page that has content and the unused state.
+
 config VIRT_TIMER
 	bool "Virtual CPU timer support"
 	help
diff -urpN linux-2.6/arch/s390/mm/init.c linux-2.6-patched/arch/s390/mm/init.c
--- linux-2.6/arch/s390/mm/init.c	2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/mm/init.c	2008-05-06 17:38:28.000000000 +0200
@@ -126,6 +126,9 @@ void __init mem_init(void)
         /* ...
From: Rik van Riel
Date: Tuesday, May 6, 2008 - 12:46 pm

On Tue, 06 May 2008 17:33:02 +0200

I suspect one of the problems is that there are too many state transitions
to have it implemented with a low overhead on anything but S390, and even
there you need milicoded instructions to handle things.

If the number of transitions can be reduced, page hinting could be useful
for KVM, too.

-- 
All Rights Reversed
--

From: Zachary Amsden
Date: Tuesday, May 6, 2008 - 8:49 pm

Spot on Rik, if every transition becomes a hypercall (and a synchronous
one at that), it isn't workable for us.  If, on the other hand, you
share the state bits between the guest and hypervisor, you need a giant
(standalone) bit array for per-page state, which is neither convenient
for Linux nor the hypervisor.  I believe s390 has an 'instruction' to
migrate the state bits into the hypervisor per-physical-page data
without requiring a hypercall.

Zach

--

From: Martin Schwidefsky
Date: Wednesday, May 7, 2008 - 12:00 am

That is why we invented the millicoded ESSA instruction on s390. We had
an emulation of the instruction to test things. It worked but was
awfully slow.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--

Previous thread: Re: [ANNOUNCE] Ramback: faster than a speeding bullet by Benny Amorsen on Wednesday, March 12, 2008 - 6:25 am. (4 messages)

Next thread: [patch 3/6] Guest page hinting: mlocked pages. by Martin Schwidefsky on Wednesday, March 12, 2008 - 6:21 am. (3 messages)