Greetings, I've dedusted the guest page hinting patches and ported them to todays upstream git tree. There is one reject if applied to 2.6.24-rc5-mm1 but that is easy to fix. The code stills works as expected on my test system. Our z/VM performance team recently published a report on guest page hinting vs. the ballooner approach on SLES10 for a farm of web servers. The code on SLES10 differs a bit from the upstream variant but the performance results should be still valid. You will find the report here: http://www.vm.ibm.com/perf/reports/zvm/html/530cmm.html (the VMRM-CMM the web page speaks about is the balloon approach, CMMA is the guest page hinting). Both approaches to the memory overcommit problem show comparable benefits for this workload, with an advantage for guest page hinting for large number of guests. For other workloads your mileage may vary. The main benefit for guest page hinting vs. the ballooner is that there is no need for a monitor that keeps track of the memory usage of all the guests, a complex algorithm that calculates the working set sizes and for the calls into the guest kernel to control the size of the balloons. The host just does normal LRU based paging. If the host picks one of the pages the guest can recreate, the host can throw it away instead of writing it to the paging device. Simple and elegant. The main disadvantage is the added complexity that is introduced to the guests memory management code to do the page state changes and to deal with discard faults. The last versions of the patches do not differ much, I consider the code to be stable. My question now is how to proceed with the code. I sure would love to see the code going upstream some day but that depends on the mm developers as the code adds complexity that needs to be supported. If the general feeling is that the advantages of this approach do not warrent for the added complexity this will likely be the last time you will hear about guest page hinting. -- blue ...
Well, I want this feature, but I agree about complexity. AFAICT, the trivial subset of this is the hinting of Unused pages. It seems that would buy us something, and perhaps be a stepping stone to full page hinting? Cheers, Rusty. --
I've been there but the unused page thing is so small that it doesn't make sense to separate it from the patches. If I don't see any progress then I will come up with a patch that adds the Unused state transitions to the arch files of s390. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Oh, that would be such a shame. Your guest page hinting patches remind me of that childhood thrill, when once a year the circus comes to town ;) But seriously, I'm ashamed to see my name in the Cc list: it would be very unfair if your patches never made it in, just because I've failed to find the time to wrap my own puny brain around them. It's very encouraging to see Jeremy and Rusty weighing in. I hope Zach will too, and I've added Andrea: their support would count a lot. You have Nick on the list, good, I've added Christoph and Peter (if you do resend, linux-mm might prove more useful than linux-kernel). With support from rival virtualizers, I do think you've a good chance of getting in. Hugh --
It is an effort to get you head around it the first time. It gets Grr, did I really forgot to copy linux-mm?!? (..insert you favourite four letter word here..). I absolutely intended to copy linux-mm but Yes, it would be great if we can find another user for it. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
I agree the page hinting technique is generally useful, even cross-architecture. What doesn't appear to be useful however, is support for this under VMware. It can be done, even without the writable pte support (yes, really). But due to us exploiting optimizations at lower layers, it doesn't appear that it will gain us any performance - and we must already have the complex working set algorithms to support I would say we support it, but I don't expect us to make use of the infrastructure anytime soon. For us it would make more sense to use the swap-fault optimization, but this requires some significant design changes in our monitor. Either way, these are both great ideas and I would not want to be held responsible for blocking their upstream progress. Someday, with the evolving x86 architecture (if we ever get per-page dirty bits), they might make sense for us to do as well. Zach --
With non-paravirt all you can do is to swap the guest physical memory (mmu notifiers allows linux to do that) or share memory (mmu notifiers + ksm allows linux to do that too). We also have complex working set algorithms that we use to finds which parts of the guest physical address space are best to swap first: the core linux VM. What paravirt allows us to do (and that's the whole point of the paper I guess), is to go one step further than just guest swapping and to ask the guest if the page really need to be swapped or if it can be freed right away. So this would be an extension of the mmu notifiers (this also shows how EMM API is too restrictive, while MMU notifiers will allow that extension in the future) to avoid I/O sometime if guest tells us it's not necessary to swap through paravirt ops. When talking with friends about ballooning I already once suggested to auto inflate the balloon with pages in the freelist. Now this paper goes well beyond the pages in the freelist (called U/unused in the paper), this also covers cache and mapped-clean cache in the guest. That would have been the next step. Anyway plain ballooning remains useful as rss limiting or numa compartments in the linux hypervisor, to provide unfariness to certain guests. I didn't read the patch yet, but I think paravirt knowledge about U/unused pages is needed to avoid guest swapping. The cache and mapped cache in the guest is a gray area, because linux as hypervisor will be extremely efficient at swapping out and swapping in the guest cache (host swapping guest cache, may be faster than re-issuing a read-I/O to refill the cache by itself, clearly with guest using paravirt). Let's say I'm mostly interested about page-hinting for the U pages initially. I'm currently busy with other two features and trying to get mmu notifier #v9 into mainline which is orders of magnitude more important than avoiding a few swapouts sometime (without mmu notifiers everything else is irrelevant, including guest page ...
We can tap into those algorithms just as effectively using ballooning, and we've optimized the sharing and working set models from outside of the guest. So while CMM gives slightly better information for a random forced page eviction, the complexity doesn't appear to justify the savings for a VMware implementation. Ballooning still works if you use a kernel based balloon driver. Using madvise wouldn't be a reliable way to balloon anyway. Are you talking about an API to manage working sets and such from userspace? Cheers, Zach --
I like the idea and it seems basically sound, but unfortunately Xen
won't be able to make use of it in the near term, because it doesn't
support any kind of backing for guest domain memory. There has been
some thought about adding this kind of functionality to Xen. Keir, Ian:
do you think this kind of support in the kernel be useful to us?
One concern I have is that 4k is really a very fine grain. We're
thinking about moving Xen's memory management to operate in 2M chunk
units, which would allow guests to directly use large pages with the
corresponding reduction in TLB pressure. One side-effect of this is
that we'd need to change ballooning to be in 2M rather than 4k units in
order to prevent physical memory fragmentation.
Page hinting at 4k resolution poses the same problem. Would this
technique still be useful operating on 2M chunks? Certainly it seems
less likely that you could easily get a whole 2M area with the same
fine-grained properties that these patches track. Would some kind of
page/sub-page tracking be useful?
My other concern is just correctness over time on the Linux side. We
already have enough trouble keeping things like the pte and page
structure state in sync, with resulting rare data-loss bugs. Adding
another layer which only applies in specific environments raises the
possibility for new bugs to be un-noticed for a long time. How can we
structure the VM changes to make sure that its robust in the face of
maintenance?
J
--
Yes, that's the main concern, as whenever lots of subtlety is added. I wonder if there's any chance of a CONFIG_DEBUG mode, which could be run on anybody's x86 machine, without involving any virtualization, but in which the PAGE_STATEs become essential to the correct working of the mm. Hugh --
How about a fake hypervisor, which is really just a random page evictor, following the rules of CMM? --
Probably simpler to just have variants of the page_set_* functions which
simulate the worst-possible host action immediately (ie, stealing pages,
logically swapping them, etc). That wouldn't give you full coverage,
but it would go some way. An async variant which schedules a change in
a few milliseconds would help too.
I guess that's equivalent to having a special-purpose hypervisor built
into the kernel (hm, sounds familiar...).
J
--
It needn't be that hard on s390, I believe you don't need to worry about PTEs becoming asynchronous when stealing a page, since if I understand the hypervisor architecture, there is a per-page mapping level available, allowing you to generate discard faults on access. It might be possible to use this mapping layer without implementing a full blown hypervisor. Martin? For x86, at discard time, you would have to manually walk and invalidate any PTEs potentially mapping the discarded page, but there is already this great thing called Xen paravirt-ops which actually does that for completely different reasons (PT page protection). I think a random exponential distribution for discard would be needed to catch all the racey failure modes. Zach --
Yes, I don't expect its a problem for s390, but the point is making
something workalike enough to make sure there's an evenly distributed
Not sure I follow. Xen pvops pays attention to whether a particular
page is being used as part of a pagetable, and changes its permissions
accordingly. But because pagetable pages are strictly kernel-only, we
can get away with updating a single kernel-mapping pte which is shared
across all processes. In the guest page hinting case, we need to deal
with general pages which can be mapped anywhere, so that really does
require a full traversal of the pagetables. Presumably rmap would be
helpful here.
J
--
Yes, on s390 the PTEs cannot be asynchronous because there is no need to synchronize them in the first place. A mapping layer with all primitives without using the SIE instruction will be difficult. For one we cannot use the ESSA instruction which isolates the state changes and host page table is tied to the SIE. The page state is stored in the page table extension and the discard state is basically a specially marked invalid pte in the host table. A mapping layer with some restrictions is If you have to walk the guest page tables you call into the guest, no ? I would characterize this more as a ballooner since you need guest activity to do the page stealing. The trick with guest page hinting is that you do NOT call into the guest to do the discard. You'll a nested We had quite a few of these racy failures. Nasty. Hard to find. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Traffic on the guest page hinting patches died down again. Until another
user shows up I guess that's it for the full version. In the meantime I
push the patch below which is the poor mans version that can be done
without common code change. It uses the arch_alloc_page/arch_free_page
hooks to do the stable/unused state transitions. Better than nothing ..
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
---
Subject: [PATCH] guest page hinting light
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the existing arch_alloc_page/arch_free_page callbacks to do
the guest page state transitions between stable and unused.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
arch/s390/Kconfig | 7 +++
arch/s390/mm/Makefile | 1
arch/s390/mm/init.c | 3 +
arch/s390/mm/page-states.c | 79 +++++++++++++++++++++++++++++++++++++++++++++
include/asm-s390/page.h | 11 ++++++
include/asm-s390/system.h | 6 +++
6 files changed, 107 insertions(+)
diff -urpN linux-2.6/arch/s390/Kconfig linux-2.6-patched/arch/s390/Kconfig
--- linux-2.6/arch/s390/Kconfig 2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/Kconfig 2008-05-06 17:38:28.000000000 +0200
@@ -430,6 +430,13 @@ config CMM_IUCV
Select this option to enable the special message interface to
the cooperative memory management.
+config PAGE_STATES
+ bool "Unused page notification"
+ help
+ This enables the notification of unused pages to the
+ hypervisor. The ESSA instruction is used to do the states
+ changes between a page that has content and the unused state.
+
config VIRT_TIMER
bool "Virtual CPU timer support"
help
diff -urpN linux-2.6/arch/s390/mm/init.c linux-2.6-patched/arch/s390/mm/init.c
--- linux-2.6/arch/s390/mm/init.c 2008-05-06 17:38:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/mm/init.c 2008-05-06 17:38:28.000000000 +0200
@@ -126,6 +126,9 @@ void __init mem_init(void)
/* ...On Tue, 06 May 2008 17:33:02 +0200 I suspect one of the problems is that there are too many state transitions to have it implemented with a low overhead on anything but S390, and even there you need milicoded instructions to handle things. If the number of transitions can be reduced, page hinting could be useful for KVM, too. -- All Rights Reversed --
Spot on Rik, if every transition becomes a hypercall (and a synchronous one at that), it isn't workable for us. If, on the other hand, you share the state bits between the guest and hypervisor, you need a giant (standalone) bit array for per-page state, which is neither convenient for Linux nor the hypervisor. I believe s390 has an 'instruction' to migrate the state bits into the hypervisor per-physical-page data without requiring a hypercall. Zach --
That is why we invented the millicoded ESSA instruction on s390. We had an emulation of the instruction to test things. It worked but was awfully slow. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
