login
Header Space

 
 

mmu notifier-core v14->v15 diff for review

Previous thread: [git pull] x86, generic: optimized inlining by Ingo Molnar on Saturday, April 26, 2008 - 12:11 pm. (1 message)

Next thread: current git: spinlock lockup while starting init by Sebastian Siewior on Saturday, April 26, 2008 - 1:53 pm. (2 messages)
To: Rusty Russell <rusty@...>, Robin Holt <holt@...>, Jack Steiner <steiner@...>, Christoph Lameter <clameter@...>
Cc: <akpm@...>, Nick Piggin <npiggin@...>, Steve Wise <swise@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-mm@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <kvm-devel@...>, <general@...>, Hugh Dickins <hugh@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>
Date: Saturday, April 26, 2008 - 12:46 pm

Hello everyone,

here it is the mmu notifier #v14.

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-...

Please everyone involved review and (hopefully ;) ack that this is
safe to go in 2.6.26, the most important is to verify that this is a
noop when disarmed regardless of MMU_NOTIFIER=y or =n.

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-...

I'll be sending that patch to Andrew inbox.

Signed-off-by: Andrea Arcangeli &lt;andrea@qumranet.com&gt;

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d45fab..ce3251c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2ad6f54..853087a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 	account_shadowed(kvm, gfn);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+	struct page *page = pfn_to_page((*spte &amp; PT64_BASE_ADDR_MASK) &gt;&gt; PAGE_SHIFT);
+	get_page(page);
+	rmap_remove(kvm, spte);
+	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	kvm_flush_remote_tlbs(kvm);
+	put_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte, *curr_spte;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!(*spte &amp; PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+		curr_spte = spte;
+		spte = rmap_next(kvm, rmapp, spte);
+		kvm_unmap_spte(kvm, curr_spte);
+	}
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+	int i;
+
+	/*
+	 * If mmap_sem isn't taken, we can look th...
To: Rusty Russell <rusty@...>, Robin Holt <holt@...>, Jack Steiner <steiner@...>, Christoph Lameter <clameter@...>
Cc: <akpm@...>, Nick Piggin <npiggin@...>, Steve Wise <swise@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-mm@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <kvm-devel@...>, <general@...>, Hugh Dickins <hugh@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 1, 2008 - 2:12 pm

Hello everyone,

this is the v14 to v15 difference to the mmu-notifier-core patch. This
is just for review of the difference, I'll post full v15 soon, please
review the diff in the meantime. Lots of those cleanups are thanks to
Andrew review on mmu-notifier-core in v14. He also spotted the
GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP
failed to catch it until I enabled PREEMPT (GFP_KERNEL there was
perfectly safe with all patchset applied but not ok if only
mmu-notifier-core was applied). As usual that bug couldn't hurt
anybody unless the mmu notifiers were armed.

I also wrote a proper changelog to the mmu-notifier-core patch that I
will append before the v14-&gt;v15 diff:

Subject: mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avo...
To: Andrea Arcangeli <andrea@...>
Cc: Rusty Russell <rusty@...>, Robin Holt <holt@...>, Jack Steiner <steiner@...>, Christoph Lameter <clameter@...>, <akpm@...>, Nick Piggin <npiggin@...>, Steve Wise <swise@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-mm@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <kvm-devel@...>, <general@...>, Hugh Dickins <hugh@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>
Date: Saturday, April 26, 2008 - 2:59 pm

You should not assume a struct page exists for any given spte. Instead, 

Perhaps I just have a weak stomach but I am uneasy having a function 
that takes a lock on exit. I walked through the logic and it doesn't 
appear to be wrong but it also is pretty clear that you could defer the 
acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) 

Worst case, you pass 4 more pointer arguments here and, take the spin 
lock, and then depending on the result of mmu_guess_page_from_pte_write, 

Why move the destruction of the vm to the MMU notifier unregister hook? 
Does anything else ever call mmu_notifier_unregister that would 
implicitly destroy the VM?

Regards,

Anthony Liguori
--
To: Anthony Liguori <aliguori@...>
Cc: Rusty Russell <rusty@...>, Robin Holt <holt@...>, Jack Steiner <steiner@...>, Christoph Lameter <clameter@...>, <akpm@...>, Nick Piggin <npiggin@...>, Steve Wise <swise@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-mm@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <kvm-devel@...>, <general@...>, Hugh Dickins <hugh@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>
Date: Saturday, April 26, 2008 - 8:20 pm

Last email from muli@ibm in my inbox argues it's useless to build rmap
on mmio regions, so the above is more efficient so put_page runs
directly on the page without going back and forth between spte -&gt; pfn
-&gt; page -&gt; pfn -&gt; page in a single function.

Certainly if we start building rmap on mmio regions we'll have to

I agree out_lock is an uncommon exit path, the problem is that the
code was buggy, and I tried to fix it with the smallest possible
change and that resulting in an out_lock. That section likely need a
refactoring, all those update_pte fields should be at least returned
by the function guess_....  but I tried to reduce the changes to make
the issue more readable, I didn't want to rewrite certain functions

Yes that was my same idea, but that's left for a later patch. Fixing
this bug mixed with the mmu notifier patch was perhaps excessive

mmu notifier -&gt;release can run at anytime before the filehandle is
closed. -&gt;release has to zap all sptes and freeze the mmu (hence all
vcpus) to prevent any further page fault. After -&gt;release returns all
pages are freed (we'll never relay on the page pin to avoid the
rmap_remove put_page to be a relevant unpin event). So the idea is
that I wanted to maintain the same ordering of the current code in the
vm destroy event, I didn't want to leave a partially shutdown VM on
the vmlist. If the ordering is entirely irrelevant and the
kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then
I can avoid changes to kvm_main.c but I doubt.

I've done it in a way that archs not needing mmu notifiers like s390
can simply add the kvm_destroy_common_vm at the top of their
kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke
kvm_destroy_common_vm in the -&gt;release of the mmu notifiers.

This will ensure that everything will be ok regardless if exit_mmap is
called before/after exit_files, and it won't make a whole lot of
difference anymore, if the driver fd is pinned through vmas-&gt;vm_file
re...
To: Andrea Arcangeli <andrea@...>
Cc: Nick Piggin <npiggin@...>, Chris Wright <chrisw@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, Marcelo Tosatti <marcelo@...>, <linux-mm@...>, Robin Holt <holt@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Steve Wise <swise@...>, Christoph Lameter <clameter@...>
Date: Saturday, April 26, 2008 - 9:54 pm

Avi can correct me if I'm wrong, but I don't think the consensus of that 
discussion was that we're going to avoid putting mmio pages in the 
rmap.  Practically speaking, replacing:

+	struct page *page = pfn_to_page((*spte &amp; PT64_BASE_ADDR_MASK) &gt;&gt; 
PAGE_SHIFT);
+	get_page(page);


With:

unsigned long pfn = (*spte &amp; PT64_BASE_ADDR_MASK) &gt;&gt; PAGE_SHIFT;
kvm_get_pfn(pfn);

Results in exactly the same code except the later allows mmio pfns in 
the rmap.  So ignoring the whole mmio thing, using accessors that are 

I appreciate the desire to minimize changes, but taking a lock on return 
seems to take that to a bit of an extreme.  It seems like a simple thing 

I see.  It seems a little strange to me as a KVM guest isn't really tied 
to the current mm.  It seems like the net effect of this is that we are 
now tying a KVM guest to an mm.

For instance, if you create a guest, but didn't assign any memory to it, 
you could transfer the fd to another process and then close the fd 
(without destroying the guest).  The other process then could assign 
memory to it and presumably run the guest.

With your change, as soon as the first process exits, the guest will be 
destroyed.  I'm not sure this behavioral difference really matters but 
it is a behavioral difference.

Regards,


--
To: Anthony Liguori <anthony@...>
Cc: Nick Piggin <npiggin@...>, Chris Wright <chrisw@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, Marcelo Tosatti <marcelo@...>, <linux-mm@...>, Robin Holt <holt@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Steve Wise <swise@...>, Christoph Lameter <clameter@...>
Date: Saturday, April 26, 2008 - 11:05 pm

My first impression on that discussion was that pci-passthrough mmio
can't be swapped, can't require write throttling etc.. ;). From a
linux VM pagetable point of view rmap on mmio looks weird. However
thinking some more, it's not like in the linux kernel where write
protect through rmap is needed only for write-throttling MAP_SHARED
which clearly is strictly RAM, for sptes we need it for every cr3
touch too to trap pagetable updates (think ioremap done by guest
kernel). So I think Avi's take that we need rmap for everything mapped

Agreed especially at the light of the above. I didn't actually touch
that function for a while (I clearly wrote it before we started moving
the kvm mmu code from page to pfn), and it was still safe to use to
test the locking of the mmu notifier methods. My current main focus in
the last few days was to get the locking right against the last mmu
notifier code #v14 ;).

Now that I look into it more closely, the get_page/put_page are
unnecessary by now (it was necessary with the older patches that
didn't implement range_begin and that relied on page pinning).

Not just in that function, but all reference counting inside kvm is
now entirely useless and can be removed.

NOTE: it is safe to flush the tlb outside the mmu_lock if done inside
the mmu_notifier methods. But only mmu notifiers can defer the tlb
flush after releasing mmu_lock because the page can't be freed by the
VM until we return.

All other kvm code must instead definitely flush the tlb inside the
mmu_lock, otherwise when the mmu notifier code runs, it will see the
spte nonpresent and so the mmu notifier code will do nothing (it will
not wait kvm to drop the mmu_lock before allowing the main linux VM to
free the page).

The tlb flush must happen before the page is freed, and doing it
inside mmu_lock everywhere (except in mmu-notifier contex where it can
be done after releasing mmu_lock) guarantees it.

The positive side of the tradeoff of having to do the tlb flush inside
the mmu_lock, ...
To: Andrea Arcangeli <andrea@...>
Cc: Anthony Liguori <anthony@...>, Nick Piggin <npiggin@...>, Chris Wright <chrisw@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, <linux-kernel@...>, Avi Kivity <avi@...>, Marcelo Tosatti <marcelo@...>, <linux-mm@...>, Robin Holt <holt@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Steve Wise <swise@...>, Christoph Lameter <clameter@...>
Date: Monday, April 28, 2008 - 10:48 am

Hi Andrea,

Looks good.

Acked-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;

--
Previous thread: [git pull] x86, generic: optimized inlining by Ingo Molnar on Saturday, April 26, 2008 - 12:11 pm. (1 message)

Next thread: current git: spinlock lockup while starting init by Sebastian Siewior on Saturday, April 26, 2008 - 1:53 pm. (2 messages)
speck-geostationary