Hello everyone, here it is the mmu notifier #v14. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-... Please everyone involved review and (hopefully ;) ack that this is safe to go in 2.6.26, the most important is to verify that this is a noop when disarmed regardless of MMU_NOTIFIER=y or =n. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-... I'll be sending that patch to Andrew inbox. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 8d45fab..ce3251c 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2ad6f54..853087a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) account_shadowed(kvm, gfn); } +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + put_page(page); +} + +static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + kvm_unmap_spte(kvm, curr_spte); + } +} + +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + + /* + * If mmap_sem isn't taken, we can look th...
Hello everyone, this is the v14 to v15 difference to the mmu-notifier-core patch. This is just for review of the difference, I'll post full v15 soon, please review the diff in the meantime. Lots of those cleanups are thanks to Andrew review on mmu-notifier-core in v14. He also spotted the GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP failed to catch it until I enabled PREEMPT (GFP_KERNEL there was perfectly safe with all patchset applied but not ok if only mmu-notifier-core was applied). As usual that bug couldn't hurt anybody unless the mmu notifiers were armed. I also wrote a proper changelog to the mmu-notifier-core patch that I will append before the v14->v15 diff: Subject: mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avo...
You should not assume a struct page exists for any given spte. Instead, Perhaps I just have a weak stomach but I am uneasy having a function that takes a lock on exit. I walked through the logic and it doesn't appear to be wrong but it also is pretty clear that you could defer the acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) Worst case, you pass 4 more pointer arguments here and, take the spin lock, and then depending on the result of mmu_guess_page_from_pte_write, Why move the destruction of the vm to the MMU notifier unregister hook? Does anything else ever call mmu_notifier_unregister that would implicitly destroy the VM? Regards, Anthony Liguori --
Last email from muli@ibm in my inbox argues it's useless to build rmap on mmio regions, so the above is more efficient so put_page runs directly on the page without going back and forth between spte -> pfn -> page -> pfn -> page in a single function. Certainly if we start building rmap on mmio regions we'll have to I agree out_lock is an uncommon exit path, the problem is that the code was buggy, and I tried to fix it with the smallest possible change and that resulting in an out_lock. That section likely need a refactoring, all those update_pte fields should be at least returned by the function guess_.... but I tried to reduce the changes to make the issue more readable, I didn't want to rewrite certain functions Yes that was my same idea, but that's left for a later patch. Fixing this bug mixed with the mmu notifier patch was perhaps excessive mmu notifier ->release can run at anytime before the filehandle is closed. ->release has to zap all sptes and freeze the mmu (hence all vcpus) to prevent any further page fault. After ->release returns all pages are freed (we'll never relay on the page pin to avoid the rmap_remove put_page to be a relevant unpin event). So the idea is that I wanted to maintain the same ordering of the current code in the vm destroy event, I didn't want to leave a partially shutdown VM on the vmlist. If the ordering is entirely irrelevant and the kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then I can avoid changes to kvm_main.c but I doubt. I've done it in a way that archs not needing mmu notifiers like s390 can simply add the kvm_destroy_common_vm at the top of their kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke kvm_destroy_common_vm in the ->release of the mmu notifiers. This will ensure that everything will be ok regardless if exit_mmap is called before/after exit_files, and it won't make a whole lot of difference anymore, if the driver fd is pinned through vmas->vm_file re...
Avi can correct me if I'm wrong, but I don't think the consensus of that discussion was that we're going to avoid putting mmio pages in the rmap. Practically speaking, replacing: + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); With: unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; kvm_get_pfn(pfn); Results in exactly the same code except the later allows mmio pfns in the rmap. So ignoring the whole mmio thing, using accessors that are I appreciate the desire to minimize changes, but taking a lock on return seems to take that to a bit of an extreme. It seems like a simple thing I see. It seems a little strange to me as a KVM guest isn't really tied to the current mm. It seems like the net effect of this is that we are now tying a KVM guest to an mm. For instance, if you create a guest, but didn't assign any memory to it, you could transfer the fd to another process and then close the fd (without destroying the guest). The other process then could assign memory to it and presumably run the guest. With your change, as soon as the first process exits, the guest will be destroyed. I'm not sure this behavioral difference really matters but it is a behavioral difference. Regards, --
My first impression on that discussion was that pci-passthrough mmio can't be swapped, can't require write throttling etc.. ;). From a linux VM pagetable point of view rmap on mmio looks weird. However thinking some more, it's not like in the linux kernel where write protect through rmap is needed only for write-throttling MAP_SHARED which clearly is strictly RAM, for sptes we need it for every cr3 touch too to trap pagetable updates (think ioremap done by guest kernel). So I think Avi's take that we need rmap for everything mapped Agreed especially at the light of the above. I didn't actually touch that function for a while (I clearly wrote it before we started moving the kvm mmu code from page to pfn), and it was still safe to use to test the locking of the mmu notifier methods. My current main focus in the last few days was to get the locking right against the last mmu notifier code #v14 ;). Now that I look into it more closely, the get_page/put_page are unnecessary by now (it was necessary with the older patches that didn't implement range_begin and that relied on page pinning). Not just in that function, but all reference counting inside kvm is now entirely useless and can be removed. NOTE: it is safe to flush the tlb outside the mmu_lock if done inside the mmu_notifier methods. But only mmu notifiers can defer the tlb flush after releasing mmu_lock because the page can't be freed by the VM until we return. All other kvm code must instead definitely flush the tlb inside the mmu_lock, otherwise when the mmu notifier code runs, it will see the spte nonpresent and so the mmu notifier code will do nothing (it will not wait kvm to drop the mmu_lock before allowing the main linux VM to free the page). The tlb flush must happen before the page is freed, and doing it inside mmu_lock everywhere (except in mmu-notifier contex where it can be done after releasing mmu_lock) guarantees it. The positive side of the tradeoff of having to do the tlb flush inside the mmu_lock, ...
Hi Andrea, Looks good. Acked-by: Marcelo Tosatti <mtosatti@redhat.com> --
| Netfilter kernel module | 8 hours ago | Linux kernel |
| serial driver xmit problem | 10 hours ago | Linux kernel |
| Why Windows is better than Linux | 10 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 17 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 1 day ago | Linux general |
| Compiling gfs2 on kernel 2.6.27 | 2 days ago | Linux kernel |
