From: Andrea Arcangeli <andrea@qumranet.com> With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if ...
FYI, I applied to patch to a tree that has the GRU driver. All regression tests passed. --- jack --
The hlist_del_init_rcu() primitive looks good. The rest of the RCU code looks fine assuming that "mn->ops->release()" either does call_rcu() to defer actual removal, or that the actual removal is deferred until after mmu_notifier_release() returns. --
Yes, actual removal is deferred until after mmu_notifier_release() Thanks for the review Paul! I should also have added your precious Acked-by to the 1/3 and 3/3 but the important is the ack by email ;) --
Ok, this looks ok as far as I'm concerned. I did not look at any details,
so obviously other VM people need to ack the parts they care about, but at
least I think this one is fine from a "big picture".
I think this should go in separately, and be split up into a patch of its
Similarly, even without any users, I think this can be posted as an
independent patch, just for setting things up, and to make the whole thing
easier to look through and review. So make this [2/3].
But before doing that, can you split up the low-level single-vma anon/file
locking/unlocking, please?
In other words, your 'mm_take_all_locks()' rigth now looks like it _works_
correctly, but it nests too deeply considering the complexity of it.
There's really subtle things going on inside that for-loop, and I think it
would be much better to split those low-level locking rules out.
...
ie, can you please make it be
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (signal_pending(current))
goto out_unlock;
if (vma->anon_vma)
vm_lock_anon_vma(vma->anon_vma);
if (vma->vm_file && vma->vm_file->f_mapping)
vm_lock_mapping(vma->vm_file->f_mapping);
}
and the same thing for unlocking.. Doesn't that look more obvious and
easier to understand from a high-level standpoing (and then the individual
locking rules for mappings/anon_vma's will also be more obvious, just
because they are separated from the higher-level code).
The comments are fine, but even with the comments I'd prefer you to write
the code so that you don't need to break up the conditionals over multiple
lines etc.
Anyway - I didn't look very much at the actual _notifier_ stuff (ie the
thing that I think should be [patch 3/3]), so I don't have any real
comments about that part - but I don't really care either. Becasue as long
as it doesn't mess up the core VM logic, I no longer have any real
objections.
I'd obviously want to see ack's by people like Andrew, Hugh and Nick, but
as far as I am ...I'll do all cleanups/splits very shortly and submit a v18 with 3 patches, all good, thanks a lot! --
