Re: [PATCH 001/001] mmu-notifier-core v17

Previous thread: 2.6.25-mm1: kmmcd NULL pointer dereference at klist_del+0xe/0x30 by Andy Isaacson on Friday, May 9, 2008 - 10:19 am. (4 messages)

Next thread: none
From: Andrea Arcangeli
Date: Friday, May 9, 2008 - 12:32 pm

From: Andrea Arcangeli <andrea@qumranet.com>

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if ...
From: Jack Steiner
Date: Monday, May 12, 2008 - 1:01 pm

FYI, I applied to patch to a tree that has the GRU driver. All regression
tests passed.

--- jack
--

From: Paul E. McKenney
Date: Friday, May 16, 2008 - 12:07 pm

The hlist_del_init_rcu() primitive looks good.

The rest of the RCU code looks fine assuming that "mn->ops->release()"
either does call_rcu() to defer actual removal, or that the actual
removal is deferred until after mmu_notifier_release() returns.

--

From: Andrea Arcangeli
Date: Thursday, June 5, 2008 - 9:51 am

Yes, actual removal is deferred until after mmu_notifier_release()

Thanks for the review Paul! I should also have added your precious
Acked-by to the 1/3 and 3/3 but the important is the ack by email ;)
--

From: Linus Torvalds
Date: Tuesday, June 3, 2008 - 9:26 am

Ok, this looks ok as far as I'm concerned. I did not look at any details, 
so obviously other VM people need to ack the parts they care about, but at 
least I think this one is fine from a "big picture".


I think this should go in separately, and be split up into a patch of its 

Similarly, even without any users, I think this can be posted as an 
independent patch, just for setting things up, and to make the whole thing 
easier to look through and review. So make this [2/3].

But before doing that, can you split up the low-level single-vma anon/file 
locking/unlocking, please?

In other words, your 'mm_take_all_locks()' rigth now looks like it _works_ 
correctly, but it nests too deeply considering the complexity of it. 
There's really subtle things going on inside that for-loop, and I think it 
would be much better to split those low-level locking rules out.

...

ie, can you please make it be

	for (vma = mm->mmap; vma; vma = vma->vm_next) {
		if (signal_pending(current))
			goto out_unlock;
		if (vma->anon_vma)
			vm_lock_anon_vma(vma->anon_vma);
		if (vma->vm_file && vma->vm_file->f_mapping)
			vm_lock_mapping(vma->vm_file->f_mapping);
	}

and the same thing for unlocking.. Doesn't that look more obvious and 
easier to understand from a high-level standpoing (and then the individual 
locking rules for mappings/anon_vma's will also be more obvious, just 
because they are separated from the higher-level code).

The comments are fine, but even with the comments I'd prefer you to write 
the code so that you don't need to break up the conditionals over multiple 
lines etc.

Anyway - I didn't look very much at the actual _notifier_ stuff (ie the 
thing that I think should be [patch 3/3]), so I don't have any real 
comments about that part - but I don't really care either. Becasue as long 
as it doesn't mess up the core VM logic, I no longer have any real 
objections.

I'd obviously want to see ack's by people like Andrew, Hugh and Nick, but 
as far as I am ...
From: Andrea Arcangeli
Date: Tuesday, June 3, 2008 - 10:35 am

I'll do all cleanups/splits very shortly and submit a v18 with 3
patches, all good, thanks a lot!
--

Previous thread: 2.6.25-mm1: kmmcd NULL pointer dereference at klist_del+0xe/0x30 by Andy Isaacson on Friday, May 9, 2008 - 10:19 am. (4 messages)

Next thread: none