Re: [PATCH] mmu notifiers #v8

Previous thread: Re: 2.6.24-git4+ regression by Mike Galbraith on Tuesday, February 19, 2008 - 1:15 am. (1 message)

Next thread: [PATCH] x86_64: fix dma_alloc_pages by Yinghai Lu on Tuesday, February 19, 2008 - 3:21 am. (2 messages)
From: Nick Piggin
Date: Tuesday, February 19, 2008 - 1:43 am

Well I started reviewing the mmu notifier code, but it is kind of hard to
know what you're talking about just by reading through code and not trying
your suggestions for yourself...

So I implemented mmu notifiers slightly differently. Andrea's mmu notifiers
are rather similar. However I have tried to make a point of minimising the
impact the the core mm/. I don't see why we need to invalidate or flush
anything when changing the pte to be _more_ permissive, and I don't
understand the need for invalidate_begin/invalidate_end pairs at all.
What I have done is basically create it so that the notifiers get called
basically in the same place as the normal TLB flushing is done, and nowhere
else.

I also wanted to avoid calling notifier code from inside eg. hardware TLB
or pte manipulation primitives. These things are already pretty well
spaghetti, so I'd like to just place them right where needed first... I
think eventually it will need a bit of a rethink to make it more consistent
and more general. But I prefer to do put them in the caller for the moment.

I have also attempted to write a skeleton driver. Not like Christoph's
drivers, but one that actually does something. This one can mmap a
window into its own virtual address space. It's not perfect yet (I need
to replace page_mkwrite with ->fault in the core mm before I can get
enough information to do protection properly I think). However I think it
may be race-free in the fault vs unmap paths. It's pretty complex, I must
say.

---

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: ...
From: Nick Piggin
Date: Tuesday, February 19, 2008 - 1:44 am

Index: linux-2.6/drivers/char/mmu_notifier_skel.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/char/mmu_notifier_skel.c
@@ -0,0 +1,255 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/mmu_notifier.h>
+#include <linux/radix-tree.h>
+#include <linux/seqlock.h>
+#include <asm/tlbflush.h>
+
+static DEFINE_SPINLOCK(mmn_lock);
+static RADIX_TREE(rmap_tree, GFP_ATOMIC);
+static seqcount_t rmap_seq = SEQCNT_ZERO;
+
+static int __rmap_add(unsigned long mem, unsigned long vaddr)
+{
+	int err;
+
+	err = radix_tree_insert(&rmap_tree, mem >> PAGE_SHIFT, (void *)vaddr);
+
+	return err;
+}
+
+static void __rmap_del(unsigned long mem)
+{
+	void *ret;
+
+	ret = radix_tree_delete(&rmap_tree, mem >> PAGE_SHIFT);
+	BUG_ON(!ret);
+}
+
+static unsigned long rmap_find(unsigned long mem)
+{
+	unsigned long vaddr;
+
+	rcu_read_lock();
+	vaddr = (unsigned long)radix_tree_lookup(&rmap_tree, mem >> PAGE_SHIFT);
+	rcu_read_unlock();
+
+	return vaddr;
+}
+
+static struct page *follow_page_atomic(struct mm_struct *mm, unsigned long address, int write)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+        if (!vma || (vma->vm_start > address))
+                return NULL;
+
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return NULL;
+
+	return follow_page(vma, address, FOLL_GET|(write ? FOLL_WRITE : 0));
+}
+
+static int mmn_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long source_vaddr = (unsigned long)vmf->pgoff << PAGE_SHIFT;
+	unsigned long dest_vaddr = (unsigned long)vmf->virtual_address;
+	unsigned long pfn;
+	struct page *page;
+	pgprot_t prot;
+	int write = vmf->flags & FAULT_FLAG_WRITE;
+	int ...
From: Robin Holt
Date: Tuesday, February 19, 2008 - 4:59 am

Because XPMEM needs to be able to sleep during its callout.  For that,
we need to move this outside of the page table lock and suddenly we
need the begin/end pair again.  There was considerable discussion about
this exact point numerous times.  We tried to develop the most inclusive
design possible.  Our design would even be extendable to IB, assuming they
made some very disruptive changes to their MPI and communication libraries.
IB would suffer the same problems XPMEM does in that the TLB entries
need to be removed on a remote host which is operating completely
independently.

Thanks,
Robin
--

From: Andrea Arcangeli
Date: Tuesday, February 19, 2008 - 6:58 am

I also tried hard to minimise the impact of the core mm/, I also
argued with Christoph that cluttering mm/ wasn't a good idea for
things like age_page that could be a 1 liner change instead of a

Note that in my patch the invalidate_pages in mprotect can be
trivially switched to a mprotect_pages with proper params. This will
prevent page faults completely in the secondary MMU (there will only
be tlb misses after the tlb flush just like for the core linux pte),
and it'll allow all the secondary MMU pte blocks (512/1024 at time
with my PT lock design) to be updated to have proper permissions

The need of the pairs is crystal clear to me: range_begin is needed
for GRU _but_only_if_ range_end is called after releasing the
reference that the VM holds on the page. _begin will flush the GRU tlb
and at the same time it will take a mutex that will block further GRU
tlb-miss-interrupts (no idea how they manange those nightmare locking,
I didn't even try to add more locking to KVM and I get away with the
fact KVM takes the pin on the page itself).

My patch calls invalidate_page/pages before the reference is released
on the page, so GRU will work fine despite lack of
range_begin. Furthermore with my patch GRU will be auto-serialized by


Your patch should also work for KVM but it's suboptimal, my patch can
be orders of magnitude more efficient for GRU thanks to the
invalidate_pages optimization. Christoph complained about having to
call one method per pte.

And adding invalidate_range is useless unless you fully support
xpmem. You're calling invalidate_range in places that can't sleep...

No idea why xpmem needs range_begin, I perfectly understand why GRU
needs _begin with Chrisotph's patch (gru lacks the page pin) but I
dunno why xpmem needs range_begin (xpmem has the page pin so I also
think it could avoid using range_begin). Still to support GRU you need
both to call invalidate_range in places that can sleep and you need
the external rmap notifier. The moment you add xpmem into ...
From: Jack Steiner
Date: Tuesday, February 19, 2008 - 7:27 am

As it turns out, no actual mutex is required. _begin_ simply increments a
count of active range invalidates, _end_ decrements the count. New TLB
dropins are deferred while range callouts are active.

This would appear to be racy but the GRU has special hardware that
simplifies locking. When the GRU sees a TLB invalidate, all outstanding
misses & potentially inflight TLB dropins are marked by the GRU with a
"kill" bit. When the dropin finally occurs, the dropin is ignored & the
instruction is simply restarted. The instruction will fault again & the TLB
dropin will be repeated.  This is optimized for the case where invalidates
are rare - true for users of the GRU.


In general, though, I agree. Most users of mmu_notifiers would likely
required a mutex or something equivalent.


--- jack



--

From: Nick Piggin
Date: Tuesday, February 19, 2008 - 4:04 pm

OK (thanks to Robin as well). Now I understand why you are using it,
but I don't understand why you don't defer new TLBs after the point
where the linux pte changes. If you can do that, then you look and
act much more like a TLB from the point of view of the Linux vm.


--

From: Andrea Arcangeli
Date: Tuesday, February 19, 2008 - 5:52 pm

Christoph was forced to put the invalidate_range callback _after_
dropping the PT lock because xpmem has to wait I/O there. But
invalidate_range is called after freeing the VM reference on the pages
so then GRU needed a _range_begin too because GRU has to flush the tlb
before the VM reference on the page is released (xpmem and KVM pin the
pages mapped by the secondary mmu, GRU doesn't). So then
invalidate_range was renamed to invalidate_range_end.
--

From: Robin Holt
Date: Tuesday, February 19, 2008 - 7:46 pm

Currently, xpmem blocks faults for the range specified at the _begin
callout, then shoots down remote TLBs and does the put_page for all the
pages in the specified range.  The _end callout merely removes the block.
We do not do any wait for I/O.  By the time we return from the _begin
callout, all activity by the remotes is stopped, pages are dereferenced,
and future faults are blocked until released by the _end callout.

Thanks,
Robin
--

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 3:50 pm

The skeletons shows how to do most of it using a spinlock and a 
counter.

--

From: Nick Piggin
Date: Tuesday, February 19, 2008 - 3:59 pm

That's why I don't understand the need for the pairs: it should be


I thought that could be used by a non-sleeping user (not intending
to try supporting sleeping users). If it is useless then it should
go away (BTW. I didn't see your recent patch, some of my confusion
I think stems from Christoph's novel way of merging and splitting

Sorry, I kind of didn't have time to follow the conversation so well
before; are there patches posted for gru and/or xpmem?

--

From: Andrea Arcangeli
Date: Tuesday, February 19, 2008 - 5:46 pm

See the last patch I posted to Andrew, you've probably looked at the
old patches, the old patches didn't work for GRU and didn't work for
xpmem and they weren't optimized to cluster the invalidates for each

I kept improving my patch in case the VM maintainers would consider
xpmem requirements not workable from a linux-VM point of view, and
they preferred to have something obviously safe, strightforward and
non intrusive, despite it doesn't support the only sleeping user out
there I know of (xpmem). My patch supports KVM and GRU (and any other

There's some xpmem code posted but the posted one isn't using the mmu
notifiers yet. GRU code may be available from Jack. I only know for
sure their requirements in terms of mmu notifiers.
--

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 3:55 pm

What is so novel about introducing functionality step by step?
--

From: Nick Piggin
Date: Tuesday, February 19, 2008 - 4:11 pm

Sorry, I realise I still didn't get this through my head yet (and also
have not seen your patch recently). So I don't know exactly what you
are doing...

But why does _anybody_ (why does Christoph's patches) need to invalidate
when they are going to be more permissive? This should be done lazily by
the driver, I would have thought.
--

From: Jack Steiner
Date: Tuesday, February 19, 2008 - 4:40 pm

Agree. Although for most real applications, the performance difference
is probably negligible.

--- jack
--

From: Nick Piggin
Date: Wednesday, February 20, 2008 - 9:42 pm

But importantly, doing it that way means you share test coverage with
the CPU TLB flushing code, and you don't introduce a new concept to the
VM.

So, it _has_ to be lazy flushing, IMO (as there doesn't seem to be a
good reason otherwise). mprotect shouldn't really be a special case,
because it still has to flush the CPU tlbs as well when restricting
access.
--

From: Jack Steiner
Date: Friday, February 22, 2008 - 9:31 am

Here is the source of the GRU driver. It is still in development but
it compiles & runs (on IA64) in a system simulator.

The GRU is a hardware resource located in the chipset. It is
mmaped into the user address space. The GRU contains functions such
as load/store, scatter/gather, bcopy, etc. It is directly accessed
by user instructions using user virtual addresses. GRU instructions
(ex., bcopy) use user virtual addresses for operands. The GRU
contains a large TLB that is functionally very similar to processor TLBs.


This version uses the V7 mmu notifier patch from Christoph. The changes
to switch to Andrea's patch are trivial. (Note, however, that XPMEM still
requires Christoph's patch).


The interesting parts relating to mmu_notifiers are in the
following functions:
	gru_try_dropin() - does TLB dropins
	gru_flush_tlb_range() - TLB flushing
	gru_mmuops_...() - all functions starting with "gru_mmuops_"
	gru_register_mmu_notifier() - registers notifiers


I have no doubt that there are bugs in the code. If you find them, please
let me know where they are ....    :-)

Other comments appreciated, too.




Portions are rough but this 
 arch/ia64/sn/kernel/sn2/sn2_smp.c |    5 
 drivers/Makefile                  |    1 
 drivers/gru/Makefile              |    4 
 drivers/gru/gru.h                 |  348 +++++++++++++
 drivers/gru/gru_instructions.h    |  502 +++++++++++++++++++
 drivers/gru/grufault.c            |  557 ++++++++++++++++++++++
 drivers/gru/grufile.c             |  453 +++++++++++++++++
 drivers/gru/gruhandles.h          |  655 +++++++++++++++++++++++++
 drivers/gru/grukservices.c        |  129 +++++
 drivers/gru/grulib.h              |   84 +++
 drivers/gru/grumain.c             |  958 ++++++++++++++++++++++++++++++++++++++
 drivers/gru/grummuops.c           |  376 ++++++++++++++
 drivers/gru/gruprocfs.c           |  309 ++++++++++++
 drivers/gru/grutables.h           |  517 ++++++++++++++++++++
 drivers/sn/Kconfig                |    7 
 15 files ...
From: Andrea Arcangeli
Date: Tuesday, February 19, 2008 - 6:09 pm

The last version was posted here:


This can be done lazily by the driver yes. The place where I've an
invalidate_pages in mprotect however can also become less permissive.
It's simpler to invalidate always and it's not guaranteed the
secondary mmu page fault is capable of refreshing the spte across a
writeprotect fault. In the future this can be changed to
mprotect_pages though, so no page fault will happen in the secondary
mmu.
--

From: Andrea Arcangeli
Date: Wednesday, February 20, 2008 - 3:39 am

Given Nick's comments I ported my version of the mmu notifiers to
latest mainline. There are no known bugs AFIK and it's obviously safe
(nothing is allowed to schedule inside rcu_read_lock taken by
mmu_notifier() with my patch).

XPMEM simply can't use RCU for the registration locking if it wants to
schedule inside the mmu notifier calls. So I guess it's better to add
the XPMEM invalidate_range_end/begin/external-rmap as a whole
different subsystem that will have to use a mutex (not RCU) to
serialize, and at the same time that CONFIG_XPMEM will also have to
switch the i_mmap_lock to a mutex. I doubt xpmem fits inside a
CONFIG_MMU_NOTIFIER anymore, or we'll all run a bit slower because of
it. It's really a call of how much we want to optimize the MMU
notifier, by keeping things like RCU for the registration.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
 	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
 	if (__young)							\
 		flush_tlb_page(__vma, __address);			\
+	__young |= mmu_notifier_age_page((__vma)->vm_mm, __address);	\
 	__young;							\
 })
 #endif
@@ -86,6 +87,7 @@ do {									\
 	pte_t __pte;							\
 	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
 	flush_tlb_page(__vma, __address);				\
+	mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);	\
 	__pte;								\
 })
 #endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -735,6 +735,7 @@ static inline pte_t ptep_clear_flush(str
 {
 	pte_t pte = *ptep;
 	ptep_invalidate(vma->vm_mm, address, ptep);
+	mmu_notifier(invalidate_page, vma->vm_mm, address);
 	return pte;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ ...
From: Andrea Arcangeli
Date: Wednesday, February 20, 2008 - 3:45 am

This is the same as before but against the mmu notifier #v6 patch,
running on top of 2.6.25-rc latest, and in this last update I fixed
the last race condition with a seqlock. I described the exact fix in a
earlier email, in short the seqlock-write is in the
invalidate_page/pages, and the reader will re-issue gfn_to_page if it
finds a seqlock read failure (see the change to paging_tmpl.h). With
this on top of mmu notifier #v6 there are no more practical or
theoretical known problems, nor in the kvm swapping, nor in the mmu
notifier patch (which also supports all sleeping users not just KVM,
without requiring a page pin).

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 41962e7..e1287ab 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6656efa..9151d64 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -533,6 +533,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 		kvm_flush_remote_tlbs(kvm);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+	struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+	get_page(page);
+	rmap_remove(kvm, spte);
+	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	kvm_flush_remote_tlbs(kvm);
+	__free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte, *curr_spte;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+		curr_spte = spte;
+		spte = rmap_next(kvm, rmapp, spte);
+		kvm_unmap_spte(kvm, ...
From: Andrea Arcangeli
Date: Wednesday, February 27, 2008 - 3:06 pm

Same as before but one one hand ported to #v7 API and on the other
hand ported to latest kvm.git.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 41962e7..e1287ab 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4583329..4067b0f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -642,6 +642,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 	account_shadowed(kvm, gfn);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+	struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+	get_page(page);
+	rmap_remove(kvm, spte);
+	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	kvm_flush_remote_tlbs(kvm);
+	__free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte, *curr_spte;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+		curr_spte = spte;
+		spte = rmap_next(kvm, rmapp, spte);
+		kvm_unmap_spte(kvm, curr_spte);
+	}
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+	int i;
+
+	/*
+	 * If mmap_sem isn't taken, we can look the memslots with only
+	 * the mmu_lock by skipping over the slots with userspace_addr == 0.
+	 */
+	spin_lock(&kvm->mmu_lock);
+	for (i = 0; i < kvm->nmemslots; i++) {
+		struct kvm_memory_slot *memslot = &kvm->memslots[i];
+		unsigned long start = memslot->userspace_addr;
+		unsigned long end;
+
+		/* mmu_lock protects userspace_addr */
+		if (!start)
+			continue;
+
+		end = start + ...
From: izik eidus
Date: Thursday, February 28, 2008 - 1:42 am

From: Robin Holt
Date: Wednesday, February 20, 2008 - 4:33 am

But won't that other "subsystem" cause us to have two seperate callouts
that do equivalent things and therefore force a removal of this and go
back to what Christoph has currently proposed?

Robin
--

From: Andrea Arcangeli
Date: Wednesday, February 20, 2008 - 5:03 am

The point is that a new kind of notifier that only supports sleeping
users will allow to keep optimizing the mmu notifier patch for the
non-sleeping users. If we keep going Christoph's way of having a
single notifier that fits all he will have to:

1) drop the entire RCU locking from its patches (making all previous
   rcu discussions and fixes void) those discussions only made sense
   if applied to _my_ patch, not Christoph's patches as long as you
   pretend to sleep in any of his mmu notifier methods like invalidate_range_*.

2) probably modify the linux VM to replace the i_mmap_lock and perhaps
   PT lock with a mutex (see Nick's comments for details)

I'm unconvinced both the main linux VM and the mmu notifier should be
changed like this just to support xpmem. All non-sleeping users don't
need that. Nevertheless I'm fully welcome to support xpmem (and it's
not my call nor my interest to comment if allocating skbs in
try_to_unmap in order to unpin pages is workable, let's assume it's
workable for the sake of this discussion) with a new config option
that will also alter how the core VM works, in order to fully support
the sleeping users for filebacked mappings.

This will also create less confusion in the registration. With
Christoph's one-config-option-fits-all you had to half register into
the mmu notifier (the sleeping calls, so not invalidate_page) and full
register in the external rmap notifier, and I had to only half
register into the mmu notifier (not range_begin) and not register in
the rmap external notifier.

With two separate config options for sleeping and non sleeping users,
I'll 100% register in the mmu notifier methods, and the non-sleeping
users will 100% register the xpmem methods. You won't have to have
designed the mmu notifier patches to understand how to use it.

In theory both KVM and GRU are free to use the xpmem methods too (the
invalidate_page will be page_t based instead of [mm,addr] based, but
that's possible to handle with KVM changes if ...
From: Robin Holt
Date: Wednesday, February 20, 2008 - 5:24 am

We do not need to do any allocation in the messaging layer, all
structures used for messaging are allocated at module load time.
The allocation discussions we had early on were about trying to
rearrange you notifiers to allow a seperate worker thread to do the
invalidate and then the main thread would spin waiting for the worker to
complete.  That was canned by the moving your notifier to before the

So, fundamentally, how would they be different?  Would we be required to
add another notifier list to the mm and have two seperate callout
points?  Reduction would end up with the same half-registered
half-not-registered situation you point out above.  Then further
reduction would lead to the elimination of the callouts you have just
proposed and using the _begin/_end callouts and we are back to
Christoph's current patch.

Robin
--

From: Andrea Arcangeli
Date: Wednesday, February 20, 2008 - 5:32 am

I thought you called some net/* function inside the mmu notifier

Did you miss Nick's argument that we'd need to change some VM lock to
mutex and solve lock issues first? Are you implying mutex are more
efficient for the VM? (you may seek support from preempt-rt folks at
least) or are you implying the VM would better run slower with mutex
in order to have a single config option?
--

From: Robin Holt
Date: Wednesday, February 20, 2008 - 6:15 am

Nope, that was the discussions with the IB folks.  We only use XPC and

That would be if we needed to support file backed mappings and hugetlbfs
mappings.  Currently (and for the last 6 years), XPMEM has not supported
either of those.  I don't view either as being a realistic possibility,
but it is certainly something we would need to address before either
could be supported.

Robin
--

From: Nick Piggin
Date: Wednesday, February 20, 2008 - 10:02 pm

Just from a high level view, in some cases we can just say that no we
aren't going to support this. And this may well be one of those cases.

The more constraints placed on the VM, the harder it becomes to
improve and adapt in future. And this seems like a pretty big restriction.
(especially if we can eg. work around it completely by having a special
purpose driver to get_user_pages on comm buffers as I suggested in the
other mail).

At any rate, I believe Andrea's patch really places minimal or no further
constraints than a regular CPU TLB (or the hash tables that some archs
implement). So we're kind of in 2 different leagues here.
--

From: Robin Holt
Date: Wednesday, February 20, 2008 - 7:41 am

Whoa there.  In Christoph's patch, we did not use rcu for the list.  It
was a simple hlist_head.  The list manipulations were done under
down_write(&current->mm->mmap_sem) and would therefore not be racy.  All
the callout locations are already acquiring the mmap_sem at least
readably, so we should be safe.  Maybe I missed a race somewhere.

Thanks,
Robin
--

From: Andrea Arcangeli
Date: Wednesday, February 20, 2008 - 8:34 am

You missed quite a few, see when atomic=1 and when mmu_rmap_notifier
is invoked for example.
--

From: Jack Steiner
Date: Wednesday, February 20, 2008 - 2:03 pm

I ported the GRU driver to use the latest #v6 patch and ran a series of
tests on it using our system simulator. The simulator is slow so true
stress or swapping is not possible - at least within a finite amount of
time.

Functionally, the #v6 patch seems to work for the GRU. However, I did
notice two significant differences that make the #v6 performance worse for
the GRU than Christoph's patch.  I think one difference is easily fixable
but the other is more difficult:

	- the location of the mmu_notifier_release() callout is at a
	  different place in the 2 patches. Christoph has the callout
	  BEFORE the call to unmap_vmas() whereas you have it AFTER. The
	  net result is that the GRU does a LOT of 1-page TLB flushes
	  during process teardown.  These flushes are not done with
	  Christops's patch.

	- the range callouts in Christoph's patch benefit the GRU because
	  multiple TLB entries can be flushed with a single GRU
	  instruction (the GRU hardware supports a range flush using a
	  vaddr & length).  The #v6 patch does a TLB flush for each page in
	  the range.  Flushing on the GRU is slow so being able to flush
	  multiple pages with a single request is a benefit.

Seems like the latter difference could be significant for other users
of mmu notifiers.


--- jack
--

From: Nick Piggin
Date: Wednesday, February 20, 2008 - 9:54 pm

Thanks! Yes the seqlock you are using now ends up looking similar
to what I did and I couldn't find a hole in that either. So I
think this is going to work.

I do prefer some parts of my patch, however for everyone's sanity,
I think you should be the maintainer of the mmu notifiers, and I
will send you incremental changes that can be discussed more easily

I agree: your coherent, non-sleeping mmu notifiers are pretty simple
and unintrusive. The sleeping version is fundamentally going to either
need to change VM locks, or be non-coherent, so I don't think there is
a question of making one solution fit everybody. So the sleeping /
xrmap patch should be kept either completely independent, or as an
add-on to this one.

I will post some suggestions to you when I get a chance.

 
--

From: Andrea Arcangeli
Date: Thursday, February 21, 2008 - 7:40 am

The need to change the VM locks to fit the sleepable "mmu notifier"
needs, I think is the major reason why the sleeping patch should be a
separate config option unless you think the i_mmap_lock will benefit
the VM for its own good regardless of the sleepable mmu
notifiers. Otherwise we'll end up merging in mainline an API that can
only satisfy the needs of the "sleeping users" that are only
interested about anonymous memory. While the basic concept of the mmu
notifiers is to cover the whole user visible address space, not just
anonymous memory! Furthermore XPMEM users already asked to work on
tmpfs/MAP_SHARED too...

Originally the trick that I was trying to remove the "atomic" param,
was to defer the invalidate_range after dropping the i_mmap_lock. But
clearly in truncate we'll have no more guarantees that nor the vma nor
the MM still exists after spin_unlock(i_mmap_lock) is called... So
it's simply impossible to call the mmu notifier out of the i_mmap_lock
for truncate, and Christoph's patch looks unfixable without altering
the VM core locking. Christoph's API one-config-fits-all can't really
fit-all, but only the anonymous memory.

However if I wear a KVM hat, I cannot care less what is merged as long
as .25 will be able to fully swap reliably a virtualized guest OS ;).
This is why I'm totally willing to support any decision in favor of
anything (including your own patch that would only work for KVM) that

I really want suggestions on Jack's concern about issuing an
invalidate per pte entry or per-pte instead of per-range. I'll answer
that in a separate email. For KVM my patch is already close to optimal
because each single spte invalidate requires a fixed amount of work,
but for GRU a large invalidate-range would be more efficient.

To address the GRU _valid_ concern, I can create a second version of
my patch with range_begin/end instead of invalidate_pages, that still
won't support sleeping users like XPMEM but only KVM and GRU. Then
it's up to Christoph when he ...
From: Jack Steiner
Date: Thursday, February 21, 2008 - 9:10 am

I don't know how much significance to place on this data, but it is
a real data point.

I ran the GRU regression test suite on kernels with both types of
mmu_notifiers. The kernel/driver using Christoph's patch had
1/7 the number of TLB invalidates as Andrea's patch.

This reduction is due to both differences I mentioned yesterday:
	- different location of callout for address space teardown
	- range callouts

Unfortunately, the current driver does not allow me to quantify
which of the differences is most significant.

Also, I'll try to post the driver within the next few days. It is
still in development but it compiles and can successfully run most
workloads on a system simulator.

--- jack
--

From: Andrea Arcangeli
Date: Wednesday, February 27, 2008 - 12:26 pm

Hello,

I hope this will can be considered final for .25 and be merged. Risk
is zero, the only discussion here is to make an API that will last
forever, functionality-wise all these patches provides zero risk and
zero overhead when MMU_NOTIFIER=n. This last patch covers KVM and GRU
and hopefully all other non-blocking users optimally, and the below
API will hopefully last forever (but even if it lasts just for .25 and
.26 is changed that's fine with us, it's a kernel _internal_ API
anyway, there's absolutely nothing visible to userland).

What Christoph need to do when he's back from vacations to support
sleepable mmu notifiers is to add a CONFIG_XPMEM config option that
will switch the i_mmap_lock from a semaphore to a mutex (any other
change to this patch will be minor compared to that) so XPMEM hardware
will have kernels compiled that way. I don't see other sane ways to
remove the "atomic" parameter from the API (apparently required by
Andrew for merging something not restricted to the xpmem current usage
with only anonymous memory) and I don't want to have such a
locking-change intrusive dependency for all other non-blocking users
that are fine without having to alter how the VM works (for example
KVM and GRU). Very minor changes will be required to this patch to
make it work after the VM locking will be altered (for example the
CONFIG_XPMEM should also switch the mmu_register/unregister locking
from RCU to mutex as well). XPMEM then will only compile if
CONFIG_XPMEM=y and in turn the invalidate_range_* will support
scheduling inside.

I don't think pretending to merge all in one block (I mean including
xpmem support that requires blocking methods) is good idea anymore as
long as we agree the "atomic" parameter shouldn't be merged. But we
can quite easily agree on the below to be optimal for GRU/KVM and
trivially extendible once a CONFIG_XPMEM will be added. So this first
part can go in now I think.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: ...
From: Peter Zijlstra
Date: Wednesday, February 27, 2008 - 1:04 pm

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 4:06 pm

Ok so it somehow works slowly with GRU and you are happy with it. What 

Would it not be better to have a solution that fits all instead of hacking 
something in now and then having to modify it later?


Hmmm.. There were earlier discussions of changing the anon vma lock to a 
rw lock because of contention issues in large systems. Maybe we can just 
generally switch the locks taken while walking rmaps to semaphores? That 
would still require to put the invalidate outside of the pte lock.

--

From: Andrea Arcangeli
Date: Wednesday, February 27, 2008 - 4:43 pm

As far as GRU is concerned, performance is the same as with your patch

If RDMA/IB folks needed to block in invalidate_range, I guess they
need to do so on top of tmpfs too, and that never worked with your

The whole point is that your solution fits only GRU and KVM too.

XPMEM in your patch works in a hacked mode limited to anonymous memory
only, Robin already received incoming mail asking to allow xpmem to
work on more than anonymous memory, so your solution-that-fits-all
doesn't actually fit some of Robin's customer needs. So if it doesn't
even entirely satisfy xpmem users, imagine the other potential

anon_vma lock can remain a spinlock unless you also want to schedule
inside try_to_unmap.

If converting the i_mmap_lock to a mutex is a big trouble, another way
that might work to allow invalidate_range to block, would be to try to
boost the mm_users to prevent the mmu_notifier_release to run in
another cpu the moment after i_mmap_lock spinlock is unlocked. But
even if that works, it'll run slower and the mmu notifiers RCU locking
should be switched to a mutex, so it'd be nice to have it as a
separate option.
--

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 5:08 pm

Either that or a separate rmap as also mentioned before.


--

From: Andrea Arcangeli
Date: Wednesday, February 27, 2008 - 5:21 pm

Yes, it can be made to work with even more extended VM changes than to
only allow invalidate_range to schedule. Those core VM changes should
only be done "by default" (w/o CONFIG_XPMEM=y), if they're doing good
to the VM regardless of xpmem requirements. And I'm not really sure of
that. I think they don't do any good or they would be a mutex

I'm not suggesting not to address the issues, just that those issues
requires VM core changes, and likely those changes should be
switchable under a CONFIG_XPMEM, so I see no reason to delay the mmu
notifier until those changes are done and merged too. It's kind of a

DRI also wants invalidate_page by (mm,addr).
--

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 5:24 pm

No its the core problem of the mmu notifier. It needs to be usable for a 
lot of scenarios.

 
--

From: Christoph Lameter
Date: Thursday, February 28, 2008 - 12:48 pm

This is not going to work even if the mutex would work as easily as you 

Changing the locking for the callouts for users of the mmu notivier that 
f.e. require a response via the network (RDMA, XPMEM etc) is not trivial 
at all. RCU lock cannot be used. So we are looking at totally disjunct 

Who disarms the notifier? Why is the method not called to disarm the 

Well a bit better but now we have to modify both the macro and the code 

The release should be called much earlier to allow the driver to release 
all resources in one go. This way each vma must be processed individually. 
For our gobs of memory this method may create a scaling problem on exit().

--

From: Andrea Arcangeli
Date: Thursday, February 28, 2008 - 2:52 pm

The notifier is auto-disarmed by mmu_notifier_release, your patch
works the same way. ->release is further called just in case anybody

Good point, it has to be called earlier for GRU, but it's not a
performance issue. GRU doesn't pin the pages so it should make the
global invalidate in ->release _before_ unmap_vmas. Linux can't fault
in the ptes anymore because mm_users is zero so there's no need of a
->release_begin/end, the _begin is enough.

In #v6 I was invalidating inside unmap_vmas so it was ok. The
performance issues you're talking about refers to #v6 I guess, for #v7
there's a single call.

Thanks!

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2039,6 +2039,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();
@@ -2050,7 +2051,6 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
-	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
--

From: Christoph Lameter
Date: Thursday, February 28, 2008 - 3:00 pm

Mutex is not acceptable for performance reasons. I think we can just drop 
the RCU lock if we simply unregister the mmu notifier in release and 
forbid the drivers from removing themselves from the notification 
chain. They can simply do nothing until release. At that time there is no 

I do not follow you about the _begin without end but the following fix 
seems okay.
--

From: Jack Steiner
Date: Thursday, February 28, 2008 - 4:17 pm

I disagree. The location of the callout IS a performance issue. In simple
comparisons of the 2 patches (Christoph's vs. Andrea's), Andrea's has a 7X
increase in the number of TLB purges being issued to the GRU. TLB flushing
is slow and can impact the performance of of tasks using the GRU.

--- jack
--

From: Andrea Arcangeli
Date: Thursday, February 28, 2008 - 5:24 pm

Are you sure that you're referring to #v7?
--

From: Christoph Lameter
Date: Thursday, February 28, 2008 - 6:13 pm

Jack: AFAICT Andrea moved the release callout and things will be 
fine in the next release.

 
--

From: Christoph Lameter
Date: Thursday, February 28, 2008 - 4:05 pm

Still think that the lock here is not of too much use and can be easily 

Andrew recomended local variables for parameters used multile times. This 


One could avoid a hlist_for_each_entry_safe here by simply always deleting 
the first object. 

Also re the _notify variants: The binding to pte_clear_flush_young etc 
will become a problem for notifiers that want to sleep because 
pte_clear_flush is usually called with the pte lock held. See f.e. 
try_to_unmap_one, page_mkclean_one etc.

It would be better if the notifier calls could be moved outside of the 
pte lock.



--

From: Andrea Arcangeli
Date: Thursday, February 28, 2008 - 5:40 pm

I don't exactly see what "buggy macro" meant? I already use
parenthesis as needed to avoid the need of local variables to be

I thought I tried the (void) but it didn't work and your solution
worked, but perhaps I did something wrong, I'll try again with (void)

Agreed, the current construct come from the fact we previously didn't
assume nobody could ever call mmu_notifier_unregister by the time

Calling __free_page out of the PT lock is much bigger
change. do_wp_page will require changes anyway when the sleepable

The point is that it can't make a difference right now, and my
objective was to avoid unnecessary source code duplication (later it
will be necessary, right now it isn't). By the time you rework
do_wp_page, removing _notify will be a very minor detail compared to
the rest of the changes to do_wp_page IMHO. Expanding it now won't
provide a real advantage later.
--

From: Andrew Morton
Date: Thursday, February 28, 2008 - 5:56 pm

multiple refernces to the argument, so

	mmu_notifier(foo, bar(), zot);

will call bar() either once or twice.

Unlikely in this case, but bad practice.  Easily fixable by using another
temporary.

--

From: Christoph Lameter
Date: Thursday, February 28, 2008 - 6:03 pm

I thought you wanted to get rid of the sync via pte lock?

What is the trouble with the current do_wp_page modifications? There is 
no need for invalidate_page() there so far. invalidate_range() does the 
trick there.

--

From: Andrea Arcangeli
Date: Friday, February 29, 2008 - 6:09 am

Sure. _notify is happening inside the pt lock by coincidence, to
reduce the changes to mm/* as long as the mmu notifiers aren't


No trouble, it's just that I didn't want to mangle over the logic of
do_wp_page unless it was strictly required, the patch has to be
obviously safe. You need to keep that bit of your patch to make the
mmu notifiers sleepable.
--

From: Christoph Lameter
Date: Friday, February 29, 2008 - 12:46 pm

Ok if this is a coincidence then it would be better to separate the 
notifier callouts from the pte macro calls.
--

From: Andrea Arcangeli
Date: Sunday, March 2, 2008 - 8:54 am

Difference between #v7 and #v8:

1) s/age_page/clear_flush_young/ (Nick's suggestion)
2) macro fix (Andrew)
3) move release before final unmap_vmas (for GRU, Jack/Christoph)
4) microoptimize mmu_notifier_unregister (Christoph)
5) use mmap_sem for registration serialization (Christoph)

The (void)xxx in macros doesn't work with "args". Christoph's solution
look best in avoiding warnings, even if it forces to make the mmu
notifier operation structure visible even if MMU_NOTIFIER=n (that's
the only downside).

I didn't drop invalidate_page, because invalidate_range_begin/end
would be slower for usages like KVM/GRU (we don't need a begin/end
there because where invalidate_page is called, the VM holds a
reference on the page). do_wp_page should also use invalidate_page
since it can free the page after dropping the PT lock without losing
any performance (that's not true for the places where invalidate_range
is called).

It'd be nice if everyone involved can agree to converge on this API
for .25. KVM/GRU (and perhaps Quadrics) and similar usages will be
fully covered in .25. This is a kernel internal API so there's no
problem if all the methods will become sleep capable only starting
only in .26. The brainer part of the VM work to do to make it sleep
capable is pretty much orthogonal with this patch.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -228,6 +229,8 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h ...
From: Andrea Arcangeli
Date: Sunday, March 2, 2008 - 9:03 am

Here an example of the futher orthogonal work to do on top of #v8
during .26-rc to make the whole mmu notifier API sleep capable.

1) Every single ptep_clear_flush_young_notify and
ptep_clear_flush_notify must be converted like the below. The below is
the conversion of a single one. do_wp_page has been converted by
Christoph already but with invalidate_range (should be changed to
invalidate_page by releasing the refcount on the page after calling
invalidate_page). Hope it's clear why I'd rather not depend on these
changes to be merged in .25 in order to have the mmu notifier included
in .25.

2) Then after all this conversion work is finished, it's trivial to
delete ptep_clear_flush_young_notify and ptep_clear_flush_notify from
mmu_notifier.h (they will be unused macros once the conversion is
complete).

3) After that the VM has to be changed to convert anon_vma lock and
i_mmap_lock spinlocks to mutex/rwsemaphore.

4) Then finally the mmu_notifier_unregister must be dropped to make the
mmu notifier sleep capable with RCU in the mmu_notifier() fast path.

It's unclear at this point if 3/4 should be switchable and happening
under a CONFIG_XPMEM or similar or if everyone will benefit from those
spinlock becoming mutex (the only one that is certain to appreciate
such a change is preempt-rt, the rest of the userbase I don't know for
sure and I'd be more confortable with a TPC number comparison before
doing such a chance by default, but I leave the commentary on such a
change to linux-mm in a separate thread).

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -274,7 +274,7 @@ static int page_referenced_one(struct pa
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -287,8 +287,11 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) ...
From: Peter Zijlstra
Date: Sunday, March 2, 2008 - 9:23 am

Or require PREEMPTIBLE_RCU, that can handle sleeps..

--

From: Nick Piggin
Date: Sunday, March 2, 2008 - 8:29 pm

I have a couple of "cleanup" patches that change the structure of this
to something I prefer. Others may not, but I'll post them for debate

I'm still not completely happy with this. I had a very quick look
at the GRU driver, but I don't see why it can't be implemented
more like the regular TLB model, and have TLB insertions depend on
the linux pte, and do invalidates _after_ restricting permissions
to the pte.

Ie. I'd still like to get rid of invalidate_range_begin, and get

If we can agree on the API, then I don't see any reason why it can't
go into 2.6.25, unless someome wants more time to review it (but
2.6.25 release should be quite far away still so there should be quite
a bit of time).
--

From: Andrea Arcangeli
Date: Monday, March 3, 2008 - 5:51 am

_begin exists because by the time _end is called, the VM already
dropped the reference on the page. This way we can do a single
invalidate no matter how large the range is. I don't see ways to
remove _begin while still invoking _end a single time for the whole

Cool! ;)
--

From: Nick Piggin
Date: Monday, March 3, 2008 - 6:10 am

Is this just a GRU problem? Can't we just require them to take a ref
on the page (IIRC Jack said GRU could be changed to more like a TLB
model).
--

From: Andrea Arcangeli
Date: Monday, March 3, 2008 - 6:24 am

Yes, it's just a GRU problem, it tries to optimize performance by
calling follow_page only in the fast path, and fallbacks to
get_user_pages; put_page in the slow path. xpmem could also send the
message in _begin and wait the message in _end, to reduce the wait
time. But if you forge GRU to call get_user_pages only (like KVM
does), the _begin can be removed. In theory we could also optimize KVM
to use follow_page only if the pte is already established. I'm not
sure how much that is a worthwhile optimization though.

However note that Quadrics also had a callback before and one after,
so they may be using the callback before for similar
optimizations. But functionality-wise _end is the only required bit if
everyone takes refcounts like KVM and XPMEM do.
--

From: Jack Steiner
Date: Monday, March 3, 2008 - 8:18 am

Maintaining a long-term reference on a page is a problem. The GRU does not
currently maintain tables to track the pages for which dropins have been done.

The GRU has a large internal TLB and is designed to reference up to 8PB of
memory. The size of the tables to track this many referenced pages would be
a problem (at best).
--

From: Nick Piggin
Date: Monday, March 3, 2008 - 9:59 am

Is it any worse a problem than the pagetables of the processes which have
their virtual memory exported to GRU? AFAIKS, no; it is on the same
magnitude of difficulty. So you could do it without introducing any
fundamental problem (memory usage might be increased by some constant
factor, but I think we can cope with that in order to make the core patch
really nice and simple).

It is going to be really easy to add more weird and wonderful notifiers
later that deviate from our standard TLB model. It would be much harder to
remove them. So I really want to see everyone conform to this model first.
Numbers and comparisons can be brought out afterwards if people want to
attempt to make such changes.
--

From: Jack Steiner
Date: Monday, March 3, 2008 - 11:06 am

The range invalidates have a performance advantage for the GRU. TLB invalidates
on the GRU are relatively slow (usec) and interfere somewhat with the performance
of other active GRU instructions. Invalidating a large chunk of addresses with
a single GRU TLBINVAL operation is must faster than issuing a stream of single
page TLBINVALs.


Functionally, the GRU is very close to what I would consider to be the
"standard TLB" model. Dropins and flushs map closely to processor dropins
and flushes for cpus.  The internal structure of the GRU TLB is identical to
the TLB of existing cpus.  Requiring the GRU driver to track dropins with
long term page references seems to me a deviation from having the basic
mmuops support a "standard TLB" model. AFAIK, no other processor requires
this.

Tracking TLB dropins (and long term page references) could be done but it
adds significant complexity and scaling issues. The size of the tables to
track many TB (to PB) of memory can get large. If the memory is being
referenced by highly threaded applications, then the problem becomes even
more complex. Either tables must be replicated per-thread (and require even
more memory), or the table structure becomes even more complex to deal with
node locality, cacheline bouncing, etc.


Agree.
--

From: Avi Kivity
Date: Monday, March 3, 2008 - 11:09 am

In theory this would apply to kvm as well (coalesce tlb flush IPIs, 
lookup shadow page table once), but is it really a fast path?  What 
triggers range operations for your use cases?

-- 
error compiling committee.c: too many arguments to function

--

From: Jack Steiner
Date: Monday, March 3, 2008 - 11:23 am

Although not frequent, an unmap of a multiple TB object could be quite painful
if each page was invalidated individually instead of 1 invalidate for the entire range.
This is even worse if the application is threaded and the object has been reference by
many GRUs (there are 16 GRU ports per node - each potentially has to be invalidated).

Forks (again, not frequent) would be another case.



--

From: Nick Piggin
Date: Monday, March 3, 2008 - 11:45 am

That is because the CPU TLBs have the mmu_gather batching APIs which
avoid the problem. It would be possible to do something similar for
GRU which would involve taking a reference for each page-to-be-invalidated
in invalidate_page, and release them when you invalidate_range. Or else
do some other scheme which makes mmu notifiers work similarly to the
mmu gather API. But not just go an invent something completely different

I don't think it would be that significant in terms of complexity or
scaling.

For a quick solution, you could stick a radix tree in each of your mmu
page tables, and locking is pretty scalable.

After that, I would really like to see whether the numbers justify
larger changes.

--

From: Jack Steiner
Date: Monday, March 3, 2008 - 12:15 pm

Correct. If the mmu_gather were passed on the mmuops callout and the callout were
done at the same point as the tlb_finish_mmu(), the GRU could
efficiently work w/o the range invalidates. A range invalidate might still
be slightly more efficient but not measureable so. The net difference is

I'm still concerned about performance. Each dropin would first have to access
an additional data structure that would most likely be non-node-local and
non-cache-resident. The net effect would be measurable but not a killer.

I haven't thought about locking requirements for the radix tree. Most accesses
would be read-only & updates infrequent. Any chance of an RCU-based radix
implementation?  Otherwise, don't we add the potential for hot locks/cachelines
for threaded applications ???
--

From: Peter Zijlstra
Date: Tuesday, March 4, 2008 - 3:35 am

The current radix tree implementation in the kernel is RCU capable. We
just don't have many RCU users yet.

--

From: Jack Steiner
Date: Tuesday, March 4, 2008 - 7:44 am

Ahhh. You are right. I thought I looked but obviously missed it.
--

From: Christoph Lameter
Date: Monday, March 3, 2008 - 12:02 pm

Still do not see how that could be done. The model here is tightly bound 
to ptes. AFAICT this could be implemented in arch code like the paravirt 
ops.

 
--

From: Christoph Lameter
Date: Monday, March 3, 2008 - 12:01 pm

Isnt this more a job for paravirt ops if it is so tightly bound to page 

API still has rcu issues and the example given for making things sleepable 
is only working for the aging callback. The most important callback is for 
try_to_unmao and page_mkclean. This means the API is still not generic 
enough and likely not extendable as needed in its present form.




--

From: Andrea Arcangeli
Date: Monday, March 3, 2008 - 2:15 pm

I converted only one of those _notify as an example of how it should
be done, because I assumed you volunteer to convert the other ones
yourself during .26. It's useless to convert all of them right now,
because the i_mmap_lock and anon_vma locks are still going to be
spinlocks in .25.
--

From: Nick Piggin
Date: Tuesday, March 4, 2008 - 5:37 pm

Um, it's bound to the *Linux page tables*, yes. And I have no idea why
you would use the paravirt ops for this.

 
--

From: Christoph Lameter
Date: Wednesday, March 5, 2008 - 11:48 am

paravirt ops allows interception of page table operations?

--

From: Nick Piggin
Date: Wednesday, March 5, 2008 - 7:59 pm

Maybe possible but it's totally the wrong API for it.

--

From: Nick Piggin
Date: Sunday, March 2, 2008 - 8:33 pm

[patch] mmu-v8: demacro


Remove the macros from mmu_notifier.h, in favour of functions.

This requires untangling the include order circular dependencies as well,
so just remove struct mmu_notifier_head in favour of just using the hlist
in mm_struct.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -55,12 +55,13 @@ struct mmu_notifier {
 
 #ifdef CONFIG_MMU_NOTIFIER
 
-struct mmu_notifier_head {
-	struct hlist_head head;
-};
-
 #include <linux/mm_types.h>
 
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(!hlist_empty(&mm->mmu_notifier_list));
+}
+
 /*
  * Must hold the mmap_sem for write.
  *
@@ -79,33 +80,59 @@ extern void mmu_notifier_register(struct
  */
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
-extern void mmu_notifier_release(struct mm_struct *mm);
-extern int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_begin(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline ...
From: Christoph Lameter
Date: Monday, March 3, 2008 - 12:03 pm

I like the patch.


--

From: Nick Piggin
Date: Sunday, March 2, 2008 - 8:34 pm

This one on top of the previous patch

[patch] mmu-v8: typesafe

Move definition of struct mmu_notifier and struct mmu_notifier_ops under
CONFIG_MMU_NOTIFIER to ensure they doesn't get dereferenced when they
don't make sense.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -3,8 +3,12 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/mm_types.h>
 
 struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
 
 struct mmu_notifier_ops {
 	/*
@@ -53,10 +57,6 @@ struct mmu_notifier {
 	const struct mmu_notifier_ops *ops;
 };
 
-#ifdef CONFIG_MMU_NOTIFIER
-
-#include <linux/mm_types.h>
-
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
 	return unlikely(!hlist_empty(&mm->mmu_notifier_list));
--

From: Christoph Lameter
Date: Monday, March 3, 2008 - 12:04 pm

The callbacks take a mmu_notifier parameter. So how does this compile for 
!MMU_NOTIFIER?

--

From: Nick Piggin
Date: Sunday, March 2, 2008 - 8:39 pm

Here is just a couple of checkpatch fixes on top of the last patches.

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h
+++ linux-2.6/include/linux/mmu_notifier.h
@@ -46,7 +46,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_begin)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);       
+				       unsigned long start, unsigned long end);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
@@ -137,7 +137,7 @@ static inline void mmu_notifier_mm_init(
 #define ptep_clear_flush_notify(__vma, __address, __ptep)		\
 ({									\
 	pte_t __pte;							\
-	struct vm_area_struct * ___vma = __vma;				\
+	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
 	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
 	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
@@ -147,7 +147,7 @@ static inline void mmu_notifier_mm_init(
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
-	struct vm_area_struct * ___vma = __vma;				\
+	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
 	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
 	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
--

From: Andrea Arcangeli
Date: Monday, March 3, 2008 - 2:37 pm

The only difference are Nick's changes (thanks Nick, nice work!) plus
a fix to make it compile.

About the removal of _begin I'm not strongly opposed to it, but I
personally think that it's unnecessary if _begin avoids to build new
data structures with a fixed ram (and cpu) cost per_page_ and at the
same time deferring _end after the whole tlb_gather page freeing is
reducing the number of invalidates.

.26 will allow all the methods to sleep by following the roadmap
described in the #v8 patch.

KVM so far is swapping fine on top of this.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -228,6 +228,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,194 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+	/*
+	 * Called when nobody can register any more notifier in the mm
+	 * and after the "mn" notifier has been disarmed already.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int ...
From: Andrea Arcangeli
Date: Monday, March 3, 2008 - 3:05 pm

Notably the registration now requires the mmap_sem in write mode.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 41962e7..e1287ab 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4583329..4067b0f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -642,6 +642,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 	account_shadowed(kvm, gfn);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+	struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+	get_page(page);
+	rmap_remove(kvm, spte);
+	set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+	kvm_flush_remote_tlbs(kvm);
+	__free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte, *curr_spte;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+		curr_spte = spte;
+		spte = rmap_next(kvm, rmapp, spte);
+		kvm_unmap_spte(kvm, curr_spte);
+	}
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+	int i;
+
+	/*
+	 * If mmap_sem isn't taken, we can look the memslots with only
+	 * the mmu_lock by skipping over the slots with userspace_addr == 0.
+	 */
+	spin_lock(&kvm->mmu_lock);
+	for (i = 0; i < kvm->nmemslots; i++) {
+		struct kvm_memory_slot *memslot = &kvm->memslots[i];
+		unsigned long start = memslot->userspace_addr;
+		unsigned long end;
+
+		/* mmu_lock protects userspace_addr */
+		if (!start)
+			continue;
+
+		end = start + (memslot->npages << ...
From: izik eidus
Date: Monday, March 3, 2008 - 5:44 pm

i wrote to you about this before (i didnt get answer for this so i write 
again)

--

From: Christoph Lameter
Date: Tuesday, March 4, 2008 - 12:31 am

Stripped things down and did what Andrea and I talked about last Friday.
No invalidate_page callbacks. No ops anymore. Simple linked list for 
notifier. No RCU. Added the code to rmap.h and rmap.c (after all it is 
concerned with handling mappings).



This patch implements a simple callback for device drivers that establish
their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines
etc). These references are unknown to the VM (therefore external).

With these callbacks it is possible for the device driver to release external
references when the VM requests it. This enables swapping, page migration and
allows support of remapping, permission changes etc etc for externally
mapped memory.

With this functionality it becomes possible to avoid pinning or mlocking
pages (commonly done to stop the VM from unmapping pages).

A device driver must subscribe to a process using

	emm_register_notifier

The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space. When the process terminates
the callback function is called with emm_release.

Callbacks are performed before and after the unmapping action of the VM.

	emm_invalidate_start	before
	emm_invalidate_end	after

Callbacks are mostly performed in a non atomic context. However, in
various places spinlocks are held to traverse rmaps. So this patch here
is only useful for those devices that can remove mappings in an atomic
context (f.e. KVM/GRU).

If the rmap traversal spinlocks are converted to semaphores then all 
callbacks willbe performed in a nonatomic context. Callouts can stay 
where they are.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm_types.h |    3 +
 include/linux/rmap.h     |   51 +++++++++++++++++++++++++++++++++
 kernel/fork.c            |    3 +
 mm/Kconfig               |    5 +++
 mm/filemap_xip.c         |    5 +++
 mm/fremap.c              |    2 +
 mm/hugetlb.c             |    4 ++
 mm/memory.c   ...
From: Christoph Lameter
Date: Tuesday, March 4, 2008 - 12:34 am

Not there but the system boots and is usable. Complains about atomic 
contexts because the tlb functions use a get_cpu() and thus disable preempt.

Not sure yet what to do about the cond_resched_lock stuff etc.


Convert i_mmap_lock to i_mmap_sem

The conversion to a rwsemaphore allows callbacks during rmap traversal
for files in a non atomic context. A rw style lock allows concurrent
walking of the reverse map.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 kernel/fork.c             |    4 ++--
 mm/filemap.c              |    8 ++++----
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   11 +++++------
 mm/memory.c               |   28 ++++++++--------------------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   16 ++++++++--------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   20 +++++++++-----------
 15 files changed, 51 insertions(+), 66 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-03-03 22:59:25.386848427 -0800
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-03-03 22:59:31.174878038 -0800
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: ...
From: Andrea Arcangeli
Date: Tuesday, March 4, 2008 - 6:30 am

I could have ripped invalidate_page from my patch too, except I didn't
want to slow down those paths for the known-common-users when not even
GRU would get any benefit from two hooks when only one is needed.

When working with single pages it's more efficient and preferable to
call invalidate_page and only later release the VM reference on the
page.
--

From: Christoph Lameter
Date: Tuesday, March 4, 2008 - 12:00 pm

But as you pointed out before that path is a slow path anyways. Its rarely 
taken. Having a single eviction callback simplifies design.

Plus the device driver can still check if the mapping was of PAGE_SIZE and 
then implement its own optimization.
 
--

From: Andrea Arcangeli
Date: Tuesday, March 4, 2008 - 3:20 pm

It's a slow path but I don't see why you think two hooks are better
than one, when only one is necessary.

I once ripped invalidate_page while working on #v8 but then I
reintroduced it because I thought reducing the total number of hooks
was beneficial to the core linux VM (even if only a
microoptimization, I sure agree about that, but it's trivial to add
one hook instead of two hooks there, so a microoptimization was worth
it IMHO).

Your API is also too restrictive, if we'll happen to need one more
method that doesn't take just (start,end) we'll have to cause all
drivers to have significant changes instead of one-liners to use

IMHO the design is actually same and I don't understand why you
rewrote it once more time in a less flexibile way (on a style side
you're not even using hlist), dropping RCU (not sure how you replace
it with), etc....

Your implementation has the same bug you had in your first V1, see how
you're not clearing the spte young bits if the pte young bit is
set. Once you fix that, your change in the ptep_clear_flush_young path
will look remarkably similar to the patch I posted incremental with
#v8 to make ->clear_flush_young sleep capable...

Converging in a single design is great, but it'd be nice if we could
converge into a single implementation, and my last patch doesn't have
any bug and I think it's quite nicer too (also including Nick cleanup
work) but then I may be biased ;).

But as usual I'm entirely satisfied by your brand new EMM Notifier to
be merged and all perfecting work done on my MMU notifier patch over
the weeks by multiple developers (including you) to be dropped for
good, as long as we can enable the new advanced KVM features in
2.6.25.
--

From: Christoph Lameter
Date: Tuesday, March 4, 2008 - 3:35 pm

Well the problem is if one does not have the begin/end hooks then 
reliable clearing of the mapping may not be possible. begin/end allow
holding off new references and that avoids the issue that would come

What would that be? I think the API need to stay as simple as possible. 
And this set is pretty minimal and easy to understand. Not having the 

All of that is needed in order to allow sleeping in the future. Your 
version locks us into atomic callbacks. It also makes the API needlessly 
complex.


It is the atomic dead end that we want to avoid. And your patch is exactly 

Well I really want us to have one API that is suitable for multiple 
purposes and that allows a generic use by device drivers for multiple 
purposes. The discussion in the last month have made that possible. I am 
glad that you do not see any major issues with the patch. I sure wish I 
would not have to post a competing patchset because I want things to be 
merged ASAP and get this over with. But we need to have at minimum clear 
way to support sleeping with the existing API in the future.

--

From: Peter Zijlstra
Date: Tuesday, March 4, 2008 - 3:42 pm

Not really, if it requires moving the VM locks to sleepable locks under
a .config option, I think its also fair to require PREEMPT_RCU.

OTOH, if you want to unconditionally move the VM locks to sleepable
locks you have a point.

--

From: Christoph Lameter
Date: Tuesday, March 4, 2008 - 4:14 pm

Which would make the patchset pretty complex. RCU is not needed with a 
single linked list. Linked list operations can exploit atomic pointer 
updates and we only tear down the list when a single execution thread 
remains.


Having said that: Here a couple of updates to address Andrea's complaint 
that we not check the referenced bit from the external mapper when the 
rerferences bit is set on an OS pte.

Plus two barriers to ensure that a new emm notifier object becomes
visible before the base pointer is updated.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-03-04 14:36:36.321922321 -0800
+++ linux-2.6/mm/rmap.c	2008-03-04 15:10:46.159429369 -0800
@@ -298,10 +298,10 @@ static int page_referenced_one(struct pa
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
-	if (!referenced)
-		/* rmap lock held */
-		referenced = emm_notify(mm, emm_referenced,
-					address, address + PAGE_SIZE);
+
+	/* rmap lock held */
+	if (emm_notify(mm, emm_referenced, address, address + PAGE_SIZE))
+			referenced = 1;
 out:
 	return referenced;
 }
@@ -1057,6 +1057,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_release);
 void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
 {
 	e->next = mm->emm_notifier;
+	smp_wmb();
 	mm->emm_notifier = e;
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
@@ -1069,6 +1070,7 @@ int __emm_notify(struct mm_struct *mm, e
 	int x;
 
 	while (e) {
+		smp_rmb();
 		if (e->func) {
 			x = e->func(e, mm, op, start, end);
 			if (x)
--

From: Peter Zijlstra
Date: Tuesday, March 4, 2008 - 4:25 pm

We generally require comments around barriers..

--

From: Peter Zijlstra
Date: Tuesday, March 4, 2008 - 4:30 pm

FWIW, I'll cut the kvm and openfabrics lists from any future posts.
I'm getting tired of the bounces.

--

From: Avi Kivity
Date: Tuesday, March 4, 2008 - 10:09 pm

Isn't that out of the question for .25?

I really wish we can get the atomic variant in now, and add on 
sleepability in .26, updating users if necessary.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Robin Holt
Date: Wednesday, March 5, 2008 - 2:47 am

I keep hearing this mantra.  What is so compelling about the .25
release?  When seems to be more important than what.  While I understand
product release cycles, etc. and can certainly agree with them. I would
like to know with what I am being asked to agree.

That said, I agree we should probably finish getting the comments on
Andrea's most recent patch, if any, cleared up and put that one in.

Robin
--

From: Avi Kivity
Date: Wednesday, March 5, 2008 - 2:53 am

kvm gained the ability to swap in 2.6.25.  Without mmu notifiers, 

Great.  Thanks.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Dor Laor
Date: Wednesday, March 5, 2008 - 3:02 am

The main reason is that several kvm exciting features are dependent on
mmu notifiers:
- It enables full guest swapping (as opposed to partial today)
- It enables memory ballooning
- It enables running Izik Eidus's Kernel Shared Pages module that unify
  guest pages together.

The patchset is kernel-internal, stable and reviewed. Even if the
interface will be changed in .26 it won't have noticeable effect.

So since its stable, internal, reviewed, needed to enable important kvm
features we like to see it in for .25.

Regards,

--

From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 8:17 am

I preferred to answer with code to avoid any possible misunderstanding
(I through already tried to explain with words and I obviously failed
miserably if you ended up writing such an erratic weird claim like
above ;).

This below simple patch invalidates the "invalidate_page" part, the
next patch will invalidate the RCU part, and btw in a way that doesn't
forbid unregistering the mmu notifiers at runtime (like your brand new
EMM does).

This is incremental with my #v9. I still ask Andrew/Linus to merge the
#v9 patch I posted a few days ago in .25 so KVM/GRU will be 100%
covered in a optimal way on all respects and with maximum flexibility
for future changes of API (to allow for future methods that may take
more than start,end, this was pointed out once by both me and Avi). My
#v9 is zero risk for .25 and it sure worth merging now.

Then in .26 we'll modify the semantics of the API to be blocking
starting with the below patchx. This is a kernel _internal_ API, and
we aren't distributions that have to respect kabi here, but even if we
were, making methods sleepable is a 100% backwards compatible
semantical change, so there's no possible reason to defer the #v9
merging. The changes in .26 will be transparent to any user (even if
they don't need to! even if we turn out to be totally wrong about .26
requiring a minor change of API everything will be perfectly
fine). Nothing of this is visible to userland so we can change it at
any time as we wish.

The reason I keep this incremental (unlike your EMM that does
everything all at the same time mixed in a single patch) is to
decrease the non obviously safe mangling over mm/* during .25. The
below patch is simple, but not as obviously safe as
s/ptep_clear_flush/ptep_clear_flush_notify/.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -134,27 +134,6 @@ static inline void ...
From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 8:23 am

This combines the non-sleep-capable RCU locking of #v9 with a seqlock
so the mmu notifier fast path will require zero cacheline
writes/bouncing while still providing mmu_notifier_unregister and
allowing to schedule inside the mmu notifier methods. If we drop
mmu_notifier_unregister we can as well drop all seqlock and
rcu_read_lock()s. But this locking scheme combination is sexy enough
and 100% scalable (the mmu_notifier_list cacheline will be preloaded
anyway and that will most certainly include the sequence number value
in l1 for free even in Christoph's NUMA systems) so IMHO it worth to
keep mmu_notifier_unregister.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -230,6 +231,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_MMU_NOTIFIER
 	struct hlist_head mmu_notifier_list;
+	seqlock_t mmu_notifier_lock;
 #endif
 };
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -130,6 +130,7 @@ static inline void mmu_notifier_mm_init(
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 	INIT_HLIST_HEAD(&mm->mmu_notifier_list);
+	seqlock_init(&mm->mmu_notifier_lock);
 }
 
 
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -20,7 +20,9 @@ void __mmu_notifier_release(struct mm_st
 void __mmu_notifier_release(struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
+	unsigned seq;
 
+	seq = read_seqbegin(&mm->mmu_notifier_lock);
 	while (unlikely(!hlist_empty(&mm->mmu_notifier_list))) {
 		mn = hlist_entry(mm->mmu_notifier_list.first,
 				 struct mmu_notifier,
@@ -28,6 +30,7 @@ void ...
From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 8:52 am

This is a rediff of Christoph's plain i_mmap_lock2rwsem patch on top
of #v9 1/4 + 2/4 + 3/4 (hence this is called 4/4). This is mostly to
show that after 3/4, any patch that plugs on the EMM patchset will
plug nicely on top of my MMU notifer patchset too.

The patch trigger bug checks here in modprobe:

    BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);

kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 252k freed
------------[ cut here ]------------
kernel BUG at mm/mmap.c:2063!
invalid opcode: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 1123, comm: modprobe.sh Not tainted 2.6.25-rc3 #22
RIP: 0010:[<ffffffff80269368>]  [<ffffffff80269368>] exit_mmap+0xef/0xfa
RSP: 0000:ffff81003c79bed8  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff810001004840 RCX: ffff81003c79bee0
RDX: 0000000000000000 RSI: ffff81003c5e8918 RDI: ffff81003d8048c0
RBP: 0000000000000000 R08: 0000000000000008 R09: ffff810002c00040
R10: 0000000000000002 R11: ffff810001009180 R12: ffff81003c57b800
R13: 0000000000000000 R14: 00000000005f0db0 R15: 00007fff3f2af234
FS:  00007f283714b6f0(0000) GS:ffffffff80694000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000458f40 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe.sh (pid: 1123, threadinfo ffff81003c79a000, task ffff81003cf9ca50)
Stack:  0000000000000091 ffff810001004840 ffff81003c57b800 ffff81003c57b880
 0000000000000000 ffffffff8022f7bf 0000000000000001 0000000000000001
 ffff81003cf9ca50 ffffffff802349b6 0000000000000292 ffffffff80354c63
Call Trace:
 [<ffffffff8022f7bf>] mmput+0x30/0x9d
 [<ffffffff802349b6>] do_exit+0x223/0x66c
 [<ffffffff80354c63>] __up_read+0x13/0x8a
 [<ffffffff80234e6e>] do_group_exit+0x6f/0x8a
 ...
From: Christoph Lameter
Date: Friday, March 7, 2008 - 1:03 pm

Yes this was never intended for .25. I think we need to split this into a 
copule of patches. One needs to get rid of the spinlock dropping, then one 
that deals with the read concurrency issues and finally one that converts 
the spinlock. Thanks for looking at it.

--

From: Christoph Lameter
Date: Wednesday, March 19, 2008 - 2:27 pm

You need this patch to address the issues (that I already mentioned when I 
sent the patch to you). New EMM notifier patch with sleeping coming soon.

From: Christoph Lameter <clameter@sgi.com>
Subject: Move tlb flushing into free_pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables. Moving the tlb flushing into free_pgtables allows
sleeping in part of free_pgtables().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    4 ++--
 mm/memory.c        |   14 ++++++++++----
 mm/mmap.c          |    6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-03-19 13:30:51.460856986 -0700
+++ linux-2.6/include/linux/mm.h	2008-03-19 13:31:20.809377398 -0700
@@ -751,8 +751,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-03-19 13:29:06.007351495 -0700
+++ linux-2.6/mm/memory.c	2008-03-19 13:46:31.352774359 -0700
@@ -271,9 +271,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned ...
From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 10:50 am

My objective was to allow mmu_notifier_register/unregister to be
called with the same mmu notifier object, I didn't mean the object
could have been freed until ->release is called. However you reminded
me that after unregistering ->release won't be called so unregister
isn't very useful and I doubt we can keep it ;).

In the meantime I've also been thinking that we could need the
write_seqlock in mmu_notifier_register, to know when to restart the
loop if somebody does a mmu_notifier_register;
synchronize_rcu(). Otherwise there's no way to be sure the mmu
notifier will start firing immediately after synchronize_rcu. I'm
unsure if it's acceptable that in-progress mmu notifier invocations,
don't need to notice the fact that somebody did mmu_notifier_register;
synchronize_rcu. If they don't need to notice, then we can just drop
unregister and all rcu_read_lock()s instead of adding write_seqlock to
the register operation.

Overall my effort is to try to avoid expand the list walk with
explicit memory barriers like in EMM while trying to be equally
efficient.

Another issue is that the _begin/_end logic doesn't provide any
guarantee that the _begin will start firing before _end, if a kernel
module is loaded while another cpu is already running inside some
munmap operation etc.. The KVM usage of mmu notifier has no problem
with that detail, but KVM doesn't use _begin at all, I wonder if
others would have problems. This is a kind of a separate problem, but
quite related to the question if the notifiers must be guaranteed to
start firing immediately after mmu_notifier_unregister;synchronize_rcu
or not, that's why I mentioned it here.

Once I get comments on the suggested direction for these details, I'll
quickly repost a replacement patch for 3/4.

Thanks Peter!
--

From: Peter Zijlstra
Date: Friday, March 7, 2008 - 11:01 am

I think we can do with a smb_wmb(); like Christoph (and like
hlist_add_rcu()), but replace the smb_rmb() Christoph has with a
smp_read_barrier_depends().

That should give much the same results.

The reason Christoph can do without RCU is because he doesn't allow
unregister, and as soon as you drop that you'll end up with something

Curious problem indeed. Would it make sense to require registering these
MMU notifiers when the process is still single threaded along with the
requirement that they can never be removed again from a running process?

For KVM this should be quite doable, but I must admit I haven't been
paying enough attention to know if its possible for these other users.


--

From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 11:45 am

Not sure to follow, what do you mean "he doesn't allow"? We'll also
have to rip unregister regardless after you pointed out the ->release
won't be called after calling my mmu_notifier_unregister in 3/4. If
you figured out how to retain mmu_notifier_unregister I'm not seeing

I'm afraid that won't help much (even if the mmu notifiers users could
cope with that restriction like KVM can) because the VM will run
concurrently in another CPU despite the task is single threaded. See
2/4 in try_to_unmap_cluster: _start/end are not only invoked in the
context of the current task.

PS. this problem I pointed out of _end possibly called before _begin
is the same for #v9 and EMM V1 as far as I can tell.
--

From: Andrea Arcangeli
Date: Friday, March 7, 2008 - 12:47 pm

Given I don't see other (buggy ;) ways anymore to retain
mmu_notifier_unregister, I did like in EMM and I dropped the
unregister function.

To me it looks like this will be enough and equally efficient as the
expanded version in EMM that is not using the highlevel hlist_rcu
macros. If you can see any pitfall let me know! Thanks a lot for the
help.

------
This is a replacement for the previously posted 3/4, one of the pieces
to allow the mmu notifier methods to sleep.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -70,17 +70,6 @@ static inline int mm_has_notifiers(struc
  */
 extern void mmu_notifier_register(struct mmu_notifier *mn,
 				  struct mm_struct *mm);
-/*
- * Must hold the mmap_sem for write.
- *
- * RCU is used to traverse the list. A quiescent period needs to pass
- * before the "struct mmu_notifier" can be freed. Alternatively it
- * can be synchronously freed inside ->release when the list can't
- * change anymore and nobody could possibly walk it.
- */
-extern void mmu_notifier_unregister(struct mmu_notifier *mn,
-				    struct mm_struct *mm);
-
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -43,12 +43,10 @@ int __mmu_notifier_clear_flush_young(str
 	struct hlist_node *n;
 	int young = 0;
 
-	rcu_read_lock();
 	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_list, hlist) {
 		if (mn->ops->clear_flush_young)
 			young |= mn->ops->clear_flush_young(mn, mm, address);
 	}
-	rcu_read_unlock();
 
 	return young;
 }
@@ -59,12 +57,10 @@ void __mmu_notifier_invalidate_page(stru
 	struct mmu_notifier *mn;
 	struct hlist_node *n;
 
-	rcu_read_lock();
 	hlist_for_each_entry_rcu(mn, ...
From: Christoph Lameter
Date: Friday, March 7, 2008 - 1:15 pm

Looks good. That is what we talked about last week. What guarantees now 
that we see the cacheline referenced after the cacheline that 
contains the pointer that was changed? hlist_for_reach does a 
rcu_dereference with implied memory barrier? So its like EMM?
--

From: Christoph Lameter
Date: Friday, March 7, 2008 - 1:12 pm

Hmmm.. We could just push that on the driver saying that is has to 
tolerate it. Otherwise how can we solve this?

--

From: Christoph Lameter
Date: Friday, March 7, 2008 - 1:10 pm

The smp_rmb is such a big problem? You have seqlock, rcu etc all in there 

Ahh. Yes that is an interesting issue. If a device driver cannot handle 
this then _begin must prohibit module loading. That means not allowing 
stop_machine_run I guess which should not be that difficult.
--

From: Christoph Lameter
Date: Friday, March 7, 2008 - 1:00 pm

Well its adds lots of processing. Not sure if its really worth it. Seems 
that this scheme cannot work since the existence of the structure passed 
to the callbacks is not guaranteed since the RCU locks are not held. You 

So this is only for sanity checking? The BUG_ON detects concurrent 


Well that gets pretty sophisticated here. If you drop the rcu lock then 
the entity pointed to by mn can go away right? So how can you pass that 
structure to clear_flush_young? What is guaranteeing the existence of the 

Ditto structure can vanish since no existence guarantee exists.


--

From: Christoph Lameter
Date: Friday, March 7, 2008 - 12:54 pm

There was never a chance to merge for .25. Lets drop that and focus on 


Hmmmm.. Okay you going for range invalidate here like EMM but there are 
still some invalidate_pages() left.

--

From: Andrea Arcangeli
Date: Tuesday, March 4, 2008 - 6:21 am

Hello Izik,


Ouch I must have lost your previous comment with a too-fast pgdown in

Right, thanks!!
--

From: Nick Piggin
Date: Wednesday, February 20, 2008 - 9:47 pm

I think we just have to make sure that it _can_ do writeprotect
faults. AFAIKS, that will be possible if the driver registers a
.page_mkwrite handler (actually not quite -- page_mkwrite is fairly
crap, so I have a patch to merge it together with .fault so we get
address information as well). Anyway, I really think we should do

Possibly, but hopefully not needed for performance. Let's wait and
see.
--

From: Robin Holt
Date: Tuesday, February 19, 2008 - 7:49 pm

I don't believe it should, but it probably does right now.  I do know
the case where a write fault where there is no need for a COW does
not call out on the PTE change.  I see no reason the others should not
handle this as well.  Just off the top of my head, I can only think of
the mprotect case needing to special case the more permissive state and
I don't think that changes PTEs at all, merely updates the VMA.

Thanks,
Robin
--

From: Christoph Lameter
Date: Wednesday, February 27, 2008 - 3:56 pm

Correct. If you find such places then we can avoid the invalidates there.
 
--

Previous thread: Re: 2.6.24-git4+ regression by Mike Galbraith on Tuesday, February 19, 2008 - 1:15 am. (1 message)

Next thread: [PATCH] x86_64: fix dma_alloc_pages by Yinghai Lu on Tuesday, February 19, 2008 - 3:21 am. (2 messages)