login
Header Space

 
 

Re: [PATCH 08 of 11] anon-vma-rwsem

Previous thread: [PATCH] MN10300: Make cpu_relax() invoke barrier() by David Howells on Wednesday, May 7, 2008 - 10:31 am. (1 message)

Next thread: Re: 2.6.25.2 - Jiffies/Time jumping back and forth (Regereesion from 2.6.24) by Martin Knoblauch on Wednesday, May 7, 2008 - 11:02 am. (1 message)
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 10:35 am

Hello,

this is the last update of the mmu notifier patch.

Jack asked a __mmu_notifier_register to call under mmap_sem in write mode.

Here an update with that change plus allowing -&gt;release not to be implemented
(two liner change to mmu_notifier.c).

The entire diff between v15 and v16 mmu-notifier-core was posted in separate
email.
--
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 10:36 am

# HG changeset patch
# User Andrea Arcangeli &lt;andrea@qumranet.com&gt;
# Date 1210115797 -7200
# Node ID 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
# Parent  94eaa1515369e8ef183e2457f6f25a7f36473d70
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson &lt;dcn@sgi.com&gt;
Signed-off-by: Andrea Arcangeli &lt;andrea@qumranet.com&gt;

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
 	return unmap_vmas(vma, address, end, &amp;nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.
--
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 10:36 am

# HG changeset patch
# User Andrea Arcangeli &lt;andrea@qumranet.com&gt;
# Date 1210115798 -7200
# Node ID eb924315351f6b056428e35c983ad28040420fea
# Parent  5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson &lt;dcn@sgi.com&gt;
Signed-off-by: Andrea Arcangeli &lt;andrea@qumranet.com&gt;

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  -&gt;i_mutex			(generic_file_buffered_write)
  *    -&gt;mmap_sem		(fault_in_pages_readable-&gt;do_page_fault)
+ *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
  *
  *  -&gt;i_mutex
  *    -&gt;i_alloc_sem             (various)
--
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 10:35 am

# HG changeset patch
# User Andrea Arcangeli &lt;andrea@qumranet.com&gt;
# Date 1210115508 -7200
# Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70
# Parent  6b384bb988786aa78ef07440180e4b2948c4c6a2
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli &lt;andrea@qumranet.com&gt;

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
 				   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-	spinlock_t **i_mmap_locks;
-	spinlock_t **anon_vma_locks;
-	size_t nr_i_mmap_locks;
-	size_t nr_anon_vma_locks;
+	struct rw_semaphore **i_mmap_sems;
+	struct rw_semaphore **anon_vma_sems;
+	size_t nr_i_mmap_sems;
+	size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-	unsigned long _a = (unsigned long)*(spinlock_t **)a;
-	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+	unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+	unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
 	cond_resched();
 	if (_a &lt; _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
 	return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
 				  int anon)
 {
 	struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
 	for (vma = mm-&gt;mmap; vma; vma = vma-&gt;vm_next) {
 		if (anon) {
 			if (vma-&gt;anon_vma)
-				locks[i++] = &amp;vma-&gt;anon_vma-&gt;lock;
+				sems[i++] = &amp;vma-&gt;...
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, <akpm@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 10:35 am

# HG changeset patch
# User Andrea Arcangeli &lt;andrea@qumranet.com&gt;
# Date 1210115136 -7200
# Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2
# Parent  58f716ad4d067afb6bdd1b5f7042e19d854aae0d
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter &lt;cl...
To: Andrea Arcangeli <andrea@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 4:56 pm

This also looks very debatable indeed. The only performance numbers quoted 

which just seems like a total disaster.

The whole series looks bad, in fact. Lack of authorship, bad single-line 
description, and the code itself sucks so badly that it's not even funny.

NAK NAK NAK. All of it. It stinks.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 5:26 pm

Glad you agree. Note that the fact the whole series looks bad, is
_exactly_ why I couldn't let Christoph keep going with
mmu-notifier-core at the very end of his patchset. I had to move it at
the top to have a chance to get the KVM and GRU requirements merged
in 2.6.26.

I think the spinlock-&gt;rwsem conversion is ok under config option, as
you can see I complained myself to various of those patches and I'll
take care they're in a mergeable state the moment I submit them. What
XPMEM requires are different semantics for the methods, and we never
had to do any blocking I/O during vmtruncate before, now we have to.
And I don't see a problem in making the conversion from
spinlock-&gt;rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on
anything but ia64.

Please ignore all patches but mmu-notifier-core. I regularly forward
_only_ mmu-notifier-core to Andrew, that's the only one that is in
merge-ready status, everything else is just so XPMEM can test and we
can keep discussing it to bring it in a mergeable state like
mmu-notifier-core already is.
--
To: Andrea Arcangeli <andrea@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:42 pm

That is currently true but we are also working on XPMEM for x86_64.
The new XPMEM code should be posted within a few weeks.

--- jack
--
To: Andrea Arcangeli <andrea@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 5:36 pm

I really suspect we don't really have to, and that it would be better to 

The thing is, I didn't like that one *either*. I thought it was the 
biggest turd in the series (and by "biggest", I literally mean "most lines 
of turd-ness" rather than necessarily "ugliest per se").

I literally think that mm_lock() is an unbelievable piece of utter and 
horrible CRAP.

There's simply no excuse for code like that.

If you want to avoid the deadlock from taking multiple locks in order, but 
there is really just a single operation that needs it, there's a really 
really simple solution.

And that solution is *not* to sort the whole damn f*cking list in a 
vmalloc'ed data structure prior to locking!

Damn.

No, the simple solution is to just make up a whole new upper-level lock, 
and get that lock *first*. You can then take all the multiple locks at a 
lower level in any order you damn well please. 

And yes, it's one more lock, and yes, it serializes stuff, but:

 - that code had better not be critical anyway, because if it was, then 
   the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_

 - parallelism is overrated: it doesn't matter one effing _whit_ if 
   something is a hundred times more parallel, if it's also a hundred 
   times *SLOWER*.

So dang it, flush the whole damn series down the toilet and either forget 
the thing entirely, or re-do it sanely.

And here's an admission that I lied: it wasn't *all* clearly crap. I did 
like one part, namely list_del_init_rcu(), but that one should have been 
in a separate patch. I'll happily apply that one.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 8:38 pm

That fix is going to be fairly difficult.  I will argue impossible.

First, a little background.  SGI allows one large numa-link connected
machine to be broken into seperate single-system images which we call
partitions.

XPMEM allows, at its most extreme, one process on one partition to
grant access to a portion of its virtual address range to processes on
another partition.  Those processes can then fault pages and directly
share the memory.

In order to invalidate the remote page table entries, we need to message
(uses XPC) to the remote side.  The remote side needs to acquire the
importing process's mmap_sem and call zap_page_range().  Between the
messaging and the acquiring a sleeping lock, I would argue this will
require sleeping locks in the path prior to the mmu_notifier invalidate_*
callouts().

On a side note, we currently have XPMEM working on x86_64 SSI, and
ia64 cross-partition.  We are in the process of getting XPMEM working
on x86_64 cross-partition in support of UV.

Thanks,
Robin Holt
--
To: Robin Holt <holt@...>
Cc: Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 13, 2008 - 8:06 am

Why do you need to take mmap_sem in order to shoot down pagetables of
the process? It would be nice if this can just be done without
sleeping.
--
To: Nick Piggin <nickpiggin@...>
Cc: Robin Holt <holt@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 13, 2008 - 11:32 am

We are trying to shoot down page tables of a different process running
on a different instance of Linux running on Numa-link connected portions
of the same machine.

The messaging is clearly going to require sleeping.  Are you suggesting
we need to rework XPC communications to not require sleeping?  I think
that is going to be impossible since the transfer engine requires a
sleeping context.

Additionally, the call to zap_page_range expects to have the mmap_sem
held.  I suppose we could use something other than zap_page_range and
atomically clear the process page tables.  Doing that will not alleviate
the need to sleep for the messaging to the other partitions.

Thanks,
Robin
--
To: Robin Holt <holt@...>
Cc: Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 12:11 am

Right. You can zap page tables without sleeping, if you're careful. I
don't know that we quite do that for anonymous pages at the moment, but it

I guess that you have found a way to perform TLB flushing within coherent
domains over the numalink interconnect without sleeping. I'm sure it would
be possible to send similar messages between non coherent domains.

So yes, I'd much rather rework such highly specialized system to fit in
closer with Linux than rework Linux to fit with these machines (and

zap_page_range does not expect to have mmap_sem held. I think for anon
pages it is always called with mmap_sem, however try_to_unmap_anon is
not (although it expects page lock to be held, I think we should be able

No, but I'd venture to guess that is not impossible to implement even
on your current hardware (maybe a firmware update is needed)?

--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 7:26 am

I assume by coherent domains, your are actually talking about system
images.  Our memory coherence domain on the 3700 family is 512 processors
on 128 nodes.  On the 4700 family, it is 16,384 processors on 4096 nodes.
We extend a "Read-Exclusive" mode beyond the coherence domain so any
processor is able to read any cacheline on the system.  We also provide
uncached access for certain types of memory beyond the coherence domain.

For the other partitions, the exporting partition does not know what
virtual address the imported pages are mapped.  The pages are frequently
mapped in a different order by the MPI library to help with MPI collective
operations.

For the exporting side to do those TLB flushes, we would need to replicate
all that importing information back to the exporting side.

Additionally, the hardware that does the TLB flushing is protected
by a spinlock on each system image.  We would need to change that
simple spinlock into a type of hardware lock that would work (on 3700)
outside the processors coherence domain.  The only way to do that is to
use uncached addresses with our Atomic Memory Operations which do the
cmpxchg at the memory controller.  The uncached accesses are an order

But it isn't that we are having a problem adapting to just the hardware.

zap_page_range calls unmap_vmas which walks to vma-&gt;next.  Are you saying
that can be walked without grabbing the mmap_sem at least readably?
I feel my understanding of list management and locking completely

Are you suggesting the sending side would not need to sleep or the
receiving side?  Assuming you meant the sender, it spins waiting for the
remote side to acknowledge the invalidate request?  We place the data
into a previously agreed upon buffer and send an interrupt.  At this
point, we would need to start spinning and waiting for completion.
Let's assume we never run out of buffer space.

The receiving side receives an interrupt.  The interrupt currently wakes
an XPC thread to do the work of transfering...
To: Robin Holt <holt@...>
Cc: Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 15, 2008 - 3:57 am

Right. Or the exporting side could be passed tokens that it tracks itself,

I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,


Oh, I get that confused because of the mixed up naming conventions
there: unmap_page_range should actually be called zap_page_range. But

FWIW, mmap_sem isn't held to protect vma-&gt;next there anyway, because at
that point the vmas are detached from the mm's rbtree and linked list.


Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 15, 2008 - 1:33 pm

How is that synchronized with code that walks the same pagetable. These 
walks may not hold mmap_sem either. I would expect that one could only 
remove a portion of the pagetable where we have some sort of guarantee 
that no accesses occur. So the removal of the vma prior ensures that?


--
To: Christoph Lameter <clameter@...>
Cc: Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 15, 2008 - 7:52 pm

I don't really understand the question. If you remove the pte and invalidate
the TLBS on the remote image's process (importing the page), then it can
of course try to refault the page in because it's vma is still there. But
you catch that refault in your driver , which can prevent the page from
being faulted back in.
--
To: Nick Piggin <npiggin@...>
Cc: Christoph Lameter <clameter@...>, Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Friday, May 16, 2008 - 7:23 am

I think Christoph's question has more to do with faults that are
in flight.  A recently requested fault could have just released the
last lock that was holding up the invalidate callout.  It would then
begin messaging back the response PFN which could still be in flight.
The invalidate callout would then fire and do the interrupt shoot-down
while that response was still active (essentially beating the inflight
response).  The invalidate would clear up nothing and then the response
would insert the PFN after it is no longer the correct PFN.

Thanks,
Robin
--
To: Nick Piggin <npiggin@...>
Cc: Christoph Lameter <clameter@...>, Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Friday, May 16, 2008 - 7:50 am

I just looked over XPMEM.  I think we could make this work.  We already
have a list of active faults which is protected by a simple spinlock.
I would need to nest this lock within another lock protected our PFN
table (currently it is a mutex) and then the invalidate interrupt handler
would need to mark the fault as invalid (which is also currently there).

I think my sticking points with the interrupt method remain at fault
containment and timeout.  The inability of the ia64 processor to handle
provide predictive failures for the read/write of memory on other
partitions prevents us from being able to contain the failure.  I don't
think we can get the information we would need to do the invalidate
without introducing fault containment issues which has been a continous
area of concern for our customers.


Thanks,
Robin
--
To: Robin Holt <holt@...>
Cc: Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 1:31 am

Really? You can get the information through via a sleeping messaging API,
but not a non-sleeping one? What is the difference from the hardware POV?

--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 6:01 am

That was covered in the early very long discussion about 28 seconds.
The read timeout for the BTE is 28 seconds and it automatically retried
for certain failures.  In interrupt context, that is 56 seconds without
any subsequent interrupts of that or lower priority.

Thanks,
Robin
--
To: Robin Holt <holt@...>
Cc: Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 6:50 am

I thought you said it would be possible to get the required invalidate
information without using the BTE. Couldn't you use XPMEM pages in
the kernel to read the data out of, if nothing else?
--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 7:05 am

I was wrong about that.  I thought it was safe to do an uncached write,
but it turns out any processor write is uncontained and the MCA that
surfaces would be fatal.  Likewise for the uncached read.
--
To: Robin Holt <holt@...>
Cc: Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 7:14 am

Oh, so the BTE transfer is purely for fault isolation. I was thinking
you guys might have sufficient control of the hardware to be able to
do it at the level of CPU memory operations, but if it is some
limitation of ia64, then I guess that's a problem.

How do you do fault isolation of userspace XPMEM accesses?
--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Christoph Lameter <clameter@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Tuesday, May 20, 2008 - 7:26 am

The MCA handler can see the fault was either in userspace (processor
priviledge level I believe) or in the early kernel entry where it is
saving registers.  When it sees that condition, it kills the users
process.  While in kernel space, there is no equivalent of the saving
user state that forces the processor stall.
--
To: Nick Piggin <npiggin@...>
Cc: Robin Holt <holt@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 15, 2008 - 7:01 am

We are pursuing Linus' suggestion currently.  This discussion is
completely unrelated to that work.


We would need to deposit the payload into a central location to do the
invalidate, correct?  That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated.  Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt.  Unfortunately, here we have a
thundering herd.  There could be up to 16256 processors interrupting the
same processor.  That will be a lot of work.  It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed.  So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word.  But how will it handle the possible failure of the other partition?
How will it detect that failure and recover?  A timeout value could be
difficult to gauge because the other side may be off doing a considerable

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems.  That pushes back to us.
I think the kernel's locking is perfectly reasonable.  The problem we run
into is we are trying to get from one context in one kernel to a different

XPMEM allows one process to make a portion of its virtual address r...
To: Robin Holt <holt@...>
Cc: Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Thursday, May 15, 2008 - 7:12 am

You don't need to interrupt every time.  Place your data in a queue (you 
do support rmw operations, right?) and interrupt.  Invalidates from 
other processors will see that the queue hasn't been processed yet and 
skip the interrupt.

-- 
error compiling committee.c: too many arguments to function

--
To: Robin Holt <holt@...>
Cc: Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 11:18 am

One thing to realize is that most of the time (read: pretty much *always*) 
when we have the problem of wanting to sleep inside a spinlock, the 
solution is actually to just move the sleeping to outside the lock, and 
then have something else that serializes things.

That way, the core code (protected by the spinlock, and in all the hot 
paths) doesn't sleep, but the special case code (that wants to sleep) can 
have some other model of serialization that allows sleeping, and that 
includes as a small part the spinlocked region.

I do not know how XPMEM actually works, or how you use it, but it 
seriously sounds like that is how things *should* work. And yes, that 
probably means that the mmu-notifiers as they are now are simply not 
workable: they'd need to be moved up so that they are inside the mmap 
semaphore but not the spinlocks.

Can it be done? I don't know. But I do know that I'm unlikely to accept a 
noticeable slowdown in some very core code for a case that affects about 
0.00001% of the population. In other words, I think you *have* to do it.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 1:57 pm

The problem is that the code in rmap.c try_to_umap() and friends loops 
over reverse maps after taking a spinlock. The mm_struct is only known 
after the rmap has been acccessed. This means *inside* the spinlock.

That is why I tried to convert the locks to scan the revese maps to 
semaphores. If that is done then one can indeed do the callouts outside of 

With larger number of processor semaphores make a lot of sense since the 
holdoff times on spinlocks will increase. If we go to sleep then the 
processor can do something useful instead of hogging a cacheline.

A rw lock there can also increase concurrency during reclaim espcially if 
the anon_vma chains and the number of address spaces mapping a page is 
high.

--
To: Christoph Lameter <clameter@...>
Cc: Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 2:27 pm

So you queue them. That's what we do with things like the dirty bit. We 
need to hold various spinlocks to look up pages, but then we can't 
actually call the filesystem with the spinlock held.

Converting a spinlock to a waiting lock for things like that is simply not 
acceptable. You have to work with the system.

Yeah, there's only a single bit worth of information on whether a page is 
dirty or not, so "queueing" that information is trivial (it's just the 
return value from "page_mkclean_file()". Some things are harder than 
others, and I suspect you need some kind of "gather" structure to queue up 
all the vma's that can be affected.

But it sounds like for the case of rmap, the approach of:

 - the page lock is the higher-level "sleeping lock" (which makes sense, 
   since this is very close to an IO event, and that is what the page lock 
   is generally used for)

   But hey, it could be anything else - maybe you have some other even 
   bigger lock to allow you to handle lots of pages in one go.

 - with that lock held, you do the whole rmap dance (which requires 
   spinlocks) and gather up the vma's and the struct mm's involved. 

 - outside the spinlocks you then do whatever it is you need to do.

This doesn't sound all that different from TLB shoot-down in SMP, and the 
"mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while 
holding the spinlocks, but if we couldn't that's how we'd still do it: 
it would get more involved (because we'd need to guarantee that the gather 
can hold *all* the pages - right now we can just flush in the middle if we 
need to), but it wouldn't be all that fundamentally different.

And no, I really haven't even wanted to look at what XPMEM really needs to 
do, so maybe the above thing doesn't work for you, and you have other 
issues. I'm just pointing you in a general direction, not trying to say 
"this is exactly how to get there". 

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Friday, May 16, 2008 - 9:38 pm

Implementation of what Linus suggested: Defer the XPMEM processing until 
after the locks are dropped. Allow immediate action by GRU/KVM.


This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:

1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.

2. Do take a refcount on pages mapped externally. These are handling by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.

With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.

With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.

A device driver must subscribe to a process using

        mm_register_notifier(struct mm_struct *, struct mm_notifier *)

The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.

When the process terminates then first the -&gt;release method is called to
remove all pages still mapped to the proces.

Before the mm_struct is freed the -&gt;destroy() method is called which
should dispose of the mm_notifier structure.

The following callbacks exist:

invalidate_range(notifier, mm_struct *, from , to)

	Invalidate a range of addresses. The invalidation is
	not required to complete immediately.

invalidate_range_sync(notifier, mm_struct *, from, to)

	This is called after some invalidate_range callouts.
	The driver may only return when the inva...
To: Linus Torvalds <torvalds@...>
Cc: Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 12:22 pm

We are in the process of attempting this now.  Unfortunately for SGI,
Christoph is on vacation right now so we have been trying to work it
internally.

We are looking through two possible methods, one we add a callout to the
tlb flush paths for both the mmu_gather and flush_tlb_page locations.
The other we place a specific callout seperate from the gather callouts
in the paths we are concerned with.  We will look at both more carefully
before posting.


In either implementation, not all call paths would require the stall
to ensure data integrity.  Would it be acceptable to always put a
sleepable stall in even if the code path did not require the pages be
unwritable prior to continuing?  If we did that, I would be freed from
having a pool of invalidate threads ready for XPMEM to use for that work.
Maybe there is a better way, but the sleeping requirement we would have
on the threads make most options seem unworkable.


Thanks,
Robin
--
To: Robin Holt <holt@...>
Cc: Nick Piggin <npiggin@...>, Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 14, 2008 - 12:56 pm

I'm not understanding the question. If you can do you management outside 
of the spinlocks, then you can obviously do whatever you want, including 
sleping. It's changing the existing spinlocks to be sleepable that is not 
acceptable, because it's such a performance problem.

		Linus
--
To: Robin Holt <holt@...>
Cc: Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 8:55 pm

You simply will *have* to do it without locally holding all the MM 
spinlocks. Because quite frankly, slowing down all the normal VM stuff for 
some really esoteric hardware simply isn't acceptable. We just don't do 
it.

So what is it that actually triggers one of these events?

The most obvious solution is to just queue the affected pages while 
holding the spinlocks (perhaps locking them locally), and then handling 
all the stuff that can block after releasing things. That's how we 
normally do these things, and it works beautifully, without making 
everything slower.

Sometimes we go to extremes, and actually break the locks are restart 
(ugh), and it gets ugly, but even that tends to be preferable to using the 
wrong locking.

The thing is, spinlocks really kick ass. Yes, they restrict what you can 
do within them, but if 99.99% of all work is non-blocking, then the really 
odd rare blocking case is the one that needs to accomodate, not the rest.

				Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:22 pm

I'll let you discuss with Christoph and Robin about it. The moment I
heard the schedule inside -&gt;invalidate_page() requirement I reacted
the same way you did. But I don't see any other real solution for XPMEM
other than spin-looping for ages halting the scheduler for ages, while
the ack is received from the network device.

But mm_lock is required even without XPMEM. And srcu is also required
without XPMEM to allow -&gt;release to schedule (however downgrading srcu
to rcu will result in a very small patch, srcu and rcu are about the

I think it's a great smp scalability optimization over the global lock

Unfortunately the lock you're talking about would be:

static spinlock_t global_lock = ...

There's no way to make it more granular.

So every time before taking any -&gt;i_mmap_lock _and_ any anon_vma-&gt;lock
we'd need to take that extremely wide spinlock first (and even worse,
later it would become a rwsem when XPMEM is selected making the VM
even slower than it already becomes when XPMEM support is selected at


mmu_notifier_register is fine to be hundred times slower (preempt-rt

Sure, I'll split it from the rest if the mmu-notifier-core isn't merged.

My objective has been:

1) add zero overhead to the VM before anybody starts a VM with kvm and
   still zero overhead for all other tasks except the task where the
   VM runs.  The only exception is the unlikely(!mm-&gt;mmu_notifier_mm)
   check that is optimized away too when CONFIG_KVM=n. And even for
   that check my invalidate_page reduces the number of branches to the
   absolute minimum possible.

2) avoid any new cacheline collision in the fast paths to allow numa
   systems not to nearly-crash (mm-&gt;mmu_notifier_mm will be shared and
   never written, except during the first mmu_notifier_register)

3) avoid any risk to introduce regressions in 2.6.26 (the patch must
   be obviously safe). Even if mm_lock would be a bad idea like you
   say, it's order of magnitude safer even if entirely broken then
   ...
To: Andrea Arcangeli <andrea@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:44 pm

Right. So what? 

It's still about a million times faster than what the code does now.

You comment about "great smp scalability optimization" just shows that 
you're a moron. It is no such thing. The fact is, it's a horrible 
pessimization, since even SMP will be *SLOWER*. It will just be "less 
slower" when you have a million CPU's and they all try to do this at the 
same time (which probably never ever happens).

In other words, "scalability" is totally meaningless. The only thing that 
matters is *performance*. If the "scalable" version performs WORSE, then 

So what you're saying is that performance doesn't matter?

So why do you do the ugly crazy hundred-line implementation, when a simple 
two-liner would do equally well?

Your arguments are crap.

Anyway, discussion over. This code doesn't get merged. It doesn't get 
merged before 2.6.26, and it doesn't get merged _after_ either.

Rewrite the code, or not. I don't care. I'll very happily not merge crap 
for the rest of my life.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:58 pm

mmu_notifier_register only runs when windows or linux or macosx
boots. Who could ever care of the msec spent in mm_lock compared to
the time it takes to linux to boot?

What you're proposing is to slowdown AIM and certain benchmarks 20% or

If you want the global lock I'll do it no problem, I just think it's
obviously inferior solution for 99% of users out there (including kvm
users that will also have to take that lock while kvm userland runs).

In my view the most we should do in this area is to reduce further the
max number of locks to take if max_map_count already isn't enough.
--
To: Andrea Arcangeli <andrea@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 7:09 pm

Andrea, you're *this* close to going to my list of people who it is not 
worth reading email from, and where it's better for everybody involved if 
I just teach my spam-filter about it.

That code was CRAP.

That code was crap whether it's used once, or whether it's used a million 
times. Stop making excuses for it just because it's not performance- 
critical.

So give it up already. I told you what the non-crap solution was. It's 
simpler, faster, and is about two lines of code compared to the crappy 
version (which was what - 200 lines of crap with a big comment on top of 
it just to explain the idiocy?).

So until you can understand the better solution, don't even bother 
emailing me, ok? Because the next email I get from you that shows the 
intelligence level of a gnat, I'll just give up and put you in a 
spam-filter.

Because my IQ goes down just from reading your mails. I can't afford to 
continue.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Jack Steiner <steiner@...>, Robin Holt <holt@...>, Nick Piggin <npiggin@...>, Peter Zijlstra <a.p.zijlstra@...>, <kvm-devel@...>, Kanoj Sarcar <kanojsarcar@...>, Roland Dreier <rdreier@...>, Steve Wise <swise@...>, <linux-kernel@...>, Avi Kivity <avi@...>, <linux-mm@...>, <general@...>, Hugh Dickins <hugh@...>, Rusty Russell <rusty@...>, Anthony Liguori <aliguori@...>, Chris Wright <chrisw@...>, Marcelo Tosatti <marcelo@...>, Eric Dumazet <dada1@...>, Paul E. McKenney <paulmck@...>
Date: Wednesday, May 7, 2008 - 7:02 pm

To remove mm_lock without adding an horrible system-wide lock before
every i_mmap_lock etc.. we've to remove
invalidate_range_begin/end. Then we can return to an older approach of
doing only invalidate_page and serializing it with the PT lock against
get_user_pages. That works fine for KVM but GRU will have to flush the
tlb once every time we drop the PT lock, that means once per each 512
ptes on x86-64 etc... instead of a single time for the whole range
regardless how large the range is.
--
To: Andrea Arcangeli <andrea@...>
Cc: <torvalds@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:31 pm

On Thu, 8 May 2008 00:22:05 +0200

Nope.  We only need to take the global lock before taking *two or more* of
the per-vma locks.

I really wish I'd thought of that.
--
To: Andrew Morton <akpm@...>
Cc: <torvalds@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 7, 2008 - 6:44 pm

I don't see how you can avoid taking the system-wide-global lock
before every single anon_vma-&gt;lock/i_mmap_lock out there without
mm_lock.

Please note, we can't allow a thread to be in the middle of
zap_page_range while mmu_notifier_register runs.

vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more
than one lock and we've to still take the global-system-wide lock
_before_ this single i_mmap_lock and no other lock at all.

Please elaborate, thanks!
--
To: Andrea Arcangeli <andrea@...>
Cc: Andrew Morton <akpm@...>, <torvalds@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 7, 2008 - 7:28 pm

You said yourself that mmu_notifier_register can be as slow as you
want ... what about you use stop_machine for it ? I'm not even joking

Ben.


--
To: Benjamin Herrenschmidt <benh@...>
Cc: Andrew Morton <akpm@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 7, 2008 - 7:45 pm

We can put a cap of time + a cap of vmas. It's not important if it
fails, the only useful case we know it, and it won't be slow at
all. The failure can happen because the cap of time or the cap of vmas
or the cap vmas triggers or there's a vmalloc shortage. We handle the
failure in userland of course. There are zillon of allocations needed
anyway, any one of them can fail, so this isn't a new fail path, is
the same fail path that always existed before mmu_notifiers existed.

I can't possibly see how adding a new global wide lock that forces all
truncate to be serialized against each other, practically eliminating
the need of the i_mmap_lock, could be superior to an approach that
doesn't cause the overhead to the VM at all, and only require kvm to
pay for an additional cost when it startup.

Furthermore the only reason I had to implement mm_lock was to fix the
invalidate_range_start/end model, if we go with only invalidate_page
and invalidate_pages called inside the PT lock and we use the PT lock
to serialize, we don't need a mm_lock anymore and no new lock from the
VM either. I tried to push for that, but everyone else wanted
invalidate_range_start/end. I only did the only possible thing to do:
to make invalidate_range_start safe to make everyone happy without
slowing down the VM.
--
To: Benjamin Herrenschmidt <benh@...>
Cc: Andrew Morton <akpm@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 7, 2008 - 9:34 pm

Sorry for not having completely answered to this. I initially thought
stop_machine could work when you mentioned it, but I don't think it
can even removing xpmem block-inside-mmu-notifier-method requirements.

For stop_machine to solve this (besides being slower and potentially
not more safe as running stop_machine in a loop isn't nice), we'd need
to prevent preemption in between invalidate_range_start/end.

I think there are two ways:

1) add global lock around mm_lock to remove the sorting

2) remove invalidate_range_start/end, nuke mm_lock as consequence of
   it, and replace all three with invalidate_pages issued inside the
   PT lock, one invalidation for each 512 pte_t modified, so
   serialization against get_user_pages becomes trivial but this will
   be not ok at all for SGI as it increases a lot their invalidation
   frequency

For KVM both ways are almost the same.

I'll implement 1 now then we'll see...
--
To: Andrea Arcangeli <andrea@...>
Cc: Benjamin Herrenschmidt <benh@...>, Andrew Morton <akpm@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Tuesday, May 13, 2008 - 8:14 am

This is what I suggested to begin with before this crazy locking was
developed to handle these corner cases... because I wanted the locking
to match with the tried and tested Linux core mm/ locking rather than
introducing this new idea.

I don't see why you're bending over so far backwards to accommodate
this GRU thing that we don't even have numbers for and could actually
potentially be batched up in other ways (eg. using mmu_gather or
mmu_gather-like idea).

The bare essential, matches-with-Linux-mm mmu notifiers that I first
saw of yours was pretty elegant and nice. The idea that "only one
solution must go in and handle everything perfectly" is stupid because
it is quite obvious that the sleeping invalidate idea is just an order
of magnitude or two more complex than the simple atomic invalidates
needed by you. We should and could easily have had that code upstream
long ago :(

I'm not saying we ignore the sleeping or batching cases, but we should
introduce the ideas slowly and carefully and assess the pros and cons
--
To: Nick Piggin <nickpiggin@...>
Cc: Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 14, 2008 - 1:43 am

I agree, we're better off generalizing the mmu_gather batching
instead...

I had some never-finished patches to use the mmu_gather for pretty much
everything except single page faults, tho various subtle differences
between archs and lack of time caused me to let them take the dust and
not finish them...

I can try to dig some of that out when I'm back from my current travel,
though it's probably worth re-doing from scratch now.

Ben.


--
To: Benjamin Herrenschmidt <benh@...>
Cc: Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, <clameter@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 14, 2008 - 9:15 am

Unfortunately, we are at least several months away from being able to
provide numbers to justify batching - assuming it is really needed.  We need
large systems running real user workloads. I wish we had that available
right now, but we don't.

It also depends on what you mean by "no batching". If you mean that the
notifier gets called for each pte that is removed from the page table, then
the overhead is clearly very high for some operations. Consider the unmap of
a very large object. A TLB flush per page will be too costly.

However, something based on the mmu_gather seems like it should provide
exactly what is needed to do efficient flushing of the TLB. The GRU does not
require that it be called in a sleepable context. As long as the notifier
callout provides the mmu_gather and vaddr range being flushed, the GRU can


-- jack
--
To: Benjamin Herrenschmidt <benh@...>
Cc: Nick Piggin <nickpiggin@...>, Andrea Arcangeli <andrea@...>, Andrew Morton <akpm@...>, <clameter@...>, <steiner@...>, <holt@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...>
Date: Wednesday, May 14, 2008 - 2:06 am

Well, the first thing would be just to get rid of the whole start/end
idea, which completely departs from the standard Linux system of
clearing ptes, then flushing TLBs, then freeing memory.

The onus would then be on GRU to come up with some numbers to justify
batching, and a patch which works nicely with the rest of the Linux
mm. And yes, mmu-gather is *the* obvious first choice of places to

I always liked the idea as you know. But I don't think that should
be mixed in with the first iteration of the mmu notifiers patch
anyway. GRU actually can work without batching, but there is simply
some (unquantified to me) penalty for not batching it. I think it
is far better to first put in a clean and simple and working functionality
first. The idea that we have to unload some monster be-all-and-end-all
solution onto mainline in a single go seems counter productive to me.
--
To: Andrea Arcangeli <andrea@...>
Cc: <torvalds@...>, <clameter@...>, <steiner@...>, <holt@...>, <npiggin@...>, <a.p.zijlstra@...>, <kvm-devel@...>, <kanojsarcar@...>, <rdreier@...>, <swise@...>, <linux-kernel@...>, <avi@...>, <linux-mm@...>, <general@...>, <hugh@...>, <rusty@...>, <aliguori@...>, <chrisw@...>, <marcelo@...>, <dada1@...>, <paulmck@...