Hello, this is the last update of the mmu notifier patch. Jack asked a __mmu_notifier_register to call under mmap_sem in write mode. Here an update with that change plus allowing ->release not to be implemented (two liner change to mmu_notifier.c). The entire diff between v15 and v16 mmu-notifier-core was posted in separate email. --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210115127 -7200 # Node ID c5badbefeee07518d9d1acca13e94c981420317c # Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c get_task_mm get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas if (task->flags & PF_BORROWED_MM) mm = NULL; else - atomic_inc(&mm->mm_users); + if (!atomic_inc_not_zero(&mm->mm_users)) + mm = NULL; } task_unlock(task); return mm; --
You can drop this patch. This turned out to be a race in xpmem. It "appeared" as if it were a race in get_task_mm, but it really is not. The current->mm field is cleared under the task_lock and the task_lock is grabbed by get_task_mm. I have been testing you v15 version without this patch and not encountere the problem again (now that I fixed my xpmem race). Thanks, Robin --
Great. About your other deadlock I'm curious if my deadlock fix for the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the swap + a swaphog allocating and freeing another 2G of swap in a loop. I couldn't reproduce any other problem with KVM since I fixed that bit regardless if I apply only mmu-notifier-core (2.6.26 version) or the full patchset (post 2.6.26). --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210096013 -7200 # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to ...
On Wed, 07 May 2008 16:35:51 +0200 Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
On Wed, 07 May 2008 16:35:51 +0200 Should that be "does" or "does not". "does", I suppose. It should refer to hlist_unhashed() The term "on entry" is a bit ambiguous - we normally use that as shorthand to mean "on entry to the function". OK? And another. Fair enough. --
On Wed, 07 May 2008 16:35:51 +0200 The patch looks OK to me. The proposal is that we sneak this into 2.6.26. Are there any sufficiently-serious objections to this? The patch will be a no-op for 2.6.26. This is all rather unusual. For the record, could we please review the reasons for wanting to do this? Thanks. --
As far as I can tell, authorship has been destroyed by at least two of the patches (ie Christoph seems to be the author, but Andrea seems to have Yeah, too late and no upside. That "locking" code is also too ugly to live, at least without some serious arguments for why it has to be done that way. Sorting the locks? In a vmalloc'ed area? And calling this something innocuous like "mm_lock()"? Hell no. That code needs some serious re-thinking. Linus --
I can't follow this, please be more specific. About the patches I merged from Christoph, I didn't touch them at all (except for fixing a kernel crashing bug in them plus some reject fix). Initially I didn't even add a signed-off-by: andrea, and I only had the signed-off-by: christoph. But then he said I had to add my signed-off-by too, while I thought at most an acked-by was required. So if I got any attribution on Christoph work it's only because he explicitly requested it as it was passing through my maintenance line. In any case, all patches except mmu-notifier-core are irrelevant in this context and I'm entirely fine to give Christoph the whole attribution of the whole patchset including the whole mmu-notifier-core where most of the code is mine. We had many discussions with Christoph, Robin and Jack, but I can assure you nobody had a single problem with regard to attribution. About all patches except mmu-notifier-core: Christoph, Robin and everyone (especially myself) agrees those patches can't yet be merged in 2.6.26. With regard to the post-2.6.26 material, I think adding a config option to make the change at compile time, is ok. And there's no other way to deal with it in a clean way, as vmtrunate has to teardown pagetables, and if the i_mmap_lock is a spinlock there's no way to notify secondary mmus about it, if the ->invalidate_range_start method has to allocate an skb, send it through the network and wait for I/O No upside to all people setting CONFIG_KVM=n true, but no downside for them either, that's the important fact! And for all the people setting CONFIG_KVM!=n, I should provide some background here. KVM MM development is halted without this, that includes: paging, ballooning, tlb flushing at large, pci-passthrough removing page pin as a whole, etc... Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle things without mmu notifiers (mlock whatever). Rusty agreed ...
The patches were sent to lkml without *any* indication that you weren't actually the author. Is that an excuse for UTTER AND TOTAL CRAP? Linus --
I rechecked and I guarantee that the patches where Christoph isn't listed are developed by myself and he didn't write a single line on them. In any case I expect Christoph to review (he's CCed) and to point me to any attribution error. The only mistake I did once in that area was to give too _few_ attribution to myself and he asked me to add myself in the signed-off so I added myself by Christoph own request, but be sure I didn't remove him! --
> I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! I think the point you're missing is that any patches written by Christoph need a line like From: Christoph Lameter <clameter@sgi.com> at the top of the body so that Christoph becomes the author when it is committed into git. The Signed-off-by: line needs to be preserved too of course, but it is not sufficient by itself. - R. --
Ok so I see the problem Linus is referring to now (I received the hint by PM too), I thought the order of the signed-off-by was relevant, it clearly isn't or we're wasting space ;) --
The order of the signed-offs are somewhat relevant, but no, sign-offs don't mean authorship. See the rules for sign-off: you can sign off on another persons patches, even if they didn't sign off on them themselves. That's clause (b) in particular. So yes, quite often you'd _expect_ the first sign-off to match the author, but that's a correlation, not a causal relationship. Linus --
By PM (guess he's scared to post to this thread ;) Chris is telling me, what you mean perhaps is I should add a From: Christoph in the body of the email if the first signed-off-by is from Christoph, to indicate the first signoff was by him and the patch in turn was started by him. I thought the order of the signoffs was enough, but if that From was mandatory and missing, if there's any error it obviously wasn't intentional especially given I only left a signed-off-by: christoph on his patches until he asked me to add my signoff too. Correcting it is trivial given I carefully ordered the signoff so that the author is at the top of the signoff list. At least for mmu-notifier-core given I obviously am the original author of that code, I hope the From: of the email was enough even if an additional From: andrea was missing in the body. Also you can be sure that Christoph and especially Robin (XPMEM) will be more than happy if all patches with Christoph at the top of the signed-off-by will be merged in 2.6.26 despite there wasn't From: christoph at the top of the body ;). So I don't see a big deal here... --
Ok, this whole series of patches have just been such a disaster that I'm (a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that I'm still looking at it at all, but I am. And quite frankly, the more I look, and the more answers from you I get, the less I like it. And I didn't like it that much to start with, as you may have noticed. You say that "At least for mmu-notifier-core given I obviously am the original author of that code", but that is not at all obvious either. One of the reasons I stated that authorship seems to have been thrown away is very much exactly in that first mmu-notifier-core patch: + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> so I would very strongly dispute that it's "obvious" that you are the original author of the code there. So there was a reason why I said that I thought authorship had been lost somewhere along the way. Linus --
How long have you been doing kernel development? How about you read SubmittingPatches a few times before you show just how clueless you are? Hint: look for the string that says "From:". Also look at the section that talks about "summary phrase". You got it all wrong, and you don't even seem to realize that you got it wrong, even when I told you. Linus --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115129 -7200
# Node ID d60d200565abde6a8ed45271e53cde9c5c75b426
# Parent c5badbefeee07518d9d1acca13e94c981420317c
invalidate_page outside PT lock
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -210,35 +210,6 @@ static inline void mmu_notifier_mm_destr
__mmu_notifier_mm_destroy(mm);
}
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep) \
-({ \
- pte_t __pte; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __pte = ptep_clear_flush(___vma, ___address, __ptep); \
- mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
- __pte; \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
- __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
- ___address); \
- __young; \
-})
-
#else /* CONFIG_MMU_NOTIFIER */
static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -274,9 +245,6 @@ static inline void mmu_notifier_mm_destr
{
}
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ...On Wed, 07 May 2008 16:35:53 +0200 This patch appears to undo some of the changes made by patch 01/11. Would it be an idea to merge them into one, so the first patch introduces the right conventions directly? -- All rights reversed. --
The only reason this isn't merged into one, is that this requires non obvious (not difficult though) to the core VM code. I wanted to keep an obviously safe approach for 2.6.26. The other conventions are only needed by XPMEM and XPMEM can't work without all other patches anyway. --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115130 -7200
# Node ID 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09
# Parent d60d200565abde6a8ed45271e53cde9c5c75b426
free-pgtables
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.
Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.
The first pointer argument to free_pgtables() can then be dropped.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
void *private);
void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
- unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+ unsigned long ceiling);
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
} while (pgd++, addr = next, addr != end);
}
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
- unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+ unsigned long ceiling)
{
+ struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = ...On Wed, 07 May 2008 16:35:54 +0200 Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210115131 -7200 # Node ID 20bc6a66a86ef6bd60919cc77ff51d4af741b057 # Parent 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09 unmap vmas tlb flushing Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the ...
On Wed, 07 May 2008 16:35:55 +0200 Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115132 -7200
# Node ID 0621238970155f8ff2d60ca4996dcdd470f9c6ce
# Parent 20bc6a66a86ef6bd60919cc77ff51d4af741b057
rwsem contended
Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
*/
extern void downgrade_write(struct rw_semaphore *sem);
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
* nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
#endif
}
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+ return rwsem_is_contended(sem);
+#else
+ return 0;
+#endif
+}
+
/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
spin_unlock_irqrestore(&sem->wait_lock, flags);
}
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ /*
+ * Racy check for an empty list. False positives or negatives
+ * would be okay. False positive may cause a useless dropping of
+ * locks. False negatives may cause locks to be held a bit
+ * longer until the next check.
+ */
+ return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
EXPORT_SYMBOL(__init_rwsem);
...# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115135 -7200
# Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
# Parent 0621238970155f8ff2d60ca4996dcdd470f9c6ce
i_mmap_rwsem
The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
if (!vma_shareable(vma, addr))
return;
- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
out:
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
}
/*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ ...# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210115136 -7200 # Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2 # Parent 58f716ad4d067afb6bdd1b5f7042e19d854aae0d anon-vma-rwsem Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter ...
This also looks very debatable indeed. The only performance numbers quoted which just seems like a total disaster. The whole series looks bad, in fact. Lack of authorship, bad single-line description, and the code itself sucks so badly that it's not even funny. NAK NAK NAK. All of it. It stinks. Linus --
Glad you agree. Note that the fact the whole series looks bad, is _exactly_ why I couldn't let Christoph keep going with mmu-notifier-core at the very end of his patchset. I had to move it at the top to have a chance to get the KVM and GRU requirements merged in 2.6.26. I think the spinlock->rwsem conversion is ok under config option, as you can see I complained myself to various of those patches and I'll take care they're in a mergeable state the moment I submit them. What XPMEM requires are different semantics for the methods, and we never had to do any blocking I/O during vmtruncate before, now we have to. And I don't see a problem in making the conversion from spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on anything but ia64. Please ignore all patches but mmu-notifier-core. I regularly forward _only_ mmu-notifier-core to Andrew, that's the only one that is in merge-ready status, everything else is just so XPMEM can test and we can keep discussing it to bring it in a mergeable state like mmu-notifier-core already is. --
I really suspect we don't really have to, and that it would be better to The thing is, I didn't like that one *either*. I thought it was the biggest turd in the series (and by "biggest", I literally mean "most lines of turd-ness" rather than necessarily "ugliest per se"). I literally think that mm_lock() is an unbelievable piece of utter and horrible CRAP. There's simply no excuse for code like that. If you want to avoid the deadlock from taking multiple locks in order, but there is really just a single operation that needs it, there's a really really simple solution. And that solution is *not* to sort the whole damn f*cking list in a vmalloc'ed data structure prior to locking! Damn. No, the simple solution is to just make up a whole new upper-level lock, and get that lock *first*. You can then take all the multiple locks at a lower level in any order you damn well please. And yes, it's one more lock, and yes, it serializes stuff, but: - that code had better not be critical anyway, because if it was, then the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ - parallelism is overrated: it doesn't matter one effing _whit_ if something is a hundred times more parallel, if it's also a hundred times *SLOWER*. So dang it, flush the whole damn series down the toilet and either forget the thing entirely, or re-do it sanely. And here's an admission that I lied: it wasn't *all* clearly crap. I did like one part, namely list_del_init_rcu(), but that one should have been in a separate patch. I'll happily apply that one. Linus --
I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the I think it's a great smp scalability optimization over the global lock Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at mmu_notifier_register is fine to be hundred times slower (preempt-rt Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with ...
On Thu, 8 May 2008 00:22:05 +0200 Nope. We only need to take the global lock before taking *two or more* of the per-vma locks. I really wish I'd thought of that. --
I don't see how you can avoid taking the system-wide-global lock before every single anon_vma->lock/i_mmap_lock out there without mm_lock. Please note, we can't allow a thread to be in the middle of zap_page_range while mmu_notifier_register runs. vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more than one lock and we've to still take the global-system-wide lock _before_ this single i_mmap_lock and no other lock at all. Please elaborate, thanks! --
On Thu, 8 May 2008 00:44:06 +0200 umm... CPU0: CPU1: spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); bad. CPU0: CPU1: spin_lock(global_lock) spin_lock(global_lock); spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); Is OK. CPU0: CPU1: spin_lock(global_lock) spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_unlock(b->lock); spin_lock(a->lock); spin_unlock(a->lock); also OK. As long as all code paths which can take two-or-more locks are all covered by the global lock there is no deadlock scenario. If a thread takes just a single instance of one of these locks without taking the global_lock then there is also no deadlock. Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly added mm_lock() thing and we also take both those locks at the same time in regular code, we're probably screwed. --
No, just use the normal static ordering for that case: one type of lock
goes before the other kind. If those locks nest in regular code, you have
to do that *anyway*.
The code that can take many locks, will have to get the global lock *and*
order the types, but that's still trivial. It's something like
spin_lock(&global_lock);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma)
spin_lock(&vma->anon_vma->lock);
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping)
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
}
spin_unlock(&global_lock);
and now everybody follows the rule that "anon_vma->lock" precedes
"i_mmap_lock". So there can be no ABBA deadlock between the normal users
and the many-locks version, and there can be no ABBA deadlock between
many-locks-takers because they use the global_lock to serialize.
This really isn't rocket science, guys.
(I really hope and believe that they don't nest anyway, and that you can
just use a single for-loop for the many-lock case)
Linus
--
Multiple vmas may share the same mapping or refer to the same anonymous vma. The above code will deadlock since we may take some locks multiple times. --
Ok, so that actually _is_ a problem. It would be easy enough to also add just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's suggestion of just using stop-machine is actually the right one just because it's _so_ simple. (That said, we're not running out of vm flags yet, and if we were, we could just add another word. We're already wasting that space right now on 64-bit by calling it "unsigned long"). Linus --
Also, stop-machine will not work if we come to the conclusion that i_mmap_lock and anon_vma->lock need to be sleepable locks. Thanks, Robin Holt --
Set the vma flag when we locked it and then skip when we find it locked right? This would be in addition to the global lock? stop-machine would work for KVM since its a once in a Guest OS time of thing. But GRU, KVM and eventually Infiniband need the ability to attach in a reasonable timeframe without causing major hiccups for other We sure have enough flags. --
Yes. And clear it before unlocking (and again, testing if it's already clear - you mustn't unlock twice, so you must only unlock when the bit was set). You also (obviously) need to have somethign that guarantees that the lists themselves are stable over the whole sequence, but I assume you already have mmap_sem for reading (since you'd need it anyway just to follow the list). And if you have it for writing, it can obviously *act* as the global lock, since it would already guarantee mutual exclusion on that mm->mmap list. Linus --
Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are unique), we need to mark the address spaces/anonvma's. So the flag would need to be in the "struct anon_vma" (and struct address_space), not in the vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and would require adding a new field to "struct anon_vma()" And related to that brain-fart of mine, that obviously also means that yes, the locking has to be stronger than "mm->mmap_sem" held for writing, so yeah, it would have be a separate global spinlock (or perhaps a blocking lock if you have some reason to protect anything else with this too). Linus --
So because the bitflag can't prevent taking the same lock twice on two different vmas in the same mm, we still can't remove the sorting, and the global lock won't buy much other than reducing the collisions. I can add that though. I think it's more interesting to put a cap on the number of vmas to min(1024,max_map_count). The sort time on an 8k array runs in constant time. kvm runs with 127 vmas allocated... --
Andrea. Take five minutes. Take a deep breadth. And *think* about actually reading what I wrote. The bitflag *can* prevent taking the same lock twice. It just needs to be in the right place. IOW, just make it be in that anon_vma (and the address_space). No sorting Shut up already. It's not constant time just because you can cap the overhead. We're not in a university, and we care about performance, not your made-up big-O notation. Linus --
It's not that I didn't read it, but to do it I've to grow every anon_vma by 8 bytes. I thought it was implicit that the conclusion of your email is that it couldn't possibly make sense to grow the size of each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem kernel modules. It surely isn't my preferred solution, while capping the number of vmas to 1024 means sort() will make around 10240 steps, Matt call tell the exact number. The big cost shouldn't be in sort. Even 512 vmas will be more than enough for us infact. Note that I've a cond_resched in the sort compare function and I can re-add the signal_pending check. I had the signal_pending check in the original version that didn't use sort() but was doing an inner loop, I thought signal_pending wasn't necessary after speeding it up with sort(). But I can add it again, so then we'll only fail to abort inside sort() and we'll be able to break the loop while taking all the spinlocks, but with such as small array that can't be an issue and the result will surely run faster than stop_machine with zero ram and cpu overhead for the VM (besides stop_machine can't work or we'd need to disable preemption between invalidate_range_start/end, even removing the xpmem schedule-inside-mmu-notifier requirement). --
Andrea, I'm not interested. I've stated my standpoint: the code being discussed is crap. We're not doing that. Not in the core VM. I gave solutions that I think aren't crap, but I already also stated that I have no problems not merging it _ever_ if no solution can be found. The whole issue simply isn't even worth the pain, imnsho. Linus --
Hi Andrew, But the problem is that we've to stop the critical section in the place I marked with "========" while mmu_notifier_register runs. Otherwise the driver calling mmu_notifier_register won't know if it's safe to start establishing secondary sptes/tlbs. If the driver will establish sptes/tlbs with get_user_pages/follow_page the page could be freed immediately later when zap_page_range starts. So if CPU1 doesn't take the global_lock before proceeding in zap_page_range (inside vmtruncate i_mmap_lock that is represented as b->lock above) we're in trouble. What we can do is to replace the mm_lock with a spin_lock(&global_lock) only if all places that takes i_mmap_lock takes the global lock first and that hurts scalability of the fast paths that are performance critical like vmtruncate and anon_vma->lock. Perhaps they're not so performance critical, but surely much more performant critical than mmu_notifier_register ;). The idea of polluting various scalable paths like truncate() syscall in the VM with a global spinlock frightens me, I'd rather return to invalidate_page() inside the PT lock removing both invalidate_range_start/end. Then all serialization against the mmu notifiers will be provided by the PT lock that the secondary mmu page fault also has to take in get_user_pages (or follow_page). In any case that is a better solution that won't slowdown the VM when MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance is about the same with or without invalidate_range_start/end. I didn't think anybody could care about how long mmu_notifier_register takes until it returns compared to all heavyweight operations that happens to start a VM (not only in the kernel but in the guest too). Infact if it's security that we worry about here, can put a cap of _time_ that mmu_notifier_register can take before it fails, and we fail to start a VM if it takes more than 5sec, that's still fine as the failure could happen for other reasons too like vmalloc ...
If mmy_notifier_register() takes the global lock, it cannot happen here. It will be blocked (by CPU0), so there's no way it can then cause an ABBA deadlock. It will be released when CPU0 has taken *all* the locks it NO! You replace mm_lock() with the sequence that Andrew gave you (and I described): spin_lock(&global_lock) .. get all locks UNORDERED .. spin_unlock(&global_lock) and you're now done. You have your "mm_lock()" (which still needs to be renamed - it should be a "mmu_notifier_lock()" or something like that), but you don't need the insane sorting. At most you apparently need a way to recognize duplicates (so that you don't deadlock on yourself), which looks like a simple bit-per-vma. The global lock doesn't protect any data structures itself - it just protects two of these mm_lock() functions from ABBA'ing on each other! Linus --
Andrea's mm_lock could have wider impact. It is the first effective way that I have seen of temporarily holding off reclaim from an address space. It sure is a brute force approach. --
Well, I don't think the naming necessarily has to be about notifiers, but it should be at least a *bit* more scary than "mm_lock()", to make it clear that it's pretty dang expensive. Even without the vmalloc and sorting, if it would be used by "normal" things it would still be very expensive for some cases - running thngs like ElectricFence, for example, will easily generate thousands and thousands of vma's in a process. Linus --
The only improvement I can imagine on mm_lock, is after changing the name to global_mm_lock() to reestablish the signal_pending check in the loop that takes the spinlock and to backoff and put the cap to 512 vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec at most (plus the vfree that may be a bigger cost but we're ok to pay it and it surely isn't security related). Then on the long term we need to talk to Matt on returning a parameter to the sort function to break the loop. After that we remove the 512 vma cap and mm_lock is free to run as long as it wants like /dev/urandom, nobody can care less how long it will run before returning as long as it reacts to signals. This is the right way if we want to support XPMEM/GRU efficiently and without introducing unnecessary regressions in the VM fastpaths and VM footprint. --
Look Linus has told you what to do. Why not simply do it? --
When it looked like we could use vm_flags to remove sort, that looked an ok optimization, no problem with optimizations, I'm all for optimizations if they cost nothing to the VM fast paths and VM footprint. But removing sort isn't worth it if it takes away ram from the VM even when global_mm_lock will never be called. sort is like /dev/urandom so after sort is fixed to handle signals (and I expect Matt will help with this) we'll remove the 512 vmas cap. In the meantime we can live with the 512 vmas cap that guarantees sort won't take more than a few dozen usec. Removing sort() is the only thing that the anon vma bitflag can achieve and it's clearly not worth it and it would go in the wrong direction (fixing sort to handle signals is clearly the right direction, if sort is a concern at all). Adding the global lock around global_mm_lock to avoid one global_mm_lock to collide against another global_mm_lock is sure ok with me, if that's still wanted now that it's clear removing sort isn't worth it, I'm neutral on this. Christoph please go ahead and add the bitflag to anon-vma yourself if you want. If something isn't technically right I don't do it no matter who asks it. --
Andrea, you really are a piece of work. Your arguments have been bogus crap that didn't even understand what was going on from the beginning, and now you continue to do that. What exactly "takes away ram" from the VM? The notion of adding a flag to "struct anon_vma"? The one that already has a 4 byte padding thing on x86-64 just after the spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two bytes of padding if we didn't just make the spinlock type unconditionally 32 bits rather than the 16 bits we actually _use_? IOW, you didn't even look at it, did you? But whatever. I clearly don't want a patch from you anyway, so .. Linus --
Actually I looked both at the struct and at the slab alignment just in
case it was changed recently. Now after reading your mail I also
compiled it just in case.
2.6.26-rc1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma 260 576 24 144 1 : tunables 120 60 8 : slabdata 4 4 0
^^ ^^^
2.6.26-rc1 + below patch
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -27,6 +27,7 @@ struct anon_vma {
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas */
+ int flag:1;
};
#ifdef CONFIG_MMU
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
anon_vma 250 560 32 112 1 : tunables 120 60 8 : slabdata 5 5 0
^^ ^^^
Not a big deal sure to grow it 33%, it's so small anyway, but I don't
see the point in growing it. sort() can be interrupted by signals, and
until it can we can cap it to 512 vmas making the worst case taking
dozen usecs, I fail to see what you have against sort().
Again: if a vma bitflag + global lock could have avoided sort and run
O(N) instead of current O(N*log(N)) I would have done that
immediately, infact I was in the process of doing it when you posted
the followup. Nothing personal here, just staying technical. Hope you
too.
--
So you need to add the flag _after_ ->lock and _before_ ->head....
Pekka
--
Oh should have taken my morning coffee first, before ->lock works obviously as well. --
Sorry, Linus's right: I didn't realize the "after the spinlock" was
literally after the spinlock, I didn't see the 4 byte padding when I
read the code and put the flag:1 in. If put between ->lock and ->head
it doesn't take more memory on x86-64 as described literlly. So the
next would be to find another place like that in the address
space. Perhaps after the private_lock using the same trick or perhaps
the slab alignment won't actually alter the number of slabs per page
regardless.
I leave that to Christoph, he's surely better than me at doing this, I
give it up entirely and I consider my attempt to merge a total failure
and I strongly regret it.
On a side note the anon_vma will change to this when XPMEM support is
compiled in:
struct anon_vma {
- spinlock_t lock; /* Serialize access to vma list */
+ atomic_t refcount; /* vmas on the list */
+ struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head; /* List of private "related" vmas
*/
};
not sure if it'll grow in size or not after that but let's say it's
not a big deal.
--
Put the flag after the spinlock, not after the "list_head". Also, we'd need to make it unsigned short flag:1; _and_ change spinlock_types.h to make the spinlock size actually match the required size (right now we make it an "unsigned int slock" even when we actually only use 16 bits). See the #if (NR_CPUS < 256) code in <asm-x86/spinlock.h>. Linus --
Btw, this is an issue only on 32-bit x86, because on 64-bit one we already have the padding due to the alignment of the 64-bit pointers in the list_head (so there's already empty space there). On 32-bit, the alignment of list-head is obviously just 32 bits, so right now the structure is "perfectly packed" and doesn't have any empty space. But that's just because the spinlock is unnecessarily big. (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the structure really will grow. That's a very odd configuration, though, and not one I feel we really need to care about). Linus --
I see two ways to implement it:
1) use #ifdef and make it zero overhead for 64bit only without playing
any non obvious trick.
struct anon_vma {
spinlock_t lock;
#ifdef CONFIG_MMU_NOTIFIER
int global_mm_lock:1;
#endif
struct address_space {
spinlock_t private_lock;
#ifdef CONFIG_MMU_NOTIFIER
int global_mm_lock:1;
#endif
2) add a:
#define AS_GLOBAL_MM_LOCK (__GFP_BITS_SHIFT + 2) /* global_mm_locked */
and use address_space->flags with bitops
And as Andrew pointed me out by PM, for the anon_vma we can use the
LSB of the list.next/prev because the list can't be browsed when the
lock is taken, so taking the lock and then setting the bit and
clearing the bit before unlocking is safe. The LSB will always read 0
even if it's under list_add modification when the global spinlock isn't
taken. And after taking the anon_vma lock we can switch it the LSB
from 0 to 1 without races and the 1 will be protected by the
global spinlock.
The above solution is zero cost for 32bit too, so I prefer it.
So I now agree with you this is a great idea on how to remove sort()
and vmalloc and especially vfree without increasing the VM footprint.
I'll send an update with this for review very shortly and I hope this
goes in so KVM will be able to swap and do many other things very well
starting in 2.6.26.
Thanks a lot,
Andrea
--
Another possibility, would something like this work? /* * null out the begin function, no new begin calls can be made */ rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); /* * lock/unlock all rmap locks in any order - this ensures that any * pending start() will have its end() function called. */ mm_barrier(mm); /* * now that no new start() call can be made and all start()/end() pairs * are complete we can remove the notifier. */ mmu_notifier_remove(mm, my_notifier); This requires a mmu_notifier instance per attached mm and that __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain the function. But I think its enough to ensure that: for each start an end will be called It can however happen that end is called without start - but we could handle that I think. --
We don't need that, it's perfectly ok if start is called but end is not, it's ok to unregister in the middle as I guarantee ->release is called before mmu_notifier_unregister returns (if ->release is needed at all, not the case for KVM/GRU). Unregister is already solved with srcu/rcu without any additional complication as we don't need the guarantee that for each start an end The only reason mm_lock() was introduced is to solve "register", to guarantee that for each end there was a start. We can't handle end called without start in the driver. The reason the driver must be prevented to register in the middle of start/end, if that if it ever happens the driver has no way to know it must stop the secondary mmu page faults to call get_user_pages and instantiate sptes/secondarytlbs on pages that will be freed as soon as zap_page_range starts. --
Right - then I got it backwards. Never mind me then.. --
I thought the thing to remove was the "get all locks". I didn't realize the major problem was only the sorting of the array. I'll add the global lock, it's worth it as it drops the worst case number of steps by log(65536) times. Furthermore surely two concurrent mm_notifier_lock will run faster as it'll decrease the cacheline collisions. Since you ask to call it mmu_notifier_lock I'll also move it to mmu_notifier.[ch] as consequence. --
You said yourself that mmu_notifier_register can be as slow as you want ... what about you use stop_machine for it ? I'm not even joking Ben. --
We can put a cap of time + a cap of vmas. It's not important if it fails, the only useful case we know it, and it won't be slow at all. The failure can happen because the cap of time or the cap of vmas or the cap vmas triggers or there's a vmalloc shortage. We handle the failure in userland of course. There are zillon of allocations needed anyway, any one of them can fail, so this isn't a new fail path, is the same fail path that always existed before mmu_notifiers existed. I can't possibly see how adding a new global wide lock that forces all truncate to be serialized against each other, practically eliminating the need of the i_mmap_lock, could be superior to an approach that doesn't cause the overhead to the VM at all, and only require kvm to pay for an additional cost when it startup. Furthermore the only reason I had to implement mm_lock was to fix the invalidate_range_start/end model, if we go with only invalidate_page and invalidate_pages called inside the PT lock and we use the PT lock to serialize, we don't need a mm_lock anymore and no new lock from the VM either. I tried to push for that, but everyone else wanted invalidate_range_start/end. I only did the only possible thing to do: to make invalidate_range_start safe to make everyone happy without slowing down the VM. --
Sorry for not having completely answered to this. I initially thought stop_machine could work when you mentioned it, but I don't think it can even removing xpmem block-inside-mmu-notifier-method requirements. For stop_machine to solve this (besides being slower and potentially not more safe as running stop_machine in a loop isn't nice), we'd need to prevent preemption in between invalidate_range_start/end. I think there are two ways: 1) add global lock around mm_lock to remove the sorting 2) remove invalidate_range_start/end, nuke mm_lock as consequence of it, and replace all three with invalidate_pages issued inside the PT lock, one invalidation for each 512 pte_t modified, so serialization against get_user_pages becomes trivial but this will be not ok at all for SGI as it increases a lot their invalidation frequency For KVM both ways are almost the same. I'll implement 1 now then we'll see... --
This is what I suggested to begin with before this crazy locking was developed to handle these corner cases... because I wanted the locking to match with the tried and tested Linux core mm/ locking rather than introducing this new idea. I don't see why you're bending over so far backwards to accommodate this GRU thing that we don't even have numbers for and could actually potentially be batched up in other ways (eg. using mmu_gather or mmu_gather-like idea). The bare essential, matches-with-Linux-mm mmu notifiers that I first saw of yours was pretty elegant and nice. The idea that "only one solution must go in and handle everything perfectly" is stupid because it is quite obvious that the sleeping invalidate idea is just an order of magnitude or two more complex than the simple atomic invalidates needed by you. We should and could easily have had that code upstream long ago :( I'm not saying we ignore the sleeping or batching cases, but we should introduce the ideas slowly and carefully and assess the pros and cons --
I agree, we're better off generalizing the mmu_gather batching instead... I had some never-finished patches to use the mmu_gather for pretty much everything except single page faults, tho various subtle differences between archs and lack of time caused me to let them take the dust and not finish them... I can try to dig some of that out when I'm back from my current travel, though it's probably worth re-doing from scratch now. Ben. --
Well, the first thing would be just to get rid of the whole start/end idea, which completely departs from the standard Linux system of clearing ptes, then flushing TLBs, then freeing memory. The onus would then be on GRU to come up with some numbers to justify batching, and a patch which works nicely with the rest of the Linux mm. And yes, mmu-gather is *the* obvious first choice of places to I always liked the idea as you know. But I don't think that should be mixed in with the first iteration of the mmu notifiers patch anyway. GRU actually can work without batching, but there is simply some (unquantified to me) penalty for not batching it. I think it is far better to first put in a clean and simple and working functionality first. The idea that we have to unload some monster be-all-and-end-all solution onto mainline in a single go seems counter productive to me. --
Unfortunately, we are at least several months away from being able to provide numbers to justify batching - assuming it is really needed. We need large systems running real user workloads. I wish we had that available right now, but we don't. It also depends on what you mean by "no batching". If you mean that the notifier gets called for each pte that is removed from the page table, then the overhead is clearly very high for some operations. Consider the unmap of a very large object. A TLB flush per page will be too costly. However, something based on the mmu_gather seems like it should provide exactly what is needed to do efficient flushing of the TLB. The GRU does not require that it be called in a sleepable context. As long as the notifier callout provides the mmu_gather and vaddr range being flushed, the GRU can -- jack --
Right. So what? It's still about a million times faster than what the code does now. You comment about "great smp scalability optimization" just shows that you're a moron. It is no such thing. The fact is, it's a horrible pessimization, since even SMP will be *SLOWER*. It will just be "less slower" when you have a million CPU's and they all try to do this at the same time (which probably never ever happens). In other words, "scalability" is totally meaningless. The only thing that matters is *performance*. If the "scalable" version performs WORSE, then So what you're saying is that performance doesn't matter? So why do you do the ugly crazy hundred-line implementation, when a simple two-liner would do equally well? Your arguments are crap. Anyway, discussion over. This code doesn't get merged. It doesn't get merged before 2.6.26, and it doesn't get merged _after_ either. Rewrite the code, or not. I don't care. I'll very happily not merge crap for the rest of my life. Linus --
mmu_notifier_register only runs when windows or linux or macosx boots. Who could ever care of the msec spent in mm_lock compared to the time it takes to linux to boot? What you're proposing is to slowdown AIM and certain benchmarks 20% or If you want the global lock I'll do it no problem, I just think it's obviously inferior solution for 99% of users out there (including kvm users that will also have to take that lock while kvm userland runs). In my view the most we should do in this area is to reduce further the max number of locks to take if max_map_count already isn't enough. --
To remove mm_lock without adding an horrible system-wide lock before every i_mmap_lock etc.. we've to remove invalidate_range_begin/end. Then we can return to an older approach of doing only invalidate_page and serializing it with the PT lock against get_user_pages. That works fine for KVM but GRU will have to flush the tlb once every time we drop the PT lock, that means once per each 512 ptes on x86-64 etc... instead of a single time for the whole range regardless how large the range is. --
Andrea, you're *this* close to going to my list of people who it is not worth reading email from, and where it's better for everybody involved if I just teach my spam-filter about it. That code was CRAP. That code was crap whether it's used once, or whether it's used a million times. Stop making excuses for it just because it's not performance- critical. So give it up already. I told you what the non-crap solution was. It's simpler, faster, and is about two lines of code compared to the crappy version (which was what - 200 lines of crap with a big comment on top of it just to explain the idiocy?). So until you can understand the better solution, don't even bother emailing me, ok? Because the next email I get from you that shows the intelligence level of a gnat, I'll just give up and put you in a spam-filter. Because my IQ goes down just from reading your mails. I can't afford to continue. Linus --
That fix is going to be fairly difficult. I will argue impossible. First, a little background. SGI allows one large numa-link connected machine to be broken into seperate single-system images which we call partitions. XPMEM allows, at its most extreme, one process on one partition to grant access to a portion of its virtual address range to processes on another partition. Those processes can then fault pages and directly share the memory. In order to invalidate the remote page table entries, we need to message (uses XPC) to the remote side. The remote side needs to acquire the importing process's mmap_sem and call zap_page_range(). Between the messaging and the acquiring a sleeping lock, I would argue this will require sleeping locks in the path prior to the mmu_notifier invalidate_* callouts(). On a side note, we currently have XPMEM working on x86_64 SSI, and ia64 cross-partition. We are in the process of getting XPMEM working on x86_64 cross-partition in support of UV. Thanks, Robin Holt --
You simply will *have* to do it without locally holding all the MM spinlocks. Because quite frankly, slowing down all the normal VM stuff for some really esoteric hardware simply isn't acceptable. We just don't do it. So what is it that actually triggers one of these events? The most obvious solution is to just queue the affected pages while holding the spinlocks (perhaps locking them locally), and then handling all the stuff that can block after releasing things. That's how we normally do these things, and it works beautifully, without making everything slower. Sometimes we go to extremes, and actually break the locks are restart (ugh), and it gets ugly, but even that tends to be preferable to using the wrong locking. The thing is, spinlocks really kick ass. Yes, they restrict what you can do within them, but if 99.99% of all work is non-blocking, then the really odd rare blocking case is the one that needs to accomodate, not the rest. Linus --
Why do you need to take mmap_sem in order to shoot down pagetables of the process? It would be nice if this can just be done without sleeping. --
We are trying to shoot down page tables of a different process running on a different instance of Linux running on Numa-link connected portions of the same machine. The messaging is clearly going to require sleeping. Are you suggesting we need to rework XPC communications to not require sleeping? I think that is going to be impossible since the transfer engine requires a sleeping context. Additionally, the call to zap_page_range expects to have the mmap_sem held. I suppose we could use something other than zap_page_range and atomically clear the process page tables. Doing that will not alleviate the need to sleep for the messaging to the other partitions. Thanks, Robin --
Right. You can zap page tables without sleeping, if you're careful. I don't know that we quite do that for anonymous pages at the moment, but it I guess that you have found a way to perform TLB flushing within coherent domains over the numalink interconnect without sleeping. I'm sure it would be possible to send similar messages between non coherent domains. So yes, I'd much rather rework such highly specialized system to fit in closer with Linux than rework Linux to fit with these machines (and zap_page_range does not expect to have mmap_sem held. I think for anon pages it is always called with mmap_sem, however try_to_unmap_anon is not (although it expects page lock to be held, I think we should be able No, but I'd venture to guess that is not impossible to implement even on your current hardware (maybe a firmware update is needed)? --
I assume by coherent domains, your are actually talking about system images. Our memory coherence domain on the 3700 family is 512 processors on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes. We extend a "Read-Exclusive" mode beyond the coherence domain so any processor is able to read any cacheline on the system. We also provide uncached access for certain types of memory beyond the coherence domain. For the other partitions, the exporting partition does not know what virtual address the imported pages are mapped. The pages are frequently mapped in a different order by the MPI library to help with MPI collective operations. For the exporting side to do those TLB flushes, we would need to replicate all that importing information back to the exporting side. Additionally, the hardware that does the TLB flushing is protected by a spinlock on each system image. We would need to change that simple spinlock into a type of hardware lock that would work (on 3700) outside the processors coherence domain. The only way to do that is to use uncached addresses with our Atomic Memory Operations which do the cmpxchg at the memory controller. The uncached accesses are an order But it isn't that we are having a problem adapting to just the hardware. zap_page_range calls unmap_vmas which walks to vma->next. Are you saying that can be walked without grabbing the mmap_sem at least readably? I feel my understanding of list management and locking completely Are you suggesting the sending side would not need to sleep or the receiving side? Assuming you meant the sender, it spins waiting for the remote side to acknowledge the invalidate request? We place the data into a previously agreed upon buffer and send an interrupt. At this point, we would need to start spinning and waiting for completion. Let's assume we never run out of buffer space. The receiving side receives an interrupt. The interrupt currently wakes an XPC thread to do the work of ...
One thing to realize is that most of the time (read: pretty much *always*) when we have the problem of wanting to sleep inside a spinlock, the solution is actually to just move the sleeping to outside the lock, and then have something else that serializes things. That way, the core code (protected by the spinlock, and in all the hot paths) doesn't sleep, but the special case code (that wants to sleep) can have some other model of serialization that allows sleeping, and that includes as a small part the spinlocked region. I do not know how XPMEM actually works, or how you use it, but it seriously sounds like that is how things *should* work. And yes, that probably means that the mmu-notifiers as they are now are simply not workable: they'd need to be moved up so that they are inside the mmap semaphore but not the spinlocks. Can it be done? I don't know. But I do know that I'm unlikely to accept a noticeable slowdown in some very core code for a case that affects about 0.00001% of the population. In other words, I think you *have* to do it. Linus --
We are in the process of attempting this now. Unfortunately for SGI, Christoph is on vacation right now so we have been trying to work it internally. We are looking through two possible methods, one we add a callout to the tlb flush paths for both the mmu_gather and flush_tlb_page locations. The other we place a specific callout seperate from the gather callouts in the paths we are concerned with. We will look at both more carefully before posting. In either implementation, not all call paths would require the stall to ensure data integrity. Would it be acceptable to always put a sleepable stall in even if the code path did not require the pages be unwritable prior to continuing? If we did that, I would be freed from having a pool of invalidate threads ready for XPMEM to use for that work. Maybe there is a better way, but the sleeping requirement we would have on the threads make most options seem unworkable. Thanks, Robin --
I'm not understanding the question. If you can do you management outside of the spinlocks, then you can obviously do whatever you want, including sleping. It's changing the existing spinlocks to be sleepable that is not acceptable, because it's such a performance problem. Linus --
The problem is that the code in rmap.c try_to_umap() and friends loops over reverse maps after taking a spinlock. The mm_struct is only known after the rmap has been acccessed. This means *inside* the spinlock. That is why I tried to convert the locks to scan the revese maps to semaphores. If that is done then one can indeed do the callouts outside of With larger number of processor semaphores make a lot of sense since the holdoff times on spinlocks will increase. If we go to sleep then the processor can do something useful instead of hogging a cacheline. A rw lock there can also increase concurrency during reclaim espcially if the anon_vma chains and the number of address spaces mapping a page is high. --
So you queue them. That's what we do with things like the dirty bit. We need to hold various spinlocks to look up pages, but then we can't actually call the filesystem with the spinlock held. Converting a spinlock to a waiting lock for things like that is simply not acceptable. You have to work with the system. Yeah, there's only a single bit worth of information on whether a page is dirty or not, so "queueing" that information is trivial (it's just the return value from "page_mkclean_file()". Some things are harder than others, and I suspect you need some kind of "gather" structure to queue up all the vma's that can be affected. But it sounds like for the case of rmap, the approach of: - the page lock is the higher-level "sleeping lock" (which makes sense, since this is very close to an IO event, and that is what the page lock is generally used for) But hey, it could be anything else - maybe you have some other even bigger lock to allow you to handle lots of pages in one go. - with that lock held, you do the whole rmap dance (which requires spinlocks) and gather up the vma's and the struct mm's involved. - outside the spinlocks you then do whatever it is you need to do. This doesn't sound all that different from TLB shoot-down in SMP, and the "mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while holding the spinlocks, but if we couldn't that's how we'd still do it: it would get more involved (because we'd need to guarantee that the gather can hold *all* the pages - right now we can just flush in the middle if we need to), but it wouldn't be all that fundamentally different. And no, I really haven't even wanted to look at what XPMEM really needs to do, so maybe the above thing doesn't work for you, and you have other issues. I'm just pointing you in a general direction, not trying to say "this is exactly how to get there". Linus --
Implementation of what Linus suggested: Defer the XPMEM processing until
after the locks are dropped. Allow immediate action by GRU/KVM.
This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:
1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.
2. Do take a refcount on pages mapped externally. These are handling by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.
With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.
With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.
A device driver must subscribe to a process using
mm_register_notifier(struct mm_struct *, struct mm_notifier *)
The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.
When the process terminates then first the ->release method is called to
remove all pages still mapped to the proces.
Before the mm_struct is freed the ->destroy() method is called which
should dispose of the mm_notifier structure.
The following callbacks exist:
invalidate_range(notifier, mm_struct *, from , to)
Invalidate a range of addresses. The invalidation is
not required to complete immediately.
invalidate_range_sync(notifier, mm_struct *, from, to)
This is called after some invalidate_range callouts.
The driver may only return when the ...Right. Or the exporting side could be passed tokens that it tracks itself, I'm not sure if you're thinking about what I'm thinking of. With the scheme I'm imagining, all you will need is some way to raise an IPI-like interrupt on the target domain. The IPI target will have a driver to handle the interrupt, which will determine the mm and virtual addresses which are to be invalidated, and will then tear down those page tables and issue hardware TLB flushes within its domain. On the Linux side, Oh, I get that confused because of the mixed up naming conventions there: unmap_page_range should actually be called zap_page_range. But FWIW, mmap_sem isn't held to protect vma->next there anyway, because at that point the vmas are detached from the mm's rbtree and linked list. Sure, you obviously would need to rework your code because it's been written with the assumption that it can sleep. What is XPMEM exactly anyway? I'd assumed it is a Linux driver. --
We are pursuing Linus' suggestion currently. This discussion is completely unrelated to that work. We would need to deposit the payload into a central location to do the invalidate, correct? That central location would either need to be indexed by physical cpuid (65536 possible currently, UV will push that up much higher) or some sort of global id which is difficult because remote partitions can reboot giving you a different view of the machine and running partitions would need to be updated. Alternatively, that central location would need to be protected by a global lock or atomic type operation, but a majority of the machine does not have coherent access to other partitions so they would need to use uncached operations. Essentially, take away from this paragraph that it is going to be really slow or really large. Then we need to deposit the information needed to do the invalidate. Lastly, we would need to interrupt. Unfortunately, here we have a thundering herd. There could be up to 16256 processors interrupting the same processor. That will be a lot of work. It will need to look up the mm (without grabbing any sleeping locks in either xpmem or the kernel) and do the tlb invalidates. Unfortunately, the sending side is not free to continue (in most cases) until it knows that the invalidate is completed. So it will need to spin waiting for a completion signal will could be as simple as an uncached word. But how will it handle the possible failure of the other partition? How will it detect that failure and recover? A timeout value could be difficult to gauge because the other side may be off doing a considerable It is an assumption based upon some of the kernel functions we call doing things like grabbing mutexes or rw_sems. That pushes back to us. I think the kernel's locking is perfectly reasonable. The problem we run into is we are trying to get from one context in one kernel to a different XPMEM allows one process to make a portion of its virtual ...
You don't need to interrupt every time. Place your data in a queue (you do support rmw operations, right?) and interrupt. Invalidates from other processors will see that the queue hasn't been processed yet and skip the interrupt. -- error compiling committee.c: too many arguments to function --
How is that synchronized with code that walks the same pagetable. These walks may not hold mmap_sem either. I would expect that one could only remove a portion of the pagetable where we have some sort of guarantee that no accesses occur. So the removal of the vma prior ensures that? --
I don't really understand the question. If you remove the pte and invalidate the TLBS on the remote image's process (importing the page), then it can of course try to refault the page in because it's vma is still there. But you catch that refault in your driver , which can prevent the page from being faulted back in. --
I think Christoph's question has more to do with faults that are in flight. A recently requested fault could have just released the last lock that was holding up the invalidate callout. It would then begin messaging back the response PFN which could still be in flight. The invalidate callout would then fire and do the interrupt shoot-down while that response was still active (essentially beating the inflight response). The invalidate would clear up nothing and then the response would insert the PFN after it is no longer the correct PFN. Thanks, Robin --
I just looked over XPMEM. I think we could make this work. We already have a list of active faults which is protected by a simple spinlock. I would need to nest this lock within another lock protected our PFN table (currently it is a mutex) and then the invalidate interrupt handler would need to mark the fault as invalid (which is also currently there). I think my sticking points with the interrupt method remain at fault containment and timeout. The inability of the ia64 processor to handle provide predictive failures for the read/write of memory on other partitions prevents us from being able to contain the failure. I don't think we can get the information we would need to do the invalidate without introducing fault containment issues which has been a continous area of concern for our customers. Thanks, Robin --
Really? You can get the information through via a sleeping messaging API, but not a non-sleeping one? What is the difference from the hardware POV? --
That was covered in the early very long discussion about 28 seconds. The read timeout for the BTE is 28 seconds and it automatically retried for certain failures. In interrupt context, that is 56 seconds without any subsequent interrupts of that or lower priority. Thanks, Robin --
I thought you said it would be possible to get the required invalidate information without using the BTE. Couldn't you use XPMEM pages in the kernel to read the data out of, if nothing else? --
I was wrong about that. I thought it was safe to do an uncached write, but it turns out any processor write is uncontained and the MCA that surfaces would be fatal. Likewise for the uncached read. --
Oh, so the BTE transfer is purely for fault isolation. I was thinking you guys might have sufficient control of the hardware to be able to do it at the level of CPU memory operations, but if it is some limitation of ia64, then I guess that's a problem. How do you do fault isolation of userspace XPMEM accesses? --
The MCA handler can see the fault was either in userspace (processor priviledge level I believe) or in the early kernel entry where it is saving registers. When it sees that condition, it kills the users process. While in kernel space, there is no equivalent of the saving user state that forces the processor stall. --
That is currently true but we are also working on XPMEM for x86_64. The new XPMEM code should be posted within a few weeks. --- jack --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115508 -7200
# Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70
# Parent 6b384bb988786aa78ef07440180e4b2948c4c6a2
mm_lock-rwsem
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
unsigned long flags, struct page **pages);
struct mm_lock_data {
- spinlock_t **i_mmap_locks;
- spinlock_t **anon_vma_locks;
- size_t nr_i_mmap_locks;
- size_t nr_anon_vma_locks;
+ struct rw_semaphore **i_mmap_sems;
+ struct rw_semaphore **anon_vma_sems;
+ size_t nr_i_mmap_sems;
+ size_t nr_anon_vma_sems;
};
extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
static int mm_lock_cmp(const void *a, const void *b)
{
- unsigned long _a = (unsigned long)*(spinlock_t **)a;
- unsigned long _b = (unsigned long)*(spinlock_t **)b;
+ unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+ unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
cond_resched();
if (_a < _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
return 0;
}
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
int anon)
{
struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (anon) {
if (vma->anon_vma)
- locks[i++] = &vma->anon_vma->lock;
+ sems[i++] = &vma->anon_vma->sem;
} else {
if ...# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210115798 -7200 # Node ID eb924315351f6b056428e35c983ad28040420fea # Parent 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b mmap sems This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki * * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) + * + * When taking multiple mmap_sems, one should lock the lowest-addressed + * one first proceeding on up to the highest-addressed one. * * ->i_mutex * ->i_alloc_sem (various) --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1210115797 -7200 # Node ID 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b # Parent 94eaa1515369e8ef183e2457f6f25a7f36473d70 export zap_page_range for XPMEM XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a return unmap_vmas(vma, address, end, &nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
