Hello, This is the latest and greatest version of the mmu notifier patch #v13. Changes are mainly in the mm_lock that uses sort() suggested by Christoph. This reduces the complexity from O(N**2) to O(N*log(N)). I folded the mm_lock functionality together with the mmu-notifier-core 1/12 patch to make it self-contained. I recommend merging 1/12 into -mm/mainline ASAP. Lack of mmu notifiers is holding off KVM development. We are going to rework the way the pages are mapped and unmapped to work with pure pfn for pci passthrough without the use of page pinning, and we can't without mmu notifiers. This is not just a performance matter. KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed further. XPMEM support comes with the later patches 2-12, risk for those patches is >0 and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are simple and can go in immediately but not all are so simple. 2-12/12 are posted as usual for review by the VM developers and so Robin can keep testing them on XPMEM and they can be merged later without any downside (they're mostly orthogonal with 1/12). --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872186 -7200 # Node ID 3c804dca25b15017b22008647783d6f5f3801fa9 # Parent ea87c15371b1bd49380c40c3f15f1c7ca4438af5 Fix ia64 compilation failure because of common code include bug. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/cpumask.h> #include <asm/page.h> #include <asm/mmu.h> --
Looks like this is not complete. There are numerous .h files missing which means that various structs are undefined (fs.h and rmap.h are needed f.e.) which leads to surprises when dereferencing fields of these struct. It seems that mm_types.h is expected to be included only in certain contexts. Could you make sure to include all necessary .h files? Or add some docs to clarify the situation here. --
Robin, what other changes did you need to compile? I only did that one because I didn't hear any more feedback from you after I sent that patch, so I assumed it was enough. --
It was perfect. Nothing else was needed. Thanks, Robin --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872186 -7200 # Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7 # Parent 3c804dca25b15017b22008647783d6f5f3801fa9 get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter <clameter@sgi.com> diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -442,7 +442,8 @@ if (task->flags & PF_BORROWED_MM) mm = NULL; else - atomic_inc(&mm->mm_users); + if (!atomic_inc_not_zero(&mm->mm_users)) + mm = NULL; } task_unlock(task); return mm; --
I thought I had to signoff if I conributed with anything that could resemble copyright? Given I only merged that patch, I can add an Acked-by if you like, but merging this in my patchset was already an implicit ack ;-). --
No you have to include a signoff if the patch goes through your custody chain. This one did. Also add a From: Christoph Lameter <clameter@sgi.com> somewhere if you want to signify that the patch came from me. --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID ac9bb1fb3de2aa5d27210a28edf24f6577094076
# Parent a6672bdeead0d41b2ebd6846f731d43a611645b7
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -169,27 +169,6 @@
INIT_HLIST_HEAD(&mm->mmu_notifier_list);
}
-#define ptep_clear_flush_notify(__vma, __address, __ptep) \
-({ \
- pte_t __pte; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __pte = ptep_clear_flush(___vma, ___address, __ptep); \
- mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
- __pte; \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
-({ \
- int __young; \
- struct vm_area_struct *___vma = __vma; \
- unsigned long ___address = __address; \
- __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
- __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
- ___address); \
- __young; \
-})
-
#else /* CONFIG_MMU_NOTIFIER */
static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -221,9 +200,6 @@
{
}
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
#endif /* CONFIG_MMU_NOTIFIER */
#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,11 +194,13 @@
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush_notify(vma, address, pte);
+ pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
...Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? --
To give zero regression risk to 1/12 when MMU_NOTIFIER=y or =n and the mmu notifiers aren't registered by GRU or KVM. Keep in mind that the whole point of my proposed patch ordering from day 0, is to keep as 1/N, the absolutely minimum change that fully satisfy GRU and KVM requirements. 4/12 isn't required by GRU/KVM so I keep it in a later patch. I now moved mmu_notifier_unregister in a later patch too for the same reason. --
We want a full solution and this kind of patching makes the patches difficuilt to review because later patches revert earlier ones. --
I know you rather want to see KVM development stalled for more months than to get a partial solution now that already covers KVM and GRU with the same API that XPMEM will also use later. It's very unfair on your side to pretend to stall other people development if what you need has stronger requirements and can't be merged immediately. This is especially true given it was publically stated that XPMEM never passed all regression tests anyway, so you can't possibly be in such an hurry like we are, we can't progress without this. Infact we can but it would be an huge effort and it would run _slower_ and it would all need to be deleted once mmu notifiers are in. Note that the only patch that you can avoid with your approach is mm_lock-rwsem, given that's software developed and not human developed I don't see a big deal of wasted effort. The main difference is the ordering. Most of the code is orthogonal so there's not much to revert. --
XPMEM has passed all regression tests using your version 12 notifiers. I have a bug in xpmem which shows up on our 8x oversubscription tests, but that is clearly my bug to figure out. Unfortunately it only shows up on a 128 processor machine so I have 1024 stack traces to sort through each time it fails. Does take a bit of time and a lot of SGI is under an equally strict timeline. We really needed the sleeping version into 2.6.26. We may still be able to get this accepted by vendor distros if we make 2.6.27. Thanks, Robin --
That's great news, thanks! I'd greatly appreciate if you could test #v13 too as I posted it. It already passed GRU and KVM regressions tests and it should work fine for XPMEM too. You can ignore the purely cosmetical error I managed to introduce in mm_lock_cmp (I implemented a BUG_ON that would have trigger if that wasn't a purely cosmetical issue, and it clearly doesn't trigger so you can be sure it's only cosmetical ;). Once I get confirmation that everyone is ok with #v13 I'll push a #v14 before Saturday with that cosmetical error cleaned up and mmu_notifier_unregister moved at the end (XPMEM will have unregister This is what I meant. As opposed we don't have any known bug left in this area, infact we need mmu_notifiers to _fix_ issues I identified that can't be fixed efficiently without mmu notifiers, and we need the mmu notifier to go I don't think vendor distro are less likely to take the patches 2-12 if 1/N (aka mmu-notifier-core) is merged in 2.6.26 especially at the light of kabi. --
I think GRU needs _unregister as well. Thanks, Robin --
The difference is that the non-sleeping variant can be shown not to affect stability or performance, even if configed in, as long as its not used. The sleeping variant will raise performance and stability concerns. I have zero objections to sleeping mmu notifiers; I only object to tying the schedules of the two together. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
We have had this workaround effort done years ago and have been suffering the ill effects of pinning for years. Had to deal with it again and again so I guess we do not matter? Certainly we have no interest in stalling KVM development. --
Yes. In addition to the pinning, there's lot of additional tlb flushing work to do in kvm without mmu notifiers as the swapcache could be freed by the vm the instruction after put_page unpins the page for whatever reason. --
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208870142 -7200
# Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5
# Parent fb3bc9942fb78629d096bd07564f435d51d86e5f
Core of mmu notifiers.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,27 @@
unsigned long addr, unsigned long len,
unsigned long flags, struct page **pages);
+/*
+ * mm_lock will take mmap_sem writably (to prevent all modifications
+ * and scanning of vmas) and then also takes the mapping locks for
+ * each of the vma to lockout any scans of pagetables of this address
+ * space. This can be used to effectively holding off reclaim from the
+ * address space.
+ *
+ * mm_lock can fail if there is not enough memory to store a pointer
+ * array to all vmas.
+ *
+ * mm_lock and mm_unlock are expensive operations that may take a long time.
+ */
+struct mm_lock_data {
+ spinlock_t **i_mmap_locks;
+ spinlock_t **anon_vma_locks;
+ size_t nr_i_mmap_locks;
+ size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
#endif
+#ifdef CONFIG_MMU_NOTIFIER
+ struct hlist_head mmu_notifier_list;
+#endif
};
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode ...This compare function looks unusual...
It should work, but sort() could be faster if the
if (a == b) test had a chance to be true eventually...
static int mm_lock_cmp(const void *a, const void *b)
{
unsigned long la = (unsigned long)*(spinlock_t **)a;
unsigned long lb = (unsigned long)*(spinlock_t **)b;
cond_resched();
if (la < lb)
return -1;
if (la > lb)
return 1;
return 0;
}
--
If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la > lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! --
You need to compare *a to *b (at least, that's what you're doing for the < case). -- error compiling committee.c: too many arguments to function --
I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear --
The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). --
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. --
While I agree with that reading of Andrew's email about invalidate_page, I think the GRU hardware makes a strong enough case to justify the two seperate callouts. Due to the GRU hardware, we can assure that invalidate_page terminates all pending GRU faults (that includes faults that are just beginning) and can therefore be completed without needing any locking. The invalidate_page() callout gets turned into a GRU flush instruction and we return. Because the invalidate_range_start() leaves the page table information available, we can not use a single page _start to mimick that functionality. Therefore, there is a documented case justifying the seperate callouts. I agree the case is fairly weak, but it does exist. Given Andrea's unwillingness to move and Jack's documented case, it is my opinion the most likely compromise is to leave in the invalidate_page() callout. Thanks, Robin --
I retrieved the value, which is why mm_lock works perfectly on #v13 as well as #v12. It's not mandatory to ever return 0, so it won't produce any runtime error (there is a bugcheck for wrong sort ordering in my patch just in case it would generate any runtime error and it never did, or I would have noticed before submission), which is why I didn't need to release any hotfix yet and I'm waiting more time to get more comments before sending an update to clean up that bit. Mentioning this as the third and last point I guess shows how strong are your arguments against merging my mmu-notifier-core now, so in the end doing that cosmetical error payed off somehow. I'll send an update in any case to Andrew way before Saturday so hopefully we'll finally get mmu-notifiers-core merged before next week. Also I'm not updating my mmu-notifier-core patch anymore except for strict bugfixes so don't worry about any more cosmetical bugs being introduced while optimizing the code like it happened this time. The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for ->release. we'll think later if it makes sense to add it, nobody's using it anyway. --
XPMEM is using it. GRU will be as well (probably already does). --
Yeppp. The GRU driver unregisters the notifier when all GRU mappings are unmapped. I could make it work either way - either with or without an unregister function. However, unregister is the most logical action to take when all mappings have been destroyed. --- jack --
This is true for KVM as well, unregister would be the most logical action to take when the kvm device is closed and the vm destroyed. However we can't implement mm_lock in O(N*log(N)) without triggering RAM allocations. And the size of those ram allocations are unknown at the time unregister runs (they also depend on the max_nr_vmas sysctl). So on a second thought not even passing the array from register to unregister would solve it (unless we allocate max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all unregister run yet).That's clearly unacceptable. The only way to avoid failing because of vmalloc space shortage or oom, would be to provide a O(N*N) fallback. But one that can't be interrupted by sigkill! sigkill interruption was ok in #v12 because we didn't rely on mmu_notifier_unregister to succeed. So it avoided any DoS but it still can't provide any reliable unregister. So in the end unregistering with kill -9 leading to ->release in O(1) sounds safer solution for the long term. You can't loop if unregister fails and pretend your module not to have deadlocks. Yes, waiting ->release add up a bit of complexity but I think it worth it, and there weren't genial ideas on how to avoid O(N*N) complexity and allocations too in mmu_notifier_unregister yet. Until that genius idea will materialize we'll stick with ->release in O(1) as the only safe unregister so we guarantee the admin will be in control of his hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are owned by sillyuser. I'm afraid if you don't want to worst-case unregister with ->release you need to have a better idea than my mm_lock and personally I can't see any other way than mm_lock to ensure not to miss range_begin... All the above is in 2.6.27 context (for 2.6.26 ->release is the way, even if the genius idea would materialize). --
If unregister fails then the driver should not detach from the address space immediately but wait until -->release is called. That may be a possible solution. It will be rare that the unregister fails. --
This is the current idea, exactly. Unless we find a way to replace mm_lock with something else, I don't see a way to make mmu_notifier_unregister reliable without wasting ram. --
But wait, mmu_notifier_register absolutely requires mm_lock to ensure that when the kvm->arch.mmu_notifier_invalidate_range_count is zero (large variable name, it'll get shorter but this is to explain), really no cpu is in the middle of range_begin/end critical section. That's why we've to take all the mm locks. But we cannot care less if we unregister in the middle, unregister only needs to be sure that no cpu could possibly still using the ram of the notifier allocated by the driver before returning. So I'll implement unregister in O(1) and without ram allocations using srcu and that'll fix all issues with unregister. It'll return "void" to make it crystal clear it can't fail. It turns out unregister will make life easier to kvm as well, mostly to simplify the teardown of the /dev/kvm closure. Given this can be a considered a bugfix to mmu_notifier_unregister I'll apply it to 1/N and I'll release a new mmu-notifier-core patch for you to review before I resend to Andrew before Saturday. Thanks! --
I'm not sure anymore this can be considered a bugfix given how large change this resulted in the locking and register/unregister/release behavior. Here a full draft patch for review and testing. Works great with KVM so far at least... - mmu_notifier_register has to run on current->mm or on get_task_mm() (in the later case it can mmput after mmu_notifier_register returns) - mmu_notifier_register in turn can't race against mmu_notifier_release as that runs in exit_mmap after the last mmput - mmu_notifier_unregister can run at any time, even after exit_mmap completed. No mm_count pin is required, it's taken automatically by register and released by unregister - mmu_notifier_unregister serializes against all mmu notifiers with srcu, and it serializes especially against a concurrent mmu_notifier_unregister with a mix of a spinlock and SRCU - the spinlock let us keep track who run first between mmu_notifier_unregister and mmu_notifier_release, this makes life much easier for the driver to handle as the driver is then guaranteed that ->release will run. - The first that runs executes ->release method as well after dropping the spinlock but before releasing the srcu lock - it was unsafe to unpin the module count from ->release, as release itself has to run the 'ret' instruction to return back to the mmu notifier code - the ->release method is mandatory as it has to run before the pages are freed to zap all existing sptes - the one that arrives second between mmu_notifier_unregister and mmu_notifier_register waits the first with srcu As said this is a much larger change than I hoped, but as usual it can only affect KVM/GRU/XPMEM if something is wrong with this. I don't exclude we'll have to backoff to the previous mm_users model. The main issue with taking a mm_users pin is that filehandles associated with vmas aren't closed by exit() if the mm_users is pinned (that simply leaks ram with kvm). It looks more correct not to relay on ...
I am not certain of this, but it seems like this patch leaves things in a somewhat asymetric state. At the very least, I think that asymetry should be documented in the comments of either mmu_notifier.h or .c. Before I do the first mmu_notifier_register, all places that test for mm_has_notifiers(mm) will return false and take the fast path. After I do some mmu_notifier_register()s and their corresponding mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true and the slow path will be taken. This, despite all registered notifiers having unregistered. It seems to me the work done by mmu_notifier_mm_destroy should really be done inside the mm_lock()/mm_unlock area of mmu_unregister and mm_notifier_release when we have removed the last entry. That would give the users job the same performance after they are done using the special device that they had prior to its use. On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: I don't think this is needed. I think you really want data->nr_i_mmap_locks. The second paragraph of this comment seems extraneous. These two sentences don't make any sense to me. I am not certain about the need to do the release callout when the driver has already told this subsystem it is done. For XPMEM, this callout would immediately return. I would expect it to be the same or GRU. Thanks, Robin --
There's no mm_lock/unlock for mmu_unregister anymore. That's the whole That's not feasible. Otherwise mmu_notifier_mm will go away at any time under both _release from exit_mmap and under _unregister too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm after the last mmdrop makes it safe. mmu_notifier_unregister also holds the mm_count because mm_count was pinned by mmu_notifier_register. That solves the issue with mmu_notifier_mm going away from under mmu_notifier_unregister and _release and that's why it can only be freed after mm_count == 0. There's at least one small issue I noticed so far, that while _release don't need to care about _register, but _unregister definitely need to care about _register. I've to take the mmap_sem in addition or in replacement of the unregister_lock. The srcu_read_lock can also likely moved just before releasing the unregister_lock but that's just a It's not needed right, but I thought it was cleaner if they all use "ret" after I had to change the code at the end of the function. Anyway I'll delete this to make the patch shorter and only Indeed. It never happens that there are zero vmas with filebacked Well that was a short explanation of why the mmu_notifier_mm structure can only be freed after the last mmdrop, which is what you asked at The point is that you don't want to run it twice. And without this you will have to serialize against ->release yourself in the driver. It's much more convenient if you know that ->release will be called just once, and before mmu_notifier_unregister returns. It could be called by _release even after you're already inside _unregister, _release may reach the spinlock before _unregister, you won't notice the difference. Solving this race in the driver looked too complex, I rather solve it once inside the mmu notifier code to be sure. Missing a release event is fatal because all sptes have to be dropped before _release returns. The requirement is the same for _unregister, all sptes ...
In the end the best is to use the spinlock around those list_add/list_del they all run in O(1) with the hlist and they take a few asm insn. This also avoids to take the mmap_sem in exit_mmap, at exit_mmap time nobody should need to use mmap_sem anymore, it might work but this looks cleaner. The lock is dynamically allocated only when the notifiers are registered, so the few bytes taken by it aren't relevant. A full new update will some become visible here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-... Please have a close look again. Your help is extremely appreciated and very helpful as usual! Thanks a lot. diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h --- xxx/include/linux/mmu_notifier.h 2008-04-24 19:41:15.000000000 +0200 +++ xx/include/linux/mmu_notifier.h 2008-04-24 19:38:37.000000000 +0200 @@ -15,7 +15,7 @@ struct hlist_head list; struct srcu_struct srcu; /* to serialize mmu_notifier_unregister against mmu_notifier_release */ - spinlock_t unregister_lock; + spinlock_t lock; }; struct mmu_notifier_ops { diff -urN xxx/mm/memory.c xx/mm/memory.c --- xxx/mm/memory.c 2008-04-24 19:41:15.000000000 +0200 +++ xx/mm/memory.c 2008-04-24 19:38:37.000000000 +0200 @@ -605,16 +605,13 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - ret = 0; if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma->anon_vma) - goto out; + return 0; } - if (unlikely(is_vm_hugetlb_page(vma))) { - ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); - goto out; - } + if (is_vm_hugetlb_page(vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, vma); if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); @@ -636,7 +633,6 @@ if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, vma->vm_start, end); -out: ...
I grabbed these and built them. Only change needed was another include.
After that, everything built fine and xpmem regression tests ran through
the first four sets. The fifth is the oversubscription test which trips
my xpmem bug. This is as good as the v12 runs from before.
Since this include and the one for mm_types.h both are build breakages
for ia64, I think you need to apply your ia64_cpumask and the following
(possibly as a single patch) first or in your patch 1. Without that,
ia64 doing a git-bisect could hit a build failure.
Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h
===================================================================
--- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h 2008-04-26 06:41:54.000000000 -0500
+++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 07:01:17.292071827 -0500
@@ -27,6 +27,8 @@
#ifndef _LINUX_SRCU_H
#define _LINUX_SRCU_H
+#include <linux/mutex.h>
+
struct srcu_struct_array {
int c[2];
};
--
Agreed, so it doesn't risk to break ia64 compilation, thanks for the
great XPMEM feedback!
Also note, I figured out that mmu_notifier_release can actually run
concurrently against other mmu notifiers in case there's a vmtruncate
(->release could already run concurrently if invoked by _unregister,
the only guarantee is that ->release will be called one time and only
one time and that no mmu notifier will ever run after _unregister
returns).
In short I can't keep the list_del_init in _release and I need a
list_del_init_rcu instead to fix this minor issue. So this won't
really make much difference after all.
I'll release #v14 with all this after a bit of kvm testing with it...
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -755,6 +755,14 @@ static inline void hlist_del_init(struct
}
}
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+ if (!hlist_unhashed(n)) {
+ __hlist_del(n);
+ n->pprev = NULL;
+ }
+}
+
/**
* hlist_replace_rcu - replace old entry by new one
* @old : the element to be replaced
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -22,7 +22,10 @@ struct mmu_notifier_ops {
/*
* Called either by mmu_notifier_unregister or when the mm is
* being destroyed by exit_mmap, always before all pages are
- * freed. It's mandatory to implement this method.
+ * freed. It's mandatory to implement this method. This can
+ * run concurrently to other mmu notifier methods and it
+ * should teardown all secondary mmu mappings and freeze the
+ * secondary mmu.
*/
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -19,12 +19,13 @@
/*
* This function can't run concurrently against mmu_notifier_register
- * or any other mmu notifier ...Now that mmu-notifier-core #v14 seems finished and hopefully will appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier code with the full patchset applied and not only with mmu-notifier-core. I soon found the full patchset has a swap deadlock bug. Then I tried without using kvm (so with mmu notifier disarmed) and I could still reproduce the crashes. After grabbing a few stack traces I tracked it down to a bug in the i_mmap_lock->i_mmap_sem conversion. If you oversubscription means swapping, you should retest with this applied on #v14 i_mmap_sem patch as it would eventually deadlock with all tasks allocating memory in D state without this. Now the full patchset is as rock solid as with only mmu-notifier-core applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for the last hours without a problem. Everything is working great with KVM at least. Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - up_write(&mapping->i_mmap_sem); + up_read(&mapping->i_mmap_sem); return ret; } --
You are going to take a semphore in an rcu section? Guess you did not activate all debugging options while testing? I was not aware that you can take a sleeping lock from a non preemptible context. --
I'd hoped to discuss this topic after mmu-notifier-core was already merged, but let's do it anyway. My point of view is that there was no rcu when I wrote that code, yet there was no reference count and yet all locking looks still exactly the same as I wrote it. There's even still the page_table_lock to serialize threads taking the mmap_sem in read mode against the first vma->anon_vma = anon_vma during the page fault. Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page->mapping. Definitely the PG_locked is taken so there's no way page->mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. So there are some possible scenarios: 1) my original anon_vma code was buggy not taking the rcu_read_lock() and somebody fixed it (I tend to exclude it) 2) somebody has seen a race that doesn't exist and didn't bother to document it other than with this obscure comment * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. I tend to exclude it too as VM folks are too smart for this to be the case. 3) somebody did some microoptimization using rcu and we surely can undo that microoptimization to get the code back to my original code that didn't need rcu despite it worked exactly the same, and that is going to be cheaper to use with semaphores than doubling the number of locked ops for every lock instruction. Now the double atomic op may not be horrible when not contented, as it works on the same cacheline but with cacheline bouncing with contention it sounds doubly horrible than a single cacheline bounce and I don't see the point of it as you can't use rcu anyways, so you can't possibly take advantage of whatever microoptimization done over the original locking. --
zap_pte_range can race with the rmap code and it does not take the page lock. The page may not go away since a refcount was taken but the mapping can go away. Without RCU you have no guarantee that the anon_vma is existing when you take the lock. Cachelines are acquired for exclusive use for a mininum duration. Multiple atomic operations can be performed after a cacheline becomes exclusive without danger of bouncing. --
There's some room for improvement, like using down_read_trylock, if
that succeeds we don't need to increase the refcount and we can keep
the rcu_read_lock held instead.
Secondly we don't need to increase the refcount in fork() when we
queue the vma-copy in the anon_vma. You should init the refcount to 1
when the anon_vma is allocated, remove the atomic_inc from all code
(except when down_read_trylock fails) and then change anon_vma_unlink
to:
up_write(&anon_vma->sem);
if (empty)
put_anon_vma(anon_vma);
While the down_read_trylock surely won't help in AIM, the second
change will reduce a bit the overhead in the VM core fast paths by
avoiding all refcounting changes by checking the list_empty the same
way the current code does. I really like how I designed the garbage
collection through list_empty and that's efficient and I'd like to
keep it.
I however doubt this will bring us back to the same performance of the
current spinlock version, as the real overhead should come out of
overscheduling in down_write ai anon_vma_link. Here an initially
spinning lock would help but that's gray area, it greatly depends on
timings, and on very large systems where a cacheline wait with many
cpus forking at the same time takes more than scheduling a semaphore
may not slowdown performance that much. So I think the only way is a
configuration option to switch the locking at compile time, then XPMEM
will depend on that option to be on, I don't see a big deal and this
guarantees embedded isn't screwed up by totally unnecessary locks on UP.
--
You have said this continually about a CONFIG option. I am unsure how that could be achieved. Could you provide a patch? Thanks, Robin --
I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git that is moving from pages to pfn for pci passthrough (that change will also remove the page pin with mmu notifiers). Unfortunately reserved-ram bugs out again in the blk-settings.c on real hardware. The fix I pushed in .25 for it, works when booting kvm (that's how I tested it) but on real hardware sata b_pfn happens to be 1 page less than the result of the min comparison and I'll have to figure out what happens (only .24 code works on real hardware..., at least my fix is surely better than the previous .25-pre code). I've other people waiting on that reserved-ram to be working, so once I've finished, I'll do the optimization to anon-vma (at least the removal of the unnecessary atomic_inc from fork) and add the config option. Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem yourself in this direction, you're very welcome to go ahead while I finish sorting out reserved-ram. If you do, please let me know so we don't duplicate effort, and it'd be absolutely great if the patches could be incremental with #v14 so I can merge them trivially later and upload a new patchset once you're finished (the only outstanding fix you have to apply on top of #v14 that is already integrated in my patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure you've already applied on top of #v14 before doing any more development on it). Thanks! --
In case you didn't notice this already, for a further explanation of why semaphores runs slower for small critical sections and why the conversion from spinlock to rwsem should happen under a config option, see the "AIM7 40% regression with 2.6.26-rc1" thread. --
[I'm scarcely following the mmu notifiers to-and-fro, which seems to be in good hands, amongst faster thinkers than me: who actually need and can test this stuff. Don't let me slow you down; but I can quickly clarify on this history.] No, the locking was different as you had it, Andrea: there was an extra bitspin lock, carried over from the pte_chains days (maybe we changed the name, maybe we disagreed over the name, I forget), which mainly guarded the page->mapcount. I thought that was one lock more than we needed, and eliminated it in favour of atomic page->mapcount in 2.6.9. Here's the relevant extracts from ChangeLog-2.6.9: [PATCH] rmaplock: PageAnon in mapping First of a batch of five patches to eliminate rmap's page_map_lock, replace its trylocking by spinlocking, and use anon_vma to speed up swapoff. Patches updated from the originals against 2.6.7-mm7: nothing new so I won't spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting the unuse_process mmap_sem fix already in 2.6.8-rc3. This patch: Replace the PG_anon page->flags bit by setting the lower bit of the pointer in page->mapping when it's anon_vma: PAGE_MAPPING_ANON bit. We're about to eliminate the locking which kept the flags and mapping in synch: it's much easier to work on a local copy of page->mapping, than worry about whether flags and mapping are in synch (though I imagine it could be done, at greater cost, with some barriers). [PATCH] rmaplock: kill page_map_lock The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to lock its pte_chains. We kept this (as page_map_lock: bit_spin_lock on PG_maplock) when we moved to objrmap. But the file objrmap locks its vma tree with mapping->i_mmap_lock, and the anon objrmap locks its vma list with anon_vma->lock: so isn't the page_map_lock superfluous? Pretty much, yes. The mapcount was protected by it, and needs to become an atomic: starting at -1 like page _count, so nr_mapped can be tracked ...
Hi Hugh!! Still I think it'd be great if you could review mmu-notifier-core v14. You and Nick are the core VM maintainers so it'd be great to hear any feedback about it. I think it's fairly easy to classify the patch as obviously safe as long as mmu notifiers are disarmed. Here a link for your convenience. Thanks a lot for the explanation! --
XPMEM requires more patches anyway. Note that in previous email you told me you weren't using it. I think GRU can work fine on 2.6.26 without mmu_notifier_unregister, like KVM too. You've simply to unpin the module count in ->release. The most important bit is that you've to do that anyway in case mmu_notifier_unregister fails (and it can fail because of vmalloc space shortage because somebody loaded some framebuffer driver or whatever). --
I said I could test without it. It is needed for the final version. It also makes the API consistent. What you are proposing is equivalent to having a file you can open but never close. This whole discussion seems ludicrous. You could refactor the code to get the sorted list of locks, pass that list into mm_lock to do the locking, do the register/unregister, then pass the same list into mm_unlock. If the allocation fails, you could fall back to the older slower method If you are not going to provide the _unregister callout you need to change the API so I can scan the list of notifiers to see if my structures are already registered. We register our notifier structure at device open time. If we receive a _release callout, we mark our structure as unregistered. At device close time, if we have not been unregistered, we call _unregister. If you take away _unregister, I have an xpmem kernel structure in use _AFTER_ the device is closed with no indication that the process is using it. In that case, I need to get an extra reference to the module in my device open method and hold that reference until the _release callout. Additionally, if the users program reopens the device, I need to scan the mmu_notifiers list to see if this tasks notifier is already registered. I view _unregister as essential. Did I miss something? Thanks, Robin --
That's not entirely true, you can close the file just fine it by killing the tasks leading to an mmput. From an user prospective in KVM terms, it won't make a difference as /dev/kvm will remain open and it'll pin the module count until the kvm task is killed anyway, I assume for GRU it's similar. Until I had the idea of how to implement an mm_lock to ensure the mmu_notifier_register could miss a running invalidate_range_begin, it wasn't even possible to implement a mmu_notifier_unregister (see EMM patches) and it looked like you were ok with that API that missed Correct, but it will keep the vmalloc ram pinned during the runtime. There's no reason to keep that ram allocated per-VM while the Correct, I already thought about that. This is exactly why I'm deferring this for later! Or those perfectionism not needed for KVM/GRU will keep delaying indefinitely the part that is already converged and that's enough for KVM and GRU (and for this specific bit, actually enough for XPMEM as well). We can make a second version of mm_lock_slow to use if mm_lock fails, in mmu_notifier_unregister, with N^2 complexity later, after the As said 1/N isn't enough for XPMEM anyway. 1/N has to only include the absolute minimum and zero risk stuff, that is enough for both KVM and Yes exactly, but you've to do that anyway, if mmu_notifier_unregister fails because some driver already allocated all vmalloc space (even x86-64 hasn't indefinite amount of vmalloc because of the vmalloc being in the end of the address space) unless we've a N^2 fallback, but the N^2 fallback will make the code more easily dosable and unkillable, so if I would be an admin I'd prefer having to quickly kill -9 a task in O(N) than having to wait some syscall that runs in O(N^2) to complete before the task quits. So the fallback to a slower algorithm isn't necessarily what will really happen after 2.6.26 is released, we'll see. Relaying on ->release for the module unpin sounds preferable, and it's certainly the only ...
Why is there still the hlist stuff being used for the mmu notifier list? And why is this still unsafe? There are cases in which you do not take the reverse map locks or mmap_sem while traversing the notifier list? This hope for inclusion without proper review (first for .25 now for .26) seems to interfere with the patch cleanup work and cause delay after delay for getting the patch ready. On what basis do you think that there is a chance of any of these patches making it into 2.6.26 given that this patchset has never been vetted in Andrew's tree? --
What's the problem with hlist, it saves 8 bytes for each mm_struct, Let's say I try to be optimistic and hope the right thing will happen given this is like a new driver that can't hurt anybody but KVM and GRU if there's any bug. But in my view what interfere with proper review for .26 are the endless discussions we're doing ;). --
list heads in mm_struct and in the mmu_notifier struct seemed to There is a potential issue in move_ptes where you call invalidate_range_end after dropping i_mmap_sem whereas my patches did the opposite. Mmap_sem saves you there? --
Yes, there's really no risk of races in this area after introducing mm_lock, any place that mangles over ptes and doesn't hold any of the three locks is buggy anyway. I appreciate the audit work (I also did it and couldn't find bugs but the more eyes the better). --
I guess I would need to merge some patches together somehow to be able to review them properly like I did before <sigh>. I have not reviewed the latest code completely. --
I guess I have to prepare another patchset then? --
If you want to embarrass yourself three time in a row go ahead ;). I thought two failed takeovers was enough. --
Apologies for my previous not too polite comment in answer to the above, but I thought this double patchset was over now that you converged in #v12 and obsoleted EMM and after the last private discussions. There's nothing personal here on my side, just a bit of general frustration on this matter. I appreciate all great contribution from you, as last your idea to use sort(), but I can't really see any possible benefit or justification anymore from keeping two patchsets floating around given we already converged on the mmu-notifier-core, and given it's almost certain mmu-notifier-core will go in -mm in time for 2.6.26. Let's put it this way, if I fail to merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire patchset and leave maintainership to you so you move 1/N to N/N and remove mm_lock-sem patch (everything else can remain the same as it's all orthogonal so changing the order is a matter of minutes). --
No I really want you to do this. I have no interest in a takeover in the future and have done the EMM stuff only because I saw no other way forward. I just want this be done the right way for all parties with patches that are nice and mergeable. --
Ok if you want me to do this, I definitely prefer the core to go in now. It's so much easier to concentrate on two problems at different times then to attack both problems at the same time given they're mostly completely orthogonal problems. Given we already solved one problem, I'd like to close it before concentrating on the second problem. I already told you it was my interest to support XPMEM too. For example it was me to notice we couldn't possibly remove can_sleep parameter from invalidate_range without altering the locking as vmas were unstable outside of one of the three core vm locks. That finding resulted in much bigger patches than we hoped (like Andrew previously sort of predicted) and you did all great work to develop those. From my part, once the converged part is in, it'll be a lot easier to fully concentrate on the rest. My main focus right now is to produce a mmu-notifier-core that is entirely bug free for .26. --
Takeover? I'd be happy if I would not have to deal with this issue. These patches were necessary because you were not listening to feedback plus there is the issue that your patchsets were not easy to review or diff against. I had to merge several patches to get to a useful patch. You have always picked up lots of stuff from my patchsets. Lots of work that could have been avoided by proper patchsets in the first place. --
You may have spotted this already. If so, just ignore this. It looks like there is a bug in copy_page_range() around line 667. It's possible to do a mmu_notifier_invalidate_range_start(), then return -ENOMEM w/o doing a corresponding mmu_notifier_invalidate_range_end(). --- jack --
No I didn't spot it yet, great catch!! ;) Thanks a lot. I think we can
take example by Jack and use our energy to spot any bug in the
mmu-notifier-core like with his above auditing effort (I'm quite
certain you didn't reprouce this with real oom ;) so we get a rock
solid mmu-notifier implementation in 2.6.26 so XPMEM will also benefit
later in 2.6.27 and I hope the last XPMEM internal bugs will also be
fixed by that time.
(for the not going to become mmu-notifier users, nothing to worry
about for you, unless you used KVM or GRU actively with mmu-notifiers
this bug would be entirely harmless with both MMU_NOTIFIER=n and =y,
as previously guaranteed)
Here the still untested fix for review.
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -597,6 +597,7 @@
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
+ int ret;
/*
* Don't copy ptes where a page fault will fill them correctly.
@@ -604,33 +605,39 @@
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.
*/
+ ret = 0;
if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
if (!vma->anon_vma)
- return 0;
+ goto out;
}
- if (is_vm_hugetlb_page(vma))
- return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+ if (unlikely(is_vm_hugetlb_page(vma))) {
+ ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
+ goto out;
+ }
if (is_cow_mapping(vma->vm_flags))
mmu_notifier_invalidate_range_start(src_mm, addr, end);
+ ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
- vma, addr, next))
- return -ENOMEM;
+ if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ vma, addr, next))) {
+ ret = -ENOMEM;
+ break;
+ }
} while ...# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872186 -7200
# Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
# Parent ac9bb1fb3de2aa5d27210a28edf24f6577094076
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.
Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.
The first pointer argument to free_pgtables() can then be dropped.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
void *private);
void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
- unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+ unsigned long ceiling);
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
} while (pgd++, addr = next, addr != end);
}
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
- unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+ unsigned long ceiling)
{
+ struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@
unlink_file_vma(vma);
if (is_vm_hugetlb_page(vma)) ...Why are the subjects all screwed up? They are the first line of the description instead of the subject line of my patches. --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872186 -7200 # Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5 # Parent ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0 Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter <clameter@sgi.com> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -723,8 +723,7 @@ struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -804,7 +804,6 @@ /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -816,20 +815,13 @@ * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far ...
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID 8965539f4d174c79bd37e58e8b037d5db906e219
# Parent fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -59,6 +59,8 @@
*/
extern void downgrade_write(struct rw_semaphore *sem);
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
* nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1984,6 +1984,15 @@
#endif
}
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+ return rwsem_is_contended(sem);
+#else
+ return 0;
+#endif
+}
+
/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@
spin_unlock_irqrestore(&sem->wait_lock, flags);
}
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ /*
+ * Racy check for an empty list. False positives or negatives
+ * would be okay. False positive may cause a useless dropping of
+ * locks. False negatives may cause locks to be held a bit
+ * longer until the next check.
+ */
+ return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
EXPORT_SYMBOL(__init_rwsem);
EXPORT_SYMBOL(__down_read);
EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@
return sem;
}
+int rwsem_is_contended(struct ...# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292
# Parent 8965539f4d174c79bd37e58e8b037d5db906e219
The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@
if (!vma_shareable(vma, addr))
return;
- spin_lock(&mapping->i_mmap_lock);
+ down_read(&mapping->i_mmap_sem);
vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
out:
- spin_unlock(&mapping->i_mmap_lock);
+ up_read(&mapping->i_mmap_sem);
}
/*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@
pgoff = offset >> PAGE_SHIFT;
i_size_write(inode, ...# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872187 -7200 # Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2 # Parent 6e04df1f4284689b1c46e57a67559abe49ecf292 Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter <clameter@sgi.com> diff ...
# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1208872187 -7200
# Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
# Parent bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1062,10 +1062,10 @@
* mm_lock and mm_unlock are expensive operations that may take a long time.
*/
struct mm_lock_data {
- spinlock_t **i_mmap_locks;
- spinlock_t **anon_vma_locks;
- size_t nr_i_mmap_locks;
- size_t nr_anon_vma_locks;
+ struct rw_semaphore **i_mmap_sems;
+ struct rw_semaphore **anon_vma_sems;
+ size_t nr_i_mmap_sems;
+ size_t nr_anon_vma_sems;
};
extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,8 +2243,8 @@
static int mm_lock_cmp(const void *a, const void *b)
{
cond_resched();
- if ((unsigned long)*(spinlock_t **)a <
- (unsigned long)*(spinlock_t **)b)
+ if ((unsigned long)*(struct rw_semaphore **)a <
+ (unsigned long)*(struct rw_semaphore **)b)
return -1;
else if (a == b)
return 0;
@@ -2252,7 +2252,7 @@
return 1;
}
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
int anon)
{
struct vm_area_struct *vma;
@@ -2261,59 +2261,59 @@
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (anon) {
if (vma->anon_vma)
- locks[i++] = &vma->anon_vma->lock;
+ sems[i++] = &vma->anon_vma->sem;
} else {
if (vma->vm_file && vma->vm_file->f_mapping)
- locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+ sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
}
}
if (!i)
...Doing the right patch ordering would have avoided this patch and allow better review. --
I didn't actually write this patch myself. This did it instead: s/anon_vma_lock/anon_vma_sem/ s/i_mmap_lock/i_mmap_sem/ s/locks/sems/ s/spinlock_t/struct rw_semaphore/ so it didn't look a big deal to redo it indefinitely. The right patch ordering isn't necessarily the one that reduces the total number of lines in the patchsets. The mmu-notifier-core is already converged and can go in. The rest isn't converged at all... nearly nobody commented on the other part (the few comments so far were negative), so there's no good reason to delay indefinitely what is already converged, given it's already feature complete for certain users of the code. My patch ordering looks more natural to me. What is finished goes in, the rest is orthogonal anyway. --
I would not want to review code that is later reverted or essentially changed in later patches. I only review your patches because we have a high interest in the patch. I suspect that others will be more willing to review this material if it would be done the right way. If you cannot produce an easily reviewable and properly formatted patchset that follows conventions then I will have to do it because we really need to get this merged. --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872187 -7200 # Node ID 128d705f38c8a774ac11559db445787ce6e91c77 # Parent f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93 XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson <dcn@sgi.com> diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -909,6 +909,7 @@ return unmap_vmas(vma, address, end, &nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. --
# HG changeset patch # User Andrea Arcangeli <andrea@qumranet.com> # Date 1208872187 -7200 # Node ID e847039ee2e815088661933b7195584847dc7540 # Parent 128d705f38c8a774ac11559db445787ce6e91c77 This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson <dcn@sgi.com> diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ * * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) + * + * When taking multiple mmap_sems, one should lock the lowest-addressed + * one first proceeding on up to the highest-addressed one. * * ->i_mutex * ->i_alloc_sem (various) --
I believe the differences between your patch set and Christoph's need to be understood and a compromise approach agreed upon. Those differences, as I understand them, are: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. 2) Range callout names: Your range callouts are invalidate_range_start and invalidate_range_end whereas Christoph's are start and end. I do not believe this has been discussed in great detail. I know I have expressed a preference for your names. I admit to having failed to follow up on this issue. I certainly believe we could come to an agreement quickly if pressed. 3) The structure of the patch set: Christoph's upcoming release orders the patches so the prerequisite patches are seperately reviewable and each file is only touched by a single patch. Additionally, that allows mmu_notifiers to be introduced as a single patch with sleeping functionality from its inception and an API which remains unchanged. Your patch set, however, introduces one API, then turns around and changes that API. Again, the desire to make it an unchanging API was expressed by, IIRC, Andrew. This does represent a risk to XPMEM as the non-sleeping API may become entrenched and make acceptance of the sleeping version less acceptable. Can we agree upon this list of issues? Thank you, Robin Holt --
invalidate_page as demonstrated in KVM pseudocode doesn't change the locking requirements, and it has the benefit of reducing the window of time the secondary page fault has to be masked and at the same time _halves_ the number of _hooks_ in the VM every time the VM deal with single pages (example: do_wp_page hot path). As long as we can't fully converge because of point 3, it'd rather keep invalidate_page to be I think using ->start ->end is a mistake, think when we later add mprotect_range_start/end. Here too I keep the better names only because we can't converge on point 3 (the API will eventually change, like every other kernel interal API, even core things like __free_page Each file touched by a single patch? I doubt... The split is about the same, the main difference is the merge ordering, I always had the zero risk part at the head, he moved it at the tail when he incorporated This is a kernel internal API, so it will definitely change over time. It's nothing close to a syscall. Also note: the API is obviously defined in mmu_notifier.h and none of the 2-12 patches touches mmu_notifier.h. So the extension of the method semantics is 100% backwards compatible. My patch order and API backward compatible extension over the patchset is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support XPMEM as well. KVM/GRU won't notice any difference once the support for XPMEM is added, but even if the API would completely change in 2.6.27, that's still better than no functionality at all in 2.6.26. --
Christoph, Jack and I just discussed invalidate_page(). I don't think the point Andrew was making is that compelling in this circumstance. The code has change fairly remarkably. Would you have any objection to putting it back into your patch/agreeing to it remaining in Andrea's patch? If not, I think we can put this issue aside until Andrew gets out of the merge window and can decide it. Either way, the patches become much more similar with this in. Thanks, Robin --
One solution would be to separate the invalidate_page() callout into a patch at the very end that can be omitted. AFACIT There is no compelling reason to have this callback and it complicates the API for the device driver writers. Not having this callback makes the way that mmu notifiers are called from the VM uniform which is a desirable goal. --
I agree that the invalidate_page optimization can be moved to a separate patch. That will be a patch that will definitely alter the API in a not backwards compatible way (unlike 2-12 in my #v13, which are all backwards compatible in terms of mmu notifier API). invalidate_page is beneficial to both mmu notifier users, and a bit beneficial to the do_wp_page users too. So there's no point to remove it from my mmu-notifier-core as long as the mmu-notifier-core is 1/N in my patchset, and N/N in your patchset, the differences caused by that ordering difference are a bigger change than invalidate_page existing or not. As I expected invalidate_page provided significant benefits (not just to GRU but to KVM too) without altering the locking scheme at all, this is because the page fault handler has to notice if begin->end both runs anyway after follow_page/get_user_pages. So it's a no brainer to keep and my approach will avoid a not backwards compatible breakage of the API IMHO. Not a big deal, nobody can care if the API will change, it will definitely change eventually, it's a kernel internal one, but given I've already invalidate_page in my patch there's no reason to remove it as long as mmu-notifier-core remains N/N on your patchset. --
Please redo the patchset with the right order. To my knowledge there is no chance of this getting merged for 2.6.26. --
FWIW, I have updated the GRU driver to use this patch (plus the fixeups). No problems. AFAICT, everything works. --- jack --
