Core code for mmu notifiers.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
include/linux/list.h | 14 ++
include/linux/mm_types.h | 6 +
include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 10 ++
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 2
mm/mmu_notifier.c | 101 ++++++++++++++++++++
9 files changed, 350 insertions(+)
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-01-28 11:35:22.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
#endif
};
+struct mmu_notifier_head {
+ struct hlist_head head;
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 11:43:03.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are ...mmu core: Need to use hlist_del
Wrong type of list del in mmu_notifier_release()
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/mmu_notifier.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 14:02:18.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-28 14:02:30.000000000 -0800
@@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru
&mm->mmu_notifier.head, hlist) {
if (mn->ops->release)
mn->ops->release(mn, mm);
- hlist_del(&mn->hlist);
+ hlist_del_rcu(&mn->hlist);
}
rcu_read_unlock();
synchronize_rcu();
--
> +void mmu_notifier_release(struct mm_struct *mm) USE_AFTER_FREE!!! I made this same comment as well as other relavent comments last week. Robin --
Must have slipped somehow. Patch needs to be applied after the rcu fix.
Please repeat the other relevant comments if they are still relevant.... I
thought I had worked through them.
mmu_notifier_release: remove mmu_notifier struct from list before calling ->release
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/mmu_notifier.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 17:17:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-28 17:17:10.000000000 -0800
@@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru
rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
&mm->mmu_notifier.head, hlist) {
+ hlist_del_rcu(&mn->hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);
- hlist_del_rcu(&mn->hlist);
}
rcu_read_unlock();
synchronize_rcu();
--
Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a It's out of my reach how can you be ok with lock=1. You said you have to block, if you can deal with lock=1 once, why can't you deal with The down_write is garbage. The caller should put it around mmu_notifier_register if something. The same way the caller should call synchronize_rcu after mmu_notifier_register if it needs synchronous behavior from the notifiers. The default version of mmu_notifier_register shouldn't be cluttered with unnecessary locking. --
Ooops my spinlock was gone from the notifier head.... so the above comment is wrong sorry! I thought down_write was needed to serialize against some _external_ event, not to serialize the list updates in place of my explicit lock. The critical section is so small that a semaphore is the wrong locking choice, that's why I assumed it was for an external event. Anyway RCU won't be optimal for a huge flood of register/unregister, I agree the down_write shouldn't create much contention and it saves 4 bytes from each mm_struct, and we can always change it to a proper spinlock later if needed. --
Andrew does not like #ifdefs and it makes it possible to verify calling Not sure yet. We may have to do more in that area. Need to have feedback from Robin. --
You could define mmu_notifier_head as an empty struct in that case. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I am going to seperate my comments into individual replies to help This is a use-after-free issue. The hlist_del_rcu needs to be done before the callout as the structure containing the mmu_notifier structure will need to be freed from within the ->release callout. Thanks, Robin --
Does this ->release actually release the 'nm' and its associated hlist? I see in this thread that this ordering is deemed "use after free" which implies so. If it does that seems wrong. This is an RCU hlist, therefore the list integrity must be maintained through the next grace period in case there are parallell readers using the element, in particular its forward For this to be updating the list, you must have some form of "write-side" exclusion as these primatives are not "parallel write safe". It would I am not sure it makes sense to add a _safe_rcu variant. As I understand things an _safe variant is used where we are going to remove the current list element in the middle of a list walk. However the key feature of an RCU data structure is that it will always be in a "safe" state until any parallel readers have completed. For an hlist this means that the removed entry and its forward link must remain valid for as long as there may be a parallel reader traversing this list, ie. until the next grace period. If this link is valid for the parallel reader, then it must be valid for us, and if so it feels that hlist_for_each_entry_rcu should be sufficient to cope in the face of entries being unlinked as we traverse the list. -apw --
That is not quite so, list elements must be preserved, not the list
It does make sense, hlist_del_rcu() maintains the fwd reference, but it
does unlink it from the list proper. As long as there is a write side
exclusion around the actual removal as you noted.
rcu_read_lock();
hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) {
if (foo) {
spin_lock(write_lock);
hlist_del_rcu(tpos);
spin_unlock(write_unlock);
}
}
rcu_read_unlock();
is a safe construct in that the list itself stays a proper list, and
even items that might be caught in the to-be-deleted entries will have a
fwd way out.
--
Right that was fixed in a later release and discussed extensively later. It was dropped in V5. --
