Re: [patch 1/6] mmu_notifier: Core code

Previous thread: [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps by Christoph Lameter on Monday, January 28, 2008 - 1:28 pm. (4 messages)

Next thread: Regala una stella, magari per san Valentino by laRecensione idee utili on Monday, January 28, 2008 - 9:56 am. (1 message)
From: Christoph Lameter
Date: Monday, January 28, 2008 - 1:28 pm

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/list.h         |   14 ++
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   10 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |  101 ++++++++++++++++++++
 9 files changed, 350 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-28 11:35:22.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are ...
From: Christoph Lameter
Date: Monday, January 28, 2008 - 3:06 pm

mmu core: Need to use hlist_del

Wrong type of list del in mmu_notifier_release()

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 14:02:18.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 14:02:30.000000000 -0800
@@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru
 					  &mm->mmu_notifier.head, hlist) {
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del(&mn->hlist);
+			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();

--

From: Robin Holt
Date: Monday, January 28, 2008 - 5:05 pm

> +void mmu_notifier_release(struct mm_struct *mm)

USE_AFTER_FREE!!!  I made this same comment as well as other relavent
comments last week.


Robin
--

From: Christoph Lameter
Date: Monday, January 28, 2008 - 6:19 pm

Must have slipped somehow. Patch needs to be applied after the rcu fix.

Please repeat the other relevant comments if they are still relevant.... I 
thought I had worked through them.



mmu_notifier_release: remove mmu_notifier struct from list before calling ->release

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 17:17:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 17:17:10.000000000 -0800
@@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru
 		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_rcu(&mn->hlist);
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();
--

From: Andrea Arcangeli
Date: Tuesday, January 29, 2008 - 6:59 am

Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a

It's out of my reach how can you be ok with lock=1. You said you have
to block, if you can deal with lock=1 once, why can't you deal with

The down_write is garbage. The caller should put it around
mmu_notifier_register if something. The same way the caller should
call synchronize_rcu after mmu_notifier_register if it needs
synchronous behavior from the notifiers. The default version of
mmu_notifier_register shouldn't be cluttered with unnecessary locking.
--

From: Andrea Arcangeli
Date: Tuesday, January 29, 2008 - 7:34 am

Ooops my spinlock was gone from the notifier head.... so the above
comment is wrong sorry! I thought down_write was needed to serialize
against some _external_ event, not to serialize the list updates in
place of my explicit lock. The critical section is so small that a
semaphore is the wrong locking choice, that's why I assumed it was for
an external event. Anyway RCU won't be optimal for a huge flood of
register/unregister, I agree the down_write shouldn't create much
contention and it saves 4 bytes from each mm_struct, and we can always
change it to a proper spinlock later if needed.
--

From: Christoph Lameter
Date: Tuesday, January 29, 2008 - 12:49 pm

Andrew does not like #ifdefs and it makes it possible to verify calling 

Not sure yet. We may have to do more in that area. Need to have feedback 
from Robin.
--

From: Avi Kivity
Date: Tuesday, January 29, 2008 - 1:41 pm

You could define mmu_notifier_head as an empty struct in that case.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Robin Holt
Date: Tuesday, January 29, 2008 - 9:07 am

I am going to seperate my comments into individual replies to help

This is a use-after-free issue.  The hlist_del_rcu needs to be done before
the callout as the structure containing the mmu_notifier structure will
need to be freed from within the ->release callout.

Thanks,
Robin
--

From: Andy Whitcroft
Date: Tuesday, February 5, 2008 - 11:05 am

Does this ->release actually release the 'nm' and its associated hlist?
I see in this thread that this ordering is deemed "use after free" which
implies so.

If it does that seems wrong.  This is an RCU hlist, therefore the list
integrity must be maintained through the next grace period in case there
are parallell readers using the element, in particular its forward

For this to be updating the list, you must have some form of "write-side"
exclusion as these primatives are not "parallel write safe".  It would

I am not sure it makes sense to add a _safe_rcu variant.  As I understand
things an _safe variant is used where we are going to remove the current
list element in the middle of a list walk.  However the key feature of an
RCU data structure is that it will always be in a "safe" state until any
parallel readers have completed.  For an hlist this means that the removed
entry and its forward link must remain valid for as long as there may be
a parallel reader traversing this list, ie. until the next grace period.
If this link is valid for the parallel reader, then it must be valid for
us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
to cope in the face of entries being unlinked as we traverse the list.

-apw
--

From: Peter Zijlstra
Date: Tuesday, February 5, 2008 - 11:17 am

That is not quite so, list elements must be preserved, not the list


It does make sense, hlist_del_rcu() maintains the fwd reference, but it
does unlink it from the list proper. As long as there is a write side
exclusion around the actual removal as you noted.

rcu_read_lock();
hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) {

	if (foo) {
		spin_lock(write_lock);
		hlist_del_rcu(tpos);
		spin_unlock(write_unlock);
	}
}
rcu_read_unlock();

is a safe construct in that the list itself stays a proper list, and
even items that might be caught in the to-be-deleted entries will have a
fwd way out.

--

From: Christoph Lameter
Date: Tuesday, February 5, 2008 - 11:19 am

Right that was fixed in a later release and discussed extensively later. 

It was dropped in V5.

--

Previous thread: [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps by Christoph Lameter on Monday, January 28, 2008 - 1:28 pm. (4 messages)

Next thread: Regala una stella, magari per san Valentino by laRecensione idee utili on Monday, January 28, 2008 - 9:56 am. (1 message)