Re: [PATCH] ummunotify: Userspace support for MMU notifications

Previous thread: [PATCH 16/23] Make register values available to PowerPC panic notifiers by David VomLehn on Sunday, April 11, 2010 - 11:04 pm. (1 message)

Next thread: by Ian Munsie on Sunday, April 11, 2010 - 11:43 pm. (6 messages)
From: Eric B Munson
Date: Sunday, April 11, 2010 - 11:22 pm

Andrew,

I am resubmitting this patch because I believe that the discussion
has shown this to be an acceptable solution.  I have fixed the 32 bit
build errors, but other than that change, the code is the same as
Roland's V3 patch.

From: Roland Dreier <rolandd@cisco.com>

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunotify_event in
    <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
    handled for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
suggestions on the interface design.  Also thanks to Jeff Squyres
<jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Eric B Munson ...
From: Pavel Machek
Date: Monday, April 12, 2010 - 1:20 pm

I do not know. I still believe that this does not belong in the
kernel; application should not need to trace itself to know what it does.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andrew Morton
Date: Monday, April 12, 2010 - 4:03 pm

On Mon, 12 Apr 2010 07:22:17 +0100


But this info could be reassembled by tracking syscall activity, yes? 
Perhaps some discussion here explaining why the (possibly enhanced)

OK, what's missing from this whole description and from ummunotify.txt
is: how does one specify the target process?  Does /dev/ummunotify
implicitly attach to current->mm?  If so, why, and what are the
implications of this?

If instead it is possible to attach to some other process's mmu
activity (/proc/<pid>/ummunotity?) then how is that done and what are
the security/permissions implications?

Also, the whole thing is obviously racy: by the time userspace finds
out that something has happened, it might have changed.  This
inevitably reduces the applicability/usefulness of the whole thing as
compared to some synchronous mechanism which halts the monitored thread
until the request has been processed and acked.  All this should (IMO)
be explored, explained and justified.

Also, what prevents the obvious DoS which occurs when I register for
events and just let them queue up until the kernel runs out of memory? 
presumably events get dropped - what are the reliability implications
of this and how is the max queue length managed?


What happens if I register two regions with the same cookie?  Will
UMMUNOTIFY_UNREGISTER_REGION unregister both, or did I cause a kernel
leak?

If it's possible to register for events from a different process then
how does the lifetime management happen?  What happens when the target
process exits?

What happens if I close() the fd when there are outstanding events? 

What happens if I close() the fd when there are still-registered

So the implementation keeps track of queued events down the
operation/start-address/end-address level and, for each new event,
checks whether that event is wholly contained within one of the
preceding start-address/end-address ranges and if that queued event
"invalidated" the new event's range, the new event is suppressed?

Sounds complex ...
From: Jason Gunthorpe
Date: Monday, April 12, 2010 - 4:59 pm

Just to summarize some of the key points of this thingy, as related to
your comments:
 1) It is really very narrowly focused on a particular problem MPI and
    RDMA have due to the way their APIs don't really match. Roland
    tried to make the interface general..  Maybe that is a mistake ..
 2) A 'self-tracing' scheme is used, again, because of an API
    mistmatching between a MPI library and it's own
    applications. Attempting to hook the appropriate calls has
    proven unsatisfactory (missing cases, and slow).
 3) Being intended for MPI applications, performance is a huge
    concern. Synchronous operation is very undesirable. Tracing APIs
    are lossy - and there is no recovery option if an event is lost.
 4) Realistically the only thing MPI cares about is if a virtual page
    is unmapped/remapped. Loosing events is unacceptable.
 5) This isn't really tracing. There is no queue. There aren't really
    events. This works more like the diry/access bit in a page table,
    it doesn't matter how many times something has been modified, only
    that it has at least once since last time you looked.
    
    This means the memory used is proportional to the number of
    page-ranges you watch, and the number of events against those
    page-ranges doesn't matter. No other API has this property.

Basically, this entire scheme is designed to detect that when a == b,
the internal state held by some_mpi_call is no longer valid, in
this kind of situation:
 a = mmap(ONE_PAGE);
 some_mpi_call(a);
 munmap(a);
 b = mmap(ONE_PAGE);   // Kernel picks b == a
 some_mpi_call(b);

All the races you point out, just don't matter for the MPI use
case. Essentially, if the app hits those races, then it is using the
MPI library in a buggy way.

That said, this could be explained better in the documentation file. :)

I'm sure Eric can go through the rest of your questions in greater

The only case that matters for the generation counter optimization is
a false negative. As long as ...
From: Håkon Bugge
Date: Tuesday, April 13, 2010 - 1:29 am

I am not sure I agree with the premises here. ptMalloc and malloc hooks are not related to the issue in my opinion. User space library calls do not change virtual to physical mapping, system calls do. The following sys calls might change virtual to physical mapping: munmap(), mremap(), sbrk(), madvice(). What we need is glibc to provide hooks for these 4 sys calls and the general syscall() when its argument is one of the four mentioned syscalls. To me, that is what is needed, and the ummunotify direction seems way too complicated to me.

It is further claimed that "… other tricks are not robust". I wrote the code used in Scali/Platform MPI handling the issue. I do not think its fair to claim that this MPI  is not robust in this matter nor that is performance is bad.


Thanks, Håkon


--

From: Roland Dreier
Date: Tuesday, April 13, 2010 - 10:57 am

> I am not sure I agree with the premises here. ptMalloc and malloc
 > hooks are not related to the issue in my opinion. User space library
 > calls do not change virtual to physical mapping, system calls do. The
 > following sys calls might change virtual to physical mapping:
 > munmap(), mremap(), sbrk(), madvice(). What we need is glibc to
 > provide hooks for these 4 sys calls and the general syscall() when
 > its argument is one of the four mentioned syscalls. To me, that is
 > what is needed, and the ummunotify direction seems way too
 > complicated to me.

Are those system calls the only possible way that virtual to physical
mappings can change?  Can't page migration or something like that
potentially affect things?  And even if you did have hooks into every
system call that mattered (keep in mind that relying on glibc is not
enough, since an MPI application may not use glibc) would decoding them
and figuring out what happened really be preferable to a single event
type that tells you exactly what address range was affected?

 > It is further claimed that "… other tricks are not robust". I wrote
 > the code used in Scali/Platform MPI handling the issue. I do not
 > think its fair to claim that this MPI is not robust in this matter
 > nor that is performance is bad.

The Open MPI developers have spent a lot of effort trying to handle this
purely in userspace and still do not believe that a truly robust
solution is possible without kernel help.  Perhaps they can expand on
what the obstacles are.

 - R.
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--

From: Peter Zijlstra
Date: Tuesday, April 13, 2010 - 11:02 am

Yeah, virtual<->physical maps can change through swapping, page
migration, memory compaction, huge-page aggregation (the latter two not
yet being upstream).

Even mlock() doesn't pin virtual<->physical maps.
--

From: Håkon Bugge
Date: Tuesday, April 13, 2010 - 10:18 pm

Assuming this holds true, RDMA will not work. And with no RDMA, we do not need ummunotify. Is that your argument?

Seriously, RDMA requires the virtual to physical mapping to remain constant for a period of time. And that time period is from memory registration to de-registration. If the virtual to physical mapping changes in that period for the registered memory area, I can't see how an HCA can handle that.

For MPI applications, the MPI API defines that the buffers used in communication cannot be changed or freed while a non-blocking transfer is in progress. So far, we are good. But memory registration (and in particular de-registration) is a costly process. Since MPI is about performance, the MPI library would like to  re-use earlier memory registrations for other transfers. The MPI library records earlier memory mappings (VA+bound) and checks if the VA+bound of a new transfer is already contained in memory revisions registered earlier.


Can you elaborate on what you mean here?


Thanks, Håkon

--

From: Gleb Natapov
Date: Wednesday, April 14, 2010 - 1:52 am

Pages registered for RDMA are GUPed so no method above should touch
them. Fork+cow or unmap/map on the other hand can change
virtual<->physical maps. GUPed pages are still GUPed, but they are no
longer mapped into process' virtual address space. MPI copes with
Fork+cow by marking registered memory as MADV_DONTFORK.

--
			Gleb.
--

From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 6:48 am

Sure, holding a page-ref will pin the physical page, but that is not
something userspace usually has means to.

--

From: Gleb Natapov
Date: Wednesday, April 14, 2010 - 2:06 am

The problem is that glibc doesn't provide correct type of hooks for MPI
to use. You can hook into free(), but the hook is called when
application frees memory, not when memory is returned back to the kernel
and since MPI wants to cache registration across free()/malloc() if
possible those hooks are not good enough. To overcome this MPI tries to
provide its own memory management library (luckily glibc defines
most/all memory management functions as weak) with proper hooks present, but
that poses a whole lot of other problems and memory management is really
not MPI's job. Even if glibc will provide proper hooks some day HPC users
may want to use other, more performance oriented, memory management
libraries instead of using build in glibc one. Relying on glibc hooks
will prevent them from doing so.

--
			Gleb.
--

From: Jeff Squyres
Date: Wednesday, April 14, 2010 - 7:36 am

By "truly robust" we mean that some other user-level code can't override the hooks installed by the MPI (user level) middleware.  All current glibc hooks are overridable by other user-level code -- and sometimes real applications do this (for their own good reasons).  Most of the time, apps blithely override our hooks because they either don't know or can't know that our hooks are installed.  It can be dicey to know what you can and cannot override pre-main(), for example (e.g., via the __malloc_initialize_hook).

Opening up a direct channel to the kernel and saying "hey, tell me when something changes" is robust because no other entity can hijack your notifications.  It also allows us to avoid using pre-main hooks, and makes it so that we don't have to hook into the memory subsystem (usually replacing it with our own).  Both of these things are extremely distasteful -- fixing these two things alone make doing something like ummunotify worthwhile, IMHO.

-- 
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

--

From: Eric B Munson
Date: Saturday, April 17, 2010 - 10:41 am

I am reworking the Documentation to address all these questions and will
resubmit when finished.

Thanks for the feedback,
Eric
From: Randy Dunlap
Date: Wednesday, April 14, 2010 - 9:43 am

From: Randy Dunlap <randy.dunlap@oracle.com>

Add ummunotify.h to Kbuild list for export to userspace,
fixing 27 build errors in umn-test.c when O=builddir is used.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 include/linux/Kbuild |    1 +
 1 file changed, 1 insertion(+)

maybe another item to add to SubmitChecklist :(


--- lnx-2634-rc4.orig/include/linux/Kbuild
+++ lnx-2634-rc4/include/linux/Kbuild
@@ -163,6 +163,7 @@ header-y += tipc_config.h
 header-y += toshiba.h
 header-y += udf_fs_i.h
 header-y += ultrasound.h
+header-y += ummunotify.h
 header-y += un.h
 header-y += utime.h
 header-y += veth.h
--

From: Eric B Munson
Date: Saturday, April 17, 2010 - 10:44 am

Thanks, will wrap this into V2.

Eric
From: Roland Dreier
Date: Sunday, April 18, 2010 - 7:38 am

Eric, sorry for not seeing this earlier, but it looks like the last
patch I emailed out did not include all the tiny changes I made prior to
asking Linus to pull this (which he ended up not doing).  For example
this fix to the include Kbuild file was there.

You could look at:
http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=cfe60a43c34...

to see what was actually in my pull request.  There may be other trivial
tweaks worth including.

Thanks for resurrecting this work BTW.

 - R.
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--

Previous thread: [PATCH 16/23] Make register values available to PowerPC panic notifiers by David VomLehn on Sunday, April 11, 2010 - 11:04 pm. (1 message)

Next thread: by Ian Munsie on Sunday, April 11, 2010 - 11:43 pm. (6 messages)