Andrew,
I am resubmitting this patch because I believe that the discussion
has shown this to be an acceptable solution. I have fixed the 32 bit
build errors, but other than that change, the code is the same as
Roland's V3 patch.
From: Roland Dreier <rolandd@cisco.com>
As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc. Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.
We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts. Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node. A userspace process can open this node read-only and use the fd
as follows:
1. ioctl() to register/unregister an address range to watch in the
kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
2. read() to retrieve events generated when a mapping in a watched
address range is invalidated (cf struct ummunotify_event in
<linux/ummunotify.h>). select()/poll()/epoll() and SIGIO are
handled for this IO.
3. mmap() one page at offset 0 to map a kernel page that contains a
generation counter that is incremented each time an event is
generated. This allows userspace to have a fast path that checks
that no events have occurred without a system call.
Thanks to Jason Gunthorpe <jgunthorpe <at> obsidianresearch.com> for
suggestions on the interface design. Also thanks to Jeff Squyres
<jsquyres <at> cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Eric B Munson ...I do not know. I still believe that this does not belong in the kernel; application should not need to trace itself to know what it does. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
On Mon, 12 Apr 2010 07:22:17 +0100 But this info could be reassembled by tracking syscall activity, yes? Perhaps some discussion here explaining why the (possibly enhanced) OK, what's missing from this whole description and from ummunotify.txt is: how does one specify the target process? Does /dev/ummunotify implicitly attach to current->mm? If so, why, and what are the implications of this? If instead it is possible to attach to some other process's mmu activity (/proc/<pid>/ummunotity?) then how is that done and what are the security/permissions implications? Also, the whole thing is obviously racy: by the time userspace finds out that something has happened, it might have changed. This inevitably reduces the applicability/usefulness of the whole thing as compared to some synchronous mechanism which halts the monitored thread until the request has been processed and acked. All this should (IMO) be explored, explained and justified. Also, what prevents the obvious DoS which occurs when I register for events and just let them queue up until the kernel runs out of memory? presumably events get dropped - what are the reliability implications of this and how is the max queue length managed? What happens if I register two regions with the same cookie? Will UMMUNOTIFY_UNREGISTER_REGION unregister both, or did I cause a kernel leak? If it's possible to register for events from a different process then how does the lifetime management happen? What happens when the target process exits? What happens if I close() the fd when there are outstanding events? What happens if I close() the fd when there are still-registered So the implementation keeps track of queued events down the operation/start-address/end-address level and, for each new event, checks whether that event is wholly contained within one of the preceding start-address/end-address ranges and if that queued event "invalidated" the new event's range, the new event is suppressed? Sounds complex ...
Just to summarize some of the key points of this thingy, as related to
your comments:
1) It is really very narrowly focused on a particular problem MPI and
RDMA have due to the way their APIs don't really match. Roland
tried to make the interface general.. Maybe that is a mistake ..
2) A 'self-tracing' scheme is used, again, because of an API
mistmatching between a MPI library and it's own
applications. Attempting to hook the appropriate calls has
proven unsatisfactory (missing cases, and slow).
3) Being intended for MPI applications, performance is a huge
concern. Synchronous operation is very undesirable. Tracing APIs
are lossy - and there is no recovery option if an event is lost.
4) Realistically the only thing MPI cares about is if a virtual page
is unmapped/remapped. Loosing events is unacceptable.
5) This isn't really tracing. There is no queue. There aren't really
events. This works more like the diry/access bit in a page table,
it doesn't matter how many times something has been modified, only
that it has at least once since last time you looked.
This means the memory used is proportional to the number of
page-ranges you watch, and the number of events against those
page-ranges doesn't matter. No other API has this property.
Basically, this entire scheme is designed to detect that when a == b,
the internal state held by some_mpi_call is no longer valid, in
this kind of situation:
a = mmap(ONE_PAGE);
some_mpi_call(a);
munmap(a);
b = mmap(ONE_PAGE); // Kernel picks b == a
some_mpi_call(b);
All the races you point out, just don't matter for the MPI use
case. Essentially, if the app hits those races, then it is using the
MPI library in a buggy way.
That said, this could be explained better in the documentation file. :)
I'm sure Eric can go through the rest of your questions in greater
The only case that matters for the generation counter optimization is
a false negative. As long as ...I am not sure I agree with the premises here. ptMalloc and malloc hooks are not related to the issue in my opinion. User space library calls do not change virtual to physical mapping, system calls do. The following sys calls might change virtual to physical mapping: munmap(), mremap(), sbrk(), madvice(). What we need is glibc to provide hooks for these 4 sys calls and the general syscall() when its argument is one of the four mentioned syscalls. To me, that is what is needed, and the ummunotify direction seems way too complicated to me. It is further claimed that "… other tricks are not robust". I wrote the code used in Scali/Platform MPI handling the issue. I do not think its fair to claim that this MPI is not robust in this matter nor that is performance is bad. Thanks, Håkon --
> I am not sure I agree with the premises here. ptMalloc and malloc > hooks are not related to the issue in my opinion. User space library > calls do not change virtual to physical mapping, system calls do. The > following sys calls might change virtual to physical mapping: > munmap(), mremap(), sbrk(), madvice(). What we need is glibc to > provide hooks for these 4 sys calls and the general syscall() when > its argument is one of the four mentioned syscalls. To me, that is > what is needed, and the ummunotify direction seems way too > complicated to me. Are those system calls the only possible way that virtual to physical mappings can change? Can't page migration or something like that potentially affect things? And even if you did have hooks into every system call that mattered (keep in mind that relying on glibc is not enough, since an MPI application may not use glibc) would decoding them and figuring out what happened really be preferable to a single event type that tells you exactly what address range was affected? > It is further claimed that "… other tricks are not robust". I wrote > the code used in Scali/Platform MPI handling the issue. I do not > think its fair to claim that this MPI is not robust in this matter > nor that is performance is bad. The Open MPI developers have spent a lot of effort trying to handle this purely in userspace and still do not believe that a truly robust solution is possible without kernel help. Perhaps they can expand on what the obstacles are. - R. -- Roland Dreier <rolandd@cisco.com> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html --
Yeah, virtual<->physical maps can change through swapping, page migration, memory compaction, huge-page aggregation (the latter two not yet being upstream). Even mlock() doesn't pin virtual<->physical maps. --
Assuming this holds true, RDMA will not work. And with no RDMA, we do not need ummunotify. Is that your argument? Seriously, RDMA requires the virtual to physical mapping to remain constant for a period of time. And that time period is from memory registration to de-registration. If the virtual to physical mapping changes in that period for the registered memory area, I can't see how an HCA can handle that. For MPI applications, the MPI API defines that the buffers used in communication cannot be changed or freed while a non-blocking transfer is in progress. So far, we are good. But memory registration (and in particular de-registration) is a costly process. Since MPI is about performance, the MPI library would like to re-use earlier memory registrations for other transfers. The MPI library records earlier memory mappings (VA+bound) and checks if the VA+bound of a new transfer is already contained in memory revisions registered earlier. Can you elaborate on what you mean here? Thanks, Håkon --
Pages registered for RDMA are GUPed so no method above should touch them. Fork+cow or unmap/map on the other hand can change virtual<->physical maps. GUPed pages are still GUPed, but they are no longer mapped into process' virtual address space. MPI copes with Fork+cow by marking registered memory as MADV_DONTFORK. -- Gleb. --
Sure, holding a page-ref will pin the physical page, but that is not something userspace usually has means to. --
The problem is that glibc doesn't provide correct type of hooks for MPI to use. You can hook into free(), but the hook is called when application frees memory, not when memory is returned back to the kernel and since MPI wants to cache registration across free()/malloc() if possible those hooks are not good enough. To overcome this MPI tries to provide its own memory management library (luckily glibc defines most/all memory management functions as weak) with proper hooks present, but that poses a whole lot of other problems and memory management is really not MPI's job. Even if glibc will provide proper hooks some day HPC users may want to use other, more performance oriented, memory management libraries instead of using build in glibc one. Relying on glibc hooks will prevent them from doing so. -- Gleb. --
By "truly robust" we mean that some other user-level code can't override the hooks installed by the MPI (user level) middleware. All current glibc hooks are overridable by other user-level code -- and sometimes real applications do this (for their own good reasons). Most of the time, apps blithely override our hooks because they either don't know or can't know that our hooks are installed. It can be dicey to know what you can and cannot override pre-main(), for example (e.g., via the __malloc_initialize_hook). Opening up a direct channel to the kernel and saying "hey, tell me when something changes" is robust because no other entity can hijack your notifications. It also allows us to avoid using pre-main hooks, and makes it so that we don't have to hook into the memory subsystem (usually replacing it with our own). Both of these things are extremely distasteful -- fixing these two things alone make doing something like ummunotify worthwhile, IMHO. -- Jeff Squyres jsquyres@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ --
I am reworking the Documentation to address all these questions and will resubmit when finished. Thanks for the feedback, Eric
From: Randy Dunlap <randy.dunlap@oracle.com> Add ummunotify.h to Kbuild list for export to userspace, fixing 27 build errors in umn-test.c when O=builddir is used. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> --- include/linux/Kbuild | 1 + 1 file changed, 1 insertion(+) maybe another item to add to SubmitChecklist :( --- lnx-2634-rc4.orig/include/linux/Kbuild +++ lnx-2634-rc4/include/linux/Kbuild @@ -163,6 +163,7 @@ header-y += tipc_config.h header-y += toshiba.h header-y += udf_fs_i.h header-y += ultrasound.h +header-y += ummunotify.h header-y += un.h header-y += utime.h header-y += veth.h --
Eric, sorry for not seeing this earlier, but it looks like the last patch I emailed out did not include all the tiny changes I made prior to asking Linus to pull this (which he ended up not doing). For example this fix to the include Kbuild file was there. You could look at: http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=commit;h=cfe60a43c34... to see what was actually in my pull request. There may be other trivial tweaks worth including. Thanks for resurrecting this work BTW. - R. -- Roland Dreier <rolandd@cisco.com> || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html --
