Linus, please consider pulling from
master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
This tree is also available from kernel.org mirrors at:
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
This will get "ummunotify," a new character device that allows a
userspace library to register for MMU notifications; this is
particularly useful for MPI implementions (message passing libraries
used in HPC) to be able to keep track of what wacky things consumers
do to their memory mappings. My colleague Jeff Squyres from the Open
MPI project posted a blog entry about why MPI wants this:
http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
His summary of ummunotify:
"It’s elegant, doesn’t require strange linker tricks, and seems to
work in all cases. Yay!"
This code went through several review iterations on lkml and was in
-mm and -next for quite a few weeks. Andrew is OK with merging it (I
think -- Andrew please correct me if I misunderstood you).
Roland Dreier (1):
ummunotify: Userspace support for MMU notifications
Documentation/Makefile | 3 +-
Documentation/ummunotify/Makefile | 7 +
Documentation/ummunotify/ummunotify.txt | 150 ++++++++
Documentation/ummunotify/umn-test.c | 200 +++++++++++
drivers/char/Kconfig | 12 +
drivers/char/Makefile | 1 +
drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++
include/linux/Kbuild | 1 +
include/linux/ummunotify.h | 121 +++++++
9 files changed, 1060 insertions(+), 1 deletions(-)
create mode 100644 Documentation/ummunotify/Makefile
create mode 100644 Documentation/ummunotify/ummunotify.txt
create mode 100644 Documentation/ummunotify/umn-test.c
create mode 100644 drivers/char/ummunotify.c
create mode 100644 ...I don't remember seeing discussion of this on lkml. Yes it is in -next... Basically it allows app to 'trace itself'? ...with interesting mmap() interface, exporting int to userspace, hoping it behaves atomically...? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> I don't remember seeing discussion of this on lkml. Yes it is in > -next... eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2 and earlier patches. > Basically it allows app to 'trace itself'? ...with interesting mmap() > interface, exporting int to userspace, hoping it behaves atomically...? Yes, it allows app to trace what the kernel does to memory mappings. I don't believe there's any real issue to atomicity of mmap'ed memory, since userspace really just tests whether read value is == to old read value or not. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Well... it seems little overspecialized. Just modifying libc to That still needs memory barriers etc.. to ensure reliable operation, no? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
That is what MPI people are doing today and their feedback is that it
doesn't work - there are a lot of ways to mess with memory and no good
choices to hook the raw syscalls and keep sensible performance.
The main focus of this is high performance MPI apps, so lower overhead
on critical paths like memory allocation is part of the point. It is
ment to go hand-in-hand with the specialized RDMA memory pinning
No, I don't think so..
The application is expected to provide sequencing of some sort between
the memory call (mmap/munmap/brk/etc) and the int check - usually just
by running in the same thread, or through some kind of locking scheme.
As long as the mmu notifiers run immediately in the same context as
the mmap/etc then it should be fine.
For example, the most common problem to solve looks like this:
x = mmap(...)
do RDMA with x
[..]
mmunmap(x);
[..]
y = mmap(..);
do RDMA with y
if by chance x == y things explode.
So this API puts the int test directly before 'do RDMA with'.
Due to the above kind of argument the net requirement is either to
completely synchronously (and with low overhead) hook every
mmap/munmap/brk/etc call into the kernel and do the accounting work,
or have a very low over head check every time the memory region is
about to be used.
Jason
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi Linus, Sorry to hassle you about this, but I would like to know where things stand. I know (from the reflink discussion if nothing else) that you're definitely not bashful about telling people when their code sucks, so this silent treatment has me really flustered. I've been showering and brushing my teeth and everything, honest! Seriously, this code solves a problem that the MPI/HPC people have been complaining about for quite a while, and if possible I'd like to get this upstream. Or if you have a better idea, I'm all ears... Thanks, Roland > Linus, please consider pulling from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify > > This will get "ummunotify," a new character device that allows a > userspace library to register for MMU notifications; this is > particularly useful for MPI implementions (message passing libraries > used in HPC) to be able to keep track of what wacky things consumers > do to their memory mappings. My colleague Jeff Squyres from the Open > MPI project posted a blog entry about why MPI wants this: > > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/ > > His summary of ummunotify: > > "It’s elegant, doesn’t require strange linker tricks, and seems to > work in all cases. Yay!" > > This code went through several review iterations on lkml and was in > -mm and -next for quite a few weeks. Andrew is OK with merging it (I > think -- Andrew please correct me if I misunderstood you). > > Roland Dreier (1): > ummunotify: Userspace support for MMU notifications > > Documentation/Makefile | 3 +- > Documentation/ummunotify/Makefile | 7 + > Documentation/ummunotify/ummunotify.txt | 150 ++++++++ > Documentation/ummunotify/umn-test.c ...
I just haven't had time to look into the issue, so I'm merging the code that I know I need to merge, and hopefully I'll have a breather later when I can actually look at code and the thread it all spawned., Linus _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Anton Blanchard suggested a while back that this might be integrated with perf-counters, since perf-counters already does mmap() tracking and also provides events through an mmap()'ed buffer. Has anybody looked into this? If someone did and I missed the discussion on why it isn't appropriate, kindly point me in the right direction ;-) _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> Anton Blanchard suggested a while back that this might be integrated > with perf-counters, since perf-counters already does mmap() tracking and > also provides events through an mmap()'ed buffer. > > Has anybody looked into this? I didn't see the original suggestion. Certainly hooking in to existing infrastructure for user/kernel communication would be good. The fit doesn't seem great to me, although I am rather naive about perf counters. The problem that ummunotify is trying to solve is to let an app say 'for these 1000 address ranges (that possibly only cover a small part of my total address space), please let me know when the mappings are invalidated for any reason'. So getting those events in the kernel is no problem -- we have the MMU notifier hooks that tell us exactly what we need to know. The issue is purely the way userspace registers interest in address ranges, and how to kernel returns the events. For perf counters it seems that one would have to create a new counter for each address range... is that correct? And also I don't know if perf counter has an analog for the fast path optimization that ummunotify provides via a mmap'ed generation counter (a quick way for userspace to see 'nothing happened since last time you checked'). - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> So getting those events in the kernel is no problem -- we have the MMU > notifier hooks that tell us exactly what we need to know. The issue is > purely the way userspace registers interest in address ranges, and how > to kernel returns the events. > > For perf counters it seems that one would have to create a new counter > for each address range... is that correct? And also I don't know if > perf counter has an analog for the fast path optimization that > ummunotify provides via a mmap'ed generation counter (a quick way for > userspace to see 'nothing happened since last time you checked'). Oh I forgot... ummunotify also preallocates everything etc. so that there is no way for events to be lost. Which saves userspace from having to trash everything cached and start over, which it would have to do if it misses an invalidate event. And AFAIK, pref counters does have the possibility of overflowing a buffer and losing an event, right? - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Well, you cannot pre-allocate everything, either you get back-logged evens in kernel space leading to a kernel DoS, or you loose events. Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode will automagically overwrite its tail data without regard for userspace having observed it. In the RW mode userspace has to advance the tail, the kernel will drop events when full and insert a PERF_EVENT_LOST event once there is room again. Hmm, or are you saying you can only get 1 event per registered range and allocate the thing on registration? That'd need some registration limit to avoid DoS scenarios. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> Hmm, or are you saying you can only get 1 event per registered range and > allocate the thing on registration? That'd need some registration limit > to avoid DoS scenarios. Yes, that's what I do. You're right, I should add a limit... although their are lots of ways for userspace to consume arbitrary amounts of kernel resources already. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > Hmm, or are you saying you can only get 1 event per registered range and > > > allocate the thing on registration? That'd need some registration limit > > > to avoid DoS scenarios. > > > > Yes, that's what I do. You're right, I should add a limit... although > > their are lots of ways for userspace to consume arbitrary amounts of > > kernel resources already. > > I'd be good to work at reducing that number, not adding to it ;-) Yes, definitely. I'll add a quick ummunotify module parameter that limits the number of registrations per process. > But yeah, I currently don't see a very nice match to perf counters. OK. It would be nice to tie into something more general, but I think I agree -- perf counters are missing the filtering and the "no lost events" that ummunotify does have. And I'm not sure it's worth messing up the perf counters design just to jam one more not totally related thing in. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
The filtering can be done and has been done - see Li Zefan's patchset that uses filter expressions to do per event in-kernel filtering. The OOM DoS is a bug in your patches i think, which perfcounters solves ;-) Ingo _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I believe that extending perf counters to do what you want is better than adding one more, very strange, user<->kernel interface. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Performance events filtering is being worked on and now with the proper non-DoS limit you've added you can lose events too, dont you? So it's all a question of how much buffering to add - and with perf events too Nobody suggested details for any redesign yet (so far it seems like a perfect match, to me at least) so i'm wondering what messup you are Agreed. I test-pulled this code and had a look at it. I think this could be done in a simpler, less limited, more generic, more useful form by using some variation of perf events. You should be able to get all that you want by adding two TRACE_EVENT() tracepoints and using the existing perf event syscall to get the events to user-space. Meaning that this: 9 files changed, 1060 insertions(+), 1 deletions(-) Would be replaced with something like: 2 files changed, 100 insertions(+), 0 deletions(-) [ the +100 lines would (roughly) would add tracepoints to invalidate_page and invalidate_range_start. (possibly via mmu_notifier_register() like the ummunotify code does) Most of that linecount would be comments. ] Another upside, beyond the reduction in complexity is that we'd have one less special char driver based ABI. Which is a big plus in my opinion, especially if this goes towards HPC folks and if it's used for real. Why should such a MM capability hidden behind a character device and an ioctl? The perf event approach is beneficial to non-HPC as well: MM instrumentation for example - page range invalidates are interesting to all sorts of modi of analysis. A question: what is the typical size/scope of the rbtree of the watched regions of memory in practical (test) deployments of the ummunofity code? Per tracepoint filtering is possible via the perf event patches Li Zefan has posted to lkml recently, under this subject: [PATCH 0/6] perf trace: Add filter support They are still being worked on but it's very clear that flexible in-kernel filtering support will be a ...
No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.
Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.
After that point the scheme is perfectly lossless.
Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.
All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..
So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
triggering on the intersection of the MMU event address range and
filter expression address range.
- The traces must be choosen so that there is exactly one filter
expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
struct perf_filter_tag foo[100]
int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
filter tags, not trace events.
This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:
struct my_data *ptr = calloc()
ptr->reg_handle = ibv_reg_mr(base,len)
ptr->filter_handle = perf_filter_register("string matching base->len",ptr)
[..]
// fast path
if (atomically(perf_map->head) != last_perf_map_head) {
struct ...> Performance events filtering is being worked on and now with the proper > non-DoS limit you've added you can lose events too, dont you? So it's > all a question of how much buffering to add - and with perf events too > you can buffer arbitrary large amount of events. No, the idea for non-DoS for ummunotify is that we would limit the number of regions the application can register; so an application might hit the limit up front but no runtime loss of events once a region was registered successfully. > I think this could be done in a simpler, less limited, more generic, > more useful form by using some variation of perf events. > > You should be able to get all that you want by adding two TRACE_EVENT() > tracepoints and using the existing perf event syscall to get the events > to user-space. Yes, I would like to use perf events too. Would it be plausible to create a way for userspace to create a "counter" for each address range being watched? Then events would not be lost, because those counters would become non-zero. > Meaning that this: > 9 files changed, 1060 insertions(+), 1 deletions(-) Note that lots/ of the files touched here are in Documentation or are one-line changes to Makefiles etc. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I'd be good to work at reducing that number, not adding to it ;-) But yeah, I currently don't see a very nice match to perf counters. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
You're right in that perf-counter currently doesn't provide a way to specify these ranges, we simply track all mmap() traffic. The thing is that mmap() data is basically a side channel. For your usage you'd basically have to open a NOP counter and only observe the mmap data. We could look at ways of adding ranges.. We do have a means of detecting if new data is available, we keep a data head index. If that moves, you've got new stuff. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
