Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

Previous thread: [ofa-general] [GIT PULL] please pull infiniband.git by Roland Dreier on Thursday, September 10, 2009 - 9:23 pm. (1 message)

Next thread: [ofa-general] Re: [GIT PULL] please pull ummunotify by Roland Dreier on Thursday, September 10, 2009 - 11:03 pm. (1 message)
From: Roland Dreier
Date: Thursday, September 10, 2009 - 9:38 pm

Linus, please consider pulling from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This will get "ummunotify," a new character device that allows a
userspace library to register for MMU notifications; this is
particularly useful for MPI implementions (message passing libraries
used in HPC) to be able to keep track of what wacky things consumers
do to their memory mappings.  My colleague Jeff Squyres from the Open
MPI project posted a blog entry about why MPI wants this:

http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/

His summary of ummunotify:

  "It’s elegant, doesn’t require strange linker tricks, and seems to
   work in all cases.  Yay!"

This code went through several review iterations on lkml and was in
-mm and -next for quite a few weeks.  Andrew is OK with merging it (I
think -- Andrew please correct me if I misunderstood you).

Roland Dreier (1):
      ummunotify: Userspace support for MMU notifications

 Documentation/Makefile                  |    3 +-
 Documentation/ummunotify/Makefile       |    7 +
 Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 drivers/char/Kconfig                    |   12 +
 drivers/char/Makefile                   |    1 +
 drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 include/linux/Kbuild                    |    1 +
 include/linux/ummunotify.h              |  121 +++++++
 9 files changed, 1060 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 ...
From: Pavel Machek
Date: Tuesday, September 15, 2009 - 4:34 am

I don't remember seeing discussion of this on lkml. Yes it  is in
-next...

Basically it allows app to 'trace itself'? ...with interesting mmap()
interface, exporting int to userspace, hoping it behaves atomically...? 
	
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Tuesday, September 15, 2009 - 7:57 am

> I don't remember seeing discussion of this on lkml. Yes it  is in
 > -next...

eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
and earlier patches.

 > Basically it allows app to 'trace itself'? ...with interesting mmap()
 > interface, exporting int to userspace, hoping it behaves atomically...? 

Yes, it allows app to trace what the kernel does to memory mappings.  I
don't believe there's any real issue to atomicity of mmap'ed memory,
since userspace really just tests whether read value is == to old read
value or not.

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Pavel Machek
Date: Monday, September 28, 2009 - 1:49 pm

Well... it seems little overspecialized. Just modifying libc to

That still needs memory barriers etc.. to ensure reliable operation,
no?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Jason Gunthorpe
Date: Monday, September 28, 2009 - 2:40 pm

That is what MPI people are doing today and their feedback is that it
doesn't work - there are a lot of ways to mess with memory and no good
choices to hook the raw syscalls and keep sensible performance.

The main focus of this is high performance MPI apps, so lower overhead
on critical paths like memory allocation is part of the point. It is
ment to go hand-in-hand with the specialized RDMA memory pinning

No, I don't think so..

The application is expected to provide sequencing of some sort between
the memory call (mmap/munmap/brk/etc) and the int check - usually just
by running in the same thread, or through some kind of locking scheme.

As long as the mmu notifiers run immediately in the same context as
the mmap/etc then it should be fine.

For example, the most common problem to solve looks like this:

   x = mmap(...)
   do RDMA with x
   [..]
   mmunmap(x);

   [..]
   y = mmap(..);
   do RDMA with y
     if by chance x == y things explode.

So this API puts the int test directly before 'do RDMA with'.

Due to the above kind of argument the net requirement is either to
completely synchronously (and with low overhead) hook every
mmap/munmap/brk/etc call into the kernel and do the accounting work,
or have a very low over head check every time the memory region is
about to be used.

Jason
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Wednesday, September 16, 2009 - 9:30 am

Hi Linus,

Sorry to hassle you about this, but I would like to know where things
stand.  I know (from the reflink discussion if nothing else) that you're
definitely not bashful about telling people when their code sucks, so
this silent treatment has me really flustered.  I've been showering and
brushing my teeth and everything, honest!

Seriously, this code solves a problem that the MPI/HPC people have been
complaining about for quite a while, and if possible I'd like to get
this upstream.  Or if you have a better idea, I'm all ears...

Thanks,
  Roland

 > Linus, please consider pulling from
 > 
 >     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This tree is also available from kernel.org mirrors at:
 > 
 >     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
 > 
 > This will get "ummunotify," a new character device that allows a
 > userspace library to register for MMU notifications; this is
 > particularly useful for MPI implementions (message passing libraries
 > used in HPC) to be able to keep track of what wacky things consumers
 > do to their memory mappings.  My colleague Jeff Squyres from the Open
 > MPI project posted a blog entry about why MPI wants this:
 > 
 > http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
 > 
 > His summary of ummunotify:
 > 
 >   "It’s elegant, doesn’t require strange linker tricks, and seems to
 >    work in all cases.  Yay!"
 > 
 > This code went through several review iterations on lkml and was in
 > -mm and -next for quite a few weeks.  Andrew is OK with merging it (I
 > think -- Andrew please correct me if I misunderstood you).
 > 
 > Roland Dreier (1):
 >       ummunotify: Userspace support for MMU notifications
 > 
 >  Documentation/Makefile                  |    3 +-
 >  Documentation/ummunotify/Makefile       |    7 +
 >  Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 >  Documentation/ummunotify/umn-test.c   ...
From: Linus Torvalds
Date: Wednesday, September 16, 2009 - 9:40 am

I just haven't had time to look into the issue, so I'm merging the code 
that I know I need to merge, and hopefully I'll have a breather later when 
I can actually look at code and the thread it all spawned.,

		Linus
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Peter Zijlstra
Date: Thursday, September 17, 2009 - 4:30 am

Anton Blanchard suggested a while back that this might be integrated
with perf-counters, since perf-counters already does mmap() tracking and
also provides events through an mmap()'ed buffer.

Has anybody looked into this?

If someone did and I missed the discussion on why it isn't appropriate,
kindly point me in the right direction ;-)


_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Thursday, September 17, 2009 - 7:24 am

> Anton Blanchard suggested a while back that this might be integrated
 > with perf-counters, since perf-counters already does mmap() tracking and
 > also provides events through an mmap()'ed buffer.
 > 
 > Has anybody looked into this?

I didn't see the original suggestion.  Certainly hooking in to existing
infrastructure for user/kernel communication would be good.

The fit doesn't seem great to me, although I am rather naive about perf
counters.  The problem that ummunotify is trying to solve is to let an
app say 'for these 1000 address ranges (that possibly only cover a small
part of my total address space), please let me know when the mappings
are invalidated for any reason'.

So getting those events in the kernel is no problem -- we have the MMU
notifier hooks that tell us exactly what we need to know.  The issue is
purely the way userspace registers interest in address ranges, and how
to kernel returns the events.

For perf counters it seems that one would have to create a new counter
for each address range... is that correct?  And also I don't know if
perf counter has an analog for the fast path optimization that
ummunotify provides via a mmap'ed generation counter (a quick way for
userspace to see 'nothing happened since last time you checked').

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Thursday, September 17, 2009 - 7:32 am

> So getting those events in the kernel is no problem -- we have the MMU
 > notifier hooks that tell us exactly what we need to know.  The issue is
 > purely the way userspace registers interest in address ranges, and how
 > to kernel returns the events.
 > 
 > For perf counters it seems that one would have to create a new counter
 > for each address range... is that correct?  And also I don't know if
 > perf counter has an analog for the fast path optimization that
 > ummunotify provides via a mmap'ed generation counter (a quick way for
 > userspace to see 'nothing happened since last time you checked').

Oh I forgot... ummunotify also preallocates everything etc. so that
there is no way for events to be lost.  Which saves userspace from
having to trash everything cached and start over, which it would have to
do if it misses an invalidate event.

And AFAIK, pref counters does have the possibility of overflowing a
buffer and losing an event, right?

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Peter Zijlstra
Date: Thursday, September 17, 2009 - 7:49 am

Well, you cannot pre-allocate everything, either you get back-logged
evens in kernel space leading to a kernel DoS, or you loose events.

Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode
will automagically overwrite its tail data without regard for userspace
having observed it.

In the RW mode userspace has to advance the tail, the kernel will drop
events when full and insert a PERF_EVENT_LOST event once there is room
again.

Hmm, or are you saying you can only get 1 event per registered range and
allocate the thing on registration? That'd need some registration limit
to avoid DoS scenarios.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Thursday, September 17, 2009 - 8:03 am

> Hmm, or are you saying you can only get 1 event per registered range and
 > allocate the thing on registration? That'd need some registration limit
 > to avoid DoS scenarios.

Yes, that's what I do.  You're right, I should add a limit... although
their are lots of ways for userspace to consume arbitrary amounts of
kernel resources already.

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Roland Dreier
Date: Thursday, September 17, 2009 - 8:45 am

> > > Hmm, or are you saying you can only get 1 event per registered range and
 > > > allocate the thing on registration? That'd need some registration limit
 > > > to avoid DoS scenarios.
 > > 
 > > Yes, that's what I do.  You're right, I should add a limit... although
 > > their are lots of ways for userspace to consume arbitrary amounts of
 > > kernel resources already.
 > 
 > I'd be good to work at reducing that number, not adding to it ;-)

Yes, definitely.  I'll add a quick ummunotify module parameter that
limits the number of registrations per process.

 > But yeah, I currently don't see a very nice match to perf counters.

OK.  It would be nice to tie into something more general, but I think I
agree -- perf counters are missing the filtering and the "no lost
events" that ummunotify does have.  And I'm not sure it's worth messing
up the perf counters design just to jam one more not totally related
thing in.

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Ingo Molnar
Date: Friday, September 18, 2009 - 4:50 am

The filtering can be done and has been done - see Li Zefan's patchset 
that uses filter expressions to do per event in-kernel filtering.

The OOM DoS is a bug in your patches i think, which perfcounters solves 
;-)

	Ingo
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Pavel Machek
Date: Tuesday, September 29, 2009 - 10:13 am

I believe that extending perf counters to do what you want is better
than adding one more, very strange, user<->kernel interface.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Ingo Molnar
Date: Wednesday, September 30, 2009 - 2:44 am

Performance events filtering is being worked on and now with the proper 
non-DoS limit you've added you can lose events too, dont you? So it's 
all a question of how much buffering to add - and with perf events too 

Nobody suggested details for any redesign yet (so far it seems like a 
perfect match, to me at least) so i'm wondering what messup you are 

Agreed.


I test-pulled this code and had a look at it.

I think this could be done in a simpler, less limited, more generic, 
more useful form by using some variation of perf events.

You should be able to get all that you want by adding two TRACE_EVENT() 
tracepoints and using the existing perf event syscall to get the events 
to user-space.

Meaning that this:

  9 files changed, 1060 insertions(+), 1 deletions(-)

Would be replaced with something like:

  2 files changed, 100 insertions(+), 0 deletions(-)

[ the +100 lines would (roughly) would add tracepoints to 
  invalidate_page and invalidate_range_start. (possibly via 
  mmu_notifier_register() like the ummunotify code does) Most of that 
  linecount would be comments. ]

Another upside, beyond the reduction in complexity is that we'd have one 
less special char driver based ABI. Which is a big plus in my opinion, 
especially if this goes towards HPC folks and if it's used for real. Why 
should such a MM capability hidden behind a character device and an 
ioctl?

The perf event approach is beneficial to non-HPC as well: MM 
instrumentation for example - page range invalidates are interesting to 
all sorts of modi of analysis.

A question: what is the typical size/scope of the rbtree of the watched 
regions of memory in practical (test) deployments of the ummunofity 
code?

Per tracepoint filtering is possible via the perf event patches Li Zefan 
has posted to lkml recently, under this subject:

   [PATCH 0/6] perf trace: Add filter support

They are still being worked on but it's very clear that flexible 
in-kernel filtering support will be a ...
From: Jason Gunthorpe
Date: Wednesday, September 30, 2009 - 9:02 am

No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.

Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.

After that point the scheme is perfectly lossless.

Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.

All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..

So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
  triggering on the intersection of the MMU event address range and
  filter expression address range.
- The traces must be choosen so that there is exactly one filter
  expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
  time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
  filter tags, not trace events.

This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:

 struct my_data *ptr = calloc()
 ptr->reg_handle = ibv_reg_mr(base,len)
 ptr->filter_handle = perf_filter_register("string matching base->len",ptr)

 [..]
 // fast path
 if (atomically(perf_map->head) != last_perf_map_head) {
     struct ...
From: Roland Dreier
Date: Wednesday, September 30, 2009 - 10:06 am

> Performance events filtering is being worked on and now with the proper 
 > non-DoS limit you've added you can lose events too, dont you? So it's 
 > all a question of how much buffering to add - and with perf events too 
 > you can buffer arbitrary large amount of events.

No, the idea for non-DoS for ummunotify is that we would limit the
number of regions the application can register; so an application might
hit the limit up front but no runtime loss of events once a region was
registered successfully.

 > I think this could be done in a simpler, less limited, more generic, 
 > more useful form by using some variation of perf events.
 > 
 > You should be able to get all that you want by adding two TRACE_EVENT() 
 > tracepoints and using the existing perf event syscall to get the events 
 > to user-space.

Yes, I would like to use perf events too.  Would it be plausible to
create a way for userspace to create a "counter" for each address range
being watched?  Then events would not be lost, because those counters
would become non-zero.

 > Meaning that this:
 >   9 files changed, 1060 insertions(+), 1 deletions(-)

Note that lots/ of the files touched here are in Documentation or are
one-line changes to Makefiles etc.

 - R.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Peter Zijlstra
Date: Thursday, September 17, 2009 - 8:22 am

I'd be good to work at reducing that number, not adding to it ;-)

But yeah, I currently don't see a very nice match to perf counters.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Peter Zijlstra
Date: Thursday, September 17, 2009 - 7:43 am

You're right in that perf-counter currently doesn't provide a way to
specify these ranges, we simply track all mmap() traffic.

The thing is that mmap() data is basically a side channel. For your
usage you'd basically have to open a NOP counter and only observe the
mmap data.

We could look at ways of adding ranges..

We do have a means of detecting if new data is available, we keep a data
head index. If that moves, you've got new stuff.



_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Previous thread: [ofa-general] [GIT PULL] please pull infiniband.git by Roland Dreier on Thursday, September 10, 2009 - 9:23 pm. (1 message)

Next thread: [ofa-general] Re: [GIT PULL] please pull ummunotify by Roland Dreier on Thursday, September 10, 2009 - 11:03 pm. (1 message)