On Mon, Apr 12, 2010 at 04:03:59PM -0700, Andrew Morton wrote:
Just to summarize some of the key points of this thingy, as related to
your comments:
1) It is really very narrowly focused on a particular problem MPI and
RDMA have due to the way their APIs don't really match. Roland
tried to make the interface general.. Maybe that is a mistake ..
2) A 'self-tracing' scheme is used, again, because of an API
mistmatching between a MPI library and it's own
applications. Attempting to hook the appropriate calls has
proven unsatisfactory (missing cases, and slow).
3) Being intended for MPI applications, performance is a huge
concern. Synchronous operation is very undesirable. Tracing APIs
are lossy - and there is no recovery option if an event is lost.
4) Realistically the only thing MPI cares about is if a virtual page
is unmapped/remapped. Loosing events is unacceptable.
5) This isn't really tracing. There is no queue. There aren't really
events. This works more like the diry/access bit in a page table,
it doesn't matter how many times something has been modified, only
that it has at least once since last time you looked.
This means the memory used is proportional to the number of
page-ranges you watch, and the number of events against those
page-ranges doesn't matter. No other API has this property.
Basically, this entire scheme is designed to detect that when a == b,
the internal state held by some_mpi_call is no longer valid, in
this kind of situation:
a = mmap(ONE_PAGE);
some_mpi_call(a);
munmap(a);
b = mmap(ONE_PAGE); // Kernel picks b == a
some_mpi_call(b);
All the races you point out, just don't matter for the MPI use
case. Essentially, if the app hits those races, then it is using the
MPI library in a buggy way.
That said, this could be explained better in the documentation file. :)
I'm sure Eric can go through the rest of your questions in greater
detail..
The only case that matters for the generation counter optimization is
a false negative. As long as user space does:
u64 val = *counter;
if (val != last_counter)
last_counter = val;
Then you can get false positives as you point out, but never a false
negative. A false positive results in an extra syscall and the kernel
just returns no data.
Regards,
Jason
--