logo
Published on KernelTrap (http://kerneltrap.org)

Linux: Distributed mmap API

By Amit Shah
Created Feb 26 2004 - 11:00

Daniel Phillips recently posted a patch to the Linux kernel mailing
list that implements a distributed mmap() API: "This function by itself
is enough to support a crude but useful form of distributed mmap where
a shared file is cached only on one cluster node at a time."

The mmap() [1] system call maps a portion of a file or device into system memory. A distributed mmap API allows one to perform mmap() on files located on remote machines visible in the local namespace via a distributed filesystem.

Daniel's patch is one of two that would implement the simple core API for distributed mmap(). Via this simple core API, cache invalidation will only work for whole files and not for portions of a file. Also, multiple readers may not cache the same data simultaneously. An improved version of this kernel API will be developed later, addressing both of these limitations.


From: Daniel Phillips [email blocked]
Subject: [RFC] Distributed mmap API
Date: Wed, 25 Feb 2004 22:20:11 +0100

This is the function formerly known as invalidate_mmap_range, with the
addition of a new code path in the zap_ call chain to handle MAP_PRIVATE
properly.  This function by itself is enough to support a crude but useful
form of distributed mmap where a shared file is cached only on one cluster
node at a time.

To use this, the distributed filesystem has to hook do_no_page to intercept
page faults and carry out the needed global locking.  The locking itself does
not require any new kernel hooks.  In brief, the patch here and another patch
to be presented for the do_no_page hook, together provide the core kernel API
for a simplified, distributed mmap.  (Note that there may be a workaround for
the lack of a do_no_page hook, but certainly not as simple and robust.)

To put this in perspective, I'll mention the two big limitations of the
simplified API:

  1) Invalidation is always a whole file at a time
  2) Multiple readers may not cache the same data simultaneously

To handle sub-file cache granularity, we also need to be able to flush dirty
data and evict cache pages with sub-file granularity, giving a trio of cache
management functions:

    unmap_mapping_range(mapping, start, length) /* this patch */
    write_mapping_range(mapping, start, length) /* start IO for dirty cache */
    evict_mapping_range(mapping, start, length) /* wait on IO and evict cache */

To handle (2) above, the distributed filesystem will need to hook and modify
the behaviour of do_wp_page so that it can intercept memory writes to shared
cache pages.

To summarize the current proposal, and where we need to go in the future:

  Simple core kernel API for simplistic distributed memory map
  ------------------------------------------------------------

     - unmap_mapping_range export (this patch)
     - do_no_page hook

  Improved core kernel API for optimal distributed memory map
  -----------------------------------------------------------

     - unmap_mapping_range export (this patch)
     - write_mapping_range export
     - evict_mapping_range export
     - do_no_page hook
     - do_wp_page hook

There's no big rush to move on to the optimal version just now, since the simplistic
version is already a big step forward.

I'd like to take this opportunity to apologize to Paul for derailing his more
modest proposal, but unfortunately, the semantics that could be obtained that
way are fatally flawed: private mmaps just won't work.  What I've written here
is about the minimum that supports acceptable mmap semantics.

And finally, the EXPORT_SYMBOL_GPL issue: after much fretting I've changed it
to just EXPORT_SYMBOL in this patch, because I feel that we have better ways
to further our goals of free and open software than to try to use this
particular API as a battering ram.  Of course it's not my decision, I just
want to register my vote here.

Regards,

Daniel



Related Links:


Source URL:
http://kerneltrap.org/node/2500