login
Header Space

 
 

Filesystem Namespace Unification

September 8, 2007 - 4:33pm
Submitted by Jeremy on September 8, 2007 - 4:33pm.
Linux news

Bharata Rao posted a query to the Linux Kernel mailing list looking for ideas on how to best handle filesystem namespace unification with Union Mount, "typically this is done by reading the directory entries of all the union'ed layers (starting from the top and working downwards) and merging the result by eliminating the duplicate entries. This is done by extending the getdents/readdir system calls to support the notion of union'ed directories." He went on to describe a few different methods they were considering for implementation noting, "he main problem in getting a sane readdir() implementation in Union Mount is the fact that a single vfs object (file structure) is used to represent more than one (underlying) directory. Because of this, it is unclear as to how lseek(2) needs to behave in such cases."

Two other similar solutions were discussed, aufs and UnionFS. In response to the Bharata's questions, Erez Zadok explained, "in the long run, storing a cache of the readdir state on disk, is the best approach by far. Since you already spend the CPU and memory resources to create a merged view, storing it on disk as a contiguous file isn't that much more effort. That effort pays off later on esp. if the directories don't change often"


From:	Bharata B Rao [email blocked]
Reply-To: [email blocked]
Subject: [RFC] Union Mount: Readdir approaches
Date:	Fri, 7 Sep 2007 11:16:18 +0530

Hi,

Any filesystem namespace unification solution (Union Mount, Unionfs) needs to
provide a unified or merged view of directory contents. Typically this is
done by reading the directory entries of all the union'ed layers (starting
from the top and working downwards) and merging the result by eliminating
the duplicate entries. This is done by extending the getdents/readdir system
calls to support the notion of union'ed directories.

Here I try to describe briefly the approaches we have tried with Union Mount
and the issues we are facing. The intention is to solicit feedback and
suggestions on how to solve this problem of merging directory contents.

Approach 1
----------
This is the approach which is present in the current Union Mount patches.

In this approach, the directory entries are read starting from the topmost
directory and they are maintained in a cache. Subsequently, when the entries
from the lower directories are read, they are checked for duplicates (in the
cache) before being passed out to the user space. There can be multiple calls
to readdir/getdents routines for reading the entries of a single directory.
But union directory cache is not maitained across these calls. Instead
for every call, the previously read entries are re-read into the cache
and newly read entires are compared against these for duplicates before
being they are returned to user space.

Since Union Mount does unioning at VFS layer, the file descriptor
of the topmost directory represents all the underlying directories. So
even though readdir() happens on the same file struct, it needs to internally
obtain file structs for lower directories and issue readdir() on them
separately. While this itself is fine, it causes a problem when the
directory is lseek'ed.

In this approach, the file position of the topmost directory is linearly
scaled to accommodate the (inode) sizes of the underlying directories.
For eg, if the union has two layers (with inode sizes 100 each), then
at the end of reading the lower directory, the file->f_pos of the topmost
directory would be 200.  So, it is possible for the file position
(of the topmost directory) to exceed it's inode size.

This approach works only if all the layers of the union have flat file
directories (like ext2). And it fails for those filesystems which return
a directory offset (dirent->d_off) greater than the inode size. And it allows
a sane lseek() behaviour on union'ed directories (only for flat file
directories).

Ref: http://lkml.org/lkml/2007/7/30/174

Approach 2
----------
This approach was part of a few versions of Union Mount patches.

Like Approach 1, this also maintains a cache of directory
entries, but this cache persists across readdir calls. So unlike the
previous approach, here the cache is not destroyed and rebuilt for
every readdir(). Instead the cache information is left hanging off
the file structure of the topmost directory. In addition to the
cached entries, it also stores information about the current
directory(vfsmnt, dentry) in the union stack on which last readdir() was done
and also the offset (file->f_pos) at which the readdir() stopped.
Subsequent readdir() will fetch this information from file struct (of the top
layer) and start reading the entries at where the previous readdir() had
stopped. And this approach doesn't try to interpret or change file->f_pos.

Since the information about the directory and the offset within it is
stored separately when the readdir() ends, this approach works for all
filesystems, irrespective of the method used by them to encode directory's
file->f_pos. But again, this can't support a sane lseek() on union'ed
directories in its current form.

Ref: http://lkml.org/lkml/2007/6/20/21

Approach 3
----------
Chistoph Hellwig suggested that to avoid a single file struct representing
multiple objects, ->readdir() which is currently a file operation should
instead be changed to inode operation. But it is not sure atm how this could
help because getdents(2) and readdir(2) restrict us to use a open file
descriptor and there is one fd for union'ed directories. Moreover we need
a single offset element to correctly represent the offset into lower
directories so as to support lseek(2) correctly.

Ref: http://lkml.org/lkml/2007/6/20/107

Questions
---------
The main problem in getting a sane readdir() implementation in Union Mount
is the fact that a single vfs object (file structure) is used to represent
more than one (underlying) directory. Because of this, it is unclear as to
how lseek(2) needs to behave in such cases.

First of all, should we even expect a sane lseek(2) on union mounted
directories ? If not, will the Approach 2, which works uniformly for
all filesystem types be acceptable ?

If lseek(2) needs to be supported, then how do we define the seek behaviour
when two different types of directories are union'ed ? For eg. how do we define
lseek(2) on a union of ext2 directory on top of a nfs directory ? Since both
of them use different encoding methods for filp->f_pos, how do we establish
a common lseek(2) behaviour here ?

And finally, what is the use case for directory seek ? Would anybody walk
directory by directory by seeking into a directory file ?

Regards,
Bharata.


From: [email blocked] Subject: Re: [RFC] Union Mount: Readdir approaches Date: Fri, 07 Sep 2007 16:31:26 +0900 Hello Bharata, I am developing a linux stackable/unification filesystem too. Bharata B Rao: > Questions > --------- ::: > First of all, should we even expect a sane lseek(2) on union mounted > directories ? If not, will the Approach 2, which works uniformly for > all filesystem types be acceptable ? > > If lseek(2) needs to be supported, then how do we define the seek behaviour > when two different types of directories are union'ed ? For eg. how do we define > lseek(2) on a union of ext2 directory on top of a nfs directory ? Since both > of them use different encoding methods for filp->f_pos, how do we establish > a common lseek(2) behaviour here ? > > And finally, what is the use case for directory seek ? Would anybody walk > directory by directory by seeking into a directory file ? Although I don't remember exactly, NFS or smbfs seek for a directory. Additionally, any user process can call seekdir or something. So I believe any stackable/unification filesystem should support it. Here is my approach. While I don't think it is the best approach since it consumes much memory and cpu, I hope it help you. (or assist you? Sorry, I don't know correct English word) - the stackable fs has its own inode, file, dentry object, which has an array for the underlying inode pointers. and the whiteout is a regular file with a special name, instead of a flag in inode. this is the most different architecture from your unification embeded in VFS. - the vritual dir inode object has a cache for its child entries. it is called vdir. the cache has a version and a customizlable lifetime too. - all the existing underlying (same-named) dir are opened too. the file objects are stored in the virtual file object as an array. - the virtual file object has a cache for its child entries too. it is a copy of the one in the inode object. When the first readdir is issued: - call vfs_readdir for every underlying opened dir (file) object. - store every entry to either the hash table for the result or the whiteout, when the same-named entry didn't exist in the tables. - to improvement the performance, the allocated memory for the hash tables are managed in a pointer array. and the elements are concatinated logically by the pointer. - the pointer for the result-table, the version, and the currect jiffies are set to vdir, which is a cache in an inode. - all cache are copied to a member in a file object. - the index of the cache memory block and the offset in an array is handled as the seek position. In the case of the application issued this sequence: - opendir() - readdir() - creat or unlink an entry under the dir - readdir() When an entry under the dir was removed or added, the inode version will be updated. Since readdir can compare it with the cached version or the lifetime (jiffies) in the file object, it can refresh the entries. But in this case, it doesn't, since the file position is not 0. If the application needs the latest entries, it has to call rewinddir. The cache in the file object will updated only the case of obsoleted AND the file position is 0. When a dir who has already its vdir is opened, the cache in the inode object will be used without calling vfs_readdir, after checking the version and the lifetime which are stored in the inode object. If it is obsoleted, vfs_readdir will be called again in order to update the cache in the inode. If you are interested in this approach, please refer to http://aufs.sf.net. It is working and used by several people. Junjiro Okajima
From: Bharata B Rao [email blocked] Reply-To: [email blocked] Subject: Re: [RFC] Union Mount: Readdir approaches Date: Fri, 7 Sep 2007 13:28:55 +0530 On Fri, Sep 07, 2007 at 04:31:26PM +0900, [email blocked] wrote: > > When the first readdir is issued: > - call vfs_readdir for every underlying opened dir (file) object. > - store every entry to either the hash table for the result or the > whiteout, when the same-named entry didn't exist in the tables. > - to improvement the performance, the allocated memory for the hash > tables are managed in a pointer array. and the elements are > concatinated logically by the pointer. > - the pointer for the result-table, the version, and the currect jiffies > are set to vdir, which is a cache in an inode. > - all cache are copied to a member in a file object. > - the index of the cache memory block and the offset in an array is > handled as the seek position. Ok, interesting approach. So you define the seek behaviour on your directory cache rather than allowing the underlying filesystems to interpret the seek. I guess we can do something similar with Union Mounts also. > If you are interested in this approach, please refer to > http://aufs.sf.net. It is working and used by several people. Will look at it. And thanks Junjiro for your detailed explanation of the aufs approach. Regards, Bharata.
From: "Josef 'Jeff' Sipek" [email blocked] Subject: Re: [RFC] Union Mount: Readdir approaches Date: Fri, 7 Sep 2007 13:39:41 -0400 On Fri, Sep 07, 2007 at 01:28:55PM +0530, Bharata B Rao wrote: > On Fri, Sep 07, 2007 at 04:31:26PM +0900, [email blocked] wrote: > > > > When the first readdir is issued: > > - call vfs_readdir for every underlying opened dir (file) object. > > - store every entry to either the hash table for the result or the > > whiteout, when the same-named entry didn't exist in the tables. > > - to improvement the performance, the allocated memory for the hash > > tables are managed in a pointer array. and the elements are > > concatinated logically by the pointer. > > - the pointer for the result-table, the version, and the currect jiffies > > are set to vdir, which is a cache in an inode. > > - all cache are copied to a member in a file object. > > - the index of the cache memory block and the offset in an array is > > handled as the seek position. > > Ok, interesting approach. So you define the seek behaviour on your > directory cache rather than allowing the underlying filesystems to > interpret the seek. I guess we can do something similar with Union > Mounts also. Unless I missunderstood something, Unionfs uses the same approach. Even Unionfs's ODF branch does the same thing. The major difference is that we keep the cache in a file on a disk. Josef 'Jeff' Sipek. -- Evolution, n.: A hypothetical process whereby infinitely improbable events occur with alarming frequency, order arises from chaos, and no one is given credit.
From: Erez Zadok [email blocked] Subject: Re: [RFC] Union Mount: Readdir approaches Date: Fri, 7 Sep 2007 13:54:18 -0400 In message <20070907173941.GB20360@filer.fsl.cs.sunysb.edu>, "Josef 'Jeff' Sipek" writes: > On Fri, Sep 07, 2007 at 01:28:55PM +0530, Bharata B Rao wrote: > > > > Ok, interesting approach. So you define the seek behaviour on your > directory cache rather than allowing the underlying filesystems to > > interpret the seek. I guess we can do something similar with Union > > Mounts also. > > Unless I missunderstood something, Unionfs uses the same approach. Even > Unionfs's ODF branch does the same thing. The major difference is that we > keep the cache in a file on a disk. Yup. Bharata, in the long run, storing a cache of the readdir state on disk, is the best approach by far. Since you already spend the CPU and memory resources to create a merged view, storing it on disk as a contiguous file isn't that much more effort. That effort pays off later on esp. if the directories don't change often: - you get a compatible behavior with seekdir/telldir (no matter how braindead that interface is :-) - for subsequent directory reading, your performance actually improves because you don't have to repeat the duplicate elimination and whiteout processing -- just read the cached file from disk as any other file. You then benefit from traditional readahead, and from not having to cache the entire contents of the readdir state file, so it falls under normal paging/flushing policies. Any policy which merges the readdir info and keeps it in memory indefinitely is problematic -- you increase average memory pressure on the system over a longer period of time; and when you purge your readdir state from memory, you have to recreate it from scratch, re-consuming the same CPU/memory resources. Our ODF code implements the readdir state caching policy, as described in the ODF design document here: <http://www.filesystems.org/unionfs-odf.txt&gt; Finally, I don't think it'll be so easy to get rid of seekdir/telldir, b/c some of it is the default behavior of non-linux NFS/smb clients (we've seen it with Solaris NFS clients). Erez.



Related Links:

Look at Plan9, as usual

September 11, 2007 - 2:59pm
Wladimir Mutel (not verified)

http://plan9.bell-labs.com/magic/man2html/1/bind
Everything is taken into account there.
Semantics is non-POSIX, however.
Look in awe at 'ns' and 'cpu' commands as well :>

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary