It is about time to take a step back and describe what I have been
implementing. A tux3 filesystem is a heirarchical structure with a
simple form:
Superblock
Free map (btree)
Free btree index*
Bitmap block or extents*
Volume table (radix tree)
Volume*
Metablock
Log chain*
Version table (radix tree)
Version index block*
Version*
Atime table (btree)
Atime index block*
Atime block*!
Inode table (btree)
Inode*
Inode attribute*!
Data attribute* (btree)
Data btree index*
Data tree leaf*
Extent*!
Versioned elements are marked by !, repeated elements by *.
There is a single free map from which extents for all other objects are
allocated.
The volume table is a new addition not central to the goals of Tux3,
but a nice feature to have given that it comes nearly for free. One
Tux3 volume can have an arbitrary number of separate filesystems tucked
inside it, indexed by a simple integer parameter at mount time. People
say they like this idea and it imposes no significant complexity, so it
goes in. I am not sure I would ever use it, personally. I like my
volumes to be nice separate pieces of disk, but I suppose that is just
me being old fashioned.
Each volume has a metablock pointing at the forward log chain for the
volume, a version table that describes the heirarchical relationship
between versions (snapshots), an atime table to take care of that
horrid legacy Unix feature, and an inode table containing files and
attributes of files. The atime tables, version tables and atom tables
could possibly just be types of files. I have not gotten there quite
yet, so let's see what works out best in practice.
Versioning takes place in three places, versioned pointers in the atime
btree, versioned extents in a file data btree and versioned ...I have talked about the benefits of deferring namespace operations such
as file create and unlink. For Tux3, the main motivation is to remove
the need to share write access to inode table blocks between front end
filesystem operations and back end cache flushing. There are other
good reasons for doing it too:
* Take long-running filesystem operations outside the i_mutex
directory lock, which is heavily contented in some situations
* Maybe there are opportunities to batch up edits to directory
blocks and perform them more efficiently
* Can sometimes handle a create-delete sequence entirely in cache
Anyway, here is a very crude patch that just defers unlinks of regular
files under Ext2. It relies on receiving an "fsync" to the directory to
actually remove the name from the directory entry block. I wrote a
little fsync utility for doing that:
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
int main(int argc, char *argv[])
{
if ((fsync(open(argv[1], O_RDONLY))) == -1) {
printf("failed to fsync %s: %s (%i)\n", argv[1], strerror(errno), errno);
}
return 0;
}
Hirofumi showed how it can be done with dd too, a nice trick. Anyway,
after a rm, the fsync has to be done by hand or the on-disk unlink will
never happen. But if you ls the directory, it will seem to be gone and
it cannot be opened. To be precise, the dentry for the file has been
turned into a "negative dentry" that says "this file does not exist",
and the negative dentry is locked into cache so that the filesystem
unlink can be done later. Of course, Ext3 will keep track of
directories that have deferred unlinked dentries, and automatically do
the fsync.
Besides the dentry, the cached inode has to be freed if its link count
has reached zero. We want to defer the inode free also, it is the
inode delete that causes the Tux3 layering problem described ...Hi Daniel, While I'm certainly not qualified to comment on the _content_ of the patch, I have been watching Linux filesystem flamewars for a while... I _strongly_ suggest that you write to linux-fsdevel@vger about these VFS changes, and sooner rather than later. (It probably would be a good idea to directly CC at least Al Viro and Christoph Hellwig; Ted Ts'o, Christoph Lameter, Linus, and Andrew Morton are also good candidates.) Tux3 looks very promising, and I would hate to see it get tied up in avoidable VFS battles when it's otherwise ready for merging. Thanks, -- BKS _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
It is planned that Tux3 will not depend on any core kernel changes. There will be a patch for deferred namespace operations, and if the patch is not applied, then Tux3 will use the alternative mechanism we have discussed, which will be less efficient but functional. So there is no need for any flamewar: we work with or without the patch. So I would prefer that the concept be functional, without any really stupid oversights such as the one Hirofumi just pointed out to me, before placing the idea within flaming distance of the usual suspects. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
Tux3 has grown simpler as it has aged. Back in august when Tux3 was two weeks old, I described what seemed to be a fairly simple filesystem structure: http://kerneltrap.org/Linux/Tux3_Hierarchical_Structure It turned out that a lot of fluff could still be cut. The volume table was dropped for reasons I will get into below, and two internal structures, the version table and allocation bitmap, were recast as normal files. We now have: Superblock Inode table node* Inode table leaf* Inode* Inode attribute*! Data btree attribute Data btree node* Data btree leaf* Extent*! Versioned elements are marked by !, repeated elements by *. Now Tux3 is just a btree of btrees, an inode table btree from which file btrees descend. Here it is in pictures, courtesy of Hirofumi Ogawa's brilliant graphical Tux3 structure dumper: http://tux3.org/images/tux3.structure.png http://tux3.org/images/tux3.structure.svg The only structural elaborations still due to arrive are the log chain for atomic commit and "metablocks" into which frequently updated superblock fields will be moved, giving a form that is likely to serve well for a long time: Superblock Metablock* log chain => log block => log block... Inode table node* Inode table leaf* Inode* Inode attribute*! Data btree attribute Data btree node* Data btree leaf* Extent*! To be sure, one could simplify even further by having just a single btree that includes file data in a single unified key space as Hammer does. But the btree of btrees arrangement maps better to Linux's cache design: we cache inodes, so file operations normally deal with a very shallow btree referenced by a cached inode instead of a btree deep enough to map every object on an entire volume. We are also able ...
