Re: [Tux3] Deferred delete prototype for Ext2

Previous thread: [Tux3] Page cache emulation now up and running by Daniel Phillips on Sunday, August 10, 2008 - 4:14 am. (7 messages)

Next thread: [Tux3] All about the free tree by Daniel Phillips on Wednesday, August 13, 2008 - 2:43 pm. (4 messages)
From: Daniel Phillips
Date: Monday, August 11, 2008 - 10:40 pm

It is about time to take a step back and describe what I have been
implementing.  A tux3 filesystem is a heirarchical structure with a
simple form:

Superblock
   Free map (btree)
      Free btree index*
         Bitmap block or extents*
   Volume table (radix tree)
      Volume*
         Metablock
            Log chain*
         Version table (radix tree)
            Version index block*
               Version*
         Atime table (btree)
            Atime index block*
               Atime block*!
         Inode table (btree)
            Inode*
               Inode attribute*!
               Data attribute* (btree)
                  Data btree index*
                     Data tree leaf*
                        Extent*!

Versioned elements are marked by !, repeated elements by *.

There is a single free map from which extents for all other objects are
allocated.

The volume table is a new addition not central to the goals of Tux3,
but a nice feature to have given that it comes nearly for free.  One
Tux3 volume can have an arbitrary number of separate filesystems tucked
inside it, indexed by a simple integer parameter at mount time.  People
say they like this idea and it imposes no significant complexity, so it
goes in.  I am not sure I would ever use it, personally.  I like my
volumes to be nice separate pieces of disk, but I suppose that is just
me being old fashioned.

Each volume has a metablock pointing at the forward log chain for the
volume, a version table that describes the heirarchical relationship
between versions (snapshots), an atime table to take care of that
horrid legacy Unix feature, and an inode table containing files and
attributes of files.  The atime tables, version tables and atom tables
could possibly just be types of files.  I have not gotten there quite
yet, so let's see what works out best in practice.

Versioning takes place in three places, versioned pointers in the atime
btree, versioned extents in a file data btree and versioned ...
From: Daniel Phillips
Date: Tuesday, December 2, 2008 - 9:16 pm

I have talked about the benefits of deferring namespace operations such
as file create and unlink.  For Tux3, the main motivation is to remove
the need to share write access to inode table blocks between front end
filesystem operations and back end cache flushing.  There are other
good reasons for doing it too:

  * Take long-running filesystem operations outside the i_mutex
    directory lock, which is heavily contented in some situations

  * Maybe there are opportunities to batch up edits to directory
    blocks and perform them more efficiently

  * Can sometimes handle a create-delete sequence entirely in cache

Anyway, here is a very crude patch that just defers unlinks of regular
files under Ext2.  It relies on receiving an "fsync" to the directory to
actually remove the name from the directory entry block.  I wrote a
little fsync utility for doing that:

#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

int main(int argc, char *argv[])
{
        if ((fsync(open(argv[1], O_RDONLY))) == -1) {
                printf("failed to fsync %s: %s (%i)\n", argv[1], strerror(errno), errno);
        }
        return 0;
}

Hirofumi showed how it can be done with dd too, a nice trick.  Anyway,
after a rm, the fsync has to be done by hand or the on-disk unlink will
never happen.  But if you ls the directory, it will seem to be gone and
it cannot be opened.  To be precise, the dentry for the file has been
turned into a "negative dentry" that says "this file does not exist",
and the negative dentry is locked into cache so that the filesystem
unlink can be done later.  Of course, Ext3 will keep track of
directories that have deferred unlinked dentries, and automatically do
the fsync.

Besides the dentry, the cached inode has to be freed if its link count
has reached zero.  We want to defer the inode free also, it is the
inode delete that causes the Tux3 layering problem described ...
From: Benjamin K. Stuhl
Date: Tuesday, December 2, 2008 - 10:14 pm

Hi Daniel,
  While I'm certainly not qualified to comment on the _content_ of the patch, I 
have been watching Linux filesystem flamewars for a while... I _strongly_ 
suggest that you write to linux-fsdevel@vger about these VFS changes, and 
sooner rather than later. (It probably would be a good idea to directly CC at 
least Al Viro and Christoph Hellwig; Ted Ts'o, Christoph Lameter, Linus, and 
Andrew Morton are also good candidates.) Tux3 looks very promising, and I 
would hate to see it get tied up in avoidable VFS battles when it's otherwise 
ready for merging.

Thanks,
-- BKS

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
From: Daniel Phillips
Date: Wednesday, December 3, 2008 - 12:34 am

It is planned that Tux3 will not depend on any core kernel changes.
There will be a patch for deferred namespace operations, and if the
patch is not applied, then Tux3 will use the alternative mechanism
we have discussed, which will be less efficient but functional.  So
there is no need for any flamewar: we work with or without the patch.

So I would prefer that the concept be functional, without any really
stupid oversights such as the one Hirofumi just pointed out to me,
before placing the idea within flaming distance of the usual suspects.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
From: Daniel Phillips
Date: Wednesday, January 7, 2009 - 12:43 am

Tux3 has grown simpler as it has aged.  Back in august when Tux3 was two 
weeks old, I described what seemed to be a fairly simple filesystem 
structure:

   http://kerneltrap.org/Linux/Tux3_Hierarchical_Structure

It turned out that a lot of fluff could still be cut.  The volume table 
was dropped for reasons I will get into below, and two internal 
structures, the version table and allocation bitmap, were recast as 
normal files.  We now have:

Superblock
   Inode table node*
      Inode table leaf*
         Inode*
            Inode attribute*!
            Data btree attribute
               Data btree node*
                  Data btree leaf*
                     Extent*!

Versioned elements are marked by !, repeated elements by *.

Now Tux3 is just a btree of btrees, an inode table btree from which file 
btrees descend.  Here it is in pictures, courtesy of Hirofumi Ogawa's 
brilliant graphical Tux3 structure dumper:

   http://tux3.org/images/tux3.structure.png
   http://tux3.org/images/tux3.structure.svg

The only structural elaborations still due to arrive are the log chain 
for atomic commit and "metablocks" into which frequently updated 
superblock fields will be moved, giving a form that is likely to serve 
well for a long time:

Superblock
   Metablock*
      log chain => log block => log block...
      Inode table node*
         Inode table leaf*
            Inode*
               Inode attribute*!
               Data btree attribute
                  Data btree node*
                     Data btree leaf*
                        Extent*!

To be sure, one could simplify even further by having just a single 
btree that includes file data in a single unified key space as Hammer 
does.  But the btree of btrees arrangement maps better to Linux's cache 
design: we cache inodes, so file operations normally deal with a very 
shallow btree referenced by a cached inode instead of a btree deep 
enough to map every object on an entire volume.  We are also able ...
Previous thread: [Tux3] Page cache emulation now up and running by Daniel Phillips on Sunday, August 10, 2008 - 4:14 am. (7 messages)

Next thread: [Tux3] All about the free tree by Daniel Phillips on Wednesday, August 13, 2008 - 2:43 pm. (4 messages)