[Tux3] Feature interaction between multiple volumes and atomic update

Previous thread: [Tux3] [PATCH] Fix maximum file size boundary checking. I added a test which attempts to by Shapor Naghibzadeh on Thursday, August 28, 2008 - 12:29 am. (10 messages)

Next thread: Re: [Tux3] Feature interaction between multiple volumes and atomic update by Matthew Dillon on Friday, August 29, 2008 - 8:31 pm. (6 messages)
From: Daniel Phillips
Date: Thursday, December 11, 2008 - 4:16 am

The attached patch demonstrates deferred inode creation, based on
yesterday's split-up of ext2_new_inode into a front end part and a back
end part named ext2_assign_ino.  Inode assignment is driven from the
directory flush routine, because inode allocation wants to know the
containing directory in order to guide its choice of inode number.  

Actually, if a new inode is linked more than once before a flush, we
don't really know which directory it will be allocated into, a little
different from the current Ext2 behavior, which allocates it near the
directory it is first linked into.  I doubt this makes much difference.

There is one puzzling glitch having to do with the inode dirty list.
I had to mask off the inode dirty state in ext2_assign_inode before
marking the inode dirty, otherwise the inode was not actually being
placed on the sb->s_dirty list.  Meaning that somebody had set the inode
dirty flag(s) (plural, because there I_DIRTY is actually 3 dirty bits
orred together) without placing the inode on the dirty list.  But I ran
out of time trying to find out who that was.  The most likely suspect is
me of course, but nothing obvious jumped out.

My locking here is extremely suspect:

     35 +int ext2_flush_dir(struct dentry *dir)
     ...
     48 +                       if (!dentry->d_inode->i_ino) {
     49 +                               show_inode("assign ino", dentry->d_inode);
     50 +                               spin_unlock(&dentry->d_lock); // this is probably wrong
     51 +                               spin_unlock(&dcache_lock);
     52 +                               ext2_assign_ino(dir->d_inode, dentry->d_inode);
     53 +                               spin_lock(&dcache_lock);
     54 +                               spin_lock(&dentry->d_lock);
     55 +                       }

What are these locks protecting again?  Why is it ok to drop them and
retake them here?  (My theory is that the directory i_mutex is doing
the protecting, but if so, what is ...
From: Daniel Phillips
Date: Friday, August 29, 2008 - 5:41 pm

It turns out that multiple independent volumes sharing the same
allocation space is a feature that does not quite come for free as I
had earlier claimed.  The issue is this:

 * Tux3 guarantees that when fsync (or other filesystem sync) returns
   then the entire volume including all subvolumes is in a consistent
   state.  In particular, any block in use by the subvolume being
   synced is persistently recorded as in use, and no block that is not
   in use by (the persistent image of) any subvolume is recorded as in
   use.

 * It is desirable that a fsync apply only to the subvolume being
   synced, even if other subvolumes are mounted and in use at the same
   time.  Otherwise, syncing a given subvolume would require time
   proportional to the number of subvolumes simultaneously mounted,
   which would be a regression compared to having the volumes actually
   separate.  Since the multiple subvolume feature has a marginal use
   case anyway, such a drawback would verge on being fatal for this
   feature.

 * Therefore it seems logical that Tux3 should have a separate forward
   log for each subvolume to allow independent syncing of subvolumes.
   But global allocation state must always be consistent regardless of
   the order in which subvolumes are synced.

 * We do not want to have a separate log dedicated to block allocation
   because that would require updating two logs in many cases where
   only one log update would otherwise be required.

 * An unexpected interruption may occur when any combination of
   subvolumes is mounted and active.  But on restart, nothing requires
   that the same set of subvolumes be remounted.

 * If a subvolume is not mounted, then it is not desirable for Tux3 to
   recreate the cache state of that subvolume.  Recreating cache state
   is fundamental to the Tux3 integrity recovery design.  In other words,
   we do not want to replay the log into cache for every subvolume that
   was mounted at the time of a crash.

So what do we do? ...
Previous thread: [Tux3] [PATCH] Fix maximum file size boundary checking. I added a test which attempts to by Shapor Naghibzadeh on Thursday, August 28, 2008 - 12:29 am. (10 messages)

Next thread: Re: [Tux3] Feature interaction between multiple volumes and atomic update by Matthew Dillon on Friday, August 29, 2008 - 8:31 pm. (6 messages)