The attached patch demonstrates deferred inode creation, based on
yesterday's split-up of ext2_new_inode into a front end part and a back
end part named ext2_assign_ino. Inode assignment is driven from the
directory flush routine, because inode allocation wants to know the
containing directory in order to guide its choice of inode number.
Actually, if a new inode is linked more than once before a flush, we
don't really know which directory it will be allocated into, a little
different from the current Ext2 behavior, which allocates it near the
directory it is first linked into. I doubt this makes much difference.
There is one puzzling glitch having to do with the inode dirty list.
I had to mask off the inode dirty state in ext2_assign_inode before
marking the inode dirty, otherwise the inode was not actually being
placed on the sb->s_dirty list. Meaning that somebody had set the inode
dirty flag(s) (plural, because there I_DIRTY is actually 3 dirty bits
orred together) without placing the inode on the dirty list. But I ran
out of time trying to find out who that was. The most likely suspect is
me of course, but nothing obvious jumped out.
My locking here is extremely suspect:
35 +int ext2_flush_dir(struct dentry *dir)
...
48 + if (!dentry->d_inode->i_ino) {
49 + show_inode("assign ino", dentry->d_inode);
50 + spin_unlock(&dentry->d_lock); // this is probably wrong
51 + spin_unlock(&dcache_lock);
52 + ext2_assign_ino(dir->d_inode, dentry->d_inode);
53 + spin_lock(&dcache_lock);
54 + spin_lock(&dentry->d_lock);
55 + }
What are these locks protecting again? Why is it ok to drop them and
retake them here? (My theory is that the directory i_mutex is doing
the protecting, but if so, what is ...It turns out that multiple independent volumes sharing the same allocation space is a feature that does not quite come for free as I had earlier claimed. The issue is this: * Tux3 guarantees that when fsync (or other filesystem sync) returns then the entire volume including all subvolumes is in a consistent state. In particular, any block in use by the subvolume being synced is persistently recorded as in use, and no block that is not in use by (the persistent image of) any subvolume is recorded as in use. * It is desirable that a fsync apply only to the subvolume being synced, even if other subvolumes are mounted and in use at the same time. Otherwise, syncing a given subvolume would require time proportional to the number of subvolumes simultaneously mounted, which would be a regression compared to having the volumes actually separate. Since the multiple subvolume feature has a marginal use case anyway, such a drawback would verge on being fatal for this feature. * Therefore it seems logical that Tux3 should have a separate forward log for each subvolume to allow independent syncing of subvolumes. But global allocation state must always be consistent regardless of the order in which subvolumes are synced. * We do not want to have a separate log dedicated to block allocation because that would require updating two logs in many cases where only one log update would otherwise be required. * An unexpected interruption may occur when any combination of subvolumes is mounted and active. But on restart, nothing requires that the same set of subvolumes be remounted. * If a subvolume is not mounted, then it is not desirable for Tux3 to recreate the cache state of that subvolume. Recreating cache state is fundamental to the Tux3 integrity recovery design. In other words, we do not want to replay the log into cache for every subvolume that was mounted at the time of a crash. So what do we do? ...
