Re: [Tux3] Comparison to Hammer fs design

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: <kernel@...>
Date: Tuesday, August 5, 2008 - 4:21 am

On Sunday 27 July 2008 20:46, Matthew Dillon wrote:

The reason for my unclear crash recovery algorithm is, it was not fully
designed when we started chatting, only the basic principles of robust
micro-transaction commit.  You forced me to go work out the details of
the high level transaction architecture.

The big points you got me focussed on were, what about mass commits a la
fsync, and how do you sort out the ties that arise between otherwise
independent high level operations, and how do you preserve the ordering
of low level operations to reflect the ordering constraints of the high
level operations?  Which is what you were trying to say with your create
a, create a/b example, I now realize.


I am well down path to identifying a similar frontend/backend
separation in Tux3.  The back end data consists of all the dirty
metadata and file data cache blocks (buffer cache and page cache(s)
respectively) and the "phase" boundaries, which separate the results of
successive large groups of high level operations so that there are no
disk-image dependencies between them.

I now have two nice tools for achieving such a clean separation between
transaction phases:

  1) Logical log records

  2) Dirty cache block cloning

The latter takes advantage of logical log records to limit the effect of
a high level operation to just the leaf blocks affected.  Where two
operations that are supposed to be in two different transaction phases
affect the same leaf block(s) then the block(s) will be cloned to
achieve proper physical separation.  So I can always force a clean
boundary between phases without generating a lot of dirty blocks to do
it, an important property when it comes to anti-deadlock memory
provisions.

The payoff for a clean frontend/backend separation is, as you say, a
clean design.  The art in this is to accomplish the separation without
any throughput compromise, which I am willing to believe you have done
in Hammer, and which I now wish to achieve by a somewhat different
means in Tux3.


I need a means of remembering which blocks are members of which phases,
and an rbtree may be indeed better than the list I was thinking of.  A
hash table is probably ok too.  I also have to remember the logical
update records that were staged when the high level operation took place
in order to avoid having the btree leaf changes climb up into the index
nodes copy on write style, and to avoid constant changes to the free map
btree.

I can coopt some bits of the buffer flags to keep track of which phase a
given dirty buffer belongs to.  To avoid having to attach buffers to
every dirty page cache page, I might use some page flags in that case. 
Or that state can be in a secondary bookkeeping structure like the
rbtree or hash.  Details to be settled before the kernel coding starts.


Roger, same here.  The inode locks and things like rename lock are only
held during the course of the high level operation.  Tux3 does not have
to worry about that at all, the VFS takes care of it nicely.  Tux3 only
has to lock the data structures that keep track of the phases
(spinlocks) and the btrees embedded in the buffer and page cache
(possibly hashed mutexes).

Note: when I write VFS I mean "virtual filesystem switch" and when you
write VFS you mean "virtual filesystem", which in Linux is just called
a filesystem.


Here, Tux3 (and Hammer) can always set up a very long forward log chain,
only updating the header once per multi-transaction commit phase.


It is beautiful, now that I understand it.


Interlock at the micro-transaction level really sucks.  There are way
too many interlocks.  It is much nicer just to have the phase
boundaries, and not care about the dependencies between micro
transactions withing the transaction phase, knowing that they will all
be satisfied because the entire phase will be committed atomically.

Now here is another way of looking at the difference between undo and
redo.  Each of my redo log entries leaves the filesystem meta-metadata
in a consistent state in the sense that all btree structure will be
consistent.  But the logical metadata may be inconsistent, for example,
a directory may have appear to have been deleted on disk while some
deletions of files it contained were not persistently recorded.  I think
one can take advantage of such partial, low level consistency in order
to do a rapid scan and repair, preserving as much as possible of what
the filesystem was doing before the crash of a busy system with lots of
transactions in flight.


Yes, this was the point that finally made me understand how your undo
approach works.  Sorry for making you explain it so many times ;-)


I presume that if one userland process is blocked on fsync, the others
are still able to continue dumping their operations into cache?


There is no need for multiple flushing threads in Tux3, I will use bio
submissions instead, which are completely asynchronous, returning
completion status in interrupt mode as they do.


It's a thought.  But anyway, I don't see a big downside to the PHTree
approach.  It might even end up faster than HTree (which does not
maintain dirent stability) by virtue of the slightly smaller (5-10%)
cache footprint.  If classic HTree turns out to be faster, I could offer
both as an option.  The code that differs between them should be about
10-20%.


You are right about the cookie 64 bitness, I don't know where I picked
up that lore, mea culpa.


I overlooked the part where you mentioned giving some lexical ordering
to your hash to get better locality between lexically related records. 
I am not sure this is a win, it will require a lot of analysis, which
starts to drift off the critical path for me, since crystallizing the
transaction design is really a lot more important.

An inverse idea to making the hash nonrandom is to make the goal inum
depend somewhat on the name hash for a newly created file.  So
relatively fewer PHTree index leaf nodes, which contain [hash, dirblock]
pairs, will be have to be reloaded into cache during deletion of a
massive directory in physical dirent order.  The goal inum has to depend
mostly on the physical dirent location though, to avoid thrashing the
inode table during mass creation, deletion or stat.  I am not sure
whether there is a gain to be had vs keeping the random hash completely
random, which does deliver a benefit in balancing the tree better.


Yes, the common case of replication is only against the "current"
version and that optimization will work well there.  There might be
multiple current versions in a Tux3 volume,

In the case of many virtual servers running off the same filesystem
volume, each with its own writable snapshot, and each replicating
thatsnapshot somewhere, I need to be able to track dirtiness in multiple
versions simultaneously.  Or just not worry to much about that and sweep
through the entire inode table btree.  Multiple dirty block streams
could be output in one pass in this case.


It flushes itself out when it is not pinned by something like ramfs for
example, which has no backing store.  The VFS takes care of the
create/delete dependencies.


That sounds a lot like the <fs>_get_block interface the Linux filesystem
block IO library uses to map buffers to physical addresses.  That is just
a set of convenience functions the filesystem can use to satisfy such
vmm calls into the filesystem as ->writepage.  The filesystem just has
to supply its own _get_block method.  Between that and mount options
parsing, there is not a whole lot more you have to do to bring up a
filesystem on Linux, that is, if you are not really worried about
performance.  Mapping each block with a separate call into the
filesystem actually sucks pretty badly CPU wise, but this is masked by
slow spindle speeds that stretch the CPU out over time.  That masking
disappears with some of the high performance storage hardware coming
out now and the wasteful effects of that interface start to show.


vm_page_array... is that like a Linux page cache?  (aka mapping, aka
address_space)  Buffers on page cache pages do not have to be
page-aligned in Linux, or rather, each page in Linux is either fully
covered with buffers or has none at all.  What sucks about all this is
having buffers _and_ page descriptors, where the two perform very
similar functions and duplicate a lot of state, causing endless bugs
due to skew in the cached state between the two kinds of objects.


That is exactly the idea.  The question is whether to have a dedicated
btree or to use a file which is also a btree.  The orphan file (if that
turns out to be the mechanism) would implement a simple linked list of
orphans, and the transitory log entries would reduce the frequency the
orphan file has to be updated, allowing it to be updated in batches.

This post required a week to respond to because I thought I should also
post some code in parallel.

Regards,

Daniel
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [Tux3] Comparison to Hammer fs design, Matthew Dillon, (Sun Jul 27, 11:46 pm)
Re: [Tux3] Comparison to Hammer fs design, Daniel Phillips, (Tue Aug 5, 4:21 am)