Comparing HAMMER And Tux3

Submitted by Jeremy
on August 7, 2008 - 11:25am

"The big advantage Hammer has over Tux3 is, it is up and running and released in the Dragonfly distro," began Daniel Phillips, offering a comparison between the two filesystem. He continued, "the biggest disadvantage is, it runs on BSD, not Linux, and it so heavily implements functionality that is provided by the VFS and block layer in Linux that a port would be far from trivial. It will likely happen eventually, but probably in about the same timeframe that we can get Tux3 up and stable." This led into a lengthy and interesting technical discussion between Daniel and HAMMER author Matthew Dillon, comparing the design of the two filesystems.

Matthew reviewed the Tux3 notes and replied, "it sounds like Tux3 is using many similar ideas [as HAMMER]. I think you are on the right track. I will add one big note of caution, drawing from my experience implementing HAMMER, because I think you are going to hit a lot of the same issues. I spent 9 months designing HAMMER and 9 months implementing it. During the course of implementing it I wound up throwing away probably 80% of the original design outright." Daniel noted that he's been working on the Tux3 design for around ten years, "and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3." Matthew cautioned, "I can tell you've been thinking about Tux for a long time. If I had one worry about your proposed implementation it would be in the area of algorithmic complexity. You have to deal with the in-memory cache, the log, the B-Tree, plus secondary indexing for snapshotted elements and a ton of special cases all over the place. Your general lookup code is going to be very, very complex. My original design for HAMMER was a lot more complex (if you can believe it!) then the end result. A good chunk of what I had to do going from concept to reality was deflate a lot of that complexity." The friendly conversation offers a very detailed look at the design choices made in each of these file systems.


From: Pedro F. Giffuni <giffunip@...>
Subject: [Tux3] Comparison to Hammer fs design
Date: Jul 25, 12:22 pm 2008

Hi;

The announcement of yet another filesystem:

http://lkml.org/lkml/2008/7/23/257

led to some comments about hammer fs:

http://tux3.org/pipermail/tux3/2008-July/000006.html

enjoy,

Pedro.

From: Daniel Phillips
Subject: [Tux3] Comparison to Hammer fs design
Date: Thu Jul 24 13:26:27 PDT 2008

I read Matt Dillon's Hammer filesystem design with interest:

   http://apollo.backplane.com/DFlyMisc/hammer01.pdf

Link kindly provided by pgquiles.  The big advantage Hammer has over 
Tux3 is, it is up and running and released in the Dragonfly distro.  
The biggest disadvantage is, it runs on BSD, not Linux, and it so 
heavily implements functionality that is provided by the VFS and block 
layer in Linux that a port would be far from trivial.  It will likely 
happen eventually, but probably in about the same timeframe that we can 
get Tux3 up and stable.

Tux3 is a simpler design than Hammer as far as I can see, and stays 
closer to our traditional ideas of how a filesystem behaves, for 
example there is no requirement for a background process to be 
continuously running through the filesystem reblocking it to recover 
space.  Though Tux3 does have the notion of followup metadata passes 
to "promote" logical forward log changes to physical changes to btree 
nodes etc.  However this does not have to be a daemon, it can just be 
something that happens every so many write transactions in the process 
context that did the write.  Avoiding daemons in filesystems is good - 
each one needs special attention to avoid deadlock, and they mess up 
the the ps list, a minor but esthetic consideration.

Matt hit on a similar idea to versioned pointers, that is, his birth and 
death version numbers for disk records.  So we both saw independently 
that recursive copy on write as in WAFL, ZFS and Btrfs is suboptimal.

I found that only the birth version is actually required, simply because 
file data elements never actually die, they is only ever overwritten or 
truncated away.  Therefore, a subsequent birth always implies the death 
of the previous data element, and only the birth version has to be 
stored, which I simply call the "version".  Data element death by 
truncate is handled by the birth of a new (versioned) size attribute.

Eventually Matt should realize that too, and rev Hammer to improve its 
storage efficiency.  Oddly, Hammer only seems to support a linear chain 
of versions, whereas I have shown that with no increase in the size of 
metadata (except for the once-per-volume version tree) you can store 
writable versions with arbitrary parentage.  I think Matt should take 
note of that too and incorporate it in Hammer.

Some aspects of the Hammer seem quite inefficient, so I wonder what he 
means when he says it performs really well.  In comparison to what?  
Well I don't have a lot more to say about that until Tux3 gets to the 
benchmark stage, and they we will be benchmarking mainly against Ext3, 
XFS and Btrfs.

Matt seems somewhat cavalier about running out of space on small 
volumes, whereas I think a filesystem should scale all the way from a 
handful of meg to at least terabytes and preferably petabytes.  The 
heavy use of a vacuum like reblocking process seems undesirable to me.  
I like my disk lights to go out as soon as the data is safely on the 
platter, not continue flashing for minutes or hours after every period 
of activity.  Admittedly, I did contemplate something similar for 
ddsnap, to improve write efficiency.  I now think that fragmentation 
can be held down to a dull roar without relying on a defragger, and 
that defragging should only be triggered at planned times by an 
administrator.  We will see what happens in practice.

Tux3 has a much better btree fanout than Hammer, 256 vs 64 for Hammer 
using the same size 4K btree index blocks.  Fanout is an important 
determinant of the K in O(log(N)) btree performance, which turns out to 
be very important when comparing different filesystems, all of which 
theoretically are Log(N), but some of which have an inconveniently 
large K (ZFS comes to mind).  I always try to make the fanout as high 
as possible in my btree, which for example is a major reason that the 
HTree index for Ext3 performs so well.

Actually, I think I can boost the Tux3 inode table btree fanout up to 
512 by having a slightly different format for the next-to-terminal 
inode table index blocks with 16 bits inum, 48 bits leaf block, because 
at the near-terminal index nodes the inum space is already divided down 
to a small range.

More comments about Hammer later as I learn more about it.

Regards,

Daniel

From: Matthew Dillon <dillon@...>
Subject: Re: [Tux3] Comparison to Hammer fs design
Date: Jul 25, 2:53 pm 2008

:Hi;
:
:The announcement of yet another filesystem:
:
:http://lkml.org/lkml/2008/7/23/257
:
:led to some comments about hammer fs:
:
:http://tux3.org/pipermail/tux3/2008-July/000006.html
:
:enjoy,
:
: Pedro.

Those are interesting comments. I think I found Daniel's email address
so I am adding him to the To: Dan, feel free to post this on your Tux
groups if you want.

I did consider multiple-parentage... that is the ability to have a
writable snapshot that 'forks' the filesystem history. It would be
an ultra cool feature to have but I couldn't fit it into the B-Tree
model I was using. Explicit snapshotting would be needed to make it
work, and the snapshot id would have to be made part of the B-Tree key,
which is fine. HAMMER is based around implicit snapshots (being able
to do an as-of historical lookup without having explicitly snapshotted
bits of the history). I would caution against using B-Tree iterations
in historical lookups, B-Trees are only fast if you can pretty much
zero in on the exact element you want as part of the primary search
mechanic. Once you have to iterate or chain performance goes out the
window. I know this because there are still two places where HAMMER
sometimes has to iterate due to the inexact nature of an as-of lookup.
Multiple-parentage almost certainly require two inequality searches,
or one inequality search and an iteration. A single timeline only
requires one inequality search.

I couldn't get away from having a delete_tid (the 'death version
numbers'). I really tried :-) There are many cases where one is only
deleting, rather then overwriting. Both numbers are then needed to
be able to properly present a historical view of the filesystem.
For example, when one is deleting a directory entry a snapshot after
that deletion must show the directory entry gone. Both numbers are
also needed to be able to properly prune out unwanted historical data
from the filesystem. HAMMER's pruning algorithm (cleaning out old
historical data which is no longer desired) creates holes in the
sequence so once you start pruning out unwanted historical data
the delete_tid of a prior record will not match the create_tid of the
following one (historically speaking).

At one point I did collapse the holes, rematching the delete_tid with
the create_tid of the following historical record, but I had to remove
that code because it seriously complicated the mirroring implementation.
I wanted each mirror to be able to have its own historical retention
policy independant of the master. e.g. so you could mirror to a backup
system which would retain a longer and more coarse-grained history then
the production system.

--

I also considered increasing the B-Tree fan-out to 256 but decided
against it because insertions and deletions really bloated up the
UNDO FIFO. Frankly I'm still undecided as to whether that was a good
idea, I would have prefered 256. I can actually compile in 256 by
changing a #define, but I released with 64 because I hit up against
a number of performance issues: bcopy() overheads for insertions
and deletions in certain tests became very apparent. Locking
conflicts became a much more serious issue because I am using whole-node
locks rather then element locks. And, finally, the UNDO records got
really bloated. I would need to adjust the node locking and UNDO
generation code significantly to remove the bottlenecks before I could
go to a 256-element B-Tree node.

HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
another limiting factor for the fan-out I can use. My elements
are 64 bytes each. 64x64 = 4K per B-Tree node. I decided to go
with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
other auxillary info. That's instead of using a radix-compressed key.
Radix compressed keys would have doubled the complexity of the B-Tree
code, particularly with the middle-of-tree pointer caching that
HAMMER does.

The historical access mechanic alone added major complexity to the
B-Tree algorithms. I will note here that B-Tree searches have a very
high cpu overhead no matter how you twist it, and being able to cache
cursors within the B-Tree is absolutely required if you want the
filesystem to perform well. If you always start searches at the root
your cpu overhead will be horrendous... so plan on caching cursors
from the get-go.

If I were to do radix compression I would also want to go with a
fully dynamic element size and fully dynamic fan-out in order to
best-compress each B-Tree node. Definitely food for thought. I'd
love to do something like that. I think radix compression would
remove much of the topological bloat the B-Tree creates verses using
blockmaps, generally speaking.

Space management is currently HAMMER's greatest weakness, but it only
applies to small storage systems. Several things in HAMMER's design
are simply not condusive to a small storage. The storage model is not
fine-grained and (unless you are deleting a lot of stuff) needs
reblocking to actually recover the freed space. The flushing algorithms
need around 100MB of UNDO FIFO space on-media to handle worst-case
dependancies (mainly directory/file visibility issues on crash recovery),
and the front-end caching of modifying operations, since they don't
know exactly how much actual storage will be needed, need ~16MB of
wiggle room to be able to estimate the on-media storage required to
back the operations cached by the front-end. Plus, on top of that,
any sort of reblocking also needs some wiggle room to be able to clean
out partially empty big-blocks and fill in new ones.

On the flip side, the reblocker doesn't just de-fragment the filesystem,
it is also intended to be used for expanding and contracting partitions,
and adding removing volumes in the next release. Definitely a
multi-purpose utility.

So I'm not actually being cavalier about it, its just that I had to
make some major decisions on the algorithm design and I decided to
weight the design more towards performance and large-storage, and
small-storage suffered a bit.

--

In anycase, it sounds like Tux3 is using many similar ideas. I think
you are on the right track. I will add one big note of caution, drawing
from my experience implementing HAMMER, because I think you are going
to hit a lot of the same issues.

I spent 9 months designing HAMMER and 9 months implementing it. During
the course of implementing it I wound up throwing away probably 80% of
the original design outright. Amoung the major components I had to
rewrite were the LOG mechanic (which ultimately became the meta-data
UNDO FIFO), and the fine-grained storage mechanic (which ultimately
became coarse-grained). The UNDO FIFO was actually the saving grace,
once that was implemented most of the writer-ordering dependancies went
away (devolved into just flushing meta-data buffers after syncing the
UNDO FIFO)... I suddenly had much, much more freedom in designing
the other on-disk structures and algorithms.

What I found while implementing HAMMER was that the on-disk topological
design essentially dictated the extent of HAMMER's feature set AND
most of its deficiencies (such as having to reblock to recover space),
and the algorithms I chose often dictated other restrictions. But the
major restrictions came from the on-disk structures.

Because of the necessarily tight integration between subsystems I found
myself having to do major redesigns during the implementation phase.
Fixing one subsystem created a cascade effect that required tweaking other
subsystems. Even adding new features, such as the mirroring, required
significant changes in the B-Tree deadlock recovery code. I couldn't get
away from it. Ultimately I chose to simplify some of the algorithms
rather then have to go through another 4 months of rewriting. All
major designs are an exercise in making trade-offs in topology, feature
set, algorithmic complexity, debuggability, robustness, etc. The list
goes on forever.

Laters!

-Matt
Matthew Dillon


From: Daniel Phillips <phillips@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 25, 6:54 pm 2008

(resent after subscribing to the the kernel@crater)

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
>
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> : Pedro.
>
> Those are interesting comments. I think I found Daniel's email address
> so I am adding him to the To: Dan, feel free to post this on your Tux
> groups if you want.

How about a cross post?

> I did consider multiple-parentage... that is the ability to have a
> writable snapshot that 'forks' the filesystem history. It would be
> an ultra cool feature to have but I couldn't fit it into the B-Tree
> model I was using. Explicit snapshotting would be needed to make it
> work, and the snapshot id would have to be made part of the B-Tree key,
> which is fine. HAMMER is based around implicit snapshots (being able
> to do an as-of historical lookup without having explicitly snapshotted
> bits of the history).

Yes, that is the main difference indeed, essentially "log everything" vs
"commit" style versioning. The main similarity is the lifespan oriented
version control at the btree leaves.

> I would caution against using B-Tree iterations
> in historical lookups, B-Trees are only fast if you can pretty much
> zero in on the exact element you want as part of the primary search
> mechanic. Once you have to iterate or chain performance goes out the
> window. I know this because there are still two places where HAMMER
> sometimes has to iterate due to the inexact nature of an as-of lookup.
> Multiple-parentage almost certainly require two inequality searches,
> or one inequality search and an iteration. A single timeline only
> requires one inequality search.

Once I get down to the leaf level I binary search on logical address in
the case of a file index btree or on version in the case of an inode
table block, so this cost is still Log(N) with a small k. For a
heavily versioned inode or file region this might sometimes result in
an overflow block or two that has to be linearly searched which is not
a big problem so long as it is a rare case, which it really ought to be
for the kinds of filesystem loads I have seen. A common example of a
worst case is /var/log/messages, where the mtime and size are going to
change in pretty much every version, so if you have hourly snapshots
and hold them for three months it adds up to about 2200 16 byte inode
table attributes to record, about 8 inode table leaf blocks. I really
do not see that as a problem. If it goes up to 100 leaf blocks to
search then that could be a problem.

Usually, I will only be interested in the first and last of those
blocks, the first holding the rarely changing attributes including the
root of the file index btree and the last holding the most recent
incarnation of the mtime/size attribute.

The penultimate inode table index block tells me how many blocks a
given inode lives in because several blocks will have the same inum
key. So the lookup algorithm for a massively versioned file becomes:
1) read the first inode table block holding that inum; 2) read the last
block with the same inum. The latter operation only needs to consult
the immediate parent index block, which is locked in the page cache at
that point.

It will take a little care and attention to the inode format,
especially the inode header, to make sure that I can reliably do that
first/last probe optimization for the "head" version, but it does seem
worth the effort. For starters I will just do a mindless linear walk
of an "overflow" inode and get fancy if that turns out to be a problem.

> I couldn't get away from having a delete_tid (the 'death version
> numbers'). I really tried :-) There are many cases where one is only
> deleting, rather then overwriting.

By far the most common case I would think. But check out the versioned
pointer algorithms. Surprisingly that just works, which is not exactly
obvious:

Versioned pointers: a new method of representing snapshots
http://lwn.net/Articles/288896/

I was originally planning to keep all versions of a truncate/rewrite
file in the same file index, but recently I realized that that is dumb
because there will never be any file data shared by a successor version
in that case. So the thing to do is just create an entirely new
versioned file data attribute for each rewrite, bulking up the inode
table entry a little but greatly constraining the search for versions
to delete and reducing cache pressure by not loading unrelated version
data when traversing a file.

> Both numbers are then needed to
> be able to properly present a historical view of the filesystem.
> For example, when one is deleting a directory entry a snapshot after
> that deletion must show the directory entry gone.

...which happens in Tux3 courtesy of the fact that the entire block
containing the dirent will have been versioned, with the new version
showing the entry gone. Here is one of two places where I violate my
vow to avoid copying an entire block when only one data item in it
changes (the other being the atime table). I rely on two things to
make this nice: 1) Most dirent changes will be logically logged and
only rolled up into the versioned file blocks when there are enough to
be reasonably sure that each changed directory block will be hit
numerous times in each rollup episode. (Storing the directory blocks
in dirent-create order as in PHTree makes this very likely for mass
deletes.) 2) When we care about this is usually during a mass delete,
where most or all dirents in each directory file block are removed
before moving on to the next block.

> Both numbers are
> also needed to be able to properly prune out unwanted historical data
> from the filesystem. HAMMER's pruning algorithm (cleaning out old
> historical data which is no longer desired) creates holes in the
> sequence so once you start pruning out unwanted historical data
> the delete_tid of a prior record will not match the create_tid of the
> following one (historically speaking).

Again, check out the versioned pointer algorithms. You can tell what
can be pruned just by consulting the version tree and the create_tids
(birth versions) for a particular logical address. Maybe the hang is
that you do not organize the btrees by logical address (or inum in the
case of the inode table tree). I thought you did but have not read
closely enough to be sure.

> At one point I did collapse the holes, rematching the delete_tid with
> the create_tid of the following historical record, but I had to remove
> that code because it seriously complicated the mirroring implementation.
> I wanted each mirror to be able to have its own historical retention
> policy independant of the master. e.g. so you could mirror to a backup
> system which would retain a longer and more coarse-grained history then
> the production system.

Fair enough. I have an entirely different approach to what you call
mirroring and what I call delta replication. (I reserve the term
mirroring to mean mindless duplication of physical media writes.) This
method proved out well in ddsnap:

http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fc...

(Sorry about the massive URL, you can blame Linus for that;-)

What I do in ddsnap is compute all the blocks that differ between two
versions and apply those to a remote volume already holding the first
of the two versions, yielding a replica of the second version that is
logically but not physically identical. The same idea works for a
versioned filesystem: compute all the leaf data that differs between
two versions, per inum, and apply the resulting delta to the
corresponding inums in the remote replica. The main difference vs a
ddsnap volume delta is that not all of the changes are physical blocks,
there are also changed inode attributes, so the delta stream format
has to be elaborated accordingly.

> I also considered increasing the B-Tree fan-out to 256 but decided
> against it because insertions and deletions really bloated up the
> UNDO FIFO. Frankly I'm still undecided as to whether that was a good
> idea, I would have prefered 256. I can actually compile in 256 by
> changing a #define, but I released with 64 because I hit up against
> a number of performance issues: bcopy() overheads for insertions
> and deletions in certain tests became very apparent. Locking
> conflicts became a much more serious issue because I am using whole-node
> locks rather then element locks. And, finally, the UNDO records got
> really bloated. I would need to adjust the node locking and UNDO
> generation code significantly to remove the bottlenecks before I could
> go to a 256-element B-Tree node.

I intend to log insertions and deletions logically, which keeps each
down to a few bytes until a btree rollup episode comes along to perform
updating of the btree nodes in bulk. I am pretty sure this will work
for you as well, and you might want to check out that forward logging
trick.

That reminds me, I was concerned about the idea of UNDO records vs
REDO. I hope I have this right: you delay acknowledging any write
transaction to the submitter until log commit has progressed beyond the
associated UNDO records. Otherwise, if you acknowledge, crash, and
prune all UNDO changes, then you would end up with applications
believing that they had got things onto stable storage and be wrong
about that. I have no doubt you did the right thing there, but it is
not obvious from your design documentation.

> HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
> another limiting factor for the fan-out I can use. My elements
> are 64 bytes each.

Yes, I mostly have 16 byte elements and am working on getting most of
them down to 12 or 8.

> 64x64 = 4K per B-Tree node. I decided to go
> with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
> 64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
> other auxillary info. That's instead of using a radix-compressed key.
> Radix compressed keys would have doubled the complexity of the B-Tree
> code, particularly with the middle-of-tree pointer caching that
> HAMMER does.

I use two-stage lookup, or four stage if you count searches within
btree blocks. This makes the search depth smaller in the case of small
files, and in the case of really huge files it adds depth exactly as
appropriate. The index blocks end up cached in the page cache (the
buffer cache is just a flavor of page cache in recent Linux) so there
is little repeated descending through the btree indices. Instead, most
of the probing is through the scarily fast radix tree code, which has
been lovingly optimized over the years to avoid cache line misses and
SMP lock contention.

I also received a proposal for a "fat" btree index from a collaborator
(Maciej) that included the file offset but I really did not like the...
fattening. A two level btree yields less metadata overall which in my
estimation is more important than saving some bree probes. The way
things work these days, falling down from level 1 cache to level 2 or
from level 2 to level 3 costs much more than executing some extra CPU
instructions. So optimization strategy inexorably shifts away from
minimizing instructions towards minimizing cache misses.

> The historical access mechanic alone added major complexity to the
> B-Tree algorithms. I will note here that B-Tree searches have a very
> high cpu overhead no matter how you twist it, and being able to cache
> cursors within the B-Tree is absolutely required if you want the
> filesystem to perform well. If you always start searches at the root
> your cpu overhead will be horrendous... so plan on caching cursors
> from the get-go.

Fortunately, we get that for free on Linux, courtesy of the page cache
radix trees :-)

I might eventually add some explicit cursor caching, but various
artists over the years have noticed that it does not make as much
difference as you might think.

> If I were to do radix compression I would also want to go with a
> fully dynamic element size and fully dynamic fan-out in order to
> best-compress each B-Tree node. Definitely food for thought.

Indeed. But Linux is braindamaged about large block size, so there is
very strong motivation to stay within physical page size for the
immediate future. Perhaps if I get around to a certain hack that has
been perenially delayed, that situation will improve:

"Variable sized page objects"
http://lwn.net/Articles/37795/

> I'd love to do something like that. I think radix compression would
> remove much of the topological bloat the B-Tree creates verses using
> blockmaps, generally speaking.

Topological bloat?

> Space management is currently HAMMER's greatest weakness, but it only
> applies to small storage systems. Several things in HAMMER's design
> are simply not condusive to a small storage. The storage model is not
> fine-grained and (unless you are deleting a lot of stuff) needs
> reblocking to actually recover the freed space. The flushing algorithms
> need around 100MB of UNDO FIFO space on-media to handle worst-case
> dependancies (mainly directory/file visibility issues on crash recovery),
> and the front-end caching of modifying operations, since they don't
> know exactly how much actual storage will be needed, need ~16MB of
> wiggle room to be able to estimate the on-media storage required to
> back the operations cached by the front-end. Plus, on top of that,
> any sort of reblocking also needs some wiggle room to be able to clean
> out partially empty big-blocks and fill in new ones.

Ah, that is a very nice benefit of Tux3-style forward logging I forgot
to mention in the original post: transaction size is limited only by
available free space on the volume, because log commit blocks can be
anywhere. Last night I dreamed up a log commit block variant where the
associated transaction data blocks can be physically discontiguous, to
make this assertion come true, shortly after reading Stephen Tweedie's
horror stories about what you have to do if you must work around not
being able to put an entire, consistent transaction onto stable media
in one go:

http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

Most of the nasties he mentions just vanish if:

a) File data is always committed atomically
b) All commit transactions are full transactions

He did remind me (via time travel from year 2000) of some details I
should write into the design explicitly, for example, logging orphan
inodes that are unlinked while open, so they can be deleted on replay
after a crash. Another nice application of forward logging, which
avoids the seek-happy linked list through the inode table that Ext3
does.

> On the flip side, the reblocker doesn't just de-fragment the filesystem,
> it is also intended to be used for expanding and contracting partitions,
> and adding removing volumes in the next release. Definitely a
> multi-purpose utility.

Good point. I expect Tux3 will eventually have a reblocker (aka
defragger). There are some really nice things you can do, like:

1) Set a new version so logical changes cease for the parent
version.

2) We want to bud off a given directory tree into its own volume,
so start by deleting the subtree in the current version. If
any link counts in the directory tree remain nonzero in the
current version then there are hard links into the subtree, so
fail now and drop the new version.

3) Reblock a given region of the inode table tree and all the files
in it into one physically contiguous region of blocks

4) Add a free tree and other once-per volume structures to the new
home region.

5) The new region is now entirely self contained and even keeps its
version history. At the volume manager level, map it to a new,
sparse volume that just has a superblock in the zeroth extent and
the new, mini filesystem at some higher logical address. Remap
the region in the original volume to empty space and add the
empty space to the free tree.

6) Optionally reblock the newly budded filesystem to the base of the
new volume so utilties that do not not play well with sparse
volumes do not do silly things.

> So I'm not actually being cavalier about it, its just that I had to
> make some major decisions on the algorithm design and I decided to
> weight the design more towards performance and large-storage, and
> small-storage suffered a bit.

Cavalier was a poor choice of words, the post was full of typos as well
so I am not proud of it. You are solving a somewhat different problem
and you have code out now, which is a huge achievement. Still I think
you can iteratively improve your design using some of the techniques I
have stumbled upon. There are probably some nice tricks I can get from
your code base too once I delve into it.

> In anycase, it sounds like Tux3 is using many similar ideas. I think
> you are on the right track.

Thankyou very much. I think you are on the right track too, which you
have a rather concrete way of proving.

> I will add one big note of caution, drawing
> from my experience implementing HAMMER, because I think you are going
> to hit a lot of the same issues.
>
> I spent 9 months designing HAMMER and 9 months implementing it. During
> the course of implementing it I wound up throwing away probably 80% of
> the original design outright. Amoung the major components I had to
> rewrite were the LOG mechanic (which ultimately became the meta-data
> UNDO FIFO), and the fine-grained storage mechanic (which ultimately
> became coarse-grained). The UNDO FIFO was actually the saving grace,
> once that was implemented most of the writer-ordering dependancies went
> away (devolved into just flushing meta-data buffers after syncing the
> UNDO FIFO)... I suddenly had much, much more freedom in designing
> the other on-disk structures and algorithms.
>
> What I found while implementing HAMMER was that the on-disk topological
> design essentially dictated the extent of HAMMER's feature set AND
> most of its deficiencies (such as having to reblock to recover space),
> and the algorithms I chose often dictated other restrictions. But the
> major restrictions came from the on-disk structures.
>
> Because of the necessarily tight integration between subsystems I found
> myself having to do major redesigns during the implementation phase.
> Fixing one subsystem created a cascade effect that required tweaking other
> subsystems. Even adding new features, such as the mirroring, required
> significant changes in the B-Tree deadlock recovery code. I couldn't get
> away from it. Ultimately I chose to simplify some of the algorithms
> rather then have to go through another 4 months of rewriting. All
> major designs are an exercise in making trade-offs in topology, feature
> set, algorithmic complexity, debuggability, robustness, etc. The list
> goes on forever.
>

The big ahas! that eliminated much of the complexity in the Tux3 design
were:

* Forward logging - just say no to incomplete transactions
* Version weaving - just say no to recursive copy on write

Essentially I have been designing Tux3 for ten years now and working
seriously on the simplifying elements for the last three years or so,
either entirely on paper or in related work like ddsnap and LVM3.

One of the nice things about moving on from design to implementation of
Tux3 is that I can now background the LVM3 design process and see what
Tux3 really wants from it. I am determined to match every checkbox
volume management feature of ZFS as efficiently or more efficiently,
without violating the traditional layering between filesystem and
block device, and without making LVM3 a Tux3-private invention.

> Laters!

Hopefully not too much later. I find this dialog very fruitful. I just
wish such dialog would occur more often at the design/development stage
in Linux and other open source work instead of each group obsessively
ignoring all "competing" designs and putting huge energy into chatting
away about the numerous bugs that arise from rushing their design or
ignoring the teachings of history.

Regards,

Daniel


From: Matthew Dillon <dillon@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 5:31 pm 2008

:When a forward commit block is actually written it contains a sequence
:number and a hash of its transaction in order to know whether the
:...
:Note: I am well aware that a debate will ensue about whether there is
:any such thing as "acceptable risk" in relying on a hash to know if a
:commit has completed. This occurred in the case of Graydon Hoare's
:Monotone version control system and continues to this day, but the fact
:is, the cool modern version control systems such as Git and Mercurial
:now rely very successfully on such hashes. Nonetheless, the debate
:will keep going, possibly as FUD from parties who just plain want to
:use some other filesystem for their own reasons. To quell that
:definitively I need a mount option that avoids all such commit risk,
:perhaps by providing modest sized journal areas salted throughout the
:volume whose sole purpose is to record log commit blocks, which then
:are not forward. Only slightly less efficient than forward logging
:and better than journalling, which has to seek far away to the journal
:and has to provide journal space for the biggest possible journal
:transaction as opposed to the most commit blocks needed for the largest
:possible VFS transaction (probably one).

Well, you've described both sides of the debate quite well. I for
one am in the 0-risk camp. ReiserFS got into trouble depending on
a collision space (albeit a small one). I did extensive testing of
64 bit collision spaces when I wrote Diablo (A USENET news system
contemporary with INN, over 10 years ago) and even with a 64 bit space
collisions could occur quite often simply due to the volume of article
traffic.

I think the cost can be reduced to the point where there's no need
to allow any risk at all. After all, we're only talking about meta-data
updates here for the most part. Even under heavy loads it can take
30 seconds for enough dirty meta-data to build up to require a flush.
One can flush the mess, then seek/update the volume header.

Or doing something like using a fixed LOG area, allowing it to be
preformatted... that would remove 100% of the risk and not require
a volume header update (drafting note: edited up from further below).

I know volume headers seem old-school. It was a quick and dirty
solution in HAMMER that I will be changing.

:Actually, the btree node images are kept fully up to date in the page
:cache which is the only way the high level filesystem code accesses
:them. They do not reflect exactly what is on the disk, but they do
:reflect exactly what would be on disk if all the logs were fully
:rolled up ("replay").

Ok, this is simpler. Hmm. That brings up the question with regards
to data allocations and the log. Are you going to use a fixed log
space and allocate data-store separately or is the log a continuously
running pointer across the disk?

My understanding is it is a continuously running space.

:A forward log that carries the edits to some dirty cache block pins
:that dirty block in memory and must be rolled up into a physical log
:before the cache block can be flushed to disk. Fortunately, such a
:rollup requires only a predictable amount of memory: space to load
:enough of the free tree to allocate space for the rollup log, enough
:...

Ok, here I spent about 30 minutes constructing a followup but then
you answered some of the points later on, so what I am going to do
is roll it up into a followup on one of your later points :-)

:One traditional nasty case that becomes really nice with logical
:forward logging is truncate of a gigantic file. We just need to
:commit a logical update like ['resize', inum, 0] then the inode data
:truncate can proceed as convenient. Another is orphan inode handling
:where an open file has been completely unlinked, in which case we
:log the logical change ['free', inum] then proceed with the actual
:delete when the file is closed or when the log is replayed after a
:surprise reboot.
:...
:Logical log replay is not idempotent, so special care has to be taken
:on replay to ensure that specified changes have not already been

Wait, it isn't? I thought it was. I think it has to be because
the related physical B-Tree modifications required can be unbounded,
and because physical B-Tree modifications are occuring in parallel the
related physical operations cause the logical operations to become
bound together, meaning the logical ops *cannot* be independantly
backed out

Here's an example:

rm a/b/c <---- 1TB file
rmdir a/b
rmdir a

There are three problems here.

First, the logical log must be idempotent, or at least mostly
idempotent, because the physical portions of the three separate
logical transactions will complete at different times.

Second, the logical log entry for "rm a/b/c" cannot be destroyed
(due to log cycling based on available free space) until after the
related physical operation has completed, which could occur seconds
to minutes later.

The second issue could become a big issue for you because it means
the log space required will become unbounded or nearly unbounded.
If the remove requires 10's to 100's of megabytes of physical log
entries, all appending to the log, and the physical blocks backing
the log cannot be recovered due to that 'rm a/b/c' needing to remain
in the log until the physical operation is complete, then I think you
have a problem. You would probably need to special-case it.

A third issue effecting the amount of free space required to complete
an operation is going to be the worst case physical log space required
for a single low-level B-Tree transaction, which will roughly be
related to the depth of the B-Tree, multiplied by the number of modifying
B-Tree operations being done in parallel (e.g. from many individual
processes).

:...
:ugly or unreliable when there are lot of stacked changes. Instead I
:introduce the rule that a logical change can only be applied to a known
:good version of the target object, which promise is fullfilled via the
:physical logging layer.
:...
:then apply the logical changes to it there. Where interdependencies
:exist between updates, for example the free tree should be updated to
:reflect a block freed by merging two btree nodes, the entire collection
:of logical and physical changes has to be replayed in topologically
:sorted order, the details of which I have not thought much about other
:than to notice it is always possible.

I came up with a really simple solution to the freemap-reuse issue
which you could apply to allow such replays to be idempotent. You
simply do not allow a freed block to be reallocated until after all
the information related to the logical operation that freed it has
been solidly committed. Thus you can replay the low level frees
without accidently re-freeing a block that has been reallocated.

In HAMMER I delay reuse of freed blocks because UNDO's are only for
meta-data, not file data, and thus freed data blocks cannot be reused
until after the related freeing operations has been completely
committed so crash recovery can do the rollback. You could use them
to make otherwise non-idempotent operations idempotent.

:When replay is completed, we have a number of dirty cache blocks which
:are identical to the unflushed cache blocks at the time of a crash,
:and we have not yet flushed any of those to disk. (I suppose this gets
:interesting and probably does need some paranoid flushing logic in
:replay to handle the bizarre case where a user replays on a smaller
:memory configuration than they crashed on.) The thing is, replay
:returns the filesystem to the logical state it was in when the crash
:happened. This is a detail that journalling filesystem authors tend
:to overlook: actually flushing out the result of the replay is
:pointless and only obscures the essential logic. Think about a crash
:during replay, what good has the flush done?

Well, you have to flush at some point anyway because you only have
so much kernel memory to play with.... You do not have enough kernel
memory to cache the dirty buffers for all the physical activity related
to a particular logical operation (such as truncating a 1TB file).

:I do not see why this example cannot be logically logged in pieces:
:
: ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
: ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
: ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
: ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
: ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]
:
:Logical updates on one line are in the same logical commit. Logical
:allocations of blocks to record the possibly split btree leaves and new
:allocations omitted for clarity. The omitted logical updates are
:bounded by the depth of the btrees. To keep things simple, the logical
:log format should be such that it is impossible to overflow one commit
:block with the updates required to represent a single vfs level
:transaction.

Hmm. Maybe. This helps you identify dependancies that must be
replayed, but it might be easier just to make the entire logical
log idempotent and just replay everything. I think the only gotcha
there is handling link counts. That would still be far less complex
then the alternative.

:I suspect there may be some terminology skew re the term "transaction".
:Tux3 understands this as "VFS transaction", which does not include
:trying to make an entire write (2) call atomic for example, but only
:such things as allocate+write for a single page cache page or
:allocate+link for a sys_link call. Fsync(2) is not a transaction but
:a barrier that Tux3 is free to realize via multiple VFS transactions,
:which the Linux VFS now takes care of pretty well now after many years
:of patching up the logic.

I kinda mix them up. I think of them as transactions at different
levels of granularity.

:...
:the resulting blocks are logged logically, but linked into the parent
:btree block using a logical update stored in the commit block of the
:physical log transaction. How cool is that?

That is definitely cool.

:transactions are not just throwaway things, they are the actual new
:data. Only the commit block is discarded, which I suppose will leave
:a lot of one block holes around the volume, but then I do not have to
:require that the commit block be immediately adjacent to the body of
:the transaction, which will allow me to get good value out of such
:holes. On modern rotating media, strictly linear transfers are not
:that much more efficient than several discontiguous transfers that all
:land relatively close to each other.

True enough. Those holes will create significant fragmentation
once you cycle through available space on the media, though.

:> So your crash recovery code will have to handle=20
:> both meta-data undo and completed and partially completed transaction=
:s.
:
:Tux3 does not use undo logging, only redo, so a transaction is complete
:as soon as there is enough information on durable media to replay the
:redo records.

Ok, that works. Be careful w/ your physical logging, it can get
expensive. I think it may be more expensive then you are currently
estimating.

One of the reasons why HAMMER uses an UNDO log is that it is a lot
less expensive then a REDO log. I'll show you why:

Physical operations related to a single flush group in HAMMER:

modify element in B-Tree node A (create UNDO w/ original contents of
B-Tree node A)
modify element in B-Tree node A (no new UNDO needs to be created)
modify element in B-Tree node A (no new UNDO needs to be created)
modify element in B-Tree node A (no new UNDO needs to be created)
modify element in B-Tree node A (no new UNDO needs to be created)

(NOTE: I do realize that the REDO log can be compressed just as the
UNDO one, by recording actual data shifts from B-Tree insertions
and deletions, it gets really complex when you do that, though, and
would not be idempotent).

See the advantage? Many B-Tree operations working on the same node,
or revisiting that node later on, within the same flush group, do not
have to lay down more then one UNDO record for that B-Tree node.

The physical ops are modifying already modified space so I don't
have to record each op in the UNDO log. Upon crash recovery the whole
mess is undone by that first UNDO record.

I just run the UNDOs backwards, which allows me to cache the in-flush
undo areas with a fixed set of small throw-away tracking structures
in system memory. If a structure gets thrown away it just means an
extranious UNDO record will be appended to the UNDO FIFO.

:Yes. Each new commit records the sequence number of the oldest commit
:that should be replayed. So the train lays down track in front of
:itself and pulls it up again when the caboose passes. For now, there
:is just one linear sequence of commits, though I could see elaborating
:that to support efficient clustering.

Yah. I see. Noting that issue I brought up earlier about the
"rm a/b/c". Locking the caboose of your log until all related physical
operations have been completed could create a problem.

:One messy detail: each forward log transaction is written into free
:space wherever physically convenient, but we need to be sure that that
:free space is not allocated for data until log rollup has proceeded
:past that transaction. One way to do this is to make a special check
:against the list of log transactions in flight at the point where
:extent allocation thinks it has discovered a suitable free block, which
:is the way ddsnap currently implements the idea. I am not sure whether
:I am going to stick with that method for Tux3 or just update the disk
:image of the free tree to include the log transaction blocks and
:somehow avoid logging those particular free tree changes to disk. Hmm,
:a choice of two ugly but workable methods, but thankfully neither
:affects the disk image.

You might be able to adapt the delayed-reuse concept I used in HAMMER
to simplify the problem, possibly to the point of making those
operations replayable.

Another thing I did in HAMMER was reserve space on the frontend for
data blocks without modifying the freemap at all. This allows the
frontend to write file data blocks to the physical media without
having to wait for synchronization with meta-data or the backend or
anything.

I keep track of the reserved space with small in-memory structures
(whos function is to prevent other allocations from using the reserved
space until it can be committed). Freemap updates are then committed
in the same flush that commits the related B-Tree insertions and
deletions. Thus the freemap is kept in sync.

:...
:> I also noticed that very large files also wind up with multiple versi=
:ons
:> of the inode, such as when writing out a terrabyte-sized file.=20
:
:Right, when writing the file takes longer than the snapshot interval.

Yah. It isn't as big a deal when explicit snapshots are taken. It
is an issue for HAMMER because the history is more of a continuum.
Effectively the system syncer gives us a 30-60 second granular snapshot
without having to lift a finger (which incidently also makes having
an 'undo' utility feasible).

:So the only real proliferation is the size/mtime attributes, which gets
:back to what I was thinking about providing quick access for the
:"current" version (whatever that means).

st_size is the one thing we can't get away from. For HAMMER,
mtime updates don't roll new inodes and both mtime and atime are
locked to the ctime when accessed via a snapshot (so one can use
a (tar | md5) as a means of validating the integrity of the snapshot).

:Note: one slightly bizarre property of this scheme is that an inode
:can be reused in any version in which its link count is zero, and the
:data blocks referenced by different versions in the same file index
:can be completely unrelated. I doubt there is a use for that.

It might complicate mirroring / clustering-style operations, but
you may not be targetting that sort of thing. HAMMER is intended to
become a cluster filesystem so I use 64 bit inode numbers which are
never reused for the entire life of the filesystem.

:Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
:beat that compactness. We have learned through bitter experience that
:anything other than an Ext2/UFS style physically stable block of
:dirents makes it difficult to support NFS telldir cookies accurately
:because NFS vs gives us only a 31 bit cookie to work with, and that is
:not enough to store a cursor for, say, a hash order directory
:traversal. This is the main reason that I have decided to go back to
:basics for the Tux3 directory format, PHTree, and make it physically
:stable.

The cookies are 64 bits in DragonFly. I'm not sure why Linux would
still be using 32 bit cookies, file offsets are 64 bits so you
should be able to use 64 bit cookies.

For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
and 32 bits is an iterator to deal with hash collisions. Poof,
problem solved.

:In the PHTree directory format lookups are also trivial: the directory
:btree is keyed by a hash of the name, then each dirent block (typically
:one) that has a name with that hash is searched linearly. Dirent block
:pointer/hash pairs are at the btree leaves. A one million entry
:directory has about 5,000 dirent blocks referenced by about 1000 btree
:leaf blocks, in turn referenced by three btree index blocks (branching
:factor of 511 and 75% fullness). These blocks all tend to end up in
:the page cache for the directory file, so searching seldom references
:the file index btree.

That seems reasonable. I'll look up the PHTree stuff, it sounds
interesting. Personally speaking I don't mind eating 64 bytes per
directory entry. I can accomodate file names up to 16 bytes in
the B-Tree element itself so 95% of the namespace doesn't even
require a look-aside data reference.

The one mistake I made, which I didn't have time to fix in the
first HAMMER release, is that my directory hash is a standard crc
which means that records in the B-Tree aren't even close to being
in any sort of lexical order. I will probably wind up changing
the hash calculation a bit to get some lexical sorting information
stuffed into it... should be doable since the hash keys are 64 bits.

:will pass Hammer in lookup speed for some size of directory because of
:the higher btree fanout. PHTree starts from way behind courtesy of the
:32 cache lines that have to be hit on average for the linear search,
:amounting to more than half the CPU cost of performing a lookup in a
:million element directory, so the crossover point is somewhere up in
:the millions of entries.

Well, the B-Tree fanout isn't actually that big a deal. Remember
HAMMER reblocks B-Tree nodes along with everything else. B-Tree
nodes in a traversal are going to wind up in the same 16K filesystem
block.

That leaves just element space overhead, which frankly is not much
of an issue on a production system because kernels have to cache a lot
more then just the directory entry. The in-memory vnode and inode
structures take up a lot more space then the cached directory entry.
So even though HAMMER's directory entry is a 64-byte B-Tree element
and roughly double the overhead of a UFS-style directory entry, the
overall cache footprint in the system is only ~20% worse.

:..
:>=20
:> Beyond that the B-Tree is organized by inode number and file offset.
:> In the case of a directory inode, the 'offset' is the hash key, so
:> directory entries are organized by hash key (part of the hash key is
:> an iterator to deal with namespace collisions).
:>=20
:> The structure seems to work well for both large and small files, for
:> ls -lR (stat()ing everything in sight) style traversals as well as=20
:> tar-like traversals where the file contents for each file is read.
:
:That is really sweet, provided that you are ok with the big keys. It
:was smart to go with that regular structure, giving you tons of control
:over everything and get the system up and running early, thus beating the
:competition. I think you can iterate from there to compress things and
:be even faster. Though if I were to presume to choose your priorities
:for you, it would be to make the reblocking optional.

Yes, my thoughts exactly. Using a fairly large but extremely flexible
B-Tree element chopped the code complexity down by a factor of 4 at
least. If I hadn't done that HAMMER would have taken another year
to finish.

I plan on expanding efficiency mainly by making the data references
fully extent-based, with each B-Tree element referencing an extent.
In fact, read()s can already handle arbitrary extents but write
clustering and historical access issues forced me to use only two
block sizes for writes (16K and 64K). A HAMMER B-Tree element
is theoretically capable of accessing up to 2G of linear storage.

:...
:> that the B-Tree can seek to directly. In fact, at one point I tried
:> indexing on delete_tid instead of create_tid, but created one hell of=
: a
:> mess in that I had to physically *move* B-Tree elements being deleted
:> instead of simply marking them as deleted.
:
:This is where I do not quite follow you. File contents are never
:really deleted in subsequent versions, they are just replaced. To
:truncate a file, just add or update size attribute for the target
:version and recover any data blocks past the truncate point that
:belongs only to the target version.

Truncation is a deletion. That is, the B-Tree elements have to
be marked as having been deleted so if you truncate-extend a file
you see the new space as being full of zeros.

Take a 1MB file. Truncate it to 512K, then truncate-extend the
file back to 1MB. Or seek-write to extend the file...
same difference, really.

:Dirent deletion is handled by updating the dirent block, possibly
:versioning the entire block. Should the entire block ever be deleted
:by a directory repacking operation (never actually happens on the vast
:majority of Linux systems) that will be handled as a file truncate.
:
:I see that in Hammer, the name is deleted from the btree instead, so
:fair enough, that is actual deletion of an object. But it could be
:handled alternatively by adding a "this object is not here" object,
:which might be enough to get rid of delete_tid entirely and just have
:create_tid.

Hmm. I think the delete_tid is less complex. I'm not really
concerned about saving 8 bytes in the B-Tree element. I don't
want to make it bigger then the 64 bytes it already is, but neither
am I going to add whole new subsystems just to get rid of 8 bytes.

:I plan to scan the whole btree initially, which is what we do in ddsnap
:and it works fine. But by looking at the mtime attributes in the inode
:I can see in which versions the file data was changed and thus not have
:to scan most files. Eventually, building in some accelerator to be able
:to skip most inode leaves as well would be a nice thing to do. I will
:think about that in the background.

Yah, that makes sense. You won't have to scan every last tidbit of
information. HAMMER would have had to if I had not added the
propagating transaction id feature to it.

You are still going to have a scaleability issue but it sounds like
you could implement the exact same optimization and completely solve
the problem.

:> You may want to consider something similar. I think using the
:> forward-log to optimize incremental mirroring operations is also fine
:> as long as you are willing to take the 'hit' of having to scan (though
:...
:
:There is a slight skew in your perception of the function of forward
:logging here, I think. Forward logs transactions are not long-lived
:disk objects, they just serve to batch together lots of little updates
:into a few full block "physical" updates that can be written to the
:disk in seek-optimized order. It still works well for this application.

Yes, I understand now that you've expanded on that topic.

:> The way it works is that HAMMER's frontend is almost entirely
:> disconnected from the backend. All meta-data operations are cached
:> in-memory. create, delete, append, truncate, rename, write...
:> you name it. Nothing the frontend does modifies any meta-data
:> or mata-data buffers.
:
:Things are a little different in Linux. The VFS takes care of most of
:what you describe as your filesystem front end, and in fact, the VFS is
:capable of running as a filesystem entirely on its own, just by
:supplying a few stub methods to bypass the backing store (see ramfs).
:I think this is pretty much what you call your front end, though I
:probably missed some major functionality you provide there that the
:VFS does not directly provide.

Well, for file data, up to a point, and HAMMER does just use the
kernel's buffer cache for file data.

Not for directory entries. I'm pretty sure Linux uses a throw-away
namecache just like we do, so unless it has a mechanic for flush
sequencing of dirty namecache entries you need in-VFS structural
support for caching the operations.

In particular, caching a namespace deletion (rm, rename, etc)
without touching the meta-data requires implementing a merged lookup
so you can cancel-out the directory entry that still exists on-media.
So it isn't entirely trivial.

Things related to file data are handled by the kernel for the most
part, but truncations still need considerable support within the
VFS to properly handle seek-write or truncate file extensions.
In order to avoid touching any meta-data when handling a truncation,
a similar canceling field in the in-memory inode needs to be
maintained until the backend is able to delete the related data
blocks. Lookups have to take into account the cached truncation offset
so as not to read data from the media past the truncation point (e.g.
if you seek-write-extend or truncate-extend the file immediately
after truncating the file).

Most filesystems will dirty meta-data buffers related to the media
storage as part of executing an operation on the frontend. I don't
know of any which have the level of separation that HAMMER has.

:Note: no other filesystem on Linux current works this way. They all
:pretty much rely on the block IO library which implements the fairly
:goofy one-block-at-a-time transfer regimen. The idea of setting up bio
:transfers directly from the filesystem is unfortunately something new.
:We should have been doing this for a long time, because the interface
:is quite elegant, and actually, that is just what the block IO library
:is doing many levels down below. It is time to strip away some levels
:and just do what needs to be done in the way it ought to be done.

Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is
used for. I'm a little surprised that Linux doesn't do that yet.
I'll expand on that down further.

:...
:> the media. The UNDO *is* flushed to the media, so the flush groups
:> can build a lot of UNDO up and flush it as they go if necessary.
:
:Hmm, and I swear I did not write the above before reading this paragraph.
:Many identical ideas there.

Yah, I'm editing my top-side responses as I continue to read down.
But I'm gleaning a lot of great information from this conversation.

:> for fsync() operations... those would go much faster using a forward
:> log then the mechanism I use because only the forward-log would have
:> to be synchronized (not the meta-data changes) to 'commit' the work.
:> I like that, it would be a definite advantage for database operations.
:
:I think you will like it even more when you realize that updating the
:filesystem header is not required for most transactions.

A definite advantage for a quick write()/fsync() like a database might
do. I might be able to implement a forward-log for certain logical
operations.

Due to my use of a fixed UNDO FIFO area I can pre-format it and I can
put a 64 bit sequence number at the head of each block. There would be
no chance of it hitting a conflict. Hmm. Yes, I do think I could get
rid of the volume header update without too much effort.

:A versioned extent:
:
: struct extent { unsigned version:10, count:6, block:24; };
:
:Directory format to index the extent within a leaf:
:
: struct entry { unsigned loglo:24, offset:8 };
: struct group { unsigned loghi:24, count:8 };
:
:Leaf layout (abridged):
:
: struct leaf { unsigned groups; struct group[]; struct entry[]; struct el=
:ements[] };
:...
:See, I am really obsessive when it comes to saving bits. None of these
:compression hacks costs a lot of code or has a big chance of hiding a
:bug, because the extra boundary conditions are strictly local.

Man, those are insanely small structures. Watch out for the cpu
and in-kernel-memory tracking overhead.

: * The linux page cache not only caches data but provides an efficient
: access path to logical blocks that avoids having to consult the
: filesystem metadata repeatedly under heavy traffic.

Actually, BSD buffer caches are logical caches, and always have been.
Linux actually took a page from our book on that one. In fact, if
you do a google search I'll bet you will see my name in some of those
Linux VM system debates 5-10 years ago :-)

The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure
is a logically indexed entity which can also contain a cached physical
translation (which is how the direct data bypass works).

:...
:> Are you just going to leave them in the log and not push a B-Tree
:> for them?=20
:
:It did not occur to me to leave them in the logical log, but yes that
:is a nice optimization. They will only stay there for a while, until a
:rollup is triggered, moving them into the inode table proper. It gets
:a little tricky to figure out whether the amortization due to the
:logical logging step is worth the extra write. There is a crossover
:somewhere, maybe around a hundred bytes or so, ideal for symlinks and
:acls.

I'm just going to throw this out. In an earlier incarnation of the
HAMMER design I had a forward log which embedded data, and the idea
was that I would leave the data in-place in the log. I was going to
use a blockmap to turn the log into a sparse-map as a means of
recovering space from it.

I eventually threw away that idea but it occurs to me that you might
be able to use a similar idea to compress small bits of meta-data.

:> :I might eventually add some explicit cursor caching, but various
:> :artists over the years have noticed that it does not make as much
:> :difference as you might think.
:>=20
:> For uncacheable data sets the cpu overhead is almost irrelevant.
:
:For rotating media, true. There is also flash, and Ramback...

Yes, true enough. Still only a 20% cache hit though, due to the many
other in-memory structures involved (namecache, vnode, in-memory inode,
etc).

:> I think it will be an issue for people trying to port HAMMER. I'm tr=
:ying
:> to think of ways to deal with it. Anyone doing an initial port can=20
:> just drop all the blocks down to 16K, removing the problem but
:> increasing the overhead when working with large files.
:
:Do the variable sized page hack :-]

Noooo! My head aches.

:Or alternatively look at the XFS buffer cache shim layer, which
:Christoph manhandled into kernel after years of XFS not being accepted
:into mainline because of it.

The nice advantage of controlling the OS code is I could change
DragonFly's kernel to handle mixed block sizes in its buffer cache.
But I still want to find a less invasive way in the HAMMER code
to make porting to other OS's easier.

:Ext2 and Ext3) so going to extents is not optional. I plan a limit of
:64 blocks per extent, cutting the metadata overhead down to 16K per
:gigabyte, which is hard to complain about. Larger extents in the file
:index would not buy much and would mess up the nice, 8 byte
:one-size-fits-all versioned extent design.

That is reasonable. I don't plan on going past about a megabyte
per extent myself, there's just no reason to go much bigger.

:> Orphan inodes in HAMMER will always be committed to disk with a
:> 0 link count. The pruner deals with them after a crash. Orphan
:> inodes can also be commited due to directory entry dependancies.
:
:That is nice, but since I want to run without a pruner I had better
:stick with the logical logging plan. The link count of the inode will
:also be committed as zero, or rather, the link attribute will be
:entirely missing at that point, which allows for a cross check.

It won't work for you if your logical log is meant to be short-lived.
Programs can hold unlinked inodes open for hours, even days. Even
longer. You would have to rewrite the entry in order to be able to
move the caboose. Well, that isn't a bad idea, just more complication.

:> I do wish we had something like LVM on BSD systems. You guys are very
:> lucky in that regard. LVM is really nice.
:
:But I thought you BSD guys have GEOM. I think GEOM is really nice, and
:I think that Linux LVM has so many problems that I have started work on
:a replacement, LVM3.

GEOM is fairly simplistic compared to LVM, but I'll keep your comments
regarding LVM in mind (never having used it). GEOM's claim to fame is
its encryption layer.

Maybe the ticket for DragonFly is to simply break storage down into
a reasonable number of pieces, like cutting up a drive into 256 pieces,
and create a layer to move and glue those pieces together into larger
logical entities.

:> BTW it took all day to write this!
:
:It took two days to write this ;-)
:
:I hope the fanout effect does not mean the next post will take four days.
:
:Regards,
:
:Daniel

I'm editing down as much as I can :-) This one took 6 hours, time to
get lunch!

-Matt
Matthew Dillon


From: Daniel Phillips <phillips@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 9:48 pm 2008

On Sunday 27 July 2008 14:31, Matthew Dillon wrote:
> :When a forward commit block is actually written it contains a sequence
> :number and a hash of its transaction in order to know whether the
> :...
> :Note: I am well aware that a debate will ensue about whether there is
> :any such thing as "acceptable risk" in relying on a hash to know if a
> :commit has completed. This occurred in the case of Graydon Hoare's
> :Monotone version control system and continues to this day, but the fact
> :is, the cool modern version control systems such as Git and Mercurial
> :now rely very successfully on such hashes. Nonetheless, the debate
> :will keep going, possibly as FUD from parties who just plain want to
> :use some other filesystem for their own reasons. To quell that
> :definitively I need a mount option that avoids all such commit risk,
> :perhaps by providing modest sized journal areas salted throughout the
> :volume whose sole purpose is to record log commit blocks, which then
> :are not forward. Only slightly less efficient than forward logging
> :and better than journalling, which has to seek far away to the journal
> :and has to provide journal space for the biggest possible journal
> :transaction as opposed to the most commit blocks needed for the largest
> :possible VFS transaction (probably one).
>
> Well, you've described both sides of the debate quite well. I for
> one am in the 0-risk camp. ReiserFS got into trouble depending on
> a collision space (albeit a small one). I did extensive testing of
> 64 bit collision spaces when I wrote Diablo (A USENET news system
> contemporary with INN, over 10 years ago) and even with a 64 bit space
> collisions could occur quite often simply due to the volume of article
> traffic.
>
> I think the cost can be reduced to the point where there's no need
> to allow any risk at all. After all, we're only talking about meta-data
> updates here for the most part. Even under heavy loads it can take
> 30 seconds for enough dirty meta-data to build up to require a flush.
> One can flush the mess, then seek/update the volume header.

So I do not want users to fixate on that detail. The mount option
allows them to choose between "fast but theoretically riskier" and
"warm n fuzzy zero risk but not quite so fast". If the example of Ext3
is anything to go by, almost everybody chooses the "ordered data" mode
over the "journal data" mode given the tradeoff that the latter is
about 30% slower but offers better data integrity for random file
rewrites.

The tradeoff for the option in Tux3 will be maybe 1 - 10% slower in
return for a miniscule reduction in the risk of a false positive on
replay. Along the lines of deciding to live underground to avoid the
risk of being hit by a meteorite. Anyway, Tux3 will offer the option
and everybody will be happy.

Actually implementing the option is pretty easy because the behavior
variation is well localized.

> Or doing something like using a fixed LOG area, allowing it to be
> preformatted... that would remove 100% of the risk and not require
> a volume header update (drafting note: edited up from further below).
>
> I know volume headers seem old-school. It was a quick and dirty
> solution in HAMMER that I will be changing.
>
> :Actually, the btree node images are kept fully up to date in the page
> :cache which is the only way the high level filesystem code accesses
> :them. They do not reflect exactly what is on the disk, but they do
> :reflect exactly what would be on disk if all the logs were fully
> :rolled up ("replay").
>
> Ok, this is simpler. Hmm. That brings up the question with regards
> to data allocations and the log. Are you going to use a fixed log
> space and allocate data-store separately or is the log a continuously
> running pointer across the disk?
>
> My understanding is it is a continuously running space.

Yes, a continuous running space, not preallocated. The forward log
will insert itself into any free blocks that happen to be near the
the transaction goal location. Such coopted free space will be
implicitly unavailable for block allocation, slightly complicating the
block allocation code which has to take into consideration both the
free space map and all the outstanding log transactions.

> :A forward log that carries the edits to some dirty cache block pins
> :that dirty block in memory and must be rolled up into a physical log
> :before the cache block can be flushed to disk. Fortunately, such a
> :rollup requires only a predictable amount of memory: space to load
> :enough of the free tree to allocate space for the rollup log, enough
> :...
>
> Ok, here I spent about 30 minutes constructing a followup but then
> you answered some of the points later on, so what I am going to do
> is roll it up into a followup on one of your later points :-)
>
> :One traditional nasty case that becomes really nice with logical
> :forward logging is truncate of a gigantic file. We just need to
> :commit a logical update like ['resize', inum, 0] then the inode data
> :truncate can proceed as convenient. Another is orphan inode handling
> :where an open file has been completely unlinked, in which case we
> :log the logical change ['free', inum] then proceed with the actual
> :delete when the file is closed or when the log is replayed after a
> :surprise reboot.
> :...
> :Logical log replay is not idempotent, so special care has to be taken
> :on replay to ensure that specified changes have not already been
>
> Wait, it isn't? I thought it was. I think it has to be because
> the related physical B-Tree modifications required can be unbounded,

Logical logging is not idempotent by nature because of the uncertainty
of the state of the object edited by a logical operation: has the
operation already been applied or not? If you know something about the
structure of the target you can usually tell. Suppose the operation is
a dirent create. It has been applied to the target if the direct
already exists, otherwise not. I do not like this style of special
case hacking to force the logical edit to be idempotent.

Instead, I choose to be sure about the state of the target object by
introducing the rule that after a logical log operation has been
generated, nothing is allowed to write to the object being edited. The
logical operation pins the state of the disk image of target object.
The object stays pinned until the logical log entry has been retired
by a physical commit to the object that updates the object and retires
the logical edit in one atomic log transaction.

I was working on a specific code example of this with pseudocode and
all disk transfers enumerated, but then your email arrived. Well after
this response the example will probably be better.

> and because physical B-Tree modifications are occuring in parallel the
> related physical operations cause the logical operations to become
> bound together, meaning the logical ops *cannot* be independantly
> backed out

Tux3 never backs anything out, so there is some skew to clear up.

The logical and physical log operations are indeed bound together at
both the disk image and the processor level, but not in the way that
you might expect. The primary purpose of rolling up logical log
entries into physical updates is to control resource consumption;
the secondary purpose is to reduce the amount of replay required to
reconstruct "current" memory images of the btree blocks on reboot or
remount.

In the first case, resource exhaustion, a high level vfs transaction
may have to wait for a rollup to finish, cleaning a number of dirty
buffer cache blocks and releasing resources needed for the rollup,
such as bio structs and lists of transaction components.

Such blocking is nominally enforced by the VMM by sending the
requesting process off to scan for and try to free up some dirty
memory (a horribly bad design idea in Linux that causes recursive
calls into the filesystem, but that is the way it works) during which
time the process is effectively blocked. Tux3 will add a more
responsible level of resource provisioning by limiting the number of
transactions that can be in flight, much as Ext3 does. This allows
the filesystem to fullfill a promise like "never will use up all of
kernel reserve memory after having been given privileged access to
it in order to clean and recover some dirty cache blocks".

In the second case, keeping the replay path short, high level
operations can proceed in parallel with physical updates, because
a transaction is _logically_ completed as soon as committing the
associated logical edits to disk has completed.

So high level operations block on logical commit completions, not on
physical commit completions except in the case of resource exhaustion.

> Here's an example:
>
> rm a/b/c <---- 1TB file
> rmdir a/b
> rmdir a
>
> There are three problems here.
>
> First, the logical log must be idempotent, or at least mostly
> idempotent, because the physical portions of the three separate
> logical transactions will complete at different times.

If the user writes:

rm a/b/c&
rm a/b&
rm a&

Then they deserve what they get, which is a probable complaint that a
directory does not exist. If the operations happen to execute in a
favorable order then they will all succeed. With Tux3 this is possible
because destroying the 1TB file can proceed asynchronously while the VFS
transaction completes and returns as soon as a commit containing

['unlink', inum_a/b/c, dnum_a/b]
['destroy', inum_a/b/c]

has been logged. So deleting the 1TB file can be very fast, although
it may be some time before all the occupied disk space becomes
available for reuse.

To make the deletion visible to high level filesystem operations, some
cached disk blocks have to be updated. For a huge deletion like this
it makes sense to invent a versioned "orphan" attribute, which is added
to the inode table leaf, meaning that any data lookups arriving from
still-open files should ignore the file index tree entirely.

> Second, the logical log entry for "rm a/b/c" cannot be destroyed
> (due to log cycling based on available free space) until after the
> related physical operation has completed, which could occur seconds
> to minutes later.

True, but that only pins one commit block on disk.

> The second issue could become a big issue for you because it means
> the log space required will become unbounded or nearly unbounded.
> If the remove requires 10's to 100's of megabytes of physical log
> entries, all appending to the log, and the physical blocks backing
> the log cannot be recovered due to that 'rm a/b/c' needing to remain
> in the log until the physical operation is complete, then I think you
> have a problem. You would probably need to special-case it.

I do not think I have that problem because recovering the space used
by the big file can proceed incrementally: free a few blocks from the
inode index leaves; commit the revised file index blocks and free tree
blocks; repeat as necessary. After a crash, find the logical inode
destroy log entry and resume the destroy.

> A third issue effecting the amount of free space required to complete
> an operation is going to be the worst case physical log space required
> for a single low-level B-Tree transaction, which will roughly be
> related to the depth of the B-Tree, multiplied by the number of modifying
> B-Tree operations being done in parallel (e.g. from many individual
> processes).

Yes indeed. A matter of totalling all that up, which is fortunately
bounded to a nice low number. Then in grand Linux tradition, ignore
the totals and just rely on there being "megabytes" of reserve memory
to handle the worst case. Obviously a bad idea, but that is how we
have done things for many years.

> :...
> :ugly or unreliable when there are lot of stacked changes. Instead I
> :introduce the rule that a logical change can only be applied to a known
> :good version of the target object, which promise is fullfilled via the
> :physical logging layer.
> :...
> :then apply the logical changes to it there. Where interdependencies
> :exist between updates, for example the free tree should be updated to
> :reflect a block freed by merging two btree nodes, the entire collection
> :of logical and physical changes has to be replayed in topologically
> :sorted order, the details of which I have not thought much about other
> :than to notice it is always possible.
>
> I came up with a really simple solution to the freemap-reuse issue
> which you could apply to allow such replays to be idempotent. You
> simply do not allow a freed block to be reallocated until after all
> the information related to the logical operation that freed it has
> been solidly committed. Thus you can replay the low level frees
> without accidently re-freeing a block that has been reallocated.
>
> In HAMMER I delay reuse of freed blocks because UNDO's are only for
> meta-data, not file data, and thus freed data blocks cannot be reused
> until after the related freeing operations has been completely
> committed so crash recovery can do the rollback. You could use them
> to make otherwise non-idempotent operations idempotent.

It sounds like a good idea, I will ponder.

> :When replay is completed, we have a number of dirty cache blocks which
> :are identical to the unflushed cache blocks at the time of a crash,
> :and we have not yet flushed any of those to disk. (I suppose this gets
> :interesting and probably does need some paranoid flushing logic in
> :replay to handle the bizarre case where a user replays on a smaller
> :memory configuration than they crashed on.) The thing is, replay
> :returns the filesystem to the logical state it was in when the crash
> :happened. This is a detail that journalling filesystem authors tend
> :to overlook: actually flushing out the result of the replay is
> :pointless and only obscures the essential logic. Think about a crash
> :during replay, what good has the flush done?
>
> Well, you have to flush at some point anyway because you only have
> so much kernel memory to play with.... You do not have enough kernel
> memory to cache the dirty buffers for all the physical activity related
> to a particular logical operation (such as truncating a 1TB file).

I hope that question is cleared up now.

> :I do not see why this example cannot be logically logged in pieces:
> :
> : ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
> : ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
> : ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
> : ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
> : ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]
> :
> :Logical updates on one line are in the same logical commit. Logical
> :allocations of blocks to record the possibly split btree leaves and new
> :allocations omitted for clarity. The omitted logical updates are
> :bounded by the depth of the btrees. To keep things simple, the logical
> :log format should be such that it is impossible to overflow one commit
> :block with the updates required to represent a single vfs level
> :transaction.
>
> Hmm. Maybe. This helps you identify dependancies that must be
> replayed, but it might be easier just to make the entire logical
> log idempotent and just replay everything. I think the only gotcha
> there is handling link counts. That would still be far less complex
> then the alternative.

The presence of a "link" record in the logical log implies a link count
in addition to whatever is recorded in the on-disk inode table leaf.
On the other hand, the "current" image of the inode table leaf in the
buffer cache has the correct link count when the sys_link returns.

> :transactions are not just throwaway things, they are the actual new
> :data. Only the commit block is discarded, which I suppose will leave
> :a lot of one block holes around the volume, but then I do not have to
> :require that the commit block be immediately adjacent to the body of
> :the transaction, which will allow me to get good value out of such
> :holes. On modern rotating media, strictly linear transfers are not
> :that much more efficient than several discontiguous transfers that all
> :land relatively close to each other.
>
> True enough. Those holes will create significant fragmentation
> once you cycle through available space on the media, though.

It depends on how high the density of such holes is. I will try to keep
it down to a few per megabyte. The nice think about having some one
block holes strewn around the disk is, there is always a nearby place
for a commit block to camp out for a while.

> :> So your crash recovery code will have to handle=20
> :> both meta-data undo and completed and partially completed transaction=
> :
> :Tux3 does not use undo logging, only redo, so a transaction is complete
> :as soon as there is enough information on durable media to replay the
> :redo records.
>
> Ok, that works. Be careful w/ your physical logging, it can get
> expensive. I think it may be more expensive then you are currently
> estimating.

I am hoping not to mess that up ;-)

The thing about physical logging in Tux3 is, the logged data blocks are
normally immediately linked into the index blocks as actual file data
or btree metadata blocks, so they are only written once, and they are
hopefully written near where the rest of the structure lives. It is
only in the optional "update in place" style of physical logging that
the commit data blocks are written twice, which will be no worse than a
traditional journal and probably a lot better because of not needing to
seek to the journal region.

> One of the reasons why HAMMER uses an UNDO log is that it is a lot
> less expensive then a REDO log. I'll show you why:
>
> Physical operations related to a single flush group in HAMMER:
>
> modify element in B-Tree node A (create UNDO w/ original contents of
> B-Tree node A)
> modify element in B-Tree node A (no new UNDO needs to be created)
> modify element in B-Tree node A (no new UNDO needs to be created)
> modify element in B-Tree node A (no new UNDO needs to be created)
> modify element in B-Tree node A (no new UNDO needs to be created)

Nice, now I understand. But do you not have to hold all filesystem
transactions that depend on the modification until the btree node has
completed writing to disk? With logical logging you only have to wait
on the logical commit to complete, into which may be packed a number of
other changes for efficiency.

> (NOTE: I do realize that the REDO log can be compressed just as the
> UNDO one, by recording actual data shifts from B-Tree insertions
> and deletions, it gets really complex when you do that, though, and
> would not be idempotent).

As described above, it is not how I do it.

> See the advantage? Many B-Tree operations working on the same node,
> or revisiting that node later on, within the same flush group, do not
> have to lay down more then one UNDO record for that B-Tree node.
>
> The physical ops are modifying already modified space so I don't
> have to record each op in the UNDO log. Upon crash recovery the whole
> mess is undone by that first UNDO record.
>
> I just run the UNDOs backwards, which allows me to cache the in-flush
> undo areas with a fixed set of small throw-away tracking structures
> in system memory. If a structure gets thrown away it just means an
> extranious UNDO record will be appended to the UNDO FIFO.

I think I see it, but I have my doubts because you have to block
transactions waiting for the up to date copy of the btree data to land
on disk. Either that, or you may give userspace the impression that
some transaction has gotten onto stable storage when that is not the
case.

If you do in fact block transactions until the current image of the
btree leaf has been written out then the blocking behavior is no
better than REDO style, and REDO can batch together many updates to
different logical blocks into a single commit block, which seems like
fewer transfers overall.

> :Yes. Each new commit records the sequence number of the oldest commit
> :that should be replayed. So the train lays down track in front of
> :itself and pulls it up again when the caboose passes. For now, there
> :is just one linear sequence of commits, though I could see elaborating
> :that to support efficient clustering.
>
> Yah. I see. Noting that issue I brought up earlier about the
> "rm a/b/c". Locking the caboose of your log until all related physical
> operations have been completed could create a problem.

I hope that is cleared up now.

> :One messy detail: each forward log transaction is written into free
> :space wherever physically convenient, but we need to be sure that that
> :free space is not allocated for data until log rollup has proceeded
> :past that transaction. One way to do this is to make a special check
> :against the list of log transactions in flight at the point where
> :extent allocation thinks it has discovered a suitable free block, which
> :is the way ddsnap currently implements the idea. I am not sure whether
> :I am going to stick with that method for Tux3 or just update the disk
> :image of the free tree to include the log transaction blocks and
> :somehow avoid logging those particular free tree changes to disk. Hmm,
> :a choice of two ugly but workable methods, but thankfully neither
> :affects the disk image.
>
> You might be able to adapt the delayed-reuse concept I used in HAMMER
> to simplify the problem, possibly to the point of making those
> operations replayable.
>
> Another thing I did in HAMMER was reserve space on the frontend for
> data blocks without modifying the freemap at all. This allows the
> frontend to write file data blocks to the physical media without
> having to wait for synchronization with meta-data or the backend or
> anything.
>
> I keep track of the reserved space with small in-memory structures
> (whos function is to prevent other allocations from using the reserved
> space until it can be committed). Freemap updates are then committed
> in the same flush that commits the related B-Tree insertions and
> deletions. Thus the freemap is kept in sync.

I am gravitating towards that style for Tux3's commit-related short
term allocations.

> :Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
> :beat that compactness. We have learned through bitter experience that
> :anything other than an Ext2/UFS style physically stable block of
> :dirents makes it difficult to support NFS telldir cookies accurately
> :because NFS vs gives us only a 31 bit cookie to work with, and that is
> :not enough to store a cursor for, say, a hash order directory
> :traversal. This is the main reason that I have decided to go back to
> :basics for the Tux3 directory format, PHTree, and make it physically
> :stable.
>
> The cookies are 64 bits in DragonFly. I'm not sure why Linux would
> still be using 32 bit cookies, file offsets are 64 bits so you
> should be able to use 64 bit cookies.

It is not Linux that perpetrates this outrage, it is NVFS v2. We can't
just tell everybody that their NFS v2 clients are now broken.

> For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
> and 32 bits is an iterator to deal with hash collisions. Poof,
> problem solved.

Which was my original proposal to solve the problem. Then Ted told me
about NFS v2 :-O

Actually, NFS hands you a 62 bit cookie with the high bits of both s32
parts unused. NFS v2 gives you a 31 bit cookie. Bleah.

> :In the PHTree directory format lookups are also trivial: the directory
> :btree is keyed by a hash of the name, then each dirent block (typically
> :one) that has a name with that hash is searched linearly. Dirent block
> :pointer/hash pairs are at the btree leaves. A one million entry
> :directory has about 5,000 dirent blocks referenced by about 1000 btree
> :leaf blocks, in turn referenced by three btree index blocks (branching
> :factor of 511 and 75% fullness). These blocks all tend to end up in
> :the page cache for the directory file, so searching seldom references
> :the file index btree.
>
> That seems reasonable. I'll look up the PHTree stuff, it sounds
> interesting. Personally speaking I don't mind eating 64 bytes per
> directory entry. I can accomodate file names up to 16 bytes in
> the B-Tree element itself so 95% of the namespace doesn't even
> require a look-aside data reference.
>
> The one mistake I made, which I didn't have time to fix in the
> first HAMMER release, is that my directory hash is a standard crc
> which means that records in the B-Tree aren't even close to being
> in any sort of lexical order. I will probably wind up changing
> the hash calculation a bit to get some lexical sorting information
> stuffed into it... should be doable since the hash keys are 64 bits.

Yes, I noticed that. Check out dx_hack_hash:

http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38

It distributed hashes of ascii strings very well for me, with few
funnels. It way outperformed some popular hashes like TEA. Ted's cut
down crytographic hash is yet more uniform but costs much more CPU.

> :will pass Hammer in lookup speed for some size of directory because of
> :the higher btree fanout. PHTree starts from way behind courtesy of the
> :32 cache lines that have to be hit on average for the linear search,
> :amounting to more than half the CPU cost of performing a lookup in a
> :million element directory, so the crossover point is somewhere up in
> :the millions of entries.
>
> Well, the B-Tree fanout isn't actually that big a deal. Remember
> HAMMER reblocks B-Tree nodes along with everything else. B-Tree
> nodes in a traversal are going to wind up in the same 16K filesystem
> block.

It affects the number of probes you have to do for the lookup.
Binsearching hits one cache line per test while each node lookup hits a
lot more.

> That leaves just element space overhead, which frankly is not much
> of an issue on a production system because kernels have to cache a lot
> more then just the directory entry. The in-memory vnode and inode
> structures take up a lot more space then the cached directory entry.
> So even though HAMMER's directory entry is a 64-byte B-Tree element
> and roughly double the overhead of a UFS-style directory entry, the
> overall cache footprint in the system is only ~20% worse.

It is not just the cache footprint but the time it takes to get your
key tables into L1 cache. To be sure, we are talking about small
details here, but these small details can add up to a difference of a
factor of two in execution speed. Which maybe you don't care much about
now that you are orders of magnitude faster than what came before ;-)

> :...
> :> that the B-Tree can seek to directly. In fact, at one point I tried
> :> indexing on delete_tid instead of create_tid, but created one hell of=
> : a
> :> mess in that I had to physically *move* B-Tree elements being deleted
> :> instead of simply marking them as deleted.
> :
> :This is where I do not quite follow you. File contents are never
> :really deleted in subsequent versions, they are just replaced. To
> :truncate a file, just add or update size attribute for the target
> :version and recover any data blocks past the truncate point that
> :belongs only to the target version.
>
> Truncation is a deletion. That is, the B-Tree elements have to
> be marked as having been deleted so if you truncate-extend a file
> you see the new space as being full of zeros.
>
> Take a 1MB file. Truncate it to 512K, then truncate-extend the
> file back to 1MB. Or seek-write to extend the file...
> same difference, really.

Following my prescription above, the file will be full of zeros when
re-extended because the data pointers were removed. But I think you
are right, truncate really is a delete and I actually argued myself
into understanding that.

> :I plan to scan the whole btree initially, which is what we do in ddsnap
> :and it works fine. But by looking at the mtime attributes in the inode
> :I can see in which versions the file data was changed and thus not have
> :to scan most files. Eventually, building in some accelerator to be able
> :to skip most inode leaves as well would be a nice thing to do. I will
> :think about that in the background.
>
> Yah, that makes sense. You won't have to scan every last tidbit of
> information. HAMMER would have had to if I had not added the
> propagating transaction id feature to it.
>
> You are still going to have a scaleability issue but it sounds like
> you could implement the exact same optimization and completely solve
> the problem.

Yes, I have thought about it a little more, and I imagine something
like a small array of dirty bits that climb up the btree with the help
of logical logging, where each bit means that there is something
interesting for some corresponding replication target to look at,
somewhere down in the subtree.

> :> The way it works is that HAMMER's frontend is almost entirely
> :> disconnected from the backend. All meta-data operations are cached
> :> in-memory. create, delete, append, truncate, rename, write...
> :> you name it. Nothing the frontend does modifies any meta-data
> :> or mata-data buffers.
> :
> :Things are a little different in Linux. The VFS takes care of most of
> :what you describe as your filesystem front end, and in fact, the VFS is
> :capable of running as a filesystem entirely on its own, just by
> :supplying a few stub methods to bypass the backing store (see ramfs).
> :I think this is pretty much what you call your front end, though I
> :probably missed some major functionality you provide there that the
> :VFS does not directly provide.
>
> Well, for file data, up to a point, and HAMMER does just use the
> kernel's buffer cache for file data.
>
> Not for directory entries. I'm pretty sure Linux uses a throw-away
> namecache just like we do, so unless it has a mechanic for flush
> sequencing of dirty namecache entries you need in-VFS structural
> support for caching the operations.

The Linux dentry cache actually implements proper namespace semantics
all by itself without needing to flush anything, which is what Ramfs
is. Ramfs just lives entirely in cache without even being backed by
a ramdisk, until the power goes off.

> In particular, caching a namespace deletion (rm, rename, etc)
> without touching the meta-data requires implementing a merged lookup
> so you can cancel-out the directory entry that still exists on-media.
> So it isn't entirely trivial.

Linux implements "negative dentries" to say "this entry is not here".

> Things related to file data are handled by the kernel for the most
> part, but truncations still need considerable support within the
> VFS to properly handle seek-write or truncate file extensions.
> In order to avoid touching any meta-data when handling a truncation,
> a similar canceling field in the in-memory inode needs to be
> maintained until the backend is able to delete the related data
> blocks. Lookups have to take into account the cached truncation offset
> so as not to read data from the media past the truncation point (e.g.
> if you seek-write-extend or truncate-extend the file immediately
> after truncating the file).

I have not looked at that part for a while now, and I did not look at
it just now, but Al Viro has been fiddling with it for years getting the
bugs out one at a time. The last major one I heard of was some time
ago. It works somehow, I should look more closely at it.

> Most filesystems will dirty meta-data buffers related to the media
> storage as part of executing an operation on the frontend. I don't
> know of any which have the level of separation that HAMMER has.

Modern Linux filesystems get close I think. Particularly in journalled
data mode, Ext3 marks all the buffers it deals with as "don't touch" to
the VFS and VMM, which have no idea how to obey the necessary ordering
constraints. There is also this thing called the "journalling block
device" that provides an abstract implementation of a physical journal,
which is actually used by more than one filesystem. (Three I think,
including Ext4, however I now hear noise about rewriting it.)

> :Note: no other filesystem on Linux current works this way. They all
> :pretty much rely on the block IO library which implements the fairly
> :goofy one-block-at-a-time transfer regimen. The idea of setting up bio
> :transfers directly from the filesystem is unfortunately something new.
> :We should have been doing this for a long time, because the interface
> :is quite elegant, and actually, that is just what the block IO library
> :is doing many levels down below. It is time to strip away some levels
> :and just do what needs to be done in the way it ought to be done.
>
> Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is
> used for. I'm a little surprised that Linux doesn't do that yet.
> I'll expand on that down further.

Either you forgot to expand or I missed it. I am interested.

> :Directory format to index the extent within a leaf:
> :
> : struct entry { unsigned loglo:24, offset:8 };
> : struct group { unsigned loghi:24, count:8 };
> ...
> Man, those are insanely small structures. Watch out for the cpu
> and in-kernel-memory tracking overhead.

I have written the code, mostly, and it is tight. A similar idea has
been part of ddsnap for 5 years now.

> : * The linux page cache not only caches data but provides an efficient
> : access path to logical blocks that avoids having to consult the
> : filesystem metadata repeatedly under heavy traffic.
>
> Actually, BSD buffer caches are logical caches, and always have been.
> Linux actually took a page from our book on that one. In fact, if
> you do a google search I'll bet you will see my name in some of those
> Linux VM system debates 5-10 years ago :-)

I remember it well. You were the one who put Rik up the reverse map
design that caused the main fireworks between 2.3 and 2.4.9. (I was
the one who finally got it work by social-engineering the merging of
the reverse map part with Andrea's more robust LRU design.)

I also remember the time when BSD buffers were far superior to Linux
ones, yes. In 2.0 the arrangement sucked enormously: sort-of coherency
between the buffer cache and the page cache was achieved by physically
copying changes from one to the other.

Today, the buffer cache and page cache are nearly fully unified. But
the unification will never be complete until the page size is variable,
so the remaining tasks done by buffers can be done by pages instead.

Anyway, this part of your OS is more similar than different, which will
help a lot with a port.

> The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure
> is a logically indexed entity which can also contain a cached physical
> translation (which is how the direct data bypass works).

Linux caches physical translations by gluing one or more buffer heads
onto each page, which is one of the few remaining uses of buffer heads.
To finally get rid of buffer heads the physical cache needs to be done
some other way. I am sure there are lots of better ways.

> I'm just going to throw this out. In an earlier incarnation of the
> HAMMER design I had a forward log which embedded data, and the idea
> was that I would leave the data in-place in the log. I was going to
> use a blockmap to turn the log into a sparse-map as a means of
> recovering space from it.
>
> I eventually threw away that idea but it occurs to me that you might
> be able to use a similar idea to compress small bits of meta-data.

This is similar to what I do when I write out a physical block and use
a logical log entry to make the on-disk parent point at it (actually,
it is a promise to make the parent point at it some time in the future.
But then I do not let those things live very long. Eventually it might
make sense to let them live a little longer, perhaps by generalizing
the idea of the single linear log to a forest of little logs, some of
which could be left around for a long time instead of being quickly
rolled up into the canonical structures.

> :> Orphan inodes in HAMMER will always be committed to disk with a
> :> 0 link count. The pruner deals with them after a crash. Orphan
> :> inodes can also be commited due to directory entry dependancies.
> :
> :That is nice, but since I want to run without a pruner I had better
> :stick with the logical logging plan. The link count of the inode will
> :also be committed as zero, or rather, the link attribute will be
> :entirely missing at that point, which allows for a cross check.
>
> It won't work for you if your logical log is meant to be short-lived.
> Programs can hold unlinked inodes open for hours, even days. Even
> longer. You would have to rewrite the entry in order to be able to
> move the caboose. Well, that isn't a bad idea, just more complication.

Indeed, I need to handle that. Since I do want to stick with a simple
linear sequence of logical log commits for now, I do not want to leave
any of them sitting around for a log time. One easy thing to do is to
put any orphans aside at rollup time, in an unversioned orphan file, say.

> Maybe the ticket for DragonFly is to simply break storage down into
> a reasonable number of pieces, like cutting up a drive into 256 pieces,
> and create a layer to move and glue those pieces together into larger
> logical entities.

Linux LVM is all about device mapper, and actually, device mapper does
not really have a reason to be a separate subsystem, it can just be the
way the block layer works. If interested, check out my posts on bio
stacking and rolling device mapper's "clone_and_map" idea up into the
generic block layer. I think all unixen should do this.

Other than that, device mapper is just about making device numbers
point at different virtual devices, transparently to userspace. Even
device stacking works like that: extract the "virtual" device (which
may or may not be a real device) from the device number you are going
to remap, store it in some other device number, create a new virtual
device on top of the other device number, finally store the new virtual
device in the old device number. The old device number has now been
transparently stacked. Nothing to it.

> I'm editing down as much as I can :-) This one took 6 hours, time to
> get lunch!

Only 3 hours so far here... hmm, make that four.

Daniel


From: Matthew Dillon <dillon@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 11:46 pm 2008

:Tux3 never backs anything out, so there is some skew to clear up.

Got it. I'm still not clear on the physical vs logical log
interactions required for crash recovery, but I think I understand
everything else.

:> One of the reasons why HAMMER uses an UNDO log is that it is a lot
:> less expensive then a REDO log. I'll show you why:
:>
:> Physical operations related to a single flush group in HAMMER:
:>
:> modify element in B-Tree node A (create UNDO w/ original contents of
:> B-Tree node A)
:> modify element in B-Tree node A (no new UNDO needs to be created)
:> modify element in B-Tree node A (no new UNDO needs to be created)
:> modify element in B-Tree node A (no new UNDO needs to be created)
:> modify element in B-Tree node A (no new UNDO needs to be created)
:
:Nice, now I understand. But do you not have to hold all filesystem
:transactions that depend on the modification until the btree node has
:completed writing to disk? With logical logging you only have to wait
:on the logical commit to complete, into which may be packed a number of
:other changes for efficiency.

The only major interlock between frontend searches and backend flushes
occurs on an inode-by-inode basis and only for the duration of the
modifying B-Tree operation, which does *not* include the writing of
the modified meta-data buffer to disk. Once the operation concludes
we are left with dirty meta-data buffers which for all intents and
purposes have transfered the in-memory cached B-Tree records (held
as separate entities in a red-black tree in-memory) into the media
view of the B-Tree (having performed the actual B-Tree operations
necessary to insert or delete the record), and the inode lock is
then released. The flusher flushes each inode on the flush list in
turn, with certain long-lived inodes (e.g. truncation of a 1TB
file to 0) only getting partial goodness and then being moved to
the next flush list. The inode lock is held for each individual
inode as it is being flushed, one inode at a time.

The actual dirty meta-data buffers are flushed later on, without
any inode lock held and without having any blocking effect on the
frontend. This media-write sequence occurs at the end, after all
inodes have been processed. UNDO finishes writing, we wait for
I/O to complete, we update the volume header, we wait for I/O
to complete, then we asynchronously flush the dirty media buffers
(and don't wait for that I/O to complete so the writes can be merged
with the beginning of the next flush group).

That is what I meant when I said the frontend was disconnected
from the backend. The worst the frontend ever does is issue reads
on media meta-data as part of any merged memory-media B-Tree searches
it must perform to locate information. Any modifying operations made
by the frontend will create records in the in-memory red-black tree,
but will *NOT* do any modifying oprations on meta-data buffers
representing the on-media B-Tree. Frontend lookups are resolved by
doing a merged-search with two cursors (one on the in-memory red-black
tree, and one on the on-media B-Tree).

I considered doing the interlock at the record level but hit
some hicups in that one of the cursors may have already passed the
memory record being committed but the other cursor (the media view)
may not have yet passed the B-Tree index where that record would be
written to. So if a commit were to occur right then, without a
governing inode lock, the merged search might see a duplicate record
(or no record in either view in the case of a deletion).

:> I just run the UNDOs backwards, which allows me to cache the in-flush
:> undo areas with a fixed set of small throw-away tracking structures
:> in system memory. If a structure gets thrown away it just means an
:> extranious UNDO record will be appended to the UNDO FIFO.
:
:I think I see it, but I have my doubts because you have to block
:transactions waiting for the up to date copy of the btree data to land
:on disk. Either that, or you may give userspace the impression that
:some transaction has gotten onto stable storage when that is not the
:case.

Userland doesn't care whether the actual transactions have made it
to disk or not unless it actually calls fsync(). If it does then
it will block through the necessary flush cycles. fsync() on
HAMMER is pretty slow as it will likely be syncing a lot more
in the flush group then just the requested inode.

If userland does not call fsync() (or 'sync', etc) then it doesn't
block at all, except as described above during the modify-buffer
phase of the B-Tree updates (on an inode-by-inode basis so actual
blocking is virtually non-existant). It is often possible to copy
an entire directory tree without initiating a backend flush.

Media B-Tree elements, inclusive of any held in dirty meta-data buffers,
are locked on a B-Tree node by B-Tree node basis. These are very
short lived locks and have nothing to do with the actual flushing
of the dirty meta-data buffers to the physical media. The frontend
always gets shared locks, the backend always gets exclusive locks.
Some blocking between the frontend and backend occurs but not a whole
lot. The backend flusher actually blocks against itself more often
(it runs multiple threads), due to the locality of reference within
the B-Tree for nearby inodes.

:> The cookies are 64 bits in DragonFly. I'm not sure why Linux would
:> still be using 32 bit cookies, file offsets are 64 bits so you
:> should be able to use 64 bit cookies.
:
:It is not Linux that perpetrates this outrage, it is NVFS v2. We can't
:just tell everybody that their NFS v2 clients are now broken.

Oh, we don't care about NFSv2 all that much any more. NFSv3 is the
bare minimum. NFSv2 is extremely old, nobody should be using it any
more. Even NFSv3 is getting fairly long in the tooth now.

:> For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
:> and 32 bits is an iterator to deal with hash collisions. Poof,
:> problem solved.
:
:Which was my original proposal to solve the problem. Then Ted told me
:about NFS v2 :-O
:
:Actually, NFS hands you a 62 bit cookie with the high bits of both s32
:parts unused. NFS v2 gives you a 31 bit cookie. Bleah.

I'd recommend dropping support for NFSv2. It is not really worth
supporting any more. Does it even support 64 bit inodes? (I don't
remember), or 64 bit file offsets? NFSv2 is garbage.

You should be able to use 63 bits of the cookie, I don't know why
you wouldn't use the high bit of the lsb 32 bit part. There is no
requirement that that bit be 0. In fact, the RFC says the cookie is
a 64 bit unsigned integer and you should be able to use all 64 bits.
If linux is not allowing all 64 bits to be used then it's a serious
bug in linux. The cookies are supposed to be opaque, just like the
file handle.

:> which means that records in the B-Tree aren't even close to being
:> in any sort of lexical order. I will probably wind up changing
:> the hash calculation a bit to get some lexical sorting information
:> stuffed into it... should be doable since the hash keys are 64 bits.
:
:Yes, I noticed that. Check out dx_hack_hash:
:
: http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38
:..
:It distributed hashes of ascii strings very well for me, with few
:funnels. It way outperformed some popular hashes like TEA. Ted's cut
:down crytographic hash is yet more uniform but costs much more CPU.

I like :-). I'll probably end up with multiple directory hash formats
too before this is over. Fortunately I have room in the inode to
specify the format the directory will use.

:...
:> Well, the B-Tree fanout isn't actually that big a deal. Remember
:> HAMMER reblocks B-Tree nodes along with everything else. B-Tree
:> nodes in a traversal are going to wind up in the same 16K filesystem
:> block.
:
:It affects the number of probes you have to do for the lookup.
:Binsearching hits one cache line per test while each node lookup hits a
:lot more.
:...
:It is not just the cache footprint but the time it takes to get your
:key tables into L1 cache. To be sure, we are talking about small
:details here, but these small details can add up to a difference of a
:factor of two in execution speed. Which maybe you don't care much about
:now that you are orders of magnitude faster than what came before ;-)

I'm more worried about extra disk seeks, bloated UNDO records, and
blockages due to B-Tree locking. The fully cached stuff already
performs extremely well and I don't feel that I have to eek out
much more performance out of it. And, frankly, the L1/L2 cache
characteristics are a wash and I'm already burning a lot of inefficient
data accesses by doing the power-of-2 search on the elements in the
first place.

Finding a better way to home in on the B-Tree element, aka getting
that power-of-2 search out of the critical path... that would be worth
doing. Worrying about an occassional node traversal is way below
the noise floor.

:Yes, I have thought about it a little more, and I imagine something
:like a small array of dirty bits that climb up the btree with the help
:of logical logging, where each bit means that there is something
:interesting for some corresponding replication target to look at,
:somewhere down in the subtree.

You can do with one transaction id / mtime field in the B-Tree node
header. Strictly speaking you don't need a per-element field on
internal elements, though I have one there too because I had the space.

:> Not for directory entries. I'm pretty sure Linux uses a throw-away
:> namecache just like we do, so unless it has a mechanic for flush
:> sequencing of dirty namecache entries you need in-VFS structural
:> support for caching the operations.
:
:The Linux dentry cache actually implements proper namespace semantics
:all by itself without needing to flush anything, which is what Ramfs
:is. Ramfs just lives entirely in cache without even being backed by
:a ramdisk, until the power goes off.

That's cool. You will want to flush it since you are backing the
system caches with a real filesystem, and you would probably need
a tracking structure of some sort anyway to deal with the
directory entry create/delete dependancies.

:> In particular, caching a namespace deletion (rm, rename, etc)
:> without touching the meta-data requires implementing a merged lookup
:> so you can cancel-out the directory entry that still exists on-media.
:> So it isn't entirely trivial.
:
:Linux implements "negative dentries" to say "this entry is not here".

Yes, BSD's have those too (and always have), but like the positive
entries they are a throw-away cache.

:> blocks. Lookups have to take into account the cached truncation offset
:> so as not to read data from the media past the truncation point (e.g.
:> if you seek-write-extend or truncate-extend the file immediately
:> after truncating the file).
:
:I have not looked at that part for a while now, and I did not look at
:it just now, but Al Viro has been fiddling with it for years getting the
:bugs out one at a time. The last major one I heard of was some time
:ago. It works somehow, I should look more closely at it.

I can believe that. It took me a while to get cached truncation
offsets working in HAMMER. It's probably the most complex bit of
code after the merged memory/media-B-Tree iterator.

:> Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is
:> used for. I'm a little surprised that Linux doesn't do that yet.
:> I'll expand on that down further.
:
:Either you forgot to expand or I missed it. I am interested.

Oops, I think I did forget to expand that. Basically buffer cache
buffers under kernel management are logically indexed. Each vnode
has its own tree of buffer cache buffers.

When the kernel wants to flush dirty buffers or cluster adjacent
buffers together for I/O it uses VOP_BMAP to map the logical offsets
to physical offsets (so kernel-supported clustering is thus based
on physical offsets). The VFS is responsible for implementing the BMAP
operation. It is the BMAP operation which assigns actual physical
media blocks to a high level logically mapped filesystem buffer.

The mapping is persistent as long as the buffer cache buffer exists.
Of course clean buffers can be thrown away by the kernel so at some
point in the future the kernel might re-request the VOP_BMAP for
that logical offset.

The VFS can choose when to implement the physical mapping. It can
tell the kernel to take a hike and do it when the kernel actually
writes the buffer out (being a logical buffer the flush goes through
the VFS via VOP_STRATEGY), it can do it early on when the actual
read() or write() opration occurs, etc.

:I remember it well. You were the one who put Rik up the reverse map
:design that caused the main fireworks between 2.3 and 2.4.9. (I was
:the one who finally got it work by social-engineering the merging of
:the reverse map part with Andrea's more robust LRU design.)
:
:I also remember the time when BSD buffers were far superior to Linux
:ones, yes. In 2.0 the arrangement sucked enormously: sort-of coherency
:between the buffer cache and the page cache was achieved by physically
:copying changes from one to the other.
:
:Today, the buffer cache and page cache are nearly fully unified. But
:the unification will never be complete until the page size is variable,
:so the remaining tasks done by buffers can be done by pages instead.
:
:Anyway, this part of your OS is more similar than different, which will
:help a lot with a port.

Hey, I like the term fireworks. The VM system is the most complex
piece of the kernel, in both Linux and BSD. I'm not surprised.

BSDs have always had a unified buffer / VM cache, but until recently
VFS's have had to have knowledge of the VM system to implement
VOP_PUTPAGES and VOP_GETPAGES. One of the things I did in DragonFly
was finally finish that unification up so VFSs now only need to
implement VOP_READ and VOP_WRITE, and do not have to implement
VOP_PUTPAGES and VOP_GETPAGES any more.

BSD buffer cache buffers are backed by vm_page_array[], an array of
VM pages, and the base of the buffer does not have to be page-aligned.
So variable page sizes are not necessary.

FreeBSD is about 80% of the way there. They haven't taken the final
steps yet. I don't know about NetBSD and OpenBSD.

:Indeed, I need to handle that. Since I do want to stick with a simple
:linear sequence of logical log commits for now, I do not want to leave
:any of them sitting around for a log time. One easy thing to do is to
:put any orphans aside at rollup time, in an unversioned orphan file, say.

I had a little time to think about it. I think the easiest solution
is, if you are forced to flush an open inode with 0 links, to actually
record the open state in a separate B-Tree so upon remounting after a
crash you can simply scan that B-Tree and finish off the inodes
you find in that tree. That way it would not have to be held in the
log at all as anything other then a transitory entry.

:Linux LVM is all about device mapper, and actually, device mapper does
:not really have a reason to be a separate subsystem, it can just be the
:way the block layer works. If interested, check out my posts on bio
:stacking and rolling device mapper's "clone_and_map" idea up into the
:generic block layer. I think all unixen should do this.

I'll look it up!

:Daniel
:

-Matt
Matthew Dillon


From: Daniel Phillips <phillips@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Aug 5, 4:21 am 2008

On Sunday 27 July 2008 20:46, Matthew Dillon wrote:
> :Tux3 never backs anything out, so there is some skew to clear up.
>
> Got it. I'm still not clear on the physical vs logical log
> interactions required for crash recovery, but I think I understand
> everything else.

The reason for my unclear crash recovery algorithm is, it was not fully
designed when we started chatting, only the basic principles of robust
micro-transaction commit. You forced me to go work out the details of
the high level transaction architecture.

The big points you got me focussed on were, what about mass commits a la
fsync, and how do you sort out the ties that arise between otherwise
independent high level operations, and how do you preserve the ordering
of low level operations to reflect the ordering constraints of the high
level operations? Which is what you were trying to say with your create
a, create a/b example, I now realize.

> :> One of the reasons why HAMMER uses an UNDO log is that it is a lot
> :> less expensive then a REDO log. I'll show you why:
> :>
> :> Physical operations related to a single flush group in HAMMER:
> :>
> :> modify element in B-Tree node A (create UNDO w/ original contents of
> :> B-Tree node A)
> :> modify element in B-Tree node A (no new UNDO needs to be created)
> :> modify element in B-Tree node A (no new UNDO needs to be created)
> :> modify element in B-Tree node A (no new UNDO needs to be created)
> :> modify element in B-Tree node A (no new UNDO needs to be created)
> :
> :Nice, now I understand. But do you not have to hold all filesystem
> :transactions that depend on the modification until the btree node has
> :completed writing to disk? With logical logging you only have to wait
> :on the logical commit to complete, into which may be packed a number of
> :other changes for efficiency.
>
> The only major interlock between frontend searches and backend flushes
> occurs on an inode-by-inode basis and only for the duration of the
> modifying B-Tree operation, which does *not* include the writing of
> the modified meta-data buffer to disk.

I am well down path to identifying a similar frontend/backend
separation in Tux3. The back end data consists of all the dirty
metadata and file data cache blocks (buffer cache and page cache(s)
respectively) and the "phase" boundaries, which separate the results of
successive large groups of high level operations so that there are no
disk-image dependencies between them.

I now have two nice tools for achieving such a clean separation between
transaction phases:

1) Logical log records

2) Dirty cache block cloning

The latter takes advantage of logical log records to limit the effect of
a high level operation to just the leaf blocks affected. Where two
operations that are supposed to be in two different transaction phases
affect the same leaf block(s) then the block(s) will be cloned to
achieve proper physical separation. So I can always force a clean
boundary between phases without generating a lot of dirty blocks to do
it, an important property when it comes to anti-deadlock memory
provisions.

The payoff for a clean frontend/backend separation is, as you say, a
clean design. The art in this is to accomplish the separation without
any throughput compromise, which I am willing to believe you have done
in Hammer, and which I now wish to achieve by a somewhat different
means in Tux3.

> Once the operation concludes
> we are left with dirty meta-data buffers which for all intents and
> purposes have transfered the in-memory cached B-Tree records (held
> as separate entities in a red-black tree in-memory) into the media
> view of the B-Tree (having performed the actual B-Tree operations
> necessary to insert or delete the record), and the inode lock is
> then released. The flusher flushes each inode on the flush list in
> turn, with certain long-lived inodes (e.g. truncation of a 1TB
> file to 0) only getting partial goodness and then being moved to
> the next flush list. The inode lock is held for each individual
> inode as it is being flushed, one inode at a time.

I need a means of remembering which blocks are members of which phases,
and an rbtree may be indeed better than the list I was thinking of. A
hash table is probably ok too. I also have to remember the logical
update records that were staged when the high level operation took place
in order to avoid having the btree leaf changes climb up into the index
nodes copy on write style, and to avoid constant changes to the free map
btree.

I can coopt some bits of the buffer flags to keep track of which phase a
given dirty buffer belongs to. To avoid having to attach buffers to
every dirty page cache page, I might use some page flags in that case.
Or that state can be in a secondary bookkeeping structure like the
rbtree or hash. Details to be settled before the kernel coding starts.

> The actual dirty meta-data buffers are flushed later on, without
> any inode lock held and without having any blocking effect on the
> frontend.

Roger, same here. The inode locks and things like rename lock are only
held during the course of the high level operation. Tux3 does not have
to worry about that at all, the VFS takes care of it nicely. Tux3 only
has to lock the data structures that keep track of the phases
(spinlocks) and the btrees embedded in the buffer and page cache
(possibly hashed mutexes).

Note: when I write VFS I mean "virtual filesystem switch" and when you
write VFS you mean "virtual filesystem", which in Linux is just called
a filesystem.

> This media-write sequence occurs at the end, after all
> inodes have been processed. UNDO finishes writing, we wait for
> I/O to complete, we update the volume header, we wait for I/O
> to complete, then we asynchronously flush the dirty media buffers
> (and don't wait for that I/O to complete so the writes can be merged
> with the beginning of the next flush group).

Here, Tux3 (and Hammer) can always set up a very long forward log chain,
only updating the header once per multi-transaction commit phase.

> That is what I meant when I said the frontend was disconnected
> from the backend. The worst the frontend ever does is issue reads
> on media meta-data as part of any merged memory-media B-Tree searches
> it must perform to locate information. Any modifying operations made
> by the frontend will create records in the in-memory red-black tree,
> but will *NOT* do any modifying oprations on meta-data buffers
> representing the on-media B-Tree. Frontend lookups are resolved by
> doing a merged-search with two cursors (one on the in-memory red-black
> tree, and one on the on-media B-Tree).

It is beautiful, now that I understand it.

> I considered doing the interlock at the record level but hit
> some hicups in that one of the cursors may have already passed the
> memory record being committed but the other cursor (the media view)
> may not have yet passed the B-Tree index where that record would be
> written to. So if a commit were to occur right then, without a
> governing inode lock, the merged search might see a duplicate record
> (or no record in either view in the case of a deletion).

Interlock at the micro-transaction level really sucks. There are way
too many interlocks. It is much nicer just to have the phase
boundaries, and not care about the dependencies between micro
transactions withing the transaction phase, knowing that they will all
be satisfied because the entire phase will be committed atomically.

Now here is another way of looking at the difference between undo and
redo. Each of my redo log entries leaves the filesystem meta-metadata
in a consistent state in the sense that all btree structure will be
consistent. But the logical metadata may be inconsistent, for example,
a directory may have appear to have been deleted on disk while some
deletions of files it contained were not persistently recorded. I think
one can take advantage of such partial, low level consistency in order
to do a rapid scan and repair, preserving as much as possible of what
the filesystem was doing before the crash of a busy system with lots of
transactions in flight.

> :> I just run the UNDOs backwards, which allows me to cache the in-flush
> :> undo areas with a fixed set of small throw-away tracking structures
> :> in system memory. If a structure gets thrown away it just means an
> :> extranious UNDO record will be appended to the UNDO FIFO.
> :
> :I think I see it, but I have my doubts because you have to block
> :transactions waiting for the up to date copy of the btree data to land
> :on disk. Either that, or you may give userspace the impression that
> :some transaction has gotten onto stable storage when that is not the
> :case.
>
> Userland doesn't care whether the actual transactions have made it
> to disk or not unless it actually calls fsync(). If it does then
> it will block through the necessary flush cycles. fsync() on
> HAMMER is pretty slow as it will likely be syncing a lot more
> in the flush group then just the requested inode.

Yes, this was the point that finally made me understand how your undo
approach works. Sorry for making you explain it so many times ;-)

> If userland does not call fsync() (or 'sync', etc) then it doesn't
> block at all, except as described above during the modify-buffer
> phase of the B-Tree updates (on an inode-by-inode basis so actual
> blocking is virtually non-existant). It is often possible to copy
> an entire directory tree without initiating a backend flush.

I presume that if one userland process is blocked on fsync, the others
are still able to continue dumping their operations into cache?

> Media B-Tree elements, inclusive of any held in dirty meta-data buffers,
> are locked on a B-Tree node by B-Tree node basis. These are very
> short lived locks and have nothing to do with the actual flushing
> of the dirty meta-data buffers to the physical media. The frontend
> always gets shared locks, the backend always gets exclusive locks.
> Some blocking between the frontend and backend occurs but not a whole
> lot. The backend flusher actually blocks against itself more often
> (it runs multiple threads), due to the locality of reference within
> the B-Tree for nearby inodes.

There is no need for multiple flushing threads in Tux3, I will use bio
submissions instead, which are completely asynchronous, returning
completion status in interrupt mode as they do.

> I'd recommend dropping support for NFSv2. It is not really worth
> supporting any more. Does it even support 64 bit inodes? (I don't
> remember), or 64 bit file offsets? NFSv2 is garbage.

It's a thought. But anyway, I don't see a big downside to the PHTree
approach. It might even end up faster than HTree (which does not
maintain dirent stability) by virtue of the slightly smaller (5-10%)
cache footprint. If classic HTree turns out to be faster, I could offer
both as an option. The code that differs between them should be about
10-20%.

> You should be able to use 63 bits of the cookie, I don't know why
> you wouldn't use the high bit of the lsb 32 bit part. There is no
> requirement that that bit be 0. In fact, the RFC says the cookie is
> a 64 bit unsigned integer and you should be able to use all 64 bits.
> If linux is not allowing all 64 bits to be used then it's a serious
> bug in linux. The cookies are supposed to be opaque, just like the
> file handle.

You are right about the cookie 64 bitness, I don't know where I picked
up that lore, mea culpa.

> :> which means that records in the B-Tree aren't even close to being
> :> in any sort of lexical order. I will probably wind up changing
> :> the hash calculation a bit to get some lexical sorting information
> :> stuffed into it... should be doable since the hash keys are 64 bits.
> :
> :Yes, I noticed that. Check out dx_hack_hash:
> :
> : http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38
> :..
> :It distributed hashes of ascii strings very well for me, with few
> :funnels. It way outperformed some popular hashes like TEA. Ted's cut
> :down crytographic hash is yet more uniform but costs much more CPU.
>
> I like :-). I'll probably end up with multiple directory hash formats
> too before this is over. Fortunately I have room in the inode to
> specify the format the directory will use.

I overlooked the part where you mentioned giving some lexical ordering
to your hash to get better locality between lexically related records.
I am not sure this is a win, it will require a lot of analysis, which
starts to drift off the critical path for me, since crystallizing the
transaction design is really a lot more important.

An inverse idea to making the hash nonrandom is to make the goal inum
depend somewhat on the name hash for a newly created file. So
relatively fewer PHTree index leaf nodes, which contain [hash, dirblock]
pairs, will be have to be reloaded into cache during deletion of a
massive directory in physical dirent order. The goal inum has to depend
mostly on the physical dirent location though, to avoid thrashing the
inode table during mass creation, deletion or stat. I am not sure
whether there is a gain to be had vs keeping the random hash completely
random, which does deliver a benefit in balancing the tree better.

> :Yes, I have thought about it a little more, and I imagine something
> :like a small array of dirty bits that climb up the btree with the help
> :of logical logging, where each bit means that there is something
> :interesting for some corresponding replication target to look at,
> :somewhere down in the subtree.
>
> You can do with one transaction id / mtime field in the B-Tree node
> header. Strictly speaking you don't need a per-element field on
> internal elements, though I have one there too because I had the space.

Yes, the common case of replication is only against the "current"
version and that optimization will work well there. There might be
multiple current versions in a Tux3 volume,

In the case of many virtual servers running off the same filesystem
volume, each with its own writable snapshot, and each replicating
thatsnapshot somewhere, I need to be able to track dirtiness in multiple
versions simultaneously. Or just not worry to much about that and sweep
through the entire inode table btree. Multiple dirty block streams
could be output in one pass in this case.

> :> Not for directory entries. I'm pretty sure Linux uses a throw-away
> :> namecache just like we do, so unless it has a mechanic for flush
> :> sequencing of dirty namecache entries you need in-VFS structural
> :> support for caching the operations.
> :
> :The Linux dentry cache actually implements proper namespace semantics
> :all by itself without needing to flush anything, which is what Ramfs
> :is. Ramfs just lives entirely in cache without even being backed by
> :a ramdisk, until the power goes off.
>
> That's cool. You will want to flush it since you are backing the
> system caches with a real filesystem, and you would probably need
> a tracking structure of some sort anyway to deal with the
> directory entry create/delete dependancies.

It flushes itself out when it is not pinned by something like ramfs for
example, which has no backing store. The VFS takes care of the
create/delete dependencies.

> :> Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is
> :> used for. I'm a little surprised that Linux doesn't do that yet.
> :> I'll expand on that down further.
> :
> :Either you forgot to expand or I missed it. I am interested.
>
> Oops, I think I did forget to expand that. Basically buffer cache
> buffers under kernel management are logically indexed. Each vnode
> has its own tree of buffer cache buffers.
>
> When the kernel wants to flush dirty buffers or cluster adjacent
> buffers together for I/O it uses VOP_BMAP to map the logical offsets
> to physical offsets (so kernel-supported clustering is thus based
> on physical offsets). The VFS is responsible for implementing the BMAP
> operation. It is the BMAP operation which assigns actual physical
> media blocks to a high level logically mapped filesystem buffer.
>
> The mapping is persistent as long as the buffer cache buffer exists.
> Of course clean buffers can be thrown away by the kernel so at some
> point in the future the kernel might re-request the VOP_BMAP for
> that logical offset.
>
> The VFS can choose when to implement the physical mapping. It can
> tell the kernel to take a hike and do it when the kernel actually
> writes the buffer out (being a logical buffer the flush goes through
> the VFS via VOP_STRATEGY), it can do it early on when the actual
> read() or write() opration occurs, etc.

That sounds a lot like the _get_block interface the Linux filesystem
block IO library uses to map buffers to physical addresses. That is just
a set of convenience functions the filesystem can use to satisfy such
vmm calls into the filesystem as ->writepage. The filesystem just has
to supply its own _get_block method. Between that and mount options
parsing, there is not a whole lot more you have to do to bring up a
filesystem on Linux, that is, if you are not really worried about
performance. Mapping each block with a separate call into the
filesystem actually sucks pretty badly CPU wise, but this is masked by
slow spindle speeds that stretch the CPU out over time. That masking
disappears with some of the high performance storage hardware coming
out now and the wasteful effects of that interface start to show.

> BSD buffer cache buffers are backed by vm_page_array[], an array of
> VM pages, and the base of the buffer does not have to be page-aligned.
> So variable page sizes are not necessary.

vm_page_array... is that like a Linux page cache? (aka mapping, aka
address_space) Buffers on page cache pages do not have to be
page-aligned in Linux, or rather, each page in Linux is either fully
covered with buffers or has none at all. What sucks about all this is
having buffers _and_ page descriptors, where the two perform very
similar functions and duplicate a lot of state, causing endless bugs
due to skew in the cached state between the two kinds of objects.

> :Indeed, I need to handle that. Since I do want to stick with a simple
> :linear sequence of logical log commits for now, I do not want to leave
> :any of them sitting around for a log time. One easy thing to do is to
> :put any orphans aside at rollup time, in an unversioned orphan file, say.
>
> I had a little time to think about it. I think the easiest solution
> is, if you are forced to flush an open inode with 0 links, to actually
> record the open state in a separate B-Tree so upon remounting after a
> crash you can simply scan that B-Tree and finish off the inodes
> you find in that tree. That way it would not have to be held in the
> log at all as anything other then a transitory entry.

That is exactly the idea. The question is whether to have a dedicated
btree or to use a file which is also a btree. The orphan file (if that
turns out to be the mechanism) would implement a simple linked list of
orphans, and the transitory log entries would reduce the frequency the
orphan file has to be updated, allowing it to be updated in batches.

This post required a week to respond to because I thought I should also
post some code in parallel.

Regards,

Daniel


From: Matthew Dillon <dillon@...> Subject: Re: [Tux3] Comparison to Hammer fs design Date: Aug 5, 1:33 pm 2008

:The payoff for a clean frontend/backend separation is, as you say, a
:clean design. The art in this is to accomplish the separation without
:any throughput compromise, which I am willing to believe you have done
:in Hammer, and which I now wish to achieve by a somewhat different
:means in Tux3.

It's been a major advantage for HAMMER. The major improvement I made
to make sure throughput was retained was to create a space reservation
subsystem based around a structure called hammer_reserve. This allows
the frontend to reserve space on the media without having to modify
any meta-data or commit to the allocation.

This in turn allows the frontend to reserve space for bulk data writes
and actually perform the writes directly to the media, and then queue
just the meta-data, along with the reservation to the backend. The
backend then finalizes the reservation (doing the actual allocation)
at the same time it lays down the related meta-data.

The result is that you get nice pipelining when writing lots of bulk
data. It is possible to literally write gigabytes of bulk data
before the backend needs to flush any of the related meta-data.

Separating the two out also made testing the filesystem a lot easier.
Most of the frontend could be tested without a working backend.

:Note: when I write VFS I mean "virtual filesystem switch" and when you
:write VFS you mean "virtual filesystem", which in Linux is just called
:a filesystem.

Yah :-)

:Now here is another way of looking at the difference between undo and
:redo. Each of my redo log entries leaves the filesystem meta-metadata
:in a consistent state in the sense that all btree structure will be
:consistent. But the logical metadata may be inconsistent, for example,
:a directory may have appear to have been deleted on disk while some
:deletions of files it contained were not persistently recorded. I think
:one can take advantage of such partial, low level consistency in order
:to do a rapid scan and repair, preserving as much as possible of what
:the filesystem was doing before the crash of a busy system with lots of
:transactions in flight.

Yah, I hit up against similar issues which I resolved with the
'flush group' abstraction, which groups together dependant operations.
If the group would become too large (e.g. you have an arbitrarily long
chain of dependancies), it can break the chain by flushing out the
inode at the break point with an adjusted nlinks count. That way
each flush group can be made logically consistent.

:...
:> phase of the B-Tree updates (on an inode-by-inode basis so actual
:> blocking is virtually non-existant). It is often possible to copy
:> an entire directory tree without initiating a backend flush.
:
:I presume that if one userland process is blocked on fsync, the others
:are still able to continue dumping their operations into cache?

Yah. fsync() just queues the inode to the flusher and the flusher
bangs away at it until its all out, then wakes up the waiting fsync
code.

:...
:> Some blocking between the frontend and backend occurs but not a whole
:> lot. The backend flusher actually blocks against itself more often
:> (it runs multiple threads), due to the locality of reference within
:> the B-Tree for nearby inodes.
:
:There is no need for multiple flushing threads in Tux3, I will use bio
:submissions instead, which are completely asynchronous, returning
:completion status in interrupt mode as they do.

The issue HAMMER had is that the flusher has to perform multiple media
B-Tree operations, which can block on reads. I wound up going with
multiple threads to parallelize the blocked reads.

:It's a thought. But anyway, I don't see a big downside to the PHTree
:approach. It might even end up faster than HTree (which does not
:maintain dirent stability) by virtue of the slightly smaller (5-10%)
:cache footprint. If classic HTree turns out to be faster, I could offer
:both as an option. The code that differs between them should be about
:10-20%.

None of the standards require directory stability within the cached
dirent buffer. That is, if libc caches just the base seek point for
a set of directory entries, there is no requirement that the positioning
of elements beyond the first element at that base seek point be
consistent.

The standards only require stability at explicit seek points.
In fact, the standards explicitly say that trying to seek around inside
a directory after performing a create or delete has undefined operation.
The only thing you can count on is a linear iteration of the
directory.

:> You should be able to use 63 bits of the cookie, I don't know why
:> you wouldn't use the high bit of the lsb 32 bit part. There is no
:> requirement that that bit be 0. In fact, the RFC says the cookie is
:> a 64 bit unsigned integer and you should be able to use all 64 bits.
:> If linux is not allowing all 64 bits to be used then it's a serious
:> bug in linux. The cookies are supposed to be opaque, just like the
:> file handle.
:
:You are right about the cookie 64 bitness, I don't know where I picked
:up that lore, mea culpa.

I seem to recall linux having issues with those bits too, it did strike
a note. I don't remember from where but it was a long time ago.
Frankly, any linuxisms that don't allow the use of at least 63 bits
need to be fixed in linux.

:> I like :-). I'll probably end up with multiple directory hash formats
:> too before this is over. Fortunately I have room in the inode to
:> specify the format the directory will use.
:
:I overlooked the part where you mentioned giving some lexical ordering
:to your hash to get better locality between lexically related records.
:I am not sure this is a win, it will require a lot of analysis, which
:starts to drift off the critical path for me, since crystallizing the
:transaction design is really a lot more important.
:
:An inverse idea to making the hash nonrandom is to make the goal inum
:depend somewhat on the name hash for a newly created file. So
:relatively fewer PHTree index leaf nodes, which contain [hash, dirblock]
:pairs, will be have to be reloaded into cache during deletion of a
:massive directory in physical dirent order. The goal inum has to depend
:mostly on the physical dirent location though, to avoid thrashing the
:inode table during mass creation, deletion or stat. I am not sure
:whether there is a gain to be had vs keeping the random hash completely
:random, which does deliver a benefit in balancing the tree better.

That's an interesting idea.

:> BSD buffer cache buffers are backed by vm_page_array[], an array of
:> VM pages, and the base of the buffer does not have to be page-aligned.
:> So variable page sizes are not necessary.
:
:vm_page_array... is that like a Linux page cache? (aka mapping, aka
:address_space) Buffers on page cache pages do not have to be

Yes.

:page-aligned in Linux, or rather, each page in Linux is either fully
:covered with buffers or has none at all. What sucks about all this is
:having buffers _and_ page descriptors, where the two perform very
:similar functions and duplicate a lot of state, causing endless bugs
:due to skew in the cached state between the two kinds of objects.

BSD buffers can overlap pages. e.g. you can have 4K pages and 6K buffers,
so every other page is overlapped by two buffers. FreeBSD and DragonFly
take care of keeping track what portions of the VM pages are actually
dirty (and it gets more complicated if the pages are mapped into
userland).

:This post required a week to respond to because I thought I should also
:post some code in parallel.
:
:Regards,
:
:Daniel

Don't wear yourself out! This conversation has been fun and I've
gotten a lot of good ideas out of it, plus a much better understanding
of REDO.

-Matt
Matthew Dillon


My head aches

Anonymous (not verified)
on
August 8, 2008 - 8:24am

My head aches by just looking at these posts. These guys are seriously immerged in their work.

It's great to see to people

Anonymous (not verified)
on
August 28, 2008 - 11:39pm

It's great to see to people on what are essentially "competing" operating systems and "competing" filesystem projects will to share the technical and design process.

Precisely the sort of collaboration that attracted me to OSS in the first place.

It's great to see to people

Anonymous (not verified)
on
August 28, 2008 - 11:39pm

It's great to see to people on what are essentially "competing" operating systems and "competing" filesystem projects willing to share the technical and design process.

Precisely the sort of collaboration that attracted me to OSS in the first place.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.