"Work continues to progress well but I've hit a few snags," noted Matthew Dillon, referring to the ongoing development of his HAMMER filesystem. He began by highlighting a number of problems with the current design, then adding, "everything else now works, and works well, including and most especially the historical access features." He continued:
"I've come to the conclusion that I am going to have to make a fairly radical change to the on-disk structures to solve these problems. On the plus side, these changes will greatly simplify the filesystem topology and greatly reduce its complexity. On the negative side, recovering space will not be instantaneous and will basically require data to be copied from one part of the filesystem to another."
Matt detailed his solution, which included getting rid of the previously described clusters, super-clusters, A-lists, and per-cluster B-Tree's, "instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem", adding that the filesystem would be implemented "as one big huge circular FIFO, pretty much laid down linearly on disk, with a B-Tree to locate and access data." He detailed the many improvements, noting that this also makes it possible to provide efficient real-time mirroring. He concluded, "it will probably take a week or two to rewire the topology and another week or two to debug it. Despite the massive rewiring, the new model is much, MUCH simpler then the old, and all the B-Tree code is retained (just extended to operate across the entire filesystem instead of just within a single cluster)."
From: Matthew Dillon <dillon@...> Subject: HAMMER update 06-Feb-2008 Date: Feb 6, 5:55 pm 2008Work continues to progress well but I've hit a few snags:
* My current recovery model is just not working for cross-cluster
transactions.* I can't fix the write performance issue due to the way the cluster
localization and spike code works.* Handling nearly full filesystems is proving to be a problem due to
the inability to estimate space requirements which in turn are
related to the cluster localization the current model uses.* Efficient real-time mirroring is just not easy to do w/ the current
cluster-localized model.Everything else now works, and works well, including and most especially
the historical access features. I've already fallen in love with being
able to create a softlink to a '@@0x' string to access
snapshots.I've come to the conclusion that I am going to have to make a fairly
radical change to the on-disk structures to solve these problems. On
the plus side, these changes will greatly simplify the filesystem topology
and greatly reduce its complexity. On the negative side, recovering
space will not be instantanious and will basically require data to
be copied from one part of the filesystem to another. My take is that
since large filesystems (our target) do not usually fill up very quickly,
and indeed it can take hours even if you are writing 60MB/sec to the disk,
It is well worth taking a hit on free space recovery if it accomplishes
all of the filesystem's other goals.This may seem a bit radical as solutions go, but the more I look at it
the more I like it because it really does neatly tie up every single
remaining issue.* Get rid of super-clusters, clusters, and typed buffers. Get rid of them
entirely. Just have an array of buffers after the volume header.* Get rid of the A-list's (the recursive bitmap radix tree allocation
model). Poof, gone. All of them, gone. Have no random block
allocation model at all.* Get rid of the per-cluster B-Tree's. Instead have just one global
B-Tree for the entire filesystem, able to access any record anywhere
in the filesystem (currently a per-cluster B-Tree can only access
information within that cluster).* Implement the filesystem as one big huge circular FIFO, pretty much
laid down linearly on disk, with a B-Tree to locate and access data.* Random modifications (required for manipulating the B-Tree and marking
records as deleted) will append undo records to this FIFO and the only
write ordering requirement will be that the buffers containing these
undo records be committed before the buffers containing the random
modifications are committed.This completely solves the crash recovery issue, regardless of the
size of a transaction. If you crash half way through doing a 200MB
write() it will be able to unwind the whole thing.This (plus the B-Tree and layout changes) completely solves the
write performance issue.* Physical recovery of 'disk space' in the filesystem requires advancing
the circular FIFO's beginning index, which means copying data we
wish to retain from the front of the FIFO to the end and discarding
data at the head that we wish to destroy.This is the one big gotcha to the model, but the key point here is
that the copying can occur off-peak, or whenever disk bandwidth is
available, and can operate on large linear swaths of disk at once
(with some random async writes for B-Tree fixups).* Since information is laid down linearly, on-disk space can be reserved
for in-memory buffer and record caches, solving the issue of how
to deal with nearly full filesystems.* Ultimately physical recovery can be optimized by introducing a
large-block allocation model 'behind' the circular FIFO abstraction.This will solve the issue of dealing with bad blocks on the disk.
* The filesystem can be recovered from horrible data loss basically by
doing a copy-scan to first remove all auxillary data (like B-Tree nodes
and older undo records that are no longer needed), and then regenerating
the B-Tree from scratch for the entire filesystem.This would of course take a long time but it would only be needed to
recover data after a head crash, where physical corruption might be
present.* This also neatly solves real-time mirroring issues. A real-time mirror
can simply track the FIFO linearly, and can record its FIFO index if
interrupted to pick up where it left off later on. You can't get
much better then that.I had given up on efficient real time mirroring when I settled on the
localized cluster model. Now its back.It will probably take a week or two to rewire the topology and another
week or two to debug it. Despite the massive rewiring, the new model
is much, MUCH simpler then the old, and all the B-Tree code is retained
(just extended to operate across the entire filesystem instead of just
within a single cluster).-Matt
Matthew Dillon
From: Simon 'corecode' Schubert <corecode@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 7:09 pm 2008Matthew Dillon wrote:
> * Implement the filesystem as one big huge circular FIFO, pretty much
> laid down linearly on disk, with a B-Tree to locate and access data.
>
> * Random modifications (required for manipulating the B-Tree and marking
> records as deleted) will append undo records to this FIFO and the only
> write ordering requirement will be that the buffers containing these
> undo records be committed before the buffers containing the random
> modifications are committed.This sounds quite like LFS now. LFS however split the volume in smaller
blocks which could be "used", "empty" or "open", IIRC. Their background
cleaner then could push remaining data from used blocks to a currently
open one, marking the block "empty" after that, allowing the FS to write
to the blocks again.They always had problems with the cleaner, I think.
cheers
simon--
Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\
Work - Mac +++ space for low €€€ NOW!1 +++ Campaign \ /
Party Enjoy Relax | http://dragonflybsd.org Against HTML \
Dude 2c 2 the max ! http://golden-apple.biz Mail + News / \
From: Michael Neumann <mneumann@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 8:02 pm 2008Simon 'corecode' Schubert wrote:
> Matthew Dillon wrote:
>> * Implement the filesystem as one big huge circular FIFO, pretty much
>> laid down linearly on disk, with a B-Tree to locate and access
>> data.
>>
>> * Random modifications (required for manipulating the B-Tree and
>> marking
>> records as deleted) will append undo records to this FIFO and
>> the only
>> write ordering requirement will be that the buffers containing
>> these
>> undo records be committed before the buffers containing the
>> random modifications are committed.
>
> This sounds quite like LFS now. LFS however split the volume in smaller
> blocks which could be "used", "empty" or "open", IIRC. Their background
> cleaner then could push remaining data from used blocks to a currently
> open one, marking the block "empty" after that, allowing the FS to write
> to the blocks again.How about "Generational Garbarge Collection"? Assuming that there are
some files that will never be deleted this could give slighly better
performance.Keep a "copy count" (a copy occurs if the cleaner has to copy data from
the left end to the right end of the FIFO). If that increases over, say
3, copy it into the old generation FIFO.One problem of course is how to dimension each generation, and how many
to use.I think that's basically how LFS works.
Regards,
Michael
From: Ivan Voras <ivoras@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 8:05 pm 2008Matthew Dillon wrote:
> * Get rid of the A-list's (the recursive bitmap radix tree allocati=
on
> model). Poof, gone. All of them, gone. Have no random block
> allocation model at all.
>=20
> * Get rid of the per-cluster B-Tree's. Instead have just one globa=
l
> B-Tree for the entire filesystem, able to access any record anywh=
ere
> in the filesystem (currently a per-cluster B-Tree can only access=> information within that cluster).
>=20
> * Implement the filesystem as one big huge circular FIFO, pretty mu=
ch
> laid down linearly on disk, with a B-Tree to locate and access da=
ta.How will this affect parallel IO (reads, but especially writes)? Would=20
having such a global structure serialize it? (I'm assuming, possibly=20
wrongly, that having trees per-cluster allowed you to lock individual=20
clusters).
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 9:15 pm 2008:How will this affect parallel IO (reads, but especially writes)? Would=20
:having such a global structure serialize it? (I'm assuming, possibly=20
:wrongly, that having trees per-cluster allowed you to lock individual=20
:clusters).Reads will not be effected at all... the locking occurs at the B-Tree
node layer.Writes will not be serialized and will still be asynchronous so the
most typical striping setups on multi-disk filesystems should still
yield very high performance. Writes WILL be far more likely to be
sequential which should actually improve write performance. Also
keep in mind that writes are buffered by the buffer cache, so there
is a caching layer between userland and the physical disk.Mixed data writes (parallel write operations by multiple processes in
different parts of the filesystem) will generally lay down new
information sequentially on disk, which can be detrimental for read
performance since the individual files will not be entirely sequential.
I seem to recall a paper at a USENIX long ago where someone tested
locality of reference for reads after laying down writes from
parallel sources sequentially, and it was no worse then trying to zone
the disparate writes, so I'm not really worried about this case.Also, once you get over a track or two's worth of data, it costs about
the same to seek 3 tracks as it does to seek 10 tracks, so as long as
writes are not *completely* strewn about due to lots of parallel write
activity occuring, it shouldn't be a problem. They won't be because
writes are cached in the buffer cache prior to being flushed out. We
should get nice long bursts of sequentially ordered data on disk.--
I don't like to think that I wasted a ton of time building the
cluster mechanism, and its kinda sad to see so much code removed. But
most of the work over the last few months has been B-Tree centric,
implementing the inode cache, high level VOPs, record structures, etc...
and those parts of the codebase remain intact.It really got to the point where implementing the last bits was starting
to take way way too much time. When things start to take that much time
to do, I know I've made a mistake somewhere in the design. Better to
fix it now then to try to slog through the complexity later on.-Matt
Matthew Dillon
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 9:40 pm 2008:This sounds quite like LFS now. LFS however split the volume in smaller =
:
:blocks which could be "used", "empty" or "open", IIRC. Their background =
:
:cleaner then could push remaining data from used blocks to a currently=20
:open one, marking the block "empty" after that, allowing the FS to write =
:
:to the blocks again.
:
:They always had problems with the cleaner, I think.
:
:cheers
: simonIt's the 'pushing data around' action that causes the most problems
with recovery and cleaning, or more specifically, recovery after you
crash while cleaning. It can get messy really fast, especially if you
are trying to manage bitmaps.With the cluster idea I got stuck the moment a transaction spanned
more the one cluster. I suddenly had two recovery domains related
to one transaction, and I couldn't find a solution to it.A filesystem-wide FIFO abstraction solves all three problems by
providing exactly one synchronization point (the circular fifo's indices)
for the entire filesystem.I really do think that the performance issues with cleaning HAMMER
will be addressable through the use of a very-large-block blockmap,
or by implementing a set of named very-large blocks. I have some
other ideas as well, such as treating the data associated with records
as an out-of-band entity that is not made part of the FIFO itself.One last thing that must be considered is real time mirroring. There's
something to be said for having a single logical FIFO for the entire
filesystem in a clustered environment. It makes non-queued mirroring
ridiculously easy to implement.-Matt
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 06-Feb-2008 Date: Feb 6, 9:25 pm 2008:How about "Generational Garbarge Collection"? Assuming that there are
:some files that will never be deleted this could give slighly better
:performance.
:
:Keep a "copy count" (a copy occurs if the cleaner has to copy data from
:the left end to the right end of the FIFO). If that increases over, say
:3, copy it into the old generation FIFO.
:
:One problem of course is how to dimension each generation, and how many
:to use.
:
:I think that's basically how LFS works.
:
:Regards,
:
: MichaelThe idea I have is turn the physical layout of the disk into a
logical linear layout. That is, instead of actually laying the FIFO
down linearly on the physical disk, instead create a very large backing
blockmap that allows large (say 16MB) chunks of the disk to be shifted
around simply by adjusting the block map.Now the FIFO copying can simply shift the underlying block map to move
a 16MB block from the beginning of the FIFO to the end of the FIFO
if no garbage collection is otherwise required.One can also have the idea of 'named very large blocks', where each one
represents a separate recovery zone. This is like a blockmap allocation
mechanism that allows you to move the block from the beginning of the
FIFO to the end without having to rewire the references to elements
within that block.I think the initial implementation has to just work with the disk. I
am making sure my design is condusive to these sorts of optimizations
but I am not going to try to implement a blockmap immediately.What is becoming very clear to me now that I'm actually coding this stuff,
is that my original clustering idea was simply too highly integrated into
the general filesystem operations, requiring complex code all over the
place. The new idea is far, FAR less complex and yet as we see there are
plenty of layer-based solutions that can solve the copy case issues
without requiring the filesystem core to have much knowledge about the
additional layer.-Matt
Matthew Dillon

the hammer also rises?
"instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem"
Reiser FS?
Ever needed to force fsck a
Ever needed to force fsck a reiserfs volume that contains deleted backup images of other reiserfs volumes? Do you know what happens?
Nobody knows that
>Do you know what happens?
Only reiserfsck knows the answer, but sure they are Very Bad Things.
I thought Reiser4 solved these issues. XFS uses a similar structure.
I think it's not fair to blame an structure based on a bad designed filesystem using it.
You're right,
You're right, but I didn't blame the structure. The original poster referred to reiserfs.
All programming languages
All programming languages evolve towards lisp, and now all filesystems evolve towards Reiser FS?
Between Hammer and Btrfs,
Between Hammer and Btrfs, things are looking exciting in the Linux filesystem world! The lack of a ZFS port now doesn't seem so bad. :-)
HAMMER is for DragonflyBSD,
HAMMER is for DragonflyBSD, not GNU/Linux.
If you're talking about the
If you're talking about the kernel , you can definitely lose the "GNU/" part. Even accoring to rms.
It can be ported to Linux.
It can be ported to Linux.