"Work continues to progress well but I've hit a few snags," noted Matthew Dillon [1], referring to the ongoing development of his HAMMER filesystem. He began by highlighting a number of problems with the current design, then adding, "everything else now works, and works well, including and most especially the historical access features." He continued:
"I've come to the conclusion that I am going to have to make a fairly radical change to the on-disk structures to solve these problems. On the plus side, these changes will greatly simplify the filesystem topology and greatly reduce its complexity. On the negative side, recovering space will not be instantaneous and will basically require data to be copied from one part of the filesystem to another."
Matt detailed his solution, which included getting rid of the previously described clusters, super-clusters, A-lists, and per-cluster B-Tree's, "instead have just one global B-Tree for the entire filesystem, able to access any record anywhere in the filesystem", adding that the filesystem would be implemented "as one big huge circular FIFO, pretty much laid down linearly on disk, with a B-Tree to locate and access data." He detailed the many improvements, noting that this also makes it possible to provide efficient real-time mirroring. He concluded, "it will probably take a week or two to rewire the topology and another week or two to debug it. Despite the massive rewiring, the new model is much, MUCH simpler then the old, and all the B-Tree code is retained (just extended to operate across the entire filesystem instead of just within a single cluster)."
From: Matthew Dillon <dillon@...>
Subject: HAMMER update 06-Feb-2008
[1]Date: Feb 6, 5:55 pm 2008
Work continues to progress well but I've hit a few snags:
* My current recovery model is just not working for cross-cluster
transactions.
* I can't fix the write performance issue due to the way the cluster
localization and spike code works.
* Handling nearly full filesystems is proving to be a problem due to
the inability to estimate space requirements which in turn are
related to the cluster localization the current model uses.
* Efficient real-time mirroring is just not easy to do w/ the current
cluster-localized model.
Everything else now works, and works well, including and most especially
the historical access features. I've already fallen in love with being
able to create a softlink to a '@@0x<timestamp>' string to access
snapshots.
I've come to the conclusion that I am going to have to make a fairly
radical change to the on-disk structures to solve these problems. On
the plus side, these changes will greatly simplify the filesystem topology
and greatly reduce its complexity. On the negative side, recovering
space will not be instantanious and will basically require data to
be copied from one part of the filesystem to another. My take is that
since large filesystems (our target) do not usually fill up very quickly,
and indeed it can take hours even if you are writing 60MB/sec to the disk,
It is well worth taking a hit on free space recovery if it accomplishes
all of the filesystem's other goals.
This may seem a bit radical as solutions go, but the more I look at it
the more I like it because it really does neatly tie up every single
remaining issue.
* Get rid of super-clusters, clusters, and typed buffers. Get rid of them
entirely. Just have an array of buffers after the volume header.
* Get rid of the A-list's (the recursive bitmap radix tree allocation
model). Poof, gone. All of them, gone. Have no random block
allocation model at all.
* Get rid of the per-cluster B-Tree's. Instead have just one global
B-Tree for the entire filesystem, able to access any record anywhere
in the filesystem (currently a per-cluster B-Tree can only access
information within that cluster).
* Implement the filesystem as one big huge circular FIFO, pretty much
laid down linearly on disk, with a B-Tree to locate and access data.
* Random modifications (required for manipulating the B-Tree and marking
records as deleted) will append undo records to this FIFO and the only
write ordering requirement will be that the buffers containing these
undo records be committed before the buffers containing the random
modifications are committed.
This completely solves the crash recovery issue, regardless of the
size of a transaction. If you crash half way through doing a 200MB
write() it will be able to unwind the whole thing.
This (plus the B-Tree and layout changes) completely solves the
write performance issue.
* Physical recovery of 'disk space' in the filesystem requires advancing
the circular FIFO's beginning index, which means copying data we
wish to retain from the front of the FIFO to the end and discarding
data at the head that we wish to destroy.
This is the one big gotcha to the model, but the key point here is
that the copying can occur off-peak, or whenever disk bandwidth is
available, and can operate on large linear swaths of disk at once
(with some random async writes for B-Tree fixups).
* Since information is laid down linearly, on-disk space can be reserved
for in-memory buffer and record caches, solving the issue of how
to deal with nearly full filesystems.
* Ultimately physical recovery can be optimized by introducing a
large-block allocation model 'behind' the circular FIFO abstraction.
This will solve the issue of dealing with bad blocks on the disk.
* The filesystem can be recovered from horrible data loss basically by
doing a copy-scan to first remove all auxillary data (like B-Tree nodes
and older undo records that are no longer needed), and then regenerating
the B-Tree from scratch for the entire filesystem.
This would of course take a long time but it would only be needed to
recover data after a head crash, where physical corruption might be
present.
* This also neatly solves real-time mirroring issues. A real-time mirror
can simply track the FIFO linearly, and can record its FIFO index if
interrupted to pick up where it left off later on. You can't get
much better then that.
I had given up on efficient real time mirroring when I settled on the
localized cluster model. Now its back.
It will probably take a week or two to rewire the topology and another
week or two to debug it. Despite the massive rewiring, the new model
is much, MUCH simpler then the old, and all the B-Tree code is retained
(just extended to operate across the entire filesystem instead of just
within a single cluster).
-Matt
Matthew Dillon
<dillon@backplane.com>From: Simon 'corecode' Schubert <corecode@...> Subject: Re: HAMMER update 06-Feb-2008 [1]Date: Feb 6, 7:09 pm 2008 Matthew Dillon wrote: > * Implement the filesystem as one big huge circular FIFO, pretty much > laid down linearly on disk, with a B-Tree to locate and access data. > > * Random modifications (required for manipulating the B-Tree and marking > records as deleted) will append undo records to this FIFO and the only > write ordering requirement will be that the buffers containing these > undo records be committed before the buffers containing the random > modifications are committed. This sounds quite like LFS now. LFS however split the volume in smaller blocks which could be "used", "empty" or "open", IIRC. Their background cleaner then could push remaining data from used blocks to a currently open one, marking the block "empty" after that, allowing the FS to write to the blocks again. They always had problems with the cleaner, I think. cheers simon -- Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\ Work - Mac +++ space for low €€€ NOW!1 +++ Campaign \ / Party Enjoy Relax | http://dragonflybsd.org [2] Against HTML \ Dude 2c 2 the max ! http://golden-apple.biz [3] Mail + News / \
From: Michael Neumann <mneumann@...> Subject: Re: HAMMER update 06-Feb-2008 [3]Date: Feb 6, 8:02 pm 2008 Simon 'corecode' Schubert wrote: > Matthew Dillon wrote: >> * Implement the filesystem as one big huge circular FIFO, pretty much >> laid down linearly on disk, with a B-Tree to locate and access >> data. >> >> * Random modifications (required for manipulating the B-Tree and >> marking >> records as deleted) will append undo records to this FIFO and >> the only >> write ordering requirement will be that the buffers containing >> these >> undo records be committed before the buffers containing the >> random modifications are committed. > > This sounds quite like LFS now. LFS however split the volume in smaller > blocks which could be "used", "empty" or "open", IIRC. Their background > cleaner then could push remaining data from used blocks to a currently > open one, marking the block "empty" after that, allowing the FS to write > to the blocks again. How about "Generational Garbarge Collection"? Assuming that there are some files that will never be deleted this could give slighly better performance. Keep a "copy count" (a copy occurs if the cleaner has to copy data from the left end to the right end of the FIFO). If that increases over, say 3, copy it into the old generation FIFO. One problem of course is how to dimension each generation, and how many to use. I think that's basically how LFS works. Regards, Michael
From: Ivan Voras <ivoras@...> Subject: Re: HAMMER update 06-Feb-2008 [3]Date: Feb 6, 8:05 pm 2008 Matthew Dillon wrote: > * Get rid of the A-list's (the recursive bitmap radix tree allocati= on > model). Poof, gone. All of them, gone. Have no random block > allocation model at all. >=20 > * Get rid of the per-cluster B-Tree's. Instead have just one globa= l > B-Tree for the entire filesystem, able to access any record anywh= ere > in the filesystem (currently a per-cluster B-Tree can only access= > information within that cluster). >=20 > * Implement the filesystem as one big huge circular FIFO, pretty mu= ch > laid down linearly on disk, with a B-Tree to locate and access da= ta. How will this affect parallel IO (reads, but especially writes)? Would=20 having such a global structure serialize it? (I'm assuming, possibly=20 wrongly, that having trees per-cluster allowed you to lock individual=20 clusters).
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 06-Feb-2008 [3]Date: Feb 6, 9:15 pm 2008 :How will this affect parallel IO (reads, but especially writes)? Would=20 :having such a global structure serialize it? (I'm assuming, possibly=20 :wrongly, that having trees per-cluster allowed you to lock individual=20 :clusters). Reads will not be effected at all... the locking occurs at the B-Tree node layer. Writes will not be serialized and will still be asynchronous so the most typical striping setups on multi-disk filesystems should still yield very high performance. Writes WILL be far more likely to be sequential which should actually improve write performance. Also keep in mind that writes are buffered by the buffer cache, so there is a caching layer between userland and the physical disk. Mixed data writes (parallel write operations by multiple processes in different parts of the filesystem) will generally lay down new information sequentially on disk, which can be detrimental for read performance since the individual files will not be entirely sequential. I seem to recall a paper at a USENIX long ago where someone tested locality of reference for reads after laying down writes from parallel sources sequentially, and it was no worse then trying to zone the disparate writes, so I'm not really worried about this case. Also, once you get over a track or two's worth of data, it costs about the same to seek 3 tracks as it does to seek 10 tracks, so as long as writes are not *completely* strewn about due to lots of parallel write activity occuring, it shouldn't be a problem. They won't be because writes are cached in the buffer cache prior to being flushed out. We should get nice long bursts of sequentially ordered data on disk. -- I don't like to think that I wasted a ton of time building the cluster mechanism, and its kinda sad to see so much code removed. But most of the work over the last few months has been B-Tree centric, implementing the inode cache, high level VOPs, record structures, etc... and those parts of the codebase remain intact. It really got to the point where implementing the last bits was starting to take way way too much time. When things start to take that much time to do, I know I've made a mistake somewhere in the design. Better to fix it now then to try to slog through the complexity later on. -Matt Matthew Dillon <dillon@backplane.com>
From: Matthew Dillon <dillon@...>
Subject: Re: HAMMER update 06-Feb-2008
[3]Date: Feb 6, 9:40 pm 2008
:This sounds quite like LFS now. LFS however split the volume in smaller =
:
:blocks which could be "used", "empty" or "open", IIRC. Their background =
:
:cleaner then could push remaining data from used blocks to a currently=20
:open one, marking the block "empty" after that, allowing the FS to write =
:
:to the blocks again.
:
:They always had problems with the cleaner, I think.
:
:cheers
: simon
It's the 'pushing data around' action that causes the most problems
with recovery and cleaning, or more specifically, recovery after you
crash while cleaning. It can get messy really fast, especially if you
are trying to manage bitmaps.
With the cluster idea I got stuck the moment a transaction spanned
more the one cluster. I suddenly had two recovery domains related
to one transaction, and I couldn't find a solution to it.
A filesystem-wide FIFO abstraction solves all three problems by
providing exactly one synchronization point (the circular fifo's indices)
for the entire filesystem.
I really do think that the performance issues with cleaning HAMMER
will be addressable through the use of a very-large-block blockmap,
or by implementing a set of named very-large blocks. I have some
other ideas as well, such as treating the data associated with records
as an out-of-band entity that is not made part of the FIFO itself.
One last thing that must be considered is real time mirroring. There's
something to be said for having a single logical FIFO for the entire
filesystem in a clustered environment. It makes non-queued mirroring
ridiculously easy to implement.
-MattFrom: Matthew Dillon <dillon@...>
Subject: Re: HAMMER update 06-Feb-2008
[3]Date: Feb 6, 9:25 pm 2008
:How about "Generational Garbarge Collection"? Assuming that there are
:some files that will never be deleted this could give slighly better
:performance.
:
:Keep a "copy count" (a copy occurs if the cleaner has to copy data from
:the left end to the right end of the FIFO). If that increases over, say
:3, copy it into the old generation FIFO.
:
:One problem of course is how to dimension each generation, and how many
:to use.
:
:I think that's basically how LFS works.
:
:Regards,
:
: Michael
The idea I have is turn the physical layout of the disk into a
logical linear layout. That is, instead of actually laying the FIFO
down linearly on the physical disk, instead create a very large backing
blockmap that allows large (say 16MB) chunks of the disk to be shifted
around simply by adjusting the block map.
Now the FIFO copying can simply shift the underlying block map to move
a 16MB block from the beginning of the FIFO to the end of the FIFO
if no garbage collection is otherwise required.
One can also have the idea of 'named very large blocks', where each one
represents a separate recovery zone. This is like a blockmap allocation
mechanism that allows you to move the block from the beginning of the
FIFO to the end without having to rewire the references to elements
within that block.
I think the initial implementation has to just work with the disk. I
am making sure my design is condusive to these sorts of optimizations
but I am not going to try to implement a blockmap immediately.
What is becoming very clear to me now that I'm actually coding this stuff,
is that my original clustering idea was simply too highly integrated into
the general filesystem operations, requiring complex code all over the
place. The new idea is far, FAR less complex and yet as we see there are
plenty of layer-based solutions that can solve the copy case issues
without requiring the filesystem core to have much knowledge about the
additional layer.
-Matt
Matthew Dillon
<dillon@backplane.com>Related links:
- Archive of above thread [3]
- Archive of above thread [3]
- Archive of above thread [3]
- Archive of above thread [3]