"HAMMER won't be ready for sure (things take however long they take), but the hardest part is working and stable and I'm just down to garbage collection and crash recovery," noted Matthew Dillon, discussing the status of what is ultimately intended to be a highly available clustering filesystem. The upcoming DragonFlyBSD release this month was originally intended to be 2.0 with a beta quality HAMMER, but the decision was recently made to call the release 1.12 while HAMMER continues to stabilize. Matt continued, "HAMMER is really shaping up now. Here's what works now: all filesystem operations; all historical operations; all Pruning features". During the discussion, he was asked how he planned to support multi-master replication, in reply to which he began:
"My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster."
From: Matthew Dillon <dillon@...>
Subject: Re: HAMMER update 06-Feb-2008
[0]Date: Feb 8, 8:22 pm 2008
:There's always the option of releasing a 1.12 version now (it's not like
:there haven't been enough changes to justify a new release). The 2.0 release
:is likely to get a lot of downloads, so I think shipping it with a pre-alpha
:hammer is a waste of an opportunity to attract more people. Not to mention
:that it's hard to put a time bound on this kind of development. So, let's
:just admit that and ship 2.0 as soon as Matt declares it ready for beta
:testing, regardless of what time of year it is. A real beta-state hammer
:justifies a 2.0 release on its own IMHO. Also, this will let Matt work on
:hammer without any tight deadlines.
:
:Aggelos
Yes, and I've agonized over this very possibility. Maybe the thing
to do is to poll the people on kernel@. HAMMER won't be ready for
sure (things take however long they take), but the hardest part of it
is working and stable and I'm just down to garbage collection and
crash recovery. Crazily enough, that is what all the major surgery
yesterday, and the continuing work, is about.
So what do people think? Should this month's release be 1.12 or 2.0 ?
-Matt
Matthew Dillon
<dillon@backplane.com>From: Matthew Dillon <dillon@...>
Subject: Re: HAMMER update 06-Feb-2008
[0]Date: Feb 10, 5:00 pm 2008
Ok, it's going to be 1.12. As much as I hate 2.0 being so late,
there's no point calling it 2.0 without a production-ready HAMMER.
Considering how few items are left on the list for HAMMER, it will
likely be production-ready long before the mid-year release.
-Matt
Matthew Dillon
<dillon@backplane.com>From: Matthew Dillon <dillon@...>
Subject: 1.12 Branching complete
[0]Date: Feb 11, 10:58 pm 2008
1.12 has been branched! I've done a rough nrelease build with HEAD
but haven't tested that a nrelease build with 1.12 produces the correct
version information yet. Other then that it went pretty smoothly.
There were no installer issues once I completely regenerated the
binary packages. I have updated the nrelease Makefile to use the
new binary packages (including a completely new bootstrap), for fresh
installs.
-MattFrom: Matthew Dillon <dillon@...>
Subject: HAMMER update 10-Feb-2008
[0]Date: Feb 10, 5:19 pm 2008
HAMMER is really shaping up now. Here's what works now:
* All filesystem operations
* All historical operations
* All Pruning features
Here's what is left:
* freemap code (allocate and free big-blocks, which are 8MB blocks).
Currently a hack so everything else can be tested, nothing is
actually freed.
* undo fifo and related recovery code. Most of the API calls are
in place, the back-end buffer reservation, flushes, and recovery
need to be implemented.
* big-block cleaning code (this is different from the pruning code).
* Structural locking. The B-Tree is fine-grained locked but the
locks for the blockmap are just a hack (one big lock).
These are all fairly low difficulty items, most of the infrastructure
needed to support their function is already in place and the FIFO
infrastructure has already been tested (just not mapped onto a blockmap
yet).
I have already run some tests with regards to the blockmap allocation
model and it looks very good. What I did was implement an array of
blockmap entry structures rather then just an array of pointers to the
actual physical big-blocks. The blockmap entry structure not only has
a pointer to the underlying physical big-block, it also has a
bytes_free field which specifies how many bytes in the underlying
big-block are free.
This is the only tracking done by the blockmap. It does not actually
try to track WHERE in the big-block the free areas are... figuring
that out will be up to the cleaning code. What this gives us is the
following:
* Extremely fast freeing of on-disk storage elements. The target
physical block doesn't have to be read or written, only the governing
blockmap entry. With 8MB big-blocks and 32-byte blockmap entries one
16K buffer can track 4GB worth of underlying storage, which means
that freeing large amounts of sparse information does not cause the
disk to seek all over the place.
This is far, FAR better then the cluster model I was using last week
and had to throw away. Massively better. Like night and day.
* The all-free case can be detected and used to immediately return a
completely-free bigblock to the free pool. I've done some testing
and what this means is that removing large files or medium-sized
sub-trees WILL in fact result in some immediate gratification.
The space freed from sparse removal and pruning will take time
to actually become reusable as the cleaning code will have to go
through and finish cleaning out the big-block(s) in question.
* The cleaning code is not complicated in the least. All it needs to
do is scan the B-Tree and check the blockmap entries for related
references. If the associated big-block has greater then a certain
percentage of space free, the cleaning code will attempt to pack
the remaining data (as it comes across it in the B-Tree) into a new
block. Since the B-Tree elements and records must be manipulated
no matter which side you approach cleaning and packing from, this is
no more difficult then trying to reverse engineer the remaining
contents of a big-block.
-Matt
Matthew Dillon
<dillon@backplane.com>
From: walt <wa1ter@...>
Subject: Re: HAMMER update 10-Feb-2008
[0]Date: Feb 10, 9:41 pm 2008
Matthew Dillon wrote:
> HAMMER is really shaping up now...
> This is far, FAR better then the cluster model I was using last week
> and had to throw away. Massively better. Like night and day.
I'm fond of telling political hotheads (which you once were, but no longer ;o)
that, before they destroy the system devised by their predecessors, they owe
it to themselves to stop and find out exactly what problems their predecessors
thought they were solving when they invented the existing system.
So -- in this instance you are both the establishment and the revolutionary at
the same time. Could you explain in extra-bonehead language what problems you
were solving with the cluster model, and if you are still solving those problems
with the newer model?
Thanks!
From: Matthew Dillon <dillon@...>
Subject: Re: HAMMER update 10-Feb-2008
[0]Date: Feb 10, 11:21 pm 2008
:So -- in this instance you are both the establishment and the revolutionary at
:the same time. Could you explain in extra-bonehead language what problems you
:were solving with the cluster model, and if you are still solving those problems
:with the newer model?
:
:Thanks!
The three major pieces didn't integrate together well enough. It's
really that simple. Individually they were great. Together they
were not.
* Small (64MB) clusters with localized B-Trees.
Plus: Recovering the filesystem from a massive failure not difficult.
Minus: Data and records cannot roam about, they have to be placed in
the particular cluster covering their key.
* Fine-grained A-list allocator.
Plus: Fine grained allocation and freeing.
Minus: It didn't help the problem of dead space in the cluster, or
for that matter help the issue of dealing with full clusters,
because particular records and data had to go into particular
clusters.
* Per-cluster recovery mechanic.
Plus: Very easy to recover a fixed portion of the B-Tree.
Minus: Couldn't cross cluster boundaries. Couldn't deal with
transactions that crossed cluster boundaries. Requires strict
write ordering and for clusters to be marked 'open' (virtually
synchronously it turned out) to even detect that recovery was needed.
I still would have needed to implement an UNDO FIFO on top of
everything else, and despite the per-buffer locality of reference
the model had it still made an excessive number of little modifications
all over the buffer that would have made the UNDO FIFO rather
expensive.
So I had a fine-grained allocator that couldn't contribute to solving
the issue of what to do when clusters were too full or too empty, a
recovery mechanic that couldn't guarantee consistency for transactions
which spanned more then one cluster, and clusters that wound up being
too small (64MB) and caused management I/O to become too dispersed.
Add to that the fact that the model required an excessive number of
special cases in-code. Do a cvs diff just of the B-Tree and cursor
code from before I ripped it out to now and you will see the new code
is much, much cleaner, with virtually no special cases.
--
The new model works a lot a better. Fine-grained allocation through
what will effectively be a sequential (blockmap) index. Extremely
good localization and packing of data (even better then the cluster
model had, and the cluster model was pretty good in that regard).
The ability (due to the blockmap) to rearrange the physical location
of blocks of nodes, records, and data, in 8 MB chunks, with no effort.
Coarse grained freeing which is extremely fast (emphasis on extremely..
it is ultra fast), but is still able to detect the 'all free' case
for a big-block, for some immediate gratification when deleting or
pruning large amounts of material.
For the UNDO FIFO, which I would have had to write anyway, the new
model generates far fewer dispersed modifications, resulting in
a smaller UNDO stream. A small UNDO stream reduces the number of
buffer dependancies to, in most cases, not more then one (i.e. 25 buffers
may need 1 buffer to be committed to disk before they can be committed).
The core allocator, next on my list to write, is literally just a
big-block allocator, dealing with large 8M blocks and not having to
deal with any fine-grained issues within those blocks. You can't get
much more straightforward then that!
-Matt
Matthew Dillon
<dillon@backplane.com>From: Jason Smethers <jason@...> Subject: Re: HAMMER update 10-Feb-2008 [0]Date: Feb 10, 10:51 pm 2008 Matthew Dillon wrote: > Here's what is left: > * Structural locking. The B-Tree is fine-grained locked but the > locks for the blockmap are just a hack (one big lock). Have you decided how to implement multi-master replication yet? -- Jason Smethers
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER update 10-Feb-2008 [0]Date: Feb 10, 11:27 pm 2008 :Matthew Dillon wrote: :> Here's what is left: :> * Structural locking. The B-Tree is fine-grained locked but the :> locks for the blockmap are just a hack (one big lock). : :Have you decided how to implement multi-master replication yet? : :-- :Jason Smethers My current plan is to use a quorum algorithm similar to the one I wrote for the backplane database years ago. But there are really two major (and very complex) pieces to the puzzle. Not only do we need a quorum algorithm, but we need a distributed cache coherency algorithm as well. With those two pieces individual machines will be able to proactively cache filesystem data and guarantee transactional consistency across the cluster. The quorum algorithm is fairly straightforward, all the complexity there is basically on how to deal with broken connections, missing hosts, and things of that ilk. The caching algorithm is going to be a lot more complex. It will have to have timeouts with quorum-based watchdogs for refreshment in order to deal with hosts that drop out of the cluster, on top of everything else it has to do. It will have to be range-based at multiple levels in order to limit the memory footprint and work with super-large filesystems. I don't even want to start thinking about it yet. What HAMMER will provide to the system as a whole is the real-time mirroring aspects, inode numbers and transaction id's which are forever-unique (allowing data to be passively cached indefinitely), historical access for as-of transactions which will allow parallel transactions to occur and detect collisions at commit time rather then during the transaction, and a bunch of other features. -Matt Matthew Dillon <dillon@backplane.com>
Related links:
- Archive of above thread [0]
- Archive of above thread [0]
- Archive of above thread [0]
- Archive of above thread [0]
- Archive of above thread [0]
- Archive of above thread [0]