logo
Published on KernelTrap (http://kerneltrap.org)

HAMMER Filesystem Design

By Jeremy
Created Oct 10 2007 - 20:51

"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon [1] on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document [2] for the new filesystem.

During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:

"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."


From: Matthew Dillon <dillon@...>
Subject: HAMMER filesystem update
 [2]Date: Oct 10, 2:41 pm 2007

I am going to start committing bits and pieces of the HAMMER filesystem
    over the next two months.  Note that the filesystem will not be
    operational until we get closer to the 2.0 release in December so
    these bits and pieces will not be tied into buildworld/buildkernel until
    then.

    I am making good progress and I believe it will be beta quality by the
    release.  It took nearly the whole year to come up with a workable
    design.  I thought I had it at the beginning of the year but I kept
    running into issues and had to redesign the thing several times
    since then.  I finally had to compromise a bit on the efficiency of the
    backup/mirroring data stream and the filesystem relies a lot more on
    heuristics and background balancing then I want it to, but other
    then that the design will meet all the goals that were laid out at
    the beginning of the year.

    I will post the design document the implementation is being based on in
    a moment.

						-Matt

From: Matthew Dillon <dillon@...>
Subject: HAMMER filesystem update - design document
 [2]Date: Oct 10, 3:33 pm 2007

Ok, here's the final design document that I am now implementing.
    Again, I expect most or all of these features to be ready and the
    filesystem to be beta-quality by the December release.


			       Hammer Filesystem

(I) General Storage Abstraction

    HAMMER uses a basic 16K filesystem buffer for all I/O.  Buffers are
    collected into clusters, cluster are collected into volumes, and a
    single HAMMER filesystem may span multiple volumes.

    HAMMER maintains a small hinted radix tree for block management in
    each layer.  A small radix tree in the volume header manages cluster
    allocations within a volume, one in the cluster header manages buffer
    allocations within a cluster, and most buffers (pure data buffers
    excepted) will embed a small tree to manage item allocations within
    the buffer.

    Volumes are typically specified as disk partitions, with one volume
    designated as the root volume containing the root cluster.  The root
    cluster does not need to be contained in volume 0 nor does it have to
    be located at any particular offset.

    Data can be migrated on a cluster-by-cluster or volume-by-volume basis
    and any given volume may be expanded or contracted while the filesystem
    is live.   Whole volumes can be added and (with appropriate data
    migration) removed.

    HAMMER's storage management limits it to 32768 volumes, 32768 clusters
    per volume, and 32768 16K filesystem buffers per cluster.   A volume
    is thus limited to 16TB and a HAMMER filesystem as a whole is limited
    to 524288TB.  HAMMER's on-disk structures are designed to allow future
    expansion through expansion of these limits.  In particular, the volume
    id is intended to be expanded to a full 32 bits in the future and using
    a larger buffer size will also greatly increase the cluster and volume
    size limitations by increasing the number of elements the buffer-
    restricted radix trees can manage.

    HAMMER breaks all of its information down into objects and records.
    Records have a creation and deletion transaction id which allows HAMMER
    to maintain a historical store.  Information is only physically deleted
    based on the data retention policy.  Those portions of the data retention
    policy affecting near-term modifications may be acted upon by the live
    filesystem but all historical vacuuming is handled by a helper process.

    All information in a HAMMER filesystem is CRCd to detect corruption.

(II) Filesystem Object Topology

    The objects and records making up a HAMMER filesystem is organized into
    a single, unified B-Tree.  Each cluster maintains a B-Tree of the
    records contained in that cluster and a unified B-Tree is constructed by
    linking clusters together.  HAMMER issues PUSH and PULL operations
    internally to open up space for new records and to balance the global
    B-Tree.  These operations may have the side effect of allocating
    new clusters or freeing clusters which become unused.

    B-Tree operations tend to be limited to a single cluster.  That is,
    the B-Tree insertion and deletion algorithm is not extended to the
    whole unified tree.  If insufficient space exists in a cluster HAMMER
    will allocate a new cluster, PUSH a portion of the existing
    cluster's record store to the new cluster, and link the existing
    cluster's B-Tree to the new one.

    Because B-Tree operations tend to be restricted and because HAMMER tries
    to avoid balancing clusters in the critical path, HAMMER employs a
    background process to keep the topology as a whole in balance.  One
    side effect of this is that HAMMER is fairly loose when it comes to
    inserting new clusters into the topology.

    HAMMER objects revolve around the concept of an object identifier.
    The obj_id is a 64 bit quantity which uniquely identifies a filesystem
    object for the entire life of the filesystem.  This uniqueness allows
    backups and mirrors to retain varying amounts of filesystem history by
    removing any possibility of conflict through identifier reuse.  HAMMER
    typically iterates object identifiers sequentially and expects to never
    run out.  At a creation rate of 100,000 objects per second it would
    take HAMMER around 6 million years to run out of identifier space.
    The characteristics of the HAMMER obj_id also allow HAMMER to operate
    in a multi-master clustered environment.

    A filesystem object is made up of records.  Each record references a
    variable-length store of related data, a 64 bit key, and a creation
    and deletion transaction id which is indexed along with the key.

    HAMMER utilizes a 64 bit key to index all records.  Regular files use
    the base data offset of the record as the key while directories use a
    namekey hash as the key and store one directory entry per record.  For
    all intents and purposes a directory can store an unlimited number of
    files. 

    HAMMER is also capable of associating any number of out-of-band
    attributes with a filesystem object using a separate key space.  This
    key space may be used for extended attributes, ACLs, and anything else
    the user desires.

(III) Access to historical information

    A HAMMER filesystem can be mounted with an as-of date to access a
    snapshot of the system.  Snapshots do not have to be explicitly taken
    but are instead based on the retention policy you specify for any
    given HAMMER filesystem.  It is also possible to access individual files
    or directories (and their contents) using an as-of extension on the
    file name.

    HAMMER uses the transaction ids stored in records to present a snapshot
    view of the filesystem as-of any time in the past, with a granularity
    based on the retention policy chosen by the system administrator. 
    feature also effectively implements file versioning.

(IV) Mirrors and Backups

    HAMMER is organized in a way that allows an information stream to be
    generated for mirroring and backup purposes.  This stream includes all
    historical information available in the source.  No queueing is required
    so there is no limit to the number of mirrors or backups you can have
    and no limit to how long any given mirror or backup can be taken offline.
    Resynchronization of the stream is not considered to be an expensive
    operation.

    Mirrors and backups are maintained logically, not physically, and may
    have their own, independant retention polcies.  For example, your live
    filesystem could have a fairly rough retention policy, even none at all,
    then be streamed to an on-site backup and from there to an off-site
    backup, each with different retention policies.

(V) Transactions and Recovery

    HAMMER implement an instant-mount capability and will recover information
    on a cluster-by-cluster basis as it is being accessed.

    HAMMER numbers each record it lays down and stores a synchronization
    point in the cluster header.  Clusters are synchronously marked 'open'
    when undergoing modification.  If HAMMER encounters a cluster which is
    unexpectedly marked open it will perform a recovery operation on the
    cluster and throw away any records beyond the synchronization point.

    HAMMER supports a userland transactional facility.  Userland can query
    the current (filesystem wide) transaction id, issue numerous operations
    and on recovery can tell HAMMER to revert all records with a greater
    transaction id for any particular set of files.  Multiple userland
    applications can use this feature simultaniously as long as the files
    they are accessing do not overlap.  It is also possible for userland
    to set up an ordering dependancy and maintain completely asynchronous
    operation while still being able to guarentee recovery to a fairly
    recent transaction id.

(VI) Database files

    HAMMER uses 64 bit keys internally and makes key-based files directly
    available to userland.  Key-based files are not regular files and do not
    operate using a normal data offset space.

    You cannot copy a database file using a regular file copier.  The
    file type will not be S_IFREG but instead will be S_IFDB.   The file
    must be opened with O_DATABASE.  Reads which normally seek the file
    forward will instead iterate through the records and lseek/qseek can
    be used to acquire or set the key prior to the read/write operation.

From: Bill Hacker <wbh@...> Subject: Re: HAMMER filesystem update - design document [2]Date: Oct 10, 4:30 pm 2007 Matthew Dillon wrote: > Ok, here's the final design document that I am now implementing. > Again, I expect most or all of these features to be ready and the > filesystem to be beta-quality by the December release. > > > Hammer Filesystem > *snip* Matt, Awesome! Tells me: "ZFS, bend over, grab your ankles and kiss your an(atomy) 'Goodbye'" From the amount of work that has HAD to go into this, it also tells me you are: A) probably single, or soon will be and B) don't sleep much anyway! ;-) Looking forward to a 'test drive'... Bill Hacker
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER filesystem update - design document [2]Date: Oct 10, 5:25 pm 2007 :Awesome! : :Tells me: "ZFS, bend over, grab your ankles and kiss your an(atomy) 'Goodbye'" : : From the amount of work that has HAD to go into this, it also tells me you are: : :A) probably single, or soon will be and Alas. :B) don't sleep much anyway! :... :Bill Hacker I get a good 8 hours of sleep. As I get older I find myself unable to pull all-nighters any more without really screwing up the entire next day. -- ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it. I do not think it is a good idea to put all one's marbles in a single copy of a filesystem, no matter how redundant its storage model is, and there isn't much point having that level of redundancy if the intent is to operate in a replicated environment. The problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time. HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year. -Matt

From: Gergo Szakal <bastyaelvtars@...>
Subject: Re: HAMMER filesystem update - design document
 [2]Date: Oct 10, 6:00 pm 2007

I am asking some questions from the user's point of view. Sorry if it
is covered in the document, I may have overlooked it.

So, the filesystem is going to be the volume manager as well (like in
ZFS), right? Will filesystems strictly be bounded to 'partitions' or
'slices'?

Another question: will this mirroring capability allow for an FS-level
RAID like RAIDZ? I wonder whether the filesystem can be extended so it
can achieve this.

Disclaimer: yes, those are ZFS features which I am asking about, bot
no, I don't want a cluster-friendly ZFS ripoff, just asking.

-- 
Gergo Szakal MD <bastyaelvtars@gmail.com>
University Of Szeged, HU
Faculty Of General Medicine

/* Please do not CC me with replies, thank you. */

From: Matthew Dillon <dillon@...> Subject: Re: HAMMER filesystem update - design document [2]Date: Oct 10, 6:38 pm 2007 :So, the filesystem is going to be the volume manager as well (like in :ZFS), right? Will filesystems strictly be bounded to 'partitions' or :'slices'? : :Another question: will this mirroring capability allow for an FS-level :RAID like RAIDZ? I wonder whether the filesystem can be extended so it :can achieve this. : :Disclaimer: yes, those are ZFS features which I am asking about, bot :no, I don't want a cluster-friendly ZFS ripoff, just asking. : :-- :Gergo Szakal MD <bastyaelvtars@gmail.com> No, it isn't a volume manager, it's simply that the filesystem can be made up of multiple volumes. Each cluster (say, a 256M chunk) is integrated into the filesystem-wide B-Tree and can only be addressed by its parent or by the parent pointers of its children. This means that clusters can be migrated with minimal work and thus can be migrated while the filesystem is live. We don't have the situation such as we have in UFS where random inodes in the filesystem directly reference random data blocks elsewhere in the filesystem. For example, if you had a HAMMER filesystem backed by two volumes you could add a third volume, migrate all the data from the first volume to the new volume, and then remove the first volume (make it not part of the filesystem any more). Similarly you could migrate the clusters at the end of a volume elsewhere and then contract that volume, or you could expand a volume and tell HAMMER to use the new space. I am not going to try to implement RAID inside HAMMER when RAID can be done with a software or hardware solution in another layer. HAMMER will do what hardware and software storage solutions can't easily or efficiently do, which is logical replication of the entire filesystem. A logical replication allows the different replication targets to retain varying amounts of filesystem history. For example, your production filesystem might retain 30 second snapshots for an hour and hourly for the day, while one of your replication targets might retain hourly snapshots for a day and daily snapshots for a month, etc. Ultimately we will have a multi-master environment which will silently handle whole or partial filesystem failures. In this case the type of redundancy you need at the storage layer will depend on the number of physical disks you need to use for each copy of the filesystem. If your filesystem fits on one or two physical disks then you wouldn't need any RAID at all. If each copy needs a bank of physical disks then you might want the bank of disks to be RAIDed. At that point you'd use a hardware or software RAID solution. But is RAID absolutely necessary? Probably not. Consider a replicated filesystem with each copy backed by an array of disks. Now say you have a disk failure. The copy of the filesystem containing the disk failure loses a portion of its B-Tree. It doesn't need to recover the disk, you would just pull it and slap in a new one and the filesystem would reload that portion of the B-Tree from one of the other replicated copies to repair itself. :University Of Szeged, HU :Faculty Of General Medicine : :/* Please do not CC me with replies, thank you. */ : -Matt Matthew Dillon <dillon@backplane.com>
From: Thomas E. Spanjaard <tgen@...> Subject: Re: HAMMER filesystem update - design document [2]Date: Oct 10, 7:45 pm 2007 Matthew Dillon wrote: > But is RAID absolutely necessary? Probably not. Consider a replicated > filesystem with each copy backed by an array of disks. Now say you > have a disk failure. The copy of the filesystem containing the disk > failure loses a portion of its B-Tree. It doesn't need to recover > the disk, you would just pull it and slap in a new one and the > filesystem would reload that portion of the B-Tree from one of the > other replicated copies to repair itself. This is the functional equivalent of a RAID1, and that is all HAMMER provides; the point of RAIDZ (and RAID3,4,5,6,etc) is that you don't need 2n bytes worth of disk for n bytes worth of usable storage, yet keeping some level of resilience. There is something to be said for this kind of scheme, namely not wasting as much disk space, but in the case of RAID1,0,10,01, moving that to a different layer (e.g. Vinum) is good enough. In a clustering environment, it's not likely that you'll want anything other than full replication, but at least on single-node storage systems, using storage more efficiently has its uses; even though it means longer recovery times. Cheers, -- Thomas E. Spanjaard tgen@netphreax.net [3]
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER filesystem update - design document [3]Date: Oct 10, 9:14 pm 2007 :This is the functional equivalent of a RAID1, and that is all HAMMER :provides; the point of RAIDZ (and RAID3,4,5,6,etc) is that you don't :need 2n bytes worth of disk for n bytes worth of usable storage, yet :keeping some level of resilience. There is something to be said for this :kind of scheme, namely not wasting as much disk space, but in the case :of RAID1,0,10,01, moving that to a different layer (e.g. Vinum) is good :enough. Yes and no. The reason it isn't quite the same is that RAID storage has no ability to recovery corruption generated by the filesystem code itself or corruption caused by other parts of the kernel or by hardware snafus which occur prior to the data getting onto the platter. When you do logical replication, however, the possibility of this sort of corruption seeping into all the replicated copies is greatly reduced and the replicated copies can check against each other to detect even more such cases. So with replication you get a degree of detection plus the ability to recover (correct) the corrupted data. Also one always has one and possibly several backups, both on-site and off-site. A standard RAID system does not give you a functional backup of your data, it just gives you redundancy. Replication coupled with HAMMER's historical data store gives you a functional backup AND replication at the same time, without having to add yet more physical storage. That is a big deal. :In a clustering environment, it's not likely that you'll want anything :other than full replication, but at least on single-node storage :systems, using storage more efficiently has its uses; even though it :means longer recovery times. : :Cheers, :-- : Thomas E. Spanjaard This is something I have been thinking about. It would be possible to replicate just a portion of a filesystem but doing it properly would require HAMMER to support a 'filesystem within a filesystem' abstraction in order to be able to use the same object ids in the replicated subset that the originator used. Even though only a subset of files are being replicated the target must be able to store objects across the source's entire object id space. So what you want to do is create a filesystem within the target's filesystem to hold the replication of the subset. e.g. something like this (pseudo code): mkfilesystem /hammer/my_source_backup replicate /elsewhere/my_source /hammer/my_source_backup mkfilesystem /hammer/my_pictures_backup replicate /elsewhere/my_pictures /hammer/my_pictures_backup HAMMER is specified such that this sort of thing could be implemented, but not for 2.0. Basically a filesystem-within-a-filesystem would be implemented by creating a HAMMER object whos records are the inodes of the pseudo-filesystem. The key space is large enough to hold the entire object id space. -Matt


Related links:


Source URL:
http://kerneltrap.org/DragonFlyBSD/HAMMER_Filesystem_Design