DragonFlyBSD: Designing a Highly Available Clustering Filesystem

Submitted by Jeremy
on February 27, 2007 - 10:43pm

Matt Dillon [interview] posted the design synopsis of a new highly available clustered filesystem he will soon begin writing for DragonFlyBSD. The feature summary at the beginning of his document included, "on-demand filesystem check and recovery; infinite snapshots; multi-master operation, including the ability to self-heal a corrupted filesystem by accessing replicated data; infinite logless replication, meaning that replication targets can be offline for 'days' without effecting performance or operation; 64 bit file space, 64 bit filesystem space, no space restrictions whatsoever; reliably handles data storage for huge multi-hundred-terrabyte filesystems without fear of unrecoverable corruption; cluster operation, provides the ability to commit data to locally replicated store independantly of other replication nodes, with access governed by cache coherency protocols; independant index, data is laid out in a highly recoverable fashion, independant of index generation, and indexes can be regenerated from scratch and thus indexes can be updated asynchronously." He then goes into detail on each of these points and many more, explaining how he intends to implement the new filesystem.

The new filesystem is currently unnamed, though Matt noted, "it doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-)" It was suggested that this could mean, "high-availability multi-master extra reliable file system", though Matt was not impressed with this. Another proposed idea that Matt liked was HACFS, or "High-Availability Clustered File System".


From: Matthew Dillon [email blocked]
To: DragonFly kernel List
Date: 	Wed, 21 Feb 2007 14:22:16 -0800 (PST)
Subject: Initial filesystem design synopsis.


    Here is my initial outline of the filesystem design.  It is open
    for discussion.  Please feel to ask questions for anything you do
    not understand.  I do not intend to start coding anything for at
    least two weeks.

    There are currently two rough spots in the design.  First, how to
    handle segment overflows in a multi-master environment.  Such overflows
    can occur when the individual masters or slaves have different historical
    data retention policies.  Second, where to store the regeneratable
    indexes.

    Plus, I need a name for this baby. I can't use DFS, however much I 
    want to, because the term is already over-used.

					-Matt
					Matthew Dillon 
 [email blocked]


Feature Summary:

    - On-demand filesystem check and recovery.  No need to scan the entire
      filesystem before going live after a reboot.

    - Infinite snapshots (e.g. on the 30 second filesystem sync), ability
      to collapse snapshots for any time interval as a means of
      recovering space.

    - Multi-master operation, including the ability to self-heal a
      corrupted filesystem by accessing replicated data.  Multi-master 
      means that each replicated store can act as a master.  New
      filesystem ops can run independantly on any of the replicated stores
      and will flow to the others.  This is done by giving the controller
      of each store a certain amount of guarenteed free space that can be
      filled without having to notify other controllers.

    - Infinite logless Replication.  No requirement to keep a log of changes
      for the purposes of replication, meaning that replication targets can
      be offline for 'days' without effecting performance or operation.
      ('mobile' computing, but also replication over slow links for backup
      purposes and other things).  Certain information is still logged,
      but only to streamline performance.

    - 64 bit file space, 64 bit filesystem space.  No space restrictions
      whatsoever.

    - Reliably handles data storage for huge multi-hundred-terrabyte
      filesystems without fear of unrecoverable corruption.

    - Cluster operation - ability to commit data to locally replicated
      store independantly of other replication nodes, with access governed
      by cache coherency protocols.

    - Independant index.  Data is laid out in a highly recoverable fashion,
      independant of index generation.  Indexes can be regenerated from
      scratch and thus indexes can be updated asynchronously.

Segmentation:

    The physical storage backing a filesystem is broken up into large
    1MB-4GB segments (64MB is a typical value).  Each segment is
    self-identifying and contains its own header, data table, and record
    table.  The operating system glues together filesystems and determines
    availability based on the segments it finds.

    All segments on a block device are the same size (power-of-2 restricted)
    and the OS can probe the disk without any other information simply by
    locating a segment header, starting with the largest possible segment
    size.  Segment size is typically chosen such that a scan of all segment
    headers takes no longer then a few seconds.

    - The segment header contains all the information required to associate
      the segment with a filesystem.   Certain information in the segment
      header is updated with an ordered write to 'commit' other work already
      flushed out into the segment.  The segment header also contains a 
      bit indicating whether the segment is 'open' or not, to aid in 
      on-demand recovery.

      The segment ID is stored in the segment header, allowing segments
      to be easily relocated.  That is, the segment's location in the
      physical backing store is irrelevant.

    - The data table consists of pure data, laid out linearly in the forward
      direction within the segment.   Data blocks are variable-sized entities
      containing pure data, with no other identifying information, suitable
      for direct DMA.  The segment header has a simple append index for
      the data table.

    - The record table consists of fixed-sized records and a reference to
      data in the data table.  The record table is built backwards from
      the end of the segment.  The segment header has a simple pre-append
      index for the record table.  Each record consists of:

      * an object identifier (e.g. inode number)
      * creation transid
      * deletion transid (updated in-place)
      * indexing key (offset or filename key)
      * data table reference (offset, bytes)  (up to a 1GB contiguous ref)
      * integrity check

    All data elements are completely identified by the record table.  All
    modifications are historical in nature... that is, no actual data is
    destroyed and one can access a 'snapshot' of the segment as of any
    transaction id (i.e. timestamp) simply by ignoring those records
    marked as deleted prior to the specified time or created after
    the specified time.  

    Certain updates to the object table occur in-place.  In particular,
    the deletion transaction id is updated in-place.  However, such
    updates are not considered 'committed' until the segment header itself
    is updated with the latest committed transaction id and a recovery
    operation will undo any deletion transaction id greater then the
    committed transaction id in the segment header, as well as free
    any uncommitted objects and their related data.

    The entire filesystem can be recovered from just the record and data
    tables.  Indexing and cross-segment spanning is described in later
    sections.

Physical space recovery:

    Physical space recovery requires actually destroying records marked
    as deleted.  Snapshots can be collapsed by destroying records whos
    creation AND deletion id's fall within the collapsed space.  The oldest
    data is freed up by destroying records whos deletion id is less then
    the terminal timestamp. 

    Record destruction creates holes in both the data table and the record
    table.  Any holes adjacent to the data table append point or the record
    table prepend point are immediately recovered by adjusting the 
    appropriate indices in the segment header.  The operating system may
    cache a record of non-adjacent holes (in memory) and reuse the space,
    and can also generate an in-memory index of available holes on the
    fly when space is very tight (which requires scanning the record table),
    but otherwise the recovery of any space not adjacent to the data table
    append point requires a performance reorganization of the segment.

Locating Data objects, and the Master Record.

    Data objects are basically the amalgamation of all records with
    the same 64 bit object identifier.  The top N bits of this identifier
    indicate the segment the master record for the data object resides in.
    All 64 bits are made available to userland.

    The filesystem needs to be free to relocate the master record for
    a data object on the fly.  Relocation is a dynamic process whereby 
    the master record in a segment becomes a forwarding record to another
    segment.  Any references to a forwarding record is adjusted on the fly
    in order to collapse the chain, and any intermediate forwarding records
    can be physically destroyed once all references to them have been
    adjusted.  However, the ORIGINAL forwarding record must be retained
    for all time in order to maintain the uniqueness of the originally
    assigned user-visible inode number and to give us a way to locate
    the master record given the inode number.  We cannot change the inode
    number.  Overhead is minimal.

    A forwarding record is created in two situations: (1) To move the 
    master record to improve performance and (2) If the current segment
    does not have sufficient space to increase the size the master record
    if it becomes necessary to do so.

    A forwarding record is a degenerate case of a master record and the
    forwarding information is embedded in the record itself, with no
    data table reference at all.  The space used is only the space required
    to store a record table entry.

    The master record for a data object is a special record in the record
    table.  It is special because it is NOT part of the historical data
    store.  The creation and deletion transaction ids represent the creation
    or deletion of the entire data object, and the data block contains
    information on where to find the other bits and pieces of the data
    object (in particular, if the data object spans more then one segment).
    A CORRUPTED MASTER RECORD CAN BE COMPLETELY RECONSTRUCTED BY SCANNING
    THE FILESYSTEM.

    The master record can be thought of as a persistent cache.  It gives
    the filesystem a way to quickly locate all the segments that might
    contain records for the data object and reserves a limited amount of
    space for the filesystem to store direct references to data residing
    in the same segment as the master record.

    Master record updates are fairly rare.  For the most part a master
    record must only be updated if a file or directory grows into a
    new segment.

Performance reorganization:

    Segments can be repacked in order to clean up fragmentation and 
    reorganize data and records for more optimal access.  Repacking is
    accomplished by copying the segment's data into a wholely new
    segment on the physical disk, then destroying the old segment.

    Since a segment is identified by its header the actual location of
    the segment on the physical media is irrelevant.

    For example, master records can wind up strewn about the record
    table for a segment.  Repacking would pack all of those master
    records next to each other.

    Similarly, a file or directory might have bits and pieces of data
    strewn about a segment.  Repacking would de-fragment those entities,
    as well as pack together the data related to small files and 
    separate out the larger chunks related to larger files.

Segment Allocation:

    Remember that since the actual physical location of a segment is
    independant of the segment identifier (typically an 8 or 16 bit
    number), the allocation of segment numbers does not have to be
    linear.  The filesystem will typically be generous in its allocation
    of segment numbers in order to allow for spill over growth of files
    into logically adjacent segments, thus simplifying the segment
    range stored in the master record that describes which segments
    might contain data records for a data object.  For example,
    the first segment allocated by the filesystem when using a 16 bit
    segment id would not be 0x0000 or 0xffff, but probably 0x8000.  The
    next segment allocated (when not doing a spill-over) might be 0x4000
    or 0xc000, and so forth.  The entire segment space is used even
    if there are fewer actual (physical) segments.

Large cross-segment objects:

    A Data object can wind up being far larger then a segment.  For that
    matter, even a small data object with a lot of history can wind up being
    far larger then a segment.

    The filesystem does its best to organize the data for such objects
    to reduce the size of the master records required to describe them
    and to separate historical data from current data, to maintain the
    performance of the live filesystem.

Indexing:

    An object's master record generally describes the segments containing
    the data for the object, and may contain direct references to data
    in the same segment as the master record (an optimization or performance 
    reorganization that is desireable for files smaller then a single
    segment).

    However, data objects can grow to be many records due to fragmentation,
    simply being large files, or due to the retention of a great deal of
    history.  The records pertaining to these objects may have to be indexed
    beyond the space the filesystem is willing to reserve in the master
    record in order to improve access.

    To make it clear, indexing is an optimization.  The index is not
    required to recover a segment or to recover a filesystem.  The intent
    is for the master record to always contain sufficient information
    to reduce the number I/O's required to resolve a file access request
    to a reasonable number, even if no index is available.

    Indexing can occur in three places:

    * First, it can occur in the segment itself to organize the records
      in that segment.  Such indexes are usually persistently cached in the
      dead space between the data table and the record table, and the 
      filesystem heuristically determines how much space is needed for
      that.  If the data table or the record table grows into the index
      the filesystem can choose to relocate, regenerate, or shift the index
      data, or to disallow the growth (extend the data object into a new
      segment).  These heuristics are fairly simple.

    * Second, it can occur in the master record to roughly organize the
      data space... for example so the filesystem does not have to check
      all segments if a particular file offset (or offset range) and all
      of its history is known to reside in just one.  The filesystem
      typically is only willing to allow so much space in the master record
      for a data object to store such an index.  If this level of index
      becomes too large it is basically simplified in-place and starts
      requiring the use of the per-segment index to further resolve the
      location of data records for the object.

    * Third, it can be generated and cached in memory.  When dealing with
      very large multi-segment files it may be beneficial to scan
      the relatively few records in the record table for the related segments
      and simply index them in memory, rather then store an index on-disk.

      For example if you are using 64MB segments and have a 20GB file,
      literally only 320 disk accesses (with a few data records read in
      each access) are required to construct an index of the entire 20GB
      file and the index itself would require very little memory.

    Indexes are asynchronous entities.  The 'official' filesystem store is
    the data table and the record table.  The host can cache updates in
    memory, and asynchronously update the performance-improving index.  

    Indexes residing in segments are regenerated if the segment is marked
    open on initial access (due to an unclean shutdown).  This allows
    persistent per-segment indexes to be updated without requiring any
    particular transactional disk synchronization or block ordering.

    Indexes are generally implemented using a B+Tree structure.

Replication:

    Data and record elements are only directly referenced by their offset
    within a segment when referenced from within that same segment.  This
    means that replication is possible on a segment-by-segment basis.

    Segment headers contain asynchronously updated record update log areas
    for deletion events (because these modify the deletion timestamp in
    an existing record rather then append a new record).  These log areas
    are of limited size and can overflow under normal operation.  They are
    ONLY used to streamline replication.  If a replication target is not
    fast enough, it will have to resynchronize by scanning the records
    on the source (which itself doesn't usually take very long to do so it
    isn't considered a big deal).  But otherwise it can check the log area
    and simply pick up where it left off.  The intention is to completely
    support both very slow replication slaves and mostly off-line slaves.

    The only thing that is actually replicated are the record table
    entries and (with the exception of a master record) the data table
    entries.  The data table entry for a master record is never replicated,
    but (re)generated by the replication target.  The replication target
    is responsible for maintaining its own indexes and managing its own
    physical segment space.  This gives any replication target a great
    deal of robustness and flexibility.

    Also note that the physical location of the logical segments on the
    replication target is entirely up to the replication target.  Replication
    is done logically, not physically.

Segment Expansion - Virtual segments

    A replication target may wish to retain historical data that the 
    replication source has not, or destroy historical data that the
    replication source intends to retain.  This can result in the replication
    target running out of room in a logical segment, and can also complicate
    recovery operations if the replication source needs to self-heal a
    corrupted segment.  This presents a problem because all filesystem
    accesses and all replication and recovery activities are segment-centric.

    To deal with this situation a logical segment can be turned into a 
    virtual segment.  A virtual segment is made up of multiple logical
    segments, all with the same identifier plus additional information
    in the segment distinguishing the pieces from each other.  Each logical
    segment is maintained independantly but API accesses check both
    (or all N such segments) when doing lookups, and write to whichever
    one has free space.  

    Virtual segments are pretty standard fare on replication slaves which
    are intended to archive a great deal more history then replication
    masters, but are intended to be very rare features of replication
    masters.  If a virtual segment must be created on a replication master
    the filesystem does all it can to shift data off the virtual segment
    in order to be able to collapse it back into a logical segment.

    Virtual segmentation is not space efficient.

Files and Directories

    As has been described, a data object (which can be a file or directory
    or other filesystem entity) is identified by a data object id, which
    is effectively its inode, and a 64 bit key field.  The key field is
    typically a base offset for a file or a sortable key for a directory
    entry.  Negative numbers are used for meta-data.  For example, -1 will
    be used for the inode data & stat information.  Other negative numbers
    will be used to support other meta-data such as ACLs.

    Since records support variable-length data, the data storage for a file
    is effectively extent-based.  Minimum granularity will be something like
    32 bytes.

    Files are thus fairly straight forward.

    From the point of view of the filesystem each directory entry will
    be a data record.  The data record will basically just contain the
    inode number, type, and filename.  It will be variable-length insofar
    as the filename is variable length.

    Most files will be representable in a single extent (or at least
    can be optimized into a single extent), and so will not depend very
    heavily on indexing.

    Directory lookups WILL depend on the indexing of the directory records
    for efficiency, otherwise every record would have to be scanned to
    lookup a filename.  In this regard a B+Tree is probably the best 
    solution.

Inode Number Uniqueness

    Inode numbers are 64 bits where the top N bits represent the segment
    in which the inode was originally allocated, identifying the master
    record for the data object.  Inode numbers do not change even if
    the master record is relocated and the original master record replaced
    with a forwarding record.  The forwarding record must be retained
    in order to guarentee the uniqueness of the inode number.

    This also allows inode numbers to be allocated on a per-segment basis,
    with 48 bits of uniqueness (one trillion file creates & deletes) within
    each segment.

    The filesystem will always allocate new inode numbers using a sequence
    number maintained in the segment header.  When all 48 bits are used up,
    however, the filesystem must check new inode numbers against existing
    numbers (including any deleted historical objects).  

    Inode number uniqueness over all time can no longer be guarenteed, but
    inode number uniqueness over a period of time CAN still be guarenteed
    by retaining the master record for deleted files for a period of time
    EVEN AFTER THEY HAVE BEEN DESTROYED.  The system operator can specify
    the retention time with the caveat that certain benchmarks might cause
    the disk to fill up or become somewhat less efficient even when 
    historical data is purged.

    It is recommended that any exported or clustered filesystems set the
    inode number retention time to at least a week.  Inode number uniqueness
    is crucial to cluster operation.


From: Bob Bishop [email blocked] Date: Thu, 22 Feb 2007 10:58:46 +0000 Subject: Initial filesystem design synopsis. Hi, At 22:22 21/02/2007, Matthew Dillon wrote: [...] Plus, I need a name for this baby. I can't use DFS, however much I want to, because the term is already over-used. If you went for 128-bit object IDs you could call it UFSv6 :-) :-) But seriously, instances of this filesystem will potentially be very big, very active *and* have very long lifetimes - is 64 bits really enough? -- Bob Bishop +44 (0)118 940 1243 rb@gid.co.uk fax +44 (0)118 940 1295
From: Matthew Dillon [email blocked] Date: Thu, 22 Feb 2007 10:24:58 -0800 (PST) Subject: Initial filesystem design synopsis. :Hi, : :At 22:22 21/02/2007, Matthew Dillon wrote: :>[...] :> Plus, I need a name for this baby. I can't use DFS, however much I :> want to, because the term is already over-used. : :If you went for 128-bit object IDs you could call it UFSv6 :-) :-) : :But seriously, instances of this filesystem will potentially be very :big, very active *and* have very long lifetimes - is 64 bits really enough? : :-- :Bob Bishop +44 (0)118 940 1243 :[email blocked] fax +44 (0)118 940 1295 What would be a reasonable limit of the number of replication targets operating in master mode (master == changes can be made, slave == changes can only be brought in through replication protocols). Lets choose 64. Now, what would be a reasonable limit for the number of discrete MODIFYING operations per second? Lets choose 10000000 (ten million). So, that's 6 bits to identify replication master, and 24 bits for sub-second resolution, leaving 40 bits for one-second resolution. 40 bits of one-second resolution comes to around 34865 years. And its even better then that. For this filesystem, transaction ids are created by synchronization events for modified data, not modifying events. Such events occur far less often then the actual modifying operations execuuted by programs. 64 bits is enough. -Matt Matthew Dillon [email blocked]
From: Thomas E. Spanjaard [email blocked] Date: Wed, 21 Feb 2007 23:47:41 +0000 Subject: Initial filesystem design synopsis. Matthew Dillon wrote: Segmentation: The physical storage backing a filesystem is broken up into large 1MB-4GB segments (64MB is a typical value). Each segment is self-identifying and contains its own header, data table, and record table. The operating system glues together filesystems and determines availability based on the segments it finds. I think the more common term for this kind of thing is 'allocation group'. - The data table consists of pure data, laid out linearly in the forward direction within the segment. Data blocks are variable-sized entities containing pure data, with no other identifying information, suitable for direct DMA. The segment header has a simple append index for the data table. And 'extent' for the variable-sized entities :). - The record table consists of fixed-sized records and a reference to data in the data table. The record table is built backwards from the end of the segment. Doesn't this prepending stuff incur a significant performance penalty for operations that walk the record table in a chronological/otherwise 'fifo' ordered fashion? Record destruction creates holes in both the data table and the record table. Any holes adjacent to the data table append point or the record table prepend point are immediately recovered by adjusting the appropriate indices in the segment header. The operating system may cache a record of non-adjacent holes (in memory) and reuse the space, and can also generate an in-memory index of available holes on the fly when space is very tight (which requires scanning the record table), but otherwise the recovery of any space not adjacent to the data table append point requires a performance reorganization of the segment. I think these lists/trees should be kept sorted, at least on-disk for performance reasons (random reads/writes on rotational media is a bummer given current seek times). Generally, I can't help but feel that the clustering/replication stuff needs to be separate from the 'actual on-disk' filesystem. Cheers, -- Thomas E. Spanjaard [email blocked]
From: Matthew Dillon [email blocked] Date: Thu, 22 Feb 2007 00:59:00 -0800 (PST) Subject: Initial filesystem design synopsis. :> - The record table consists of fixed-sized records and a reference to :> data in the data table. The record table is built backwards from :> the end of the segment. : :Doesn't this prepending stuff incur a significant performance penalty :for operations that walk the record table in a chronological/otherwise :'fifo' ordered fashion? There is no reason why it would. Records are small fixed-sized entities while disk I/O's tend to be much larger. Disks cache whole tracks anyhow (and in fact most disks record sectors in reverse order on the track). So, basically it doesn't matter with regards to accessing the record array. :> Record destruction creates holes in both the data table and the record :> table. Any holes adjacent to the data table append point or the record :> table prepend point are immediately recovered by adjusting the :> appropriate indices in the segment header. The operating system may :> cache a record of non-adjacent holes (in memory) and reuse the space, :> and can also generate an in-memory index of available holes on the :> fly when space is very tight (which requires scanning the record table), :> but otherwise the recovery of any space not adjacent to the data table :> append point requires a performance reorganization of the segment. : :I think these lists/trees should be kept sorted, at least on-disk for :performance reasons (random reads/writes on rotational media is a bummer :given current seek times). You can't reorganize the record array without doing a performance reorganization of the segment (as defined by the document). The reason is that the records in question represent information that we cannot afford to lose, and there is no easy way to order disk I/O to maintain an array in sorted order without potentially losing data if an operating system crash were to occur during the I/O. (Note I'm not talking about a disk crash here, I'm talking about an OS crash). Also, the concept of a 'sorted order' is not entirely applicable to a historical data store where the historical (deleted) records will overlap the current records in many different ways. Trying to keep track of it all in a single sort leads to severe inefficiencies when doing lookups on files with a lot of historical records. Instead one must maintain an index of the records separate from the record list itself (maybe even more then one index, in fact). The index can be in-memory or on-disk. There's a nice trick one can do with indexes... there is no need to insert new records into an index when they are created. Instead you can also maintain a very short list of, say, 8 unindexed records which are always checked against in any search and then insert the unindexed records into the actual index in-bulk, with cache locality of reference for both code and data (which is far more efficient then inserting them one at a time when they were originally prepended). :Generally, I can't help but feel that the clustering/replication stuff :needs to be separate from the 'actual on-disk' filesystem. : :Cheers, :-- : Thomas E. Spanjaard : [email blocked] One thing I've learned from doing the Backplane database is that you can't separate the storage topology from the clustering algorithms without creating a huge mess. They have to be basically compatible with each other for things to operate smoothly. -Matt
From: Jan Grant [email blocked] Date: Thu, 22 Feb 2007 08:39:43 +0000 (GMT) Subject: Initial filesystem design synopsis. Quick question... On Wed, 21 Feb 2007, Matthew Dillon wrote: > - Infinite snapshots > > - Multi-master operation > > - Infinite logless Replication transid space: monotonic increasing on each replication target, or a fine-grained synchronised timestamp*, or something else? Cheers, jan * which I guess means that dfly replication targets need to occupy roughly the same relativistic reference frames in operation :-) -- jan grant, ISYS, University of Bristol. http://www.bris.ac.uk/ Tel +44 (0)117 3317661 http://ioctl.org/jan/ printf 'cat\nhello world' | `sh -c 'read c; echo $c'`
From: Matthew Dillon [email blocked] Date: Thu, 22 Feb 2007 01:09:27 -0800 (PST) Subject: Initial filesystem design synopsis. :Quick question... : :On Wed, 21 Feb 2007, Matthew Dillon wrote: : :> - Infinite snapshots :> :> - Multi-master operation :> :> - Infinite logless Replication : :transid space: monotonic increasing on each replication target, or a :fine-grained synchronised timestamp*, or something else? : :Cheers, :jan Monotonic increasing AND a fine-grained timestamp. Low bits of the timestamp (sub-nanosecond equivalent) would simply be used to identify the replication target, allowing each target to 'allocate' transaction ids independantly (and also incidently tell us which 'master' was responsible for the original op that is now being replicated). A newly created transaction id would at a minimum have to be larger then the last transaction id... and if this goes beyond the current 'real time', the host must sleep for a few microseconds to allow real time to catch up. (In reality the granularity can be selected such that it is possible to allocate hundreds of thousands or millions of transids a second across the entire cluster, so this isn't an issue). The transaction id must be translatable into a timestamp of sorts (beyond the monotonic requirement), just to make snapshot handling sane. The problem with such a scheme is, of course, that a host which is not properly time synchronized can throw a big wrench in the works. And, also, conceviably someone could set the system time to 0xffffffffffffffff and the filesystem would barf (not be able to allocate any new transaction ids because, well, it just ran out!). Sanity checks in the code can handle unsynchronized hosts, guarentee monotonic increasing transaction ids, and prevent the filesystem from becoming corrupted. Deliberately generating absurd time stamps would be a bigger problem... for example, basing your cluster on a single machine's RTC would be a bad idea. At the very least you would want an NTP-synchronized time source. Monotonic increasing transaction ids are *CRITICAL* to replication protocols. Absolutely critical. It's the difference between having to keep a physical log of changes (with unbounded size), and just having to store the last transaction ID you had synchronized to. -Matt Matthew Dillon [email blocked]
From: Oliver Fromme [email blocked] Date: 22 Feb 2007 15:18:21 GMT Subject: Initial filesystem design synopsis. Matthew Dillon [email blocked] wrote: > [new file system design] I have a question about a specific scenario. You mention multi-master, replication and off-line operation. Suppose I have two machines in a replicated multi-master setup, i.e. each of them has a full copy of the file system to work with, and both have write access which is replicated to the partner. This could ba a set of redundant servers, or maybe a desktop machine plus a notebook. Now what happens if the two nodes are disconnected for a certain period of time? E.g. there's a network outage, or the notebook is taken off-line on a trip. In such a situation, both nodes should still be fully functional with write access, of course. Therefore, each node must maintain a queue of changes that need to be replicated to the other node as soon as the connection is restored. When the connecton between the machines is restored, the file system has to be synchronized somehow. That should happen automatically without user intervention. Such a synchronization means that each node has to send its queue of changes to the other node. However -- and here is the question -- what happens if there are any conflicts? For example, what happens the same file has been changed in different ways on the nodes during off-line operation? > Plus, I need a name for this baby. I can't use DFS, however much I > want to, because the term is already over-used. Well, then use a Greek letter and call it "delta FS". :-) That would be ΔFS or δfs in HTML, or ΔFS (upper case) or δFS (lower case) if your browser doesn't know the Greek letter entity names (but they're standard since HTML 4.0 which is quite a while). Given your feature summary, a few interesting abbreviations can be made, for example: high-availability replicated distributed file system == HARDFS high-availability clustered replicated file system == HACRFS (pronounced "hacker FS", of course) high-availability multi-master extra reliable file system == HAMMERFS The "R" can mean replicated or reliable (or robust), whatever you prefer, and the "E" can mean extra, eminently, exceedingly or exceptionally. Best regards Oliver PS: Oh, just one thing that you didn't mention in your feature list ... It would be very useful to support checksums for all file system data (file data and meta data), so any form of corruption can be reliably detected on the file system level. ZFS supports it, and GELI in FreeBSD has grown support for it, too. They do it on the block level, I think (i.e. a checksum per file system block). No more silent corruption by broken hard disks. -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way.
From: Matthew Dillon [email blocked] Date: Thu, 22 Feb 2007 10:30:39 -0800 (PST) Subject: Initial filesystem design synopsis. :I have a question about a specific scenario. You mention :multi-master, replication and off-line operation. Suppose :I have two machines in a replicated multi-master setup, :i.e. each of them has a full copy of the file system to :work with, and both have write access which is replicated :to the partner. This could ba a set of redundant servers, :or maybe a desktop machine plus a notebook. : :Now what happens if the two nodes are disconnected for a :certain period of time? E.g. there's a network outage, or :the notebook is taken off-line on a trip. In such a :situation, both nodes should still be fully functional :with write access, of course. Therefore, each node must :maintain a queue of changes that need to be replicated to :the other node as soon as the connection is restored. You have two options: First, operate the cluster filesystem in quorum mode, where modifying operations are only allowed if a quorum of replication masters is accessible. This is the most typical scenario. Conflicts do not occur when using a quorum protocol. Second, operate the cluster filesystem in independant-operation mode, which is how one would have to operate something like a laptop, where work must be synchronized at a later date and for which conflicts may occur. In this situation conflicts must be brought to the attention of the user and the user must choose how to resolve them. :Given your feature summary, a few interesting abbreviations :can be made, for example: : :high-availability replicated distributed file system :== HARDFS : :high-availability clustered replicated file system :== HACRFS (pronounced "hacker FS", of course) : :high-availability multi-master extra reliable file system :== HAMMERFS : :The "R" can mean replicated or reliable (or robust), :whatever you prefer, and the "E" can mean extra, eminently, :exceedingly or exceptionally. : :Best regards : Oliver HAMMER is a good name (though I dont like the expanded acronym). I like HACFS better then HACRFS. I like HAMMER better then HACFS but HACFS's expanded acronym is better. -Matt
From: Oliver Fromme [email blocked] Date: 23 Feb 2007 11:39:02 GMT Subject: Initial filesystem design synopsis. Matthew Dillon wrote: > > :high-availability multi-master extra reliable file system > :== HAMMERFS > : > :The "R" can mean replicated or reliable (or robust), > :whatever you prefer, and the "E" can mean extra, eminently, > :exceedingly or exceptionally. > : > :Best regards > : Oliver > > HAMMER is a good name (though I dont like the expanded acronym). > I like HACFS better then HACRFS. > > I like HAMMER better then HACFS but HACFS's expanded acronym is > better. I guss it's the "E" that you don't like, right?. Well, it could mean a lot of other things. There are plenty of cool buzzwords starting with e. ;-) For example, actually it could mean "exabyte", because you're going to have 64 bit files, and 2^64 happens to be sixteen exabyte (1 EB is 2^60 bytes). Or "high-avilability, multi-master-enabled, replicated file system". Simple. And finally: If you like that name, does it _have_ to have a good expansion? From a marketing and publicity point of view, the name is everything. Nobody will remember the expansion of the acronym anyway. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way.
From: Matthew Dillon [email blocked] Date: Sat, 24 Feb 2007 10:05:17 -0800 (PST) Subject: Initial filesystem design synopsis. :And finally: If you like that name, does it _have_ to :have a good expansion? From a marketing and publicity :point of view, the name is everything. Nobody will :remember the expansion of the acronym anyway. : :Best regards : Oliver : :-- :Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing True enough. It doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-) I am still refining the filesystem spec. The greatest weakness in it right now is where to store large indexes (primarily for directories). -Matt Matthew Dillon [email blocked]

Related Links:

Anything like this on Linux?

Anonymous (not verified)
on
February 28, 2007 - 12:13pm

Apart from GFS?

Yes

Anonymous (not verified)
on
February 28, 2007 - 3:23pm

http://www.lustre.org/

Warning on big parallel/clustered file systems - here there be a zillion little dragons that will really ruin your day.

Yes, I currently use ocfs2

Anonymous (not verified)
on
March 4, 2007 - 12:58am

Yes, I currently use ocfs2 (oracle clustered file system 2) in production for a web server farm.

It doesn't have nearly the features this proposed filesystem will have, but if this ever makes it into the linux kernel, I'll implement it immediately. Ocfs2 runs really well (if you increase the heartbeat threshold to a value around 900 in case of high load, else they self fence==bad)

Michael

I might point out that me

Michael (not verified)
on
March 13, 2007 - 6:29pm

I might point out that me having the heartbeat threshold as high as 900 was bad. OCFS2 can't establish any new locks if a machine goes down until the heartbeat is determined dead (900 means 30 minutes====very bad)

Setting this to 15 is a much more realistic value, and the machines don't self fence.

Michael

Free ones..

JerkerNyberg (not verified)
on
March 5, 2007 - 6:09am

Free shared storage file systems: GFS and OCFS.

Free distributed parallel: PVFS and Lustre. In development also Ceph and GlusterFS. Lustre, Ceph and GlusterFS has fault tolerance in their roadmaps.

Free distributed parallel fault tolerant: GFarmFS.

HAMMER

on
March 2, 2007 - 5:13pm

Why not just eschew the acronym and call it HammerFS? Or, go the route of the Compiler Language With No Pronounceable Acronym, which is, for obvious reasons, abbreviated "INTERCAL." ;-)

(Not trying to malign the filesystem, just trying to make a joke.)

Since it has no limits and

TDog (not verified)
on
March 3, 2007 - 12:21am

Since it has no limits and there is nothing it can't do, why not call it GodFS ?

Oh dear, that is really not

umka@ (not verified)
on
March 4, 2007 - 9:24am

Oh dear, that is really not so easy as you may be think ;) Do you have any related experience? Would be interesting to know.

hmm...

turn.self.off (not verified)
on
March 4, 2007 - 10:03am

i would love to see this in combo with the plan 9 style mounting of machine resources as if they where local :D

then i can set up a kind of cluster at home that behave as a single large computer.

maybe have a multimedia computer (or HTPC if you will) mounted into the same overall system as my home computer(s), my laptop (that i can dismount and bring with me, and that will resync the moment i mount it back into the home network) and more.

i see potential here people ;) clustering is for more then number crunching :P

HammerFS is excellent ...

none@no.no (not verified)
on
March 8, 2007 - 3:10am

HammerFS is excellent ... works as a word and can be expanded to "high-availability, multi-master-enabled, replicated" ... file system.

Totally agree that clustered file systems are "massively" :-) useful for consumers. People need vast replicable fault tolerant storage for the *tons* of data they are going to have. They simply aren't all going to be storing that on A3 or in their 200gig limit (someday) Google accounts!

ZFS (now working really quite well in FreeBSD) actually adds some neat things from a consumer electronics and data storage perspective: people want to be able to keep static media "offline" and have it come "online" really easily it's a super key part of everyone's multimedia/domotics/communications technologies of the future. Unburdened by a huge legacy userbase/codebase - Dragonfly is doing great things these days. Keep up the good work!

Checksums please!

Anonymous (not verified)
on
March 8, 2007 - 7:40am

Sounds good, but all those features, but no checksums? Ouch.

Read the mailing list

Anonymous (not verified)
on
March 12, 2007 - 8:38am

Read the mailing list archive. Of course there will be checksums.

DOOFS

Anonymous (not verified)
on
March 22, 2007 - 7:34am

I vote for DOOFS. Distributed Object-Oriented File System

What about DillonFs? its

Anonymous (not verified)
on
April 24, 2007 - 9:08am

What about DillonFs? its nice.... reiser have too filesystem...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.