"I will be continuing to commit bits and pieces of HAMMER, but note that it will probably not even begin to work for quite some time," Matthew Dillon reported on the new clustering filesystem he's developing for DragonFlyBSD. He noted, "I am still on track for it to make it into the end-of-year release." Matt continued:
"My B-Tree implementation also allows HAMMER to cache B-Tree nodes and start lookups from any internal node rather then having to start at the root. You can do this in a standard B-Tree too but it isn't necessarily efficient for certain boundary cases. In my implementation I store boundaries for the left AND right side which means a search starting in the middle of the tree knows exactly where to go and will never have to retrace its steps."
"I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'," Matt Dillon explained when asked to compare his new clustering HAMMER filesystem with ReiserFS, both of which utilize BTrees to organize objects and records. He continued, "I don't expect HAMMER to be slow. A B-Tree typically uses a fairly small radix in the 8-64 range (HAMMER uses 8 for now). A standard indirect block methodology typically uses a much larger radix, such as 512, but is only able to organize information in a very restricted, linear way." He continued to describe numerous plans he has for optimizing performance, "my expectation is that this will lead to a fairly fast filesystem. We will know in about a month :-)"
Among the optimizations planned, Matt explained, "the main thing you want to do is to issue large I/Os which cover multiple B-Tree nodes and then arrange the physical layout of the B-Tree such that a linear I/O will cover the most likely path(s), thus reducing the actual number of physical I/O's needed." He noted, "HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree operations, because it doesn't need an intact B-Tree for recovery of the filesystem." He went on to describe another potential optimization allowed by the filesystem's design, "HAMMER is designed to allow clusters-by-cluster reoptimization of the storage layout. Anything that isn't optimally layed-out at the time it was created can be re-layed-out at some later time, e.g. with a continuously running background process or a nightly cron job or something of that ilk. This will allow HAMMER to choose to use an expedient layout instead of an optimal one in its critical path and then 'fix' the layout later on to make re-accesses optimal."
"I am going to start committing bits and pieces of the HAMMER filesystem over the next two months," announced Matthew Dillon on the Dragonfly BSD kernel mailing list. He noted that the filesystem should be functional by the 2.0 release in December, "I am making good progress and I believe it will be beta quality by the release. It took nearly the whole year to come up with a workable design. I thought I had it at the beginning of the year but I kept running into issues and had to redesign the thing several times since then." Matthew then posted a detailed design document for the new filesystem.
During the followup discussion, Matthew was asked if HAMMER would be a ZFS killer. He responded, "ZFS serves a different purpose and I think it is cool, but as time has progressed I find myself liking ZFS's design methodology less and less, and I am very glad I decided against trying to port it." He noted it is essential to have redundant copies of data, but added, "the problem ZFS has is that it is TOO redundant. You just don't need that scale of redundancy if you intend to operate in a multi-master replicated environment because you not only have wholely independant (logical) copies of the filesystem, they can also all be live and online at the same time." As for how Dragonfly's new filesystem will address redundancy, he explained:
"HAMMER's approach to redundancy is logical replication of the entire filesystem. That is, wholely independant copies operating on different machines in different locations. Ultimately HAMMER's mirroring features will be used to further our clustering goals. The major goal of this project is transparent clustering and a major requirement for that is to have a multi-master replicated environment. That is the role HAMMER will eventually fill. We wont have multi-master in 2.0, but there's a good chance we will have it by the end of next year."
Matthew Dillon has announced the release of DragonFly BSD 1.10, the sixth major DragonFly release since the project's creation in 2003. The release notes say "we consider 1.10 to be more stable then 1.8," and summarize some of the new features:
"Several big-ticket items are present in this release. Our default ATA driver has been switched to NATA (ported from FreeBSD). NATAs big claim to fame is support for AHCI which is the native SATA protocol standard. It is far, far better then the old ATA/IDE protocol. DragonFly now has non-booting support for GPT partitioning and 64 bit disklabels. Non-booting means we don't have boot support for these formats yet. DragonFly's Light Weight Process abstraction is now finished and working via libthread_xu but the default threading library is not quite ready to be changed from libc_r yet. All threaded programs now link against an actual 'libpthread' which is a softlink to libc_r or libthread_xu, allowing the new threading library to be tested more fully."
"1.10 has been branched," DragonFlyBSD creator Matt Dillon announced, noting that the official release is expected soon, "no release date has been set yet but this coming weekend is looking real good now." Among the new features of DragonFly 1.10 are improved virtual kernel support, a new disk management infrastructure, improvements to wireless networking, and support for the new syslink protocol.
DragonFlyBSD has a stable release every six months. The current development branch is numbered 1.11, with the next stable release at the end of the year numbered 2.0. The 1.10 release has been delayed about a week while some final bugs were addressed. Matt noted:
"The 1.10 release is looking a lot better now. We are basically just waiting for a new pkgsrc bootstrap kit and a little more testing. All major issues except booting a machine with a USB root with EHCI loaded have been resolved."
DragonFlyBSD founder Matthew Dillon [interview] posted an update on his syslink protocol which he defined as, "a message based protocol that can devolve down into almost direct procedure calls when two localized resources talk to each other." The syslink API will be used to talk to both local resources on the same node as well as to remote resources on a different node. Earlier documentation further explained the networking nature of the protocol, "the Syslink protocol is used to glue the cluster mesh together. It is based on the concept of reliable packets and buffered streams. Adding a new node to the mesh is as simple as obtaining a stream connection to any node already in the mesh, or tying into a packet switch with UDP." In another email Matthew explained how various DragonFlyBSD nodes utilize Syslink to automatically establish the optimal physical route.
In his recent email, Matthew described the latest Syslink issue he has solved, "in order to transport requests across a machine boundary (that is, outside the domain of a direct memory access), it is necessary to assign a unique identifier to the resource." He detailed how he had originally planned to rework dozens of major system structures to use the syslink API, but instead will now "rework JUST the reference counting methodology used in these resource structures." The end result is "a common ref counting API and a little structure that includes a 64 bit unique sysid, red-black tree node, the ref count, and a pointer to a resource type structure (e.g. identifying it as a vnode, vm object, or whatever). When any of the above resources are allocated, they will be indexed in a Red-Black tree. In other words it will be possible to identify every single resource in the system by traversing the red-black tree". He goes on to summarize, "and that, folks, gives us the building blocks we need to represent resources in a cluster. This also means I don't have to rewrite the APIs. Instead I can simply write new RPC APIs for accesses made via syslink ids and, poof, now all of a system's resources will become accessible remotely, with only modest effort."
Matt Dillon [interview] posted the design synopsis of a new highly available clustered filesystem he will soon begin writing for DragonFlyBSD. The feature summary at the beginning of his document included, "on-demand filesystem check and recovery; infinite snapshots; multi-master operation, including the ability to self-heal a corrupted filesystem by accessing replicated data; infinite logless replication, meaning that replication targets can be offline for 'days' without effecting performance or operation; 64 bit file space, 64 bit filesystem space, no space restrictions whatsoever; reliably handles data storage for huge multi-hundred-terrabyte filesystems without fear of unrecoverable corruption; cluster operation, provides the ability to commit data to locally replicated store independantly of other replication nodes, with access governed by cache coherency protocols; independant index, data is laid out in a highly recoverable fashion, independant of index generation, and indexes can be regenerated from scratch and thus indexes can be updated asynchronously." He then goes into detail on each of these points and many more, explaining how he intends to implement the new filesystem.
The new filesystem is currently unnamed, though Matt noted, "it doesn't have to translate as an acronym. At the moment 'HAMMER' is my favorite. I like the idea of a hammer :-)" It was suggested that this could mean, "high-availability multi-master extra reliable file system", though Matt was not impressed with this. Another proposed idea that Matt liked was HACFS, or "High-Availability Clustered File System".
Matt Dillon [interview] decided on an official version numbering scheme for DragonFlyBSD releases. First ruling out the usage of dates in each release, he settled on using odd numbers to denote a work in progress, and even numbers to denote releases. For example, 1.0, 1.2, 1.4, and so on would be considered releases, whereas 1.1, 1.3, 1.5, and so on would be considered works in progress.
Four tags will also be used, -CURRENT, -WORKING, -RELEASE and -STABLE. The -CURRENT tag indicates "a build based on the head of the CVS tree." The -WORKING tag indicates "a build based on our current stable tag". The -RELEASE tag indicates "a build based on a release branch." And the -STABLE tag indicates "a build based on a post-release branch." Matt adds, "you can probably see why I am also using odd/even numbering... so people can just glance at the number to get an idea of the relative time frame without necessarily understanding what all the keywords mean." Following this scheme, the next stable release will be DragonFly 1.2-RELEASE.
Matt Dillon [story] provides an interesting and detailed explanation of future development plans with regards to DragonFly's I/O subsystem. Originally inspired by the PIPE code improvements of FreeBSD's Alan Cox, and demonstrated in DragonFly's unique XIO and MSFBUF APIs, the goal of this work is to avoid KVA mappings for I/O requests and the resulting overhead of interprocessor interrupts in SMP systems. In theory, this equates to high performance through the benefit of efficient I/O in combination with the ability of any subsystem layer to transfer data to busdma with zero memory-to-memory copies. Matt expands:
"What we are going to do is extend the msf_buf abstraction to cover these needs and provide a set of API calls that allows upper layers to supply data in any form and lower level layers to request data in any form, including with address restrictions. msf_buf's already have a page-list (XIO) and KVA mapping abstraction. We are going to add a bounce-buffer abstraction and then work on a bunch of new API calls for msf_bufs to cover the needs of various subsystems."
There appears to be a lot of interesting work going on in DragonFly, read more for the entirety of Matt's post.
In response to a question raised on the dragonfly-kernel mailing-list, Matt Dillon [interview] gives an overview of plans to revise the DragonFly userland scheduler, the second of two schedulers that Dragonfly currently utilizes in a layered fashion. Matt explains:"There are actually two schedulers.. there's the LWKT scheduler, which is as close to perfect as its possible to be, and there is the userland scheduler, which is the one that needs work.".
Matt goes on to describe the benefits of the intended re-design:" The biggest advantage of this methodology is that we can in fact implement multiple userland schedulers and switch between them on the fly because the userland schedule is only a relatively simple layer operating on top of LWKT. So it is theoretically easy to switch the userland portion of the scheduler on a live system."
In addition to on-fly switching of schedulers, there are also plans to allow multiple schedulers to run in parallel. Full thread with relevant links follows.