Matthew Dillon created DragonFly BSD [1] in June of 2003 as a fork of the FreeBSD 4.8 codebase. KernelTrap first spoke with Matthew back in January of 2002 while he was still a FreeBSD developer and a year before his current project was started. He explains that the DragonFly project's primary goal is to design a "fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly."
In this interview, Matthew discusses his incentive for starting a new BSD project and briefly compares DragonFly to FreeBSD and the other BSD projects. He goes on to discuss the new features in today's DragonFly 1.10 release. He also offers an in-depth explanation of the project's cluster goals, including a thorough description of his ambitious new clustering filesystem. Finally, he reflects back on some of his earlier experiences with FreeBSD and Linux, and explains the importance of the BSD license.
DragonFlyBSD Versus Other BSD Projects
Jeremy Andrews: It's been five and a half years since I last interviewed you on KernelTrap. At that time you were a FreeBSD committer, and DragonFlyBSD had not yet been born. Can you summarize what happened to cause you to leave the FreeBSD project and start the DragonFly project?
[1]
JA: How does the DragonFly development process differ from the FreeBSD development process?
Matthew Dillon: The development process works about the same way but we make it clear that there is no concept of long-term 'ownership' for any of the code that goes into CVS. Code that goes in is considered to be community property. Individual developers only 'own' a piece of code while they are actively working on it. Once something has gone in and is no longer being worked on actively anyone else can come in with fixes or cleanups.
We aren't as rabid about cleanups as other projects. We consider cleanups to be good business because it results in a more readable and more understandable code base down the line.
The DragonFly release process is similar but somewhat more evolved in order to not waste developers time as much as the FreeBSD release process did. The code freeze is usually no longer then a week and once the release is branched in CVS people are free to commit any work in HEAD that was held off. We also went with a full Live-CD image almost immediately, something I am happy to see other projects doing now as well.
There is a much greater emphasis on code and algorithmic quality over code performance, because there isn't much point releasing a system that performs highly but also crashes occasionally.
JA: Have you done any performance comparisons between DragonFly and any other BSD projects?
Matthew Dillon: I haven't, particularly. We know where we stand simply by virtue of where we are in the MP work, which is what most performance comparisons measure these days... somewhere between Open/Net and FreeBSD. Removing the big giant lock has not been a big priority (and wont be for 2.0 either) and I keep hoping that another developer will pick up that ball and run with it.
JA: What is the "MP work"?
Matthew Dillon: Multi-processor, aka SMP. In the context of the OS code it means getting rid of the global locks that effectively prevents code from running in a kernel context on more then one cpu at a time.
JA: Do you continue to share code with the FreeBSD project?
Matthew Dillon: We pull in driver updates when appropriate (NATA being a prime example), but we do a considerable amount of driver work ourselves now, at least with regards to network drivers. We probably port more stuff from NetBSD and OpenBSD then we do from FreeBSD at this point.
JA: How many developers are actively contributing to DragonFly?
Matthew Dillon: About a dozen people are heavily involved, with numerous other people contributing around the edges.
JA: Has the number of active developers been growing since the project's inception?
Matthew Dillon: We have added a new committer on average about once every few months.
JA: Does code have to be approved by you or any other developers before it is merged?
Matthew Dillon: We rely on the people we've given commit bits to to have common sense and by and large that has worked.
JA: How stable is each release of DragonFly?
Matthew Dillon: DragonFly releases tend to be very stable. I can't really say how it stacks up to recent FreeBSD releases but DragonFly's stability is roughly similar to the stability that the FreeBSD-4 series was known for.
Clustering
JA: What are the main goals and design differences of DragonFly that separate it from the other BSD projects?
Matthew Dillon: My primary goal is to eventually have a fully cross-machine coherent and transparent cluster OS capable of migrating processes (and thus the work load) on the fly. Doing this properly requires direct, integrated support in the kernel. We are probably two years away from accomplishing this goal.
JA: What pieces of this goal are already implemented, even partially?
Matthew Dillon: Range locking mechanics for I/O atomicity have been partially implemented. We do range locking but we have not yet removed the vnode lock for I/O operations. Syslink has been mostly implemented (but not the over-the-wire version for remote machine access yet). A userland vfs API using syslink is partially implemented and will be completed for 2.0. A system identifier abstraction for major data structures called sysid (kern_sysid.c) is partially implemented.... this is how machines will reference shared resources in a clustered environment. The cluster filesystem design is well under way, as previously described.
JA: What pieces are not yet implemented?
Matthew Dillon: Cache Coherency mechanics have not been implemented. Resource sharing and execution context migration mechanics have not been implemented. Cache coherency is the big ticket item here, resource sharing naturally follows once we have proper cache coherency mechanics.
JA: What is your inspiration and need for all the clustering features in Dragonfly?
Matthew Dillon: I don't want to have to shut down a machine that has been running for months and months just to fix a piece of hardware. I want to be able to shove all of its functions onto another box while it is still live and then be able to physically power down the original box without it effecting operations.
More exhaustively speaking I see a huge potential in internet-based clustered computing where resources such as cpu and storage are not localized.
DragonFly 1.10:
JA: What architectures are currently supported by the DragonFly code base?
Matthew Dillon: Just i386 for now. Architectures haven't been a big priority and tend to interfere with necessary machine dependent infrastructure work. There is more interest now that most of the potentially interfering infrastructure work has been completed, and because we now have a 'virtual kernel' architecture that serves as a great template for new architectural ports. We certainly want to get 64 bit support in sooner then later but it is not the number one priority for the project.
JA: What is the 'virtual kernel' architecture?
Matthew Dillon: It refers primarily to the way the kernel source code is organized in the source tree. While working on virtual kernel support a great deal of separation and clarification of the machine and architectural specific source files occurred.
JA: DragonFly 1.10 was released today, nearly 6 months after DragonFly 1.8. Does the project have a regular release cycle?
Matthew Dillon: 1.10 may be a bit late due to a slew of last-minute work.
DragonFly does a release every 6 months or so, one in summer, one in winter. This somewhat slower release cycle allows our relatively few developers to focus on development rather then focus on release engineering.
I think FreeBSD has contemplated going to a slower release schedule as well, though I don't know what the status of their discussion is.
JA: What issues are delaying the 1.10 release?
Matthew Dillon: Primarily little niggling bugs that were thought to be less important then they turned out to be. We broke the kernel's ability to boot with a root vinum, there was a kernel memory leak related to exec*(), a particular sound driver wasn't happy, a bug was found in msdosfs, and a few other things like that. Vinum (the software raid driver) also had some serious interaction issues with the new ATA driver (NATA) because it was ignoring DMA limits specified by the new driver. That opened up a pandora's box that required adding an abstraction layer in the vnode subsystem to break up large I/O strategy requests.
JA: What are some of the main new features found in DragonFly 1.10 compared to version 1.8?
Matthew Dillon: I haven't put together the feature list yet but off the top of my head our virtual kernel support (running a virtual DragonFly kernel as a user process under a real DragonFly kernel) is much better now then it was in 1.8. We have introduced a new disk management infrastructure which supports GPT and 64 bit native disklabels (though not the boot code yet to allow one to boot from the above). A great deal of wireless networking work has gone in and been stabilized. We also ported and stabilized FreeBSD's new ATA driver (we call it NATA) and it will become the default in 1.10. Package Source support is also considerably better now. Syslink is in and working (syslink isn't really user visible yet, though).
JA: What is GPT, and what are the advantages of 64 bit native disklabels?
Matthew Dillon: GPT is a partitioning scheme put forth by Intel as part of the EFI standard for BIOS interactions during low level boot. Among other things it expands the 32 bit block number limit that the DOS partition table had (which is responsible for the 2TB-per-device limitations we see on PCs today) to 64 bits.
64 bit disklabel support is a DragonFly-specific feature that replaces the old crufty 32 bit BSD disklabels with 64 disk labels. That is, BSD disklabels also had the problem of specifying block ranges with 32 bit quantities as well as having to store them as absolute sector offsets on-disk and translate them back and forth on the fly when reading or writing a label. Standard BSD disklabels were also in-band in that the first partition was allowed to overlap the label area itself, which causes all sorts of problems. The 64 bit disklabel support is slice-relative, sector-size-agnostic (uses 64 bit byte offsets rather then sector numbers), and does not allow partitions to overlap the label area.
Both GPT and 64 bit disklabel support is experimental as of this release. We do not yet support booting from such partitions because we haven't yet written boot code to support them.
JA: In addition to stability, what other improvements have been made to wireless networking?
Matthew Dillon: It isn't my area of expertise but my understanding from browsing all the commits is that a great deal of device support has been added in addition to the stability work.
JA: You mentioned your new ATA driver, NATA, having been ported from FreeBSD. What was the driver called in FreeBSD?
Matthew Dillon: 'ata'. In FreeBSD their new ata driver replaced the old one entirely. In DragonFly kernels with either the old or the new new driver can be built, but they are mutually exclusive of each other.
This is typical of how we operate. We generally integrate major upgrades in parallel with existing ones and allow both to be used for quite a period of time before switching the default. NATA has been in the tree in some form or another for almost a year but was only considered good enough to make the default as of this release. Similarly we have multiple versions of GCC in the tree (3.4.6 and 4.1.2 at the moment), and numerous people are running with a 4.1.2 default, but the official default is still 3.4.6. In the case of GCC the compiler can be selected on an individual-use basis with a simple environment variable.
JA: What are the advantages of the NATA driver?
Matthew Dillon: The biggest advantage of the new driver is that it supports something called AHCI which is the native command queuing protocol used by SATA controllers. SATA controllers can operate in one of two modes: Emulated mode or native mode. In emulated mode SATA controllers looks like your standard run of the mill IDE/ATA/ATAPI controller. IDE protocols are over 20 years old and horrible beyond measure. In native (AHCI) mode SATA controllers looks much more like modern SCSI controllers, are far easier to manage, and do not have any of the severe limitations the IDE protocols had.
This is all the work of FreeBSD's Soren Schmidt.
Highly Available Clustering Filesystem:
JA: In February you posted some updates on the DragonFly mailing lists about a highly available clustering filesystem that you are designing. What is the current status of this filesystem? What are the long term goals?
Matthew Dillon: A really good cluster filesystem is a prerequisite to having a really good cluster OS. I had hoped to have it finished for this release but a ton of things came up that had to be addressed and I ran out of time. The cluster filesystem is my personal priority for 2.0.
The filesystem has several major goals.
- Infinite snapshots. You can think of this kinda like journaling but more in a transactional sense where there is no explicit snapshotting event and instead you simply mount the filesystem 'as of' a certain date to get a snapshot as-of that date.
- Today's storage media is far larger then most people need. Until told otherwise the filesystem will not destroy any historical data. A continuous cleaning system will allow you to manage the granularity of the snapshots... for example, every 30 seconds for the last hour, every minute for the last day, every hour after a day, every day after a week, once a week after a month, and so forth. Cleaning involves 'collapsing' modifications made within the selected time quantum, which has the side effect of freeing space.
- Integrated backup mechanism. Because we have infinite snapshots we do not have the races associated with dump/restore or tar or other standard backup mechanisms. Backing up the filesystem can be thought of as a continuous stream of changes, but a stream which does not require any 'queuing' of the backup data per-say, which means that you can back-up to multiple targets including very slow targets and mostly off-line targets without worrying about the backlog built up for any given target creating problems on the live system.
The filesystem is being designed to make streaming backups very efficient -- e.g. meaning that one will not have to do a full filesystem scan to make an incremental backup. Backup targets themselves will also be live filesystems, not archives, and can independently manage their snapshot granularity.
- The backup mechanism will also be used for replication in a cluster, or even replication without a cluster, for redundancy. This won't exist in the first release of the filesystem but the infrastructure will be designed to support it.
- Storage media migration. The filesystem will be able to cross multiple storage media boundaries natively and not require a volume manager. Additionally it will be possible to migrate large chunks (probably in the 2-4G range) between physical storage media while the filesystem is live, which is a major requirement for any filesystem one wishes to maintain long-term.
Those are the basics of the design. Look for it in 2.0!
JA: When you talk about preserving historical data, does this essentially mean that the file system offers a built in version control system?
Matthew Dillon: Yes, you can think of it that way.
JA: How will one control and access this historical data?
Matthew Dillon: Either directly specify a date when mounting, allowing one to mount arbitrary snapshots (as many as you like) in parallel with the live filesystem mount, or through an extension of the file or directory name to specify the as-of date, which I have not yet come up with.
JA: Does your new filesystem have a name?
Matthew Dillon: Not yet. 'DFS' is taken unfortunately.
JA: Why do you need filesystems implemented in userland?
Matthew Dillon: Being able to implement a filesystem in userland greatly reduces development time. From an operational standpoint you want any high performance filesystem to run in the kernel. On the other hand, there is no reason why one would ever need to run a low performance filesystem such as for a CD/DVD, msdosfs, or cross-os emulated filesystems, to run in the kernel. Running those things in userland insulates the kernel from filesystem bugs which might otherwise crash the kernel.
JA: Are there any existing free or proprietary filesystems that your aware of that meet all or most of these goals?
Matthew Dillon: Some of them but not all of them together. ZFS, EXT3, and Reiser all have individual features that are desirable.
JA: How does your filesystem compare to ZFS?
Matthew Dillon: ZFS solves a different problem, but I hope to achieve similar storage redundancy in our filesystem by virtue of making live mirrors practical. ZFS takes an integrated filesystem+storage-layer approach to redundancy but I think that is a mistake. One really needs to take a whole-filesystem-approach to redundancy, meaning that redundancy at the filesystem layer needs to operate multiple independent filesystems which happen to implement protocols allowing them to remain coherent with each other, verses operate multiple independent storage systems as a single filesystem.
Think of it like this: When you make a backup of a filesystem to tape, or you make an archive of a filesystem, and then at some point down the line something blows up and you need to restore it, suddenly you are faced with the situation of having to spend an entire day or even longer rebuilding your live filesystem from your archived backup. That is unacceptable today. Most people I know have switched to filesystem replication as their backup scheme. For example, I back-up all my DragonFly systems by doing a daily snapshot to independent storage on another machine in my lan, and do a weekly snapshot of that system to an off-site backup machine. Both the LAN backup box and the offsite backup box replicate the entire directory and file structure and use hardlinks for those files which have not changed:
backup# df -i /backup
Filesystem 1K-blocks Used Avail Capacity
/dev/ad6s1d 726621736 299012871 369479127 45%
iused ifree %iused Mounted on
40997899 4790899 90% /backup
The LAN backup box keeps daily snapshots for the last two months and the off-site backup box keeps weekly snapshots for the last 6 months.
One of the major ideas behind the new filesystem is to integrate the concept of making backups and maintaining live mirrors and offsite mirrors directly into the filesystem.
JA: How possible will it be to port this filesystem to other operating systems? That is, how intimately is it tied to the design of DragonFly?
Matthew Dillon: I am developing it in userland using syslink so theoretically it would not be too hard to port, but it will use DragonFly's VFS API which is substantially different then the API found in other BSDs.
Syslink Protocol:
JA: You've also discussed your Syslink Protocol on the DragonFly mailing lists. Can you explain a little about what the Syslink protocol is, what changes it has made to the kernel, and how it is used?
Matthew Dillon: Yes, 1.10 will have the first real cut of the syslink protocol ready to go and syslink will be used for our userland VFS implementation. You can think of syslink as a glorified communications pipe with a twist.
- Instead of sending a byte stream you are sending messages and getting replies.
- Message formatting is formalized, endian-translatable, and verified by the transport mechanism (i.e. the kernel in this case).
- The kernel keeps track of messages which have not been replied and if the communications pipe is broken any unreplied messages will be replied to by the kernel with an error code.
- Plus there is an out-of-band 'DMA' data mechanism. Messages have a limited size but may include data payloads up to 128KB. The data payloads are out of band... they are not transferred as a byte stream over the syslink 'pipe' but instead are implemented separately. So, for example, the OS syslink API can opt to memory-map the data buffers between sender and receiver if it wishes, and a truly remote syslink can choose to transport the data in-band or via a separate connection or something of that sort.
This allows the basic message stream to be byte oriented (aka copied rather then mapped) without imposing copying overhead on the stuff that truly matters, that being the related data.
Syslink kinda sounds like mach messaging but it isn't, really. It is designed with an eventual use as the primary form of communication between hosts in a cluster. Initially a userland VFS interface will be implemented using it with the intent of producing an extremely robust result. We need to be able to implement filesystems in userland and yet still guarantee that nothing bad happens if the related userland process is killed.
DragonFly 2.0:
JA: What major plans do you have for the next version of DragonFly?
Matthew Dillon: The cluster filesystem is the major goal for 2.0.
JA: Are you aware of any other major projects that will be focused on by other DragonFly developers for the 2.0 release?
Matthew Dillon: I am hoping we'll get some progress on 64 bit cpu support.
JA: Referring back to all of the technologies that you've described, where did you learn to design and implement this?
Matthew Dillon: That's hard to say. I tend to soak up everything around me. Many concepts formulated for the new filesystem are based on work I did designing and building a database at a startup a few years ago. This was known as the Backplane Database and it was a really nicely engineered piece of work with multi-master quorum operation, historical queries, and non-queued streaming backups.
BSD License:
JA: How important to you is it that your code is released under the BSD license?
Matthew Dillon: Unbelievably important. I have never subscribed to the almost religious fervor surrounding the GPL, in particular I do not like the idea of trying to impose the concept of freedom on people by attaching strings. The GPL has created a misguided sense of self importance in the open source world.
Simply getting openly specified software and algorithms into the mainstream has a far larger effect then any license. BSD conforms more to the concept of pure invention. More importantly, in large collaborative projects the BSD license allows the individual authors to use both the project as a whole and bits and pieces of collaborative work they have contributed to no matter where their life takes them, including into commercial settings and even proprietary commercial settings.
BSD is a way of saying that we are not so greedy that we have to hog-tie anyone else who wants to use and profit from our work. Or, in another sense, BSD is a way of confirming that actually making money from an open-source project is a very rare event and some of us aren't really interested in that aspect of the work.
Frankly it is not so easy to 'steal' open source projects as people seem to think. The BSD license acknowledges this fact while also acknowledging and even supporting both commercial use and the occasional commercial proprietization of project code. In a sense, it doesn't really matter whether code is proprietized or not because short of rewriting it completely any commercial success (take Apple's use of BSD and Mach for example) will inherently force that commercial entity into the use of a great deal of openly specified protocols. Just because they can add little proprietary bits and pieces here and there does not change the fact that 95% of their work base will not be proprietary, so the goal of forcing the world into using more open standards, something I *DO* want, is achieved just as well with BSD as it is with GPL.
It is really unfortunate that the fanatics don't realize this. They hold up few and far-between examples of so-called 'stealing' and the so-called protection that the GPL affords against such 'stealing' without any real understanding of what is actually accomplished. There is very little difference between the concept of 'integration' and the concept of 'stealing' in the open-source world. They are more like shades of grey.
If I were to write a large proprietary commercial application that happens to run on Linux (and many such examples exist), the integrated result is for all intents and purposes a black box, GPL or not. And yet, even in that black box a staggering number of open standards are going to be put into play by virtue of the use of open-source, and the use of such standards has a snowballing effect that in almost all cases prevents any significant proprietization over the long term, regardless of the license. And even when proprietization does occur (take Microsoft's stupid extensions to Kerberos for example) it is questionable whether such proprietization actually helps the commercial entity doing it verses the black box nature that their product already is, and it certainly has no significant effect on the open-source world.
From my point of view, this means that the GPL basically just devolves down into, in effect, giving a project protection from competition if the project wishes to go commercial. MySQL is a good example. As people have realized, just because the base code is free doesn't mean that anyone can continue to maintain and develop it. Using the BSD license is basically saying that one has no serious monetary interest in any of the work derived from that project, and that one has no interest in imposing strings on people who might want to use the work.
Personal Life:
JA: Briefly digressing from kernel development and referring to comments you made in our earlier interview [2], are you still snow boarding?
Matthew Dillon: Yes! Not taking big jumps any more, though.
JA: Did you finish the row boat you were working on?
Matthew Dillon: Yup, we sure did. That was way back in 2002:
http://apollo.backplane.com/Boat2002/ [3]
JA: Are you still living and working in Berkeley?
Matthew Dillon: Yup.
History:
JA: I was digging through some old newsgroup mailing lists, and found a post from 1994 in which you compare Linux to BSD, claiming that you prefer Linux and found BSD "well, stuffy". The thread is dated April 1'st, was it a serious post?
Matthew Dillon: 1994! Looks serious to me. And look! The very next posting was Jordan Hubbard responding. I worked on and used the linux kernel while still living up in the Tahoe area, before starting BEST Internet (as people may remember, we used BSDi at BEST, then switched to SGI, then switched to FreeBSD). It was a long time ago, before I really got back into BSD. I was still heavily involved with the Amiga at that time too. Picture me sitting in a cubby hole in a large room (more like the entire floor of a house) stacked to the brim with electronics, testing equipment, technical reference books, with an Amiga 3000 sitting on the desk and a Linux box with 80 individual wires (my version of a SCSI cable) hanging out from it going to an external 80 megabyte full height seagate hard drive.
Of course, all my points about programs making assumptions about BSDisms are now true of linux. Programs make tons of assumptions about linuxisms these days.
I am rather amused that I made the observation of the stuffiness of the BSD projects way back then. It turned out to be one of my chief complaints about those projects over the years.
JA: Thank you for all your time!